# **Test Assignment - Data Extraction and NLP**

### **Objective**
> **To extract textual data articles from the given URL and
perform text analysis to compute variables.**

**Step 1 : Data Extraction**
> For each of the articles, extract the article text and save the extracted
article in a text file with URL_ID as its file name.
While extracting text, please make sure your program extracts only the article title and the article
text.

**Step 2 : Data Analysis**
> For each of the extracted texts from the article, perform textual analysis and compute variables given below:
1. POSITIVE SCORE
2. NEGATIVE SCORE
3. POLARITY SCORE
4. SUBJECTIVITY SCORE
5. AVG SENTENCE LENGTH
6. PERCENTAGE OF COMPLEX WORDS
7. FOG INDEX
8. AVG NUMBER OF WORDS PER SENTENCE
9. COMPLEX WORD COUNT
10. WORD COUNT
11. SYLLABLE PER WORD
12. PERSONAL PRONOUNS
13. AVG WORD LENGTH

In [1]:
# Import files from gdrive
import gdown
url = 'https://drive.google.com/drive/folders/1VU-vbGBYz7E0QTRh_iPPKaTnlfWB54Xf?usp=drive_link'
gdown.download_folder(url, quiet=True, use_cookies=False, remaining_ok=True)




['/content/Test Assignment/MasterDictionary/negative-words.txt',
 '/content/Test Assignment/MasterDictionary/positive-words.txt',
 '/content/Test Assignment/StopWords/StopWords_Auditor.txt',
 '/content/Test Assignment/StopWords/StopWords_Currencies.txt',
 '/content/Test Assignment/StopWords/StopWords_DatesandNumbers.txt',
 '/content/Test Assignment/StopWords/StopWords_Generic.txt',
 '/content/Test Assignment/StopWords/StopWords_GenericLong.txt',
 '/content/Test Assignment/StopWords/StopWords_Geographic.txt',
 '/content/Test Assignment/StopWords/StopWords_Names.txt',
 '/content/Test Assignment/Input.xlsx',
 '/content/Test Assignment/Objective.docx',
 '/content/Test Assignment/Output Data Structure.xlsx',
 '/content/Test Assignment/Text Analysis.docx']

In [2]:
# Import necessary pacakages
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.tokenize import sent_tokenize


# Load NLTK resources
nltk.download('punkt')
nltk.download('vader_lexicon')
nltk.download('stopwords')

import warnings
warnings.filterwarnings("ignore")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# **Data Extraction**

In [3]:
# Function to extract the article title and text from a URLs given in Input.xlxs file and extracting each articles into txt file

def extract_article_text(url):
    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find and remove unwanted elements (e.g., header, footer, etc.)
        for element in soup(["header", "footer"]):
            element.decompose()

        # Extract article title and text
        article_title = soup.find('title').text.strip()
        article_text = ""

        # Extract text from <div class="td-post-content tagdiv-type">
        article_div = soup.find('div', class_='td-post-content tagdiv-type')
        if article_div:
            article_text = article_div.get_text()
        return article_title, article_text

    except Exception:
        print(f"Error while extracting article from {url}: {Exception}")
        return None, None


In [4]:
# Function to save the article title and text to a text file

def save_article_to_file(url_id, article_title, article_text):
    if not os.path.exists("articles"):
        os.mkdir("articles")

    with open(f"articles/{url_id}.txt", "w", encoding="utf-8") as file:
        file.write(f"Title: {article_title}\n\n")
        file.write(article_text)


In [6]:
def main():
    input_file = "/content/Test Assignment/Input.xlsx"
    df = pd.read_excel(input_file)

    for index, row in df.iterrows():
        url_id = row["URL_ID"]
        url = row["URL"]

        # Extract article title and text
        article_title, article_text = extract_article_text(url)

        # Check if extraction was successful
        if article_title and article_text:
            save_article_to_file(url_id, article_title, article_text)
            print(f"Article {url_id} extracted and saved successfully.")
        else:
            print(f"Failed to extract article {url_id}.")

if __name__ == "__main__":
    main()

Article bctech2011 extracted and saved successfully.
Article bctech2012 extracted and saved successfully.
Article bctech2013 extracted and saved successfully.
Article bctech2014 extracted and saved successfully.
Article bctech2015 extracted and saved successfully.
Article bctech2016 extracted and saved successfully.
Article bctech2017 extracted and saved successfully.
Article bctech2018 extracted and saved successfully.
Article bctech2019 extracted and saved successfully.
Article bctech2020 extracted and saved successfully.
Article bctech2021 extracted and saved successfully.
Article bctech2022 extracted and saved successfully.
Article bctech2023 extracted and saved successfully.
Article bctech2024 extracted and saved successfully.
Article bctech2025 extracted and saved successfully.
Article bctech2026 extracted and saved successfully.
Article bctech2027 extracted and saved successfully.
Article bctech2028 extracted and saved successfully.
Article bctech2029 extracted and saved success

# **Data Analysis**

## **> Sentiment Analysis**

> For each of the extracted texts from the article, compute variables given below:
1. POSITIVE SCORE
2. NEGATIVE SCORE
3. POLARITY SCORE
4. SUBJECTIVITY SCORE

In [7]:
# Function to load positive and negative dictionaries from files
def load_dictionaries(positive_dict_file, negative_dict_file):
    with open(positive_dict_file, 'r', encoding='latin-1', errors='replace') as file:
        positive_words = set(file.read().splitlines())
    with open(negative_dict_file, 'r', encoding='latin-1', errors='replace') as file:
        negative_words = set(file.read().splitlines())
    return positive_words, negative_words

In [8]:
# Function to perform sentiment analysis and calculate scores
def calculate_sentiment_scores(text, positive_words, negative_words):
    sia = SentimentIntensityAnalyzer()
    tokens = word_tokenize(text)

    positive_score = 0
    negative_score = 0

    for word in tokens:
        # Remove punctuation and convert to lowercase
        word = word.lower()
        if word.isalpha():
            # Check if the word is in the positive dictionary
            if word in positive_words:
                positive_score += 1
            # Check if the word is in the negative dictionary
            if word in negative_words:
                negative_score += 1

    # Calculate sentiment analysis metrics
    polarity_score = (positive_score - negative_score) / ((positive_score + negative_score) + 0.000001)
    subjectivity_score = (positive_score + negative_score) / (len(tokens) + 0.000001)

    return positive_score, negative_score, polarity_score, subjectivity_score

In [9]:
def main():
    input_data_file = "/content/Test Assignment/Output Data Structure.xlsx"
    positive_dict_file = "/content/Test Assignment/MasterDictionary/positive-words.txt"
    negative_dict_file = "/content/Test Assignment/MasterDictionary/negative-words.txt"
    articles_dir = "/content/articles"

    # Load dictionaries
    positive_words, negative_words = load_dictionaries(positive_dict_file, negative_dict_file)

    # Read output data structure Excel file
    output_data = pd.read_excel(input_data_file)

    results = []
    for index, row in output_data.iterrows():
        url_id = row["URL_ID"]
        url = row["URL"]
        article_file = os.path.join(articles_dir, f"{url_id}.txt")

        if os.path.exists(article_file):
            # Read article text from file
            with open(article_file, 'r', encoding='utf-8') as article:
                article_text = article.read()

            # Perform sentiment analysis
            positive_score, negative_score, polarity_score, subjectivity_score = calculate_sentiment_scores(article_text, positive_words, negative_words)

            results.append({
                "URL_ID": url_id,
                "URL": url,
                "Positive_Score": positive_score,
                "Negative_Score": negative_score,
                "Polarity_Score": polarity_score,
                "Subjectivity_Score": subjectivity_score
            })

    # Create DataFrame from results
    df1 = pd.DataFrame(results)

    # Save results to Excel
    df1.to_excel("Sentiment_Analysis.xlsx", index=False)

if __name__ == "__main__":
    main()

In [10]:
sentiment_analysis = pd.read_excel("Sentiment_Analysis.xlsx")
sentiment_analysis.head(10)

Unnamed: 0,URL_ID,URL,Positive_Score,Negative_Score,Polarity_Score,Subjectivity_Score
0,bctech2011,https://insights.blackcoffer.com/ml-and-ai-bas...,138,45,0.508197,0.059667
1,bctech2012,https://insights.blackcoffer.com/streamlined-i...,19,6,0.52,0.04363
2,bctech2013,https://insights.blackcoffer.com/efficient-dat...,19,10,0.310345,0.041076
3,bctech2014,https://insights.blackcoffer.com/effective-man...,13,6,0.368421,0.034545
4,bctech2015,https://insights.blackcoffer.com/streamlined-t...,17,3,0.7,0.028986
5,bctech2016,https://insights.blackcoffer.com/efficient-aws...,10,11,-0.047619,0.039179
6,bctech2017,https://insights.blackcoffer.com/streamlined-e...,6,4,0.2,0.02457
7,bctech2018,https://insights.blackcoffer.com/automated-ort...,6,3,0.333333,0.020045
8,bctech2019,https://insights.blackcoffer.com/streamlining-...,7,15,-0.363636,0.039855
9,bctech2020,https://insights.blackcoffer.com/efficient-dat...,25,9,0.470588,0.041769


## **> Text Analysis**

> For each of the extracted texts from the article, perform textual analysis and compute variables given below:
5. AVG SENTENCE LENGTH
6. PERCENTAGE OF COMPLEX WORDS
7. FOG INDEX
8. AVG NUMBER OF WORDS PER SENTENCE
9. COMPLEX WORD COUNT
10. WORD COUNT
11. SYLLABLE PER WORD
12. PERSONAL PRONOUNS
13. AVG WORD LENGTH

In [11]:
# 1. Function to calculate average sentence length
# Average Sentence Length = the number of words / the number of sentences
def calculate_avg_sentence_length(sentences):
    total_words = sum(len(word_tokenize(sentence)) for sentence in sentences)
    total_sentences = len(sentences)
    return total_words / total_sentences

# 2. Function to calculate percentage of complex words
# Percentage of Complex words = the number of complex words / the number of words
def calculate_percentage_complex_words(text):
    words = word_tokenize(text)
    complex_words = [word for word in words if len(word) > 2]
    return len(complex_words) / len(words)

# 3. Function to calculate fog index
# Fog Index = 0.4 * (Average Sentence Length + Percentage of Complex words)
def calculate_fog_index(avg_sentence_length, percentage_complex_words):
    return 0.4 * (avg_sentence_length + percentage_complex_words)

# 4. Function to calculate average number of words per sentence
def calculate_avg_words_per_sentence(words, sentences):
    return len(words) / len(sentences)

In [12]:
# 5. Function to calculate complex word count
# Complex words are words in the text that contain more than two syllables i.e, syllable count > 2
def calculate_complex_word_count(text):
    words = word_tokenize(text)
    complex_words = [word for word in words if count_syllables(word) > 2]
    return len(complex_words)

# 6. Function for Word count
# We count the total words present in the text by removing the stop words and removing any punctuations like ? ! , . from the word before counting.
def words_count(text):
    stop_words = set(stopwords.words("english"))
    cleaned_words = [word for word in text.split() if (word not in stop_words) and (re.match(r'^[,.!?]',word)==None)]
    return len(cleaned_words)

# Function to count syllables in a word
# We determine the syllable count for each word in the text by counting the vowels within each word.
# Additionally, we handle exceptions such as words ending with "es" or "ed" by excluding them from the syllable count.
def count_syllables(word):
   vowels = "aeiouAEIOU"
   count = 0
   if word[-1] in ['e', 'E'] and word[-2:] != 'le' and word[-2:] != 'LE':
       word = word[:-1]
   for index, letter in enumerate(word):
       if index == 0 and letter.lower() in vowels:
           count += 1
       elif letter.lower() in vowels and not (word[index - 1].lower() in vowels):
           count += 1

   return max(count, 1) # Return at least one syllable

# 7. Function to calculate syllable count per word
def calculate_syllable_count_per_word(text):
    words = word_tokenize(text)
    syllables_per_word = sum(count_syllables(word) for word in words)
    return syllables_per_word/max(len(words), 1) # Avoid division by zero error


# 8. Function to calculate personal pronoun count
# We use regex to find the count of Personal Pronouns (“I,” “we,” “my,” “ours,” and “us”) mentioned in the text.
def calculate_personal_pronouns(text):
    pronouns = ["I", "we", "my", "ours", "us"]
    pattern = r'\b(?:' + '|'.join(pronouns) + r')\b'   # \b is used to match word boundaries
    matches = re.findall(pattern, text)
    return len(matches)


# 9. Function to calculate average word length
# Average Word Length = Sum of the total number of characters in each word/Total number of words
def calculate_avg_word_length(text):
    words = word_tokenize(text)
    total_characters = sum(len(word) for word in words)
    return total_characters / len(words)

In [13]:
def main():
    output_data_file = "/content/Test Assignment/Output Data Structure.xlsx"
    articles_dir = "/content/articles"

    # Read output data structure Excel file
    output_data = pd.read_excel(output_data_file)

    results_ = []
    for index, row in output_data.iterrows():
        url_id = row["URL_ID"]
        article_file = os.path.join(articles_dir, f"{url_id}.txt")

        if os.path.exists(article_file):
            # Read article text from file
            with open(article_file, 'r', encoding='utf-8') as article:
                article_text = article.read()

            # Tokenize sentences for text analysis
            sentences = sent_tokenize(article_text)
            words = word_tokenize(article_text)

            # Calculate text analysis metrics
            avg_sentence_length = calculate_avg_sentence_length(sentences)
            percentage_complex_words = calculate_percentage_complex_words(article_text)
            fog_index = calculate_fog_index(avg_sentence_length, percentage_complex_words)
            avg_words_per_sentence = calculate_avg_words_per_sentence(words, sentences)
            complex_word_count = calculate_complex_word_count(article_text)
            word_count = words_count(article_text)
            syllable_count_per_word = calculate_syllable_count_per_word(article_text)
            personal_pronoun_count = calculate_personal_pronouns(article_text)
            avg_word_length = calculate_avg_word_length(article_text)

            results_.append({
                "URL_ID": url_id,
                "Avg_Sentence_Length": avg_sentence_length,
                "Percentage_Complex_Words": percentage_complex_words,
                "Fog_Index": fog_index,
                "Avg_Words_Per_Sentence": avg_words_per_sentence,
                "Complex_Word_Count": complex_word_count,
                "Word_Count": word_count,
                "Syllable_Count_Per_Word": syllable_count_per_word,
                "Personal_Pronoun_Count": personal_pronoun_count,
                "Avg_Word_Length": avg_word_length
            })

    # Create DataFrame from results
    df2 = pd.DataFrame(results_)

    # Save results to Excel
    df2.to_excel("Text_Analysis.xlsx", index=False)

if __name__ == "__main__":
    main()

In [14]:
text_analysis = pd.read_excel("Text_Analysis.xlsx")
text_analysis.head(10)

Unnamed: 0,URL_ID,Avg_Sentence_Length,Percentage_Complex_Words,Fog_Index,Avg_Words_Per_Sentence,Complex_Word_Count,Word_Count,Syllable_Count_Per_Word,Personal_Pronoun_Count,Avg_Word_Length
0,bctech2011,17.327684,0.740789,7.227389,17.327684,923,1929,1.968047,2,5.57059
1,bctech2012,11.019231,0.806283,4.730205,11.019231,167,391,2.057592,1,6.057592
2,bctech2013,20.171429,0.784703,8.382452,20.171429,174,473,1.944759,1,5.628895
3,bctech2014,10.377358,0.812727,4.476034,10.377358,157,388,1.996364,1,5.869091
4,bctech2015,23.793103,0.715942,9.803618,23.793103,162,434,1.868116,1,5.269565
5,bctech2016,10.113208,0.796642,4.36394,10.113208,152,364,1.927239,1,5.606343
6,bctech2017,14.034483,0.800983,5.934186,14.034483,103,290,1.899263,1,5.700246
7,bctech2018,17.96,0.781737,7.496695,17.96,117,299,1.886414,1,5.514477
8,bctech2019,21.230769,0.764493,8.798105,21.230769,117,351,1.833333,1,5.438406
9,bctech2020,28.068966,0.785012,11.541591,28.068966,263,539,2.041769,1,5.821867


In [15]:
data = pd.merge(sentiment_analysis, text_analysis, on='URL_ID')
data.head()

Unnamed: 0,URL_ID,URL,Positive_Score,Negative_Score,Polarity_Score,Subjectivity_Score,Avg_Sentence_Length,Percentage_Complex_Words,Fog_Index,Avg_Words_Per_Sentence,Complex_Word_Count,Word_Count,Syllable_Count_Per_Word,Personal_Pronoun_Count,Avg_Word_Length
0,bctech2011,https://insights.blackcoffer.com/ml-and-ai-bas...,138,45,0.508197,0.059667,17.327684,0.740789,7.227389,17.327684,923,1929,1.968047,2,5.57059
1,bctech2012,https://insights.blackcoffer.com/streamlined-i...,19,6,0.52,0.04363,11.019231,0.806283,4.730205,11.019231,167,391,2.057592,1,6.057592
2,bctech2013,https://insights.blackcoffer.com/efficient-dat...,19,10,0.310345,0.041076,20.171429,0.784703,8.382452,20.171429,174,473,1.944759,1,5.628895
3,bctech2014,https://insights.blackcoffer.com/effective-man...,13,6,0.368421,0.034545,10.377358,0.812727,4.476034,10.377358,157,388,1.996364,1,5.869091
4,bctech2015,https://insights.blackcoffer.com/streamlined-t...,17,3,0.7,0.028986,23.793103,0.715942,9.803618,23.793103,162,434,1.868116,1,5.269565


In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147 entries, 0 to 146
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   URL_ID                    147 non-null    object 
 1   URL                       147 non-null    object 
 2   Positive_Score            147 non-null    int64  
 3   Negative_Score            147 non-null    int64  
 4   Polarity_Score            147 non-null    float64
 5   Subjectivity_Score        147 non-null    float64
 6   Avg_Sentence_Length       147 non-null    float64
 7   Percentage_Complex_Words  147 non-null    float64
 8   Fog_Index                 147 non-null    float64
 9   Avg_Words_Per_Sentence    147 non-null    float64
 10  Complex_Word_Count        147 non-null    int64  
 11  Word_Count                147 non-null    int64  
 12  Syllable_Count_Per_Word   147 non-null    float64
 13  Personal_Pronoun_Count    147 non-null    int64  
 14  Avg_Word_L

In [20]:
data.describe().T.round(3)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Positive_Score,147.0,12.735,13.851,0.0,6.0,10.0,16.5,138.0
Negative_Score,147.0,5.81,5.887,0.0,2.0,4.0,8.0,45.0
Polarity_Score,147.0,0.362,0.363,-0.833,0.143,0.429,0.559,1.0
Subjectivity_Score,147.0,0.033,0.015,0.0,0.023,0.03,0.042,0.078
Avg_Sentence_Length,147.0,36.12,27.193,10.113,23.718,28.023,36.938,204.0
Percentage_Complex_Words,147.0,0.763,0.032,0.658,0.739,0.764,0.784,0.835
Fog_Index,147.0,14.753,10.879,4.364,9.786,11.515,15.073,81.916
Avg_Words_Per_Sentence,147.0,36.119,27.193,10.113,23.718,28.023,36.938,204.0
Complex_Word_Count,147.0,102.816,92.59,21.0,53.0,74.0,131.0,923.0
Word_Count,147.0,338.918,219.796,26.0,196.5,283.0,418.5,1929.0


In [22]:
data.to_excel("Output Data Structure.xlsx")



**----------------------------------------------------------------------------**

**By Mrudula A P**