# Explanation of Approach:

The provided Python script performs sentiment analysis and calculates various text metrics on a dataset of web articles. Here's a breakdown of the approach:

## 1.Data Loading:
Reads input data from an Excel file containing URLs of web articles.

## 2.Loading Dictionaries and Stopwords:
Loads positive and negative word dictionaries.
Loads a collection of stopwords from text files.

## 3.Text Processing and Analysis:
Utilizes the BeautifulSoup library to scrape HTML content of each article.
Extracts article title and text content.
Cleans and tokenizes the text, removing punctuation and stopwords.
Calculates positive and negative scores, as well as polarity and subjectivity scores.
Computes various text metrics such as average sentence length, percentage of complex words, Fog Index, and more.

## 4.Result Compilation:
Stores the results for each article, including sentiment scores and text metrics, in a DataFrame.

## 5.DataFrame Operations:
Rounds numeric columns in the DataFrame for clarity.
Displays the first few rows of the DataFrame.

## 6.Output:
Saves the DataFrame to a CSV file named 'Output Data.csv.'

# Running the .py file:

To execute the code and generate the output, follow these steps:

- Ensure that Python is installed on your system.
- Install the required dependencies by running
- Download NLTK data by running the following Python script

## Dependencies:

- pandas
- requests
- beautifulsoup4
- nltk

# Importing Necessary Libraries and Downloading NLTK resources

In [35]:
import pandas as pd
import os
import re
import requests
from bs4 import BeautifulSoup
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Setting path and Reading Input Data

In [36]:
path = r"C:\Users\DELL\Downloads" 

# Reading input data from Excel
input_df = pd.read_excel(f"{path}\Input.xlsx")

# Loading positive,negative and stop words

In [37]:
# Loading positive and negative words
PSF = rf"{path}\MasterDictionary\MasterDictionary\positive-words.txt"
with open(PSF, 'r') as pos:
    positivewords = pos.read().split('\n')

NSF = rf"{path}\MasterDictionary\MasterDictionary\negative-words.txt"
with open(NSF, 'r', encoding="ISO-8859-1") as neg:
    negativewords = neg.read().split('\n')

# Loading stopwords
stopwords_files = f"{path}\ST\StopWords"
stopwords_list = []
for filename in os.listdir(stopwords_files):
    if filename.endswith('.txt'):
        with open(os.path.join(stopwords_files, filename), 'r', encoding='ISO-8859-1') as file:
            for line in file:
                words = line.split('|')[0].strip()
                stopwords_list.append(words)

stopwords_list = list(set(stopwords_list))
stopwords_list.sort()

# Data Processing and Analysis

In [38]:
# Creating an empty list to store the result data for each article.
result_data = []

# Looping through each row in the input dataset
for index, row in input_df.iterrows():
    article_url = row['URL']

    # Sending a HTTP GET request to the article URL and parse the HTML content.
    response = requests.get(article_url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extracting the article title from the HTML.
    article_title = soup.find('h1')
    if article_title:
        article_title_text = article_title.text
    else:
        article_title_text = "Title not found"

    # Extracting the article text content from the HTML.
    article_text_elements = soup.findAll(attrs={'class': 'td-post-content'})
    if article_text_elements:
        article_text = article_text_elements[0].text.replace('\n', " ")
    else:
        article_text = "Content not found"

    # Removing punctuation from the article text.
    article_text1 = article_text.translate(str.maketrans('', '', string.punctuation))

    # Tokenizing the cleaned article text into words.
    text_tokens = word_tokenize(article_text1)

    # Removing stopwords from the tokenized text.
    nostopwords_tokens = [word for word in text_tokens if not word in stopwords_list]

    # Calculating the number of words in the text.
    words_count = len(nostopwords_tokens)

    # Calculating positive and negative scores using dictionary.
    Positive_Score = sum(1 for word in nostopwords_tokens if word in positivewords)
    Negative_Score = sum(1 for word in nostopwords_tokens if word in negativewords)

    # Calculating Polarity and Subjectivity Scores.
    Polarity_Score = (Positive_Score - Negative_Score) / ((Positive_Score + Negative_Score) + 0.000001)
    Subjectivity_Score = (Positive_Score + Negative_Score) / ((words_count) + 0.000001)
    def cal_avg_sentence_length(text):
        total_characters = len(re.sub(r'\s', '', text))
        total_sentences = len(re.split(r'[?!.]', text))
        avg_sentence_length = total_characters / total_sentences if total_sentences > 0 else 0
        return avg_sentence_length
    Average_Sentence_Length = cal_avg_sentence_length(article_text)
      # Define functions to calculate average number of words per sentence.
    def cal_avg_number_of_words_per_sentence(text):
        sentences = re.split(r'[.!?]', text)
        avg_number_of_words_per_sentence = sum(len(sentence.split()) for sentence in sentences) / len(sentences) if sentences else 0
        return avg_number_of_words_per_sentence
    Average_Number_Of_Words_Per_Sentence = cal_avg_number_of_words_per_sentence(article_text)
      # Define functions to calculate average number of words per sentence.
    def cal_avg_number_of_words_per_sentence(text):
        sentences = re.split(r'[.!?]', text)
        avg_number_of_words_per_sentence = sum(len(sentence.split()) for sentence in sentences) / len(sentences) if sentences else 0
        return avg_number_of_words_per_sentence
    Average_Number_Of_Words_Per_Sentence = cal_avg_number_of_words_per_sentence(article_text)

  # Define a function to count syllables in a word.
    def count_syllables(word):
        vowels = "AEIOUaeiou"
        syllable_count = 0
        prev_char_was_vowel = False
        for char in word:
            if char in vowels:
                if not prev_char_was_vowel:
                    syllable_count += 1
                    prev_char_was_vowel = True
                else:
                    prev_char_was_vowel = False
        if word.endswith(('e', 'E')) and not word.endswith(('ed', 'es')):    # Adjustment for words ending with "ed" and "es"
            syllable_count -= 1
        syllable_count = max(1, syllable_count)
        return syllable_count
    words = re.findall(r'\b\w+\b', article_text)

  # Counting the number of complex words based on syllable count.
    Complex_Word_Count = sum(1 for word in words if count_syllables(word) > 2)

  # Calculating the average syllables per word based on syllable count.
    syllable_counts = [count_syllables(word) for word in words]
    Syllable_Per_Word = sum(syllable_counts)/len(syllable_counts)
  # Calculating the word count after removing stopwords using stopwords class of nltk package.
    stop_words = stopwords.words('english')
    word_count = [word for word in text_tokens if not word in stop_words]
    Word_Count = len(word_count)

  # Calculating the percentage of complex words.
    Percentage_of_Complex_Words = (Complex_Word_Count / Word_Count) * 100

  # Calculating the Fog Index.
    Fog_Index = 0.4 * (Average_Sentence_Length + Percentage_of_Complex_Words)

  # Count personal pronouns in the article text.
    pronouns = r'\b(?:I|we|my|ours|us)\b'
    pronoun_matches = re.findall(pronouns, article_text, flags=re.IGNORECASE)
    filtered_pronouns = [pronoun for pronoun in pronoun_matches if pronoun.lower() != "us"]
    Personal_Pronouns = len(filtered_pronouns)

  # Calculate the average word length.
    total_characters = sum(len(word) for word in words)
    total_words = len(article_text.split())
    Average_Word_Length = total_characters / total_words if total_words > 0 else 0

    # Extract URL_ID and URL from the current row and append the result data to the list.
    url_id = row['URL_ID']
    url = row['URL']
    result_data.append({
        'URL_ID': url_id,
        'URL': url,
        'POSITIVE SCORE': Positive_Score,
        'NEGATIVE SCORE': Negative_Score,
        'POLARITY SCORE': Polarity_Score,
        'SUBJECTIVITY SCORE': Subjectivity_Score,
        'AVG SENTENCE LENGTH': Average_Sentence_Length,
        'PERCENTAGE OF COMPLEX WORDS': Percentage_of_Complex_Words,
        'FOG INDEX': Fog_Index,
        'AVG NUMBER OF WORDS PER SENTENCE': Average_Number_Of_Words_Per_Sentence,
        'COMPLEX WORD COUNT': Complex_Word_Count,
        'WORD COUNT': Word_Count,
        'SYLLABLE PER WORD': Syllable_Per_Word,
        'PERSONAL PRONOUNS': Personal_Pronouns,
        'AVG WORD LENGTH': Average_Word_Length
    })


# Rounds numeric columns and Displays the first few rows 

In [39]:
# Create a DataFrame from the result data.
results_df = pd.DataFrame(result_data)

# Round numeric columns
for column in results_df.columns:
    if pd.api.types.is_numeric_dtype(results_df[column]):
        results_df[column] = results_df[column].round(3)

# Display output 

In [54]:
# Display the first few rows of the DataFrame
results_df.head()

Unnamed: 0,URL_ID,URL,POSITIVE SCORE,NEGATIVE SCORE,POLARITY SCORE,SUBJECTIVITY SCORE,AVG SENTENCE LENGTH,PERCENTAGE OF COMPLEX WORDS,FOG INDEX,AVG NUMBER OF WORDS PER SENTENCE,COMPLEX WORD COUNT,WORD COUNT,SYLLABLE PER WORD,PERSONAL PRONOUNS,AVG WORD LENGTH
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...,33,5,0.737,0.061,72.772,2.561,30.133,15.468,19,742,1.19,11,4.582
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...,56,31,0.287,0.103,100.402,7.104,43.003,17.841,66,929,1.332,3,5.478
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...,37,23,0.233,0.089,117.351,12.43,51.912,18.702,89,716,1.414,12,6.117
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...,35,71,-0.34,0.164,124.288,8.616,53.162,20.269,61,708,1.362,5,5.972
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...,21,8,0.448,0.078,96.775,7.745,41.808,17.05,34,439,1.29,6,5.543


# Save as CSV file

In [55]:
# Save the DataFrame to a CSV file
results_df.to_csv('Output Data.csv', index=False)
print("CSV File Saved")

CSV File Saved


# Final Executable Code

In [56]:
import pandas as pd
import os
import re
import requests
from bs4 import BeautifulSoup
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk

nltk.download('punkt')
nltk.download('stopwords')

path = r"C:\Users\DELL\Downloads"  

# Read input data from Excel
input_df = pd.read_excel(f"{path}\Input.xlsx")

# Load positive and negative words
PSF = rf"{path}\MasterDictionary\MasterDictionary\positive-words.txt"
with open(PSF, 'r') as pos:
    positivewords = pos.read().split('\n')

NSF = rf"{path}\MasterDictionary\MasterDictionary\negative-words.txt"
with open(NSF, 'r', encoding="ISO-8859-1") as neg:
    negativewords = neg.read().split('\n')

# Load stopwords
stopwords_files = f"{path}\ST\StopWords"
stopwords_list = []
for filename in os.listdir(stopwords_files):
    if filename.endswith('.txt'):
        with open(os.path.join(stopwords_files, filename), 'r', encoding='ISO-8859-1') as file:
            for line in file:
                words = line.split('|')[0].strip()
                stopwords_list.append(words)

stopwords_list = list(set(stopwords_list))
stopwords_list.sort()

# Create an empty list to store the result data for each article.
result_data = []

# Loop through each row in the input dataset
for index, row in input_df.iterrows():
    article_url = row['URL']

    # Send an HTTP GET request to the article URL and parse the HTML content.
    response = requests.get(article_url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract the article title from the HTML.
    article_title = soup.find('h1')
    if article_title:
        article_title_text = article_title.text
    else:
        article_title_text = "Title not found"

    # Extract the article text content from the HTML.
    article_text_elements = soup.findAll(attrs={'class': 'td-post-content'})
    if article_text_elements:
        article_text = article_text_elements[0].text.replace('\n', " ")
    else:
        article_text = "Content not found"

    # Remove punctuation from the article text.
    article_text1 = article_text.translate(str.maketrans('', '', string.punctuation))

    # Tokenize the cleaned article text into words.
    text_tokens = word_tokenize(article_text1)

    # Remove stopwords from the tokenized text.
    nostopwords_tokens = [word for word in text_tokens if not word in stopwords_list]

    # Calculate the number of words in the text.
    words_count = len(nostopwords_tokens)

    # Calculate positive and negative scores using dictionary.
    Positive_Score = sum(1 for word in nostopwords_tokens if word in positivewords)
    Negative_Score = sum(1 for word in nostopwords_tokens if word in negativewords)

    # Calculate Polarity and Subjectivity Scores.
    Polarity_Score = (Positive_Score - Negative_Score) / ((Positive_Score + Negative_Score) + 0.000001)
    Subjectivity_Score = (Positive_Score + Negative_Score) / ((words_count) + 0.000001)
    def cal_avg_sentence_length(text):
        total_characters = len(re.sub(r'\s', '', text))
        total_sentences = len(re.split(r'[?!.]', text))
        avg_sentence_length = total_characters / total_sentences if total_sentences > 0 else 0
        return avg_sentence_length
    Average_Sentence_Length = cal_avg_sentence_length(article_text)
      # Define functions to calculate average number of words per sentence.
    def cal_avg_number_of_words_per_sentence(text):
        sentences = re.split(r'[.!?]', text)
        avg_number_of_words_per_sentence = sum(len(sentence.split()) for sentence in sentences) / len(sentences) if sentences else 0
        return avg_number_of_words_per_sentence
    Average_Number_Of_Words_Per_Sentence = cal_avg_number_of_words_per_sentence(article_text)
      # Define functions to calculate average number of words per sentence.
    def cal_avg_number_of_words_per_sentence(text):
        sentences = re.split(r'[.!?]', text)
        avg_number_of_words_per_sentence = sum(len(sentence.split()) for sentence in sentences) / len(sentences) if sentences else 0
        return avg_number_of_words_per_sentence
    Average_Number_Of_Words_Per_Sentence = cal_avg_number_of_words_per_sentence(article_text)

  # Define a function to count syllables in a word.
    def count_syllables(word):
        vowels = "AEIOUaeiou"
        syllable_count = 0
        prev_char_was_vowel = False
        for char in word:
            if char in vowels:
                if not prev_char_was_vowel:
                    syllable_count += 1
                    prev_char_was_vowel = True
                else:
                    prev_char_was_vowel = False
        if word.endswith(('e', 'E')) and not word.endswith(('ed', 'es')):    # Adjustment for words ending with "ed" and "es"
            syllable_count -= 1
        syllable_count = max(1, syllable_count)
        return syllable_count
    words = re.findall(r'\b\w+\b', article_text)

  # Count the number of complex words based on syllable count.
    Complex_Word_Count = sum(1 for word in words if count_syllables(word) > 2)

  # Calculate the average syllables per word based on syllable count.
    syllable_counts = [count_syllables(word) for word in words]
    Syllable_Per_Word = sum(syllable_counts)/len(syllable_counts)
  # Calculate the word count after removing stopwords using stopwords class of nltk package.
    stop_words = stopwords.words('english')
    word_count = [word for word in text_tokens if not word in stop_words]
    Word_Count = len(word_count)

  # Calculate the percentage of complex words.
    Percentage_of_Complex_Words = (Complex_Word_Count / Word_Count) * 100

  # Calculate the Fog Index.
    Fog_Index = 0.4 * (Average_Sentence_Length + Percentage_of_Complex_Words)

  # Count personal pronouns in the article text.
    pronouns = r'\b(?:I|we|my|ours|us)\b'
    pronoun_matches = re.findall(pronouns, article_text, flags=re.IGNORECASE)
    filtered_pronouns = [pronoun for pronoun in pronoun_matches if pronoun.lower() != "us"]
    Personal_Pronouns = len(filtered_pronouns)

  # Calculate the average word length.
    total_characters = sum(len(word) for word in words)
    total_words = len(article_text.split())
    Average_Word_Length = total_characters / total_words if total_words > 0 else 0

    # Extract URL_ID and URL from the current row and append the result data to the list.
    url_id = row['URL_ID']
    url = row['URL']
    result_data.append({
        'URL_ID': url_id,
        'URL': url,
        'POSITIVE SCORE': Positive_Score,
        'NEGATIVE SCORE': Negative_Score,
        'POLARITY SCORE': Polarity_Score,
        'SUBJECTIVITY SCORE': Subjectivity_Score,
        'AVG SENTENCE LENGTH': Average_Sentence_Length,
        'PERCENTAGE OF COMPLEX WORDS': Percentage_of_Complex_Words,
        'FOG INDEX': Fog_Index,
        'AVG NUMBER OF WORDS PER SENTENCE': Average_Number_Of_Words_Per_Sentence,
        'COMPLEX WORD COUNT': Complex_Word_Count,
        'WORD COUNT': Word_Count,
        'SYLLABLE PER WORD': Syllable_Per_Word,
        'PERSONAL PRONOUNS': Personal_Pronouns,
        'AVG WORD LENGTH': Average_Word_Length
    })

# Create a DataFrame from the result data.
results_df = pd.DataFrame(result_data)

# Round numeric columns
for column in results_df.columns:
    if pd.api.types.is_numeric_dtype(results_df[column]):
        results_df[column] = results_df[column].round(3)

# Display the first few rows of the DataFrame
results_df.head()

# Save the DataFrame to a CSV file
results_df.to_csv('Output Data.csv', index=False)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
