<center>
    <h1>Blackcoffer</h1>
    <h4>Submitted by: Prachi Nikalje</h4>
</center>

<html><body><h2>1. Setup and Initialization</h2>
    <p>Begin by importing necessary libraries and initializing any variables or settings needed for the conversion process.</p>

    <h2>2. Data Retrieval and Preprocessing</h2>
    <ul>
        <li>Iterate through the dataset or data source to retrieve URLs.</li>
        <li>Attempt to fetch text content from each URL using the <code>extract_text_from_url()</code> function.</li>
        <li>If fetching fails, print an error message and drop the corresponding row from the DataFrame.</li>
    </ul>

    <h2>3. Text Processing and Metric Calculation</h2>
    <ul>
        <li>If text content is successfully fetched, process it using <code>text_process()</code> function.</li>
        <li>Tokenize the processed text into words.</li>
        <li>Calculate various metrics based on the text content, including positive and negative scores, polarity and subjectivity scores, average sentence length, percentage of complex words, Fog index, etc.</li>
    </ul>

    <h2>4. Save the ouput into .csv file<h2>
</body>
</html>

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import numpy as np
import nltk
import string
from nltk.tokenize import word_tokenize

# Download NLTK resources
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/prachi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [14]:
#importing input file
df = pd.read_excel('Input.xlsx', usecols=['URL_ID', 'URL'])

In [15]:
df.shape

(100, 2)

In [16]:
df.head(10)

Unnamed: 0,URL_ID,URL
0,blackassign0001,https://insights.blackcoffer.com/rising-it-cit...
1,blackassign0002,https://insights.blackcoffer.com/rising-it-cit...
2,blackassign0003,https://insights.blackcoffer.com/internet-dema...
3,blackassign0004,https://insights.blackcoffer.com/rise-of-cyber...
4,blackassign0005,https://insights.blackcoffer.com/ott-platform-...
5,blackassign0006,https://insights.blackcoffer.com/the-rise-of-t...
6,blackassign0007,https://insights.blackcoffer.com/rise-of-cyber...
7,blackassign0008,https://insights.blackcoffer.com/rise-of-inter...
8,blackassign0009,https://insights.blackcoffer.com/rise-of-cyber...
9,blackassign0010,https://insights.blackcoffer.com/rise-of-cyber...


In [17]:
def extract_text_from_url(url, retries=3, backoff_factor=0.3):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}
    
    for attempt in range(retries):
        try:
            page = requests.get(url, headers=headers)
            page.raise_for_status()  # Raises an HTTPError for bad responses
            soup = BeautifulSoup(page.content, 'html.parser')

            # Extract the content
            content_tags = soup.find_all(attrs={'class': 'td-post-content'})
            content = ' '.join([tag.get_text(strip=True) for tag in content_tags])

            # Clean the text
            content = content.replace('\xa0', ' ').replace('\n', ' ')

            return content
        except requests.exceptions.RequestException as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            time.sleep(backoff_factor * (2 ** attempt))  # Exponential backoff
            
    print(f"Failed to fetch {url} after {retries} attempts")
    return None

In [18]:
# Custom function for text processing
def text_process(text):
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    return ' '.join([word.lower() for word in nopunc.split() if word.lower() not in stopwords])


In [19]:
# Load stopwords and dictionaries
stopwords_files = [
    "StopWords/StopWords_Auditor.txt",
    "StopWords/StopWords_Currencies.txt",
    "StopWords/StopWords_DatesandNumbers.txt",
    "StopWords/StopWords_Generic.txt",
    "StopWords/StopWords_GenericLong.txt",
    "StopWords/StopWords_Geographic.txt",
    "StopWords/StopWords_Names.txt"
]

stopwords = set()
for file in stopwords_files:
    with open(file, 'r', encoding='ISO-8859-1') as f:
        stopwords.update(f.read().split())


In [20]:
positive_words = set(open("MasterDictionary/positive-words.txt", 'r').read().split())
negative_words = set(open("MasterDictionary/negative-words.txt", 'r', encoding="ISO-8859-1").read().split())


In [21]:
# Helper function to count syllables
def syllable_count(word):
    vowels = 'aeiou'
    word = word.lower()
    count = 0
    if word[0] in vowels:
        count += 1
    for index in range(1, len(word)):
        if word[index] in vowels and word[index - 1] not in vowels:
            count += 1
    if word.endswith("es") or word.endswith("ed"):
        count -= 1
    return max(1, count)

In [22]:
# Initialize lists to store scores for each URL
positive_scores = []
negative_scores = []
polarity_scores = []
subjectivity_scores = []
avg_sentence_lengths = []
percentage_of_complex_wo = []
fog_indices = []
avg_words_per_sent = []
complex_word_counts = []
word_counts = []
syllable_counts = []
personal_pronouns = []
avg_word_lengths = []

In [23]:
# Personal pronouns list
personal_pronouns_list = ['i', 'we', 'my', 'ours', 'us']

In [24]:
# Iterate over each URL
cnt = 1
for i, row in df.iterrows():
    print("url_id: ", cnt)
    url = row['URL']
    try:
        text = extract_text_from_url(url)
        if text:
            # Process the extracted text
            processed_text = text_process(text)
            
            # Tokenize the processed text
            tokens = word_tokenize(processed_text)
            
            # Calculate metrics
            positive_score_url = sum(1 for word in tokens if word in positive_words)
            negative_score_url = sum(1 for word in tokens if word in negative_words)
            polarity_score_url = (positive_score_url - negative_score_url) / (positive_score_url + negative_score_url + 1e-6)
            subjectivity_score_url = (positive_score_url + negative_score_url) / (len(tokens) + 1e-6)
            
            sentences = processed_text.split('.')
            avg_sentence_length = len(tokens) / len(sentences)
            
            complex_words = [word for word in tokens if len(word) > 2 and syllable_count(word) > 2]
            percentage_of_complex_words = len(complex_words) / len(tokens)
            
            fog_index = 0.4 * (avg_sentence_length + percentage_of_complex_words)
            
            avg_words_per_sentence = len(tokens) / len(sentences)
            
            complex_word_count_url = len(complex_words)
            word_count = len(tokens)
            
            syllable_count_total = sum(syllable_count(word) for word in tokens)
            
            personal_pronoun_count = sum(1 for word in tokens if word in personal_pronouns_list)
            
            avg_word_length = sum(len(word) for word in tokens) / len(tokens)
            
            # Append scores to respective lists
            positive_scores.append(positive_score_url)
            negative_scores.append(negative_score_url)
            polarity_scores.append(polarity_score_url)
            subjectivity_scores.append(subjectivity_score_url)
            avg_sentence_lengths.append(avg_sentence_length)
            percentage_of_complex_wo.append(percentage_of_complex_words)
            fog_indices.append(fog_index)
            avg_words_per_sent.append(avg_words_per_sentence)
            complex_word_counts.append(complex_word_count_url)
            word_counts.append(word_count)
            syllable_counts.append(syllable_count_total)
            personal_pronouns.append(personal_pronoun_count)
            avg_word_lengths.append(avg_word_length)
    except Exception:
        
        print("Deleted the url.",url)
        # Drop the row where URL matches the problematic URL
        df.drop(df.index[df['URL'] == url], inplace=True)
    cnt+=1


url_id:  1
url_id:  2
url_id:  3
url_id:  4
url_id:  5
url_id:  6
url_id:  7
url_id:  8
url_id:  9
url_id:  10
url_id:  11
url_id:  12
url_id:  13
url_id:  14
url_id:  15
url_id:  16
url_id:  17
url_id:  18
url_id:  19
url_id:  20
url_id:  21
url_id:  22
url_id:  23
url_id:  24
url_id:  25
url_id:  26
url_id:  27
url_id:  28
url_id:  29
url_id:  30
url_id:  31
url_id:  32
url_id:  33
url_id:  34
url_id:  35
url_id:  36
Attempt 1 failed: 404 Client Error: Not Found for url: https://insights.blackcoffer.com/how-neural-networks-can-be-applied-in-various-areas-in-the-future/
Deleted the url. https://insights.blackcoffer.com/how-neural-networks-can-be-applied-in-various-areas-in-the-future/
url_id:  37
url_id:  38
url_id:  39
url_id:  40
url_id:  41
url_id:  42
url_id:  43
url_id:  44
url_id:  45
url_id:  46
url_id:  47
url_id:  48
url_id:  49
Attempt 1 failed: 404 Client Error: Not Found for url: https://insights.blackcoffer.com/covid-19-environmental-impact-for-the-future/
Deleted the url

In [None]:
# Create a DataFrame to store the scores
scores_df = pd.DataFrame({
    'URL_ID':df['URL_ID'],
    'URL': df['URL'],
    'Positive Score': positive_scores,
    'Negative Score': negative_scores,
    'Polarity Score': polarity_scores,
    'Subjectivity Score': subjectivity_scores,
    'Average Sentence Length': avg_sentence_lengths,
    'Percentage of Complex Words': percentage_of_complex_words,
    'Fog Index': fog_indices,
    'Average Words per Sentence': avg_words_per_sentence,
    'Complex Word Count': complex_word_counts,
    'Word Count': word_counts,
    'Syllable Count': syllable_counts,
    'Personal Pronouns': personal_pronouns,
    'Average Word Length': avg_word_lengths
})


In [None]:
# Display the DataFrame
print(scores_df)

# Save the DataFrame to a CSV file
scores_df.to_csv('output_scores.csv', index=False)


# END