# Project: What's The Hot Topic In Town? - kelvin.ahiakpor & emmanuel.acquaye

This notebook addresses Phase 2 & 3 of the What's The Hot Topic in Topic Town? project: **Display News & Social Sentiment Analysis** respectively.

### Natural Language Processing

# Phase 2           
Display News

The self-created rubric, in our repository, explains the requirement for a proper execution of this phase as seen below.   
**Description:** Show top 10 news headlines with one paragraph summaries and allow user to select and
read desired articles

### Repository Link 

Here is a link to our repository:

[What's The Hot Topic In Town?](https://github.com/kelvin-ahiakpor/Whats.The.Hot.Topic.In.Town)

### Imports

In [1]:
import re
import pandas as pd
import ipywidgets as widgets 

from collections import Counter
from transformers import pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

**Load the default sentiment analysis pipeline**   
`distilbert-base-uncased-finetuned-sst-2-english`

In [2]:
sentiment_analyzer = pipeline("sentiment-analysis", 
                              model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


**Widget creation**

In [3]:
upload_widget = widgets.FileUpload(multiple=False)
upload_button = widgets.Button(description="Process Upload")

### Upload a news data csv

Here we define a set of functions to allow uploads

**The following set of functions allow uploads in both Jupyter and Google Colab.**  
**Note:** This is for demonstration purposes only.  
In our deployed app this process is automated where the news data csv is saved to a sqllite database after scraping with scrapy and is loaded back into the application to perform natural language processing and identify the top 10 news articles.

**Global variables**  
Dictionary of upload information   
Video path  
Upload Widget
Upload Status

In [4]:
upload_info = None
file_path = None

In [5]:
upload_completed = False

In [6]:
def validate_file_upload(uploaded_files):
    for filename, file_info in uploaded_files.items():
        save_file(filename, file_info['content'])
        return filename

In [7]:
def save_file(filename, file_content):
    with open(filename, 'wb') as f:
        f.write(file_content)

In [8]:
def process_files(button):
    global upload_completed
    global file_path
    uploaded_files = upload_widget.value
    # Handle file validation and save
    file_path = validate_file_upload(uploaded_files)
    if file_path:
        print(f"File path set to: {file_path}")
        upload_completed = True  # Mark the upload as complete
    else:
        print("No valid file uploaded.")
    
    if upload_completed:
        print()
        print("Upload complete. You can now proceed to the run the next cells.")

In [9]:
def show_upload_widget():
    global upload_button
    upload_button = widgets.Button(description="Upload File")
    upload_button.on_click(process_files)
    
    # Display the widgets
    display(upload_widget)
    display(upload_button)
    
    print("Please upload a file and then click the 'Upload File' button.")

In [10]:
show_upload_widget()

FileUpload(value={}, description='Upload')

Button(description='Upload File', style=ButtonStyle())

Please upload a file and then click the 'Upload File' button.
File path set to: newsData.csv

Upload complete. You can now proceed to the run the next cells.


### Load the uploaded news data file

In [11]:
if upload_completed:
    newsdata = pd.read_csv(file_path)
    display(newsdata)
else:
    print("Upload a file first and then proceed to load the data.")

Unnamed: 0,TITLE,COUNTRY,BODY,Website Link
0,Zimbabwean in South Africa Challenges Exorbita...,zimbabwe,"A Zimbabwean resident in South Africa, Takudzw...",https://www.newzimbabwe.com/zimbabwean-in-sout...
1,American Star Tiffany Haddish Gives Zimbabwe a...,zimbabwe,"VISITING American comedienne, Tiffany Haddish,...",https://www.herald.co.zw/american-star-tiffany...
2,Insect Farming New Frontier for Smallholder Li...,zimbabwe,"Zimbabwe, just like many other countries, is l...",https://www.herald.co.zw/insect-farming-new-fr...
3,"Zanu-PF, Swapo in Key Engagement",zimbabwe,ZANU PF and SWAPO of Namibia share common bond...,https://www.herald.co.zw/zanu-pf-swapo-in-key-...
4,,zimbabwe,,
5,Zanu-PF Official in Court Over $4 Million Frau...,zimbabwe,A senior official from Zimbabwe's ruling Zanu ...,https://www.263chat.com/zanu-pf-official-in-co...
6,"Despite Hostility, LGBTQI+ Activists in Zimbab...",zimbabwe,"Munich, Germany — LGBTQI+ people have long be...",
7,Zim Swimmer Weisthuizen Out of the Olympics,zimbabwe,ZIMBABWE swimmer Paige Van Der Weisthuizen was...,https://www.newzimbabwe.com/zim-swimmer-weisth...
8,Dynamos' Mangombe in a Fix,zimbabwe,"Castle Lager Premiership giants, Dynamos head ...",https://www.herald.co.zw/dynamos-mangombe-in-a...
9,UN Tourism Delegation Visits First Lady,zimbabwe,Victoria Falls — United Nations Tourism Secret...,https://www.herald.co.zw/un-tourism-delegation...


In [12]:
newsdata

Unnamed: 0,TITLE,COUNTRY,BODY,Website Link
0,Zimbabwean in South Africa Challenges Exorbita...,zimbabwe,"A Zimbabwean resident in South Africa, Takudzw...",https://www.newzimbabwe.com/zimbabwean-in-sout...
1,American Star Tiffany Haddish Gives Zimbabwe a...,zimbabwe,"VISITING American comedienne, Tiffany Haddish,...",https://www.herald.co.zw/american-star-tiffany...
2,Insect Farming New Frontier for Smallholder Li...,zimbabwe,"Zimbabwe, just like many other countries, is l...",https://www.herald.co.zw/insect-farming-new-fr...
3,"Zanu-PF, Swapo in Key Engagement",zimbabwe,ZANU PF and SWAPO of Namibia share common bond...,https://www.herald.co.zw/zanu-pf-swapo-in-key-...
4,,zimbabwe,,
5,Zanu-PF Official in Court Over $4 Million Frau...,zimbabwe,A senior official from Zimbabwe's ruling Zanu ...,https://www.263chat.com/zanu-pf-official-in-co...
6,"Despite Hostility, LGBTQI+ Activists in Zimbab...",zimbabwe,"Munich, Germany — LGBTQI+ people have long be...",
7,Zim Swimmer Weisthuizen Out of the Olympics,zimbabwe,ZIMBABWE swimmer Paige Van Der Weisthuizen was...,https://www.newzimbabwe.com/zim-swimmer-weisth...
8,Dynamos' Mangombe in a Fix,zimbabwe,"Castle Lager Premiership giants, Dynamos head ...",https://www.herald.co.zw/dynamos-mangombe-in-a...
9,UN Tourism Delegation Visits First Lady,zimbabwe,Victoria Falls — United Nations Tourism Secret...,https://www.herald.co.zw/un-tourism-delegation...


In [13]:
newsdata = newsdata.dropna()
newsdata_body = newsdata['BODY']

**Function to split text into sentences**

In [14]:
def split_text_by_sentences(text):
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
    return sentences

**Preprocess the data by converting it to lower case** 

In [15]:
def preprocess(text):
    text = text.lower()  # Lowercase
    # Add more preprocessing steps if needed
    return text

In [16]:
newsdata.loc[:, 'cleaned_body'] = newsdata['BODY'].apply(preprocess)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  newsdata.loc[:, 'cleaned_body'] = newsdata['BODY'].apply(preprocess)


### Use TF-IDF to retrieve most important terms in the news data and select top 10 news articles
The TFidVectorizer is sued to transform the body of the articles into numeric form, finding the top term based on their term frequencies. These terms aare then counted in each article, where the articles are arranged based on highest number of terms present.

**Vectorize the text**

In [17]:
vectorizer = TfidfVectorizer(stop_words=list(ENGLISH_STOP_WORDS), max_features=1000)
X = vectorizer.fit_transform(newsdata['cleaned_body'])

**Calculate term frequencies**

In [18]:
feature_names = vectorizer.get_feature_names_out()
tfidf_scores = X.sum(axis=0).A1
term_scores = dict(zip(feature_names, tfidf_scores))

**Get top 10 trending topics based on TF-IDF scores using the 20 most common terms**

In [19]:
top_terms = Counter(term_scores).most_common(20)

In [20]:
print("Top Relevant Terms:")
for term, score in top_terms:
    print(f"{term}: {score}")

article_scores = []

# Calculate relevance score for each article based on top terms
for index, row in newsdata.iterrows():
    relevance_score = 0
    for term, _ in top_terms:
        relevance_score += row['cleaned_body'].count(term)  # Count occurrences of the term
    article_scores.append({'TITLE': row['TITLE'],
                           'URL': row['Website Link'],
                           'BODY':row['BODY'],
                           'Relevance': relevance_score,
                           })

Top Relevant Terms:
said: 1.617239418765897
sadc: 1.5866304053154328
zimbabwe: 1.5387162220256676
africa: 1.0980387025610683
investment: 1.0527951933568427
tourism: 1.0083783864340194
harare: 0.9963720158183158
country: 0.9404634824655383
police: 0.9276228224586293
million: 0.9266161434507789
region: 0.9027383259263309
season: 0.8665253037480395
mr: 0.8141305710770922
president: 0.7944616990579325
government: 0.742501399340561
year: 0.7418823191337467
percent: 0.7307642736221417
international: 0.7276504583290118
mnangagwa: 0.7108881419472728
economic: 0.707578804741229


In [21]:
# Convert list to DataFrame
article_scores_df = pd.DataFrame(article_scores)

# Sort articles by relevance score and get top 10
top_10_articles = article_scores_df.sort_values(by='Relevance', ascending=False).head(10)

In [22]:
top_10_articles

Unnamed: 0,TITLE,URL,BODY,Relevance
18,Afreximbank Earmarks U.S.$400m for Zim Firms,https://www.herald.co.zw/afreximbank-earmarks-...,The African Export-Import Bank (Afreximbank) h...,136
8,VP Chiwenga Rallies SADC to Boost Trade,https://www.herald.co.zw/vp-chiwenga-rallies-s...,SADC countries must use their regional advanta...,77
28,Mnangagwa Taking Over SADC Chairmanship a Trav...,https://www.newzimbabwe.com/mnangagwa-taking-o...,ZIMBABWEANS living abroad have raised concern ...,75
12,Tourism Receipts Up 35pc to U.S.$241m,https://www.herald.co.zw/tourism-receipts-up-3...,Zimbabwe's tourism receipts grew by an estimat...,63
20,"US Diplomat Runs Over 11-Year-Old Girl, Flees ...",https://www.herald.co.zw/us-diplomat-runs-over...,A United States Embassy official allegedly ran...,52
15,AfCFTA Pursues Exports for Africa's Food Security,https://www.herald.co.zw/afcfta-pursues-export...,THE African Continental Free Trade Area (AfCFT...,49
7,UN Tourism Delegation Visits First Lady,https://www.herald.co.zw/un-tourism-delegation...,Victoria Falls — United Nations Tourism Secret...,48
14,Air Ambulances Training Starts. . .300 Special...,https://www.herald.co.zw/air-ambulances-traini...,THE Russian HeliDrive Air ambulances medical s...,44
13,Air Zim Resumes Harare-Joburg Flights,https://www.herald.co.zw/air-zim-resumes-harar...,Air Zimbabwe yesterday resumed flights from Ha...,32
29,"Police Crackdown On Vendors, Touts, Car Wash O...",https://www.newzimbabwe.com/police-crackdown-o...,The Zimbabwe Republic Police (ZRP) in conjunct...,28


# Phase 3           
Sentiment Analysis

### Natural Language Processing

**Description:** Apply the distilBERT base uncased SST-2 sentiment analysis tool from using Hugging Face, to
understand and display the emotional tone of news articles, either negative or positive.   

#### About distilBERT
The distilbert-base-uncased-finetuned-sst-2-english is applied on the dataset(more specifically the body of the articles to acquire the sentiment analysis of the article in question. This is done by splitting the articles into chunks and passing this chunks into the model, the sentiment scores of all these are articles are averaged and the average score of either negative or positive is compared, The highest average is then selected, then displaying the sentiment analysis as either Positive or negative. "

In [23]:
for text in top_10_articles['BODY']:
    sentences = split_text_by_sentences(text)
    results = []

    # Analyze each sentence and calculate the average sentiment score
    for sentence in sentences:
        result = sentiment_analyzer(sentence)[0]
        results.append(result)
        #print(f"Sentence: {sentence}\nSentiment: {result['label']}, Probability: {result['score']:.4f}\n")
 
    # Calculate the average sentiment score
    positive_scores = [result['score'] for result in results if result['label'] == 'POSITIVE']
    negative_scores = [result['score'] for result in results if result['label'] == 'NEGATIVE']

    average_positive = sum(positive_scores) / len(positive_scores) if positive_scores else 0
    average_negative = sum(negative_scores) / len(negative_scores) if negative_scores else 0

    # Determine the overall average sentiment
    overall_sentiment = "POSITIVE" if average_positive >= average_negative else "NEGATIVE"
    average_score = max(average_positive, average_negative)

    for index, row in top_10_articles.iterrows():
        if text in row['BODY']:
            print()
            print("TITLE OF ARTICLE: "+ row['TITLE'])

    print(f"Overall Sentiment: {overall_sentiment}")
    print(f"Average Sentiment Score: {average_score:.4f}")

    '''# Display the probabilities for each sentence
    for i, result in enumerate(results):
      print(f"Sentence {i+1} Sentiment: {result['label']}, Probability: {result['score']:.4f}")'''


TITLE OF ARTICLE: Afreximbank Earmarks U.S.$400m for Zim Firms
Overall Sentiment: POSITIVE
Average Sentiment Score: 0.9633

TITLE OF ARTICLE: VP Chiwenga Rallies SADC to Boost Trade
Overall Sentiment: POSITIVE
Average Sentiment Score: 0.9920

TITLE OF ARTICLE: Mnangagwa Taking Over SADC Chairmanship a Travesty - Diasporans Accuse Regional Bloc of Legitimizing Tyranny
Overall Sentiment: NEGATIVE
Average Sentiment Score: 0.9585

TITLE OF ARTICLE: Tourism Receipts Up 35pc to U.S.$241m
Overall Sentiment: POSITIVE
Average Sentiment Score: 0.9916

TITLE OF ARTICLE: US Diplomat Runs Over 11-Year-Old Girl, Flees Country
Overall Sentiment: NEGATIVE
Average Sentiment Score: 0.9706

TITLE OF ARTICLE: AfCFTA Pursues Exports for Africa's Food Security
Overall Sentiment: POSITIVE
Average Sentiment Score: 0.9948

TITLE OF ARTICLE: UN Tourism Delegation Visits First Lady
Overall Sentiment: NEGATIVE
Average Sentiment Score: 0.9741

TITLE OF ARTICLE: Air Ambulances Training Starts. . .300 Specialised J