# Final Project in Text Analysis and Natural Language Processing

### Title: Predicting financial risk using NLP, a sentiment and entity based approach

#### Table of contents:


<a href='#web_scrape'>1. Webscraping the data from the Google News </a><br>
    
<a href='#data_cleaning'>2. Text Cleaning & Deduplication</a>

<a href='#data_pre'>3. Data Preprocessing</a>	

<a href='#data_nlp'>4. Text Analysis and NLP </a>

### Importing the needed libraries

Here we are importing the libraries that I will be needing for the further analysis:

In [1]:
# Importing libraries to scrap

from serpapi import GoogleSearch
import pandas as pd
from datetime import datetime, timedelta
import time
import re

## Webscraping the data from the Independent newspaper <a id='web_scrape'></a>

In this code snippet i have tried to scrap recent financial news headlines and a short description of the news using SerpAPI (Google News API) for the following companies:

- Apple
- Tesla
- Nvidia
- Amazon

The objective is to retrieve high-quality, recent news headlines and snippets (within the last 90 days) that can be used for further Sentiment analysis, Entity recogntion and Risk Scoring. 

In [None]:
# ✅ Step 1: Your SerpAPI Key
API_KEY = "--"

# ✅ Step 2: Parse Relative/Absolute Dates
def parse_relative_date(text):
    today = datetime.today()
    try:
        if re.search(r"\w{3,} \d{1,2}, \d{4}", text):  # e.g. Apr 4, 2025
            return datetime.strptime(text, "%b %d, %Y")
        elif "day" in text:
            days = int(re.search(r"(\d+)", text).group(1))
            return today - timedelta(days=days)
        elif "hour" in text:
            hours = int(re.search(r"(\d+)", text).group(1))
            return today - timedelta(hours=hours)
        elif "week" in text:
            weeks = int(re.search(r"(\d+)", text).group(1))
            return today - timedelta(weeks=weeks)
        elif "month" in text:
            months = int(re.search(r"(\d+)", text).group(1))
            return today - timedelta(days=30 * months)
        else:
            return None
    except:
        return None

# ✅ Step 3: Scraper Function
def scrape_serpapi_news(company, target_articles=600):
    print(f"🔍 Scraping {company}...")
    results = []
    page = 0
    collected = 0
    today = datetime.today()
    ninety_days_ago = today - timedelta(days=90)

    while collected < target_articles:
        params = {
            "engine": "google",
            "q": f"{company} stock",
            "tbm": "nws",
            "api_key": API_KEY,
            "num": 100,
            "start": page * 100,
            "tbs": "cdr:1,cd_min:1/14/2024,cd_max:4/14/2025"
        }

        search = GoogleSearch(params)
        response = search.get_dict()
        news_results = response.get("news_results", [])

        if not news_results:
            print(f"⚠️ No more news results found at page {page} for {company}")
            break

        for article in news_results:
            pub_date_raw = article.get("date")
            parsed_date = parse_relative_date(pub_date_raw)

            if parsed_date and parsed_date >= ninety_days_ago:
                results.append({
                    "company": company,
                    "title": article.get("title"),
                    "link": article.get("link"),
                    "snippet": article.get("snippet"),
                    "source": article.get("source"),
                    "published": parsed_date.strftime("%m/%d/%Y")
                })

        collected = len(results)
        page += 1
        time.sleep(1)

    print(f"✅ Collected {len(results)} valid articles for {company}")
    return pd.DataFrame(results[:target_articles])

# ✅ Step 4: List of Companies
companies = ["Apple", "Tesla", "Nvidia", "Amazon"]

# ✅ Step 5: Scrape & Save
all_dfs = []
for company in companies:
    df = scrape_serpapi_news(company, target_articles=600)
    all_dfs.append(df)

final_df = pd.concat(all_dfs, ignore_index=True)

### Reading the data

For confidentiality purposes, i have hidden my API and have uploaded and read the data below:

In [4]:
final_df = pd.read_csv("C:\\Users\\Mustafa Ansari\\Downloads\\Scrapped Google News.csv")

In [5]:
final_df.head()

Unnamed: 0,company,title,link,snippet,source,published
0,Apple,Where Will Apple Stock Be In 5 Years?,https://www.forbes.com/sites/investor-hub/arti...,The shares currently trade approximately 12% b...,Forbes,03/15/2025
1,Apple,Apple Joins AI Data Center Race After Siri Mess,https://www.investors.com/news/technology/appl...,Apple is in the process of placing orders for ...,Investor's Business Daily,03/24/2025
2,Apple,How Bad Could Sustained Tariffs Be for Apple S...,https://www.morningstar.com/stocks/how-bad-cou...,"If tariffs persist, Apple's profit margins cou...",Morningstar,04/09/2025
3,Apple,Analysts revisit Apple stock price targets as ...,https://www.thestreet.com/investing/analysts-r...,"Wedbush analyst Dan Ives, a committed Apple bu...",TheStreet,03/24/2025
4,Apple,Watch These Apple Stock Price Levels Amid Tari...,https://www.investopedia.com/watch-these-apple...,Apple shares gained ground Wednesday after los...,Investopedia,04/09/2025


## Text Cleaning & Deduplication <a id='data_cleaning'></a>

In this step, i have firstly combined title and snippet into a single column: text then dropped exact duplicates of combined text and dropped nulls and whitespace-only entries. Finally i have reset index for tidy output.

In [22]:
# Original Data
original_count = len(final_df)
df = final_df.copy()

# STEP 1: Combine title + snippet
df["text"] = df["title"].fillna("") + ". " + df["snippet"].fillna("")

# STEP 2: Remove short/empty text
df["text"] = df["text"].str.strip()
before_empty_filter = len(df)
df = df[df["text"].str.len() > 10]
after_empty_filter = len(df)
empty_dropped = before_empty_filter - after_empty_filter

# STEP 3: Remove exact duplicates
before_dedup = len(df)
df = df.drop_duplicates(subset=["text"])
after_dedup = len(df)
duplicates_dropped = before_dedup - after_dedup

# STEP 4: Reset index
df = df.reset_index(drop=True)

# Save cleaned version
df.to_csv(r"C:\Users\Mustafa Ansari\Downloads\news_cleaned_deduplicated.csv", index=False)


In [21]:
# Summary
print("Cleaning Summary:")
print(f"• Original articles: {original_count}")
print(f"• Removed short/empty text: {empty_dropped}")
print(f"• Removed duplicates: {duplicates_dropped}")
print(f" Final cleaned articles: {len(df)}")

Cleaning Summary:
• Original articles: 1929
• Removed short/empty text: 0
• Removed duplicates: 146
 Final cleaned articles: 1783


In [17]:
df.head()

Unnamed: 0,company,title,link,snippet,source,published,text
0,Apple,Where Will Apple Stock Be In 5 Years?,https://www.forbes.com/sites/investor-hub/arti...,The shares currently trade approximately 12% b...,Forbes,03/15/2025,Where Will Apple Stock Be In 5 Years?. The sha...
1,Apple,Apple Joins AI Data Center Race After Siri Mess,https://www.investors.com/news/technology/appl...,Apple is in the process of placing orders for ...,Investor's Business Daily,03/24/2025,Apple Joins AI Data Center Race After Siri Mes...
2,Apple,How Bad Could Sustained Tariffs Be for Apple S...,https://www.morningstar.com/stocks/how-bad-cou...,"If tariffs persist, Apple's profit margins cou...",Morningstar,04/09/2025,How Bad Could Sustained Tariffs Be for Apple S...
3,Apple,Analysts revisit Apple stock price targets as ...,https://www.thestreet.com/investing/analysts-r...,"Wedbush analyst Dan Ives, a committed Apple bu...",TheStreet,03/24/2025,Analysts revisit Apple stock price targets as ...
4,Apple,Watch These Apple Stock Price Levels Amid Tari...,https://www.investopedia.com/watch-these-apple...,Apple shares gained ground Wednesday after los...,Investopedia,04/09/2025,Watch These Apple Stock Price Levels Amid Tari...
