Authors: Allard Marc-Antoine

Course: FIN-407

---

## News Dataset Exploration

- Exploration of the FNSPIDataset (Financial News and Stock Price Integration Dataset)
    - Overview of the tools used
    - Overview of the dataset
- Comprehensive Exploratory Data Analysis (EDA) including topic extraction and analysis
- Summary statistics
- Introduction: Proposed methods for analysis

---

In [None]:
# TODO in Google Colab 
# pip install requirements.txt

In [3]:
%load_ext autoreload
%autoreload 2

import warnings; warnings.simplefilter('ignore')
import os, codecs, string, random
import numpy as np
from numpy.random import seed as random_seed
from numpy.random import shuffle as random_shuffle
import pandas as pd 
import statsmodels as sm 
import seaborn as sns 
import matplotlib.pyplot as plt
from datetime import timedelta
from datetime import datetime
%matplotlib inline  

seed = 42
random.seed(seed)
np.random.seed(seed)

#NLP libraries
import spacy, nltk, gensim, sklearn
import pyLDAvis.gensim_models

#Vader
import vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# pd plotting option 
pd.options.plotting.backend = "plotly"
pd.options.plotting.backend = 'matplotlib'
# Plot styling and settings 
plt.style.use('fivethirtyeight')
pd.set_option("display.max_columns", 50)
pd.set_option("display.max_rows", 700)
pd.set_option('display.max_colwidth', None)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### 1. Exploration of the FNSPIDataset (Financial News and Stock Price Integration Dataset)

#### 1.1 Overview of the sample datasets

> Note that the dataset we are interesting in are the summary news and the gpt sentiment score

In [77]:
# 1. news_data_preprocessed/aa.csv -> date x url x Text summary x Mark
path_1 = '../../data/FNSPID_Financial_News_Dataset/data_processor/news_data_summarized/aa.csv'

summarized_news = pd.read_csv(path_1)

print(f"Number of summarize news: {len(summarized_news)}")

oldest_date = summarized_news['Date'].min()
newest_date = summarized_news['Date'].max()

# Display the range of dates
print()
print("Oldest Date:", oldest_date)
print("Newest Date:", newest_date)

Number of summarize news: 1502

Oldest Date: 2016-03-22 04:39:00+00:00
Newest Date: 2024-01-17 03:52:00+00:00


In [79]:
# 2. news_data_sentiment_scored_by_gpt/aa.csv -> date x url x Text summary x sentiment
path_2 = '../../data/FNSPID_Financial_News_Dataset/data_processor/news_data_sentiment_scored_by_gpt/aa.csv'

sentiment_news = pd.read_csv(path_2)
print(f"Number of sentiment label news: {len(sentiment_news)}")

oldest_date = sentiment_news['Date'].min()
newest_date = sentiment_news['Date'].max()

# Display the range of dates
print()
print("Oldest Date:", oldest_date)
print("Newest Date:", newest_date)

Number of sentiment label news: 1502

Oldest Date: 2016-03-22 04:39:00+00:00
Newest Date: 2024-01-17 03:52:00+00:00


---
### 2. EDA 

#### 2.1 Exploration

In [63]:
path_2 = '../../data/All_external.csv'

external_news = pd.read_csv(path_2, nrows=5000000)

oldest_date = external_news['Date'].min()
newest_date = external_news['Date'].max()

# Display the range of dates
print("Oldest Date:", oldest_date)
print("Newest Date:", newest_date)

Oldest Date: 1914-09-16 00:00:00 UTC
Newest Date: 2020-06-11 13:12:35 UTC


In [64]:
external_news['Date'] = pd.to_datetime(external_news['Date'])

# Filter rows to include only entries from 1999 or later
external_news_filtered = external_news[external_news['Date'].dt.year >= 1999]

# Display the range of dates after filtering
oldest_date_filtered = external_news_filtered['Date'].min()
newest_date_filtered = external_news_filtered['Date'].max()

print("Oldest Date after filtering:", oldest_date_filtered)
print("Newest Date after filtering:", newest_date_filtered)

# Display the yearly count of entries after filtering
yearly_counts_filtered = external_news_filtered['Date'].dt.year.value_counts().sort_index()
print("\nYearly Counts of News Entries after filtering:")
print(yearly_counts_filtered)

Oldest Date after filtering: 1999-08-31 00:00:00+00:00
Newest Date after filtering: 2020-06-11 13:12:35+00:00

Yearly Counts of News Entries after filtering:
Date
1999      3081
2000     16176
2001     21974
2002     22179
2003     21557
2004     24386
2005     30718
2006     35964
2007     80965
2008    175001
2009    172170
2010    271717
2011    356624
2012    359954
2013    327242
2014    381490
2015    580577
2016    469798
2017    344705
2018    523476
2019    578683
2020    201557
Name: count, dtype: int64


In [73]:
external_news_filtered.head()

Unnamed: 0,Date,Article_title,Stock_symbol,Url,Publisher,Author,Article,Lsa_summary,Luhn_summary,Textrank_summary,Lexrank_summary
0,2020-06-05 06:30:54+00:00,Stocks That Hit 52-Week Highs On Friday,A,https://www.benzinga.com/news/20/06/16190091/stocks-that-hit-52-week-highs-on-friday,Benzinga Insights,,,,,,
1,2020-06-03 06:45:20+00:00,Stocks That Hit 52-Week Highs On Wednesday,A,https://www.benzinga.com/news/20/06/16170189/stocks-that-hit-52-week-highs-on-wednesday,Benzinga Insights,,,,,,
2,2020-05-26 00:30:07+00:00,71 Biggest Movers From Friday,A,https://www.benzinga.com/news/20/05/16103463/71-biggest-movers-from-friday,Lisa Levin,,,,,,
3,2020-05-22 08:45:06+00:00,46 Stocks Moving In Friday's Mid-Day Session,A,https://www.benzinga.com/news/20/05/16095921/46-stocks-moving-in-fridays-mid-day-session,Lisa Levin,,,,,,
4,2020-05-22 07:38:59+00:00,"B of A Securities Maintains Neutral on Agilent Technologies, Raises Price Target to $88",A,https://www.benzinga.com/news/20/05/16095304/b-of-a-securities-maintains-neutral-on-agilent-technologies-raises-price-target-to-88,Vick Meyer,,,,,,


In [72]:
average_title_length = external_news_filtered['Article_title'].apply(lambda x: len(x.split())).mean()

# Display the average string length
print("Average String Length of Article Titles (in words):", average_title_length)

Average String Length of Article Titles (in words): 9.676495811794974


In [86]:
num_entries_with_inflation = sentiment_news_filtered['Article_title'].str.contains('inflation', case=False).sum()

# Display the number of entries containing 'inflation' in the Article_title
print("Number of Entries with 'inflation' in Article Title:", num_entries_with_inflation)

Number of Entries with 'inflation' in Article Title: 6326


In [87]:
# Define the list of lexicon terms
lexicon_terms = [
    "price", "cost of living", "high bill", "inflation", "expensive", "gasoline price",
    "high rent", "low rent", "energy costs", "deflation", "disinflation", "sale",
    "sell-off", "low bill", "low cost", "cheap" 
]

# Create a regex pattern to match any of the lexicon terms as whole words
pattern = '|'.join(r'\b{}\b'.format(term) for term in lexicon_terms)

# Filter DataFrame to include only entries where 'Article_title' contains any of the lexicon terms
titles_with_lexicon_terms = sentiment_news_filtered[sentiment_news_filtered['Article_title'].str.contains(pattern, case=False)]

# Count the number of article titles containing any of the lexicon terms
num_titles_with_lexicon_terms = len(titles_with_lexicon_terms)

# Display the count of article titles containing any of the lexicon terms
print("Number of Article Titles containing any of the Lexicon Terms:", num_titles_with_lexicon_terms)

Number of Article Titles containing any of the Lexicon Terms: 117864


In [89]:
if len(titles_with_lexicon_terms) > 0:
    print("Sample of 100 Titles containing 'inflation':")
    for title in titles_with_lexicon_terms['Article_title'].sample(min(100, len(titles_with_lexicon_terms))):
        print("-", title)
else:
    print("No titles found containing 'inflation'.")

Sample of 100 Titles containing 'inflation':
- Why CVR Refining Is So Cheap
- Ronald Stoeferle: Gold Is Dirt Cheap Right Now
- Jefferies Upgrades Masco to Buy, Raises Price Target to $40
- Exxon's California refinery sale seen pushed back to 2016
- Northland Securities Downgrades Plantronics to Market Perform, Lowers Price Target to $35
- CreditRiskMonitor: A Cheap Micro-Cap With A Recession-Proof Business Model
- Clarksons Platou Initiates Coverage On Alliance Resource with Buy Rating, Announces $20 Price Target
- Dow Chemical Lifts Dividend & Buyback, Ups Asset Sale Goal - Analyst Blog
- TREASURIES-30-year Treasury bonds trade a point higher in price
- Good News: Emerging Markets ETFs Are Cheap
- Oil & Gas Stocks Making Afternoon Rally: Rex Energy Shares Up 8.9%, PetroQuest Up 5.7%, Halcon, Jones, Wildhorse Up 5%, QEP, Approach Up 4%, Enduro Up 3.7%, W&T, Ring Up 3.1%; Not Seeing Specific News To Justify Sector-Wide Price Action
- Barclays Shaves Price Targets On 3 Casino Stocks
- If

In [100]:
# Path to the extracted CSV file
csv_file_path = '../../data/nasdaq_external_data.csv'

columns_to_read = ['Date', 'Article_title', 'Url', 'Lsa_summary']

# Read the specified columns from the CSV file into a pandas DataFrame
nasdaq_data = pd.read_csv(csv_file_path, usecols=columns_to_read, nrows=2000000)

# Display the first few rows of the DataFrame to verify
nasdaq_data.head()

Unnamed: 0,Date,Article_title,Url,Lsa_summary
0,2023-12-16 23:00:00 UTC,Interesting A Put And Call Options For August 2024,https://www.nasdaq.com/articles/interesting-a-put-and-call-options-for-august-2024,"Because the $125.00 strike represents an approximate 10% discount to the current trading price of the stock (in other words it is out-of-the-money by that percentage), there is also the possibility that the put contract would expire worthless. Of course, a lot of upside could potentially be left on the table if A shares really soar, which is why looking at the trailing twelve month trading history for Agilent Technologies, Inc., as well as studying the business fundamentals becomes important. Below is a chart showing A's trailing twelve month trading history, with the $150.00 strike highlighted in red: Considering the fact that the $150.00 strike represents an approximate 8% premium to the current trading price of the stock (in other words it is out-of-the-money by that percentage), there is also the possibility that the covered call contract would expire worthless, in which case the investor would keep both their shares of stock and the premium collected."
1,2023-12-12 00:00:00 UTC,Wolfe Research Initiates Coverage of Agilent Technologies (A) with Outperform Recommendation,https://www.nasdaq.com/articles/wolfe-research-initiates-coverage-of-agilent-technologies-a-with-outperform-recommendation,"Fintel reports that on December 13, 2023, Wolfe Research initiated coverage of Agilent Technologies (NYSE:A) with a Outperform recommendation. Agilent instruments, software, services, solutions, and people provide trusted answers to customers' most challenging questions. Fintel is one of the most comprehensive investing research platforms available to individual investors, traders, financial advisors, and small hedge funds."
2,2023-12-12 00:00:00 UTC,Agilent Technologies Reaches Analyst Target Price,https://www.nasdaq.com/articles/agilent-technologies-reaches-analyst-target-price-0,"In recent trading, shares of Agilent Technologies, Inc. (Symbol: A) have crossed above the average analyst 12-month target price of $132.36, changing hands for $133.74/share. And so with A crossing above that average target price of $132.36/share, investors in A have been given a good signal to spend fresh time assessing the company and deciding for themselves: is $132.36 just one stop on the way to an even higher target, or has the valuation gotten stretched to the point where it is time to think about taking some chips off the table? The Top 25 Broker Analyst Picks of the S&P 500 » Also see:  PXMD Insider Buying  ETM Insider Buying  Funds Holding FXG The views and opinions expressed herein are the views and opinions of the author and do not necessarily reflect those of Nasdaq, Inc."
3,2023-12-07 00:00:00 UTC,Agilent (A) Enhances BioTek Cytation C10 With New Technology,https://www.nasdaq.com/articles/agilent-a-enhances-biotek-cytation-c10-with-new-technology,"Per a Grand View Research report, the global microplate reader market is expected to grow at a CAGR of 7.6% during the forecast period 2023-2030. A Mordor Intelligence report indicates the global live cell imaging market size will reach $2.95 billion by 2028, exhibiting a CAGR of 7.06% between 2023 and 2028. Some better-ranked stocks in the broader technology sector are Badger Meter BMI, Arista Networks ANET and Adobe ADBE."
4,2023-12-07 00:00:00 UTC,"Pre-Market Most Active for Dec 7, 2023 : SQQQ, PLTR, TQQQ, ALT, UBER, PFE, GILD, MRK, AMD, NIO, JBLU, AI",https://www.nasdaq.com/articles/pre-market-most-active-for-dec-7-2023-%3A-sqqq-pltr-tqqq-alt-uber-pfe-gild-mrk-amd-nio-jblu,"ProShares UltraPro Short QQQ (SQQQ) is -0.15 at $16.37, with 2,061,124 shares traded. Over the last four weeks they have had 6 up revisions for the earnings forecast, for the fiscal quarter ending Dec 2023. Over the last four weeks they have had 4 up revisions for the earnings forecast, for the fiscal quarter ending Mar 2024."


In [104]:
nasdaq_data['Date'] = pd.to_datetime(nasdaq_data['Date'])

# Filter rows to include only entries from 1999 or later
nasdaq_data_filtered = nasdaq_data[nasdaq_data['Date'].dt.year >= 1999]


# Display the range of dates after filtering
oldest_date_filtered = nasdaq_data_filtered['Date'].min()
newest_date_filtered = nasdaq_data_filtered['Date'].max()

print("Oldest Date after filtering:", oldest_date_filtered)
print("Newest Date after filtering:", newest_date_filtered)

# Display the yearly count of entries after filtering
yearly_counts_filtered = nasdaq_data_filtered['Date'].dt.year.value_counts().sort_index()
print("\nYearly Counts of News Entries after filtering:")
print(yearly_counts_filtered)

Oldest Date after filtering: 2009-04-08 00:00:00+00:00
Newest Date after filtering: 2024-01-09 00:00:00+00:00

Yearly Counts of News Entries after filtering:
Date
2009       610
2010      7658
2011     12014
2012     29369
2013     55070
2014     73541
2015     86726
2016    103930
2017    133108
2018    137456
2019     96346
2020    117461
2021    143280
2022    219785
2023    783576
2024        70
Name: count, dtype: int64


In [107]:
nasdaq_data_filtered.dropna(inplace=True)

In [112]:
# Define the list of lexicon terms
lexicon_terms = [
    "price", "cost of living", "high bill", "inflation", "expensive", "gasoline price",
    "high rent", "low rent", "energy costs", "deflation", "disinflation", "sale",
    "sell-off", "low bill", "low cost", "cheap" 
]

# Create a regex pattern to match any of the lexicon terms as whole words
pattern = '|'.join(r'\b{}\b'.format(term) for term in lexicon_terms)

# Filter DataFrame to include only entries where 'Article_title' contains any of the lexicon terms
titles_with_lexicon_terms = nasdaq_data_filtered[nasdaq_data_filtered['Lsa_summary'].str.contains('inflation', case=False)]

# Count the number of article titles containing any of the lexicon terms
num_titles_with_lexicon_terms = len(titles_with_lexicon_terms)

# Display the count of article titles containing any of the lexicon terms
print("Number of Article Titles containing any of the Lexicon Terms:", num_titles_with_lexicon_terms)

Number of Article Titles containing any of the Lexicon Terms: 53600


In [113]:
if len(titles_with_lexicon_terms) > 0:
    print("Sample of 100 Titles containing 'inflation':")
    for title in titles_with_lexicon_terms['Lsa_summary'].sample(min(3, len(titles_with_lexicon_terms))):
        print("-", title, '\n')
else:
    print("No titles found containing 'inflation'.")

Sample of 100 Titles containing 'inflation':
- United Airlines stock (NASDAQ: UAL) currently trades at $49 per share, more than 20% below its level in March 2021, and it has the potential for sizable gains. Our detailed analysis of United Airlines’ upside post-inflation shock captures trends in the company’s stock during the turbulent market conditions seen over 2022. Its steady increase in ASM and load factor will likely bolster its top-line growth, and a higher operating margin would translate into bottom-line expansion in the near term, boding well for its stock. 

- Anticipating Further Moves Gold’s future trajectory remains uncertain, as accelerating inflation and the resilience of the U.S. economy could herald a rate hike as early as November. Echoing this sentiment, Loretta Mester from the Cleveland Federal Reserve Bank stated that quelling inflation might necessitate another interest rate increment, followed by a period of stability. This article was originally posted on FX Emp

#### 2.1 Summary Stats

In [None]:
# TODO

#### 2.1 Topic Analysis (inflation related)

In [None]:
# TODO

---
### 3. Continuation and Proposed Methods

> ...