# 0. Problem statement

We are given the top 25 news headlines (based on votes from the "worldnews" subreddit in Reddit) for every day from Aug. 8, 2008 to Jul. 1, 2016 along with the corresponding movements in the DJI index: daily returns >= 0% are labelled with a 1 and daily returns < 0% are labelled with a 0. 

**Can we use NLP to develop a model that can forecast whether DJI returns are positive or not given the top 25 headlines for a day?**

# I. Imports

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns
import pandas_datareader.data as reader
import datetime as dt
from datetime import timedelta
from typing import Iterable, Union
import itertools
from plotly import graph_objects as go
from plotly.subplots import make_subplots
from plotly import express as px

import unicodedata
import re
import nltk
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import LancasterStemmer, WordNetLemmatizer

In [2]:
import contractions

In [3]:
import requests

In [4]:
from functools import partial

In [5]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [6]:
# Ensure that the following commands are run at least once on your machine to ensure that NLTK pre-processing works as expected.
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('omw-1.4')
# nltk.download('vader_lexicon')

# II. Data loading
I downloaded the relevant dataset from Kaggle (https://www.kaggle.com/datasets/aaron7sun/stocknews?resource=download) and have saved it to my local machine. It is part of my GitHub Repository as well, so the following command should work if you've cloned the repository appropriately.

In [7]:
pwd

'/home/murali/personal_projects/stock-price-forecasts/eda'

In [8]:
df = pd.read_csv("data/headlines_vs_DJI.csv")

# III. Preliminary look at the data

In [9]:
df.head()

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",...,b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
1,2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."
2,2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...",...,b'U.S. troops still in Georgia (did you know t...,b'Why Russias response to Georgia was right',"b'Gorbachev accuses U.S. of making a ""serious ...","b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man..."
3,2008-08-13,0,b' U.S. refuses Israel weapons to attack Iran:...,"b""When the president ordered to attack Tskhinv...",b' Israel clears troops who killed Reuters cam...,b'Britain\'s policy of being tough on drugs is...,b'Body of 14 year old found in trunk; Latest (...,b'China has moved 10 *million* quake survivors...,"b""Bush announces Operation Get All Up In Russi...",b'Russian forces sink Georgian ships ',...,b'Elephants extinct by 2020?',b'US humanitarian missions soon in Georgia - i...,"b""Georgia's DDOS came from US sources""","b'Russian convoy heads into Georgia, violating...",b'Israeli defence minister: US against strike ...,b'Gorbachev: We Had No Choice',b'Witness: Russian forces head towards Tbilisi...,b' Quarter of Russians blame U.S. for conflict...,b'Georgian president says US military will ta...,b'2006: Nobel laureate Aleksander Solzhenitsyn...
4,2008-08-14,1,b'All the experts admit that we should legalis...,b'War in South Osetia - 89 pictures made by a ...,b'Swedish wrestler Ara Abrahamian throws away ...,b'Russia exaggerated the death toll in South O...,b'Missile That Killed 9 Inside Pakistan May Ha...,"b""Rushdie Condemns Random House's Refusal to P...",b'Poland and US agree to missle defense deal. ...,"b'Will the Russians conquer Tblisi? Bet on it,...",...,b'Bank analyst forecast Georgian crisis 2 days...,"b""Georgia confict could set back Russia's US r...",b'War in the Caucasus is as much the product o...,"b'""Non-media"" photos of South Ossetia/Georgia ...",b'Georgian TV reporter shot by Russian sniper ...,b'Saudi Arabia: Mother moves to block child ma...,b'Taliban wages war on humanitarian aid workers',"b'Russia: World ""can forget about"" Georgia\'s...",b'Darfur rebels accuse Sudan of mounting major...,b'Philippines : Peace Advocate say Muslims nee...


In [10]:
df.dtypes

Date     object
Label     int64
Top1     object
Top2     object
Top3     object
Top4     object
Top5     object
Top6     object
Top7     object
Top8     object
Top9     object
Top10    object
Top11    object
Top12    object
Top13    object
Top14    object
Top15    object
Top16    object
Top17    object
Top18    object
Top19    object
Top20    object
Top21    object
Top22    object
Top23    object
Top24    object
Top25    object
dtype: object

In [11]:
df = df.rename(columns={"Date": "date", "Label": "label"})

In [12]:
df["date"] = pd.to_datetime(df["date"])

In [13]:
df[["date", "label"]].describe(datetime_is_numeric=True)

Unnamed: 0,date,label
count,1989,1989.0
mean,2012-07-20 11:52:23.891402752,0.535445
min,2008-08-08 00:00:00,0.0
25%,2010-07-30 00:00:00,0.0
50%,2012-07-19 00:00:00,1.0
75%,2014-07-14 00:00:00,1.0
max,2016-07-01 00:00:00,1.0
std,,0.498867


# IV. Data pre-processing

There are various different ways to process text. Some methods may be applicable to our use case and others may not! It will be useful to apply some standard text processing functions on the head of randomly selected columns of text in our DataFrame to evaluate each method's utility.

In [14]:
# Let's make sure we can see the full string every time
pd.set_option('display.max_colwidth', None)

In [15]:
def remove_accented_chars(text: str) -> str:
    try:
        text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        return text
    except:
        print(f"{text} is causing issues.")

In [16]:
def replace_contractions(text):
    """Replace contractions in string of text"""
    return contractions.fix(text)

In [17]:
def remove_special_characters(text: str, remove_digits: bool=False) -> str:
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

In [18]:
def remove_stopwords(words: Iterable) -> Iterable:
    """Remove stop words from list of tokenized words"""
    new_words = []                        # Create empty list to store pre-processed words.
    for word in words:
        if word not in stopwords.words('english'):
            new_words.append(word)        # Append processed words to new list.
    return new_words

**Let us evaluate our pre-processing functions on our DataFrame's Top1 column's head:**

In [19]:
# Original df["Top1"].head()
df["Top1"].head()

0                       b"Georgia 'downs two Russian warplanes' as countries move to brink of war"
1    b'Why wont America and Nato help us? If they wont help us now, why did we help them in Iraq?'
2     b'Remember that adorable 9-year-old who sang at the opening ceremonies? That was fake, too.'
3                                           b' U.S. refuses Israel weapons to attack Iran: report'
4                                          b'All the experts admit that we should legalise drugs '
Name: Top1, dtype: object

In [20]:
df["Top1"].head().apply(remove_accented_chars)

0                       b"Georgia 'downs two Russian warplanes' as countries move to brink of war"
1    b'Why wont America and Nato help us? If they wont help us now, why did we help them in Iraq?'
2     b'Remember that adorable 9-year-old who sang at the opening ceremonies? That was fake, too.'
3                                           b' U.S. refuses Israel weapons to attack Iran: report'
4                                          b'All the experts admit that we should legalise drugs '
Name: Top1, dtype: object

**We don't see much impact of the function above on the head of our Top1 series, because there aren't any accented characters in these headlines. Nevertheless, this will be useful processing to do on our text.**

In [21]:
df["Top1"].head().apply(replace_contractions)

0                               b"Georgia 'downs two Russian warplanes' as countries move to brink of war"
1    b'Why will not America and Nato help us? If they will not help us now, why did we help them in Iraq?'
2             b'Remember that adorable 9-year-old who sang at the opening ceremonies? That was fake, too.'
3                                                 b' YOU.S. refuses Israel weapons to attack Iran: report'
4                                                  b'All the experts admit that we should legalise drugs '
Name: Top1, dtype: object

**We see that replacing contractions has replaced the U in "U.S" from the headline in row 4. We could try removing the special characters and see how this holds up.**

In [22]:
df["Top1"].head().apply(remove_special_characters)

0                      bGeorgia downs two Russian warplanes as countries move to brink of war
1    bWhy wont America and Nato help us If they wont help us now why did we help them in Iraq
2       bRemember that adorable 9yearold who sang at the opening ceremonies That was fake too
3                                           b US refuses Israel weapons to attack Iran report
4                                       bAll the experts admit that we should legalise drugs 
Name: Top1, dtype: object

In [23]:
df["Top1"].head().apply(remove_special_characters).apply(replace_contractions)

0                              bGeorgia downs two Russian warplanes as countries move to brink of war
1    bWhy will not America and Nato help us If they will not help us now why did we help them in Iraq
2               bRemember that adorable 9yearold who sang at the opening ceremonies That was fake too
3                                                   b US refuses Israel weapons to attack Iran report
4                                               bAll the experts admit that we should legalise drugs 
Name: Top1, dtype: object

**This works, but it will be useful to replace hyphens with a space. Otherwise, the text that is split by hyphens would probably be bunched up together, similar to how "9-year-old" has become "9yearold" in headline #3. In addition, we need to do something about the "b" that is at the beginning of many of these headlines.**

In [24]:
def replace_hyphen_with_space(text: str) -> str:
    return text.replace("-", " ")

In [25]:
df["Top1"].head().apply(replace_hyphen_with_space)

0                       b"Georgia 'downs two Russian warplanes' as countries move to brink of war"
1    b'Why wont America and Nato help us? If they wont help us now, why did we help them in Iraq?'
2     b'Remember that adorable 9 year old who sang at the opening ceremonies? That was fake, too.'
3                                           b' U.S. refuses Israel weapons to attack Iran: report'
4                                          b'All the experts admit that we should legalise drugs '
Name: Top1, dtype: object

In [26]:
def remove_unnecessary_prefix_from_headlines(text: str) -> str:
    if ("b\"" == text[0:2]) or ("b\'" == text[0:2]):
        return text[2:]
    else:
        return text

**Note that our assumption in the function above is that the first two characters are b" or b'. So, we will need to strip any preceding whitespaces before running this function.**

In [27]:
df["Top1"].head().str.lstrip().apply(remove_unnecessary_prefix_from_headlines)

0                       Georgia 'downs two Russian warplanes' as countries move to brink of war"
1    Why wont America and Nato help us? If they wont help us now, why did we help them in Iraq?'
2     Remember that adorable 9-year-old who sang at the opening ceremonies? That was fake, too.'
3                                            U.S. refuses Israel weapons to attack Iran: report'
4                                          All the experts admit that we should legalise drugs '
Name: Top1, dtype: object

**Similar to removing these prefixes, it would be good to remove the "'s" that stems that indicates that something belongs to the noun it is being applied on.**

In [28]:
def remove_possessive_s(text: str) -> str:
    return text.replace("\'s", "")

**Let us also standardise country names. For example, it would be good to replace "US" with "america" after removing all the punctuation**

In [29]:
def standardise_country_names(text: str) -> str:
    text = text.replace("US", "america")
    text = text.replace("UK", "britain")
    text = text.lower().replace("great britain", "britain")
    
    return text

**Getting headlines from different geographies means that we should also standardise our spelling. Let us change everything to UK style.**

In [30]:
url ="https://raw.githubusercontent.com/hyperreality/American-British-English-Translator/master/data/american_spellings.json"
american_to_british_dict = requests.get(url).json() 

In [31]:
def text_to_british_spelling(text: str, american_to_british_dict: dict) -> str:
    for american_spelling, british_spelling in american_to_british_dict.items():
        text = text.replace(american_spelling, british_spelling)
  
    return text

**Pre-processing text prior to stemming / lemmatisation**

In [32]:
def normalise_text(text_series: pd.Series) -> pd.Series:
    text_series = text_series.apply(remove_accented_chars)
    text_series = text_series.str.lstrip().apply(remove_unnecessary_prefix_from_headlines)
    text_series = text_series.apply(replace_hyphen_with_space)
    text_series = text_series.apply(remove_possessive_s)
    text_series = text_series.apply(remove_special_characters)
    text_series = text_series.apply(replace_contractions)
    text_series = text_series.apply(standardise_country_names)
    text_series = text_series.str.lower()
    text_series = text_series.apply(partial(text_to_british_spelling, american_to_british_dict = american_to_british_dict))
    text_series = text_series.apply(word_tokenize).apply(remove_stopwords)
    
    return text_series

In [33]:
normalised_text = normalise_text(df["Top1"].head())
normalised_text

0    [georgia, downs, two, russian, warplanes, countries, move, brink, war]
1                           [america, nato, help, us, help, us, help, iraq]
2       [remember, adorable, 9, year, old, sang, opening, ceremonies, fake]
3                 [america, refuses, israel, weapons, attack, iran, report]
4                                         [experts, admit, legalise, drugs]
Name: Top1, dtype: object

**Stemming vs Lemmatisation: which one do we use?**

The answer is clearly lemmatisation (from the examples shown below).

In [34]:
def stem_words(text_series: pd.Series) -> pd.Series:
    """Stem words in list of tokenized words"""
    stemmer = LancasterStemmer()
    
    return text_series.apply(lambda x: [stemmer.stem(y) for y in x])

In [35]:
stem_words(normalised_text)

0    [georg, down, two, russ, warpl, country, mov, brink, war]
1               [americ, nato, help, us, help, us, help, iraq]
2          [rememb, ad, 9, year, old, sang, op, ceremony, fak]
3          [americ, refus, israel, weapon, attack, ir, report]
4                                   [expert, admit, leg, drug]
Name: Top1, dtype: object

In [36]:
def lemmatize_verbs(text_series: Iterable) -> Iterable:
    """Lemmatize verbs in list of tokenized words"""
    lemmatizer = WordNetLemmatizer()    
    return text_series.apply(lambda x: [lemmatizer.lemmatize(y, pos='v') for y in x])

In [37]:
lemmatize_verbs(normalised_text)

0    [georgia, down, two, russian, warplanes, countries, move, brink, war]
1                          [america, nato, help, us, help, us, help, iraq]
2         [remember, adorable, 9, year, old, sing, open, ceremonies, fake]
3                 [america, refuse, israel, weapons, attack, iran, report]
4                                         [experts, admit, legalise, drug]
Name: Top1, dtype: object

**Now that we have a decent set of pre-processing steps, let us apply this to samples from other columns of our df.**

In [38]:
print("Original text:")
print(df["Top2"].head())
print("\n")
print("Processed text:")
lemmatize_verbs(normalise_text(df["Top2"].head()))

Original text:
0                                                                                                             b'BREAKING: Musharraf to be impeached.'
1                                                                                                         b'Bush puts foot down on Georgian conflict'
2                                                                                                                  b"Russia 'ends Georgia operation'"
3    b"When the president ordered to attack Tskhinvali [the capital of South Ossetia], we knew then we were doomed. How come he didn't realize that?"
4                                                                                     b'War in South Osetia - 89 pictures made by a Russian soldier.'
Name: Top2, dtype: object


Processed text:


0                                                                         [break, musharraf, impeach]
1                                                               [bush, put, foot, georgian, conflict]
2                                                                   [russia, end, georgia, operation]
3    [president, order, attack, tskhinvali, [, capital, south, ossetia, ], know, doom, come, realise]
4                                           [war, south, osetia, 89, picture, make, russian, soldier]
Name: Top2, dtype: object

In [39]:
print("Original text:")
print(df["Top3"].head())
print("\n")
print("Processed text:")
lemmatize_verbs(normalise_text(df["Top3"].head()))

Original text:
0    b'Russia Today: Columns of troops roll into South Ossetia; footage from fighting (YouTube)'
1             b"Jewish Georgian minister: Thanks to Israeli training, we're fending off Russia "
2                               b'"If we had no sexual harassment we would have no children..."'
3                                          b' Israel clears troops who killed Reuters cameraman'
4                     b'Swedish wrestler Ara Abrahamian throws away medal in Olympic hissy fit '
Name: Top3, dtype: object


Processed text:


0    [russia, today, columns, troop, roll, south, ossetia, footage, fight, youtube]
1                 [jewish, georgian, minister, thank, israeli, train, fend, russia]
2                                             [sexual, harassment, would, children]
3                                  [israel, clear, troop, kill, reuters, cameraman]
4     [swedish, wrestler, ara, abrahamian, throw, away, medal, olympic, hissy, fit]
Name: Top3, dtype: object

In [40]:
print("Original text:")
print(df["Top15"].head())
print("\n")
print("Processed text:")
lemmatize_verbs(normalise_text(df["Top15"].head()))

Original text:
0                                                                       b'Did World War III start today?'
1                                       b'The French Team is Stunned by Phelps and the 4x100m Relay Team'
2                                                                 b'The 11 Top Party Cities in the World'
3                                                            b'Why Russias response to Georgia was right'
4    b'Russia apparently is sabotaging infrastructure to cripple the already battered Georgian military.'
Name: Top15, dtype: object


Processed text:


0                                                                 [world, war, iii, start, today]
1                                               [french, team, stun, phelps, 4x100m, relay, team]
2                                                                 [11, top, party, cities, world]
3                                                             [russias, response, georgia, right]
4    [russia, apparently, sabotage, infrastructure, cripple, already, batter, georgian, military]
Name: Top15, dtype: object

In [41]:
print("Original text:")
print(df["Top25"].head())
print("\n")
print("Processed text:")
lemmatize_verbs(normalise_text(df["Top25"].head()))

Original text:
0                                                         b"No Help for Mexico's Kidnapping Surge"
1                                           b"So this is what it's come to: trading sex for food."
2                                     b"BBC NEWS | Asia-Pacific | Extinction 'by man not climate'"
3          b'2006: Nobel laureate Aleksander Solzhenitsyn accuses U.S., NATO of encircling Russia'
4    b'Philippines : Peace Advocate say Muslims need assurance Christians not out to convert them'
Name: Top25, dtype: object


Processed text:


0                                                                 [help, mexico, kidnap, surge]
1                                                                      [come, trade, sex, food]
2                                          [bbc, news, asia, pacific, extinction, man, climate]
3    [2006, nobel, laureate, aleksander, solzhenitsyn, accuse, america, nato, encircle, russia]
4            [philippines, peace, advocate, say, muslims, need, assurance, christians, convert]
Name: Top25, dtype: object

**From a quick look at the processed text, it looks like we've done a good job in removing many of the unnecessary text that could serve as noise for our model. There is still some irrelevant text in our headlines though (like news channel names like BBC).**

How do we deal with the problem above? We could add them to a list of unnecessary words that we could remove (similar to how we got rid of stop-words). However, we are only able to see a fraction of words that are not useful. This means that we can't be comprehensive enough if we keep adding one word at a time to this list. 

Is there some way we could generalise this to add a large set of words that are (for the most part) not necessary?
- Are country names necessary for the model to detect sentiment? 
- Are news channel names relevant (even if the headlines is about the news channel)? 

These are potential considerations to help tune our model. But for now, let us see how good our model can be with our current level of pre-processing!

In [42]:
df.head(1)

Unnamed: 0,date,label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as countries move to brink of war""",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into South Ossetia; footage from fighting (YouTube)',"b'Russian tanks are moving towards the capital of South Ossetia, which has reportedly been completely destroyed by Georgian artillery fire'","b""Afghan children raped with 'impunity,' U.N. official says - this is sick, a three year old was raped and they do nothing""",b'150 Russian tanks have entered South Ossetia whilst Georgia shoots down two Russian jets.',"b""Breaking: Georgia invades South Ossetia, Russia warned it would intervene on SO's side""","b""The 'enemy combatent' trials are nothing but a sham: Salim Haman has been sentenced to 5 1/2 years, but will be kept longer anyway just because they feel like it.""",...,"b'Georgia Invades South Ossetia - if Russia gets involved, will NATO absorb Georgia and unleash a full scale war?'",b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to prevent an Israeli strike on Iran."" Israeli Defense Minister Ehud Barak: ""Israel is prepared for uncompromising victory in the case of military hostilities.""'",b'This is a busy day: The European Union has approved new sanctions against Iran in protest at its nuclear programme.',"b""Georgia will withdraw 1,000 soldiers from Iraq to help fight off Russian forces in Georgia's breakaway region of South Ossetia""",b'Why the Pentagon Thinks Attacking Iran is a Bad Idea - US News &amp; World Report',b'Caucasus in crisis: Georgia invades South Ossetia',"b'Indian shoe manufactory - And again in a series of ""you do not like your work?""'",b'Visitors Suffering from Mental Illnesses Banned from Olympics',"b""No Help for Mexico's Kidnapping Surge"""


In [43]:
normalise_text(df["Top5"])

0                                                             [afghan, children, raped, impunity, un, official, says, sick, three, year, old, raped, nothing]
1                                                                                                              [olympic, opening, ceremony, fireworks, faked]
2                                                                                                            [ceasefire, georgia, putin, outmanoeuvres, west]
3       [body, 14, year, old, found, trunk, latest, ransom, paid, kidnapping, victim, mexico, head, cop, quits, prez, dissolves, suspect, elite, task, force]
4                                                                                                  [missile, killed, 9, inside, pakistan, may, launched, cia]
                                                                                ...                                                                          
1984                                                

In [45]:
df = df.fillna("")

In [47]:
processed_df = df.drop(columns=["label", "date"]).apply(normalise_text, axis=1)

In [49]:
processed_df.loc[1000, "Top1"]

['cuban',
 'president',
 'ral',
 'castro',
 'willing',
 'hold',
 'limits',
 'talks',
 'america']

In [75]:
# for col in processed_df.columns:
#     if col == "Top1":
#         all_headlines = processed_df[col]
#     else:
#         all_headlines += processed_df[col]

# SCRATCH

In [123]:
# sid = SentimentIntensityAnalyzer()
# up = []
# neutral = []
# down = []

# for i in range(len(processed_df)):
#     t=processed_df.iloc[i].apply(lambda headline: sid.polarity_scores(" ".join(headline))) 
#     q=t.apply(lambda scores: scores["compound"])
#     up += [len(q[q>0])]
#     neutral += [len(q[q==0])]
#     down += [len(q[q<0])]
    
# stats = df[["date", "label"]]
# stats["up"] = up
# stats["neutral"] = neutral
# stats["down"] = down

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  stats["up"] = up
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  stats["neutral"] = neutral
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  stats["down"] = down


In [81]:
final_df = pd.DataFrame(df["label"])
final_df = final_df.assign(headlines = all_headlines)

In [90]:
final_df["headlines"] = final_df["headlines"].str.join(" ")

In [159]:
final_df.head(2)

NameError: name 'final_df' is not defined

# V. Building the model

In [154]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=1000)
data_features = vectorizer.fit_transform(processed_df['Top1'].str.join(" "))

data_features = data_features.toarray()

data_features.shape

(1989, 1000)

In [153]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=1000)
data_features = vectorizer.fit_transform(processed_df['Top1'].str.join(" "))

data_features = data_features.toarray()

data_features.shape

NameError: name 'final_df' is not defined

In [156]:
from sklearn.model_selection import train_test_split

X_train_count, X_test_count, y_train_count, y_test_count = train_test_split(data_features, df["label"], test_size=0.3, random_state=42)

In [158]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

forest = RandomForestClassifier(n_estimators=10, n_jobs=4)

forest = forest.fit(X_train_count, y_train_count)

print(forest)

print(np.mean(cross_val_score(forest, data_features, df["label"], cv=10)))

RandomForestClassifier(n_estimators=10, n_jobs=4)
0.5067813816557536


In [4]:
import fasttext

ModuleNotFoundError: No module named 'fasttext'

# VI. Testing the model against test data for the last one month

# VII. Conclusion + Next steps