## Predicting Stock Market Data with Daily News Headlines

In this notebook we have tried various approaches to analyze the news headlines to predict the Stock Market.

In [3]:
import pandas as pd
import numpy as np

In [4]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\prans_\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\prans_\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### We have the DJIA index values and the news headlines for the day. We are trying to predict the market movement based on this data. Our approach is as follows:

### 1. Basic info on Dataframe - missing values, 0/1 count
### 2. Data Cleaning
### 3. Dictionaries for up/down
### 4. Sentiment analysis for up/down
### 5. N-Grams
### 6. Feature extraction - bag of words / td-idf
### 7. Performing Classification

## Importing Data

We have imported the data from DJIA and Reddit news table, which consists of data from 2008 to 2016.

In [3]:
data = pd.read_csv('Combined_News_DJIA.csv')

In [4]:
data.head()

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",...,b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
1,2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."
2,2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...",...,b'U.S. troops still in Georgia (did you know t...,b'Why Russias response to Georgia was right',"b'Gorbachev accuses U.S. of making a ""serious ...","b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man..."
3,2008-08-13,0,b' U.S. refuses Israel weapons to attack Iran:...,"b""When the president ordered to attack Tskhinv...",b' Israel clears troops who killed Reuters cam...,b'Britain\'s policy of being tough on drugs is...,b'Body of 14 year old found in trunk; Latest (...,b'China has moved 10 *million* quake survivors...,"b""Bush announces Operation Get All Up In Russi...",b'Russian forces sink Georgian ships ',...,b'Elephants extinct by 2020?',b'US humanitarian missions soon in Georgia - i...,"b""Georgia's DDOS came from US sources""","b'Russian convoy heads into Georgia, violating...",b'Israeli defence minister: US against strike ...,b'Gorbachev: We Had No Choice',b'Witness: Russian forces head towards Tbilisi...,b' Quarter of Russians blame U.S. for conflict...,b'Georgian president says US military will ta...,b'2006: Nobel laureate Aleksander Solzhenitsyn...
4,2008-08-14,1,b'All the experts admit that we should legalis...,b'War in South Osetia - 89 pictures made by a ...,b'Swedish wrestler Ara Abrahamian throws away ...,b'Russia exaggerated the death toll in South O...,b'Missile That Killed 9 Inside Pakistan May Ha...,"b""Rushdie Condemns Random House's Refusal to P...",b'Poland and US agree to missle defense deal. ...,"b'Will the Russians conquer Tblisi? Bet on it,...",...,b'Bank analyst forecast Georgian crisis 2 days...,"b""Georgia confict could set back Russia's US r...",b'War in the Caucasus is as much the product o...,"b'""Non-media"" photos of South Ossetia/Georgia ...",b'Georgian TV reporter shot by Russian sniper ...,b'Saudi Arabia: Mother moves to block child ma...,b'Taliban wages war on humanitarian aid workers',"b'Russia: World ""can forget about"" Georgia\'s...",b'Darfur rebels accuse Sudan of mounting major...,b'Philippines : Peace Advocate say Muslims nee...


### Basic Info on DataFrames

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1989 entries, 0 to 1988
Data columns (total 27 columns):
Date     1989 non-null object
Label    1989 non-null int64
Top1     1989 non-null object
Top2     1989 non-null object
Top3     1989 non-null object
Top4     1989 non-null object
Top5     1989 non-null object
Top6     1989 non-null object
Top7     1989 non-null object
Top8     1989 non-null object
Top9     1989 non-null object
Top10    1989 non-null object
Top11    1989 non-null object
Top12    1989 non-null object
Top13    1989 non-null object
Top14    1989 non-null object
Top15    1989 non-null object
Top16    1989 non-null object
Top17    1989 non-null object
Top18    1989 non-null object
Top19    1989 non-null object
Top20    1989 non-null object
Top21    1989 non-null object
Top22    1989 non-null object
Top23    1988 non-null object
Top24    1986 non-null object
Top25    1986 non-null object
dtypes: int64(1), object(26)
memory usage: 419.6+ KB


In [6]:
data.isnull().sum()

Date     0
Label    0
Top1     0
Top2     0
Top3     0
Top4     0
Top5     0
Top6     0
Top7     0
Top8     0
Top9     0
Top10    0
Top11    0
Top12    0
Top13    0
Top14    0
Top15    0
Top16    0
Top17    0
Top18    0
Top19    0
Top20    0
Top21    0
Top22    0
Top23    1
Top24    3
Top25    3
dtype: int64

We can see  there are only 7 missing values, in the last 3 columns.

In [7]:
import nltk
import spacy

In [8]:
import spacy
import en_core_web_md

### Data Pre-Processing

We will do some basic pre processing for our text data. This will include:
1. Converting to lower case.
2. Removing the bold tags in the beginning
3. Removing the punctuation

After doing this, we concatenate all 25 columns containing headlines into one single column. We thus essentially get a clean corpus of text of top 25 headlines

In [9]:
data.head()

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",...,b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
1,2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."
2,2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...",...,b'U.S. troops still in Georgia (did you know t...,b'Why Russias response to Georgia was right',"b'Gorbachev accuses U.S. of making a ""serious ...","b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man..."
3,2008-08-13,0,b' U.S. refuses Israel weapons to attack Iran:...,"b""When the president ordered to attack Tskhinv...",b' Israel clears troops who killed Reuters cam...,b'Britain\'s policy of being tough on drugs is...,b'Body of 14 year old found in trunk; Latest (...,b'China has moved 10 *million* quake survivors...,"b""Bush announces Operation Get All Up In Russi...",b'Russian forces sink Georgian ships ',...,b'Elephants extinct by 2020?',b'US humanitarian missions soon in Georgia - i...,"b""Georgia's DDOS came from US sources""","b'Russian convoy heads into Georgia, violating...",b'Israeli defence minister: US against strike ...,b'Gorbachev: We Had No Choice',b'Witness: Russian forces head towards Tbilisi...,b' Quarter of Russians blame U.S. for conflict...,b'Georgian president says US military will ta...,b'2006: Nobel laureate Aleksander Solzhenitsyn...
4,2008-08-14,1,b'All the experts admit that we should legalis...,b'War in South Osetia - 89 pictures made by a ...,b'Swedish wrestler Ara Abrahamian throws away ...,b'Russia exaggerated the death toll in South O...,b'Missile That Killed 9 Inside Pakistan May Ha...,"b""Rushdie Condemns Random House's Refusal to P...",b'Poland and US agree to missle defense deal. ...,"b'Will the Russians conquer Tblisi? Bet on it,...",...,b'Bank analyst forecast Georgian crisis 2 days...,"b""Georgia confict could set back Russia's US r...",b'War in the Caucasus is as much the product o...,"b'""Non-media"" photos of South Ossetia/Georgia ...",b'Georgian TV reporter shot by Russian sniper ...,b'Saudi Arabia: Mother moves to block child ma...,b'Taliban wages war on humanitarian aid workers',"b'Russia: World ""can forget about"" Georgia\'s...",b'Darfur rebels accuse Sudan of mounting major...,b'Philippines : Peace Advocate say Muslims nee...


In [10]:
data = data.apply(lambda x: x.astype(str).str.lower())

In [11]:
date = data["Date"]
label = data['Label']

In [12]:
del data['Date']
del data["Label"]

In [13]:
data = data.apply(lambda x: x.astype(str).str[1:])

In [14]:
data = data.apply(lambda x: x.astype(str).str.replace('[^\w\s]',''))

In [15]:
data['combined'] = data['Top1'] + ' ' + data['Top2'] + ' ' + data['Top3'] + ' ' + data['Top4'] + ' ' + data['Top5'] + ' ' + data['Top6'] + ' ' + data['Top7'] + ' ' + data['Top8'] + ' ' + data['Top9'] + ' ' + data['Top10'] + ' ' + data['Top11'] + ' ' + data['Top12'] + ' ' + data['Top13'] + ' ' + data['Top14'] + ' ' + data['Top15'] + ' ' + data['Top16'] + ' ' + data['Top17'] + ' ' + data['Top18'] + ' ' + data['Top19'] + ' ' + data['Top20'] + ' ' + data['Top21'] + ' ' + data['Top22'] + ' ' + data['Top23'] + ' ' + data['Top24'] + ' ' + data['Top25']

In [16]:
del data['Top1']
del data['Top2']
del data['Top3']
del data['Top4']
del data['Top5']
del data['Top6']
del data['Top7']
del data['Top8']
del data['Top9']
del data['Top10']
del data['Top11']
del data['Top12']
del data['Top13']
del data['Top14']
del data['Top15']
del data['Top16']
del data['Top17']
del data['Top18']
del data['Top19']
del data['Top20']
del data['Top21']
del data['Top22']
del data['Top23']
del data['Top24']
del data['Top25']

In [17]:
data['Date'] = date
data['Label'] = label

In [18]:
data.head()

Unnamed: 0,combined,Date,Label
0,georgia downs two russian warplanes as countri...,2008-08-08,0
1,why wont america and nato help us if they wont...,2008-08-11,1
2,remember that adorable 9yearold who sang at th...,2008-08-12,0
3,us refuses israel weapons to attack iran repo...,2008-08-13,0
4,all the experts admit that we should legalise ...,2008-08-14,1


In [19]:
data['Label'].value_counts()

1    1065
0     924
Name: Label, dtype: int64

In [20]:
data['Date'].max()

'2016-07-01'

In [21]:
data['Date'].min()

'2008-08-08'

In [22]:
data[data.Date.str.contains('2008-08-08')].combined.tolist()

['georgia downs two russian warplanes as countries move to brink of war breaking musharraf to be impeached russia today columns of troops roll into south ossetia footage from fighting youtube russian tanks are moving towards the capital of south ossetia which has reportedly been completely destroyed by georgian artillery fire afghan children raped with impunity un official says  this is sick a three year old was raped and they do nothing 150 russian tanks have entered south ossetia whilst georgia shoots down two russian jets breaking georgia invades south ossetia russia warned it would intervene on sos side the enemy combatent trials are nothing but a sham salim haman has been sentenced to 5 12 years but will be kept longer anyway just because they feel like it georgian troops retreat from s osettain capital presumably leaving several hundred people killed video did the us prep georgia for war with russia rice gives green light for israel to attack iran says us has no veto over israeli

In [23]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1989 entries, 0 to 1988
Data columns (total 3 columns):
combined    1989 non-null object
Date        1989 non-null object
Label       1989 non-null object
dtypes: object(3)
memory usage: 46.7+ KB


### Up and Down dictionaries

We will now split the data into up market and down market. After doing this, we will tokenize the words and get the frequency for most repeated words for both up and down market. These are our up and down dictionaries.

In [24]:
ups = data[data.Label == '1']
ups_corpus = ups['combined'].tolist()

In [25]:
downs = data[data.Label == '0']
downs_corpus = downs['combined'].tolist()

In [26]:
ups_corpus_str = ''.join(ups_corpus)
downs_corpus_str = ''.join(downs_corpus)

In [27]:
stop_words = set(stopwords.words('english'))

In [28]:
word_tokens_up = word_tokenize(ups_corpus_str)
filtered_sentence_ups = [w for w in word_tokens_up if not w in stop_words] 

In [29]:
word_tokens_down = word_tokenize(downs_corpus_str)  
filtered_sentence_downs = [w for w in word_tokens_down if not w in stop_words]

In [30]:
ups_counter = Counter(filtered_sentence_ups)
print(ups_counter.most_common(10))

[('us', 1975), ('says', 1384), ('new', 1102), ('government', 1035), ('police', 956), ('people', 944), ('world', 844), ('israel', 837), ('years', 823), ('war', 813)]


In [31]:
downs_counter = Counter(filtered_sentence_downs)
print(downs_counter.most_common(10))

[('us', 1759), ('says', 1162), ('new', 1024), ('government', 876), ('people', 862), ('police', 848), ('world', 796), ('years', 750), ('war', 719), ('israel', 696)]


### Sentiment Analysis - Using TextBlob

In [32]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\prans_\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [33]:
headlines = ['positive vibes', 'negative shit']
sia = SIA()
results = []

for line in headlines:
    pol_score = sia.polarity_scores(line)
    pol_score['headline'] = line
    results.append(pol_score)

print(results[:3])

[{'neg': 0.0, 'neu': 0.217, 'pos': 0.783, 'compound': 0.5574, 'headline': 'positive vibes'}, {'neg': 1.0, 'neu': 0.0, 'pos': 0.0, 'compound': -0.8074, 'headline': 'negative shit'}]


In [34]:
from textblob import TextBlob

In [35]:
def sentiments(x):
    mylist = x['combined']
    polarity=[]
    subjectivity=[]
    for i in mylist:
        testimonial = TextBlob(''.join(i))
        polarity.append(testimonial.sentiment.polarity)
        subjectivity.append(testimonial.subjectivity)
    x['polarity'] = polarity
    x['subjectivity'] = subjectivity
    return x

In [36]:
analyze_sentiments = sentiments(data)

In [37]:
analyze_sentiments.head()

Unnamed: 0,combined,Date,Label,polarity,subjectivity
0,georgia downs two russian warplanes as countri...,2008-08-08,0,-0.048568,0.267549
1,why wont america and nato help us if they wont...,2008-08-11,1,0.109325,0.374806
2,remember that adorable 9yearold who sang at th...,2008-08-12,0,-0.044302,0.536234
3,us refuses israel weapons to attack iran repo...,2008-08-13,0,0.005842,0.364021
4,all the experts admit that we should legalise ...,2008-08-14,1,0.035469,0.375099


In [38]:
len(analyze_sentiments[(analyze_sentiments['Label'] == '0') & (analyze_sentiments['polarity'] < 0)])

319

In [39]:
len(analyze_sentiments[(analyze_sentiments['Label'] == '0')])

924

In [40]:
len(analyze_sentiments[(analyze_sentiments['Label'] == '1') & (analyze_sentiments['polarity'] > 0)])

722

In [41]:
len(analyze_sentiments[analyze_sentiments['Label'] == '1'])

1065

**Thus, we can see that the sentiments getting negative when the stock market goes down is about 30%, and the sentiment being positive when the stock market goes up is about 70%, which indicates that when it is a positive sentiment, there is a high probability that the stock market is going to increase.**

### Performing Tf-Idf

In [44]:
import os.path
from gensim import corpora
from gensim.models import LsiModel
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from gensim.models.coherencemodel import CoherenceModel
import matplotlib.pyplot as plt

In [45]:
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string


In [46]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(binary=False) # we cound ignore binary=False argument since it is default
vec.fit(sample)

import pandas as pd
a=pd.DataFrame(vec.transform(sample).toarray(), columns=sorted(vec.vocabulary_.keys()))

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
vec.fit(sample)

import pandas as pd
a=pd.DataFrame(vec.transform(sample).toarray(), columns=sorted(vec.vocabulary_.keys()))
a=a.transpose()
a.columns=['idf']
a=a.sort_values('idf', ascending=False)

In [48]:
a.head(10)

Unnamed: 0,idf
of,0.337385
the,0.337385
georgia,0.276042
ossetia,0.2147
to,0.2147
south,0.2147
and,0.184028
russian,0.153357
war,0.153357
in,0.153357


## Performing Classification - Using SpaCy and Scikit-Learn

In [49]:
sample

['georgia downs two russian warplanes as countries move to brink of war breaking musharraf to be impeached russia today columns of troops roll into south ossetia footage from fighting youtube russian tanks are moving towards the capital of south ossetia which has reportedly been completely destroyed by georgian artillery fire afghan children raped with impunity un official says  this is sick a three year old was raped and they do nothing 150 russian tanks have entered south ossetia whilst georgia shoots down two russian jets breaking georgia invades south ossetia russia warned it would intervene on sos side the enemy combatent trials are nothing but a sham salim haman has been sentenced to 5 12 years but will be kept longer anyway just because they feel like it georgian troops retreat from s osettain capital presumably leaving several hundred people killed video did the us prep georgia for war with russia rice gives green light for israel to attack iran says us has no veto over israeli

In [50]:
data.head()

Unnamed: 0,combined,Date,Label,polarity,subjectivity
0,georgia downs two russian warplanes as countri...,2008-08-08,0,-0.048568,0.267549
1,why wont america and nato help us if they wont...,2008-08-11,1,0.109325,0.374806
2,remember that adorable 9yearold who sang at th...,2008-08-12,0,-0.044302,0.536234
3,us refuses israel weapons to attack iran repo...,2008-08-13,0,0.005842,0.364021
4,all the experts admit that we should legalise ...,2008-08-14,1,0.035469,0.375099


In [51]:
data_classify = data[['combined', 'Label']]

In [52]:
data_classify.head()

Unnamed: 0,combined,Label
0,georgia downs two russian warplanes as countri...,0
1,why wont america and nato help us if they wont...,1
2,remember that adorable 9yearold who sang at th...,0
3,us refuses israel weapons to attack iran repo...,0
4,all the experts admit that we should legalise ...,1


In [53]:
data_classify.head()

Unnamed: 0,combined,Label
0,georgia downs two russian warplanes as countri...,0
1,why wont america and nato help us if they wont...,1
2,remember that adorable 9yearold who sang at th...,0
3,us refuses israel weapons to attack iran repo...,0
4,all the experts admit that we should legalise ...,1


In [54]:
data_classify.Label.value_counts()

1    1065
0     924
Name: Label, dtype: int64

In [55]:
sample=data_classify[:100].sample(100)

In [56]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(sample.combined)
X_train_counts.shape

(100, 7073)

In [57]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(100, 7073)

In [58]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, sample.Label)

In [59]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultinomialNB())])

In [60]:
text_clf = text_clf.fit(sample.combined, sample.Label)

In [61]:
sample_test=data.sample(20)

In [62]:
import numpy as np
#twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
predicted = text_clf.predict(sample_test.combined)
np.mean(predicted == sample_test.Label)

0.25

In [63]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline

In [64]:
import string
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

# Create our list of punctuation marks
punctuations = string.punctuation

# Create our list of stopwords
nlp = spacy.load('en')
stop_words = spacy.lang.en.stop_words.STOP_WORDS

# Load English tokenizer, tagger, parser, NER and word vectors
parser = English()

# Creating our tokenizer function
def spacy_tokenizer(sentence):
    # Creating our token object, which is used to create documents with linguistic annotations.
    mytokens = parser(sentence)

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Removing stop words
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # return preprocessed list of tokens
    return mytokens


In [65]:
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        # Cleaning Text
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

# Basic function to clean the text
def clean_text(text):
    # Removing spaces and converting text into lowercase
    return text.strip().lower()

In [66]:
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))

In [67]:
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)

In [68]:
from sklearn.model_selection import train_test_split

X = data_classify['combined'] # the features we want to analyze
ylabels = data_classify['Label'] # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)

In [72]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()

# Create pipeline using Bag of Words
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', tfidf_vector),
                 ('classifier', classifier)])

# model generation
pipe.fit(X_train,y_train)

Pipeline(memory=None,
         steps=[('cleaner', <__main__.predictors object at 0x000001A211B7EF60>),
                ('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop...
                                 tokenizer=<function spacy_tokenizer at 0x000001A211E40488>,
                                 use_idf=True, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
         

In [73]:
from sklearn import metrics
# Predicting with a test dataset
predicted = pipe.predict(X_test)

# Model Accuracy
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
#print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
#print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))

Logistic Regression Accuracy: 0.509212730318258


**Thus we see that we get around 51% accuracy by using Classification algorithms. We thus need to find a better approach to improve accuracy.**