This notebook contains analysis of data taken from the kaggle website. For background of data, visit 
https://www.kaggle.com/aaron7sun/stocknews. The data was originally intended to fit models on headlines from days to predict whether the S&P 500 prices would go up or down. That analysis has been done in a different notebook. Since we have this collection of headlines of over 8 years. I also decided to try out some more learning methods with it. This notebooks contains top modeling analysis of the headlines from 2008-2016

In [10]:
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string
import pandas as pd
from gensim.models import ldamodel
from gensim import corpora
import pyLDAvis.gensim as gensimvis
import pyLDAvis
import re
from gensim.utils import ClippedCorpus

We split data according to date '2015-01-01' as this ensures an approximately neat split between classes

In [11]:
data = pd.read_csv('C:/Users/manas/OneDrive/Documents/practice/stocknews/RedditNews.csv')
train = data[data['Date']<'2015-01-01']
test = data[data['Date']>'2014-12-31']
news = list(train['News'])

Setting stop words from the nltk.corpus lib. Apart from those, there were some problematic words which were different versions of common words. Hence I've included those as well.  I'm removing stopwords and normalising the dataset and saving the clean one.

In [12]:
stop = set(stopwords.words('english')+ ['ba','bthe','b',"r", "n", "amp", 
           "girl","woman","world",'u',"year", "u", ]) 
lemma = WordNetLemmatizer()
def clean(doc):
    punc_free = re.sub("[^a-zA-Z]"," ", doc)
    stop_free = " ".join([i for i in punc_free.lower().split() if i not in stop])
    normalized = " ".join(lemma.lemmatize(word) for word in stop_free.split())
    return normalized
data_clean = [clean(doc).split() for doc in news]

Next we convert the dataset to a corpus dictionalry for further use and filter the dictionary to contain just the most commonly appearing words

In [13]:
dictionary = corpora.Dictionary(data_clean)
dictionary.filter_extremes(no_below = 100, no_above = 0.05)
d_mat = [dictionary.doc2bow(doc) for doc in data_clean]

I then take the top 5000 words in the corpus and pass to the ldamodel function and create the lda model.

In [14]:
clipped_corpus = ClippedCorpus(d_mat, 5000)
Lda = ldamodel.LdaModel
ldamod = Lda(clipped_corpus, num_topics = 10, id2word = dictionary, passes = 15, random_state = 20)
ldamod.print_topics(10)

[(0,
  '0.030*"border" + 0.028*"ban" + 0.020*"chinese" + 0.018*"french" + 0.016*"army" + 0.015*"aid" + 0.014*"war" + 0.014*"government" + 0.012*"china" + 0.012*"country"'),
 (1,
  '0.020*"police" + 0.018*"law" + 0.018*"uk" + 0.017*"anti" + 0.017*"power" + 0.017*"germany" + 0.016*"protest" + 0.016*"hong" + 0.015*"kong" + 0.014*"china"'),
 (2,
  '0.030*"child" + 0.029*"north" + 0.027*"korea" + 0.023*"woman" + 0.019*"nsa" + 0.019*"snowden" + 0.018*"new" + 0.017*"abuse" + 0.015*"south" + 0.015*"cup"'),
 (3,
  '0.064*"state" + 0.038*"iraq" + 0.036*"islamic" + 0.028*"isi" + 0.017*"american" + 0.016*"soldier" + 0.015*"iraqi" + 0.014*"muslim" + 0.012*"group" + 0.011*"member"'),
 (4,
  '0.038*"minister" + 0.020*"court" + 0.017*"oil" + 0.017*"saudi" + 0.017*"prime" + 0.016*"government" + 0.015*"australian" + 0.014*"school" + 0.014*"country" + 0.014*"president"'),
 (5,
  '0.022*"year" + 0.019*"china" + 0.017*"death" + 0.015*"tax" + 0.013*"phone" + 0.013*"egypt" + 0.013*"missile" + 0.013*"syrian" 

The 10 topics and the words most associated with them are displayed here

The next step is to add unseen documents to the model, and see how the model performs 

In [15]:
testnews = list(train['News'])
clean_test = clean(testnews[1]).split()
samplen = dictionary.doc2bow(clean_test)
lda_vector = ldamod[samplen]
print(testnews[1])
print(ldamod.print_topic(max(lda_vector, key=lambda item: item[1])[0]))




North Korean defector details 'human experiments' -- use of mentally and physically handicapped children in chemical weapons tests 'the last straw'
0.030*"child" + 0.029*"north" + 0.027*"korea" + 0.023*"woman" + 0.019*"nsa" + 0.019*"snowden" + 0.018*"new" + 0.017*"abuse" + 0.015*"south" + 0.015*"cup"


In [16]:
pyLDAvis.enable_notebook()

As we can see, the model performs fairly well of predicting related topic in this one case. Below, the topics and related words have been displayed. 
As you scroll over, you can see that intuitively, the model has done a fairly good job of grouping commonly associated words together.

In [17]:
plot = gensimvis.prepare(ldamod,clipped_corpus, dictionary)
pyLDAvis.display(plot)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


Possible things to do with this model would be to combine this with the predictive model to see if specific topics tend to increase or decrease stocks and influence prediction that way.