## Topic Modelling with Gensim and Spacy

Topic modelling can be described as a method for finding a group of words (i.e topic) from a collection of documents that best represents the information in the collection. It can also be thought of as a form of text mining – a way to obtain recurring patterns of words in textual material.
There are many techniques that are used to obtain topic models. In this notebook we will make use of models like LDA, LSI and HDP 

In [1]:
import gensim
import numpy as np
import spacy
from spacy import displacy
from gensim.corpora import Dictionary
from gensim.models import LdaModel
import matplotlib.pyplot as plt

In [2]:
import warnings
import os
warnings.filterwarnings('ignore')  # Let's not pay heed to them right now
%matplotlib inline

We will be using the Lee corpus which is a shortened version of the [Lee Background Corpus](http://www.socsci.uci.edu/~mdlee/lee_pincombe_welsh_document.PDF). The shortened version consists of 300 documents selected from the Australian Broadcasting Corporation's news mail service. It consists of texts of headline stories from around the year 2000-2001.

In [3]:
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
text = open(lee_train_file).read()

Uncomment and run the below step only if you have not installed the spacy's en model. There are different ways to download the model which can be obtained from this [link](https://spacy.io/usage/models)

In [4]:
#!pip3 install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz

In [4]:
import en_core_web_sm

In [5]:
nlp = en_core_web_sm.load()

To pre process the data we will make use of spaCy's nlp pipeline, an industry grade text-processing package.
We will add some stopwords to the existing list of spaCy's stopwords list. This can be done by the below code.

In [6]:
stop_words = [u'say', u'\'s', u'Mr', u'be', u'said', u'says', u'saying']
for stopword in stop_words:
    lexeme = nlp.vocab[stopword]
    lexeme.is_stop = True

In [7]:
doc = nlp(text)

In [8]:
doc

Hundreds of people have been forced to vacate their homes in the Southern Highlands of New South Wales as strong winds today pushed a huge bushfire towards the town of Hill Top. A new blaze near Goulburn, south-west of Sydney, has forced the closure of the Hume Highway. At about 4:00pm AEDT, a marked deterioration in the weather as a storm cell moved east across the Blue Mountains forced authorities to make a decision to evacuate people from homes in outlying streets at Hill Top in the New South Wales southern highlands. An estimated 500 residents have left their homes for nearby Mittagong. The New South Wales Rural Fire Service says the weather conditions which caused the fire to burn in a finger formation have now eased and about 60 fire units in and around Hill Top are optimistic of defending all properties. As more than 100 blazes burn on New Year's Eve in New South Wales, fire crews have been called to new fire at Gunning, south of Goulburn. While few details are available at this

In [9]:
texts, article = [], []
for w in doc:
    # if it's not a stop word or punctuation mark, add it to our article!
    if w.text != '\n' and not w.is_stop and not w.is_punct and not w.like_num and w.lemma_ != '-PRON-':
        # we add the lematized version of the word
        article.append(w.lemma_)
    # if it's a new line, it means we're onto our next document
    if w.text == '\n':
        texts.append(article)
        article = []

In [10]:
#print(nlp.Defaults.stop_words)
#i = 0
#for stop in nlp.Defaults.stop_words:
    #print(str(i) +' -- ' + stop)
    #i+=1

Lets take a look at the tokenized version of document

In [11]:
texts[2]

['the',
 'national',
 'road',
 'toll',
 'christmas',
 'new',
 'year',
 'holiday',
 'period',
 'stand',
 'few',
 'time',
 'year',
 'people',
 'die',
 'new',
 'south',
 'wales',
 'road',
 'fatality',
 'queensland',
 'victoria',
 'western',
 'australia',
 'northern',
 'territory',
 'south',
 'australia',
 'record',
 'death',
 'act',
 'tasmania',
 'remain',
 'fatality',
 'free']

Sometimes topic models make more sense when 'New' and 'York' are treated as 'New_York' - we can do this by creating a bigram model and modifying our corpus accordingly.

In [12]:
bigram = gensim.models.Phrases(texts)

In [13]:
texts = [bigram[line] for line in texts]

In [14]:
texts[2]

['the',
 'national',
 'road',
 'toll',
 'christmas',
 'new',
 'year',
 'holiday',
 'period',
 'stand',
 'few',
 'time',
 'year',
 'people_die',
 'new_south',
 'wales',
 'road',
 'fatality',
 'queensland',
 'victoria',
 'western_australia',
 'northern_territory',
 'south',
 'australia',
 'record',
 'death',
 'act',
 'tasmania',
 'remain',
 'fatality',
 'free']

Convert the document into a Bag of Words Model

In [15]:
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

In [16]:
len(corpus)

299

### LDA

LDA, or Latent Dirichlet Allocation is arguably the most famous topic modelling algorithm out there. Out here we create a simple topic model with 10 topics.

In [17]:
ldamodel = LdaModel(corpus=corpus, num_topics=10, id2word=dictionary)

In [18]:
ldamodel.show_topics()

[(0,
  '0.014*"the" + 0.005*"pakistan" + 0.005*"group" + 0.005*"government" + 0.004*"attack" + 0.004*"people" + 0.004*"call" + 0.003*"india" + 0.003*"australia" + 0.003*"day"'),
 (1,
  '0.030*"the" + 0.007*"australian" + 0.007*"government" + 0.004*"force" + 0.004*"official" + 0.004*"people" + 0.004*"australia" + 0.004*"afghanistan" + 0.004*"kill" + 0.004*"israeli"'),
 (2,
  '0.015*"the" + 0.004*"force" + 0.004*"area" + 0.004*"australian" + 0.004*"us" + 0.003*"day" + 0.003*"a" + 0.003*"year" + 0.003*"government" + 0.003*"fire"'),
 (3,
  '0.009*"the" + 0.006*"australia" + 0.005*"world" + 0.005*"australian" + 0.004*"people" + 0.004*"test" + 0.004*"good" + 0.004*"day" + 0.004*"start" + 0.004*"be"'),
 (4,
  '0.016*"the" + 0.006*"force" + 0.006*"year" + 0.004*"people" + 0.004*"metre" + 0.004*"australia" + 0.004*"israeli" + 0.004*"fire" + 0.003*"new" + 0.003*"report"'),
 (5,
  '0.008*"australia" + 0.006*"the" + 0.006*"day" + 0.005*"australian" + 0.004*"test" + 0.004*"man" + 0.004*"qantas" + 0

### LSI

LSI stands for Latent Semantic Indeixing - it is a popular information retreival method which works by decomposing the original matrix of words to maintain key topics. Gensim's implementation uses an SVD.

In [19]:
from gensim.models import CoherenceModel, LsiModel, HdpModel

In [20]:
lsimodel = LsiModel(corpus=corpus, num_topics=10, id2word=dictionary)

In [21]:
lsimodel.show_topics(num_topics=5)  # Showing only the top 5 topics

[(0,
  '0.556*"the" + 0.145*"israeli" + 0.144*"force" + 0.138*"palestinian" + 0.130*"arafat" + 0.121*"people" + 0.120*"government" + 0.118*"australian" + 0.116*"australia" + 0.114*"us"'),
 (1,
  '-0.358*"palestinian" + -0.344*"israeli" + -0.332*"arafat" + 0.206*"the" + -0.177*"israel" + -0.154*"sharon" + -0.148*"hamas" + -0.148*"official" + -0.139*"west_bank" + 0.131*"australia"'),
 (2,
  '0.290*"afghanistan" + 0.280*"force" + 0.248*"us" + 0.216*"al_qaeda" + 0.198*"bin_laden" + -0.193*"the" + 0.145*"tora_bora" + 0.141*"pakistan" + 0.133*"fighter" + 0.130*"afghan"'),
 (3,
  '-0.394*"fire" + -0.288*"area" + -0.223*"sydney" + -0.181*"firefighter" + -0.173*"south" + -0.167*"north" + -0.160*"wind" + 0.155*"australia" + -0.151*"wales" + -0.151*"new_south"'),
 (4,
  '-0.340*"the" + 0.247*"test" + 0.209*"day" + 0.204*"good" + 0.178*"match" + 0.168*"win" + -0.165*"company" + 0.149*"play" + 0.136*"wicket" + 0.135*"australia"')]

### HDP

HDP, the Hierarchical Dirichlet process is an unsupervised topic model which figures out the number of topics on it's own

In [22]:
hdpmodel = HdpModel(corpus=corpus, id2word=dictionary)

In [23]:
hdpmodel.show_topics(num_topics=10)

[(0,
  '0.005*the + 0.005*israeli + 0.003*palestinian + 0.003*government + 0.003*kill + 0.003*group + 0.002*match + 0.002*attack + 0.002*australia + 0.002*play + 0.002*sharon + 0.002*gaza_strip + 0.002*meeting + 0.002*team + 0.002*leave + 0.002*west_bank + 0.002*rafter + 0.002*howard + 0.002*hamas + 0.002*president'),
 (1,
  '0.006*the + 0.004*company + 0.002*australian + 0.002*cent + 0.002*staff + 0.002*austar + 0.002*entitlement + 0.002*receive + 0.002*read + 0.002*$ + 0.001*administrator + 0.001*official + 0.001*cease + 0.001*morning + 0.001*report + 0.001*alarming + 0.001*share + 0.001*redundant + 0.001*pay + 0.001*homeless'),
 (2,
  '0.004*airport + 0.003*taliban + 0.002*opposition + 0.002*kandahar + 0.002*kill + 0.002*night + 0.002*civilian + 0.002*bombing + 0.002*us + 0.002*near + 0.002*unsportsmanlike + 0.001*city + 0.001*villawood + 0.001*leave + 0.001*agha + 0.001*gul + 0.001*wound + 0.001*half + 0.001*lali + 0.001*the'),
 (3,
  '0.004*israeli + 0.004*arafat + 0.002*sharon + 

**pyLDAvis** is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

In [24]:
import pyLDAvis.gensim

In [25]:
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)