# Homework 5: Topic Modelling with Latent Dirichlet Allocation

------------------------------------------------------
*Machine Learning, Master in Big Data Analytics, 2018-2019*

*Pablo M. Olmos olmos@tsc.uc3m.es*

------------------------------------------------------

The goal of this homework is to first introduce the pre-processing tasks that one has to run over a corpus of documents before analyzing its structure with topic models. We will use the library [nltk](https://www.nltk.org/) to perform document tokenization, stemming, and lemmatization. Then, you will explore the library [gensim](https://radimrehurek.com/gensim/) and implement LDA over the processed database.


The data set we’ll use is a list of over one million news headlines published over a period of 15 years and can be downloaded from [Kaggle](https://www.kaggle.com/therohk/million-headlines/data).

In [1]:
# Load the dataset from the CSV and save it to 'data_text'

import pandas as pd
data = pd.read_csv('abcnews-date-text.csv', error_bad_lines=False);

# We only need the Headlines text column from the data. We reduce the database size to 3000000 headlines, 
# so model training does not take that long

data_text = data[:300000][['headline_text']];
data_text['index'] = data_text.index

documents = data_text

In [2]:
# preview documents

documents.head(50)

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4
5,ambitious olsson wins triple jump,5
6,antic delighted with record breaking barca,6
7,aussie qualifier stosur wastes four memphis match,7
8,aust addresses un security council over iraq,8
9,australia is locked into war timetable opp,9


## Data pre-processing
We will perform the following steps for data pre-processing:

1. Tokenization.
2. Lowercase the words and remove punctuation.
3. Remove stop words.
4. Stemming and Lemmatization.

**Tokenization** is the process of splitting the given text into smaller pieces called tokens. Words, numbers, punctuation marks, and others can be considered as tokens.

**“Stop words** are the most common words in a language like “the”, “a”, “on”, “is”, “all”. These words do not carry important meaning and are usually removed from texts.

**Stemming** is a process of reducing words to their word stem, base or root form (for example, books — book, looked — look).

**The aim of lemmatization**, like stemming, is to reduce inflectional forms to a common base form. As opposed to stemming, lemmatization does not simply chop off inflections. Instead it uses lexical knowledge bases to get the correct base forms of words. 


In [3]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)

paramiko missing, opening SSH/SCP/SFTP paths will be disabled.  `pip install paramiko` to suppress


In [4]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\niall\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [5]:
# Lemmatizer example
print(WordNetLemmatizer().lemmatize('went', pos = 'v')) # past tense to present tense

go


In [6]:
# stemmer example
stemmer = SnowballStemmer("english")
original_words = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned', 
           'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational', 
           'traditional', 'reference', 'colonizer','plotted']
singles = [stemmer.stem(plural) for plural in original_words]

pd.DataFrame(data={'original word':original_words, 'stemmed':singles })

Unnamed: 0,original word,stemmed
0,caresses,caress
1,flies,fli
2,dies,die
3,mules,mule
4,denied,deni
5,died,die
6,agreed,agre
7,owned,own
8,humbled,humbl
9,sized,size


In [7]:
# pre processing steps on the entire dataset

def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [8]:
# Preview a document after preprocessing

document_num = 4310
doc_sample = documents[documents['index'] == document_num].values[0][0]

print("Original document: ")
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocess(doc_sample))

Original document: 
['rain', 'helps', 'dampen', 'bushfires']


Tokenized and lemmatized document: 
['rain', 'help', 'dampen', 'bushfir']


In [9]:
# preprocess all the headlines, saving the list of results as 'processed_docs'
processed_docs = documents['headline_text'].map(preprocess)

In [10]:
# Bag of words on the dataset
# Create a dictionary from 'processed_docs'
dictionary = gensim.corpora.Dictionary(processed_docs)

# Checking dictionary created
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 broadcast
1 communiti
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit


In [11]:
# We remove words that haven been seldom used

dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n=100000)

In [12]:
# Obtain a bag-of-words document respresentation

bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [13]:
document_num = 20

bow_corpus[document_num]

[(75, 1), (76, 1), (77, 1), (78, 1)]

In [14]:
# Preview BOW for our sample preprocessed document

# Here document_num is document number 4310 which we have checked in Step 2
bow_doc_4310 = bow_corpus[document_num]

for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                                     dictionary[bow_doc_4310[i][0]], 
                                                     bow_doc_4310[i][1]))

Word 75 ("attack") appears 1 time.
Word 76 ("busi") appears 1 time.
Word 77 ("prepar") appears 1 time.
Word 78 ("terrorist") appears 1 time.


## Calling LDA using Bag of Words

**Question 1 (4 points):** Run LDA with 10 topics over the processed corpus of headlines using the the LDA implementation in the [gensim](https://radimrehurek.com/gensim/) library.

In [33]:
# Running LDA using Bag of Words
from gensim.models.ldamodel import LdaModel

lda = LdaModel(corpus=bow_corpus, id2word = dictionary, num_topics = 11, update_every = 1, passes = 1)


**Question 2 (2 points):** Print the 10 most probable words in every topic. Can you interpret the meaning of each topic?

In [38]:
## YOUR CODE HERE

lda.show_topics(formatted =False)



[(3,
  [('water', 0.061926607),
   ('plan', 0.039452564),
   ('group', 0.020389773),
   ('govt', 0.020238133),
   ('council', 0.017405178),
   ('attack', 0.017086718),
   ('green', 0.017064193),
   ('home', 0.016446479),
   ('time', 0.012299835),
   ('question', 0.011681642)]),
 (9,
  [('polic', 0.06812261),
   ('urg', 0.03079354),
   ('help', 0.023746645),
   ('closer', 0.02069798),
   ('school', 0.017897306),
   ('resid', 0.01558193),
   ('driver', 0.014870643),
   ('australian', 0.014323024),
   ('probe', 0.013226297),
   ('arrest', 0.013063837)]),
 (5,
  [('charg', 0.04166429),
   ('court', 0.035764642),
   ('face', 0.033504464),
   ('accus', 0.021950422),
   ('murder', 0.021135312),
   ('case', 0.015734756),
   ('indigen', 0.015662214),
   ('road', 0.015103767),
   ('fight', 0.014883258),
   ('jail', 0.014655143)]),
 (7,
  [('elect', 0.024042891),
   ('offer', 0.019546736),
   ('fear', 0.0193431),
   ('centr', 0.018985467),
   ('deal', 0.0187054),
   ('want', 0.018188717),
   ('pl

**Question 3 (2 points):**

Print the proportion of topics in document number 17320

In [40]:
vec = bow_corpus[17320]

lda.get_document_topics(vec)

[(0, 0.022727275),
 (1, 0.27278066),
 (2, 0.022727393),
 (3, 0.022730427),
 (4, 0.022727273),
 (5, 0.022727704),
 (6, 0.022727974),
 (7, 0.022727916),
 (8, 0.02272795),
 (9, 0.022727273),
 (10, 0.5226682)]

**Question 4 (2 points):**

Find the 10 most similar documents to document number 17320

In [42]:
from gensim.similarities import MatrixSimilarity 

index = MatrixSimilarity(bow_corpus, num_features = len(dictionary), num_best = 10)

similar_docs = index[vec]

In [44]:
print(similar_docs)

[(17320, 0.9999999403953552), (218668, 0.6666666269302368), (71926, 0.6666666269302368), (89070, 0.5773502588272095), (160198, 0.5773502588272095), (80394, 0.5773502588272095), (19351, 0.5773502588272095), (222894, 0.5773502588272095), (100011, 0.5773502588272095), (258039, 0.5773502588272095)]
