# Topic Modelling


# Talbe of contents
* [Import libraries](#importlibraries)
* [Taks 2: Topic Modelling](#task_2)
    * [Part 01: LDA Model](#step3)
        * [Preparing Data](#step3_1)
        * [Training LDA model 1st (Topics =4)](#step3_2)
        * [Training LDA model 2nd (Topics =8)](#step3_3)
    * [Part 02: Nonnegative Matrix Factorization](#step4)
        * [Preparing Data](#step4_1)
        * [Training NMF model](#step4_2)


# Introduction
<a id="introduction"></a>

Here are two main tasks in this assignment. 

The first is we will use Neural Network Method and Machine learning method to build three text classifiers that predict thress classes of InfoTheory, CompVis, and Math by using the Abstract field.

The second is use LDA and NMF method to perform top modelling.

# Import libraries
<a id="importlibraries"></a>
Import some libraries for this assignment:

In [1]:
#!pip install torch
#!pip install spacy
#!pip install warnings
#!pip install pandas
#!pip install time
#!pip install numpy
#!pip install nltk
#!pip install sklearn
#!pip install seaborn
#!pip install matplotlib
#!pip3 install gensim
#!pip3 install pickle
#!pip3 install pyLDAvis
#!pip3 install pprint
#!pip3 install re
#!pip3 install string

In [2]:
import torch
import torch.optim as optim
from torchtext import data
from torchtext.data import TabularDataset
import torch.nn as nn
import spacy
from spacy.lang.en.stop_words import STOP_WORDS as en_stop
import warnings
import pandas as pd
import time
import numpy as np
from nltk.corpus import stopwords
from nltk import word_tokenize, pos_tag   
from nltk.tokenize import wordpunct_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, confusion_matrix, matthews_corrcoef
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB, GaussianNB,BernoulliNB
from sklearn.svm import LinearSVC, SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import seaborn as sns
import matplotlib.pyplot as plt
from gensim.models import Phrases
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from gensim import corpora, models
import pickle
import pyLDAvis.gensim
from pprint import pprint
import re
import string
import pyLDAvis
import pyLDAvis.sklearn
%matplotlib inline

# Task 2: Topic Modelling
<a id="task_2"></a>
In the topic modelling, we will use LDA and NMF methods to perform topic modelling and then use visualisation to analuse the result.

## Part 1: LDA Model
<a id="step3"></a>

### Preparing Data
<a id="step3_1"></a>

In [59]:
text_data = []
# read the data
df = pd.read_csv('Monash_crawled.csv')
# create a list contains body
docs = df['body'].tolist()
#print(docs[0][0:500])

In [60]:
# Tokenize the documents.
# Split the documents into tokens by using RegexpTokenizer function.
tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    # Convert the documents into lowercase.
    docs[idx] = docs[idx].lower()
    # Split the documents into words.
    docs[idx] = tokenizer.tokenize(docs[idx])  

# Remove numbers, but not words that contain numbers.
docs = [[token for token in doc if not token.isnumeric()] for doc in docs]

# Remove one character words.
docs = [[token for token in doc if len(token) > 1] for doc in docs]

# Use WordNetLemmatizer to lemmatize the documents
lemmatizer = WordNetLemmatizer()
docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]

In [61]:
# Find the bigrams that only appear more than 20 times
bigram = Phrases(docs, min_count=20)

# Add bigrams to docs
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)

In [62]:
# create a dictionary reporting how many words and how many times those words appear.
dictionary = Dictionary(docs)

# Remove rare and common tokens
dictionary.filter_extremes(no_below=20, no_above=0.5)

In [63]:
# transform the documents into a vectorized form
corpus = [dictionary.doc2bow(doc) for doc in docs]


   Preview Bag Of Words for our sample preprocessed document.

In [64]:
doc_125 = corpus[125]
for i in range(len(doc_125)):
    print("Word {} (\"{}\") appears {} time.".format(doc_125[i][0], dictionary[doc_125[i][0]], doc_125[i][1]))

Word 50 ("during") appears 1 time.
Word 68 ("ha_been") appears 1 time.
Word 75 ("high") appears 1 time.
Word 79 ("important") appears 1 time.
Word 85 ("known") appears 1 time.
Word 92 ("make") appears 1 time.
Word 105 ("off") appears 1 time.
Word 133 ("science") appears 1 time.
Word 134 ("scientist") appears 1 time.
Word 148 ("statement") appears 1 time.
Word 156 ("think") appears 1 time.
Word 166 ("very") appears 1 time.
Word 174 ("world") appears 1 time.
Word 190 ("anything") appears 1 time.
Word 193 ("at_monash") appears 1 time.
Word 206 ("coast") appears 1 time.
Word 254 ("island") appears 2 time.
Word 279 ("monday") appears 1 time.
Word 284 ("much") appears 1 time.
Word 285 ("near") appears 1 time.
Word 290 ("on_monday") appears 1 time.
Word 320 ("research") appears 1 time.
Word 326 ("say") appears 1 time.
Word 335 ("show") appears 1 time.
Word 343 ("strong") appears 1 time.
Word 352 ("thought") appears 1 time.
Word 380 ("before") appears 1 time.
Word 411 ("monash_university") app

In [65]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 1412
Number of documents: 366


In [66]:
# save the corpus and the dictionary
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

In [67]:
# Create tf-idf model object
tfidf = models.TfidfModel(corpus)
# apply transformation to the entire corpus
corpus_tfidf = tfidf[corpus]

### Training LDA model 1st (Topics =4)<a id="step3_2"></a>

In [68]:
# Set training parameters.
# set the number of topics
NUM_TOPICS = 4
# set the number of documents are processed at a time in the training algorithm
chunksize = 2000
# set the number of epochs
passes = 20
# set the number of iterations
iterations = 400
eval_every = None  

# Use id2token function to make a index to word dictionary.
temp = dictionary[0]  
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=NUM_TOPICS,
    passes=passes,
    eval_every=eval_every,
    random_state=5125
)
outputfile = f'model{NUM_TOPICS}.gensim'
print("Saving model in " + outputfile)
print("")
model.save(outputfile)

Saving model in model4.gensim



In [69]:
# define lda_topics_words function to get the words in topics
def lda_topics_words(model, NUM_TOPICS):
    word_dict = {}
    for i in range(NUM_TOPICS):
        words = model.show_topic(i, topn = 15)
        word_dict['Topic ' + str(i+1)] = [i[0] for i in words];
    return pd.DataFrame(word_dict);

In [70]:
lda_topics_words(model, NUM_TOPICS)

Unnamed: 0,Topic 1,Topic 2,Topic 3,Topic 4
0,wuhan,she,fire,you
1,chinese,her,cent,area
2,flight,woman,per,your
3,confirmed,patient,per_cent,work
4,ship,study,climate,say
5,passenger,mask,smoke,student
6,outbreak,face,air,should
7,sydney,his,pandemic,cell
8,student,say,change,school
9,japan,my,bushfires,our


In [71]:
# display the result of LDA
lda_display_4 = pyLDAvis.gensim.prepare(model, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display_4)

### Training LDA model 2nd (Topics =8)
<a id="step3_3"></a>

In [72]:
# Set training parameters.
# set the number of topics
NUM_TOPICS = 8
# set the number of documents are processed at a time in the training algorithm
chunksize = 2000
# set the number of epochs
passes = 20
# set the number of iterations
iterations = 400
eval_every = None  

# Use id2token function to make a index to word dictionary.
temp = dictionary[0]  
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=NUM_TOPICS,
    passes=passes,
    eval_every=eval_every,
    random_state=5125
)
outputfile = f'model{NUM_TOPICS}.gensim'
print("Saving model in " + outputfile)
print("")
model.save(outputfile)

Saving model in model8.gensim



In [73]:
lda_topics_words(model, NUM_TOPICS)

Unnamed: 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7,Topic 8
0,wuhan,she,fire,school,student,area,say,ship
1,symptom,her,smoke,you,chinese,study,you,cruise
2,chinese,woman,air,should,wuhan,patient,his,passenger
3,patient,mask,bushfires,home,flight,you,change,princess
4,outbreak,face,cent,pandemic,ban,analysis,what,cruise_ship
5,confirmed,face_mask,per,covid,travel,data,climate,diamond_princess
6,hospital,store,per_cent,state,island,research,like,diamond
7,spread,business,million,professor,february_february,using,how,japan
8,infected,hand,bushfire,food,january_january,between,them,flight
9,disease,just,climate,need,pictured,used,specie,board


In [74]:
# display the result of LDA
lda_display_8 = pyLDAvis.gensim.prepare(model, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display_8)

## Nonnegative Matrix Factorization
<a id="step4"></a>

### Preparing Data
<a id="step4_1"></a>

In [75]:
import re
import string
nlp = spacy.load('en', disable=['parser', 'ner'])
warnings.filterwarnings('ignore')

In [76]:
df = pd.read_csv('Monash_crawled.csv')

In [77]:
# define clean_text function to lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.
def clean_text(text):
    # lowercase the text
    text = text.lower()
    # remove text in square brackets, remove punctuation and remove words containing numbers.
    text = re.sub(r'\[.*?\]', '', text)
    text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub(r'\w*\d\w*', '', text)
    # return the clean text
    return text

# create df_clean data frame
df_clean = pd.DataFrame(df.body.apply(lambda x: clean_text(x)))

# define lemmatizer function to lemmatize the text
def lemmatizer(text):        
    sent = []
    doc = nlp(text)
    for word in doc:
        sent.append(word.lemma_)
    return " ".join(sent)
    
df_clean["body_lemmatize"] =  df_clean.apply(lambda x: lemmatizer(x['body']), axis=1)
df_clean['body_lemmatize_clean'] = df_clean['body_lemmatize'].str.replace('-PRON-', '')

In [78]:
df_clean

Unnamed: 0,body,body_lemmatize,body_lemmatize_clean
0,canberra\n has experienced its worst air qual...,canberra \n have experience -PRON- bad air ...,canberra \n have experience bad air qualit...
1,as\n dawn broke over a blackened australi...,as \n dawn break over a blacken austral...,as \n dawn break over a blacken austral...
2,your babys brain and body grow a lot during t...,-PRON- babys brain and body grow a lot durin...,babys brain and body grow a lot during the ...
3,living in polluted cities may make your bones...,live in polluted city may make -PRON- bone w...,live in polluted city may make bone weak an...
4,researchers have developed a new battery they...,researcher have develop a new battery -PR...,researcher have develop a new battery cl...
...,...,...,...
361,published aedt march updated...,publish aedt march updat...,publish aedt march updat...
362,published aedt march updated...,publish aedt march updat...,publish aedt march updat...
363,published aedt march updated...,publish aedt march updat...,publish aedt march updat...
364,published aedt march updated...,publish aedt march updat...,publish aedt march updat...



   Using only Nouns:

In [79]:
# define the nouns function to tokenize only the nouns.
def nouns(text):
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    return ' '.join(all_nouns)

In [80]:
df_nouns = pd.DataFrame(df_clean.body_lemmatize_clean.apply(nouns))
df_nouns.to_csv('df_nouns.csv', index=False)
df_nouns = pd.read_csv('df_nouns.csv')
df_nouns.head()

Unnamed: 0,body_lemmatize_clean
0,canberra experience air quality record bushfir...
1,dawn break landscape picture emerge disaster s...
2,babys brain body lot month baby pace expert mi...
3,live city bone research study people particle ...
4,researcher battery claim power phone day vehic...


### Training NMF model <a id="step4_2"></a>

In [81]:
from sklearn.decomposition import NMF
# set the number of features
n_features = 4000
# set the number of components
n_components = 6
# set the number of top words
n_top_words = 15
alpha = 0.1 
# set the L1 ratio
l1_ratio = 0.5 
# set the min_df
min_df = 2 
# set the max_df
max_df = 0.95  

tfidf_vectorizer = TfidfVectorizer(max_df=max_df, min_df=min_df,
                                   max_features=n_features,
                                   stop_words='english')

tfidf = tfidf_vectorizer.fit_transform(df_nouns['body_lemmatize_clean'].values.astype(str))
nmf = NMF(n_components=n_components, random_state=1, alpha=alpha, l1_ratio=l1_ratio).fit(tfidf)
nmf_output = nmf.fit_transform(tfidf)

In [82]:
# define nmf_topics_words function to get the words in topics
def nmf_topics_words(model, n_top_words):
    
    # to get the feature names
    feature_names = tfidf_vectorizer.get_feature_names()
    # create the dictionary of word_dict
    word_dict = {}
    for i in range(n_components):
        words_index = model.components_[i].argsort()[:-15 - 1:-1]
        words = [feature_names[index] for index in words_index]
        word_dict['Topic ' + str(i+1)] = words
    
    return pd.DataFrame(word_dict);

In [83]:
nmf_topics_words(nmf,n_top_words)

Unnamed: 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6
0,coronavirus,climate,student,ship,flight,chemist
1,virus,bushfire,ban,cruise,island,mask
2,people,smoke,university,princess,christma,store
3,case,change,travel,diamond,evacuee,hand
4,china,air,semester,passenger,wuhan,sanitiser
5,health,quality,school,board,zealand,warehouse
6,patient,season,education,japan,passenger,face
7,wuhan,weather,china,yokohama,facility,chatswood
8,outbreak,year,australia,quarantine,qanta,stock
9,symptom,australia,country,vessel,plane,customer


In [84]:
# Visualiziation of intertopic distance for nmf
pyLDAvis.enable_notebook()
vis_nmf = pyLDAvis.sklearn.prepare(nmf, tfidf, tfidf_vectorizer,mds = "pcoa",sort_topics=False)
vis_nmf