# The Problem

Research paper topic modeling is an unsupervised machine learning method that helps us discover hidden semantic structures in a paper, that allows us to learn topic representations of papers in a corpus. The model can be applied to any kinds of labels on documents, such as tags on posts on the website.


[Research paper can be found here](https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/dataset.csv)

# Text Cleaning


This function will take care of all the text cleaning & will return tokens of text.

In [1]:
import spacy
spacy.load('en')
from spacy.lang.en import English
parser = English()

def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        elif token.like_url:
            lda_tokens.append('URL')
        elif token.orth_.startswith('@'):
            lda_tokens.append('SCREEN_NAME')
        else:
            lda_tokens.append(token.lower_)
    return lda_tokens

We use NLTK’s Wordnet to find the meanings of words, synonyms, antonyms, and more. In addition, we use WordNetLemmatizer to get the root word.

In [2]:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma
    
from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

[nltk_data] Downloading package wordnet to /home/haseeb/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Filtering stopwords

In [3]:
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/haseeb/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now we can define a function to prepare the text for topic modelling:

In [4]:
def prepare_text_for_lda(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokens

Let's prepare our data for LDA

In [5]:
import random
text_data = []
with open('dataset.csv') as f:
    for line in f:
        tokens = prepare_text_for_lda(line)
        if random.random() > .99:
            print(tokens)
            text_data.append(tokens)

['record', 'subtype', 'facility', 'database', 'system']
['beyond', 'popularity', 'really', 'simple', 'system', 'scale']
['polygon', 'assist', 'compression', 'synthetic', 'image']
['towards', 'highly', 'scalable', 'effective', 'metasearch', 'engine']
['subject', 'directory', 'drive', 'system', 'organize', 'access', 'large', 'statistical', 'database']
['biometric', 'base', 'level', 'secure', 'access', 'control', 'implantable', 'medical', 'devices', 'emergency']
['modeling', 'telephone', 'network', 'group', 'structure', 'complex', 'network', 'perspective']
['similarity']
['project', 'schema', 'language', 'medley', 'melee']
['design', 'relational', 'storage']
['packet', 'distribution', 'algorithm', 'sensor', 'network']
['overview', 'management', 'platform', 'digital', 'advertising']
['adaptive', 'modulation', 'coding', 'switching', 'scheme', 'multicasting', 'system']
['novel', 'complexity', 'lattice', 'filter', 'overflow', 'property', 'close', 'normalize', 'lattice']
['detecting', 'origin'

# LDA with Gensim

Let's save dictionary & corpus for further use

In [6]:
from gensim import corpora
dictionary = corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]
import pickle
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

Let's get only five topics from data

In [7]:
import gensim
NUM_TOPICS = 5
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
ldamodel.save('model5.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.073*"lattice" + 0.040*"compression" + 0.040*"image" + 0.040*"polygon"')
(1, '0.049*"system" + 0.027*"database" + 0.027*"subject" + 0.027*"directory"')
(2, '0.038*"access" + 0.037*"secure" + 0.037*"base" + 0.037*"emergency"')
(3, '0.059*"system" + 0.032*"popularity" + 0.032*"beyond" + 0.032*"simple"')
(4, '0.092*"network" + 0.050*"group" + 0.050*"telephone" + 0.050*"modeling"')


Let's test it now with new data

In [8]:
new_doc = 'Practical Bayesian Optimization of Machine Learning Algorithms'
new_doc = prepare_text_for_lda(new_doc)
new_doc_bow = dictionary.doc2bow(new_doc)
print(new_doc_bow)
print(ldamodel.get_document_topics(new_doc_bow))

[(53, 1)]
[(0, 0.10002815), (1, 0.10001865), (2, 0.100026354), (3, 0.59989154), (4, 0.10003535)]


# Visualize

In [10]:
dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))
lda = gensim.models.ldamodel.LdaModel.load('model5.gensim')
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))
