## Data loading and preprocessing

We are loading a dataset consisting of Amazon product reviews for products from the category "Electronics". We do not know what types of products are covered by these reviews and we want to uncover that by means of applying **topic modeling** (LDA) on this corpus of reviews 

In [1]:
import pandas as pd

reviews = pd.read_csv('reviews.csv', delimiter = '\t') # in our file, the values are actually TAB-separated
reviews

FileNotFoundError: [Errno 2] No such file or directory: 'reviews.csv'

### Preprocessing

We next want to preprocess our review texts to eliminate as much of the "noise" that could affect the topic modeling. We will apply common text preprocessing: 

- Tokenization: we split the texts into words/tokens
- Stopword and punctuation removal: we eliminate all tokens that are in the list of stopwords (and puncutation)
- Additionally, we remove all non-content words, i.e., all words with part-of-speech-tags that do not correspond to nouns, verbs, adjectives, ...

We will use SpaCy to perform tokenization and filtering. 
- To do that, one first needs to install spacy and download the corresponding model
- We will use the model "en_core_web_md" that contains tokenization, part-of-speech tagging and other models for English

To install spacy, run the following in command line: 
- *pip install spacy*

To download the model, run the following in command line: 
- *python -m spacy download en_core_web_md*


In [None]:
import spacy
import wordcloud
from wordcloud import STOPWORDS

# removing the repetitions if there are any, converting the list to set
stopwords = set(list(STOPWORDS) + ['.', "?", "!", ",", "(", ")", ":", ";", "\"", "'", "=", "-"])
stop_tags = ['ADV','PRON','CCONJ','PUNCT','PART','DET','ADP','SPACE', 'NUM', 'SYM']
print(stopwords)

nlp = spacy.load("en_core_web_md")

In [None]:
# we're tokenizing all reviews, and from each eliminating stopwords and words with non-content POS-tags
reviews["tokens"] = reviews.content.apply(lambda x: [t.lemma_.lower() for t in nlp(x, disable=["parser", "ner"]) if (t.text.strip() != "" and (t.text.lower() not in stopwords) and t.pos_ not in stop_tags)])

In [None]:
ind = 45
reviews.iloc[ind].tokens

## Peparation for topic modeling with LDA

We will carry out topic modeling with the implementation of LDA from the popular library *gensim*. To be able to apply LDA from gensim, we first need to prepare some data structures, namely: the *dictionary* of words we will use as the vocabulary over which to carry out topic models. 

We can influence on how big the dictionary will be, via the parameters of the *filter_extremes* method:  

- by setting the minimal number of documents (reviews) in which the token has to appear to be included in the dictionary (this eliminates the most infrequent terms, unlikely to be relevant for any topic)

- by setting the maximal percentage of documents in which the token is allowed to appear (this eliminates the most frequent terms, likely to be "part of" every topic) 

In [None]:
from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary(reviews['tokens'])
dictionary.filter_extremes(no_below=5, no_above=0.1)

l = list(dictionary.items())
print(l)
print(len(l))


We're next creating the corpus as a set of sparse document vectors: for this we're using the *doc2bow* function of the gensim Dictionary object: 

In [None]:
corpus = [dictionary.doc2bow(a) for a in reviews['tokens']]
corpus

## Running topic modeling (LDA)

Having prepared (1) dictionary (i.e., vocabulary) and (2) corpus (i.e., sparse document vectors over the terms of the vocabulary), we are now ready to execute/run topic modeling. 

- *gensim* contains several classes for this. The standard one is *gensim.models.ldamodel.LdaModel*. We will, however, resort to *gensim.models.LdaMulticore*, which allows a faster, multi-core training of the LDA model. Multicore means that multiple processor units (CPUs) will be used, if available, to parallelize the computation. 

*gensim.models.LdaMulticore* takes, among others, the following parameters: 
- *corpus*: the list of sparse document bag-of-word vectors (which we've built already)
- *id2word*: the mapping from vocabulary IDs to actual tokens, which is exactly what our "dictionary" contains
- *num_topics*: number of topics we would like to induce
- *iterations*: max. number of iterations through the corpus when inferring the topic distribution of a corpus
- *workers*: how many parallel threads (i.e., CPUs) to use for training the topic model  



In [None]:
from gensim.models import LdaMulticore
# alternative: gensim.models.ldamodel.LdaModel

lda_model = LdaMulticore(corpus=corpus, id2word=dictionary, iterations=50, num_topics=20, workers = 4)

In [None]:
lda_model.print_topics(num_words=5)


How are topics represented in documents? Or, put differently, which documents belong to which topics?
- This information is stored in our *lda_model* object after training
- We can access the corpus with lda_model[corpus] and then index that list for the document we're interested in

In [None]:
ind = 35
print(lda_model[corpus][ind])
reviews["content"][ind]

## Evaluating topics

- Question #1: How good are the topics we induced? 
- Question #2: We define the number of topics in advance. What is the optimal number of topics? Optimal w.r.t. what criteria?

One widely adopted way to evaluate topic models is with various coherence measures: these quantify (in various ways) the semantic similarity / lexical association between the most prominent words of each topic. If the words that have the highest weight in each topic are not semantically similar/associated, then the topic is not coherent. 

There is a number of coherence measures already implemented in *gensim* in the class *gensim.modelsCoherenceModel*. We will use some of those to quantify the coherence of the topics we induced. 
 

In [None]:
# We will execute topic modeling 20 times, each time with the different number of topics
max_topics = 30 
models = []

for i in range(max_topics):
    print("Training LDA with " + str(i+1) + " topics.")
    
    lda_model = LdaMulticore(corpus=corpus, id2word=dictionary, iterations=100, num_topics=i+1, workers = 4, random_state=100)
    models.append(lda_model)
          
    print("Done.")    

In [None]:
models[6].print_topics(num_words = 5)

In [None]:
from gensim.models import CoherenceModel

coherence_measure = "c_uci" # 'u_mass', 'c_v', 'c_uci', 'c_npmi'
scores = []

for i in range(len(models)):
    print("Computing coherence for the LDA model with " + str(i+1) + " topics.")
    cm = CoherenceModel(model=models[i], corpus=corpus, texts = reviews["tokens"], dictionary=dictionary, coherence=coherence_measure)
    score = cm.get_coherence()
    scores.append(score)
    print("Done.")
    

In [None]:
import matplotlib.pyplot as plt

num_topics = [i+1 for i in range(len(scores))]
                                
_=plt.plot(num_topics, scores)
_=plt.xlabel('Number of Topics')
_=plt.ylabel('Coherence Score')
plt.show()

In [None]:
# according to the c_v measure, 16 topics looks like a good choice
# let's see what those topics look like

num_tops = 15
models[num_tops - 1].print_topics(num_words = 5)

## Topic Visualization

There is a special library, *pyLDAvis*, that we can use to visualize topic models induced with *gensim* (among other). All you need to do is: 

- import the pyLDAvis package
- allow it to run inside the Jupyter notebook
- instantiate/prepare the display object, feeding the LDA model as an argument
- call the display function

In [None]:
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()# Visualise inside a notebook

In [None]:
lda_display = pyLDAvis.gensim_models.prepare(models[num_tops - 1], corpus, dictionary)
pyLDAvis.display(lda_display)