# Module 6 - Topic Modelling

The original data for the next steps can be downloaded from the [Kaggle News Category Dataset](https://www.kaggle.com/rmisra/news-category-dataset).

It is important to read about the dataset that you are using so that you understand what it contains and also what it doesn't contain.

### Subset exploration

Often we want to explore a subset of data, and we only need to analyse part of it.

In [None]:
import json

# load the complete dataset
with open('data/News_Category_Dataset_v2.json', 'r') as f:
    news_list = f.readlines()

# convert each line (string) to json (dict)
news_json = list(map(json.loads,news_list))

print("Number of stories: ",len(news_json))

# view the first 10 elements in the list
news_json[:20]

What categories are available in this dataset?

In [None]:
set([story['category'] for story in news_json])

Extract just the science stories from the dataset...

In [None]:
# filter the list for stories that are in the category SCIENCE
science_json = [story for story in news_json if story['category']=='SCIENCE']

# for each, create the 'story' by adding together the headline and the short_description
science_stories = [story['headline']+' - '+story['short_description'] for story in science_json]

print("Number of science stories: ",len(science_stories))

# look at first 10
science_stories[:10]

In [None]:
science_json

### Word frequency

How do we find anything meaningful in these science news stories?

We could start by just extracting words and looking at the frequencies...

In [None]:
import re

story1 = science_stories[0]

re.split('\W+',story1.lower())

In [None]:
story1

In [None]:
word_counts = {}

for story in science_stories:
    words = re.split('\W+',story.lower())
    for word in words:
        if word in word_counts:
            word_counts[word] += 1
        else:
            word_counts[word] = 1
        
# sort the word_counts by counts
sorted_counts = {k: v for k, v in sorted(word_counts.items(), key=lambda item: item[1],reverse=True)}

sorted_counts

### Alternatives to finding information in text

This does give us some information, but there are some problems:
- small meaningless words are dominating the count
- words that are most significant are spread out amongst the list

The field of **Information Retrieval** has developed techniques to help with this issue. We're going to look at two...
1. TF/IDF as a better term frequency
2. LDA for topic modelling

First we need some additional packages not installed in our Jupyter environment...
- [gensim](https://radimrehurek.com/gensim/) for topic modelling
- [pyLDAvis](https://github.com/bmabey/pyLDAvis) for interactive visualisation of topic models

In [None]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install gensim
!{sys.executable} -m pip install pyLDAvis

In [None]:
from gensim.models import TfidfModel
from gensim.corpora import Dictionary
from gensim.utils import tokenize
from gensim.utils import simple_preprocess
from gensim.corpora.textcorpus import remove_stopwords
from gensim.summarization import keywords
from gensim.models.ldamodel import LdaModel
import pyLDAvis
import pyLDAvis.gensim
import pandas as pd

### Pre-processing with gensim

Let's bring our stories into a dataframe and use some of the gensim tools...

In [None]:
stories_df = pd.DataFrame(science_stories,columns=['story'])
stories_df

In [None]:
# get a list of tokens for first story
tokens = list(tokenize(stories_df['story'][0],lowercase=True))
tokens

In [None]:
# get a list of tokens for first story using simple_preprocess
tokens = list(simple_preprocess(stories_df['story'][0],min_len=3))
tokens

In [None]:
# remove the 'stopwords' from first story
remove_stopwords(tokens)

In [None]:
# do this for whole dataframe
stories_df['terms'] = [remove_stopwords(simple_preprocess(story,min_len=3)) for story in stories_df['story']]
stories_df

In [None]:
vocab = Dictionary(stories_df['terms'])
print(vocab.token2id)

### Term Frequency, Inverse Document Frequency (TF/IDF)

For TF/IDF we use Bag of Words (BoW). For more information on these terms, see:
- [A gentle introduction to the Bag-of-words model](https://machinelearningmastery.com/gentle-introduction-bag-words-model/)
- [tf-idf Wikipedia](https://en.wikipedia.org/wiki/Tf–idf)

In [None]:
# convert corpus to BoW format
corpus = [vocab.doc2bow(terms) for terms in stories_df['terms']]  

# fit a tf-idf model to the corpus
model = TfidfModel(corpus)

# apply model to the first corpus document
tfidf_doc = model[corpus[0]] 

In [None]:
tfidf_doc

In [None]:
[(vocab[w[0]],w[1]) for w in tfidf_doc]

In [None]:
[(vocab[w[0]],w[1]) for w in tfidf_doc if w[1]>0.3]

In [None]:
stories_df['terms'][0]

In [None]:
# try the second story
terms = stories_df['terms'][1]
print("terms: ",terms)
tfidf_doc2 = model[corpus[1]]
tfidf2 = [(vocab[w[0]],w[1]) for w in tfidf_doc2 if w[1]>0.1]
print("tf/idf: ",tfidf2)

### Most relevant terms

What is probably more interesting is the top n terms, which are expected to be the most relevant.

Let's create a function to take the top 5 terms based on tf/idf.

In [None]:
def get_tfidf(idx):
    term_values = [(vocab[el[0]],el[1]) for el in model[corpus[idx]] if el[1]>0]
    srt =  sorted(term_values, key=lambda x: x[1],reverse=True)
    return list(map(lambda x: x[0],srt[:5]))

In [None]:
get_tfidf(1)

In [None]:
get_tfidf(0)

Now we apply this function to the whole dataframe

In [None]:
stories_df['tfidf'] = stories_df.index.map(get_tfidf)
stories_df

In [None]:
stories_df.iloc[2]

Although TF/IDF does a good job at distinguishing between documents - identifying what is unique about a document - it doesn't use human meaning-making.

Algorithmic 'semantics' is not the same as human semantics.

It is worth considering how this might be a problem in a world that increasingly uses computation to process language.


### Latent Dirichlet Allocation (LDA)

However, there are approaches that are closer to human meaning-making than TF/IDF. LDA is one. For more detail on LDA, see the [LDA Wikipedia page](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

In [None]:
# create an lda model from our corpus and vocab - we need to specify the number of topics
lda_model = LdaModel(corpus=corpus, id2word=vocab, num_topics=20)

In [None]:
# view the topics in the model
for topic in lda_model.show_topics(num_topics=20,num_words=15):
    print("Topic "+str(topic[0])+"\n"+topic[1]+"\n")

For each document, we can get the probability that the document belongs to a particular topic

In [None]:
doc = stories_df['story'][1]
print("doc:\n",doc)
doc_topics = lda_model.get_document_topics(corpus[1],minimum_probability=0.3)
print("doc_topics:\n",doc_topics)
for topic in doc_topics:
    terms = [term for term, prob in lda_model.show_topic(topic[0])]
    print(terms)

We can create a function to get the top terms for the top topic for each document. This will enable us to assign the top topic words to the original dataframe.

In [None]:
def get_topic_terms(idx):
    doc_topics = lda_model.get_document_topics(corpus[idx])
    top_topic = doc_topics[0]
    return [term for term, prob in lda_model.show_topic(top_topic[0])]

In [None]:
# try out the function
get_topic_terms(1)

In [None]:
# add to our original dataframe
stories_df['lda'] = stories_df.index.map(get_topic_terms)
stories_df

To help us explore the model, we can visualise the topics using pyLDAvis. **NOTE:** This visualisation can take a while to produce (up to 5 minutes) so be patient!

In [None]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, vocab)
vis