# Topic Modelling - Notes and Playground
(July 06, 2019)

Instead of creating a formal Lit. Review, I would like to learn as much as needed to apply high level concepts for topic modelling. This will start by digging into blogs which utilize standard Python libraries and then push into more in-depth articles, papers, etc. regarding better implementation.

## Table of Contents

1. ["Applying Machine Learning to Classify an Unsupervised Text Documents"](#1)
    * Data -> TD-IDF -> K-means
    
2. ["pyLDAvis - Library for Interactive Topic Model Visualization"](#2)
    * Port of an R package
    
3. ["LDA in Python - How to Grid Search Best Topic Models"](#3)

4. ["Unsupervised LDA to Semi-Supervised GuidedLDA"](#4)

## Import Libraries and Data

In [111]:
# Imports

# Standard Libraries
from itertools import combinations

# Numerical Libraries
import numpy as np
from scipy.stats import skew, kurtosis
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Visual Libraries
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image

# Local Package Libraries
import sys
sys.path.append("../..")

from src.data.make_dataset import *
from src.features.general_helper_functions import *
from src.features.text_cleaning import *

In [2]:
# Settings

# Stop the warnings for chain in pandas...
pd.options.mode.chained_assignment = None

%load_ext autoreload
%autoreload 2

from IPython.core.display import display, HTML
display(HTML("<style>.container {width:100% !important;}</style>"))

%matplotlib inline

In [3]:
# raw data import
_, watchlist_raw, stock_prices_raw = get_raw_data()
article_raw = clean_and_open_business_wire_data_01(None)

In [4]:
# Base data cleaning and formatting
articles = article_raw.copy()
articles.reset_index(inplace=True)
articles.time = pd.to_datetime(articles.time)

watchlist = watchlist_raw.copy()

<a id="1"></a>
## Applying Machine Learning to Classify an Unsupervised Text Documents

* Author builds a model to classify a data set into 2 classifications.
* Steps:
    1. Vectorize the text data by using sklearn's TfidfVectorizer
    2. Implements sklearn's KMeans cluster algorithm with true_k = 2
* The input document is a list of strings, by the appearance the strings are already relatively clean.

In [5]:
# Import
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd

In [6]:
titles_cleaned = clean_text(articles, "title").title.values

In [7]:
vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(titles_cleaned)

print(X.shape)

(8433, 7676)


In [8]:
true_num_topics=2 # Obviously not true, but I will go with it for now

naive_kmeans_topic_model = KMeans(
    n_clusters=true_num_topics, 
    init="k-means++", 
    max_iter=100, 
    n_init=1)

naive_kmeans_topic_model.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
       n_clusters=2, n_init=1, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

Can now take a look at the top words in each centroid:

In [9]:
order_centroids = naive_kmeans_topic_model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

for i in range(true_num_topics):
    print("Cluster %d:" % i)
    for ind in order_centroids[i, :10]:
        print(" %s" % terms[ind])
    print("\n")

Cluster 0:
 announce
 result
 quarter
 financial
 conference
 investor
 pharmaceutical
 report
 present
 healthcare


Cluster 1:
 market
 researchandmarkets
 com
 global
 pipeline
 research
 review
 forecast
 analysis
 drug




<a id="2"></a>
## pyLDAvis - Library for Interactive Topic Model Visualization
https://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb#topic=1&lambda=0.5&term=

pyLDAvis is a library that ports an R script for topic visualization.

It is possible to oberlay it with gensim and sklearn.

**pyLDAvis**
https://pyldavis.readthedocs.io/en/latest/

**Gensim's LDA**
https://radimrehurek.com/gensim/models/ldamodel.html

In [95]:
import gensim
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from gensim.models.tfidfmodel import TfidfModel
import pyLDAvis.gensim

pyLDAvis.enable_notebook()

In [112]:
tokenized_titles_cleaned = [remove_short_words(text).split(" ") for text in titles_cleaned]
dictionary = corpora.Dictionary(tokenized_titles_cleaned)
dictionary.filter_extremes(no_below=5, no_above=0.4)

In [113]:
corpus = [dictionary.doc2bow(text) for text in tokenized_titles_cleaned]

In [114]:
tfidf_corpus = TfidfModel(corpus)[corpus]

In [115]:
lda = LdaModel(tfidf_corpus, num_topics=20)

In [116]:
for topic_id in range(4):
    print("Topic:", topic_id)
    print([(dictionary[item[0]], np.round(item[1],3)) for item in lda.get_topic_terms(topic_id)], "\n")

Topic: 0
[('needle', 0.024), ('bone', 0.019), ('free', 0.019), ('system', 0.016), ('advance', 0.014), ('market', 0.014), ('diagnostics', 0.014), ('molecular', 0.013), ('earn', 0.013), ('research', 0.012)] 

Topic: 1
[('quarter', 0.058), ('result', 0.056), ('fourth', 0.047), ('year', 0.047), ('financial', 0.038), ('full', 0.037), ('report', 0.033), ('first', 0.032), ('synthetic', 0.017), ('johnson', 0.013)] 

Topic: 2
[('financial', 0.049), ('quarter', 0.047), ('result', 0.045), ('chief', 0.041), ('officer', 0.04), ('appoint', 0.029), ('report', 0.029), ('operate', 0.029), ('third', 0.027), ('president', 0.027)] 

Topic: 3
[('trend', 0.037), ('forecast', 0.032), ('market', 0.03), ('global', 0.029), ('researchandmarkets', 0.027), ('analysis', 0.025), ('price', 0.022), ('growth', 0.019), ('size', 0.019), ('share', 0.016)] 



In [117]:
pyLDAvis.gensim.prepare(lda, tfidf_corpus, dictionary)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


<a id="3"></a>
## LDA Python - How to Grid Search Best Topic Models

https://www.machinelearningplus.com/nlp/topic-modeling-python-sklearn-examples/

<a id="4"></a>
## Unsupervised LDA to Semi-Supervised GuidedLDA

https://www.freecodecamp.org/news/how-we-changed-unsupervised-lda-to-semi-supervised-guidedlda-e36a95f3a164/