https://medium.com/ml2vec/topic-modeling-is-an-unsupervised-learning-approach-to-clustering-documents-to-discover-topics-fdfbf30e27df

Topic Modeling is an unsupervised learning approach to clustering documents, to discover topics based on their contents.

In this post, we will walk through two different approaches for topic modeling, and compare their results. These approaches are LDA (Latent Derilicht Analysis), and NMF (Non-negative Matrix factorization). Let’s talk about each of these before we move onto code. We will look at their definitions, and some basic math that describe how they work.

# LDA  - LDA2VEC

LDA, or Latent Derelicht Analysis is a probabilistic model, and to obtain cluster assignments, it uses two probability values: P( word | topics) and P( topics | documents). These values are calculated based on an initial random assignment, after which they are repeated for each word in each document, to decide their topic assignment. In an iterative procedure, these probabilities are calculated multiple times, until the convergence of the algorithm.

# NMF

Non-negative Matrix Factorization is a Linear-algeabreic model, that factors high-dimensional vectors into a low-dimensionality representation. Similar to Principal component analysis (PCA), NMF takes advantage of the fact that the vectors are non-negative. By factoring them into the lower-dimensional form, NMF forces the coefficients to also be non-negative.
Given the original matrix A, we can obtain two matrices W and H, such that A= WH. 

NMF has an inherent clustering property, such that W and H represent the following information about A:

-A (Document-word matrix) — input that contains which words appear in which documents.

-W (Basis vectors) — the topics (clusters) discovered from the documents.

-H (Coefficient matrix) — the membership weights for the topics in each document.

 We will apply topic modeling on the ABC Millions Headlines dataset (published on Kaggle recently: https://www.kaggle.com/therohk/million-headlines)

In [1]:
import pandas as pd;
import numpy as np;
import scipy as sp;
import sklearn;
import sys;
from nltk.corpus import stopwords;
import nltk;
from gensim.models import ldamodel
import gensim.corpora;
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer;
from sklearn.decomposition import NMF;
from sklearn.preprocessing import normalize;
import pickle;

In [2]:
!dir

 Volume in drive C is Windows
 Volume Serial Number is 7A75-B79E

 Directory of C:\Users\CAMNG3\documents_clustering

23/10/2019  10:37    <DIR>          .
23/10/2019  10:37    <DIR>          ..
21/10/2019  15:47    <DIR>          .ipynb_checkpoints
21/10/2019  11:27        55,392,904 abcnews-date-text.csv
22/10/2019  13:40    <DIR>          clean_data
22/10/2019  13:18     1,037,965,801 glove.6B.300d.txt
22/10/2019  10:03           509,723 LDA2Vec.ipynb
21/10/2019  10:52                24 README.md
23/10/2019  10:37           199,561 Topic_modeling.ipynb
               5 File(s)  1,094,068,013 bytes
               4 Dir(s)  398,598,397,952 bytes free


In [3]:
data = pd.read_csv("abcnews-date-text.csv")
data_text = data[['headline_text']]

In [4]:
data_text.head(3)

Unnamed: 0,headline_text
0,aba decides against community broadcasting lic...
1,act fire witnesses must be aware of defamation
2,a g calls for infrastructure protection summit


In [5]:
len(data_text)

1103663

We need to remove stopwords first

In [33]:
data_text = data_text.sample(frac=0.1, random_state=3)

In [34]:
len(data_text)

110366

# consiDering the big amount of data we can use pyspark to speed up the PRE-PROCESSING PHASE

In [6]:
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext
import time
from pyspark import SparkConf, SparkContext


In [7]:
sc = SparkContext("local", "App Name")

In [8]:
#The sql function on a SQLContext enables applications to run SQL queries
sqlContext = SQLContext(sc)

In [9]:
session= SparkSession.builder.appName('pandasToSparkDF').getOrCreate()

In [10]:
# DECIDE IN HOW MANY PARTITIONS WE WANT TO SPLIT THE DATAFRAME
num_part = sc.parallelize(data_text['headline_text'],1000)

In [11]:
#from pandas to dataframe
df = sqlContext.createDataFrame(data_text)

In [12]:
df.show(10)

+--------------------+
|       headline_text|
+--------------------+
|aba decides again...|
|act fire witnesse...|
|a g calls for inf...|
|air nz staff in a...|
|air nz strike to ...|
|ambitious olsson ...|
|antic delighted w...|
|aussie qualifier ...|
|aust addresses un...|
|australia is lock...|
+--------------------+
only showing top 10 rows



In [13]:
stop = stopwords.words('english')

In [14]:
start = time.time()
gigi = num_part.map(lambda x: [[word for word in x.split() if word not in stop]])
end = time.time()
print(end-start)

0.0009653568267822266


In [15]:
gigi.take(10)

[[['aba', 'decides', 'community', 'broadcasting', 'licence']],
 [['act', 'fire', 'witnesses', 'must', 'aware', 'defamation']],
 [['g', 'calls', 'infrastructure', 'protection', 'summit']],
 [['air', 'nz', 'staff', 'aust', 'strike', 'pay', 'rise']],
 [['air', 'nz', 'strike', 'affect', 'australian', 'travellers']],
 [['ambitious', 'olsson', 'wins', 'triple', 'jump']],
 [['antic', 'delighted', 'record', 'breaking', 'barca']],
 [['aussie', 'qualifier', 'stosur', 'wastes', 'four', 'memphis', 'match']],
 [['aust', 'addresses', 'un', 'security', 'council', 'iraq']],
 [['australia', 'locked', 'war', 'timetable', 'opp']]]

In [16]:
#from rdd to dataframe
s = sqlContext.createDataFrame(gigi)

In [17]:
s.show(10)

+--------------------+
|                  _1|
+--------------------+
|[aba, decides, co...|
|[act, fire, witne...|
|[g, calls, infras...|
|[air, nz, staff, ...|
|[air, nz, strike,...|
|[ambitious, olsso...|
|[antic, delighted...|
|[aussie, qualifie...|
|[aust, addresses,...|
|[australia, locke...|
+--------------------+
only showing top 10 rows



In [18]:
data_pd = s.toPandas()

In [19]:
len(data_pd)

1103663

In [109]:
#save data because it took very long 
data_pd['_1'] = data_pd['_1'].apply(lambda row : ' '.join([x for x in row]) )
data_pd.to_csv("complete.csv", index=False)

In [182]:
# start = time.time()
# data_text['headline_text'] = data_text['headline_text'].apply(lambda x : [word for word in x.split() if word not in stopwords.words()])
# end = time.time()
# print(end-start)

#after 30 seconds was still running

In [21]:
# get the wrods as an array for lda input
train_headlines = [value[0] for value in data_pd.iloc[0:].values]

In [22]:
train_headlines

[['aba', 'decides', 'community', 'broadcasting', 'licence'],
 ['act', 'fire', 'witnesses', 'must', 'aware', 'defamation'],
 ['g', 'calls', 'infrastructure', 'protection', 'summit'],
 ['air', 'nz', 'staff', 'aust', 'strike', 'pay', 'rise'],
 ['air', 'nz', 'strike', 'affect', 'australian', 'travellers'],
 ['ambitious', 'olsson', 'wins', 'triple', 'jump'],
 ['antic', 'delighted', 'record', 'breaking', 'barca'],
 ['aussie', 'qualifier', 'stosur', 'wastes', 'four', 'memphis', 'match'],
 ['aust', 'addresses', 'un', 'security', 'council', 'iraq'],
 ['australia', 'locked', 'war', 'timetable', 'opp'],
 ['australia', 'contribute', '10', 'million', 'aid', 'iraq'],
 ['barca', 'take', 'record', 'robson', 'celebrates', 'birthday'],
 ['bathhouse', 'plans', 'move', 'ahead'],
 ['big', 'hopes', 'launceston', 'cycling', 'championship'],
 ['big', 'plan', 'boost', 'paroo', 'water', 'supplies'],
 ['blizzard', 'buries', 'united', 'states', 'bills'],
 ['brigadier', 'dismisses', 'reports', 'troops', 'harassed'

In [340]:
#total number of unique words
from itertools import chain

docs_temp = [word for elem in train_headlines for word in elem]
len(set(docs_temp))

13318

# Implementing LDA

In [23]:
#Initialize the number of Topics we need to cluster:
num_topics = 10

We will use the gensim library for LDA. First, we obtain a id-2-word dictionary. For each headline, we will use the dictionary to obtain a mapping of the word id to their word counts. The LDA model uses both of these mappings.

In [24]:
#Here we assigned a unique integer id to all words appearing in the corpus
id2word = gensim.corpora.Dictionary(train_headlines)

# To convert documents to vectors, we’ll use a document
#representation called bag-of-words. In this representation, 
#each document is represented by one vector where each vector 
#element represents a question-answer pair, in the style of:
# “How many times does the word system appear in the document? Once.”
# ex ['alfa','beta'] => (34,1),(35,1)
#    ['alfa','alfa'] => (34,2)

corpus = [id2word.doc2bow(text) for text in train_headlines]

In [25]:
LDA = ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=num_topics)

Generating LDA topics:
We will iterate over the number of topics, get the top words in each cluster, and add them to a DataFrame, than print these words.

In [28]:
def get_lda_topics(model, num_topics):
    word_dict = {};
    for i in range(num_topics):
        words = model.show_topic(i, topn = 10);
        word_dict['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words];
    return pd.DataFrame(word_dict);

In [29]:
get_lda_topics(LDA,num_topics)

Unnamed: 0,Topic # 01,Topic # 02,Topic # 03,Topic # 04,Topic # 05,Topic # 06,Topic # 07,Topic # 08,Topic # 09,Topic # 10
0,indigenous,child,australian,government,north,police,one,trump,australia,wa
1,afl,life,sex,queensland,south,man,people,melbourne,world,sa
2,open,league,brisbane,two,canberra,nsw,family,adelaide,turnbull,tasmanian
3,trial,big,years,women,coast,court,final,donald,test,health
4,nrl,guilty,school,say,calls,election,royal,first,china,nt
5,darwin,hospital,power,us,state,death,john,day,national,call
6,drum,children,could,dead,tasmania,perth,says,sydney,new,missing
7,program,former,says,attack,fire,murder,violence,win,2016,city
8,road,man,street,killed,gold,charged,commission,australian,australias,qld
9,case,police,bill,war,korea,woman,budget,report,record,says


# Implementing LDA2VEC using tensorflow

https://github.com/nateraw/Lda2vec-Tensorflow

Pretrained Embeddings

This repo can load a wide variety of pretrained embedding files (see nlppipe.py for more info). The examples are all using GloVe embeddings. You can download them from https://github.com/stanfordnlp/GloVe


# Preprocessing

In [30]:
import lda2vec
import pandas as pd
from lda2vec.nlppipe import Preprocessor
import spacy

Using TensorFlow backend.


In [31]:
# Should we load pretrained embeddings from file
load_embeds = True

# Where to save preprocessed data
clean_data_dir = "./clean_data/"

In [32]:
data_text = pd.read_csv('complete.csv')

In [33]:
# if you have problem with the model not found, download it using python -m spacy download "en_core_web_sm" and place it 
# in the spacy sub-directory data
# Initialize a preprocessor
P = Preprocessor(data_text, "_1", max_features=30000, maxlen=10000, min_count=30)

In [34]:
# Run the preprocessing on your dataframe
P.preprocess()


---------- Tokenizing Texts ----------


1103663it [02:55, 6285.66it/s]


Removing 80342 low frequency tokens out of 95494 total tokens

---------- Getting Skipgrams ----------


1103663it [01:11, 15501.62it/s]


In [77]:
#The output from the preprocess is 2 dictionaries containing all the mappings between words and words_idx 
#and vice-versa and the skipgrams csv

# P.idx_to_word
# P.word_to_idx
# P.skipgrams_df

{1: 'to',
 2: 'in',
 3: 'for',
 4: 'of',
 5: 'on',
 6: 'over',
 7: 'the',
 8: 'police',
 9: 'at',
 10: 'with',
 11: 'after',
 12: 'new',
 13: 'man',
 14: 'a',
 15: 'and',
 16: 'up',
 17: 'as',
 18: 'says',
 19: 'by',
 20: 'from',
 21: 'us',
 22: 'govt',
 23: 'out',
 24: 'court',
 25: 'be',
 26: 'council',
 27: 'more',
 28: 'interview',
 29: 'fire',
 30: 'not',
 31: 'nt',
 32: 'plan',
 33: 'australia',
 34: 'nsw',
 35: 'qld',
 36: 'water',
 37: 'wa',
 38: 'crash',
 39: 'sydney',
 40: 'death',
 41: 'into',
 42: 'back',
 43: 'off',
 44: 'health',
 45: 'against',
 46: 'no',
 47: 'charged',
 48: 'australian',
 49: 'murder',
 50: 'down',
 51: 'report',
 52: 'sa',
 53: 'hospital',
 54: 'an',
 55: 'day',
 56: 'call',
 57: 'calls',
 58: 'may',
 59: 'win',
 60: 'car',
 61: 'world',
 62: 'killed',
 63: 'government',
 64: 'accused',
 65: 'coast',
 66: 'urged',
 67: 'woman',
 68: 'two',
 69: 'home',
 70: 'about',
 71: 'missing',
 72: 'found',
 73: 'is',
 74: 'm',
 75: 'north',
 76: 'set',
 77: 'sou

In [35]:
#if you get an error just go inside the library lda2vec and 
if load_embeds:
    # load embedding matrix from file path
    embedding_matrix = P.load_glove("./glove.6B.300d.txt")
else:
    print("embedding matrix not loaded or used")
    embedding_matrix = None

  all_embs = np.stack(embeddings_index.values())


In [36]:
embedding_matrix

array([[ 0.77465525,  0.46353653,  0.67612951, ...,  0.78450666,
         0.18352231, -0.0420995 ],
       [-0.26710001,  0.23902   , -0.26073   , ...,  0.1964    ,
        -0.54051   ,  0.33379   ],
       [-0.62333   , -0.42434001, -0.035321  , ..., -0.13613001,
         0.09868   ,  0.60900003],
       ...,
       [ 0.67116859, -0.13476656,  0.01170877, ...,  0.15757811,
        -0.0334881 , -0.03941384],
       [ 0.73619998,  0.010199  , -0.30054   , ...,  1.31190002,
        -0.39732999, -0.71890002],
       [ 0.27189001,  0.19475999, -0.05003   , ...,  0.35427001,
        -0.09823   ,  0.15094   ]])

In [37]:
P.save_data(clean_data_dir, embedding_matrix=embedding_matrix)

# Train the model

In [38]:
from lda2vec import utils, model


In [39]:
# Where I saved preprocessed data
data_path = "./clean_data/"
# Whether or not to load saved embeddings file
load_embeds = True

In [40]:
# load data from files
(idx_to_word, word_to_idx, freqs, pivot_ids,
 target_ids, doc_ids, embed_matrix) = utils.load_preprocessed_data(data_path, load_embed_matrix=load_embeds)

In [41]:
len(freqs)

15152

In [42]:
# Number of unique documents
num_docs = doc_ids.max() + 1
# Number of unique words in vocabulary (int)
vocab_size = len(freqs)
# Embed layer dimension size
# If not loading embeds, change 128 to whatever size you want.
embed_size = embed_matrix.shape[1] if load_embeds else 128
# Number of topics to cluster into
num_topics = 5
# Amount of iterations over entire dataset
num_epochs = 200
# Batch size - Increase/decrease depending on memory usage
batch_size = 8192
# Epoch that we want to "switch on" LDA loss
switch_loss_epoch = 0
# Pretrained embeddings value
pretrained_embeddings = embed_matrix if load_embeds else None
# If True, save logdir, otherwise don't
save_graph = True

In [43]:
# Initialize the model
m = model(num_docs,
          vocab_size,
          num_topics,
          embedding_size=embed_size,
          pretrained_embeddings=pretrained_embeddings,
          freqs=freqs,
          batch_size = batch_size,
          save_graph_def=save_graph)

W1023 11:16:45.825509  4724 deprecation_wrapper.py:119] From c:\users\camng3\appdata\local\programs\python\python37\lib\site-packages\lda2vec\Lda2vec.py:40: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W1023 11:16:45.848464  4724 deprecation_wrapper.py:119] From c:\users\camng3\appdata\local\programs\python\python37\lib\site-packages\lda2vec\Lda2vec.py:42: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

W1023 11:16:48.727988  4724 deprecation_wrapper.py:119] From c:\users\camng3\appdata\local\programs\python\python37\lib\site-packages\lda2vec\Lda2vec.py:69: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

W1023 11:16:48.957921  4724 deprecation_wrapper.py:119] From c:\users\camng3\appdata\local\programs\python\python37\lib\site-packages\lda2vec\word_embedding.py:18: The name tf.truncated_normal is deprecated. Please use tf.random.truncated_normal instead.

W1023 11:16:48.969887  4

In [45]:
# Train the model
m.train(pivot_ids,
        target_ids,
        doc_ids,
        len(pivot_ids),
        num_epochs,
        idx_to_word=idx_to_word,
        switch_loss_epoch=switch_loss_epoch)

Visualizing the Results
We can now visualize the results of our model using pyLDAvis:

In [None]:
utils.generate_ldavis_data(data_path, m, idx_to_word, freqs, vocab_size)

# LSA (Latent Semantic Analysis)

The core idea is to take a matrix of what we have — documents and terms — and decompose it into a separate document-topic matrix and a topic-term matrix.

Given m documents and n words in our vocabulary, we can construct an m × n matrix A in which each row represents a document and each column represents a word.  LSA models typically replace raw counts in the document-term matrix with a tf-idf score.Once we have our document-term matrix A, we can start thinking about our latent topics. Here’s the thing: in all likelihood, A is very sparse, very noisy, and very redundant across its many dimensions. As a result, to find the few latent topics that capture the relationships among the words and documents, we want to perform dimensionality reduction on A.
This dimensionality reduction can be performed using truncated SVD. SVD, or singular value decomposition, is a technique in linear algebra that factorizes any matrix M into the product of 3 separate matrices: M=U*S*V, where S is a diagonal matrix of the singular values of M. 

In [99]:
data_text = pd.read_csv('complete.csv')

In [87]:
data_text['_1'] = data_text['_1'].map(lambda x : x.lower())

Applying Tf-idf to create Document-Term Matrix

Now, we have our data ready. We will apply tfidf vectoriser to create a document-term matrix. We will use sklearn’s TfidfVectorizer to create a matrix with 30,000 terms.

In [111]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [117]:
#ngram_range: this just means I’ll look at unigrams, bigrams and trigrams
# max features = number of terms kept
vectorizer = TfidfVectorizer(stop_words=stop, max_features=30000, max_df=0.5, use_idf=True, ngram_range=(1,3))
X = vectorizer.fit_transform(data_text['_1'])

In [119]:
print(X.shape) # check shape of the document-temr matrix

(1103663, 30000)


In [120]:
terms = vectorizer.get_feature_names()

In [122]:
len(terms)

30000

Clustering text documents using k-means
In this step we will cluster the text documents using k-means algorithm. This clustering is being used purely for plotting purposes here.

In [123]:
from sklearn.cluster import KMeans

In [124]:
num_clusters=10
km = KMeans(n_clusters=num_clusters)
km.fit(X)
clusters = km.labels_.tolist()

KeyboardInterrupt: 

# Topic Modeling

The next step is to represent each and every term and document as a vector. We will use the document-term matrix and decompose it into multiple matrices. This is basically the LSA part.

We will use sklearn’s randomized_svd to perform the task of matrix decomposition. You need some knowledge of LSA and Singular Value Decomposition (SVD) to understand the below part.

In the definition of SVD, an original matrix A is approximated as a product A ≈ UΣV* where U and V have orthonormal columns, and Σ is non-negative diagonal.

In [125]:
# applying LSA

In [127]:
from sklearn.utils.extmath import randomized_svd

In [None]:
U, Sigma, VT = randomized_svd(X, n_components=10, n_iter=100, random_state=122)

In [None]:
#printing te concepts

In [None]:
for i, comp in enumerate(VT):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:7]
    print("Concept "+str(i)+": ")
    for t in sorted_terms:
        print(t[0])
        print(" ")