https://medium.com/ml2vec/topic-modeling-is-an-unsupervised-learning-approach-to-clustering-documents-to-discover-topics-fdfbf30e27df

Topic Modeling is an unsupervised learning approach to clustering documents, to discover topics based on their contents.

In this post, we will walk through two different approaches for topic modeling, and compare their results. These approaches are LDA (Latent Derilicht Analysis), and NMF (Non-negative Matrix factorization). Let’s talk about each of these before we move onto code. We will look at their definitions, and some basic math that describe how they work.

# LDA

LDA, or Latent Derelicht Analysis is a probabilistic model, and to obtain cluster assignments, it uses two probability values: P( word | topics) and P( topics | documents). These values are calculated based on an initial random assignment, after which they are repeated for each word in each document, to decide their topic assignment. In an iterative procedure, these probabilities are calculated multiple times, until the convergence of the algorithm.

# NMF

Non-negative Matrix Factorization is a Linear-algeabreic model, that factors high-dimensional vectors into a low-dimensionality representation. Similar to Principal component analysis (PCA), NMF takes advantage of the fact that the vectors are non-negative. By factoring them into the lower-dimensional form, NMF forces the coefficients to also be non-negative.
Given the original matrix A, we can obtain two matrices W and H, such that A= WH. 

NMF has an inherent clustering property, such that W and H represent the following information about A:

-A (Document-word matrix) — input that contains which words appear in which documents.

-W (Basis vectors) — the topics (clusters) discovered from the documents.

-H (Coefficient matrix) — the membership weights for the topics in each document.

 We will apply topic modeling on the ABC Millions Headlines dataset (published on Kaggle recently: https://www.kaggle.com/therohk/million-headlines)

In [4]:
import pandas as pd;
import numpy as np;
import scipy as sp;
import sklearn;
import sys;
from nltk.corpus import stopwords;
import nltk;
from gensim.models import ldamodel
import gensim.corpora;
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer;
from sklearn.decomposition import NMF;
from sklearn.preprocessing import normalize;
import pickle;

In [7]:
!dir

 Volume in drive C is Windows
 Volume Serial Number is 7A75-B79E

 Directory of C:\Users\CAMNG3\documents_clustering

21/10/2019  11:32    <DIR>          .
21/10/2019  11:32    <DIR>          ..
21/10/2019  11:05    <DIR>          .ipynb_checkpoints
21/10/2019  11:27        55,392,904 abcnews-date-text.csv
21/10/2019  10:52                24 README.md
21/10/2019  11:32            10,090 Topic_modeling.ipynb
               3 File(s)     55,403,018 bytes
               3 Dir(s)  404,868,915,200 bytes free


In [297]:
data = pd.read_csv("abcnews-date-text.csv", warn_bad_lines=True,error_bad_lines=False)
data_text = data[['headline_text']]

In [298]:
data_text.head(3)

Unnamed: 0,headline_text
0,aba decides against community broadcasting lic...
1,act fire witnesses must be aware of defamation
2,a g calls for infrastructure protection summit


In [299]:
len(data_text)

1103663

We need to remove stopwords first

In [300]:
data_text = data_text.sample(frac=0.01, random_state=3)

In [301]:
len(data_text)

11037

# consiering the big amount of data we can use spark to speed up everything

In [302]:
# import pyspark
import pyspark as sp
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext
import time

In [303]:
sc = spark.sparkContext

In [304]:
#The sql function on a SQLContext enables applications to run SQL queries
sqlContext = SQLContext(sc)

In [305]:
session= SparkSession.builder.appName('pandasToSparkDF').getOrCreate()

In [306]:
# DECIDE IN HOW MANY PARTITIONS WE WANT TO SPLIT THE DATAFRAME
num_part = sc.parallelize(data_text['headline_text'],5)

In [307]:
#from pandas to dataframe
df = spark.createDataFrame(data_text)

In [308]:
df.show(10)

+--------------------+
|       headline_text|
+--------------------+
|millions of dolla...|
|burrow and fannin...|
|teens rampage in ...|
|owners urged to p...|
|australians urged...|
|us markets fail t...|
|industrial relati...|
|indigenous missin...|
|police cite falli...|
|long jail term so...|
+--------------------+
only showing top 10 rows



In [309]:
rdd_df=df.select('headline_text').rdd

In [310]:
start = time.time()
num_part = num_part.map(lambda x: [[word for word in x.split() if word not in stopwords.words()]])
end = time.time()
print(end-start)

0.0009996891021728516


In [311]:
num_part.take(3)

[[['millions', 'dollars', 'could', 'saved', 'less', 'invasive', 'surge']],
 [['burrow', 'fanning', 'survive', 'elimination', 'heats', 'fiji', 'pro']],
 [['teens', 'rampage', 'vacant', 'house']]]

In [312]:
#from rdd to dataframe
s = sqlContext.createDataFrame(num_part)

In [313]:
s.show()

+--------------------+
|                  _1|
+--------------------+
|[millions, dollar...|
|[burrow, fanning,...|
|[teens, rampage, ...|
|[owners, urged, p...|
|[australians, urg...|
|[us, markets, fai...|
|[industrial, rela...|
|[indigenous, miss...|
|[police, falling,...|
|[long, jail, term...|
|[three, trapped, ...|
|[racq, calls, roa...|
|[woman, doused, p...|
|[dish, celebrates...|
|[philippines, flo...|
|[flying, clydesdale]|
|[scientist, surpr...|
|[taxi, driver, gr...|
|[water, ski, prop...|
|[wikimedia, boss,...|
+--------------------+
only showing top 20 rows



In [314]:
data_pd = s.toPandas()

In [315]:
len(data_pd)

11037

In [316]:
#save data because it took very long 
data_pd.to_csv("trial.csv")

In [182]:
# start = time.time()
# data_text['headline_text'] = data_text['headline_text'].apply(lambda x : [word for word in x.split() if word not in stopwords.words()])
# end = time.time()
# print(end-start)

#after 30 seconds was still running

In [327]:
# get the wrods as an array for lda input
train_headlines = [value[0] for value in data_pd.iloc[0:].values]

In [340]:
#total number of unique words
from itertools import chain

docs_temp = [word for elem in train_headlines for word in elem]
len(set(docs_temp))

13318

# Implementing LDA

In [341]:
#Initialize the number of Topics we need to cluster:
num_topics = 5

We will use the gensim library for LDA. First, we obtain a id-2-word dictionary. For each headline, we will use the dictionary to obtain a mapping of the word id to their word counts. The LDA model uses both of these mappings.

In [352]:
id2word = gensim.corpora.Dictionary(train_headlines)

# To convert documents to vectors, we’ll use a document
#representation called bag-of-words. In this representation, 
#each document is represented by one vector where each vector 
#element represents a question-answer pair, in the style of:
# “How many times does the word system appear in the document? Once.”
# ex ['alfa','beta'] => (34,1),(35,1)
#    ['alfa','alfa'] => (34,2)

corpus = [id2word.doc2bow(text) for text in train_headlines]

In [355]:
corpus[:5]


[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)],
 [(7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1)],
 [(14, 1), (15, 1), (16, 1), (17, 1)],
 [(18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1)],
 [(23, 1), (24, 1), (25, 1), (26, 1)]]