### Topic Modelling Demo Code

#### Things I want to do -
- Identify a package to build / train LDA model
- Use visualization to explore Documents -> Topics Distribution -> Word distribution

In [None]:
# !pip install pyLDAvis gensim --user

In [1]:
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import seaborn as sns
import pyLDAvis.gensim

# Text Preprocessing and model building
from gensim.corpora import Dictionary
import nltk
from nltk.stem import WordNetLemmatizer
import re
# Iteratively read files
import glob
import os

# For displaying images in ipython
from IPython.display import HTML, display

scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()


In [2]:
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (14.0, 8.7)
#warnings.filterwarnings('ignore')
pd.options.display.float_format = '{:,.2f}'.format

<h2>Latent Dirichlet Allocation</h2>
<h3>From Documents -- DTM -- LDA Model</h3>

Topic modeling aims to automatically summarize large collections of documents to facilitate organization and management, as well as search and recommendations. At the same time, it can enable the understanding of documents to the extent that humans can interpret the descriptions of topics

<img src="images/lda2.png" alt="lda" style="width:60%">
<img src="images/docs_to_lda.png" alt="ldaflow" style="width:100%">

### Load Data

In [3]:
# User defined function to read and store bbc data from multipe folders
def load_data(folder_names,root_path):
    fileNames = [path + '/' + 'bbc' +'/'+ folder + '/*.txt' for path,folder in zip([root_path]*len(folder_names),
                                                                               folder_names )]
    doc_list = []
    tags = folder_names
    for docs in fileNames:
        #print(docs)
        #print(type(docs))
        doc = glob.glob(docs) # glob method iterates through the all the text documents in a folder
        for text in doc:
            with open(text, encoding='latin1') as f:
                topic = docs.split('/')[-2]

                lines = f.readlines()
                heading = lines[0].strip()
                body = ' '.join([l.strip() for l in lines[1:]])
                doc_list.append([topic, heading, body])
        print("Completed loading data from folder: %s"%topic)
    
    print("Completed Loading entire text")
    
    return doc_list

In [4]:
folder_names = ['business','entertainment','politics','sport','tech']
docs = load_data(folder_names = folder_names, root_path = os.getcwd())

Completed loading data from folder: business
Completed loading data from folder: entertainment
Completed loading data from folder: politics
Completed loading data from folder: sport
Completed loading data from folder: tech
Completed Loading entire text


In [5]:
docs = pd.DataFrame(docs, columns=['Category', 'Heading', 'Article'])
print(docs.head())
print('\nShape of data is {}\n'.format(docs.shape))

   Category                            Heading  \
0  business    UK economy facing 'major risks'   
1  business  Aids and climate top Davos agenda   
2  business   Asian quake hits European shares   
3  business   India power shares jump on debut   
4  business    Lacroix label bought by US firm   

                                             Article  
0   The UK manufacturing sector will continue to ...  
1   Climate change and the fight against Aids are...  
2   Shares in Europe's leading reinsurers and tra...  
3   Shares in India's largest power producer, Nat...  
4   Luxury goods group LVMH has sold its loss-mak...  

Shape of data is (2225, 3)



### Extract Raw Corpus

In [6]:
articles = docs.Article.tolist()

In [7]:
print(type(articles))
print(articles[0:2])

<class 'list'>
[' The UK manufacturing sector will continue to face "serious challenges" over the next two years, the British Chamber of Commerce (BCC) has said.  The group\'s quarterly survey of companies found exports had picked up in the last three months of 2004 to their best levels in eight years. The rise came despite exchange rates being cited as a major concern. However, the BCC found the whole UK economy still faced "major risks" and warned that growth is set to slow. It recently forecast economic growth will slow from more than 3% in 2004 to a little below 2.5% in both 2005 and 2006.  Manufacturers\' domestic sales growth fell back slightly in the quarter, the survey of 5,196 firms found. Employment in manufacturing also fell and job expectations were at their lowest level for a year.  "Despite some positive news for the export sector, there are worrying signs for manufacturing," the BCC said. "These results reinforce our concern over the sector\'s persistent inability to sus

In [8]:
wordnet_lemmatizer = WordNetLemmatizer()

### Preprocessing of Raw Text

In [9]:
from nltk.corpus import stopwords
import nltk
# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('stopwords')

In [None]:
# nltk.download('stopwords')

In [10]:
stopwords = stopwords.words('english')

In [13]:
stopwords[0:10] , f"Total words in stopwords list {len(stopwords)}"

(['i',
  'me',
  'my',
  'myself',
  'we',
  'our',
  'ours',
  'ourselves',
  'you',
  "you're"],
 'Total words in stopwords list 179')

In [14]:
# Method to preprocess my raw data
def preprocessText(x):
    temp = x.lower()
    temp = re.sub(r'[^\w]', ' ', temp)
    temp = nltk.word_tokenize(temp)
    temp = [wordnet_lemmatizer.lemmatize(w) for w in temp]
    temp = [word for word in temp if word not in stopwords ]
    return temp

### Stemming
readily - !ily --> read
volley  -  !y --> volle

### Lemmetaization Statistical method of reducing words to root / base form -
volley --> volley


In [15]:
articles_final = [preprocessText(article) for article in articles]

In [None]:
articles_final[0:2]

### Transformation of Preprocessed text into Vector form using Gensim

In [17]:
# Create a dictionary representation of the documents.
dictionary = Dictionary(articles_final)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)

In [18]:
print(dictionary)

Dictionary(3101 unique tokens: ['12', '18', '2', '2003', '2004']...)


In [19]:
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in articles_final]

In [20]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 3101
Number of documents: 2225


In [None]:
corpus[0]

### Train LDA model using Gensim

In [28]:
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 5
chunksize = 2000
passes = 10
# iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token
# print(id2word)

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
#     iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

### Model exploration: Top K words in each topic

In [23]:
import pprint

In [30]:
# Print the Keyword in the 10 topics
pprint.pprint(model.print_topics(num_words= 20))
doc_lda = model[corpus]

[(0,
  '0.013*"mobile" + 0.011*"people" + 0.011*"technology" + 0.010*"phone" + '
  '0.007*"firm" + 0.007*"user" + 0.006*"use" + 0.006*"new" + 0.006*"microsoft" '
  '+ 0.006*"one" + 0.005*"music" + 0.005*"pc" + 0.005*"service" + '
  '0.005*"software" + 0.005*"computer" + 0.005*"device" + 0.005*"could" + '
  '0.005*"network" + 0.005*"digital" + 0.005*"system"'),
 (1,
  '0.015*"people" + 0.014*"service" + 0.010*"broadband" + 0.008*"online" + '
  '0.008*"internet" + 0.008*"net" + 0.008*"could" + 0.008*"million" + '
  '0.008*"uk" + 0.008*"new" + 0.006*"bt" + 0.006*"one" + 0.006*"access" + '
  '0.006*"call" + 0.005*"number" + 0.005*"tv" + 0.005*"user" + 0.005*"card" + '
  '0.005*"website" + 0.005*"blog"'),
 (2,
  '0.035*"game" + 0.010*"film" + 0.009*"dvd" + 0.008*"player" + 0.008*"best" + '
  '0.008*"like" + 0.008*"time" + 0.007*"one" + 0.007*"title" + 0.007*"world" + '
  '0.006*"play" + 0.006*"first" + 0.006*"new" + 0.005*"award" + 0.005*"2" + '
  '0.005*"top" + 0.005*"next" + 0.005*"well" 

### Model Visualization using PyLDAvis

In [31]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(model, corpus, dictionary=dictionary)
vis

### Assign Topic Model Numbers to original Data Frame as Column

In [32]:
# Assigns the topics to the documents in corpus
lda_corpus = model[corpus]

In [41]:
mappings = {4:"ploitics", 2: "game", 0: "mobile tech", 1: "isp", 3: "piracy"}

In [33]:
topics = []

for doc in lda_corpus:
    temp_id = []
    temp_score = []
    for doc_tuple in doc:
        temp_id.append(doc_tuple[0])
        temp_score.append(doc_tuple[1])
    index = np.argmax(temp_score)
    topics.append(temp_id[index])

In [34]:
docs["Topic_num"] = topics

In [None]:
docs.tail(n= 40)

In [38]:
docs.columns

Index(['Category', 'Heading', 'Article', 'Topic_num'], dtype='object')

In [42]:
docs["new_label"] = docs["Topic_num"].apply(lambda x: mappings[x])

In [45]:
docs.tail(n=20)

Unnamed: 0,Category,Heading,Article,Topic_num,new_label
2205,tech,Cheaper chip for mobiles,A mobile phone chip which combines a modem an...,0,mobile tech
2206,tech,Progress on new internet domains,By early 2005 the net could have two new doma...,1,isp
2207,tech,Slim PlayStation triples sales,Sony PlayStation 2's slimmer shape has proved...,2,game
2208,tech,Loyalty cards idea for TV addicts,Viewers could soon be rewarded for watching T...,1,isp
2209,tech,Apple iPod family expands market,Apple has expanded its iPod family with the r...,0,mobile tech
2210,tech,DVD copy protection strengthened,DVDs will be harder to copy thanks to new ant...,0,mobile tech
2211,tech,Millions buy MP3 players in US,One in 10 adult Americans - equivalent to 22 ...,1,isp
2212,tech,US woman sues over ink cartridges,"A US woman is suing Hewlett Packard (HP), say...",4,ploitics
2213,tech,The Force is strong in Battlefront,The warm reception that has greeted Star Wars...,2,game
2214,tech,Seamen sail into biometric future,"The luxury cruise liner Crystal Harmony, curr...",1,isp
