### Topic Modelling Demo Code

#### Things I want to do -
- Identify a package to build / train LDA model
- Use visualization to explore Documents -> Topics Distribution -> Word distribution

In [2]:
!pip install pyLDAvis, gensim



In [3]:
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import seaborn as sns
import pyLDAvis.gensim

# Text Preprocessing and model building
from gensim.corpora import Dictionary
import nltk
from nltk.stem import WordNetLemmatizer
import re
# Iteratively read files
import glob
import os

# For displaying images in ipython
from IPython.display import HTML, display

scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()


In [4]:
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (14.0, 8.7)
#warnings.filterwarnings('ignore')
pd.options.display.float_format = '{:,.2f}'.format

<h2>Latent Dirichlet Allocation</h2>
<h3>From Documents -- DTM -- LDA Model</h3>

Topic modeling aims to automatically summarize large collections of documents to facilitate organization and management, as well as search and recommendations. At the same time, it can enable the understanding of documents to the extent that humans can interpret the descriptions of topics

<img src="images/lda2.png" alt="lda" style="width:60%">
<img src="images/docs_to_lda.png" alt="ldaflow" style="width:100%">

### Load Data

In [5]:
# User defined function to read and store bbc data from multipe folders
def load_data(folder_names,root_path):
    fileNames = [path + '/' + 'bbc' +'/'+ folder + '/*.txt' for path,folder in zip([root_path]*len(folder_names),
                                                                               folder_names )]
    doc_list = []
    tags = folder_names
    for docs in fileNames:
        #print(docs)
        #print(type(docs))
        doc = glob.glob(docs) # glob method iterates through the all the text documents in a folder
        for text in doc:
            with open(text, encoding='latin1') as f:
                topic = docs.split('/')[-2]

                lines = f.readlines()
                heading = lines[0].strip()
                body = ' '.join([l.strip() for l in lines[1:]])
                doc_list.append([topic, heading, body])
        print("Completed loading data from folder: %s"%topic)
    
    print("Completed Loading entire text")
    
    return doc_list

In [6]:
folder_names = ['business','entertainment','politics','sport','tech']
docs = load_data(folder_names = folder_names, root_path = os.getcwd())

Completed loading data from folder: business
Completed loading data from folder: entertainment
Completed loading data from folder: politics
Completed loading data from folder: sport
Completed loading data from folder: tech
Completed Loading entire text


In [7]:
docs = pd.DataFrame(docs, columns=['Category', 'Heading', 'Article'])
print(docs.head())
print('\nShape of data is {}\n'.format(docs.shape))

   Category                            Heading  \
0  business    UK economy facing 'major risks'   
1  business  Aids and climate top Davos agenda   
2  business   Asian quake hits European shares   
3  business   India power shares jump on debut   
4  business    Lacroix label bought by US firm   

                                             Article  
0   The UK manufacturing sector will continue to ...  
1   Climate change and the fight against Aids are...  
2   Shares in Europe's leading reinsurers and tra...  
3   Shares in India's largest power producer, Nat...  
4   Luxury goods group LVMH has sold its loss-mak...  

Shape of data is (2225, 3)



### Extract Raw Corpus

In [8]:
articles = docs.Article.tolist()

In [9]:
print(type(articles))
print(articles[0:2])

<class 'list'>
[' The UK manufacturing sector will continue to face "serious challenges" over the next two years, the British Chamber of Commerce (BCC) has said.  The group\'s quarterly survey of companies found exports had picked up in the last three months of 2004 to their best levels in eight years. The rise came despite exchange rates being cited as a major concern. However, the BCC found the whole UK economy still faced "major risks" and warned that growth is set to slow. It recently forecast economic growth will slow from more than 3% in 2004 to a little below 2.5% in both 2005 and 2006.  Manufacturers\' domestic sales growth fell back slightly in the quarter, the survey of 5,196 firms found. Employment in manufacturing also fell and job expectations were at their lowest level for a year.  "Despite some positive news for the export sector, there are worrying signs for manufacturing," the BCC said. "These results reinforce our concern over the sector\'s persistent inability to sus

In [10]:
wordnet_lemmatizer = WordNetLemmatizer()

### Preprocessing of Raw Text

In [48]:
from nltk.corpus import stopwords
import nltk
# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('stopwords')

In [51]:
# nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/paragpradhan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
stopwords = stopwords.word('english')

In [56]:
# Method to preprocess my raw data
def preprocessText(x):
    temp = x.lower()
    temp = re.sub(r'[^\w]', ' ', temp)
    temp = nltk.word_tokenize(temp)
    temp = [wordnet_lemmatizer.lemmatize(w) for w in temp]
    temp = [word for word in temp if word not in stopwords ]
    return temp

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/paragpradhan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/paragpradhan/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [16]:
articles_final = [preprocessText(article) for article in articles]

In [None]:
articles_final[0:2]

### Transformation of Preprocessed text into Vector form using Gensim

In [18]:
# Create a dictionary representation of the documents.
dictionary = Dictionary(articles_final)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)

In [19]:
print(dictionary)

Dictionary(3202 unique tokens: ['12', '18', '2', '2003', '2004']...)


In [20]:
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in articles_final]

In [21]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 3202
Number of documents: 2225


### Train LDA model using Gensim

In [22]:
dictionary

<gensim.corpora.dictionary.Dictionary at 0x7f9b79679fa0>

In [27]:
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 5
chunksize = 2000
passes = 10
# iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
#     iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

### Model exploration: Top K words in each topic

In [36]:
# Print the Keyword in the 10 topics
pprint.pprint(model.print_topics(num_words= 20))
doc_lda = model[corpus]

[(0,
  '0.027*"game" + 0.014*"you" + 0.011*"i" + 0.008*"more" + 0.007*"can" + '
  '0.007*"player" + 0.007*"or" + 0.007*"first" + 0.006*"one" + 0.006*"like" + '
  '0.006*"time" + 0.006*"world" + 0.006*"gadget" + 0.006*"all" + 0.006*"than" '
  '+ 0.005*"there" + 0.005*"play" + 0.005*"title" + 0.005*"what" + '
  '0.005*"new"'),
 (1,
  '0.012*"mobile" + 0.011*"more" + 0.011*"people" + 0.010*"technology" + '
  '0.009*"phone" + 0.007*"can" + 0.007*"service" + 0.006*"than" + 0.006*"user" '
  '+ 0.006*"new" + 0.005*"digital" + 0.005*"or" + 0.005*"one" + 0.005*"about" '
  '+ 0.005*"microsoft" + 0.005*"high" + 0.005*"we" + 0.005*"could" + '
  '0.005*"music" + 0.005*"mr"'),
 (2,
  '0.011*"his" + 0.010*"i" + 0.009*"mr" + 0.008*"we" + 0.006*"after" + '
  '0.005*"who" + 0.004*"â" + 0.004*"u" + 0.004*"there" + 0.004*"new" + '
  '0.004*"over" + 0.004*"last" + 0.004*"out" + 0.004*"time" + '
  '0.004*"government" + 0.003*"if" + 0.003*"t" + 0.003*"more" + 0.003*"about" '
  '+ 0.003*"two"'),
 (3,
  '0.009

### Model Visualization using PyLDAvis

In [35]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(model, corpus, dictionary=dictionary)
vis

### Assign Topic Model Numbers to original Data Frame as Column

In [37]:
# Assigns the topics to the documents in corpus
lda_corpus = model[corpus]

In [38]:
topics = []

for doc in lda_corpus:
    temp_id = []
    temp_score = []
    for doc_tuple in doc:
        temp_id.append(doc_tuple[0])
        temp_score.append(doc_tuple[1])
    index = np.argmax(temp_score)
    topics.append(temp_id[index])

In [39]:
docs["Topic_num"] = topics

In [41]:
docs.tail()

Unnamed: 0,Category,Heading,Article,Topic_num
2220,tech,Warning over Windows Word files,Writing a Microsoft Word document can be a da...,3
2221,tech,Fast lifts rise into record books,Two high-speed lifts at the world's tallest b...,0
2222,tech,Nintendo adds media playing to DS,Nintendo is releasing an adapter for its DS h...,0
2223,tech,Fast moving phone viruses appear,Security firms are warning about several mobi...,1
2224,tech,Hacker threat to Apple's iTunes,Users of Apple's music jukebox iTunes need to...,1
