# Quick Topic Modeling Example

* In this notebook we will go over the basic steps in carrying out 
topic modeling using the LexisNexus data on vaping collected in a previous notebook

* We'll use the `lda` and `pyLDAvis` modules that you might need to install using:

``!pip install lda pyLDAvis``

## 1. Load data from CSV file

Use the `pandas` library that provides easy and powerful access to data in CSV files.

In [110]:
import pandas as pd

In [60]:
data = pd.read_csv('data/articles.csv')

In [10]:
data.head()

Unnamed: 0,prebody,body,postbody
0,\n\n\n New Straits Ti...,\n\n\nKUALA LUMPUR: MANUFACTURERS and sellers ...,"LOAD-DATE: November 1, 2015\n\nLANGUAGE: ENGLI..."
1,\n\n\n Spokesman Revie...,"\n\n\nAt Smokin' Legal Vaperz, Alex Overman ca...","LOAD-DATE: December 31, 2015\n\nLANGUAGE: ENGL..."
2,\n\n\n Spokesman Revie...,"\n\n\nEDITORIAL\n\nVaping, like smoking, shoul...","LOAD-DATE: January 1, 2016\n\nLANGUAGE: ENGLIS..."
3,\n\n\n The York Dispatc...,\n\n\nReady to quit? Find resources below\n\nC...,"LOAD-DATE: April 14, 2015\n\nLANGUAGE: ENGLISH..."
4,\n\n\n The Calgary He...,\n\n\nCalgary city council's proposed ban on v...,"LOAD-DATE: June 27, 2015\n\nLANGUAGE: ENGLISH\..."


In [112]:
# the body text can be accessed like this:
data.body

0      \n\n\nKUALA LUMPUR: MANUFACTURERS and sellers ...
1      \n\n\nAt Smokin' Legal Vaperz, Alex Overman ca...
2      \n\n\nEDITORIAL\n\nVaping, like smoking, shoul...
3      \n\n\nReady to quit? Find resources below\n\nC...
4      \n\n\nCalgary city council's proposed ban on v...
5      \n\n\nKUALA LUMPUR: THE Health Ministry will n...
6      \n\n\nHIDDEN RISKS: There has been a long deba...
7      \n\n\nBANGOR, Maine -- In its latest effort to...
8      \n\n\nBANGOR, Maine -- In its latest effort to...
9      \n\n\nTHERE are restaurants in Kuala Lumpur of...
10     \n\n\nAS things are on hyper mode in Kuala Lum...
11     \n\n\nQuitting smoking cigarettes was easy for...
12     \n\n\nThe government has not decided on whethe...
13     \n\n\nThe Welsh government's proposal to ban t...
14     \n\n\nSept. 06--SHAMOKIN -- A new business tha...
15     \n\n\nKUANTAN: THE Pahang Islamic Religious an...
16     \n\n\nVaping advocates found relief at city ha...
17     \n\n\nKyra Donaldson, a 

## 2. Create document-feature matrix

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

In [90]:
cnt_vect = CountVectorizer(min_df=20, stop_words='english')
cnt_vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=20,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [128]:
simple_vect = CountVectorizer(token_pattern='\\b\\w+\\b')
text1 = 'This is SOME text that I am going to turn into a document term matrix!'
text2 = 'Another sentence here that we want to add'

In [133]:
toy_matrix = simple_vect.fit_transform([text1, text2])

simple_vect.get_feature_names()

pd.DataFrame(toy_matrix.todense(), columns=simple_vect.get_feature_names(), index=['text1', 'text2']).T


Unnamed: 0,text1,text2
a,1,0
add,0,1
am,1,0
another,0,1
document,1,0
going,1,0
here,0,1
i,1,0
into,1,0
is,1,0


In [92]:
matrix

<464x1117 sparse matrix of type '<class 'numpy.int64'>'
	with 57455 stored elements in Compressed Sparse Row format>

In [76]:
cnt_vect.get_feature_names()

['000',
 '10',
 '100',
 '10th',
 '11',
 '12',
 '13',
 '14',
 '15',
 '150',
 '16',
 '17',
 '18',
 '19',
 '20',
 '200',
 '2003',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013',
 '2014',
 '2015',
 '2016',
 '2017',
 '21',
 '22',
 '222',
 '23',
 '24',
 '25',
 '250',
 '26',
 '28',
 '29',
 '30',
 '300',
 '31',
 '35',
 '360',
 '40',
 '400',
 '450',
 '50',
 '500',
 '530',
 '60',
 '65',
 '70',
 '700',
 '80',
 '800',
 '8th',
 '90',
 '95',
 '99',
 '___',
 'abdul',
 'ability',
 'able',
 'absolutely',
 'abuse',
 'acceptable',
 'accepted',
 'access',
 'accessories',
 'according',
 'account',
 'accounted',
 'act',
 'action',
 'active',
 'activities',
 'activity',
 'actual',
 'actually',
 'adam',
 'add',
 'added',
 'addict',
 'addicted',
 'addiction',
 'addictive',
 'addicts',
 'adding',
 'addition',
 'additional',
 'additionally',
 'additive',
 'additives',
 'address',
 'administration',
 'adolescents',
 'adopt',
 'adopted',
 'ads',
 'adult',
 'adults',
 'advantages',
 'adverse

In [83]:
import lda

In [84]:
model = lda.LDA(n_topics=5)

In [93]:
model.fit(matrix)

<lda.lda.LDA at 0x10bbd3710>

In [48]:
model.doc_topic_

array([[  2.24971879e-04,   3.46681665e-01,   2.76940382e-01,
          1.66704162e-01,   2.09448819e-01],
       [  1.74367916e-04,   1.62336530e-01,   9.60767219e-02,
          3.00087184e-01,   4.41325196e-01],
       [  3.60765550e-01,   5.20255183e-01,   1.30781499e-02,
          1.05582137e-01,   3.18979266e-04],
       ..., 
       [  3.39665787e-01,   1.76077397e-01,   1.75901495e-04,
          6.17414248e-02,   4.22339490e-01],
       [  5.30434783e-01,   4.14078675e-04,   4.14078675e-04,
          2.81987578e-01,   1.86749482e-01],
       [  3.93788820e-01,   4.14078675e-04,   4.14078675e-04,
          4.02070393e-01,   2.03312629e-01]])

In [55]:
model.topic_word_.shape

(5, 247)

In [105]:
matrix2 = cnt_vect.transform(data.body[:10])
matrix2

<10x1117 sparse matrix of type '<class 'numpy.int64'>'
	with 1815 stored elements in Compressed Sparse Row format>

In [106]:
model.transform(matrix2)



array([[ 0.01959271,  0.04362574,  0.46087429,  0.39479868,  0.08110858],
       [ 0.05151475,  0.2908398 ,  0.01769963,  0.27227386,  0.36767197],
       [ 0.09534772,  0.18483585,  0.11838185,  0.38861033,  0.21282426],
       [ 0.07181134,  0.38003069,  0.03227905,  0.46655596,  0.04932296],
       [ 0.00313397,  0.07231488,  0.01601725,  0.55261716,  0.35591673],
       [ 0.01179145,  0.00565075,  0.49954988,  0.38388034,  0.09912757],
       [ 0.13278075,  0.07818933,  0.08257561,  0.70544624,  0.00100806],
       [ 0.2162294 ,  0.15946174,  0.0040799 ,  0.36138612,  0.25884284],
       [ 0.2162294 ,  0.15946174,  0.0040799 ,  0.36138612,  0.25884284],
       [ 0.00074296,  0.17144157,  0.38328393,  0.40533007,  0.03920147]])

In [86]:
%matplotlib inline
import pyLDAvis
import numpy as np

In [245]:
doc_lengths = [len(t.split()) for t in data.body.values]
vocab = cnt_vect.get_feature_names()
term_frequency = matrix.sum(axis=0).tolist()

data_dict = {'topic_term_dists': model.topic_word_,
        'doc_topic_dists': model.doc_topic_,
        'doc_lengths': np.array(doc_lengths),
        'vocab': np.array(vocab),
        'term_frequency': term_frequency[0]}



In [95]:
vis_data = pyLDAvis.prepare(**data_dict)

In [96]:
pyLDAvis.display(vis_data)

In [231]:
vis_data.topic_coordinates

Unnamed: 0_level_0,Freq,cluster,topics,x,y
topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3,44.579511,1,1,0.251403,0.086521
1,21.382101,1,2,-0.01367,-0.166762
4,15.488031,1,3,-0.087,0.221213
0,9.77671,1,4,0.10988,-0.112079
2,8.773647,1,5,-0.260614,-0.028892


In [183]:
import re





In [256]:
term_dict={}
for tnum,topic in enumerate(model.topic_word_):
    terms = list(np.array(vocab)[np.argsort(topic)])[:-20:-1]
    term_dict[tnum]=terms
    
pd.DataFrame(term_dict)

Unnamed: 0,0,1,2,3,4
0,cigarettes,said,vaping,cigarettes,said
1,said,vaping,said,smoking,cigarettes
2,health,vape,vape,vaping,public
3,percent,nicotine,ban,tobacco,vaping
4,vaping,vapor,health,cigarette,city
5,tobacco,cigarettes,products,nicotine,ban
6,nicotine,like,industry,smokers,smoking
7,products,liquid,government,health,electronic
8,cigarette,shop,state,people,health
9,use,store,ministry,smoke,devices


In [269]:
topic=2
topN=40
topic_terms = list(np.array(vocab)[np.argsort(model.topic_word_[topic-1])][:-topN:-1])
term_re = r'\b({})\b'.format('|'.join(topic_terms)) 
term_re

'\\b(said|vaping|vape|nicotine|vapor|cigarettes|like|liquid|shop|store|people|customers|products|years|juice|quit|new|devices|year|time|day|business|shops|used|smoking|liquids|industry|want|flavors|stores|vapes|started|smoker|trying|know|called|sell|smoke|opened)\\b'

In [225]:
from IPython.display import Markdown

In [268]:
Markdown(re.sub(term_re, lambda x: '**' + x.group(1).upper() + '**',data.body.iloc[463]))




Responding to riders complaints, BART plans to join the ranks of **PUBLIC** transit
agencies snuffing out e-**CIGARETTE** **USE** on its trains and stations.

The Bay Area Rapid Transit Board board has scheduled a Feb. 12 final vote on the
ban, after giving unanimous initial approval last week.

The American Lung Association called for the measure, saying it is important to
protect the **HEALTH** of riders from second-hand vapors and particle pollution from
electronic **CIGARETTES**.

"If someone pulls out one of this **DEVICES** on a crowded BART train, you're
stuck," **SAID** Serena Chen, regional advocacy director for the American Lung
Association in California. "It's not harmless water vapor. They are particles of
**NICOTINE** and other substances that are listed as harmful toxics."

E-**CIGARETTES** heat a **LIQUID** to produce vapors that can carry **NICOTINE** to the user
-- along with a variety of flavored substances. Sales of the **DEVICES** are booming
as an alternative to **TOBACCO** **CIGARETTES**.

And complaints about **USE** of the **DEVICES** on trains also are growing, **SAID** BART
Director Robert Raburn of Oakland.

While the district bans **SMOKING**, it has had no policy on electronic **CIGARETTES**.
Vaping devises would be subject to the same rules as **CIGARETTES**.

An e-**CIGARETTE** industry spokesman **SAID** Tuesday the measure overreaches because
it bans **VAPING** **DEVICES** everywhere on BART property -- not just on train cars.

"As a matter of etiquette, you can see a ban on trains," **SAID** Greg Conley,
president of the American Vaping Association. "But the proposal goes too far and
it's emblematic of the anti-**CIGARETTE** groups treating electronic **CIGARETTES** the
same."

The **VAPING** **ASSOCIATION** contends e-**CIGARETTES** are much safer than **TOBACCO**
**CIGARETTES** and help many smokers wean themselves off **TOBACCO**.

The **LUNG** **ASSOCIATION**, however, says many people who have never smoked get hooked
on **NICOTINE** through e-**CIGARETTES**

At a BART meeting Thursday night, five speakers spoke in favor of the ban.

Penalties would be $100 for first-time violators and $200 for second time
offenders.

Caltrans, AC Transit, Santa Clara Valley Transportation Authority and San
Francisco Muni **PUBLIC** transit systems already have banned electronic **CIGARETTES**,
says the American Lung Association.

But some other transit operators such as the San Diego Metropolitan Transit
System have not restricted e-**CIGARETTES** as they await **STATE** and federal
guidelines.

A bill introduced this week by State Sen. Mark Leno, D-San Francisco, would
classify e-**CIGARETTES** as **TOBACCO** **PRODUCTS** and bar them from **PUBLIC** transit
systems, the work place, schools, and in restaurants and bars.

Contact Denis Cuff at 925-943-8267. Follow him at Twitter.com/deniscuff .



In [221]:
pd.DataFrame(model.doc_topic_)

Unnamed: 0,0,1,2,3,4
0,0.022976,0.034275,0.490019,0.361959,0.090772
1,0.052163,0.279604,0.000247,0.254883,0.413103
2,0.000487,0.209732,0.175669,0.433577,0.180535
3,0.087205,0.348075,0.000248,0.502112,0.062360
4,0.000484,0.073123,0.000484,0.596126,0.329782
5,0.032994,0.000407,0.476986,0.346640,0.142974
6,0.120203,0.086391,0.062722,0.723753,0.006932
7,0.233188,0.174964,0.020670,0.326346,0.244833
8,0.218632,0.201164,0.000291,0.326346,0.253566
9,0.009583,0.195054,0.343431,0.411437,0.040495
