# Topic Modeling

Here given the main topic (politics, business, world etc...) we will try to detect the top N topics. Topic modeling in essence is difficult to do it does not always produce reasonable topics but we will give it a try. 

In [1]:
import pandas as pd
import os
import sys

from multiprocessing import cpu_count
from loguru import logger
from pathlib import Path
from pprint import pprint
from time import time, strftime, gmtime

In [2]:
data_folder = Path.home() / 'Data' / 'cc_news'
model_input_folder = data_folder / 'model_output' 

In [3]:
# Configuring the logger
config = {"handlers": [{"sink": sys.stdout,"colorize": True,
          "format": "<green>{time:YYYY-MM-DD HH:mm:ss}</green> | <level>{level: <8}</level> | <cyan>{name}</cyan>:<cyan>{function}</cyan>:<cyan>{line}</cyan> - <level>{message}</level>"}]}
logger.configure(**config)

[1]

In [4]:
# Read data from previous section
df = pd.read_csv(model_input_folder / 'data_topic_model_ready.csv')

In [5]:
# Printing a sample doc
pprint(df.sample(1).iloc[0].text)

('Why ROC: G.W. Lisk Skip to content RochesterFirst Rochester 23  Sponsored By '
 'Search Primary Menu News Local News State News National News International '
 'Washington Business News Entertainment News Your Local Election HQ Adam '
 'Interviews News 8 Archives Crime Education Weird News Digital Exclusives Top '
 'Stories News 8 at Noon: Online broadcast for December 18, 2019 Top Stories '
 'Where to watch: SU vs. Oakland Public Market mainstay Scott’s II closing '
 'this week after 28 years in business Despite team’s record, Tre White is '
 'Bills’ only Pro-Bowler How much would ‘The 12 Days of Christmas’ gifts cost '
 'today? Weather Weather Today’s Forecast 8-Day Forecast Almanac Interactive '
 'Radar Map Center Weather Cameras Weather Dogs Hourly Forecast Weekend '
 'Forecast Weather Watchers Traffic Closings and Delays Sports Local Sports '
 'National Sports Rochester Pro Teams The Bills Report High School Sports '
 'College Sports Buffalo Sabres Section V Best Orange Nation Ev

In [6]:
# The distribution of our classes: 
df.main_topic.value_counts()

business_economy    48536
sports              27930
politics             8785
entertainment        6514
world                1115
Name: main_topic, dtype: int64

##  1. Subsetting the data
Here, we will arbitrarily subset the data to "politics" and see what topics come up

In [7]:
data = df[df.main_topic == 'politics']
data = data.text.values.tolist()

## 2. Preprocessing


Preprocessing pipeline for Topic Modeling is involved.  

First, I will ike to tokenize and then I would like to detect `collocations` which `gensim` guys call `Phrases`. 

In [8]:
from gensim.utils import simple_preprocess
from gensim.models import Phrases
from gensim.models.phrases import Phraser

In [10]:
# Tokenize removing punctuations with gensim
# Create a generator function 
def sent_to_words(sentences):
    for sentence in sentences:
        yield(simple_preprocess(str(sentence), deacc=True))

In [11]:
data_tokenized = list(sent_to_words(data))

In [12]:
# Inspect an example
print(data_tokenized[1111])

['impeachment', 'hits', 'house', 'floor', 'what', 'to', 'watch', 'on', 'historic', 'vote', 'fnewslocal', 'the', 'kitchen', 'newsinside', 'editionevents', 'preparedweather', 'sportsbig', 'game', 'coverageinstant', 'replayhigh', 'athletesa', 'liveelder', 'eatscool', 'schoolsbig', 'rulessaqtv', 'listingssa', 'picksksat', 'kidsksat', 'tvksat', 'communitysa', 'salutesh', 'backyard', 'kitchenchristus', 'circle', 'ksan', 'antonio', 'river', 'authorityday', 'of', 'the', 'deadksat', 'expertsfoodauto', 'to', 'donewslettersif', 'you', 'are', 'disabled', 'and', 'need', 'help', 'with', 'the', 'public', 'file', 'call', 'to', 'to', 'donewsletters', 'fpoliticslaurie', 'december', 'pmtags', 'jerrold', 'nadler', 'bill', 'clinton', 'michael', 'pence', 'mitch', 'mcconnell', 'john', 'roberts', 'adam', 'schiff', 'justin', 'amash', 'gerald', 'ford', 'richard', 'nixon', 'nancy', 'pelosi', 'donald', 'trump', 'charles', 'schumer', 'george', 'bush', 'politics', 'government', 'jimmy', 'cartersign', 'up', 'for', '

In [18]:
# Training bigram and Trigram Phrase models,
# The higher the threshold the less Phrases you will get. 
# That is just just thresholding the Mutual Information
bigram = Phrases(data_tokenized, threshold=20)
trigram = Phrases(bigram[data_tokenized], threshold=20)

In [19]:
# Run the models with Phraser which is a lot faster
bigram_model = Phraser(bigram)
trigram_model = Phraser(trigram)
os.system("play /usr/share/sounds/sound-icons/trumpet-1.wav") 

In [20]:
#Inspect
print(trigram_model[bigram_model[data_tokenized[42]]])

['is', 'this', 'final', 'jeopardy', 'for', 'democrats', 'subscribe', 'nowfor_full_windsopen_city', 'settingsfull', 'forecastusa', 'todayphoto', 'videoscrime', 'newsthe', 'job', 'adsdeath', 'noticespublic', 'noticesbusiness', 'directoryusa', 'today', 'todayphoto', 'videoscrime', 'newsthe', 'job', 'adsdeath', 'noticespublic', 'noticesbusiness', 'directoryusa', 'today', 'accountaccess_billreport_delivery_issuespause', 'guidehelp_centersign_outhave_an', 'existing_account_sign_inalready', 'have_subscription_activate', 'your_accountdon_have', 'an_account_create', 'oneget', 'the', 'newsshare_this_story_let', 'friends', 'in', 'your_social_network', 'know', 'what', 'you', 'are', 'reading', 'this', 'final', 'jeopardy', 'for', 'democrats', 'what', 'the', 'wager', 'wrong', 'answers', 'will_leave', 'the', 'party', 'with', 'nothing', 'against', 'the', 'president', 'post', 'to', 'link_has_been_sent', 'to', 'your_friend_email_address', 'posted_link_has_been', 'posted', 'to', 'your_facebook_feed', 'joi

### Stopword removal
Now we can get rid of stop words

In [21]:
from nltk.corpus import stopwords

In [22]:
stop_words = stopwords.words('english')
stop_words.extend(['news', "local", 'say'])

In [23]:
data_no_stopwords = [[toke for toke in simple_preprocess(str(doc)) if toke not in stop_words] for doc in data_tokenized]

In [24]:
# Let's finally get our bigrams and trigrams
corpus_trigrammed = [trigram_model[bigram_model[doc]] for doc in data_no_stopwords]
os.system("play /usr/share/sounds/sound-icons/trumpet-1.wav") 

0

In [25]:
print(corpus_trigrammed[42])

['final', 'jeopardy', 'democrats', 'subscribe', 'nowfor_full_windsopen_city', 'settingsfull', 'forecastusa', 'todayphoto', 'videoscrime', 'newsthe', 'job', 'adsdeath', 'noticespublic', 'noticesbusiness', 'directoryusa', 'today', 'todayphoto', 'videoscrime', 'newsthe', 'job', 'adsdeath', 'noticespublic', 'noticesbusiness', 'directoryusa', 'today', 'accountaccess_billreport_delivery_issuespause', 'guidehelp_centersign', 'outhave', 'existing_account_sign_inalready', 'subscription_activate', 'accountdon', 'account_create_oneget', 'newsshare', 'story_let_friends', 'social_network_know', 'reading', 'final', 'jeopardy', 'democrats', 'wager', 'wrong', 'answers', 'leave', 'party', 'nothing', 'president', 'post', 'link', 'sent', 'friend_email', 'address_posted', 'link', 'posted', 'facebook_feed_join', 'conversationto_find', 'facebook_commenting_please_read', 'conversation_guidelines', 'faqscomments_welcome', 'new', 'improved_comments', 'subscribers', 'test', 'see_whether', 'improve', 'experience

### Lemmatization

For lemmatization, which gives a lot better results than stemming (better, best -> good), we will use spacy package 

In [26]:
import spacy

In [27]:
nlp = spacy.load('en', disable=['parser', 'ner'])

In [28]:
start_time = time()
lemmatized= []
for sent in corpus_trigrammed:
    doc = nlp(" ".join(sent)) 
    lemmatized.append([token.lemma_ for token in doc if token.pos_ in ['NOUN', 'ADJ', 'VERB', 'ADV']])
elapsed = strftime("%H:%M:%S", gmtime(time() - start_time))
logger.info(f'It took {elapsed} to run this script!')
os.system("play /usr/share/sounds/sound-icons/trumpet-1.wav")

[32m2020-01-12 21:43:55[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m7[0m - [1mIt took 00:05:15 to run this script![0m


0

## 3. Training the Model 
OK now we can train our LDA model...

In [29]:
from gensim.corpora import Dictionary
from gensim.models import LdaModel

In [30]:
# Create Dictionary
id_2_word = Dictionary(lemmatized)

# Filter out words that occur less than 5 documents, or more than 70% of the documents.
id_2_word.filter_extremes(no_below=5, no_above=0.7)

In [31]:
len(id_2_word)

12879

In [32]:
# Convert the corpus to a BOW corpus
# Term Document Frequency 
corpus_bow = [id_2_word.doc2bow(doc) for doc in lemmatized]

In [33]:
# And train the model at k=30 first
start_time = time()
lda_model =  LdaModel(corpus_bow,
                      num_topics = 30, 
                      id2word = id_2_word,
                      random_state=42,
                      passes = 10,
                      alpha='auto',
                      eta='auto',
                      per_word_topics=True,
                      eval_every=None
                      )
elapsed = strftime("%H:%M:%S", gmtime(time() - start_time))
logger.info(f'It took {elapsed} to run this script!')
os.system("play /usr/share/sounds/sound-icons/trumpet-1.wav")

[32m2020-01-12 21:45:34[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m14[0m - [1mIt took 00:00:42 to run this script![0m


0

In [34]:
pprint(lda_model.print_topics())

[(13,
  '0.031*"gun" + 0.030*"sign" + 0.030*"trump" + 0.023*"quiz" + 0.023*"page" + '
  '0.022*"politic" + 0.021*"trend" + 0.020*"remember" + 0.019*"website" + '
  '0.018*"technology"'),
 (7,
  '0.057*"disable" + 0.030*"recipe" + 0.029*"minute" + 0.026*"break" + '
  '0.024*"public_file_call_bestcw" + '
  '0.024*"healthlive_healthymodern_votingtv_listingsjax" + 0.017*"monitor" + '
  '0.017*"jail" + 0.017*"result" + 0.016*"dolphinstampa_bay"'),
 (20,
  '0.023*"live" + 0.020*"trade" + 0.017*"trump" + 0.016*"penny" + 0.015*"home" '
  '+ 0.014*"vote" + 0.014*"community" + 0.011*"life" + 0.010*"pence" + '
  '0.010*"politic"'),
 (3,
  '0.019*"hour" + 0.018*"sport" + 0.017*"editor_submit" + 0.017*"vote" + '
  '0.015*"obituary_subscribe_start_subscription" + 0.013*"com" + 0.011*"bid" + '
  '0.011*"business" + 0.010*"purchase_photos_submit_letter" + 0.010*"break"'),
 (2,
  '0.036*"sport" + 0.021*"cheat" + 0.018*"home" + 0.014*"health" + '
  '0.013*"open" + 0.011*"debate" + 0.010*"shadow" + 0.009

### Observation

So we see some interesting stuff:

1. a topic with guns and trump signing a law perhaps
2. Several impeachment related stuff and bingo. December 18th was when President was impeached
3. And some not so usefull topics

## 4. Evaluating and hyper parameter tuning 
Let's get our Coherence scores: 

In [35]:
from gensim.models import CoherenceModel

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=lemmatized, dictionary=id_2_word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print(f'Coherence Score: ', coherence_lda)

Coherence Score:  0.4719949540934286


Not too good...  But how do we know we have select the k right?  Well, that is the difficult part and usually we do a grid search and get coherence scores: 

In [36]:
start_time = time()
coherence_scores = {}
for k in list(range(0, 55, 5)[1:]):
    lda_model =  LdaModel(corpus_bow,
                              num_topics = k, 
                              id2word = id_2_word,
                              random_state=42,
                              passes = 10,
                              alpha='auto',
                              eta='auto',
                              per_word_topics=True,
                              eval_every=None
                          )
     
    coherence_model_lda = CoherenceModel(model=lda_model, texts=lemmatized, dictionary=id_2_word, coherence='c_v')
    coherence_scores[k] = coherence_model_lda.get_coherence()
    print(f"done with => {k}")
elapsed = strftime("%H:%M:%S", gmtime(time() - start_time))
logger.info(f'It took {elapsed} to run this script!')
os.system("play /usr/share/sounds/sound-icons/trumpet-1.wav") 

done with => 5
done with => 10
done with => 15
done with => 20
done with => 25
done with => 30
done with => 35
done with => 40
done with => 45
done with => 50
[32m2020-01-12 21:57:59[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m19[0m - [1mIt took 00:07:27 to run this script![0m


0

In [37]:
coherence_scores

{5: 0.5473958641578343,
 10: 0.5176837263824698,
 15: 0.5023169205958522,
 20: 0.4715409043492447,
 25: 0.47389579648455443,
 30: 0.4719949540934286,
 35: 0.4655300933667939,
 40: 0.4904883165471784,
 45: 0.44097625928281636,
 50: 0.4399632918719032}

In [38]:
coherence_scores = {}
for k in range(5, 16):
    lda_model =  LdaModel(corpus_bow,
                              num_topics = k, 
                              id2word = id_2_word,
                              random_state=42,
                              passes = 10,
                              alpha='auto',
                              eta='auto',
                              per_word_topics=True,
                              eval_every=None
                          )
     
    coherence_model_lda = CoherenceModel(model=lda_model, texts=lemmatized, dictionary=id_2_word, coherence='c_v')
    coherence_scores[k] = coherence_model_lda.get_coherence()
    print(f"done with => {k}")
elapsed = strftime("%H:%M:%S", gmtime(time() - start_time))
logger.info(f'It took {elapsed} to run this script!')
os.system("play /usr/share/sounds/sound-icons/trumpet-1.wav") 

done with => 5
done with => 6
done with => 7
done with => 8
done with => 9
done with => 10
done with => 11
done with => 12
done with => 13
done with => 14
done with => 15
[32m2020-01-12 22:07:31[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m18[0m - [1mIt took 00:16:59 to run this script![0m


0

In [39]:
coherence_scores

{5: 0.5473958641578343,
 6: 0.5625845655663847,
 7: 0.5807315378865728,
 8: 0.5788981105242129,
 9: 0.5597550378448293,
 10: 0.5176837263824698,
 11: 0.5298071688287204,
 12: 0.5128208361628906,
 13: 0.5371891025082393,
 14: 0.5154255305395259,
 15: 0.5023169205958522}

### Results 
Optimum k appears to be 9. Let's retrain that model. 

In [None]:
lda_model_optim =  LdaModel(corpus_bow,
                              num_topics = 9, 
                              id2word = id_2_word,
                              random_state=42,
                              passes = 10,
                              alpha='auto',
                              eta='auto',
                              per_word_topics=True,
                              eval_every=None
                          )


## 5. Scoring each document and getting the top topic for each

In [65]:
df_topics = pd.DataFrame()
for i, row in enumerate(lda_model_optim[corpus_bow]):
    row = sorted(row[0], key=lambda x: (x[1]), reverse=True)
    
    for j, (topic_num, prop_topic) in enumerate(row):
        if j == 0:  # => dominant topic
            wp = lda_model_optim.show_topic(topic_num)
            topic_keywords = ", ".join([word for word, prop in wp])
            df_topics = df_topics.append(pd.Series([int(topic_num), round(prop_topic,2), topic_keywords]), ignore_index=True)
        else:
            break
df_topics.columns = ['top_topic', 'perc_contrib', 'topic_kw']

In [66]:
df_topics

Unnamed: 0,top_topic,perc_contrib,topic_kw
0,7.0,0.66,"say, disable, police, man, break, public_file_..."
1,3.0,0.67,"sport, live, weather, home, impeachment, meet,..."
2,7.0,0.57,"say, disable, police, man, break, public_file_..."
3,2.0,0.55,"sign, sport, subscribe, business, home, opinio..."
4,0.0,0.39,"say, state, would, year, law, make, people, al..."
...,...,...,...
8780,2.0,0.42,"sign, sport, subscribe, business, home, opinio..."
8781,0.0,0.71,"say, state, would, year, law, make, people, al..."
8782,7.0,0.83,"say, disable, police, man, break, public_file_..."
8783,4.0,0.43,"trump, share, impeachment, facebookshare_tweet..."


In [77]:
topic_dist = df_topics.top_topic.value_counts().to_frame().sort_values(by='top_topic', ascending=False)

In [78]:
topic_dist

Unnamed: 0,top_topic
0.0,2066
6.0,1537
5.0,1429
3.0,845
4.0,734
2.0,715
7.0,618
8.0,446
1.0,395


In [79]:
# Topics sorted by the distribution 
for t in topic_dist.index:
    wp = lda_model_optim.show_topic(int(t))
    topic_keywords = ", ".join([word for word, prop in wp])
    print(topic_keywords, "\n")

say, state, would, year, law, make, people, also, election, time 

impeachment, trump, vote, impeach, house, debate, abuse, article, say, expect 

trump, impeachment, say, president, politic, vote, election, support, impeach, call 

sport, live, weather, home, impeachment, meet, video, trump, vote, team 

trump, share, impeachment, facebookshare_tweet_email_print, continue, advertise, term, op_ed, terrorism_israel_russia_north, wj_email_subscribe 

sign, sport, subscribe, business, home, opinion, log, thank, read, email 

say, disable, police, man, break, public_file_call, charge, minute, shoot, live 

weather, live, impeachment, week, sport, trump, day, home, meet, video 

link, tip, post, violation, weather, photo, sport, new, weekend, send 



In [80]:
# Add Doc to the dataframe
df_topics['text'] = df[df.main_topic=='politics'].text.tolist()
df_topics['url'] = df[df.main_topic=='politics'].url.tolist()

In [81]:
df_topics

Unnamed: 0,top_topic,perc_contrib,topic_kw,text,url
0,7.0,0.66,"say, disable, police, man, break, public_file_...","Volunteers Travel 2,000 Miles to Help Return S...",https://www.news4jax.com/inside-edition/2019/1...
1,3.0,0.67,"sport, live, weather, home, impeachment, meet,...",NBC/WSJ poll: Public remains split on Trump’s ...,https://www.kark.com/news/national-news/nbc-ws...
2,7.0,0.57,"say, disable, police, man, break, public_file_...",Feds: Man whose number found on NJ shooter was...,https://www.news4jax.com/news/2019/12/18/feds-...
3,2.0,0.55,"sign, sport, subscribe, business, home, opinio...",Visclosky set to vote today for both impeachme...,https://www.nwitimes.com/news/local/govt-and-p...
4,0.0,0.39,"say, state, would, year, law, make, people, al...",Domestic dispute leads to firing of deputy she...,https://www.fox46charlotte.com/news/domestic-d...
...,...,...,...,...,...
8780,2.0,0.42,"sign, sport, subscribe, business, home, opinio...","Court: Part of ‘Obamacare’ invalid, more revie...",https://www.kmvt.com/content/news/Court-Obamac...
8781,0.0,0.71,"say, state, would, year, law, make, people, al...",Director of Communications and Intergovernment...,https://medford.wickedlocal.com/news/20191218/...
8782,7.0,0.83,"say, disable, police, man, break, public_file_...",Singer Camila Cabello apologizes for past raci...,https://www.clickondetroit.com/entertainment/2...
8783,4.0,0.43,"trump, share, impeachment, facebookshare_tweet...",Trump administration setting up asylum seekers...,https://www.kget.com/border-report-tour/trump-...


In [82]:
# Thresholded at 0.5 contribution
topic_dist_thresh = df_topics[df_topics.perc_contrib>=0.5].top_topic.value_counts().to_frame().sort_values(by='top_topic', ascending=False)

In [83]:
topic_dist_thresh

Unnamed: 0,top_topic
0.0,1548
6.0,1230
5.0,1093
4.0,619
3.0,598
2.0,514
7.0,439
8.0,371
1.0,336


In [84]:
# Topics sorted by the distribution Thresholded at 0.5 contribution
for t in topic_dist_thresh.index:
    wp = lda_model_optim.show_topic(int(t))
    topic_keywords = ", ".join([word for word, prop in wp])
    print(topic_keywords, "\n")

say, state, would, year, law, make, people, also, election, time 

impeachment, trump, vote, impeach, house, debate, abuse, article, say, expect 

trump, impeachment, say, president, politic, vote, election, support, impeach, call 

trump, share, impeachment, facebookshare_tweet_email_print, continue, advertise, term, op_ed, terrorism_israel_russia_north, wj_email_subscribe 

sport, live, weather, home, impeachment, meet, video, trump, vote, team 

sign, sport, subscribe, business, home, opinion, log, thank, read, email 

say, disable, police, man, break, public_file_call, charge, minute, shoot, live 

weather, live, impeachment, week, sport, trump, day, home, meet, video 

link, tip, post, violation, weather, photo, sport, new, weekend, send 



## 6. Comments and Conclusion

We did find out what is the main topic talked about on that day: Impeachment and various versions of it: Partizan comments, Public opinion, courts, elections etc. 
However, it is not easy to label them and there are several words that should not be in the top n of the topic-word distributions, like weather, sports etc. 

A quick look at the distribution of words of our topics and spot checking a few records reveal the main problem immediately. Since we used the whole body of the html file, a lot of the sub sections of the webpages came with it.  A more advisable approach would be to parse only the main page content but since that would depend largely on the domain schema, it is out of scope for our exercises. 

Also, instead of topics, Named Entity Recognition could be applied to the corpus and we could find Persons, Locations and Organizations that have made the news a lot. 

And based on that we could do some topic modeling to see what are the key words that are being talked about. 

For instance, 

Google (ORG) -> CEO (PERSON) -> Announces blah blah... (topics)

