## Topic Modeling
- This notebook walks thru a topic modeling process using `data/interim/subset_first_15000.gzip` 
- At the end of the notebook, a labeled data will be returned

#### Import Libraries

In [27]:
# Change to parent directory
import os
os.chdir(os.pardir)

In [28]:
import re
import pickle 
from pprint import pprint
import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [30]:
from src.data_prep.topic_modeling_helpers import preprocess_text, make_corpus,extract_labels, find_best_matching_topic

In [4]:
import gensim
import pyLDAvis.gensim
import pyLDAvis

#### Import file 
- file_directory: `data/interim/`

In [5]:
filename = "subset_first_15000.gzip"
df = pd.read_parquet(os.path.join('data', 'interim', filename), engine="pyarrow")
df.head(2)

Unnamed: 0,date,author,title,article,url,section,publication
0,2016-12-09 18:31:00,Lee Drutman,We should take concerns about the health of liberal democracy seriously,"This post is part of Polyarchy, an independent blog produced by the political reform program at New America, a Washington think tank devoted to developing new ideas and new voices. Imagine you are...",https://www.vox.com/polyarchy/2016/12/9/13898340/democracy-warning-signs,unknown,Vox
1,2016-10-07 21:26:46,Scott Davis,Colts GM Ryan Grigson says Andrew Luck's contract makes it difficult to build the team,"The Indianapolis Colts made Andrew Luck the highest-paid player in NFL history this offseason with a five-year, $122-million contract with $89 million guaranteed. However, they're already finding...",https://www.businessinsider.com/colts-gm-ryan-grigson-andrew-luck-contract-2016-10,unknown,Business Insider


#### Apply pre-processing script to `article` column

In [6]:
# Apply preprocess_text, including: removing punctuation, coverting to lowercase
# and removing unicode encoding
papers = df['article'].apply(preprocess_text)

In [7]:
# inspect first 300 chars
papers.tail(1).values[0][:300]

'this week artist studios in california florida japan new york and rhode island advertise on hyperallergic with nectar ads the 128th installment of a series in which artists send in a photo and a description of their workspace want to take part submit your studio  just check out the submission guidel'

#### Prepare elements for topic modeling   
`corpus`, `id2word`, `bigrams`, `data_lemmatized`

In [8]:
# Create corpus, dictionary, lemmatized data
corpus, id2word, bigrams, data_lemmatized = make_corpus(papers)

#### grid searching for `n_topics`, `alpha`, `eta`
- **WARNING**: takes a long time! Code from [medium article](https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0)

In [None]:
# import numpy as np
# import tqdm
# from gensim.models import CoherenceModel

# # supporting function
# def compute_coherence_values(corpus, dictionary, k, a, b):
    
#     lda_model = gensim.models.LdaMulticore(corpus=corpus,
#                                            id2word=id2word,
#                                            num_topics=k, 
#                                            random_state=100,
#                                            chunksize=100,
#                                            passes=10,
#                                            alpha=a,
#                                            eta=b,
#                                            per_word_topics=True)
    
#     coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')    
#     return coherence_model_lda.get_coherence()

# grid = {}
# grid['Validation_Set'] = {}
# # Topics range
# min_topics = 10
# max_topics = 16
# step_size = 2
# topics_range = range(min_topics, max_topics, step_size)
# # Alpha parameter
# alpha = list(np.arange(0.5, 1.1, 0.25))
# alpha.append('symmetric')
# alpha.append('asymmetric')
# # Beta parameter
# beta = list(np.arange(0.5, 1.1, 0.25))
# beta.append('symmetric')

# # Validation sets
# num_of_docs = len(corpus)
# corpus_sets = [# gensim.utils.ClippedCorpus(corpus, num_of_docs*0.25), 
#                # gensim.utils.ClippedCorpus(corpus, num_of_docs*0.5), 
#                # gensim.utils.ClippedCorpus(corpus, int(num_of_docs*0.75)), 
#                corpus]
# corpus_title = [
#                 # '75% Corpus', 
#                 '100% Corpus']
# model_results = {'Validation_Set': [],
#                  'Topics': [],
#                  'Alpha': [],
#                  'Beta': [],
#                  'Coherence': []
#                 }
# # Can take a long time to run
# if 1 == 1:
#     pbar = tqdm.tqdm(total=60)
    
#     # iterate through validation corpuses
#     for i in range(len(corpus_sets)):
#         # iterate through number of topics
#         for k in topics_range:
#             # iterate through alpha values
#             for a in alpha:
#                 # iterare through beta values
#                 for b in beta:
#                     # get the coherence score for the given parameters
#                     cv = compute_coherence_values(corpus=corpus_sets[i], dictionary=id2word, 
#                                                   k=k, a=a, b=b)
#                     # Save the model results
#                     model_results['Validation_Set'].append(corpus_title[i])
#                     model_results['Topics'].append(k)
#                     model_results['Alpha'].append(a)
#                     model_results['Beta'].append(b)
#                     model_results['Coherence'].append(cv)
                    
#                     pbar.update(1)
#     grid_summary = pd.DataFrame(model_results)
#     grid_summary.to_csv('data/interim/lda_tuning_gridsearch_results.csv', index=False)
#     pbar.close()

# print('>>> DONE! <<<')

In [None]:
# Check 10 best coherence values
# grid_summary.sort_values(by='Coherence', ascending=False).head(10)

In [9]:
# Optimized parameters from grid-searching (above)
n_topics = 14
alpha, eta = 'asymmetric', 1

# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=n_topics, 
                                           random_state=100,
                                           chunksize=100,
                                           passes=10,
                                           alpha=alpha,
                                           eta=eta)

# Visualize the topics
pyLDAvis.enable_notebook()
LDAvis_prepared = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
LDAvis_prepared

In [10]:
# Print topic_number and keywords (& weights)
pprint(lda_model.print_topics(n_topics))

[(0,
  '0.010*"government" + 0.009*"bill" + 0.008*"trump" + 0.008*"plan" + '
  '0.008*"state" + 0.007*"law" + 0.007*"deal" + 0.007*"could" + 0.006*"policy" '
  '+ 0.006*"may"'),
 (1,
  '0.014*"work" + 0.012*"artist" + 0.012*"art" + 0.005*"space" + 0.004*"image" '
  '+ 0.004*"painting" + 0.004*"exhibition" + 0.004*"world" + 0.004*"create" + '
  '0.004*"new"'),
 (2,
  '0.008*"country" + 0.008*"people" + 0.008*"government" + 0.006*"military" + '
  '0.006*"city" + 0.005*"attack" + 0.005*"official" + 0.005*"report" + '
  '0.005*"area" + 0.004*"kill"'),
 (3,
  '0.013*"family" + 0.013*"police" + 0.012*"people" + 0.010*"black" + '
  '0.008*"man" + 0.008*"child" + 0.007*"protest" + 0.006*"violence" + '
  '0.006*"today" + 0.006*"community"'),
 (4,
  '0.015*"film" + 0.014*"show" + 0.012*"movie" + 0.008*"character" + '
  '0.008*"story" + 0.008*"season" + 0.006*"episode" + 0.005*"first" + '
  '0.005*"s" + 0.005*"watch"'),
 (5,
  '0.067*"sometimes_previous" + 0.014*"flight" + 0.014*"indicates_expand

#### Extract topics from model
- add topic labels as `topic` column in original dataset

In [26]:
# Add a new `topic'-column into original dataset
df['topic'] = extract_labels(lda_model, data_lemmatized, corpus, n_topics)

In [27]:
# Find the best_topic_number matching racial-violence keywords, i.e.,
# ['black', 'man', 'woman', 'police', 'violence', 'kill', 'arrest']
best_topic_number = find_best_matching_topic(lda_model, n_topics)

Matching {topic: total keywords}-candidates are :
 {2: 1, 3: 4, 10: 1}
Best matching topic number is: 3


In [28]:
# Inspect a few examples
print(df[df.topic==best_topic_number].shape[0])
df[df.topic==best_topic_number].sample(3)

402


Unnamed: 0,date,author,title,article,url,section,publication,topic
7181,2016-07-08 11:33:16,German Lopez,This tweet perfectly captures how many people feel after the Dallas shooting,"In the aftermath of the mass shooting in Dallas and the police shootings of Alton Sterling and Philando Castile, much of the reaction has been highly polarized, with some people blaming Black Live...",https://www.vox.com/2016/7/8/12126544/dallas-shooting-police-tweet,unknown,Vox,3
536,2018-12-09 00:00:00,Ilana NovickHakim Bishara,"Activists Protest at Whitney Museum, Demanding Vice Chairman and Owner of Tear Gas Manufacturer “Must Go”",Members and supporters of activist group Decolonize This Place emphasized that Warren Kanders is only a symptom of a larger problem. Advertise on Hyperallergic with Nectar Ads “I’d rather see this...,https://hyperallergic.com/475198/activists-protest-at-whitney-museum-demanding-vice-chairman-and-owner-of-tear-gas-manufacturer-must-go/,unknown,Hyperallergic,3
2643,2017-03-13 16:45:00,Denise Benson,Montreal's Stonewall: How the Sex Garage Raid Mobilized a Generation of LGBT Activists,"In our Dancing vs. The State series,THUMP explores nightlife's complicated relationship to law enforcement, past and present. I've always believed that dance clubs and parties have the potential t...",https://www.vice.com/en_us/article/4x8pjq/montreal-sex-garage-raid-feature,Noisey,Vice,3


In [30]:
# Check any given text 
df[df.topic==best_topic_number].article.sample(1).values

array([' Prime Minister Justin Trudeau says Canada has been hit with a terrorist attack, after two men opened fire in a Quebec City mosque, killing six people and injuring 19 others, some gravely.  The victims ranged from 35 to 70 years old, police said at an early morning press conference near the scene. Another 39 were at the mosque, but escaped uninjured.  Two male suspects have been arrested. One was arrested near the scene of the shooting, while the second was stopped after a chase that culminated near a bridge to the east of the city. Police said they do not believe there are any other suspects. "We condemn this terrorist attack on Muslims in a centre of worship and refuge,"\xa0Trudeau said in a statement released late Sunday night, hours after police reported having two men in custody and the situation at the Islamic Cultural Centre of Quebec "under control." Continue reading on VICE\xa0News'],
      dtype=object)

#### Save labeled data 
- Save as `data/interim/labeled_subset_first_15000.gzip`

In [None]:
df.to_parquet('data/interim/labeled_subset_first_15000.gzip', compression='gzip')

In [None]:
with open("models/lda_model_n14_first15000.pkl","wb") as fout:
    pickle.dump(lda_model, fout)

---