## Topic Modeling
- This notebook walks thru a topic modeling process using `data/interim/subset_first_15000.gzip` 
- At the end of the notebook, a labeled data will be returned

#### Import Libraries

In [1]:
# Change to parent directory
import os
os.chdir(os.pardir)

In [2]:
import re
import pickle 
from pprint import pprint
import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
# warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings('ignore')


#### Helper Function to process raw data by chunks 
(Functions from `data_prep` package)  
  
For each chunk:
- preprocess text file (remove empty articles, impute nans)
- topic model it
- find best topic modeled as "crime" 
    - keywords: ['black', 'man', 'woman', 'police', 'violence', 'kill', 'arrest']
- save each labeled news into `data/interim` folder

In [4]:
from src.data_prep.topic_modeling_helpers import (preprocess_text, make_corpus,extract_labels,
                                                  find_best_matching_topic,build_lda_model, 
                                                  extract_labels, find_best_matching_topic)
from src.data_prep.preprocessing_helpers import impute_nans, remove_empty_articles

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jhonsen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [7]:
# CONSTANTs (Hyperparameters) from previous notebook (TopicModelingFirstBatch.ipynb)
ALPHA = 'asymmetric'
ETA = 1
NTOPICS = 14

# Helper functions
def create_topic_model(dataset, start_row, end_row):

    # Preprocess 
    papers = dataset['article'].apply(preprocess_text)
    # Prepare topic modeling input
    corpus, id2word, bigrams, data_lemmatized = make_corpus(papers)
    # Build model & print the topic number with best matching keywords
    lda_model = build_lda_model(corpus, id2word, n_topics=NTOPICS, alpha=ALPHA, eta=ETA)
    best_topic_no = find_best_matching_topic(lda_model, n_topics=NTOPICS)
    # label document
    dataset['topic'] = extract_labels(lda_model, data_lemmatized, corpus, n_topics=NTOPICS)
    print(f'Topic {best_topic_no} has ', dataset[dataset.topic==best_topic_no].shape[0], ' rows')
    # Save file & model
    filename = f'labeled_crime_row{start_row}_to_row{end_row}.gzip'
    filepath = os.path.join('data', 'interim', filename)
    dataset.to_parquet(filepath, compression='gzip')

    with open(f"models/lda_model_n{NTOPICS}_row{start_row}_to_row{end_row}.pkl", "wb") as fout:
        pickle.dump(lda_model, fout)
        print(f'LDA model saved as models/lda_model_n{NTOPICS}_row{start_row}_to_row{end_row}.pkl')
    
    return dataset[dataset.topic==best_topic_no], best_topic_no, filename
        
def process_chunk(dataset, start_row, end_row):
    return create_topic_model(impute_nans(remove_empty_articles(dataset)), start_row, end_row)


---

#### Processing each chunk
**BEWARE** This takes **13 hours** to run locally! 
  
(You can skip this step and continue with the next cell)

In [9]:
data_directory = 'data'
raw_filepath = os.path.join(os.path.relpath('.'), 'data', 'raw', 'all-the-news-2-1.csv')

start_row = 1
chunksize = 10000  # This is equivalent to <25 mb parquet file
end_row = chunksize 

def start():
    confirm = input("do you want to start? [y/n]")
    if (confirm == 'y') or (confirm =='Y'):

        all_crime_news = pd.DataFrame()
        crime_topic_index = pd.DataFrame()
        for chunk in pd.read_csv(raw_filepath, header=0,
                                 chunksize=chunksize, 
                                 encoding='utf-8',
                                 usecols = ["date","author","title","publication","section","url", "article"],
                                 parse_dates=['date']
                                ):
            crime_news, best_topic_no, fname = process_chunk(chunk, start_row, end_row)

            crime_topic_index = crime_topic_index.append(pd.DataFrame({'filename': fname, 'topic': best_topic_no,
                                                                      'start_row': start_row, 'end_row': end_row},
                                                                     columns=['filename','topic','start_row','end_row'], index=[0]), ignore_index=True)
            all_crime_news = all_crime_news.append(crime_news, ignore_index=True)

            print(f'\t===== Finished with first {end_row} rows ====\n')
            start_row += chunksize
            end_row += chunksize

        filepath = os.path.join('data', 'processed', 'crime_topic_index.gzip')
        crime_topic_index.to_parquet(filepath, compression='gzip')

# Un-comment below to start!
# start()

Matching {topic: total keywords}-candidates are :
 {0: 1, 4: 3, 9: 3}
Best matching topic number is: 4
Topic 4 has  1032  rows
LDA model saved as models/lda_model_n14_row1_to_row10000.pkl
	===== Finished with first 10000 rows ====

Matching {topic: total keywords}-candidates are :
 {0: 1, 3: 1, 4: 7, 11: 2}
Best matching topic number is: 4
Topic 4 has  630  rows
LDA model saved as models/lda_model_n14_row10001_to_row20000.pkl
	===== Finished with first 20000 rows ====

Matching {topic: total keywords}-candidates are :
 {4: 2, 5: 6, 7: 1, 8: 1, 10: 1}
Best matching topic number is: 5
Topic 5 has  1049  rows
LDA model saved as models/lda_model_n14_row20001_to_row30000.pkl
	===== Finished with first 30000 rows ====

Matching {topic: total keywords}-candidates are :
 {3: 3, 5: 1, 7: 2}
Best matching topic number is: 3
Topic 3 has  210  rows
LDA model saved as models/lda_model_n14_row30001_to_row40000.pkl
	===== Finished with first 40000 rows ====

Matching {topic: total keywords}-candidate

#### Save labeled data in `data/interim/`-folder
- Save all news labeled as crime as 1 file,  `all_crime_news_labeled.gzip`
- Save the topic index map as `crime_topic_index.gzip`

In [None]:
# filename = 'all_crime_news_labeled.gzip'
# filepath = os.path.join('data', 'processed', filename)
# all_crime_news.to_parquet(filepath, compression='gzip')

---

#### Manual Inspection of Topic Model Accuracy
- Call each file saved during Chunk-processing (above), saved in `data/interim`  
- Take a glance at "title" and "article" to see if they are really CRIME-related news

In [58]:
topic_index = pd.read_parquet(os.path.join('data', 'processed', 'crime_topic_index.gzip'), engine="pyarrow")
topic_index.iloc[[*range(3)],:]

Unnamed: 0,filename,topic,start_row,end_row
0,labeled_crime_row1_to_row10000.gzip,4,1,10000
1,labeled_crime_row10001_to_row20000.gzip,4,10001,20000
2,labeled_crime_row20001_to_row30000.gzip,5,20001,30000


In [71]:
# Total number of files to inspect
topic_index.shape[0]

269

## `TO DO:` Inspect each chunk for accuracy of "crime" label
- Read "title" and "article" of labeled news in `data/interim` folder
- Mark non-relevant filenames for next step

In [74]:
################# MODIFY start and end ############## 

# Increments of 5 is easy for the eyes
start = 0
end = 5 

#####################################################

# display setting
display_row = 3
colnames = ['topic','title','article']
selected_chunk = topic_index.iloc[[*range(start,end)],:]

for index, row in selected_chunk.iterrows():

    if row['topic'] != 99:  # 99 is a dummy topic

        # get title and articles
        start,end,topic = row['start_row'], row['end_row'], row['topic']
        filename = f'labeled_crime_row{start}_to_row{end}.gzip'
        
        article = pd.read_parquet(os.path.join('data', 'interim', filename), 
                                  engine="pyarrow").query(f'topic=={topic}').head(display_row)[colnames]        
        select_topic_index = topic_index[(topic_index['start_row']==start) & (topic_index['end_row']==end)][['filename','topic']]

        display(pd.merge(select_topic_index, article, how='inner', on='topic'))



Unnamed: 0,filename,topic,title,article
0,labeled_crime_row1_to_row10000.gzip,4,"Venezuela detains six military, police officials: family members, activists","CARACAS (Reuters) - Venezuelan authorities have arrested six members of the country’s military and police forces over the weekend, according to relatives of the detainees and human rights activist..."
1,labeled_crime_row1_to_row10000.gzip,4,"Paradise, California, wildfire: why the fire threat to California is only growing","PARADISE, CALIFORNIA — Brook Jenkins moved to the town of Paradise to escape a rough neighborhood in nearby Chico and raise her three children in an idyllic small town, filled with trees. Paradise..."
2,labeled_crime_row1_to_row10000.gzip,4,Teen prisoners rioted and lit British Columbia's ‘super jail’ on fire this week,"A six-hour riot that started with a fire and devolved into a rampage through a youth ""super jail"" in British Columbia this week has confirmed long-held fears over what would happen when the provin..."


Unnamed: 0,filename,topic,title,article
0,labeled_crime_row10001_to_row20000.gzip,4,We Asked a Law Professor Whether the Government Could Really Ban Rough Sex,"John Doe was a freshman at George Mason University when he started seeing Jane Roe, a student at a different university (both subjects have been anonymous in media accounts and court documents). T..."
1,labeled_crime_row10001_to_row20000.gzip,4,Justin Bieber Fan Arrested for Trespassing At Singer's Beverly Hills Home,A Justin Bieber fan was arrested at the singer's Beverly Hills home Monday after cops say she wandered onto the property looking for the singer ... for the third time this week. Law enforcement so...
2,labeled_crime_row10001_to_row20000.gzip,4,Baylor University paid ex-football coach $15 million after sex scandal,(Reuters) - Baylor University in Texas paid more than $15.1 million to its former head football coach Art Briles after firing him in 2016 for failing to address students’ complaints of rape and se...


Unnamed: 0,filename,topic,title,article
0,labeled_crime_row20001_to_row30000.gzip,5,The Tour That Celebrates the Lives—Not Deaths—of Jack the Ripper's Victims,"Ever since Jack the Ripper claimed his first victim 130 years ago, investigating his legacy has become both a mainstream activity and a legitimized hobby. There have been TV programs, films, video..."
1,labeled_crime_row20001_to_row30000.gzip,5,Unexploded device spotted on one of attacked oil tankers -U.S. source,"WASHINGTON, June 13 (Reuters) - An unexploded device, believed to be a limpet mine, was spotted on the side of one of two oil tankers attacked on Thursday in the Gulf of Oman, a U.S. official told..."
2,labeled_crime_row20001_to_row30000.gzip,5,Hong Kong court favors gay couple in landmark victory for LGBT+ rights,BANGKOK (Thomson Reuters Foundation) - Hong Kong’s top court on Thursday ruled in favor of a gay civil servant fighting for spousal and tax benefits for his husband in the latest legal victory ove...


Unnamed: 0,filename,topic,title,article
0,labeled_crime_row30001_to_row40000.gzip,3,Forever 21 Pulls 'Rapey' Tee After Customer Complaints,"Forever 21 has pulled a controversial t-shirt after getting a huge backlash from consumers who called the gear ""shameful"" and ""rapey."" The graphic tee featured a slogan that seemingly referred to ..."
1,labeled_crime_row30001_to_row40000.gzip,3,Giegling Co-Founder Responds to Article Alleging He Made Sexist Remarks,"Update [June 22 2017, 11.20 AM]: London's Sunfall festival announced that it has removed Giegling from its bill following the allegations of Konstantin's sexist remarks about female DJs. Konstant..."
2,labeled_crime_row30001_to_row40000.gzip,3,Study: black people simply saying they’re multiracial makes others think they’re better-looking,"Newly published research indicates that black people are perceived as more attractive if they claim to be multiracial, regardless of the way they look. Let that sink in: All the study subjects had..."


Unnamed: 0,filename,topic,title,article
0,labeled_crime_row40001_to_row50000.gzip,10,"After scandals, Pope orders his diplomats to toe the line","VATICAN CITY (Reuters) - Pope Francis on Thursday told his ambassadors around the world, some of whom have been involved in sexual and financial scandals, to live humble, exemplary lives and be cl..."
1,labeled_crime_row40001_to_row50000.gzip,10,Chyna Autopsy Shows Toxic Cocktail of Rx Pills and Booze,WWE legend Chyna had traces of multiple prescription drugs in her system -- including oxycodone and Valium -- along with alcohol when she died ... this according to the autopsy report obtained by ...
2,labeled_crime_row40001_to_row50000.gzip,10,New York City terror attack: what we know so far,"A 29-year-old man drove a rental truck into a pedestrian and bike path along the Hudson River in Lower Manhattan in New York City Tuesday, killing eight people and injuring 11 in the deadliest ter..."


##### Record the filenames of non-relevant crime news below

In [75]:
# Insert the filenames for which title and article dont appear to be crime related
non_crime_filenames = []

---