In [222]:
import pandas as pd
import numpy as np
import random
from collections import defaultdict 
from ast import literal_eval
from collections import Counter
import re
import unicodedata
from nlp_preprocessing import *
from topic_modeling import *
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS, CountVectorizer
import spacy

sp_nlp = spacy.load('en')

pd.set_option("display.max_rows", 50)
pd.set_option("display.max_columns", None)
# pd.set_option('display.max_colwidth', None)

pd.reset_option('display.max_colwidth')

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Data Import

In [141]:
df = pd.read_csv("../Data/data_NLP_round1.csv")

df.head()

Unnamed: 0,number,global_bias,title,date,summary,link,news_title,news_source,news_link,bias,paras,authors,publish_date,text
0,5,From the Left,Trump Administration Drops Citizenship Questio...,"July 3rd, 2019",['The Trump Administration dropped plans to ad...,https://www.allsides.com/story/trump-administr...,Trump Responds After His Administration Drops ...,HuffPost,https://www.huffpost.com/entry/trump-citizensh...,Left,President Donald Trump spoke out Tuesday on hi...,"['Antonia Blumberg', 'Huffpost Us', 'Reporter']",2019-07-03 08:13:05+05:30,“A very sad time for America when the Supreme ...
1,5,From the Right,Trump Administration Drops Citizenship Questio...,"July 3rd, 2019",['The Trump Administration dropped plans to ad...,https://www.allsides.com/story/trump-administr...,Trump administration drops push for citizenshi...,Washington Times,https://www.washingtontimes.com/news/2019/jul/...,Lean Right,President Trump’s quest to add a citizenship q...,"['The Washington Times Http', 'Stephen Dinan']",2019-07-02 00:00:00,President Trump‘s quest to add a citizenship q...
2,15,From the Left,Iran to Surpass Uranium Enrichment Breaching N...,"July 7th, 2019","['On Sunday, Iranian officials said the countr...",https://www.allsides.com/story/iran-surpass-ur...,Iran Announces New Breach of Nuclear Deal Limi...,New York Times (News),https://www.nytimes.com/2019/07/07/world/middl...,Lean Left,Iran said on Sunday that within hours it would...,"['David D. Kirkpatrick', 'David E. Sanger']",2019-07-07 00:00:00,Iran said on Sunday that within hours it would...
3,15,From the Right,Iran to Surpass Uranium Enrichment Breaching N...,"July 7th, 2019","['On Sunday, Iranian officials said the countr...",https://www.allsides.com/story/iran-surpass-ur...,Iran raises uranium enrichment as nuclear deal...,Washington Times,https://www.washingtontimes.com/news/2019/jul/...,Lean Right,Iran announced Sunday it will raise its level ...,"['The Washington Times Http', 'Jon Gambrell', ...",2019-07-07 00:00:00,"TEHRAN, Iran — Iran announced Sunday it will r..."
4,25,From the Left,Social Media Summit Draws Wide Range of Coverage,"July 12th, 2019","[""The 'Social Media Summit' hosted by Presiden...",https://www.allsides.com/story/social-media-su...,Trump accuses social media companies of ‘terri...,Washington Post,https://www.washingtonpost.com/technology/2019...,Lean Left,"President Trump assailed Facebook, Google and ...","['Tony Romm', 'Senior Tech Policy Reporter']",2019-07-11 00:00:00,“Some of you are extraordinary. The crap you t...


Since each news article can contain slightly different unicode formatting, its best to convert everything to ascii format, to make it easier to work the data. All incomptabile characters will be converted or dropped. Since we are working with English, the hope is that a majority of the data is retained.
**But we can come to this later to see how much data is being dropped.**

In [143]:
# Ensuring everything is in ascii format and removing any wierd formatings.
df['text_ascii'] = df.text.map(lambda x: unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('ascii'))
df[['text','text_ascii']].sample()

Unnamed: 0,text,text_ascii
277,Syria’s prime minister escaped a daring assass...,Syrias prime minister escaped a daring assassi...


# Pre-processing to work on

1. Better cleaning process - Post lemma and pre lemma? what else??
1. Compound term extraction - incl. punctuation separated & space separated
1. Named entity extraction & linkage (eg: hong_kong vs hong kong)

# Breaking Into Paras

Let's breakout each news article into paragraphs and expand this into a new dataframe.  
These paragraphs will be treated as individual documents that will be used to vectorize & topic model. Post which, for a given overall news headline, each paragraph from the left & right bias will be compared to see pair up paragraphs.

In [37]:
df_expanded = df[['number','global_bias','title','news_source','text_ascii']].copy(deep=True)

# Splitting each para into a list of paras
df_expanded['text_paras_list'] = df_expanded.text_ascii.str.split('\n\n')

# Exploding the paragraphs into a dataframe, where each row has a paragraph
df_expanded_col = pd.DataFrame(df_expanded.text_paras_list.explode())
df_expanded_col.rename(columns={'text_paras_list':'text_paras'}, inplace=True)

# Joining the exploded dataframe back, so that other metadata can be associated with it
df_expanded = df_expanded.join(df_expanded_col,).reset_index()
df_expanded.rename(columns={'index':'article'}, inplace=True)
df_expanded.drop(columns='text_paras_list', inplace=True)

# getting paragraph numbering
df_expanded['para_count'] = df_expanded.groupby('article').cumcount()

# Pre-processing

## Lemmatization

Lemmatizing first helps preserve as much meaning of the word as possible, while separating out punctuation as needed. It also preserves entity names.  
**Only need to link compound words somehow**

In [209]:
%%time

df_expanded['text_paras_lemma'] = df_expanded.text_paras.map(spacy_lemmatization)
df_expanded[['text_paras', 'text_paras_lemma']].sample(2)

Wall time: 12min 47s


Unnamed: 0,text_paras,temp
23653,I am ready and willing to support strong candi...,be ready and willing to support strong candid...
26308,"On Wednesday, Neal formally requested that the...","on Wednesday , Neal formally request that the ..."


In [237]:
pd.set_option('display.max_colwidth', None)
print(df_expanded.sample()[['text_paras','text_paras_lemma']])
pd.reset_option('display.max_colwidth')

                                                                                                                                                                                                                                                                                                                                         text_paras  \
7124  Google didn't immediately respond to a request for comment, but the company has said its competitive edge comes from offering a product that billions of people choose to use each day. Alphabet's shares opened Tuesday up roughly 1%, ahead of the broader market, after The Wall Street Journal first reported news of the impending suit.   

                                                                                                                                                                                                                                                                                                                         

## Misc Cleaning

Misc. cleaning of the documents. Currently this involves just removing email addresses, website links & any non-alphanumeric characters.

In [238]:
df_expanded['text_paras_misc_clean'] = df_expanded.text_paras_lemma.map(cleaning)
df_expanded[['text_paras_lemma','text_paras_misc_clean']].sample(2)

Unnamed: 0,text_paras_lemma,text_paras_misc_clean
25126,President Donald Trump say late Wednesday that...,President Donald Trump say late Wednesday that...
25920,the emergency declaration unsettle many lawmak...,the emergency declaration unsettle many lawmak...


In [249]:
pd.set_option('display.max_colwidth', None)
print(df_expanded.loc[18300,['text_paras','text_paras_misc_clean']])
pd.reset_option('display.max_colwidth')

text_paras               All the components of the "H-bomb" were "homemade," so North Korea could produce "powerful nuclear weapons as many as it wants," the KCNA quoted Kim as saying.
text_paras_misc_clean              all the component of the  h  bomb  be  homemade   so North Korea could produce  powerful nuclear weapon as many as  want   the KCNA quote Kim as say 
Name: 18300, dtype: object


In [247]:
pd.set_option('display.max_colwidth', None)
print(df_expanded.sample()[['text_paras','text_paras_misc_clean']])
pd.reset_option('display.max_colwidth')

                                                                                                                                                                                                                                                                                                text_paras  \
1916   Ms. Noel and Mr. Thomas were charged with conspiracy to defraud the United States and with making false records. They both surrendered to the F.B.I. on Tuesday morning and pleaded not guilty at a hearing in Federal District Court in Manhattan in the afternoon. Bail was set at $100,000 each.   

                                                                                                                                                                                                                                                                 text_paras_misc_clean  
1916    Ms Noel and Mr Thomas be charge with conspiracy to defraud the United States and with make false record   

In [None]:
%%time

custom_stop_words = ['ad', 'advertisement', '000', 'mr', 'ms', 'said', 'going', 'dont', 'think', 'know', 'want', 'like', 'im', 'thats', 'told', \
                     'lot', 'hes', 'really', 'say', 'added', 'come', 'great','newsletter','daily','sign','app',\
                    'click','app','inbox', 'latest', 'jr','everybody','`']

df_expanded['text_paras_stopwords'] = df_expanded.text_paras_misc_clean.map(lambda x: remove_stopwords(x, custom_words=custom_stop_words))

# df_expanded['text_paras_stopwords'] = df_expanded.text_paras_stopwords.map(lambda x: remove_stopwords(x, remove_words_list = [], \
#                                                                                                      custom_words = custom_stop_words))
df_expanded[['text_paras_lemma','text_paras_stopwords']].sample(2)



In [172]:
# spacy_text = sp_nlp(df_expanded.loc[18300,'text_paras_stopwords'])
# [[token.text, token.ent_type_] for token in spacy_text]

[['component', ''],
 ['`', ''],
 ['`', 'WORK_OF_ART'],
 ['h', 'WORK_OF_ART'],
 ['bomb', 'WORK_OF_ART'],
 ['`', ''],
 ['`', 'WORK_OF_ART'],
 ['`', 'WORK_OF_ART'],
 ['`', 'WORK_OF_ART'],
 ['homemade', 'WORK_OF_ART'],
 ['`', ''],
 ['`', ''],
 ['north', 'GPE'],
 ['korea', 'GPE'],
 ['could', ''],
 ['produce', ''],
 ['`', ''],
 ['`', ''],
 ['powerful', ''],
 ['nuclear', ''],
 ['weapon', ''],
 ['many', ''],
 ['want', ''],
 ['`', ''],
 ['`', ''],
 ['kcna', ''],
 ['quote', ''],
 ['kim', 'PERSON'],
 ['say', '']]

In [176]:
# df_nlp_round1['text_final'] = df_nlp_round1['text_stopwords']

df_expanded['text_final'] = df_expanded['text_paras_stopwords']

In [187]:
df_expanded['text_paras_stopwords'].str.contains('mr').sum()

4738

In [174]:
%%time

params = {'stop_words':'english','min_df': 10, 'max_df': 0.5, 'ngram_range':(1, 1),}

tfidf = TfidfVectorizer(**params)
review_word_matrix_tfidf = tfidf.fit_transform(df_expanded['text_final'])
review_vocab_tfidf = tfidf.get_feature_names()

lda_tfidf, score_tfidf, topic_matrix_tfidf, word_matrix_tfidf = lda_topic_modeling(review_word_matrix_tfidf, vocab = review_vocab_tfidf, n = 20)

iteration: 1 of max_iter: 100
iteration: 2 of max_iter: 100
iteration: 3 of max_iter: 100
iteration: 4 of max_iter: 100
iteration: 5 of max_iter: 100
iteration: 6 of max_iter: 100
iteration: 7 of max_iter: 100
iteration: 8 of max_iter: 100
iteration: 9 of max_iter: 100
iteration: 10 of max_iter: 100
iteration: 11 of max_iter: 100
iteration: 12 of max_iter: 100
iteration: 13 of max_iter: 100
iteration: 14 of max_iter: 100
iteration: 15 of max_iter: 100
iteration: 16 of max_iter: 100
iteration: 17 of max_iter: 100
iteration: 18 of max_iter: 100
iteration: 19 of max_iter: 100
iteration: 20 of max_iter: 100
iteration: 21 of max_iter: 100
iteration: 22 of max_iter: 100
iteration: 23 of max_iter: 100
iteration: 24 of max_iter: 100
iteration: 25 of max_iter: 100
iteration: 26 of max_iter: 100
iteration: 27 of max_iter: 100
iteration: 28 of max_iter: 100
iteration: 29 of max_iter: 100
iteration: 30 of max_iter: 100
iteration: 31 of max_iter: 100
iteration: 32 of max_iter: 100
iteration: 33 of 

### Exploring The Topic Models

Let's take a look at the topic model to see what we've got.

In [175]:
top_words_for_all_topics(word_matrix_tfidf, 20, 20)

Topic 0
king, register, professor, nbc, water, barrett, university, harvard, clip, grassley, trudeau, todd, picture, lake, patrick, spillway, plenty, feeling, writer, positive, 

Topic 1
trump, investigation, mr, committee, say, house, president, fbi, mueller, comey, attorney, counsel, flynn, impeachment, general, special, russia, report, probe, official, 

Topic 2
percent, say, tax, health, year, job, care, pay, plan, billion, million, government, cut, rate, budget, worker, 000, insurance, economy, federal, 

Topic 3
newsletter, daily, manage, sign, uranium, cumming, brunson, turkish, dowd, audience, opinion, conway, participation, hammer, yang, erdogan, liar, stuff, rosatom, quarter, 

Topic 4
llc, copyright, 2020, times, click, permission, reprint, washington, buzz, nooyi, charles, word, post, 2006, warfare, pepsico, typically, drink, examiner, ag, 

Topic 5
police, say, officer, protester, black, protest, city, people, video, mr, man, shooting, trump, walker, arrest, church, violen

Looking at the top words for each topics, there are a number of filler words which we could remove to make the topics a lot more senseful. Additionally, all numbers except for years can be removed too. Lastly, a way needs to be identified for detecting compound words, especially names of places, like Hong Kong, North America etc 

In [130]:
custom_stop_words = ['000', 'mr', 'said', 'going', 'dont', 'think', 'know', 'want', 'like', 'im', 'thats', 'told', \
                     'lot', 'hes', 'really', 'say', 'added', 'come', 'great','newsletter','daily','sign','app',\
                    'click','app','inbox', 'latest', 'jr','everybody']