# Homework

There are two topics to explore: TF-IDF and Stopwords. Be sure to add analysis markdown cells to record any insights you learned, any questions that popped into your head along the way, and any discussion points you want to talk about next time we meet.
    
Add new cells to do the TFIDF work just below where `df.head(10)` is printed out, above the `Stopwords` section.

_Important Note: if you find something interesting in the data that you want to explore, but it isn't part of the homework, immediatly stop what you are doing and EXPLORE IT! Being curious and digging into interesting patterns is more important than completing homework tasks. Just be sure to add markdown cells to record your questions, analysis, findings, and any questions that came up during your analysis_

## TF-IDF: Term Frequency Inverse Document Frequency

Here is a good page that describes TFIDF and how to calculate it: http://www.tfidf.com/

Calculate the TFIDF scores for the following words: `smell, the, this, washington, money, road, and`

How are their scores different from each other? What do you think this means?
- how do you interpret words with low scores? What about high scores?

What word in the articles has the highest and lowest TFIDF score?

More TFIDF resources
- https://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html#7990

In [1]:
import numpy as np
import pandas as pd
import itertools

import matplotlib.pyplot as plt
import numpy as np
from matplotlib import colors
from matplotlib.ticker import PercentFormatter

## `news.csv` Data Set

4 columns: 
- article id
- article title
- article text
- lable

In [2]:
#Read the data
df=pd.read_csv('data/news.csv')

#Get shape and head
shape = df.shape
print(f"shape of the dataset: {shape} \n")
df.columns = ['id', 'title', 'text', 'label']
df.head(10)

shape of the dataset: (6335, 4) 



Unnamed: 0,id,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
5,6903,"Tehran, USA","\nI’m not an immigrant, but my grandparents ...",FAKE
6,7341,Girl Horrified At What She Watches Boyfriend D...,"Share This Baylee Luciani (left), Screenshot o...",FAKE
7,95,‘Britain’s Schindler’ Dies at 106,A Czech stockbroker who saved more than 650 Je...,REAL
8,4869,Fact check: Trump and Clinton at the 'commande...,Hillary Clinton and Donald Trump made some ina...,REAL
9,2909,Iran reportedly makes new push for uranium con...,Iranian negotiators reportedly have made a las...,REAL


# TF(Term frequency)

In [13]:
# Ref. https://stackoverflow.com/questions/49941646/python-counter-function-to-count-words-in-documents-with-more-then-one-occurre
import re
from collections import Counter

target_words = ['smell', 'the', 'this', 'washington', 'money', 'road', 'and']

df['words'] = df.text.str.replace(r'[(,“,”,),%,$,+,.,\,,@,—,‘,’,!,;]', '').str.lower().str.strip().str.split(' ')
df['words_count'] = df['text'].str.len()
# df['words'] = [word for word in df['words']]


df['frequencies'] = df['words'].apply(
    lambda words: 
    Counter({k: v for k,v in Counter(words).items() if k in target_words}))
df.head(10)

Unnamed: 0,id,title,text,label,words,words_count,frequencies
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,"[daniel, greenfield, a, shillman, journalism, ...",7518,"{'the': 82, 'this': 6, 'and': 29, 'smell': 1}"
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,"[google, pinterest, digg, linkedin, reddit, st...",2646,"{'this': 3, 'and': 10, 'the': 14}"
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,"[us, secretary, of, state, john, f, kerry, sai...",2543,"{'this': 2, 'the': 22, 'and': 7}"
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,"[kaydee, king, kaydeeking, november, 9, 2016, ...",2660,"{'the': 25, 'this': 3, 'and': 4}"
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,"[it's, primary, day, in, new, york, and, front...",1840,"{'and': 12, 'the': 19, 'this': 2}"
5,6903,"Tehran, USA","\nI’m not an immigrant, but my grandparents ...",FAKE,"[im, not, an, immigrant, but, my, grandparents...",13333,"{'the': 113, 'and': 83, 'this': 10}"
6,7341,Girl Horrified At What She Watches Boyfriend D...,"Share This Baylee Luciani (left), Screenshot o...",FAKE,"[share, this, baylee, luciani, left, screensho...",3171,"{'this': 4, 'the': 34, 'and': 12}"
7,95,‘Britain’s Schindler’ Dies at 106,A Czech stockbroker who saved more than 650 Je...,REAL,"[a, czech, stockbroker, who, saved, more, than...",783,"{'the': 9, 'and': 1}"
8,4869,Fact check: Trump and Clinton at the 'commande...,Hillary Clinton and Donald Trump made some ina...,REAL,"[hillary, clinton, and, donald, trump, made, s...",13863,"{'and': 40, 'the': 141, 'money': 1, 'washingto..."
9,2909,Iran reportedly makes new push for uranium con...,Iranian negotiators reportedly have made a las...,REAL,"[iranian, negotiators, reportedly, have, made,...",4296,"{'the': 42, 'and': 11, 'this': 1}"


In [14]:
df_transpose = pd.DataFrame.from_dict(df['frequencies'].to_dict())\
                    .transpose().fillna(0).astype(int)
df_transpose['id'] = df['id']
df_transpose['label'] = df['label']
df_transpose['words_count'] = df['words_count']
df_transpose.head(5)

Unnamed: 0,the,this,and,smell,money,washington,road,id,label,words_count
0,82,6,29,1,0,0,0,8476,FAKE,7518
1,14,3,10,0,0,0,0,10294,FAKE,2646
2,22,2,7,0,0,0,0,3608,REAL,2543
3,25,3,4,0,0,0,0,10142,FAKE,2660
4,19,2,12,0,0,0,0,875,REAL,1840


In [15]:
# Calculate TF: term frequency from https://en.wikipedia.org/wiki/Tf%E2%80%93idf
tf = df_transpose.copy()
tf[target_words] = tf[target_words].div(tf['words_count'], axis=0)
tf.head(10)
# tf.to_csv('data/tf.csv')

Unnamed: 0,the,this,and,smell,money,washington,road,id,label,words_count
0,0.010907,0.000798,0.003857,0.000133,0.0,0.0,0.0,8476,FAKE,7518
1,0.005291,0.001134,0.003779,0.0,0.0,0.0,0.0,10294,FAKE,2646
2,0.008651,0.000786,0.002753,0.0,0.0,0.0,0.0,3608,REAL,2543
3,0.009398,0.001128,0.001504,0.0,0.0,0.0,0.0,10142,FAKE,2660
4,0.010326,0.001087,0.006522,0.0,0.0,0.0,0.0,875,REAL,1840
5,0.008475,0.00075,0.006225,0.0,0.0,0.0,0.0,6903,FAKE,13333
6,0.010722,0.001261,0.003784,0.0,0.0,0.0,0.0,7341,FAKE,3171
7,0.011494,0.0,0.001277,0.0,0.0,0.0,0.0,95,REAL,783
8,0.010171,0.0,0.002885,0.0,7.2e-05,7.2e-05,0.0,4869,REAL,13863
9,0.009777,0.000233,0.002561,0.0,0.0,0.0,0.0,2909,REAL,4296


# IDF(Inverse document frequency)

In [16]:
# From df_sample, convert word(term) frequency to boolean type.
# For calculate DF, we needs check if the word exists on the document or not.

idf_prep = df_transpose.copy()
idf_prep[target_words] = idf_prep[target_words].apply(lambda x: x > 0)
idf_prep.head(10)

Unnamed: 0,the,this,and,smell,money,washington,road,id,label,words_count
0,True,True,True,True,False,False,False,8476,FAKE,7518
1,True,True,True,False,False,False,False,10294,FAKE,2646
2,True,True,True,False,False,False,False,3608,REAL,2543
3,True,True,True,False,False,False,False,10142,FAKE,2660
4,True,True,True,False,False,False,False,875,REAL,1840
5,True,True,True,False,False,False,False,6903,FAKE,13333
6,True,True,True,False,False,False,False,7341,FAKE,3171
7,True,False,True,False,False,False,False,95,REAL,783
8,True,False,True,False,True,True,False,4869,REAL,13863
9,True,True,True,False,False,False,False,2909,REAL,4296


In [18]:
# Calculate IDF: inverse document frequency smooth from https://en.wikipedia.org/wiki/Tf%E2%80%93idf

number_of_docs = idf_prep.shape[0] #6335
doc_freq = idf_prep[target_words].sum()

idf = np.log(number_of_docs/doc_freq)
idf

smell         5.288109
the           0.025257
this          0.254612
washington    1.434643
money         1.853114
road          3.181691
and           0.061355
dtype: float64

# TF-IDF

In [31]:
tfidf = tf.copy()
tfidf[target_words] = tf[target_words] * idf

tfidf

Unnamed: 0,the,this,and,smell,money,washington,road,id,label,words_count
0,0.000274,0.000202,0.000236,0.000701,0.000000,0.000000,0.000000,8476,FAKE,7549
1,0.000133,0.000288,0.000231,0.000000,0.000000,0.000000,0.000000,10294,FAKE,2654
2,0.000217,0.000199,0.000168,0.000000,0.000000,0.000000,0.000000,3608,REAL,2559
3,0.000236,0.000286,0.000092,0.000000,0.000000,0.000000,0.000000,10142,FAKE,2673
4,0.000258,0.000274,0.000396,0.000000,0.000000,0.000000,0.000000,875,REAL,1860
...,...,...,...,...,...,...,...,...,...,...
6330,0.000202,0.000062,0.000089,0.000000,0.000000,0.000348,0.000000,4490,REAL,4124
6331,0.000276,0.000124,0.000389,0.000000,0.000000,0.000300,0.000000,8062,FAKE,14343
6332,0.000360,0.000127,0.000235,0.000000,0.000617,0.000717,0.000265,8622,FAKE,12013
6333,0.000219,0.000181,0.000183,0.000000,0.000263,0.000000,0.000000,4021,REAL,7043


## JP


## Stopwords

TODO: 

Here are a couple of links about `stopwords` to read. 
- https://kavita-ganesan.com/what-are-stop-words/
- https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html

The python NLP toolkit NLTK has a set of built in stopwords it uses for it's algorithms. Install NLTK 

`pip install nktp`

You'll also need to install some of the NLTK resources. Go to this link and follow the cmd line install instructions: https://www.nltk.org/data.html#command-line-installation

`python -m nltk.downloader stopwords`

Run the below code to print out the stopwords

In [2]:
# to install the natural language toolkip
# $ pip install nltk

# to install the "stopwords" resource
# $ python -m nltk.downloader stopwords

from nltk.corpus import stopwords
stop = stopwords.words('english')
stop

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

### Questions: 
Given what you learned about TF-IDF, do you think stopwords will have a high TFIDF score or a low score? 

Why might it be useful to remove stopwords from the text when doing NLP machine learning? What types of words are left over after stopwords are removed?

What are some of the potential limitations from removing all the english stopwords when doing news analysis?