# Homework

There are two topics to explore: TF-IDF and Stopwords. Be sure to add analysis markdown cells to record any insights you learned, any questions that popped into your head along the way, and any discussion points you want to talk about next time we meet.
    
Add new cells to do the TFIDF work just below where `df.head(10)` is printed out, above the `Stopwords` section.

_Important Note: if you find something interesting in the data that you want to explore, but it isn't part of the homework, immediatly stop what you are doing and EXPLORE IT! Being curious and digging into interesting patterns is more important than completing homework tasks. Just be sure to add markdown cells to record your questions, analysis, findings, and any questions that came up during your analysis_

## TF-IDF: Term Frequency Inverse Document Frequency

Here is a good page that describes TFIDF and how to calculate it: http://www.tfidf.com/

Calculate the TFIDF scores for the following words: `smell, the, this, washington, money, road, and`

How are their scores different from each other? What do you think this means?
- how do you interpret words with low scores? What about high scores?

What word in the articles has the highest and lowest TFIDF score?

More TFIDF resources
- https://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html#7990

In [2]:
import numpy as np
import pandas as pd
import itertools

import matplotlib.pyplot as plt
import numpy as np
from matplotlib import colors
from matplotlib.ticker import PercentFormatter

## `news.csv` Data Set

4 columns: 
- article id
- article title
- article text
- lable

In [3]:
#Read the data
df=pd.read_csv('data/news.csv')

#Get shape and head
shape = df.shape
print(f"shape of the dataset: {shape} \n")
df.columns = ['id', 'title', 'text', 'label']
df.head(10)

shape of the dataset: (6335, 4) 



Unnamed: 0,id,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
5,6903,"Tehran, USA","\r\nI’m not an immigrant, but my grandparent...",FAKE
6,7341,Girl Horrified At What She Watches Boyfriend D...,"Share This Baylee Luciani (left), Screenshot o...",FAKE
7,95,‘Britain’s Schindler’ Dies at 106,A Czech stockbroker who saved more than 650 Je...,REAL
8,4869,Fact check: Trump and Clinton at the 'commande...,Hillary Clinton and Donald Trump made some ina...,REAL
9,2909,Iran reportedly makes new push for uranium con...,Iranian negotiators reportedly have made a las...,REAL


# TF(Term frequency)

In [4]:
# Ref. https://stackoverflow.com/questions/49941646/python-counter-function-to-count-words-in-documents-with-more-then-one-occurre
import re
from collections import Counter

df['words'] = df.text.str.replace(r'[(,“,”,),%,$,+,.,\,,@,—,‘,’,!,;]', '').str.lower().str.strip().str.split(' ')
df['words_count'] = df['text'].str.len()
# df['words'] = [word for word in df['words']]

df['frequencies'] = df['words'].apply(
    lambda words: 
    Counter({k: v for k,v in Counter(words).items()}))
df.head(10)

Unnamed: 0,id,title,text,label,words,words_count,frequencies
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,"[daniel, greenfield, a, shillman, journalism, ...",7549,"{'daniel': 1, 'greenfield': 1, 'a': 38, 'shill..."
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,"[google, pinterest, digg, linkedin, reddit, st...",2654,"{'google': 1, 'pinterest': 1, 'digg': 1, 'link..."
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,"[us, secretary, of, state, john, f, kerry, sai...",2559,"{'us': 2, 'secretary': 2, 'of': 14, 'state': 2..."
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,"[kaydee, king, kaydeeking, november, 9, 2016, ...",2673,"{'kaydee': 1, 'king': 1, 'kaydeeking': 1, 'nov..."
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,"[it's, primary, day, in, new, york, and, front...",1860,"{'it's': 1, 'primary': 1, 'day': 1, 'in': 13, ..."
5,6903,"Tehran, USA","\r\nI’m not an immigrant, but my grandparent...",FAKE,"[im, not, an, immigrant, but, my, grandparents...",13363,"{'im': 2, 'not': 5, 'an': 13, 'immigrant': 1, ..."
6,7341,Girl Horrified At What She Watches Boyfriend D...,"Share This Baylee Luciani (left), Screenshot o...",FAKE,"[share, this, baylee, luciani, left, screensho...",3177,"{'share': 1, 'this': 4, 'baylee': 8, 'luciani'..."
7,95,‘Britain’s Schindler’ Dies at 106,A Czech stockbroker who saved more than 650 Je...,REAL,"[a, czech, stockbroker, who, saved, more, than...",783,"{'a': 1, 'czech': 1, 'stockbroker': 1, 'who': ..."
8,4869,Fact check: Trump and Clinton at the 'commande...,Hillary Clinton and Donald Trump made some ina...,REAL,"[hillary, clinton, and, donald, trump, made, s...",13981,"{'hillary': 1, 'clinton': 20, 'and': 40, 'dona..."
9,2909,Iran reportedly makes new push for uranium con...,Iranian negotiators reportedly have made a las...,REAL,"[iranian, negotiators, reportedly, have, made,...",4332,"{'iranian': 2, 'negotiators': 1, 'reportedly':..."


In [5]:
df_transpose = pd.DataFrame.from_dict(df['frequencies'].to_dict())\
                    .transpose().fillna(0).astype(int)
df_transpose.head(5)

Unnamed: 0,daniel,greenfield,a,shillman,journalism,fellow,at,the,freedom,center,...,applause\r\n\r\nsince,twitter\r\n\r\nthe,attack-dog,"""joyfully""\r\n\r\nother",land\r\n\r\nother,surges\r\n\r\ncullen,fergus,"govern""\r\n\r\no'connell",studious,"worked"""
0,1,1,38,1,1,1,5,82,1,1,...,0,0,0,0,0,0,0,0,0,0
1,0,0,15,0,0,0,2,14,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,7,0,0,0,3,22,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,7,0,0,0,2,25,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,5,0,0,0,0,19,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
# Calculate TF: term frequency from https://en.wikipedia.org/wiki/Tf%E2%80%93idf

tf = df_transpose.copy()
tf['words_count'] = df['words_count']
tf = tf.div(tf['words_count'], axis=0).drop('words_count', axis=1)
tf.head(10)
# tf.to_csv('data/tf.csv')

Unnamed: 0,daniel,greenfield,a,shillman,journalism,fellow,at,the,freedom,center,...,applause\r\n\r\nsince,twitter\r\n\r\nthe,attack-dog,"""joyfully""\r\n\r\nother",land\r\n\r\nother,surges\r\n\r\ncullen,fergus,"govern""\r\n\r\no'connell",studious,"worked"""
0,0.000132,0.000132,0.005034,0.000132,0.000132,0.000132,0.000662,0.010862,0.000132,0.000132,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.005652,0.0,0.0,0.0,0.000754,0.005275,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.002735,0.0,0.0,0.0,0.001172,0.008597,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.002619,0.0,0.0,0.0,0.000748,0.009353,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.002688,0.0,0.0,0.0,0.0,0.010215,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.003592,0.0,0.0,0.000299,0.000898,0.008456,0.00015,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.001574,0.0,0.0,0.0,0.001574,0.010702,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.001277,0.0,0.0,0.0,0.001277,0.011494,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.003004,0.0,0.0,0.0,0.001287,0.010085,0.0,0.000143,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.004848,0.0,0.0,0.0,0.000693,0.009695,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# IDF(Inverse document frequency)

In [7]:
# From df_sample, convert word(term) frequency to boolean type. 
# For calculate DF, we needs check if the word exists on the document or not.

idf_prep = df_transpose.copy()
idf_prep = idf_prep.apply(lambda x: x > 0)
idf_prep.head(10)

Unnamed: 0,daniel,greenfield,a,shillman,journalism,fellow,at,the,freedom,center,...,applause\r\n\r\nsince,twitter\r\n\r\nthe,attack-dog,"""joyfully""\r\n\r\nother",land\r\n\r\nother,surges\r\n\r\ncullen,fergus,"govern""\r\n\r\no'connell",studious,"worked"""
0,True,True,True,True,True,True,True,True,True,True,...,False,False,False,False,False,False,False,False,False,False
1,False,False,True,False,False,False,True,True,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,True,False,False,False,True,True,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,True,False,False,False,True,True,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,True,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
5,False,False,True,False,False,True,True,True,True,False,...,False,False,False,False,False,False,False,False,False,False
6,False,False,True,False,False,False,True,True,False,False,...,False,False,False,False,False,False,False,False,False,False
7,False,False,True,False,False,False,True,True,False,False,...,False,False,False,False,False,False,False,False,False,False
8,False,False,True,False,False,False,True,True,False,True,...,False,False,False,False,False,False,False,False,False,False
9,False,False,True,False,False,False,True,True,False,False,...,False,False,False,False,False,False,False,False,False,False


In [8]:
# Calculate IDF: inverse document frequency smooth from https://en.wikipedia.org/wiki/Tf%E2%80%93idf

number_of_docs = idf_prep.shape[0] #6335
doc_freq = idf_prep.sum()

idf = np.log(number_of_docs/doc_freq)
idf

daniel                      4.410040
greenfield                  6.114788
a                           0.065223
shillman                    7.144407
journalism                  3.941661
                              ...   
surges\r\n\r\ncullen        8.753845
fergus                      8.753845
govern"\r\n\r\no'connell    8.753845
studious                    8.753845
worked"                     8.753845
Length: 177500, dtype: float64

# TF-IDF

In [13]:
tfidf = tf.copy()
tfidf = tf * idf

tfidf = tfidf.set_index(df['id'])

In [14]:
tfidf['min_index'] = tfidf[tfidf > 0].min(axis=1)
tfidf['min_value'] = tfidf[tfidf > 0].idxmin(axis=1)
tfidf['max_index'] = tfidf.loc[:, tfidf.columns != 'min_value'].max(axis=1)
tfidf['max_value'] = tfidf.loc[:, tfidf.columns != 'min_value'].idxmax(axis=1)
tfidf[['min_index', 'min_value', 'max_index', 'max_value']]

Unnamed: 0_level_0,min_index,min_value,max_index,max_value
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
8476,0.000052,who,0.008972,fbi
10294,0.000083,by,0.008189,ryan
3608,0.000093,it,0.007024,kerry
10142,0.000045,that,0.006550,berniesteachers
875,0.000061,on,0.006574,delegates
...,...,...,...,...
4490,0.000028,on,0.019056,pagliano
8062,0.000033,been,0.004623,newshour
8622,0.000036,about,0.009143,oligarchy
4021,0.000067,it,0.010681,ethiopia


## Stopwords

TODO: 

Here are a couple of links about `stopwords` to read. 
- https://kavita-ganesan.com/what-are-stop-words/
- https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html

The python NLP toolkit NLTK has a set of built in stopwords it uses for it's algorithms. Install NLTK 

`pip install nktp`

You'll also need to install some of the NLTK resources. Go to this link and follow the cmd line install instructions: https://www.nltk.org/data.html#command-line-installation

`python -m nltk.downloader stopwords`

Run the below code to print out the stopwords

In [2]:
# to install the natural language toolkip
# $ pip install nltk

# to install the "stopwords" resource
# $ python -m nltk.downloader stopwords

from nltk.corpus import stopwords
stop = stopwords.words('english')
stop

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

### Questions: 
Given what you learned about TF-IDF, do you think stopwords will have a high TFIDF score or a low score? 

Why might it be useful to remove stopwords from the text when doing NLP machine learning? What types of words are left over after stopwords are removed?

What are some of the potential limitations from removing all the english stopwords when doing news analysis?