# Homework

There are two topics to explore: TF-IDF and Stopwords. Be sure to add analysis markdown cells to record any insights you learned, any questions that popped into your head along the way, and any discussion points you want to talk about next time we meet.
    
Add new cells to do the TFIDF work just below where `df.head(10)` is printed out, above the `Stopwords` section.

_Important Note: if you find something interesting in the data that you want to explore, but it isn't part of the homework, immediatly stop what you are doing and EXPLORE IT! Being curious and digging into interesting patterns is more important than completing homework tasks. Just be sure to add markdown cells to record your questions, analysis, findings, and any questions that came up during your analysis_

## TF-IDF: Term Frequency Inverse Document Frequency

Here is a good page that describes TFIDF and how to calculate it: http://www.tfidf.com/

Calculate the TFIDF scores for the following words: `smell, the, this, washington, money, road, and`

How are their scores different from each other? What do you think this means?
- how do you interpret words with low scores? What about high scores?

What word in the articles has the highest and lowest TFIDF score?

More TFIDF resources
- https://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html#7990

In [1]:
import itertools
import re

from collections import Counter

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import numpy as np
from matplotlib import colors
from matplotlib.ticker import PercentFormatter

## `news.csv` Data Set

4 columns: 
- article id
- article title
- article text
- lable

In [2]:
#Read the data
df=pd.read_csv('data/news.csv')

#Get shape and head
shape = df.shape
print(f"shape of the dataset: {shape} \n")

df.head(10)

shape of the dataset: (6335, 4) 



Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
5,6903,"Tehran, USA","\nI’m not an immigrant, but my grandparents ...",FAKE
6,7341,Girl Horrified At What She Watches Boyfriend D...,"Share This Baylee Luciani (left), Screenshot o...",FAKE
7,95,‘Britain’s Schindler’ Dies at 106,A Czech stockbroker who saved more than 650 Je...,REAL
8,4869,Fact check: Trump and Clinton at the 'commande...,Hillary Clinton and Donald Trump made some ina...,REAL
9,2909,Iran reportedly makes new push for uranium con...,Iranian negotiators reportedly have made a las...,REAL


In [3]:
term_count = dict()
counter = 0
for body, title in zip(df['text'], df['title']):
    all_words = body +  " " + title
    word_collection = Counter()
    for word in re.split("\W+", all_words):
        if word:
            word_collection[word.lower()] += 1
    word_collection["__total__"] = len(all_words)
    term_count[counter] = word_collection
    counter += 1

In [13]:
num_docs = len(term_count.keys())
assert num_docs == len(df)
num_docs

6335

In [29]:
all_idfs = {}
all_words = set()
for body, title in zip(df["text"], df["title"]):
    full_words = body + " " + title
    full_words = {word.lower() for word in re.split("\W+", full_words) if word}
    all_words.update(full_words)
    
for word in all_words:
    num_docs_with_term = sum(1 for doc_id in term_count if term_count[doc_id][word])
    idf = np.log(num_docs/(1+num_docs_with_term))
    all_idfs[word] = idf
all_idfs["all"]
all_idfs["s0"]

8.060697912195295

In [30]:
def tf_idf(doc_id, term, term_count=term_count, num_docs=num_docs):
    tf = term_count[doc_id][term] / term_count[doc_id]["__total__"]
    idf = all_idfs[term]
    result = tf * idf
    return result

In [31]:
doc_id = 0
for term in ("smell", "the", "this", "washington", "money", "road", "and"):
    t = tf_idf(doc_id, term)
    print(f"term {term}\t tf-idf {t}")

term smell	 tf-idf 0.001393225793371872
term the	 tf-idf 0.00024359543106162621
term this	 tf-idf 0.00017505353807473458
term washington	 tf-idf 0.0
term money	 tf-idf 0.0
term road	 tf-idf 0.0
term and	 tf-idf 0.00022259865501157171


as mentioned in other notebook, terms with a small tf-idf are ones that are common both in this document and across all the documents. Large tf-idf means the term is rare in other documents but common (or at least well-represented) in this particular document. 

let's find the min and max tf-idf across all terms in all documents:

In [33]:
# doc_id, tf-idf, term
min_tfidf = (None, 1, None)
max_tfidf = (None, 0, None)
for doc_id, body, title in zip(range(num_docs), df["text"], df["title"]):
    all_words = body + " " + title
    unique_words = {word.lower() for word in re.split("\W+", all_words) if word}
    scores = {word:tf_idf(doc_id, word) for word in unique_words}
    min_term = min(scores, key=scores.get)
    max_term = max(scores, key=scores.get)
    if scores[min_term] < min_tfidf[1]:
        min_tfidf = (doc_id, scores[min_term], min_term)
    if scores[max_term] > max_tfidf[1]:
        max_tfidf = (doc_id, scores[max_term], max_term)

print(f"min: {min_tfidf}")
print(f"max: {max_tfidf}")

min: (2060, 6.177516654519451e-06, 'to')
max: (6328, 0.1905175248085637, 'bum')


"to" being the minimum doesn't surprise me. It's a common word, so it being super-common in a document and in the whole collection is unsurprising. "bum" is a less common word, and used very differently in the UK vs US, so it's not a surprise that it's rarely used in the set of docs, but I'm a little surprised it's used often enough in one doc to give it a big tf-idf. Curious about how many times that word appears in that doc

In [38]:
term_count[6328]["bum"]

2

In [39]:
sum(1 for doc_id in term_count if term_count[doc_id]["bum"])

4

Intersting. So the word "bum" only appears in 4 documents, appears twice in one of those 4, and therefore has a high idf, and since it's in that doc twice, gets a high tf-idf.

In [34]:
print(f"to's idf: {all_idfs['to']}")
print(f"bum's idf: {all_idfs['bum']}")

to's idf: 0.038949242506745134
bum's idf: 7.144407180321139


Let's compare those to the absolute minimum idf words:

In [36]:
min_idf = min(all_idfs, key=all_idfs.get)
max_idf = max(all_idfs, key=all_idfs.get)
print(f"min: {min_idf} {all_idfs[min_idf]} max: {max_idf} {all_idfs[max_idf]}")

min: the 0.019767900195936484 max: stringing 8.060697912195295


So the absolute min and max idfs aren't that far off from the identified min/max tfidf idf values. curious how many docs the max idf is in

In [40]:
sum(1 for doc_id in term_count if term_count[doc_id]["stringing"])

1

In [43]:
list(doc_id for doc_id in term_count if term_count[doc_id]["stringing"])

[5296]

In [44]:
term_count[5296]["stringing"]

1

In [45]:
tf_idf(5296, "stringing")

0.0011340317828074417

So "stringing" only appears in one document, and so has a quite high idf, but since it only appears once in that document it's overall tf-idf is lower than a word that appears even twice in another document.

## Stopwords

TODO: 

Here are a couple of links about `stopwords` to read. 
- https://kavita-ganesan.com/what-are-stop-words/
- https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html

The python NLP toolkit NLTK has a set of built in stopwords it uses for it's algorithms. Install NLTK 

`pip install nktp`

You'll also need to install some of the NLTK resources. Go to this link and follow the cmd line install instructions: https://www.nltk.org/data.html#command-line-installation

`python -m nltk.downloader stopwords`

Run the below code to print out the stopwords

In [71]:
# to install the natural language toolkip
# $ pip install nltk

# to install the "stopwords" resource
# $ python -m nltk.downloader stopwords

from nltk.corpus import stopwords
stop = stopwords.words('english')
stop

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

### Questions: 
Given what you learned about TF-IDF, do you think stopwords will have a high TFIDF score or a low score? 

Why might it be useful to remove stopwords from the text when doing NLP machine learning? What types of words are left over after stopwords are removed?

What are some of the potential limitations from removing all the english stopwords when doing news analysis?

I'd expect stopwords to have a low tf-idf since the point of a stopword is that it's a common, glue-like word, like conjunctions. Two articles can be on immensely different subjects, but I would expect the use of glue words like "a", "an", "the", etc, will be fairly consistent between them, since those are a function of the language generally, rather than the subject. 

Given the above, including stopwords can throw off your statistics about the words in the dataset being analyzed. If you are doing statistics on the most/least common words in the dataset, the presence of words that are simultaneously common and don't convey much meaning will risk throwing off the results of the analysis. 

For instance, it could be interesting to look into which word is used most often by fake articles compared to real ones (so find minimum idf across the corpus). If you leave stop words in the dataset, that's likely to be a meaningless stop word, and won't give you much useful information to act on. 

In fact, let's do that. First, the full corpus min idf:

In [72]:
min_idf = min(all_idfs, key=all_idfs.get)
print(f"min: {min_idf} {all_idfs[min_idf]}")

min: the 0.019767900195936484


Now we need to re-compute the `all_idfs` collection for just the reals and fakes. Shouldn't need to re-compute the term_count, since that's per-document, and real vs fake is a document-by-document property.

In [75]:
fake_idfs = dict()
real_idfs = dict()
all_fake_words = set()
all_real_words = set()
num_fake_docs = 0
num_real_docs = 0
# note: tried this with a df.loc to limit it to just fake. that didn't work, so filtering instead.
for body, title, label in zip(df["text"], df["title"], df["label"]):
    full_words = body + " " + title
    full_words = {word.lower() for word in re.split("\W+", full_words) if word}
    if label == "REAL":
        all_real_words.update(full_words)
        num_real_docs += 1
    elif label == "FAKE":
        all_fake_words.update(full_words)
        num_fake_docs += 1

for word in all_fake_words:
    num_docs_with_term = sum(1 for doc_id in term_count if term_count[doc_id][word])
    idf = np.log(num_fake_docs/(1+num_docs_with_term))
    fake_idfs[word] = idf
for word in all_real_words:
    num_docs_with_term = sum(1 for doc_id in term_count if term_count[doc_id][word])
    idf = np.log(num_real_docs/(1+num_docs_with_term))
    real_idfs[word] = idf   

In [76]:
min_fake = min(fake_idfs, key=fake_idfs.get)
min_real = min(real_idfs, key=real_idfs.get)
print(f"fake min: {min_fake} {fake_idfs[min_fake]}\t real min: {min_real} {real_idfs[min_real]}")

fake min: the -0.6744848636717585	 real min: the -0.6722749180209556


they're the same word for both. Now let's try re-doing this but removing the stop words.

In [78]:
fake_idfs_no_stop = dict()
real_idfs_no_stop = dict()
all_fake_words = set()
all_real_words = set()
num_fake_docs = 0
num_real_docs = 0
set_stops = set(stop)
# note: tried this with a df.loc to limit it to just fake. that didn't work, so filtering instead.
for body, title, label in zip(df["text"], df["title"], df["label"]):
    full_words = body + " " + title
    full_words = {word.lower() for word in re.split("\W+", full_words) if word}
    full_words = full_words - set_stops
    if label == "REAL":
        all_real_words.update(full_words)
        num_real_docs += 1
    elif label == "FAKE":
        all_fake_words.update(full_words)
        num_fake_docs += 1

for word in all_fake_words:
    num_docs_with_term = sum(1 for doc_id in term_count if term_count[doc_id][word])
    idf = np.log(num_fake_docs/(1+num_docs_with_term))
    fake_idfs_no_stop[word] = idf
for word in all_real_words:
    num_docs_with_term = sum(1 for doc_id in term_count if term_count[doc_id][word])
    idf = np.log(num_real_docs/(1+num_docs_with_term))
    real_idfs_no_stop[word] = idf

In [80]:
min_fake = min(fake_idfs_no_stop, key=fake_idfs_no_stop.get)
min_real = min(real_idfs_no_stop, key=real_idfs_no_stop.get)
print(f"fake min: {min_fake} {fake_idfs_no_stop[min_fake]}\t real min: {min_real} {real_idfs_no_stop[min_real]}")

fake min: one -0.22793609422421762	 real min: one -0.22572614857341483


In [81]:
max_fake = max(fake_idfs_no_stop, key=fake_idfs_no_stop.get)
max_real = max(real_idfs_no_stop, key=real_idfs_no_stop.get)
print(f"fake max: {max_fake} {fake_idfs_no_stop[max_fake]}\t real max: {max_real} {real_idfs_no_stop[max_real]}")

fake max: flogs 7.366445148327599	 real max: stringing 7.368655093978402


so the most common word (min idf) across both real and fake is `one`. Not an enormously surprising word, and not something I would base anything on. The most rare is different between them, and somewhat distinctive. 