# Homework: turn the FakeNews text column into tfidf vectors the easy way
  
Vocab:
- corpus: set of all documents in a dataset

## `news.csv` Data Set

4 columns: 
- article id
- article title
- article text
- lable

In [1]:
import itertools
import re

from collections import Counter

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import numpy as np
from matplotlib import colors
from matplotlib.ticker import PercentFormatter

In [2]:
#Read the data
df=pd.read_csv('data/news.csv')

#Get shape and head
shape = df.shape
print(f"shape of the dataset: {shape} \n")

df.head(10)

shape of the dataset: (6335, 4) 



Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
5,6903,"Tehran, USA","\nI’m not an immigrant, but my grandparents ...",FAKE
6,7341,Girl Horrified At What She Watches Boyfriend D...,"Share This Baylee Luciani (left), Screenshot o...",FAKE
7,95,‘Britain’s Schindler’ Dies at 106,A Czech stockbroker who saved more than 650 Je...,REAL
8,4869,Fact check: Trump and Clinton at the 'commande...,Hillary Clinton and Donald Trump made some ina...,REAL
9,2909,Iran reportedly makes new push for uranium con...,Iranian negotiators reportedly have made a las...,REAL


## Machine Learning Framework API Patterns: Transformers and Estimators

Machine learning pipelines take a raw dataset and then transform the data over a series of steps and finally outputs some prediction. For example, text documents -> tokenization -> word counts -> idf calculation -> tfidf calculation -> build classification model -> predictions

The individual steps in a pipeline are called Transformers. Some Transformers can take the input data and directly transform it without any understanding of the full dataset. For example, a stopword filter transformer might take a set of document token arrays, and remove stopwords from each document token array.

Transformers usually have a `transform()` function to do it's work. 

Other types of Transformers need to capture some understanding of the entire dataset before it can Transform the input data. For example a min/max normalization transformer needs to pass over the entire data set once to figure out the minimum and maximum values, then it'll pass over the dataset a second time and calculate the min/max normalization. 

These types of Transformers are called Estimators and usually have a `fit()` function which will do the first pass over the data and calculate it's required state (min and max values), and returns a Transformer that has this state. Then you can call `transform()` on this object to transform the data (calc min/max norms)

Scikit Learn, Spark, and a few others I've run across follow this pattern.

## Word Tokenization & Token Counts

CountVectorizer: Converts a collection of text documents into a matrix of token counts

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [3]:
# example
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['this is the first document',
           'this document is the second document',
           'and this is the third one',
           'is this the first document',         
         ]

vectorizer = CountVectorizer()

estimator = vectorizer.fit(corpus)
X_train_counts = estimator.transform(corpus)

X_train_counts = vectorizer.fit_transform(corpus)

# dictionary of all terms in corpus
#     Warning: term overload...this does not mean a python dict, it's a dictionary in the "list of words" sense.
print(vectorizer.get_feature_names())

# token count marix from corpus
X_train_counts.toarray()

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])

### Task: Run the text column from the FakeNews dataset through the CountVectorizer

Task 1a:
- turn the raw "text" column into a count vector
  - Try to keep all data in the pandas dataframe. So when creating the count vectors, put them back into the pandas dataframe as a new column
- what is the size of the corpus dictionary?
- what is the shape of the fitted document token count matrix
- from a TF-IDF perspective, how does the output of the CountVectorizer relate to the TF-IDF calculation

Task 1b:
- look at the CountVectorizer documentation and find the `min_df` and `max_df` parameters to the CountVectorizer constructor
  - use these params to filter out terms (tokens) that are only in 5 documents or less
  - use these params to filter out terms (tokens) that are in 95% of the docuemnts or more
- what is the size of the corpus dictionary?
- why are these two types of term filters useful?

In [4]:
# build count matrix for Fake News dataset, put it back on the data frame as a new field
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(df["text"])


In [5]:
df['count_vector'] = counts

In [6]:
df.head(5)

Unnamed: 0.1,Unnamed: 0,title,text,label,count_vector
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,"(0, 15786)\t1\n (0, 26358)\t1\n (0, 54359)..."
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,"(0, 15786)\t1\n (0, 26358)\t1\n (0, 54359)..."
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,"(0, 15786)\t1\n (0, 26358)\t1\n (0, 54359)..."
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,"(0, 15786)\t1\n (0, 26358)\t1\n (0, 54359)..."
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,"(0, 15786)\t1\n (0, 26358)\t1\n (0, 54359)..."


out of curiousity, what's at the top and bottom of the dictionary?

In [7]:
vectorizer.get_feature_names()[:30]

['00',
 '000',
 '0000',
 '000000031',
 '00000031',
 '000035',
 '00006',
 '0001',
 '0001pt',
 '0002',
 '000billion',
 '000ft',
 '000km',
 '000x',
 '001',
 '0011',
 '002',
 '003',
 '004',
 '004s',
 '005',
 '005s',
 '006',
 '00684',
 '006s',
 '007',
 '007s',
 '008',
 '008s',
 '009']

numbers. That's not super surprising, I guess.

In [8]:
vectorizer.get_feature_names()[-30:]

['שני',
 'שעת',
 'שתי',
 'תאמצנה',
 'תוצאה',
 'תחל',
 'תיירות',
 'תנותק',
 'תעודת',
 'תתרכז',
 'أن',
 'إجلاء',
 'الأمر',
 'الجرحى',
 'الدولية',
 'القادمون',
 'اللجنة',
 'تحتاج',
 'تعرفه',
 'تنجح',
 'حلب',
 'عربي',
 'عن',
 'لم',
 'ما',
 'محاولات',
 'من',
 'هذا',
 'والمرضى',
 'ยงade']

WTF? 

I had assumed that the articles in the dataset were pure English. That's clearly not true. Reading the entries, it looks like the posts containing these words are quoting people who posted in those languages. I'm not sure this changes anything about the analysis so far, but will have to consider that if we do English-specific things to the text later, it might do surprising things to Arabic or Hebrew.

In [9]:
# what's the size of the corpus dictionary?
print(f"feature dict length: {len(vectorizer.get_feature_names())}")
# what's the shape of the counts matrix
print(f"{counts.shape}")

feature dict length: 67659
(6335, 67659)


#### from a TF-IDF perspective, how does the output of the CountVectorizer relate to the TF-IDF calculation

In the manual computation of the TF-IDF, I had a "term_count", which was the number of times a term occurred in the given doc, and the tf was then just the term_count of that term divided by the # of terms in the doc. The CountVectorizer is that term_count, precomputed. To get the TF, you'll need to sum the whole vector to get the total # of terms in the document, but that's fairly trivial for a vector of numbers. 

To get the IDF for a given term, you could sum across the docs at a fixed point in the vector, corresponding to the term you're looking for. So, the CountVectorizer is pre-computing a few of the steps needed for TF-IDF.


## Create TFIDF Vectors

TfidfTransformer: Converts a count matrix into a tfidf matrix

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html

In [10]:
from sklearn.feature_extraction.text import TfidfTransformer

tf_transformer = TfidfTransformer(use_idf=True, smooth_idf=True).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)

# dictionary term idf values
print(tf_transformer.idf_)

# token count marix from corpus
X_train_tf.toarray()

[1.91629073 1.22314355 1.51082562 1.         1.91629073 1.91629073
 1.         1.91629073 1.        ]


array([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524],
       [0.        , 0.6876236 , 0.        , 0.28108867, 0.        ,
        0.53864762, 0.28108867, 0.        , 0.28108867],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.26710379, 0.51184851, 0.26710379],
       [0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])

### Task: Run the token count data (that used the `min_df` and `max_df` parameters) from the FakeNews dataset through the TfidfTransformer

- What's different between the counts matrix and the tfidf matrix?

so much easier than manually calculating tfidf right?

In [11]:
# calculate tfidf for the text column in the FakeNews dataset

# Stuck here. Can do by hand like:

fake_news_vectorizer = CountVectorizer()

fake_train_counts = fake_news_vectorizer.fit_transform(df['text'])

fake_tfidf = TfidfTransformer(use_idf=True).fit_transform(fake_train_counts)

# but that's cheating, isn't it? I should be using the counts off of the dataframe itself, yes?

# if I try 
# df['count_vector'] = counts
#  
# I get an exception that talks about a float conversion
#
# if I try
# df['count_vector'] = [count for count in counts]
# I get a also get the float conversion exception 
#
# If I try
# df['count_vector'] = [ar for ar in counts.toarray()]
# I get a third exception that talks about size-1 arrays.
#TfidfTransformer(use_idf=True).fit_transform(df['count_vector'])

#
# In theory I could just keep going with fake_train_counts, but you specifically asked me to keep as much
# as possible in the dataframe, and everything I try to put it there is causing exceptions, so I'm a bit stuck.
#

In [12]:
fake_train_counts

<6335x67659 sparse matrix of type '<class 'numpy.int64'>'
	with 2158282 stored elements in Compressed Sparse Row format>

In [13]:
fake_tfidf

<6335x67659 sparse matrix of type '<class 'numpy.float64'>'
	with 2158282 stored elements in Compressed Sparse Row format>

In [14]:
features = fake_news_vectorizer.get_feature_names()

In [15]:
features.index("stringing")

57613

In [16]:
array_fake_tfidf = fake_tfidf.toarray()

In [17]:
array_fake_tfidf[5296][57613]

0.04669418737191859

Wait...why is that different from what we computed when done by hand? In the `02_tfidf_stopwords` one, we got "stringing" in doc 5296 as a tfidf of `0.00113`. Let's make sure we're talking about the same document here, and compare the others we pulled.

In [18]:
doc_id = 0
for term in ("smell", "the", "this", "washington", "money", "road", "and"):
    index = features.index(term)
    print(f"term {term}\t tf-idf {array_fake_tfidf[doc_id][index]}")

term smell	 tf-idf 0.0271874367707218
term the	 tf-idf 0.4139414589780448
term this	 tf-idf 0.03186928087028816
term washington	 tf-idf 0.0
term money	 tf-idf 0.0
term road	 tf-idf 0.0
term and	 tf-idf 0.13816442254810016


it looks like it's talking about the same document (the missing entries are the same) but all of the actual computed *values* are totally different. Why?  ooooh, wait: this is computing tf-idf on just the *text* part of the data frame...in the other one we computed it on the zip of the text and the title. Some of the docs had empty bodies and just titles. So, in *theory* if I make a new column that's the combination of the text and the title it should give me the same tfidf as the other doc. Partly I want to make sure I didn't f-up the computation in the earlier doc.

In [19]:
all_words = df['text'] + df['title']
df['combined'] = all_words

In [20]:
all_words_vectorizer = CountVectorizer()

all_words_counts = all_words_vectorizer.fit_transform(df['combined'])

all_words_tf_transformer = TfidfTransformer(use_idf=True).fit(all_words_counts)

all_trained_df = all_words_tf_transformer.transform(all_words_counts)

all_features = all_words_vectorizer.get_feature_names()

array_all_tfidf = all_trained_df.toarray()

In [21]:
doc_id = 0
for term in ("smell", "the", "this", "washington", "money", "road", "and"):
    index = all_features.index(term)
    print(f"term {term}\t tf-idf {array_all_tfidf[doc_id][index]}")

term smell	 tf-idf 0.054309331718960276
term the	 tf-idf 0.41168399638548503
term this	 tf-idf 0.03177445894424231
term washington	 tf-idf 0.0
term money	 tf-idf 0.0
term road	 tf-idf 0.0
term and	 tf-idf 0.13749704067972163


Huh. So, it did change the values, but not significantly. Certainly not enough to make them match what was computed in the previous exercise. Why? Waaaait. The common words have *high* tfidf? That's wrong. Colud the TfidfTransformer be returning a token count rather than the tfidf?

Just in case I'm getting the tf here and not the tfidf, let's grab the assumed tf, and look up the computed idf and multiply them together:

In [22]:
index = all_features.index("smell")
idf_maybe = all_words_tf_transformer.idf_[index]*array_all_tfidf[0][index]
idf_maybe

0.33984039188931553

The manually computed tf-idf for `smell` in doc 0 from the other worksheet was `0.001393225793371872`.

In [97]:
index = all_features.index("the")
idf_maybe2 = all_words_tf_transformer.idf_[index]*array_all_tfidf[0][index]
idf_maybe2

0.4199533934266698

The manually computed tf-idf for `the` in doc 0 from the other worksheet was `0.00024359543106162621`. This is not only a different value, the assumed tf * the assumed idf for a super-common word (the) is still *higher* than the assumed computed tfidf for a less common word. This isn't right.

In [60]:
all_words_tf_transformer.idf_

array([5.65965837, 2.63842081, 8.65539064, ..., 9.06085575, 9.06085575,
       9.06085575])

In [62]:
smell_index =  all_features.index("s0")
all_words_tf_transformer.idf_[smell_index]

9.060855752934316

Even the IDF values are different. 

In [66]:
df[:1]["combined"]

0    Daniel Greenfield, a Shillman Journalism Fello...
Name: combined, dtype: object

Going to set that aside for the moment...I don't like it, since I'm *super* confident the tfidf this is returning is wrong (common words are bigger than uncommon ones, that's got to be wrong), but I'm not sure how to fix it. Any fix I figure out should be applicable to the process below, though, so let's keep going on the process.

Next tasks: 
    look at the CountVectorizer documentation and find the min_df and max_df parameters to the CountVectorizer constructor
        use these params to filter out terms (tokens) that are only in 5 documents or less
        use these params to filter out terms (tokens) that are in 95% of the documents or more
    what is the size of the corpus dictionary?
    why are these two types of term filters useful?


In [73]:
all_words_counts

<6335x68637 sparse matrix of type '<class 'numpy.int64'>'
	with 2172815 stored elements in Compressed Sparse Row format>

In [74]:
# max df as an int is the absolute value of the df above which terms are ignored. So if any word appears in less than five docs, it's ignored and 
# not shown in this count.
uncommon_words_vectorizer = CountVectorizer(max_df=5)
uncommon_words_counts = uncommon_words_vectorizer.fit_transform(df['combined'])

uncommon_words_counts


<6335x46736 sparse matrix of type '<class 'numpy.int64'>'
	with 81756 stored elements in Compressed Sparse Row format>

So the full set has 68,637 words in the corpus, the limited one has 46,736.

In [75]:
# min_df as a float becomes a percentage. This ignores any word that appears in less than 5% of the documents.
common_words_vectorizer = CountVectorizer(min_df=0.05)
common_words_counts = common_words_vectorizer.fit_transform(df["combined"])

common_words_counts

<6335x1327 sparse matrix of type '<class 'numpy.int64'>'
	with 1245783 stored elements in Compressed Sparse Row format>

This is only 1,327 words. This is a reaching towards a set of stop words...things that appear in lots of the documents will tend to be commonly-used words that won't uniquely identify a document.

In [81]:
features = common_words_vectorizer.get_feature_names()
features[-30:]

['with',
 'within',
 'without',
 'woman',
 'women',
 'won',
 'word',
 'words',
 'work',
 'worked',
 'workers',
 'working',
 'works',
 'world',
 'worse',
 'worst',
 'worth',
 'would',
 'wouldn',
 'written',
 'wrong',
 'wrote',
 'year',
 'years',
 'yes',
 'yet',
 'york',
 'you',
 'young',
 'your']

These kinds of filters are useful since they are explicitly finding common and uncommon words across the corpus. not everything in the common words set is a stop word, but they are words that won't uniquely identify documents.

## Machine Learning Pipelines

Most ML frameworks have a pipeline framework, where you can add multiple different transformers into a parent transformer, then you only all `fit()` and `transform()` on the pipeline object. Internally the pipeline will call `fit()` and `transform()` on each individual transformer and output the final matrix of data

In [82]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('count', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
]).fit(corpus)

X = pipe.transform(corpus)
X.toarray()

array([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524],
       [0.        , 0.6876236 , 0.        , 0.28108867, 0.        ,
        0.53864762, 0.28108867, 0.        , 0.28108867],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.26710379, 0.51184851, 0.26710379],
       [0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])

In [83]:
print(f"count vectorizer dictionary: {pipe['count'].get_feature_names()}")
print()
print(f"tfidf transformer's idf data: {pipe['tfidf'].idf_}")

count vectorizer dictionary: ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

tfidf transformer's idf data: [1.91629073 1.22314355 1.51082562 1.         1.91629073 1.91629073
 1.         1.91629073 1.        ]


### Task: Build a Pipeline to generate a tfidf document matrix



In [39]:
# build and a ML pipeline to calc tfidf on the text column of the FakeNews dataset
fake_pipe = Pipeline([("count", CountVectorizer()), ("tfidf", TfidfTransformer()),]).fit(df['combined'])

transformed = fake_pipe.transform(df["combined"])
arr = transformed.toarray()
arr[:4]



array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.02742222, 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [40]:
fake_pipe_features = fake_pipe["count"].get_feature_names()

In [41]:
len(fake_pipe_features)

68637

In [94]:
index = fake_pipe_features.index("s0")
fake_pipe['tfidf'].idf_[index]

9.060855752934316

On the plus side, this at least returns the same value for the idf we got in the previous computations.

## Extra Credit: Text Normalization

There are different algorithms for text normalization, such as stemming and lemitization. These algorithms aren't build into scikit-learn, but other text processing libraries like `nltk` have implementations. Both have pros and cons. I really like lemitization, but it has a heavier processing cost. 

Pick one, stemming or lemitization, and integrate it into your raw text to tfidf pipeline (there are articles out there on how to integrate `nltk` into a scikit-learn pipeline).  

How did this change the vocabulary size?

So...I'm feeling contrary, so I'll try stemming. The pipeline will have to be (to my present understanding) that we run the corpus through a stemming transformer, which will change the vocab enormously, then through the CountVectorizer to get the counts of the stemmed vocab, then to the TfidfTransformer to run the TF-IDF computation on it.

In [23]:
from nltk.stem import SnowballStemmer

In [24]:
snowball = SnowballStemmer("english")

from some reading it looks like the right way to do this is to override the analyzer in the CountVectorizer, and add the stemmer there. That also allows you to tell the CountVectorizer to use stop words also. 

So, our extended CountVectorizer would look like:

In [29]:
class StemmedCountVectorizer(CountVectorizer):
    snowball_stemmer = SnowballStemmer("english")
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([self.snowball_stemmer.stem(w) for w in analyzer(doc)])

So you can make a vectorizer simply with:

In [30]:
stemmer_vector = StemmedCountVectorizer(stop_words="english")

Then the full stemmed pipeline would look like

In [33]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('count', StemmedCountVectorizer(stop_words="english")),
    ('tfidf', TfidfTransformer()),
]).fit(df["combined"])

X = pipe.transform(df["combined"])

In [34]:
pipe_features = pipe["count"].get_feature_names()

In [35]:
pipe_features[:5]

['00', '000', '0000', '000000031', '00000031']

In [42]:
pipe_features[-5:]

['مورد', 'هذا', 'والمرضى', 'کدآمایی', 'ยงade']

In [36]:
len(pipe_features)

46742

Stemming shrank the size of the feature (the vocab) by a *lot*. It was 68,637 before stemming, 46,742 after. This probably didn't change the likely un-stemmable things (the numbers at the start of the features) or the non-english things at the end of the features. It might be interesting to drop those entirely since the analysis we're doing is entirely English-based.