# Homework: turn the FakeNews text column into tfidf vectors the easy way
  
Vocab:
- corpus: set of all documents in a dataset

## `news.csv` Data Set

4 columns: 
- article id
- article title
- article text
- lable

In [65]:
#Read the data
df=pd.read_csv('data/news.csv')

#Get shape and head
shape = df.shape
print(f"shape of the dataset: {shape} \n")

df.head(10)

shape of the dataset: (6335, 4) 



Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
5,6903,"Tehran, USA","\nI’m not an immigrant, but my grandparents ...",FAKE
6,7341,Girl Horrified At What She Watches Boyfriend D...,"Share This Baylee Luciani (left), Screenshot o...",FAKE
7,95,‘Britain’s Schindler’ Dies at 106,A Czech stockbroker who saved more than 650 Je...,REAL
8,4869,Fact check: Trump and Clinton at the 'commande...,Hillary Clinton and Donald Trump made some ina...,REAL
9,2909,Iran reportedly makes new push for uranium con...,Iranian negotiators reportedly have made a las...,REAL


## Machine Learning Framework API Patterns: Transformers and Estimators

Machine learning pipelines take a raw dataset and then transform the data over a series of steps and finally outputs some prediction. For example, text documents -> tokenization -> word counts -> idf calculation -> tfidf calculation -> build classification model -> predictions

The individual steps in a pipeline are called Transformers. Some Transformers can take the input data and directly transform it without any understanding of the full dataset. For example, a stopword filter transformer might take a set of document token arrays, and remove stopwords from each document token array.

Transformers usually have a `transform()` function to do it's work. 

Other types of Transformers need to capture some understanding of the entire dataset before it can Transform the input data. For example a min/max normalization transformer needs to pass over the entire data set once to figure out the minimum and maximum values, then it'll pass over the dataset a second time and calculate the min/max normalization. 

These types of Transformers are called Estimators and usually have a `fit()` function which will do the first pass over the data and calculate it's required state (min and max values), and returns a Transformer that has this state. Then you can call `transform()` on this object to transform the data (calc min/max norms)

Scikit Learn, Spark, and a few others I've run across follow this pattern.

## Word Tokenization & Token Counts

CountVectorizer: Converts a collection of text documents into a matrix of token counts

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [8]:
# example
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['this is the first document',
           'this document is the second document',
           'and this is the third one',
           'is this the first document',         
         ]

vectorizer = CountVectorizer()

estimator = vectorizer.fit(corpus)
X_train_counts = estimator.transform(corpus)

X_train_counts = vectorizer.fit_transform(corpus)

# dictionary of all terms in corpus
print(vectorizer.get_feature_names())

# token count marix from corpus
X_train_counts.toarray()

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])

### Task: Run the text column from the FakeNews dataset through the CountVectorizer

Task 1a:
- turn the raw text column into count vectors
  - Try to keep all data in the pandas dataframe. So when creating the count vectors, put them back into the pandas dataframe as a new column
- what is the size of the corpus dictionary?
- what is the shape of the fitted document token count matrix
- from a TF-IDF perspective, how does the output of the CountVectorizer relate to the TF-IDF calculation

Task 1b:
- look at the CountVectorizer documentation and find the `min_df` and `max_df` parameters to the CountVectorizer constructor
  - use these params to filter out terms (tokens) that are only in 5 documents or less
  - use these params to filter out terms (tokens) that are in 95% of the docuemnts or more
- what is the size of the corpus dictionary?
- why are these two types of term filters useful?

In [62]:
# build count matrix for Fake News dataset
df["text"].head(5)



0    Daniel Greenfield, a Shillman Journalism Fello...
1    Google Pinterest Digg Linkedin Reddit Stumbleu...
2    U.S. Secretary of State John F. Kerry said Mon...
3    — Kaydee King (@KaydeeKing) November 9, 2016 T...
4    It's primary day in New York and front-runners...
Name: text, dtype: object

## Create TFIDF Vectors

TfidfTransformer: Converts a count matrix into a tfidf matrix

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html

In [9]:
from sklearn.feature_extraction.text import TfidfTransformer

tf_transformer = TfidfTransformer(use_idf=True, smooth_idf=True).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)

# dictionary term idf values
print(tf_transformer.idf_)

# token count marix from corpus
X_train_tf.toarray()

[1.91629073 1.22314355 1.51082562 1.         1.91629073 1.91629073
 1.         1.91629073 1.        ]


array([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524],
       [0.        , 0.6876236 , 0.        , 0.28108867, 0.        ,
        0.53864762, 0.28108867, 0.        , 0.28108867],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.26710379, 0.51184851, 0.26710379],
       [0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])

### Task: Run the token count data (that used the `min_df` and `max_df` parameters) from the FakeNews dataset through the TfidfTransformer

- What's different between the counts matrix and the tfidf matrix?

so much easier than manually calculating tfidf right?

In [59]:
# calculate tfidf for the text column in the FakeNews dataset





## Machine Learning Pipelines

Most ML frameworks have a pipeline framework, where you can add multiple different transformers into a parent transformer, then you only all `fit()` and `transform()` on the pipeline object. Internally the pipeline will call `fit()` and `transform()` on each individual transformer and output the final matrix of data

In [10]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('count', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
]).fit(corpus)

X = pipe.transform(corpus)
X.toarray()

array([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524],
       [0.        , 0.6876236 , 0.        , 0.28108867, 0.        ,
        0.53864762, 0.28108867, 0.        , 0.28108867],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.26710379, 0.51184851, 0.26710379],
       [0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])

In [11]:
print(f"count vectorizer dictionary: {pipe['count'].get_feature_names()}")
print()
print(f"tfidf transformer's idf data: {pipe['tfidf'].idf_}")

count vectorizer dictionary: ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

tfidf transformer's idf data: [1.91629073 1.22314355 1.51082562 1.         1.91629073 1.91629073
 1.         1.91629073 1.        ]


### Task: Build a Pipeline to generate a tfidf document matrix



In [64]:
# build and a ML pipeline to calc tfidf on the text column of the FakeNews dataset




## Extra Credit: Text Normalization

There are different algorithms for text normalization, such as stemming and lemitization. These algorithms aren't build into scikit-learn, but other text processing libraries like `nltk` have implementations. Both have pros and cons. I really like lemitization, but it has a heavier processing cost. 

Pick one, stemming or lemitization, and integrate it into your raw text to tfidf pipeline (there are articles out there on how to integrate `nltk` into a scikit-learn pipeline).  

How did this change the vocabulary size?

In [6]:
5^5

0