## TFIDF Practice

Today we will be taking a closer look at TFIDF as a descriptive measure of rare terms, and practicing some basic plots with TFIDF.

In [1]:
# Import libraries and setup code
import pandas as pd
import seaborn as sns

# sklearn
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer

%matplotlib inline

## A basic example, revisited.

To understand TFIDF is to use it at a basic level, again.  For our mini-practice today, we will revisit the cat, rat, and the bat.

In [2]:
corpus = ["I am a cat.", "I am a bat.", "I am a rat."] # which animal are you?

## 1. Vectorize your corpus using TfidfVectorizer and set to a variable called "X"
1. Initialize TfidfVectorizer to a new variable.
1. Set X to the return of fit_transform.

In [None]:
# Initialize TfidfVectorizer to a new variable.

# Set X to the return of fit_transform.

## 2. Look at your vectorized corpus, X, as an array or a dense matrix.
What does it look like?

## 3. Intialize a new dataframe with an array from X.
Also, set the columns to the TF-IDF vectorizer objects `.get_feature_names()` reference.  Without it, you won't be able to reference which word features correspond to which matrix column.

Each row will coorespond to each document from the original corpus object.  Verify that each row matches the original dataset with the word features.

In [None]:
# setup your dataframe here

## 4. Aggregate your data with mean, median, min, max.  Plot each of your results with a "bar" or "barh" figure.

## 5. Refactor your existing code into a function.
- Your method should accept a corpus object, a vectorizer type (tfidf or countvectorizer), and aggregate function parameter (string or function -- your choice!).
- Your method should output a figure

An example use case of your code would be:
> ```python
>  corpus = ["I am a rat", "I am a cat", "I am a bat"]
>
>  # TFIDF plot with max aggregation
>  vectorize_and_plot(corpus, vectorizer="tfidf", agg_func="max")
>  [your plot here] 
>
>  # COUNT plot with max aggreagation
>  vectorize_and_plot(corpus, vectorizer="count", agg_func="max")
>  [your plot here] 
>  ```

## 6. Use your function to compare CountVectorizer vs TfidfVectorizer 
- Use the original corpus object
- THEN try using the new corpus object below

In [43]:
## Original corpus
# corpus = ["I am a cat.", "I am a bat.", "I am a rat."] # which animal are you?

## New corpus
# corpus = ["I am a cat.", "I am a bat.", "I am a rat.", "the cat is not a rat", "There is not a cat that sat"] # which animal are you?


## 7. Check out this awesome pipeline.
- Fit the corpus object
- Try to run a basic prediction

_This is not a real-world probem but hopefully you get a sense of the basics of looking at text data, and it's application in sklearn._

In [41]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups

# setup our data

""" We use this list to filter which categories we want from our sample newsgroups dataset """
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]

training_data = fetch_20newsgroups(
    subset       =  'train', 
    categories   =  categories,
    shuffle      =  True, 
    random_state =  42,
    remove       = ('headers', 'footers', 'quotes'))

test_data = fetch_20newsgroups(
    subset       =  'test', 
    categories   =  categories,
    shuffle      =  True, 
    random_state =  42,
    remove       =  ('headers', 'footers', 'quotes')
)

""" Our training data """
X_train = training_data.data
y_train = training_data.target

""" Our testing data"""
X_test = test_data.data
y_test = test_data.target

# Rememver all that code needed to vectorize our text data before modeling?  
# We still need to use it in order to do EDA and evalutate our dataset before we model. 
# DO NOT GET IN THE HABBIT OF MODELING WITHOUT EDA!
pipeline = Pipeline([
    ('vect', CountVectorizer()),     # You will have questions about this
    ('tfidf', TfidfTransformer()),
    # ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression()),
])

# Fit our data to the pipeline AS IF it were any other model like we've previously done in sklearn
# Note:  X_train is literal text data, RAW format!
model = pipeline.fit(X_train, y_train)
model.score(X_test, y_test)

0.73318551367331852

## 8. Reference the vectorized matrix from the "model" object cast from the Pipeline instance.
Plot the vectorized data with aggregation.  Your previous function will work well if you refactor. 

_It's helpful to look at different subset classes to understand how each of the features could or may contribute to prediction.  To know which model to use, how it may perform, it's essential to look at your data in order to understand model selection well and confidently evaluate and report findings._

In [35]:
## Here are some hints to jumpstart your work

# 1. Inspect the pipeline object.  All of the "steps" whithin the pipeline are contained inside.
pipeline.steps

# 2. Use the 2nd step object, "tfidf", to get a reference to the object that can transform data
# this will get "TfidfTransformer" object in the steps list ('tfidf', TfidfTransformer)
step2_transformer = pipeline.steps[1][1] 

# 3. Use the step2_transformer to .fit_transform() and examine your training dataset with proper EDA

## 9. Update the pipeline to use only CountVectorizer.
Also experiment different classification models:
- KNN
- Random Forrest