# Assessment Review

First up:

## Natural Language Processing Practice

Using the 'Spooky Authors' dataset: https://www.kaggle.com/c/spooky-author-identification/overview

In [None]:
# Imports
import pandas as pd
import string
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, plot_confusion_matrix

**Student 1:**

Please grab the dataset and look at a few aspects of this dataset (shape, some examples, etc). We'll be using just the train csv for this, for ease of use!

In [None]:
# Grab the train set from the competition 
df = None

In [None]:
# Encoding our target from author initials to numbers
le = LabelEncoder()
df['target'] = le.fit_transform(df['author'])

In [None]:
# Checking that change
df.head()

In [None]:
# Grabbing our inputs and target
X = df['text']
y = df['target']

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
# Grabbing a list of stopwords from NLTK, imported above
# We're also using the string library add punctuation to our list
stopwords_list = stopwords.words('english') + list(string.punctuation)

In [None]:
stopwords_list[:20]

**Student 2:**

What is the point of a list of stopwords? How/why will we use this list?

- 


### "Bag of Words" - Count Vectorizer

Useful link to the 'User Guide' part of the documentation on this: https://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage

In [None]:
# Intstantiating our vectorizer
count_vectorizer = CountVectorizer()

# Training on the train set, then transforming the train set
X_count_train = count_vectorizer.fit_transform(X_train)
# Transforming the test set
X_count_test = count_vectorizer.transform(X_test)

In [None]:
# Instantiating a classifier to use on this text - Multinomial Naive Bayes
nb_classifier = MultinomialNB() 

# Fitting the classifier
nb_classifier.fit(X_count_train, y_train)

# Getting our predictions for the train and test sets
train_preds = nb_classifier.predict(X_count_train)
test_preds = nb_classifier.predict(X_count_test)

In [None]:
# Let's see how we did!
print(accuracy_score(y_test, test_preds))
plot_confusion_matrix(nb_classifier, X_count_test, y_test, 
                      values_format = ".4g") # to make numbers readable
plt.show()

**Student 3:**

Discuss! How did we do? What could we change?

- 


We're about to try this on a few different vectorizers, so let's make that easier!

**Student 4:**

Write a function where we can provide an instantiated vectorizer, an instantiated classifer, and all of our train and test data, and the function will spit out the accuracy score and confusion matrix just like above:

In [None]:
def classify_vectorized_text(vectorizer, classifier, Xtrain, Xtest, ytrain, ytest):
    '''
    Fit and transform text data using the provided vectorizer, then fit and 
    predict with the provided classifier, in order to see the resulting
    accuracy score and confusion matrix
    For the Xtrain, Xtest, ytrain, ytest, expect the output of an
    sklearn train/test split
    -
    Inputs:
    vectorizer: an instantiated sklearn vectorizer
    classifier: an instantiated sklearn classifier
    X_train: training input data
    X_test: testing input data
    y_train: training true result
    y_test: testing true result
    -
    Outputs: 
    train_preds: predicted results for the train set
    test_preds: predicted results for the test set
    '''
    
    pass

**Student 5:**

Please add in something that was missing from our first Count Vectorizer:

Link to the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [None]:
# Create a new vectorizer here


**Student 6:**

Please create a new classifier and compare the results, using our previously-defined function!

In [None]:
# Create a new classifier here

In [None]:
# Use the function here

Compare: 

- 


### TF-IDF: Term-Frequency - Inverse Document-Frequency

Bryan talked about this... but what even is it?

From [the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html):

> "The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus."

Basically, it's a statistic that hopefully reflects how important a word is in the document. By looking at the overall frequency you find how common a word is across the whole corpus, compared to the document frequency that shows how common a word is within the document in question. If a word appears often in our document, but relatively rarely in the corpus, it probably captures an important word in that specific document!

In this example, the training corpus is every sentence in the `text` column in our train set, and the document is the individual sentence that we're trying to classify (per row).

Reference: http://www.tfidf.com/

We'll be using Sklearn's [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), which is 'equivalent to CountVectorizer followed by TfidfTransformer'

In [None]:
# Instantiating the vectorizer
tfidf = TfidfVectorizer(stop_words=stopwords_list, use_idf=True)

# Training on the train set, then transforming the train set
tfidf_train = tfidf.fit_transform(X_train)
# Transforming the test set
tfidf_test = tfidf.transform(X_test)

In [None]:
tfidf_df = pd.DataFrame(tfidf.idf_, index=tfidf.get_feature_names(),columns=["idf_weights"])

In [None]:
tfidf_df.sort_values(by='idf_weights', ascending=False).head(10)

In [None]:
# Let's look at a specific example for one row
tfidf_test_df = pd.DataFrame(tfidf_test.toarray(), columns=tfidf.vocabulary_.keys())

test_doc = tfidf_test_df.iloc[16]
print(test_doc.idxmax(axis=1))
print(test_doc[test_doc.idxmax(axis=1)])

This tells you that for the 17th document in our test set, the word 'chivalry' has the highest TF-IDF value.

**Student 7:**

What does this tell you about the word "chivalry" in the this document of our test set?

- 


In [None]:
# Using our function to compare the results...
tfidf = TfidfVectorizer(stop_words=stopwords_list, use_idf=True)
nb_tfidf = MultinomialNB()

tfidf_train_preds, tfidf_test_preds = classify_vectorized_text(tfidf, nb_tfidf, X_train, X_test, y_train, y_test)

Compare:

- 


In [None]:
# We can also use our function to try different classifiers
tfidf = TfidfVectorizer(stop_words=stopwords_list, use_idf=True)
rfc = RandomForestClassifier(n_estimators=100)

rfc_train_preds, rfc_test_preds = classify_vectorized_text(tfidf, rfc, X_train, X_test, y_train, y_test)

Compare: 

- 


## Further Review!

![](kmeans.gif)

**Student 8:**

Please describe the steps of a k-means clustering algorithm:

- 


![](pca.gif)

**Student 9:**

Please describe how principal component analysis works:

- 
