Lambda School Data Science

*Unit 4, Sprint 1, Module 3*

---

# Document Classification (Assignment)

This notebook is for you to practice skills during lecture.

Today's guided module project and assignment will be different. You already know how to do classification. You ready know how to extract features from documents. So? That means you're ready to combine and practice those skills in a kaggle competition. We we will open with a five minute sprint explaining the competition, and then give you 25 minutes to work. After those twenty five minutes are up, I will give a 5-minute demo an NLP technique that will help you with document classification (*and **maybe** the competition*).

Today's all about having fun and practicing your skills.

## Sections
* <a href="#p1">Part 1</a>: Text Feature Extraction & Classification Pipelines
* <a href="#p2">Part 2</a>: Latent Semantic Indexing
* <a href="#p3">Part 3</a>: Word Embeddings with Spacy
* <a href="#p4">Part 4</a>: Post Lecture Assignment

# Text Feature Extraction & Classification Pipelines (Learn)
<a id="p1"></a>

## Follow Along 

What you should be doing now:
1. Join the Kaggle Competition
2. Download the data
3. Train a model (try using the pipe method I just demoed)

### Load Competition Data

In [1]:
import pandas as pd

# You may need to change the path
train = pd.read_csv('./whiskey-reviews-dspt4/train.csv')
test = pd.read_csv('./whiskey-reviews-dspt4/test.csv')
print(train.shape, test.shape)

(4087, 3) (1022, 2)


In [2]:
train.head()

Unnamed: 0,id,description,ratingCategory
0,1321,"\nSometimes, when whisky is batched, a few lef...",1
1,3861,\nAn uncommon exclusive bottling of a 6 year o...,0
2,655,\nThis release is a port version of Amrut’s In...,1
3,555,\nThis 41 year old single cask was aged in a s...,1
4,1965,"\nQuite herbal on the nose, with aromas of dri...",1


In [3]:
# Distribution of ratingCategory: 0 (Excellent), 1 (Good), 2 (Poor)
train.ratingCategory.value_counts()

1    2881
0    1141
2      65
Name: ratingCategory, dtype: int64

In [4]:
# Read a few reviews from the "Excellent" category
pd.set_option('display.max_colwidth', 0)
train[train.ratingCategory == 0].sample(3)

Unnamed: 0,id,description,ratingCategory
3101,3835,"\nThe bourbon cask-matured partner to the Sherry Matured bottling (see above), this expression similarly comprises a vatting of whisky distilled in 1973, 1977, 1988, 1991, 2002, and 2006. Both Contrasting Casks are offered in non-chill filtered format. Apple pie, with buttery caramel, milk chocolate, and coconut on the nose. Zesty spices on the early palate, then custard, contrasting lemon, and a note of char. Spicy fruits, black pepper, and more char in the finish. £100",0
1779,4548,"\nMedium-bodied but viscous in texture. Clearly aged in a bourbon cask-there's plenty of honeyed malt and vanilla throughout. Hints of bourbon even peek through occasionally, along with some subtle peat. Soft melon mid-palate yields to dried spice, sea salt, bitter chocolate and herbal notes on the finish. Fairly dry for a 12 year old-particularly on the finish. Best served as an aperitif.",0
2860,4279,"\nReminiscent of old-style sherry malts, with round, mellow, oaky notes and a dusty, waxy, farmy feel that evolves into dark fruit and striking gunpowder on the nose. Odd and enticing. The oily palate shows taffy, blistering white pepper, fresh pine sawdust, and ripe fruits, with chocolatey suggestions of Tootsie Rolls. Starts big, then moves to a quick, spicy-hot resolution. People who think all-wheat whisky is mild or weak really need to try this luscious beauty. $35 CAD",0


In [5]:
# Read a few reviews from the "Poor" category
train[train.ratingCategory == 2].sample(3)

Unnamed: 0,id,description,ratingCategory
3196,5104,"\nAn unaged whiskey from Carroll County, Iowa, with rye grain and sugar mash. The nose is all kinds of barnyard funk: hay, horse, and manure. Underneath is a bite of sugar and vanilla. On the palate it’s less funky, with sugar, strawberry, and a grappa-like note. The rye spice emerges mid-palate, but it’s fleeting and leads to an edgy and fractured finish. Time in oak might help, but there are issues here that wood won’t resolve.",2
999,5076,"\nQuite pale in color. Very youthful and naked, with damp peat, leafy smoke, charred oak, and black licorice, pears in honey and vanilla-tinged barley. Quite an eye-opener for a non-Islay whisky. It’s a little green and ornery. Certainly an entertaining whisky, but a few more years in the barrel would round this whisky out, meld the flavors together, and add depth. \r\n",2
1667,5081,"\nAged for “at least” one month, this bourbon is a collaboration with the band Fierce Dead Rabbit. Better Days is pale gold and noticeably cloudy. On the nose it’s paste, yeasty bread dough, and wet pavement. On the palate it is all over the place with raw oak, cinnamon, almond, and black pepper. There's no balance and no integration. The finish is short, hot, and dry. As whiskey ages, it goes through odd, awkward phases, and that's where this one is.",2


### Split the Training Set into Train/Validation

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(train['description'], 
                                                    train['ratingCategory'], 
                                                    test_size=0.2, 
                                                    stratify=train['ratingCategory'],
                                                    random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(3269,) (818,) (3269,) (818,)


### Define Pipeline Components

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

vect = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
clf = RandomForestClassifier()

pipe = Pipeline([('vect', vect), ('clf', clf)])

### Define Your Search Space
You're looking for both the best hyperparameters of your vectorizer and your classification model. 

In [8]:
# from sklearn.model_selection import GridSearchCV

# parameters = {
#     'vect__max_df': (0.75, 1.0),
#     'vect__min_df': (2, 5, 10),
#     'vect__max_features': (5000, 11000),
#     'clf__n_estimators': (100, 500),
#     'clf__max_depth': (5, 10, 20, None)
# }

# grid_search = GridSearchCV(pipe, parameters, cv=5, n_jobs=-1, verbose=8)
# grid_search.fit(X_train, y_train)

In [9]:
# from sklearn.metrics import accuracy_score
# grid_search.best_params_

In [10]:
# grid_search.best_score_

### Make a Submission File
*Note:* In a typical Kaggle competition, you are only allowed two submissions a day, so you only submit if you feel you cannot achieve higher test accuracy. For this competition the max daily submissions are capped at **20**. Submit for each demo and for your assignment. 

In [11]:
# # Predictions on test sample
# pred = grid_search.predict(test['description'])

In [12]:
# submission = pd.DataFrame({'id': test['id'], 'ratingCategory':pred})
# submission['ratingCategory'] = submission['ratingCategory'].astype('int64')

In [13]:
# # Make Sure the Category is an Integer
# submission.head()

In [14]:
# subNumber = 0

In [15]:
# # Save your Submission File
# # Best to Use an Integer or Timestamp for different versions of your model

# submission.to_csv(f'./whiskey-reviews-dspt4/submission{subNumber}.csv', index=False)
# subNumber += 1

## Challenge

You're trying to achieve a minimum of 70% Accuracy on your model.

## Latent Semantic Indexing (Learn)
<a id="p2"></a>

## Follow Along
1. Join the Kaggle Competition
2. Download the data
3. Train a model & try: 
    - Creating a Text Extraction & Classification Pipeline
    - Tune the pipeline with a `GridSearchCV` or `RandomizedSearchCV`
    - Add some Latent Semantic Indexing (lsi) into your pipeline. *Note:* You can grid search a nested pipeline, but you have to use double underscores ie `lsi__svd__n_components`
4. Make a submission to Kaggle 


### Define Pipeline Components

In [16]:
import scipy.stats as stats
from sklearn.model_selection import RandomizedSearchCV
from sklearn.decomposition import TruncatedSVD

vect = TfidfVectorizer(stop_words='english', ngram_range=(1,2),
 max_df=1.0,
 max_features=11000,
 min_df=2)
lsi = TruncatedSVD(algorithm='randomized',
                   n_iter=10)
clf = RandomForestClassifier(max_depth=None,
 n_estimators=100, random_state=42)

pipe = Pipeline([('vect', vect),('lsi', lsi), ('clf', clf)])

### Define Your Search Space
You're looking for both the best hyperparameters of your vectorizer and your classification model. 

In [17]:
# parameters = {
#     'lsi__n_components': [10,100,250],
# }

# grid_search = GridSearchCV(pipe,parameters, cv=5, n_jobs=4, verbose=6)
# grid_search.fit(X_train, y_train)

In [18]:
# lsi.get_params().keys()

In [19]:
# from sklearn.metrics import accuracy_score
# grid_search.best_params_

In [20]:
# grid_search.best_score_

### Make a Submission File

In [21]:
# # Predictions on test sample
# pred = grid_search.predict(test['description'])

In [22]:
# submission = pd.DataFrame({'id': test['id'], 'ratingCategory':pred})
# submission['ratingCategory'] = submission['ratingCategory'].astype('int64')

In [23]:
# Make Sure the Category is an Integer
# submission.head()

In [24]:
# # Save your Submission File
# # Best to Use an Integer or Timestamp for different versions of your model

# submission.to_csv(f'./whiskey-reviews-dspt4/submission{subNumber}.csv', index=False)
# subNumber += 1

## Challenge

Continue to apply Latent Semantic Indexing (LSI) to various datasets. 

# Word Embeddings with Spacy (Learn)
<a id="p3"></a>

## Follow Along

In [25]:
# Apply to your Dataset
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from scipy.stats import randint

param_dist = {
    'max_depth' : randint(3,10),
    'min_samples_leaf': randint(2,15)
}

In [26]:
import spacy
nlp = spacy.load("en_core_web_lg")

In [27]:
# Bring over helper function
def get_word_vectors(docs):
    return [nlp(doc).vector for doc in docs]

In [None]:
X_test = get_word_vectors(X_test)
X_train = get_word_vectors(X_train)

In [None]:
from sklearn.neural_network import MLPClassifier

clf = MLPClassifier(solver='lbfgs',
                    max_iter=500,
                    alpha=1e-5,
                    hidden_layer_sizes=(16, 2),
                    random_state=42)
clf.fit(X_train, y_train)

In [None]:
# Evaluate on test data
from sklearn.metrics import accuracy_score
y_preds = clf.predict(X_test)
accuracy_score(y_preds, y_test)

### Make a Submission File

In [None]:
# Predictions on test sample
pred = clf.predict(test)

In [None]:
submission = pd.DataFrame({'id': test['id'], 'ratingCategory':pred})
submission['ratingCategory'] = submission['ratingCategory'].astype('int64')

In [None]:
# Make Sure the Category is an Integer
submission.head()

In [None]:
# Save your Submission File
# Best to Use an Integer or Timestamp for different versions of your model
subNumber=3
submission.to_csv(f'./whiskey-reviews-dspt4/submission{subNumber}.csv', index=False)
subNumber += 1

## Challenge

What you should be doing now:
1. Join the Kaggle Competition
2. Download the data
3. Train a model & try: 
    - Creating a Text Extraction & Classification Pipeline
    - Tune the pipeline with a `GridSearchCV` or `RandomizedSearchCV`
    - Add some Latent Semantic Indexing (lsi) into your pipeline. *Note:* You can grid search a nested pipeline, but you have to use double underscores ie `lsi__svd__n_components`
    - Try to extract word embeddings with Spacy and use those embeddings as your features for a classification model.
4. Make a submission to Kaggle 

# Post Lecture Assignment
<a id="p4"></a>

Your primary assignment this afternoon is to achieve a minimum of 70% accuracy on the Kaggle competition. Once you have achieved 70% accuracy, please work on the following: 

1. Research "Sentiment Analysis". Provide answers in markdown to the following questions: 
    - What is "Sentiment Analysis"? 
    - Is Document Classification different than "Sentiment Analysis"? Provide evidence for your response
    - How do create labeled sentiment data? Are those labels really sentiment?
    - What are common applications of sentiment analysis?
2. Research our why word embeddings worked better for the lecture notebook than on the whiskey competition.
    - This [text classification documentation](https://developers.google.com/machine-learning/guides/text-classification/step-2-5) from Google might be of interest
    - Neural Networks are becoming more popular for document classification. Why is that the case?