Lambda School Data Science

*Unit 4, Sprint 1, Module 3*

---

# Document Classification (Prepare)

Today's guided module project will be different. You already know how to do classification. You ready know how to extract features from documents. So? That means you're ready to combine and practice those skills in a kaggle competition. We we will open with a five minute sprint explaining the competition, and then give you 25 minutes to work. After those twenty five minutes are up, I will give a 5-minute demo an NLP technique that will help you with document classification (*and **maybe** the competition*).

Today's all about having fun and practicing your skills. The competition will begin

## Learning Objectives
* <a href="#p0">Part 0</a>: Kaggle Competition
* <a href="#p1">Part 1</a>: Text Feature Extraction & Classification Pipelines
* <a href="#p2">Part 2</a>: Latent Semantic Indexing
* <a href="#p3">Part 3</a>: Word Embeddings with Spacy

# Text Feature Extraction & Classification Pieplines (Learn)
<a id="p1"></a>

## Overview

Sklearn pipelines allow you to stitch together multiple components of a machine learning process. The idea is that you can pass you raw data and get predictions out of the pipeline. This ability to pass raw input and receive a prediction from a singular class makes pipelines well suited for production, because you can pickle a a pipeline without worry about other data preprocessing steps. 

In [1]:
# Import Statements
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import pandas as pd


In [2]:
# Train and test whiskey description data
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')

train.shape, test.shape

((2586, 3), (288, 2))

In [3]:
# Save a piece of train for validation
train, val = train_test_split(train, test_size=0.05)

train.shape, val.shape

((2456, 3), (130, 3))

In [4]:
train.head()

Unnamed: 0,id,description,category
1284,2096,Diurachs are the inhabitants of the Isle of Ju...,1
2375,3772,A slightly perfumed nose offers up the slightl...,1
2149,3419,Tullibardine 225 Sauternes is finished in Saut...,1
602,1011,Similar to the standard Forty Creek Barrel Sel...,4
225,353,Think of walking in a prairie meadow at a stat...,2


In [5]:
# Define X and y
X = train['description']
y = train['category']

In [18]:
from sklearn.model_selection import RandomizedSearchCV

In [6]:
# Instantiate models and pipelines
svd = TruncatedSVD(algorithm='randomized',
                   n_iter=10)
vect = TfidfVectorizer(stop_words='english')
clf = SGDClassifier()
lsi = Pipeline([('vect', vect), ('svd', svd)])
pipe = Pipeline([('lsi', lsi), ('clf', clf)])

In [20]:
# Define search params
parameters = {
    'lsi__svd__n_components': [10,100, 250],
    'lsi__svd__n_iter': [2, 5, 10, 15],
    'lsi__vect__ngram_range': [(1,1), (1,2), (1,3)],
    'lsi__vect__max_df': [0.95, 0.975, 1.0],
    'clf__max_iter': [1000, 3000, 5000, 7500],
    'clf__loss': ['hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron'],
    'clf__alpha': [0.0001, 0.0005, 0.001, 0.005]
}

# run the search
rand_search = RandomizedSearchCV(pipe,parameters, cv=5, n_iter=200, n_jobs=-1, verbose=10)
rand_search.fit(X, y)

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:   23.9s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   29.5s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  3.7min
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:  4.7min
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:  5.2min
[Parallel(n_jobs=-1)]: Done  77 tasks      | elapsed:  6.6min
[Parallel(n_jobs=-1)]: Done  90 tasks      | elapsed:  7.2min
[Parallel(n_jobs=-1)]: Done 105 tasks      | elapsed:  7.9min
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed: 10.2min
[Parallel(n_jobs=-1)]: Done 137 tasks      | elapsed: 11.8min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed: 13.7min
[Parallel(n_jobs=-1)]: Done 173 tasks      | elapsed: 15

RandomizedSearchCV(cv=5, error_score='raise-deprecating',
          estimator=Pipeline(memory=None,
     steps=[('lsi', Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=...m_state=None, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False))]),
          fit_params=None, iid='warn', n_iter=200, n_jobs=-1,
          param_distributions={'lsi__svd__n_components': [10, 100, 250], 'lsi__svd__n_iter': [2, 5, 10, 15], 'lsi__vect__ngram_range': [(1, 1), (1, 2), (1, 3)], 'lsi__vect__max_df': [0.95, 0.975, 1.0], 'clf__max_iter': [1000, 3000, 5000, 7500], 'clf__loss': ['hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron'], 'clf__alpha': [0.0001, 0.0005, 0.001, 0.005]},
          pre_dispatch='2*n_jobs', random_state=None, r

In [22]:
# Get best score
rand_search.best_score_

0.9385179153094463

In [9]:
# Define validation X and y
X_val = val['description']
y_val = val['category']

In [10]:
# Get val score
grid_search.score(X_val, y_val)

0.9615384615384616

### Make a Submission File
*Note:* You are only allowed two submissions a day. Only submit if you feel you cannot achieve higher test accuracy. 

In [11]:
# Predictions on test sample
pred = grid_search.predict(test['description'])

In [12]:
submission = pd.DataFrame({'id': test['id'], 'category':pred})
submission['category'] = submission['category'].astype('int64')

In [13]:
# Make Sure the Category is an Integer
submission.head()

Unnamed: 0,id,category
0,955,2
1,3532,3
2,1390,4
3,1024,1
4,1902,1


In [14]:
submission.dtypes

id          int64
category    int64
dtype: object

In [15]:
# Save your Submission File
# Best to Use an Integer or Timestamp for different versions of your model
submission.to_csv('./data/submission2.csv', index=False)

## Challenge

Continue to apply Latent Semantic Indexing (LSI) to various datasets. 

# Word Embeddings with Spacy (Learn)
<a id="p3"></a>

### STILL MESSING AROUND BELOW HERE

In [16]:
import spacy
nlp = spacy.load("en_core_web_lg")

In [30]:
def get_word_vectors(docs):
    return [nlp(doc).vector for doc in docs]

In [39]:
X = get_word_vectors(train['description'])
y = train['category'] 


In [40]:
model = xgb.XGBClassifier()

In [41]:
model.fit(X, y)

IndexError: tuple index out of range

In [None]:
vect = TfidfVectorizer(tokenizer=get_word_vectors())
clf = SGDClassifier()
lsi = Pipeline([('vect', vect), ('svd', svd)])
pipe = Pipeline([('lsi', lsi), ('clf', clf)])

In [34]:
parameters = {
    'lsi__svd__n_components': [10,100, 250],
    'lsi__vect__max_df': [0.95, 0.975, 1.0]
}

grid_search = GridSearchCV(pipe,parameters, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(X, y)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


AttributeError: 'numpy.ndarray' object has no attribute 'lower'

## Follow Along

In [59]:
X = get_word_vectors(train['description'])
y = train['category']

In [60]:
rfc.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [61]:
X_val = get_word_vectors(val['description'])
y_val = val['category']

#rfc.predict(X_val)

In [62]:
rfc.score(X_val, y_val)

0.6923076923076923

## Challenge

What you should be doing now:
1. Join the Kaggle Competition
2. Download the data
3. Train a model & try: 
    - Creating a Text Extraction & Classification Pipeline
    - Tune the pipeline with a `GridSearchCV` or `RandomizedSearchCV`
    - Add some Latent Semantic Indexing (lsi) into your pipeline. *Note:* You can grid search a nested pipeline, but you have to use double underscores ie `lsi__svd__n_components`
    - Try to extract word embeddings with Spacy and use those embeddings as your features for a classification model.
4. Make a submission to Kaggle 

# Review

To review this module: 
* Continue working on the Kaggle comeptition
* Find another text classification task to work on