Lambda School Data Science

*Unit 4, Sprint 1, Module 3*

---

# Document Classification (Assignment)

This notebook is for you to practice skills during lecture.

Today's guided module project and assignment will be different. You already know how to do classification. You ready know how to extract features from documents. So? That means you're ready to combine and practice those skills in a kaggle competition. We we will open with a five minute sprint explaining the competition, and then give you 25 minutes to work. After those twenty five minutes are up, I will give a 5-minute demo an NLP technique that will help you with document classification (*and **maybe** the competition*).

Today's all about having fun and practicing your skills.

## Sections
* <a href="#p1">Part 1</a>: Text Feature Extraction & Classification Pipelines
* <a href="#p2">Part 2</a>: Latent Semantic Indexing
* <a href="#p3">Part 3</a>: Word Embeddings with Spacy
* <a href="#p4">Part 4</a>: Post Lecture Assignment

# Text Feature Extraction & Classification Pipelines (Learn)
<a id="p1"></a>

## Follow Along 

What you should be doing now:
1. Join the Kaggle Competition
2. Download the data
3. Train a model (try using the pipe method I just demoed)

### Load Competition Data

In [19]:
import pandas as pd

# You may need to change the path
train = pd.read_csv('./train.csv')
test = pd.read_csv('./test.csv')

In [2]:
pd.set_option('max_colwidth', 0)
train.head(2)

Unnamed: 0,id,description,ratingCategory
0,1321,"\nSometimes, when whisky is batched, a few leftover barrels are returned to the warehouse. Canadian Club recently pulled and vatted several of these from the 1970s. Acetone, Granny Smith apples, and fresh-cut white cedar showcase this long age. Complex and spicy, yet reserved, this dram is ripe with strawberries, canned pears, cloves, pepper, and faint flowers, then slightly pulling oak tannins. Distinct, elegant, and remarkably vibrant, this ancient Canadian Club is anything but tired. (Australia only) A$133",1
1,3861,"\nAn uncommon exclusive bottling of a 6 year old cask strength malt. Light gold in color, the nose is vegetal, more peat bog than peat smoke, with an undercurrent of pastry cream and rose. It’s an odd combination of aromas. The entry is flavorful and inviting with smoked pineapple, clove, and rose. Peak smoke arrives in full force in the mid-palate, which drops the sweet and becomes spicy. The finish is mostly smoke, but with a pleasant minty coolness. (Wyoming only)",0


In [3]:
train.shape

(4087, 3)

In [4]:
test.shape

(1022, 2)

In [5]:
test.head(2)

Unnamed: 0,id,description
0,3461,"\nStyle: Speyside single malt scotch Color: Walnut Aroma: Richly sherried and thick, with notes of nuts and toffee. Wood resins contribute spice and variety. Fruitcake at Christmas. Palate: Thick, chewy in texture, and quite ripe. Again the fruitcake. Very deep and mature with some underlying maltiness. Dry, spicy, oak notes fight off all that sherry and add balance and complexity. Long, soothing finish. \r\n"
1,2604,"\nVery bright and lively, with a nice balance of flavors. Zesty fruit (lemon, peach, ripe pineapple, golden raisin) on a bed of layered sweetness (creamy vanilla, light honey, lightly toasted marshmallow, and a hint of coconut). Gently dry, delicately spicy, dried citrus finish. Light enough and with enough zing to enjoy before dinner, but it should stand up well enough after dinner, too. This is a nice whisky, but it shows a lighter, more elegant side of Glenrothes. It doesn’t express the rich, opulent notes often shown in bottlings like the 1972 Vintage, for example."


### Define Pipeline Components

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer

In [7]:
vect = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
rfc = RandomForestClassifier()

pipe = Pipeline([
    ('vect', vect), 
    ('rfc', rfc)
])

### Define Your Search Space
You're looking for both the best hyperparameters of your vectorizer and your classification model. 

In [8]:
parameters = {
    'vect__max_df': (0.6, 0.9),
    'vect__min_df': (.02, .05),
    'vect__max_features': (5000, None),
    'rfc__n_estimators': (100, 250),
    'rfc__max_depth': (15, None)
}

parameters = {
    'vect__min_df': (0.01, 0.02),
    'vect__max_df': (0.5, 0.6),
    'vect__max_features': (500, None, 1000),
    'rfc__max_depth':(None, 5),
    'rfc__n_estimators': (250,500)
}

grid_search = GridSearchCV(pipe, parameters, cv=3, n_jobs=4, verbose=1)
grid_search.fit(train['description'], train['ratingCategory'])

Fitting 3 folds for each of 48 candidates, totalling 144 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:   50.8s
[Parallel(n_jobs=4)]: Done 144 out of 144 | elapsed:  2.1min finished


GridSearchCV(cv=3, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 2),
                                                        no

In [9]:
grid_search.best_params_

{'rfc__max_depth': None,
 'rfc__n_estimators': 500,
 'vect__max_df': 0.6,
 'vect__max_features': 1000,
 'vect__min_df': 0.01}

In [10]:
grid_search.best_score_

0.7411322379551312

In [11]:
# from sklearn.metrics import accuracy_score

# # Evaluate on test data
# y_test = grid_search.predict(test['description'])
# # accuracy_score((test['description']).target, y_test)

### Make a Submission File
*Note:* In a typical Kaggle competition, you are only allowed two submissions a day, so you only submit if you feel you cannot achieve higher test accuracy. For this competition the max daily submissions are capped at **20**. Submit for each demo and for your assignment. 

In [12]:
# Predictions on test sample
pred = grid_search.predict(test['description'])

In [13]:
submission = pd.DataFrame({'id': test['id'], 'ratingCategory':pred})
submission['ratingCategory'] = submission['ratingCategory'].astype('int64')

In [14]:
# Make Sure the Category is an Integer
submission.head()

Unnamed: 0,id,ratingCategory
0,3461,1
1,2604,1
2,3341,1
3,3764,1
4,2306,1


In [15]:
subNumber = 0

In [16]:
# Save your Submission File
# Best to Use an Integer or Timestamp for different versions of your model

submission.to_csv(f'./dspt4-which-whiskey/submission{subNumber}.csv', index=False)
subNumber += 1

FileNotFoundError: [Errno 2] No such file or directory: './dspt4-which-whiskey/submission0.csv'

## Challenge

You're trying to achieve a minimum of 70% Accuracy on your model.

## Latent Semantic Indexing (Learn)
<a id="p2"></a>

## Follow Along
1. Join the Kaggle Competition
2. Download the data
3. Train a model & try: 
    - Creating a Text Extraction & Classification Pipeline
    - Tune the pipeline with a `GridSearchCV` or `RandomizedSearchCV`
    - Add some Latent Semantic Indexing (lsi) into your pipeline. *Note:* You can grid search a nested pipeline, but you have to use double underscores ie `lsi__svd__n_components`
4. Make a submission to Kaggle 


### Define Pipeline Components

In [None]:
params = {
    'vect__max_df':(0.6, 0.95),
    'svd__n_components': (100, 250),
    'clf__n_estimators': (50, 100)
}

In [None]:
import scipy.stats as stats
from sklearn.model_selection import RandomizedSearchCV
from sklearn.decomposition import TruncatedSVD
import xgboost as xgb
# Create Pipeline Components

vect = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
rfc = RandomForestClassifier()

svd = TruncatedSVD(n_components=100, # Just here for demo. 
                   algorithm='randomized',
                   n_iter=10)
# Instantiate vectorizer object
tfidf = TfidfVectorizer(stop_words = 'english',
                         #tokenizer = tokenize,
                         ngram_range = (1,2),
                         min_df = 0.25, 
                         max_df = 0.8,
                         max_features = 5000
                       )


In [None]:
pipe = Pipeline([
    ('vect', vect),      # TF-IDF Vectorizer
    ('svd', svd),        # Truncated SVD Dimensionality Reduction
    ('clf', rfc),         # RandomForest Classifier
])

### Define Your Search Space
You're looking for both the best hyperparameters of your vectorizer and your classification model. 

In [None]:
# parameters = {
#     'lsi__svd__n_components': [10,100,250],
#     'vect__max_df': (0.75, 1.0),
#     'clf__max_depth':(5,10,15,20)
# }

grid_search = GridSearchCV(pipe,params, cv=5, n_jobs=4, verbose=1)
grid_search.fit(train['description'], train['ratingCategory'])

### Make a Submission File

In [None]:
# Predictions on test sample
pred = grid_search.predict(test['description'])

In [None]:
submission = pd.DataFrame({'id': test['id'], 'ratingCategory':pred})
submission['ratingCategory'] = submission['ratingCategory'].astype('int64')

In [None]:
# Make Sure the Category is an Integer
submission.head()

In [None]:
# Save your Submission File
# Best to Use an Integer or Timestamp for different versions of your model

submission.to_csv(f'./data/submission{subNumber}.csv', index=False)
subNumber += 1

## Challenge

Continue to apply Latent Semantic Indexing (LSI) to various datasets. 

# Word Embeddings with Spacy (Learn)
<a id="p3"></a>

## Follow Along

In [17]:
# Apply to your Dataset

import sys
!{sys.executable} -m pip install xgboost
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier

from scipy.stats import randint
from xgboost import XGBClassifier

# param_dist = {
    
#     'max_depth' : (1,150,300),
#     'min_samples_leaf': (1, 0),
#     'n_estimators' : (10,200,400),
#     'bootstrap': (True, False),
#     'max_features': ('auto', 'sqrt'),
#     'min_samples_leaf': [1, 2, 4],
#     'min_samples_split': [2, 5, 10],
#}
param_dist = {
    'eta' : (0.1, 0.3),
    'min_child_weight' : (1,2,4),
    'max_depth' : (1,4,7,9),
    'alpha' : (0,1,3),

}
params = {
        'min_child_weight': [1, 5, 10],
        'gamma': [0.5, 1, 1.5, 2, 5],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [3, 4, 5]
        }
# param_dist = {
#     'bootstrap': [True, False],
#     'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
#     'max_features': ['auto', 'sqrt'],
#     'min_samples_leaf': [1, 2, 4],
#     'min_samples_split': [2, 5, 10],
#     'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000
# }

Collecting xgboost
  Downloading xgboost-1.0.2-py3-none-manylinux1_x86_64.whl (109.7 MB)
[K     |████████████████████████████████| 109.7 MB 13 kB/s  eta 0:00:011
Installing collected packages: xgboost
Successfully installed xgboost-1.0.2


In [20]:
# Continue Word Embedding Work Here
import spacy
nlp = spacy.load("en_core_web_lg")


ModuleNotFoundError: No module named 'spacy'

In [62]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=3, shuffle = True, random_state = 1001)

In [49]:
def get_word_vectors(docs):
    return [nlp(doc).vector for doc in docs]

In [50]:
X = get_word_vectors(train['description'])
len(X) == len(train['description'])

True

In [71]:
X_df=pd.DataFrame(X)

In [72]:
X_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,-0.046755,0.217486,-0.139249,-0.086381,0.108063,0.082181,-0.069926,-0.078897,-0.046305,1.675343,...,-0.163038,0.011581,-0.037443,0.01424,-0.143216,0.018097,0.000748,-0.209037,-0.072346,0.042847
1,-0.019244,0.206962,-0.056247,-0.029087,0.079248,-0.001398,-0.02488,-0.057035,-0.025045,1.641834,...,-0.161042,0.091476,-0.045018,-0.110397,-0.12778,0.029475,0.054958,-0.138565,-0.067064,0.065419
2,0.046398,0.189816,-0.111451,-0.031625,0.081704,-0.025362,-0.026995,-0.062688,-0.021074,1.624273,...,-0.109123,0.022506,-0.044898,-0.07754,-0.114823,0.008062,0.054627,-0.146851,-0.135264,0.072096
3,-0.064541,0.224933,-0.091545,-0.078598,0.032475,0.032356,-0.103998,-0.050203,-0.022012,1.528916,...,-0.11006,0.047583,-0.052992,-0.06531,-0.171562,0.033367,0.090011,-0.240344,-0.095537,0.11319
4,-0.092036,0.261761,-0.18837,-0.070761,0.026548,0.098266,-0.13369,-0.031325,-0.051093,1.566256,...,-0.128262,0.054398,-0.086081,-0.126515,-0.222649,0.035877,0.11067,-0.330467,-0.044748,0.109515


In [58]:
xgb = XGBClassifier(learning_rate=0.02, n_estimators=600, objective='binary:logistic',
                    silent=True, nthread=1)

In [76]:
grid_search = GridSearchCV(xgb,params, cv=5, n_jobs=4, verbose=1)
#GridSearchCV(estimator=xgb, param_grid=params, scoring='roc_auc', n_jobs=4, cv=skf.split(X_df,train['ratingCategory']), verbose=3 )
#


In [None]:
grid_search.fit(X_df, train['ratingCategory'])

Fitting 5 folds for each of 405 candidates, totalling 2025 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed: 19.5min
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed: 607.5min
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed: 744.1min
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed: 991.8min


In [None]:
grid_search.best_params_

In [98]:
X_test = get_word_vectors(test['description'])

In [67]:
X

AttributeError: 'list' object has no attribute 'shape'

In [None]:
grid_search.best_score_

In [85]:
rfc.fit(X, train['ratingCategory'])

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [86]:
test.head(2)

Unnamed: 0,id,description
0,3461,"\nStyle: Speyside single malt scotch Color: Walnut Aroma: Richly sherried and thick, with notes of nuts and toffee. Wood resins contribute spice and variety. Fruitcake at Christmas. Palate: Thick, chewy in texture, and quite ripe. Again the fruitcake. Very deep and mature with some underlying maltiness. Dry, spicy, oak notes fight off all that sherry and add balance and complexity. Long, soothing finish. \r\n"
1,2604,"\nVery bright and lively, with a nice balance of flavors. Zesty fruit (lemon, peach, ripe pineapple, golden raisin) on a bed of layered sweetness (creamy vanilla, light honey, lightly toasted marshmallow, and a hint of coconut). Gently dry, delicately spicy, dried citrus finish. Light enough and with enough zing to enjoy before dinner, but it should stand up well enough after dinner, too. This is a nice whisky, but it shows a lighter, more elegant side of Glenrothes. It doesn’t express the rich, opulent notes often shown in bottlings like the 1972 Vintage, for example."


In [100]:
# Evaluate on test data
y_test = grid_search.predict(X_test)
# accuracy_score(test['ratingCategory'], y_test)

### Make a Submission File

In [89]:
# # Predictions on test sample
# pred = rfc.predict(test['description'])

In [134]:
submission = pd.DataFrame({'id': test['id'], 'ratingCategory':y_test})
submission['ratingCategory'] = submission['ratingCategory'].astype('int64')

In [114]:
# Make Sure the Category is an Integer
submission.head()

Unnamed: 0,id,ratingCategory
0,3461,1
1,2604,1
2,3341,1
3,3764,1
4,2306,1


In [135]:
# Save your Submission File
# Best to Use an Integer or Timestamp for different versions of your model

submission.to_csv(f'./data/submission{subNumber}.csv', index=False)
subNumber += 1

In [1]:
from fastai.text import *

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('train_whiskey.csv')

In [4]:
df.iloc[1857]


id                                                     921
text     \nSignificantly darker than the rest, well-bal...
label                                                    1
Name: 1857, dtype: object

In [5]:
df.label.value_counts()


1    2881
0    1141
2      65
Name: label, dtype: int64

In [6]:
data_lm = TextLMDataBunch.from_csv('', 'train_whiskey.csv')
data_clas = TextClasDataBunch.from_csv('', 'train_whiskey.csv', bs=16)

Your valid set contained the following unknown labels, the corresponding items have been discarded.
4697, 2506, 4568, 4762, 721...
  if getattr(ds, 'warn', False): warn(ds.warn)


## Challenge

What you should be doing now:
1. Join the Kaggle Competition
2. Download the data
3. Train a model & try: 
    - Creating a Text Extraction & Classification Pipeline
    - Tune the pipeline with a `GridSearchCV` or `RandomizedSearchCV`
    - Add some Latent Semantic Indexing (lsi) into your pipeline. *Note:* You can grid search a nested pipeline, but you have to use double underscores ie `lsi__svd__n_components`
    - Try to extract word embeddings with Spacy and use those embeddings as your features for a classification model.
4. Make a submission to Kaggle 

In [7]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5)
learn.fit_one_cycle(1, 1e-2)


epoch,train_loss,valid_loss,accuracy,time
0,4.249985,3.778995,0.344431,06:25


In [8]:
learn.predict('The lagavulin had a very peaty aftertaste', n_words=10)

'The lagavulin had a very peaty aftertaste in its nose of distilled apple and balanced with vanilla'

In [9]:
learn.save_encoder('ft_enc')

In [10]:
data_clas.show_batch()

text,target
"xxbos \n xxmaj following xxmaj tennessee ’s practice of charcoal - filtering the distillate before aging , this whiskey is soft around the edges yet delivers plenty of intensity . xxmaj the mouthwatering peanut aromas evoke memories of xxunk open a school xxunk while the palate delivers abundant fruit : orange marmalade and caramel apple . xxmaj fine bitter - sweet balance suggests burnt sugar , xxmaj mexican chocolate",166
"xxbos \n a 12 year old blended whisky was created in 1972 consisting of 70 different malt whiskies and 12 grain whiskies . xxmaj the blend was then placed in three sherry casks , where it was matured for the next 36 years ( highly irregular , to say the least ) . xxmaj the quality of the sherry casks is quite evident , as",112
"xxbos \n xxmaj there ’s something enigmatic and highly attractive about this distillery and its xxunk brother , xxmaj brora . xxmaj perhaps it ’s the xxunk ; there are plenty of independent releases of varied and often xxunk quality , and you ’re never quite sure what ’s going to turn up . xxmaj even iconic expressions such as the 30 year olds can vary from quite sublime",865
"xxbos \n xxmaj amber with orange and ruby hues . xxmaj aromas of xxmaj islay peat smoke , wine fruit , and background floral and spice notes . xxmaj flavors reveal more of what the aroma suggests - peat smoke , fruit , and floral / spice notes - with a balancing oak and subtle sea salt and seaweed . xxmaj the peat smoke particularly comes in on the",1651
"xxbos \n xxmaj past bottlings were distilled at the xxmaj bushmills or xxmaj cooley distilleries ( xxunk the rare , original 1951 xxmaj vintage , which was from the old xxup b. xxmaj daly distillery ) . xxmaj this one is triple distilled , so think xxmaj bushmills . ( xxmaj cooley distills their whisky twice . ) xxmaj in the past , i ’ve noticed a lot of",3718


In [11]:
data_lm = (TextList.from_csv('','train_whiskey.csv', cols='text')
                   .split_by_rand_pct()
                   .label_for_lm()
                   .databunch())
data_lm.save()

In [19]:
data_lm.show_batch()

idx,text
0,"48 xxbos \n xxmaj antique amber color . xxmaj mature complex aromas - especially in wood spice notes . xxmaj in addition to wood spices , i also found notes of roasted nuts , xxunk , orange marmalade , anise , and subtle brine . xxmaj thick and syrupy in texture , with complex flavors that echo its aroma . xxmaj the depth on the palate is incredible !"
1,"i am pleased to see an older expression , because i have always felt that the previous releases , generally 10 years old or less , needed a little more time to blossom . xxmaj sweeter notes of peaches and cream , pineapple upside - down cake , banana bread , and vanilla wafer are balanced by a spicy , resinous dry finish . a textural whisky that coats the"
2,". xxmaj instead of xxmaj tormore ’s normal xxunk xxunk , this flows sweetly over the tongue , leaving fruit leather , stewed rhubarb , and with water , rosewater and fresh wild strawberry . a lovely xxmaj tormore ! £ xxunk xxbos \n xxmaj this is bursting with barley , sweet oak , and all - butter shortbread . xxmaj there are charcoal sticks too , which"
3,"add complexity and intrigue without masking xxmaj glenmorangie ’s lovely subtle complexity . xxmaj this one does a pretty good job of it , although there ’s a lot of fruit here ( an obvious xxunk of the wine ) . xxmaj complex fruit , with notes of plum , raspberry , nectarine , blueberries , and a hint of lemon . xxmaj underneath the fruit , there ’s nougat"
4,"with vanilla pod , shoreline , and smoke wrapped in a woolen blanket . xxmaj the palate shows more smoke , light chocolate , xxmaj ardbeg oiliness , and soot . xxmaj it ’s fresh and charming , but ultimately is a xxunk beaten on xxunk . xxbos \n xxmaj floral on the nose , with notes of orange blossom , as well as cinnamon , caramel , coffee"


In [20]:
learn = language_model_learner(data_lm, AWD_LSTM)
learn.fit_one_cycle(2, 1e-2)
learn.save('mini_train_lm')
learn.save_encoder('mini_train_encoder')

epoch,train_loss,valid_loss,accuracy,time
0,4.405022,3.647685,0.350298,05:37
1,3.869565,3.520652,0.36126,05:31


In [21]:
learn.show_results()

text,target,pred
xxbos \n xxmaj there ’s no doubt that this is from xxmaj glenlivet ; there ’s still that pure,"combination of fruit and flowers , now given a little xxunk toward a more concentrated expression : the flowers are","whisky of xxmaj , spice , but aged a xxunk xxunk , the xxunk mature whisky . xxmaj xxmaj ,"
of xxmaj american bourbon casks and xxmaj spanish sherry casks that previously held heavily - peated whisky . xxmaj the,"nose opens with a whiff of ozone ; then lemon , vanilla , and coconut appear , along with fleeting","nose is with a hint of a , the the , cinnamon , and a . . with with a"
"time he tastes it he 's taken back in time to xxup xxunk 's “ xxmaj green ” tour ,","so what 's not to love ? xxmaj dried apple dustiness gives way to pineapple , melon , and xxunk","with i ’s going going be ? xxmaj this fruit , , way to a , and , and a"
"- then buy the 12 year old . xxbos \n xxmaj buttery , nutty , and richly spiced ,","this has all the hallmarks of a sherry - finished whiskey — dried fruit , hazelnuts , chocolate — but","with is a the flavors of a whisky whisky flavored whisky . a fruit , vanilla , and , and"
"hint of mint and anise . xxmaj palate : xxmaj similar to its aroma . xxmaj rich , full ,","thick , and somewhat viscous . a beautiful marriage of lush , ripe fruit and spicy , resinous oak notes","and , and with sweet . xxmaj very whisky of xxmaj and sweet fruit , a finish sweet finish ."


In [22]:
data_clas = (TextList.from_csv('', 'train_whiskey.csv', cols='text', vocab=data_lm.vocab)
                   .random_split_by_pct()
                   .label_from_df(cols='label')
                   .databunch(bs=42))

  warn("`random_split_by_pct` is deprecated, please use `split_by_rand_pct`.")


In [16]:
data_clas.show_batch()

text,target
"xxbos \n xxmaj following xxmaj tennessee ’s practice of charcoal - filtering the distillate before aging , this whiskey is soft around the edges yet delivers plenty of intensity . xxmaj the mouthwatering peanut aromas evoke memories of xxunk open a school xxunk while the palate delivers abundant fruit : orange marmalade and caramel apple . xxmaj fine bitter - sweet balance suggests burnt sugar , xxmaj mexican chocolate",166
"xxbos \n xxmaj though xxmaj brora has acquired cult status , it has to be said that for a few years these xxmaj special xxmaj release xxmaj broras went through an off - putting xxunk phase , which might well have put off xxunk to this legendary closed site , who must have wondered what all the xxunk was about . \r \n xxmaj one nose of the",90
"xxbos \n xxmaj the xxmaj ichiro of the title is xxmaj ichiro xxmaj akuto , xxunk of the family which owned the now demolished xxmaj hanyu distillery , and proprietor of the brand new and incredibly cute ( yes … distilleries can be cute ) xxmaj chichibu distillery — even the name ’s cute . \r \n xxmaj this release is a vatting of different ( un -",1803
"xxbos \n xxmaj the xxmaj eigashima distillery , on the xxmaj xxunk xxmaj strait near xxmaj xxunk , may be the least well known of xxmaj japan ’s single malt plants , but has a sound claim to be the country ’s oldest , as its xxunk to make whisky was xxunk in xxunk — four years before xxmaj yamazaki was built . xxmaj it has , however ,",4820
xxbos \n xxmaj background on the xxmaj master ’s xxmaj collection : this is the fourth of the 100 % pot still whiskeys from xxmaj woodford xxmaj reserve in their xxmaj master ’s xxmaj collection series ( the previous being two different xxmaj four xxmaj grain releases and a xxmaj sonoma - xxmaj cutrer wine finish expression ) . xxmaj all four have a common pot still character to,4512


In [23]:
learn = text_classifier_learner(data_clas, AWD_LSTM)
learn.load_encoder('mini_train_encoder')
learn.fit_one_cycle(2, slice(1e-3,1e-2))
learn.save('mini_train_clas')

epoch,train_loss,valid_loss,accuracy,time
0,0.829527,0.632345,0.694002,05:19
1,0.651403,0.607579,0.718482,05:32


In [24]:
learn.show_results()

text,target,prediction
"xxbos \n xxmaj very well - balanced and mellow on the nose and palate . xxmaj sweet notes of mature dark rum , toffee , nougat , and candy corn dovetail with dried apricot , golden raisin , hot cinnamon , soft mint tea , and vanilla . xxmaj polished leather and tobacco leaves on a long , contemplative finish . xxmaj this",0,1
"xxbos \n xxmaj now the one that peat xxunk wait xxunk for every year , which makes it the bottling that produces the most debate . xxmaj for me , this is up there with last year ’s bottling , which itself xxunk in a return to high standards after a slight dropping - off in xxunk . \r \n xxmaj this is different , however . xxmaj",1,0
"xxbos \n xxmaj the xxmaj eigashima distillery , on the xxmaj xxunk xxmaj xxunk near xxmaj xxunk , may be the least well known of xxmaj japan ’s single malt plants , but has a sound xxunk to be the country ’s oldest , as its xxunk to make whisky was xxunk in xxunk — four years before xxmaj yamazaki was built . xxmaj it has , however ,",0,1
"xxbos \n xxmaj the ( sadly xxunk ) xxmaj xxunk distillery is at the opposite extreme to xxmaj eigashima . xxmaj peated malt , small stills , and sherry casks give a single malt of uncompromising weight and solidity . xxmaj those of you who thought xxmaj japan was all about the ethereal and xxunk , think again . xxmaj in xxunk terms , if xxmaj eigashima is the",1,0
"xxbos \n xxmaj color is antique gold while the aroma is dry but creamy , with notes of vanilla , marshmallow , honey , and tropical fruit ( pineapple , coconut ) . xxmaj palate is malty and creamy up front , with vanilla , marshmallow , and a hint of honey ; briefly becoming fruity ( again , the tropical fruits ) before turning dry and oaky ,",1,1


In [31]:
learn.predict(TextList.from_csv('', 'test_whiskey.csv', cols='text'))

(Category 1, tensor(1), tensor([0.2970, 0.6913, 0.0116]))

In [42]:
test = TextList.from_csv('', 'test_whiskey.csv', cols='text')

In [104]:
test_id = TextList.from_csv('', 'test_whiskey.csv', cols='id')

In [97]:
y_test = [learn.predict(text) for text in test]

In [98]:
rating_category = [c[0].obj for c in y_test]

In [87]:
dir(category)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'apply_tfms',
 'data',
 'obj',
 'show']

In [None]:
# Predictions on test sample
pred = grid_search.predict(test['description'])

In [94]:
category.obj

1

In [36]:
data_clas_test = (TextList.from_csv('', 'test_whiskey.csv', cols='text')
                   .split_none()
                   .label_from_df(cols='text')
                   .databunch(bs=42))

In [37]:
y_test = learn.predict(data_clas_test)

In [103]:
test['id']

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

In [100]:
subNumber=7

In [105]:
submission = pd.DataFrame({'id': test_id, 'ratingCategory':rating_category})
submission['ratingCategory'] = submission['ratingCategory'].astype('int64')

In [106]:
submission.head()

Unnamed: 0,id,ratingCategory
0,3461,1
1,2604,1
2,3341,1
3,3764,1
4,2306,1


In [108]:
# Save your Submission File
# Best to Use an Integer or Timestamp for different versions of your model

submission.to_csv(f'./data/submission{subNumber}.csv', index=False)
subNumber += 1

# Post Lecture Assignment
<a id="p4"></a>

Your primary assignment this afternoon is to achieve a minimum of 70% accuracy on the Kaggle competition. Once you have achieved 70% accuracy, please work on the following: 

1. Research "Sentiment Analysis". Provide answers in markdown to the following questions: 
    - What is "Sentiment Analysis"? 
    - Is Document Classification different than "Sentiment Analysis"? Provide evidence for your response
    - How do create labeled sentiment data? Are those labels really sentiment?
    - What are common applications of sentiment analysis?
2. Research our why word embeddings worked better for the lecture notebook than on the whiskey competition.
    - This [text classification documentation](https://developers.google.com/machine-learning/guides/text-classification/step-2-5) from Google might be of interest
    - Neural Networks are becoming more popular for document classification. Why is that the case?