# Wine Classification - Using Basic NLP

Notebook by [Prashant Brahmbhatt](https://github.com/hashbanger)

[Data Source](https://www.kaggle.com/zynicide/wine-reviews)

References are from [Shanglun Wang](https://www.toptal.com/resume/shanglun-sean-wang)'s tutorial.

In [1]:
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.cross_validation import train_test_split



In [2]:
df = pd.read_csv('winemag-data-130k-v2.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


### Cleaning the data

We use here **Counter** to count the occurences of each of the words in the list

In [4]:
counter = Counter(df['variety'].tolist())

Forming a dictionary for the top ten variety names with their rank as values.

In [5]:
top_10_varieties = {name[0]: index for index, name in enumerate(counter.most_common(10))}

Getting the data of the **top_10_varieties**

In [6]:
top_df = df[df['variety'].apply(lambda x: x in top_10_varieties)]

_______

Getting the description of all the reviews for the **top_10_varieties**

In [7]:
description = list(top_df['description'])

Mapping the ranks of the **top_10_varieties** as list

In [8]:
varietal_list = np.array([top_10_varieties[i] for i in list(top_df['variety'])])

________

### Vectorizing

In [9]:
count_vect = CountVectorizer()

In [10]:
x_train_counts = count_vect.fit_transform(description)

### Tranforming

In [11]:
tfidf = TfidfTransformer()

In [12]:
x_train_tfidf = tfidf.fit_transform(x_train_counts)

____

### Splitting the data

In [13]:
X_train, X_test, y_train, y_test = train_test_split(x_train_tfidf, varietal_list, test_size = 0.25,
                                                    random_state = 0)

### Building the Model

#### Using Naive Bayes

In [14]:
classifier_nb = MultinomialNB()
classifier_nb.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Predicting

In [16]:
y_pred = classifier_nb.predict(X_test)

In [17]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.55      0.90      0.68      3370
          1       0.62      0.97      0.76      2868
          2       0.56      0.66      0.61      2379
          3       0.67      0.74      0.71      2229
          4       0.76      0.59      0.67      1744
          5       0.94      0.53      0.68      1309
          6       0.96      0.24      0.38      1248
          7       1.00      0.02      0.04      1041
          8       0.97      0.18      0.30       867
          9       0.00      0.00      0.00       776

avg / total       0.68      0.63      0.58     17831



  'precision', 'predicted', average, warn_for)


In [18]:
n_right = 0
for i in range(len(y_pred)):
    if y_pred[i] == y_test[i]:
        n_right += 1

print("Accuracy: %.2f%%" % ((n_right/float(len(y_test)) * 100)))

Accuracy: 62.94%
