# CAPSTONE - FINDINGS AND TECHNICAL REPORT WINE BLIND TASTING (PROVINCE)

I have built classification models that, based on a wine description given by an expert or semi-expert, are able to tell what province was come from that wine.
The models I am using are the same that I described in the country jupyter notebook

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

pd.set_option('display.max_columns',500)
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

# SMALL DATASET

In [2]:
df = pd.read_csv('small_wineV1.csv', )

In [3]:
df.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,taster_name,title,variety,winery,vintage
0,France,"Buxom and heady, this is a delightfully rich, ...",Vieilles Vignes,99,114.0,Châteauneuf-du-Pape,Rhône Valley,Anna Lee C. Iijima,Domaine de la Janasse 2016 Vieilles Vignes Red...,Rhône-style Red Blend,Domaine de la Janasse,2016.0
1,France,"Sultry and silken on the palate, this wine sta...",La Réserve,98,175.0,Châteauneuf-du-Pape,Rhône Valley,Anna Lee C. Iijima,Domaine le Clos du Caillou 2016 La Réserve Red...,Rhône-style Red Blend,Domaine le Clos du Caillou,2016.0
2,Portugal,The wine's fine perfumed black plum fruits giv...,,98,120.0,Port,Port Blend,Roger Voss,Fonseca 2017 Port,Port,Fonseca,2017.0
3,France,"Veins of vanilla, smoke and toast amplify blac...",Hommage à Henry Tacussel,98,80.0,Châteauneuf-du-Pape,Rhône Valley,Anna Lee C. Iijima,Domaine Moulin-Tacussel 2016 Hommage à Henry T...,Grenache,Domaine Moulin-Tacussel,2016.0
4,France,"This juicy, fruit-forward wine drenches the pa...",La Muse,97,88.0,Châteauneuf-du-Pape,Rhône Valley,Anna Lee C. Iijima,Guillaume Gonnet 2016 La Muse Red (Châteauneuf...,Rhône-style Red Blend,Guillaume Gonnet,2016.0


In [4]:
print("There are {} provinces producing wine in this dataset such as {}... \n".
      format(len(df.province.unique()), ", ".join(df.province.unique()[0:5])))

There are 1112 provinces producing wine in this dataset such as Châteauneuf-du-Pape, Port, Champagne, Trento, Santa Maria Valley... 



In [5]:
print('The province baseline is', df.province.value_counts(normalize=True).max())

The province baseline is 0.024616533578445454


By looking at the provinces, we can find out that there are 1111 provinces in the small dataset. each one having different number of inputs, many of them having only one. Certainly, the latter ones will not be of use, because we will not be able to train and test on one sample. I will keep the provinces thay appear more than 100 times. I will also reduce the number by joining some of them. 


In [6]:
df1 = df[df['province'].map(df['province'].value_counts()) > 100]

In [7]:
df1.province = df1.province.apply(lambda x: str(x).replace('Blaye Côtes de Bordeaux', 'Bordeaux').
                                    replace('Bordeaux Rosé', 'Bordeaux').replace('Bordeaux Supérieur', 'Bordeaux').
                                   replace('Bordeaux Blanc', 'Bordeaux').replace('Sta. Rita Hills', 'California').
                                   replace('Cava', 'Penedes').replace('Mendocino County', 'California').
                                   replace('Russian River Valley', 'Sonoma Valley').replace('North Coast', 'Sonoma Valley'))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [8]:
print("The new number of the provinces is", len(df1.province.unique()))

The new number of the provinces is 77


In [9]:
print('The province baseline is', df1.province.value_counts(normalize=True).max())

The province baseline is 0.05765158806544755


### DEFINE THE TARGET AND SPLIT THE DATA

In [11]:
X1 = df1.description
y1 = df1.province

In [12]:
#Split and stratify the predictor and the target
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size=0.2, stratify = y1, random_state = 1)

### COUNTVECTORIZER  AND TFIDFVECTORIZER WITH LOGISTIC REGRESSION

The scores did not differ so much but the best performed model was with CountVectorizer and Logistic Regression with Lasso Regularization and multiclass ovr. I calculated the cross validation in this model and the score does not differ so much than the accuracy score.
 
All of the models have better scores than the baseline.


In [13]:
cvec = CountVectorizer(strip_accents='unicode', stop_words="english", ngram_range=(1, 1))

In [43]:
# CountVectorizer and Logisitic Regression with Ridge regularization
pipeline_ridge1 = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l2', solver='lbfgs', multi_class='ovr', verbose = 1, n_jobs = 2))
])
pipeline_ridge1.fit(X_train1, y_train1)
predicted_ridge1 = pipeline_ridge1.predict(X_test1)
pipeline_ridge1.score(X_test1, y_test1)

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    6.9s
[Parallel(n_jobs=2)]: Done  77 out of  77 | elapsed:   10.4s finished


0.5553416746871992

In [44]:
print(classification_report(y_test1, predicted_ridge1))

                         precision    recall  f1-score   support

      Adelaida District       0.10      0.04      0.06        26
             Alentejano       0.50      0.52      0.51        50
       Alexander Valley       0.00      0.00      0.00        24
                 Alsace       0.85      0.95      0.90       152
             Alto Adige       0.87      0.97      0.92        35
          Amador County       0.15      0.06      0.09        31
        Anderson Valley       0.44      0.26      0.33        31
             Barbaresco       0.36      0.24      0.29        38
                 Barolo       0.69      0.79      0.74       158
               Bordeaux       0.56      0.83      0.67       155
 Brunello di Montalcino       0.57      0.64      0.60        84
             Burgenland       0.65      0.51      0.57        39
                 Cahors       0.60      0.47      0.53        38
             California       0.40      0.59      0.48       240
               Carneros 

In [14]:
# CountVectorizer and Logisitic Regression with Lasso regularization
pipelines1 = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l1', solver='saga', multi_class='ovr', verbose = 1, n_jobs = 2))
])
pipelines1.fit(X_train1, y_train1)
predicteds1 = pipelines1.predict(X_test1)
pipelines1.score(X_test1, y_test1)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.


max_iter reached after 3 seconds
max_iter reached after 3 seconds




max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 2 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 4 seconds
max_iter reached after 3 seconds
max_iter reached after 2 seconds
max_iter reached after 2 seconds
max_iter reached after 4 seconds
max_iter reached after 7 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 2 seconds
max_iter reached after 4 seconds
max_iter reached after 2 seconds
max_iter reached after 3 seconds
max_iter reached after 2 seconds
max_iter reached after 3 seconds
max_iter reached after 4 seconds
max_iter reached after 4 seconds
max_iter reached after 2 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter r

[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:  1.1min


max_iter reached after 2 seconds
max_iter reached after 5 seconds
max_iter reached after 2 seconds
max_iter reached after 3 seconds
max_iter reached after 2 seconds
max_iter reached after 3 seconds
max_iter reached after 4 seconds
max_iter reached after 3 seconds
max_iter reached after 2 seconds
max_iter reached after 2 seconds
max_iter reached after 4 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 4 seconds
max_iter reached after 2 seconds
max_iter reached after 4 seconds
max_iter reached after 4 seconds
max_iter reached after 3 seconds
max_iter reached after 7 seconds
max_iter reached after 2 seconds
max_iter reached after 2 seconds
max_iter reached after 3 seconds
max_iter reached after 2 seconds
max_iter reached after 2 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 5 seconds
max_iter reached after 3 seconds
max_iter r

[Parallel(n_jobs=2)]: Done  77 out of  77 | elapsed:  1.9min finished


0.563041385948027

In [15]:
print(classification_report(y_test1, predicteds1))

                         precision    recall  f1-score   support

      Adelaida District       0.12      0.04      0.06        26
             Alentejano       0.51      0.52      0.51        50
       Alexander Valley       0.00      0.00      0.00        24
                 Alsace       0.84      0.95      0.90       152
             Alto Adige       0.87      0.94      0.90        35
          Amador County       0.25      0.10      0.14        31
        Anderson Valley       0.56      0.29      0.38        31
             Barbaresco       0.45      0.24      0.31        38
                 Barolo       0.69      0.83      0.75       158
               Bordeaux       0.56      0.83      0.67       155
 Brunello di Montalcino       0.60      0.63      0.62        84
             Burgenland       0.62      0.46      0.53        39
                 Cahors       0.67      0.47      0.55        38
             California       0.41      0.63      0.50       240
               Carneros 

In [16]:
#Cross validation score
cross = cross_val_score(pipelines1, X_train1, y_train1, cv=5)
print(cross)
print(cross.mean())
 


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   49.7s
[Parallel(n_jobs=2)]: Done  77 out of  77 | elapsed:  1.4min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   49.7s
[Parallel(n_jobs=2)]: Done  77 out of  77 | elapsed:  1.4min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   48.4s
[Parallel(n_jobs=2)]: Done  77 out of  77 | elapsed:  1.4min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   48.2s
[Parallel(n_jobs=2)]: Done  77 out of  77 | elapsed:  1.4min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   48.8s


[0.53513996 0.53590664 0.54766917 0.54232164 0.55211182]
0.5426298489078663


[Parallel(n_jobs=2)]: Done  77 out of  77 | elapsed:  1.4min finished


In [45]:
# CountVectorizer and Logisitic Regression with Lasso regularization
pipelines1 = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l1', solver='saga', multi_class='ovr', verbose = 1, n_jobs = 2))
])
pipelines1.fit(X_train1, y_train1)
predicteds1 = pipelines1.predict(X_test1)
pipelines1.score(X_test1, y_test1)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.


max_iter reached after 3 seconds




max_iter reached after 4 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 2 seconds
max_iter reached after 2 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 5 seconds
max_iter reached after 4 seconds
max_iter reached after 2 seconds
max_iter reached after 3 seconds
max_iter reached after 4 seconds
max_iter reached after 9 seconds
max_iter reached after 4 seconds
max_iter reached after 3 seconds
max_iter reached after 4 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 4 seconds
max_iter reached after 2 seconds
max_iter reached after 5 seconds
max_iter reached after 2 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 4 seconds
max_iter reached after 4 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter r

[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:  1.3min


max_iter reached after 2 seconds
max_iter reached after 5 seconds
max_iter reached after 2 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 4 seconds
max_iter reached after 3 seconds
max_iter reached after 2 seconds
max_iter reached after 3 seconds
max_iter reached after 4 seconds
max_iter reached after 3 seconds
max_iter reached after 4 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 5 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 7 seconds
max_iter reached after 3 seconds
max_iter reached after 2 seconds
max_iter reached after 3 seconds
max_iter reached after 2 seconds
max_iter reached after 2 seconds
max_iter reached after 1 seconds
max_iter reached after 3 seconds
max_iter reached after 5 seconds
max_iter reached after 3 seconds
max_iter r

[Parallel(n_jobs=2)]: Done  77 out of  77 | elapsed:  2.1min finished


0.5625601539942252

In [50]:
print(classification_report(y_test1, predicteds1))

                         precision    recall  f1-score   support

      Adelaida District       0.12      0.04      0.06        26
             Alentejano       0.50      0.52      0.51        50
       Alexander Valley       0.00      0.00      0.00        24
                 Alsace       0.84      0.95      0.90       152
             Alto Adige       0.87      0.94      0.90        35
          Amador County       0.25      0.10      0.14        31
        Anderson Valley       0.56      0.29      0.38        31
             Barbaresco       0.43      0.24      0.31        38
                 Barolo       0.69      0.83      0.75       158
               Bordeaux       0.56      0.83      0.67       155
 Brunello di Montalcino       0.61      0.63      0.62        84
             Burgenland       0.62      0.46      0.53        39
                 Cahors       0.67      0.47      0.55        38
             California       0.41      0.63      0.49       240
               Carneros 

In [49]:
# CountVectorizer and Logisitic Regression with Lasso regularization and multinomial mult_class
pipelines2 = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l1', solver='saga', multi_class='multinomial', verbose = 1, n_jobs = 2))
])
pipelines2.fit(X_train1, y_train1)
predicteds2 = pipelines2.predict(X_test1)
pipelines2.score(X_test1, y_test1)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.


max_iter reached after 134 seconds


[Parallel(n_jobs=2)]: Done   1 out of   1 | elapsed:  2.2min finished


0.5469201154956689

In [51]:
print(classification_report(y_test1, predicteds2))

                         precision    recall  f1-score   support

      Adelaida District       0.09      0.04      0.05        26
             Alentejano       0.41      0.42      0.42        50
       Alexander Valley       0.00      0.00      0.00        24
                 Alsace       0.85      0.93      0.89       152
             Alto Adige       0.82      0.89      0.85        35
          Amador County       0.14      0.06      0.09        31
        Anderson Valley       0.50      0.32      0.39        31
             Barbaresco       0.31      0.26      0.29        38
                 Barolo       0.70      0.77      0.73       158
               Bordeaux       0.58      0.78      0.67       155
 Brunello di Montalcino       0.56      0.60      0.58        84
             Burgenland       0.65      0.51      0.57        39
                 Cahors       0.59      0.50      0.54        38
             California       0.40      0.60      0.48       240
               Carneros 

In [52]:
vect = TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode')

In [53]:
# TfidfVectorizer and Logisitic Regression with Ridge regularization
pipelines3 = Pipeline([
    ('vect', vect),
    ('cls', LogisticRegression(penalty = 'l2', solver='lbfgs', multi_class='ovr', verbose = 1, n_jobs = 2))
])
pipelines3.fit(X_train1, y_train1)
predicteds3 = pipelines3.predict(X_test1)
pipelines3.score(X_test1, y_test1)

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    4.4s
[Parallel(n_jobs=2)]: Done  77 out of  77 | elapsed:    6.2s finished


0.5534167468719923

In [54]:
print(classification_report(y_test1, predicteds3))

                         precision    recall  f1-score   support

      Adelaida District       0.00      0.00      0.00        26
             Alentejano       0.71      0.34      0.46        50
       Alexander Valley       0.00      0.00      0.00        24
                 Alsace       0.72      1.00      0.84       152
             Alto Adige       0.85      1.00      0.92        35
          Amador County       0.00      0.00      0.00        31
        Anderson Valley       0.33      0.03      0.06        31
             Barbaresco       0.00      0.00      0.00        38
                 Barolo       0.60      0.91      0.72       158
               Bordeaux       0.44      0.94      0.60       155
 Brunello di Montalcino       0.59      0.56      0.57        84
             Burgenland       0.74      0.36      0.48        39
                 Cahors       0.88      0.37      0.52        38
             California       0.35      0.82      0.49       240
               Carneros 

  'precision', 'predicted', average, warn_for)


In [55]:
# TfidfVectorizer and Logisitic Regression with Lasso regularization
pipelines4 = Pipeline([
    ('vect', vect),
    ('cls', LogisticRegression(penalty = 'l1', solver='saga', multi_class='ovr', ))
])
pipelines4.fit(X_train1, y_train1)
predicteds4 = pipelines4.predict(X_test1)
pipelines4.score(X_test1, y_test1)



0.5601539942252165

In [57]:
print(classification_report(y_test1, predicteds4))

                         precision    recall  f1-score   support

      Adelaida District       0.00      0.00      0.00        26
             Alentejano       0.67      0.44      0.53        50
       Alexander Valley       0.00      0.00      0.00        24
                 Alsace       0.76      0.96      0.85       152
             Alto Adige       0.84      0.91      0.88        35
          Amador County       0.00      0.00      0.00        31
        Anderson Valley       0.62      0.16      0.26        31
             Barbaresco       0.38      0.08      0.13        38
                 Barolo       0.60      0.87      0.71       158
               Bordeaux       0.48      0.88      0.62       155
 Brunello di Montalcino       0.52      0.51      0.51        84
             Burgenland       0.59      0.44      0.50        39
                 Cahors       0.69      0.53      0.60        38
             California       0.36      0.80      0.50       240
               Carneros 

  'precision', 'predicted', average, warn_for)


In [58]:
# TfidfVectorizer and Logisitic Regression with Ridge regularization and multinomial class
pipelines5 = Pipeline([
    ('vect', vect),
    ('cls', LogisticRegression(solver='saga', multi_class='multinomial', penalty = 'l1'))
])
pipelines5.fit(X_train1, y_train1)
predicteds5 = pipelines5.predict(X_test1)
pipelines5.score(X_test1, y_test1)



0.5551010587102984

In [59]:
print(classification_report(y_test1, predicteds5))

                         precision    recall  f1-score   support

      Adelaida District       0.00      0.00      0.00        26
             Alentejano       0.66      0.42      0.51        50
       Alexander Valley       0.00      0.00      0.00        24
                 Alsace       0.78      0.97      0.86       152
             Alto Adige       0.81      0.86      0.83        35
          Amador County       0.00      0.00      0.00        31
        Anderson Valley       0.67      0.19      0.30        31
             Barbaresco       0.31      0.11      0.16        38
                 Barolo       0.59      0.82      0.69       158
               Bordeaux       0.49      0.88      0.63       155
 Brunello di Montalcino       0.49      0.48      0.48        84
             Burgenland       0.61      0.49      0.54        39
                 Cahors       0.67      0.47      0.55        38
             California       0.36      0.79      0.50       240
               Carneros 

  'precision', 'predicted', average, warn_for)


### RANDOM FOREST CLASSIFIER, COUNTVECTORIZER AND TFIDFVECTORIZER

Logistic Regression performs better than Random Forest Classifier in this data set, with higher accuracy.

The best performed model was Random Forest with CountVectorizer.

All of the models have better scores than the baseline


In [60]:
cvecsr = CountVectorizer(strip_accents='unicode', stop_words="english", ngram_range=(1, 1))

In [61]:
# Random Forest with CountVectorizer
pipelines6  = Pipeline([
    ('vect', cvecsr),
    #('tfidf', TfidfTransformer()),
    ('cls', RandomForestClassifier(n_estimators=200, n_jobs = 2))
])
pipelines6.fit(X_train1, y_train1)
predicteds6 = pipelines6.predict(X_test1)
pipelines6.score(X_test1, y_test1)

0.5238209817131858

In [62]:
print(classification_report(y_test1, predicteds6))

                         precision    recall  f1-score   support

      Adelaida District       0.00      0.00      0.00        26
             Alentejano       0.82      0.18      0.30        50
       Alexander Valley       0.00      0.00      0.00        24
                 Alsace       0.69      0.99      0.81       152
             Alto Adige       0.74      0.97      0.84        35
          Amador County       0.00      0.00      0.00        31
        Anderson Valley       0.00      0.00      0.00        31
             Barbaresco       1.00      0.05      0.10        38
                 Barolo       0.49      0.97      0.65       158
               Bordeaux       0.35      0.96      0.51       155
 Brunello di Montalcino       0.73      0.19      0.30        84
             Burgenland       0.70      0.36      0.47        39
                 Cahors       0.80      0.11      0.19        38
             California       0.37      0.79      0.50       240
               Carneros 

  'precision', 'predicted', average, warn_for)


In [63]:
# Random Forest with TfidfVectorizer and max_depth = 10
pipelines7 = Pipeline([
    ('vect', vect),
    
    ('cls', RandomForestClassifier(n_estimators=200, n_jobs = 2, max_depth = 10 ))
])
pipelines7.fit(X_train1, y_train1)
predicteds7 = pipelines7.predict(X_test1)
pipelines7.score(X_test1, y_test1)

0.43695861405197306

In [64]:
print(classification_report(y_test1, predicteds7))

                         precision    recall  f1-score   support

      Adelaida District       0.00      0.00      0.00        26
             Alentejano       0.00      0.00      0.00        50
       Alexander Valley       0.00      0.00      0.00        24
                 Alsace       0.61      1.00      0.76       152
             Alto Adige       0.59      0.54      0.57        35
          Amador County       0.00      0.00      0.00        31
        Anderson Valley       0.00      0.00      0.00        31
             Barbaresco       0.00      0.00      0.00        38
                 Barolo       0.37      1.00      0.54       158
               Bordeaux       0.25      0.99      0.40       155
 Brunello di Montalcino       0.00      0.00      0.00        84
             Burgenland       0.00      0.00      0.00        39
                 Cahors       0.00      0.00      0.00        38
             California       0.31      0.88      0.46       240
               Carneros 

  'precision', 'predicted', average, warn_for)


### NATIVE BAYES, COUNTVECTORIZER AND TFDFVECTORIZER

The best accuracy I got in MultinomialNB with CountVectorizer. MultinomialNB is one of the two classic Naive Bayes variants used in text classification.

Logistic Regression and some of the Random Forest Classifier models perform better than Native Bayes in this data set, with higher accuracy.

All of the models have better scores than the baseline


In [65]:
# MultinomialNB with CountVectorizer
pipelines8 = Pipeline([
    ('vect', cvec),
    ('cls', MultinomialNB())
])
pipelines8.fit(X_train1, y_train1)
predicteds8 = pipelines8.predict(X_test1)
pipelines8.score(X_test1, y_test1)

0.4793070259865255

In [66]:
print(classification_report(y_test1, predicteds8))

                         precision    recall  f1-score   support

      Adelaida District       0.00      0.00      0.00        26
             Alentejano       0.60      0.12      0.20        50
       Alexander Valley       0.00      0.00      0.00        24
                 Alsace       0.64      1.00      0.78       152
             Alto Adige       0.70      0.94      0.80        35
          Amador County       0.00      0.00      0.00        31
        Anderson Valley       0.00      0.00      0.00        31
             Barbaresco       0.00      0.00      0.00        38
                 Barolo       0.45      1.00      0.62       158
               Bordeaux       0.31      1.00      0.48       155
 Brunello di Montalcino       0.56      0.06      0.11        84
             Burgenland       0.67      0.15      0.25        39
                 Cahors       1.00      0.08      0.15        38
             California       0.33      0.80      0.47       240
               Carneros 

  'precision', 'predicted', average, warn_for)


In [67]:
# BernoulliNB with CountVectorizer
pipelines9 = Pipeline([
    ('vect', cvec),
    ('cls', BernoulliNB())
])
pipelines9.fit(X_train1, y_train1)
predicteds9 = pipelines9.predict(X_test1)
pipelines9.score(X_test1, y_test1)

0.40495668912415783

In [69]:
print(classification_report(y_test1, predicteds9))

                         precision    recall  f1-score   support

      Adelaida District       0.00      0.00      0.00        26
             Alentejano       0.00      0.00      0.00        50
       Alexander Valley       0.00      0.00      0.00        24
                 Alsace       0.62      1.00      0.76       152
             Alto Adige       0.83      0.29      0.43        35
          Amador County       0.00      0.00      0.00        31
        Anderson Valley       0.00      0.00      0.00        31
             Barbaresco       0.00      0.00      0.00        38
                 Barolo       0.39      1.00      0.56       158
               Bordeaux       0.25      1.00      0.39       155
 Brunello di Montalcino       0.00      0.00      0.00        84
             Burgenland       0.00      0.00      0.00        39
                 Cahors       0.00      0.00      0.00        38
             California       0.25      0.86      0.39       240
               Carneros 

In [70]:
# MultinomialNB with TdidfVectorizer
pipelines10 = Pipeline([
    ('vect',vect) ,
    ('cls', MultinomialNB())
])
pipelines10.fit(X_train1, y_train1)
predicteds10 = pipelines10.predict(X_test1)
pipelines10.score(X_test1, y_test1)

0.4025505293551492

In [71]:
print(classification_report(y_test1, predicteds10))

                         precision    recall  f1-score   support

      Adelaida District       0.00      0.00      0.00        26
             Alentejano       0.00      0.00      0.00        50
       Alexander Valley       0.00      0.00      0.00        24
                 Alsace       0.61      1.00      0.75       152
             Alto Adige       0.79      0.54      0.64        35
          Amador County       0.00      0.00      0.00        31
        Anderson Valley       0.00      0.00      0.00        31
             Barbaresco       0.00      0.00      0.00        38
                 Barolo       0.37      1.00      0.54       158
               Bordeaux       0.25      1.00      0.40       155
 Brunello di Montalcino       0.00      0.00      0.00        84
             Burgenland       0.00      0.00      0.00        39
                 Cahors       0.00      0.00      0.00        38
             California       0.23      0.94      0.37       240
               Carneros 

  'precision', 'predicted', average, warn_for)


In [72]:
# BernoulliNB with  TdidfVectorizer
pipelines11 = Pipeline([
    ('vect', vect),
    ('cls', BernoulliNB())
])
pipelines11.fit(X_train1, y_train1)
predicteds11 = pipelines11.predict(X_test1)
pipelines11.score(X_test1, y_test1)

0.40495668912415783

The best score I got it was in CountVectorizer and Logistic Regression with Lasso Regularizations. The rest of the models performed good better than the baseline.


# BIG DATASET

By looking at the provinces, we can find out that there are 450 provinces in this dataset. Each one having different number of inputs, many of them having only one. Certainly, the latter ones will not be of use, because we will not be able to train and test on one sample. I will keep the provinces thay appear more than 100 times. I will also reduce the number by joining some of them.

I am going to train the models that perform better in the small dataset.

In [18]:
vino = pd.read_csv('big_wineV1.csv')

In [45]:
vino.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,taster_name,title,variety,winery,vintage
0,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,Roger Voss,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,2011.0
1,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Paul Gregutt,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,2013.0
2,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,Alexander Peartree,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,2013.0
3,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Paul Gregutt,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,2012.0
4,5,Spain,Blackberry and raspberry aromas show a typical...,Ars In Vitro,87,15.0,Northern Spain,Navarra,Michael Schachner,Tandem 2011 Ars In Vitro Tempranillo-Merlot (N...,Tempranillo-Merlot,Tandem,2011.0


In [19]:
vino1 = vino[vino['province'].map(vino['province'].value_counts()) > 100]

In [22]:
vino1.province = vino1.province.apply(lambda x: str(x).replace('Blaye Côtes de Bordeaux', 'Bordeaux').
                                    replace('Bordeaux Rosé', 'Bordeaux').replace('Bordeaux Supérieur', 'Bordeaux').
                                   replace('Bordeaux Blanc', 'Bordeaux').replace('Sta. Rita Hills', 'California').
                                   replace('Cava', 'Penedes').replace('Mendocino County', 'California').
                                   replace('Russian River Valley', 'Sonoma Valley').replace('North Coast', 'Sonoma Valley')
                                  )

In [21]:
len(vino1.province.unique())

84

In [49]:
print("There are {} types of province in this dataset such as {}... \n".
      format(len(vino1.province.unique()), ", ".join(vino.province.unique()[0:5])))

There are 84 types of province in this dataset such as Douro, Oregon, Michigan, Northern Spain, Sicily & Sardinia... 



In [51]:
print('The province baseline is', vino1.province.value_counts(normalize=True).max())

The province baseline is 0.30702786353264433


### DEFINE THE VARIABLE PROVINCE AS A TARGET

In [23]:
yb = vino1.province
Xb =  vino1.description

In [24]:
yb.shape, Xb.shape

((126653,), (126653,))

In [25]:
#Split and stratify the data
X_trainb, X_testb, y_trainb, y_testb = train_test_split(Xb, yb, test_size=0.2, stratify = yb, random_state = 1)

###  LOGISTIC REGRESSION,  COUNTVECTORIZER AND  TFIDFVECTORIZER  

The best acurracy I obtained in this dataset with Logistic Regression was in CountVectorizer with multi_class 'ovr'. The same whats happend in the small dataset. The cross validation score is very similar to the accuracy for this model.

All of the models have better scores than the baseline.

In [26]:
cvec = CountVectorizer(strip_accents='unicode', stop_words="english", ngram_range=(1, 1))

In [95]:
cvec.fit_transform(Xb)

<126653x30257 sparse matrix of type '<class 'numpy.int64'>'
	with 3008837 stored elements in Compressed Sparse Row format>

In [96]:
# CountVectorizer and Logisitic Regression with Ridge regularization
pipeline = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l2', solver='lbfgs', multi_class='ovr', verbose = 1, n_jobs = 2))
])
pipeline.fit(X_trainb, y_trainb)
predicte = pipeline.predict(X_testb)
pipeline.score(X_testb, y_testb)

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   29.1s
[Parallel(n_jobs=2)]: Done  84 out of  84 | elapsed:   52.8s finished


0.7073546247680708

In [97]:
print(classification_report(y_testb, predicte))

  'precision', 'predicted', average, warn_for)


                      precision    recall  f1-score   support

    Aconcagua Valley       0.00      0.00      0.00        22
          Alentejano       0.45      0.26      0.33       171
            Alentejo       0.36      0.11      0.17        35
              Alsace       0.68      0.77      0.72       502
             America       0.00      0.00      0.00        20
           Andalucia       0.90      0.47      0.62        38
     Australia Other       0.29      0.12      0.17        49
            Bairrada       0.94      0.48      0.64        33
          Beaujolais       0.62      0.55      0.58       206
            Bordeaux       0.60      0.76      0.67       917
    British Columbia       0.50      0.03      0.06        30
          Burgenland       0.50      0.31      0.39       118
            Burgundy       0.54      0.68      0.60       700
    Cachapoal Valley       0.14      0.02      0.04        42
          California       0.86      0.98      0.91      7777
       

In [27]:
# CountVectorizer and Logisitic Regression with Lasso regularization
pipelineb1 = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l1', solver='saga', multi_class='ovr', verbose = 1, n_jobs = 2))
])
pipelineb1.fit(X_trainb, y_trainb)
predictedb1 = pipelineb1.predict(X_testb)
pipelineb1.score(X_testb, y_testb)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.


max_iter reached after 31 seconds




max_iter reached after 48 seconds
max_iter reached after 27 seconds
max_iter reached after 28 seconds
max_iter reached after 28 seconds
max_iter reached after 81 seconds
max_iter reached after 38 seconds
max_iter reached after 24 seconds
max_iter reached after 48 seconds
max_iter reached after 32 seconds
max_iter reached after 91 seconds
max_iter reached after 52 seconds
max_iter reached after 35 seconds
max_iter reached after 82 seconds
max_iter reached after 28 seconds
max_iter reached after 54 seconds
max_iter reached after 65 seconds
max_iter reached after 64 seconds
max_iter reached after 31 seconds
max_iter reached after 254 seconds
max_iter reached after 44 seconds
max_iter reached after 52 seconds
max_iter reached after 31 seconds
max_iter reached after 51 seconds
max_iter reached after 40 seconds
max_iter reached after 66 seconds
max_iter reached after 31 seconds
max_iter reached after 55 seconds
max_iter reached after 41 seconds
max_iter reached after 54 seconds
max_iter reac

[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed: 20.3min


max_iter reached after 66 seconds
max_iter reached after 37 seconds
max_iter reached after 135 seconds
max_iter reached after 72 seconds
max_iter reached after 64 seconds
max_iter reached after 125 seconds
max_iter reached after 96 seconds
max_iter reached after 37 seconds
max_iter reached after 142 seconds
max_iter reached after 67 seconds
max_iter reached after 32 seconds
max_iter reached after 50 seconds
max_iter reached after 195 seconds
max_iter reached after 41 seconds
max_iter reached after 84 seconds
max_iter reached after 57 seconds
max_iter reached after 33 seconds
max_iter reached after 42 seconds
max_iter reached after 42 seconds
max_iter reached after 70 seconds
max_iter reached after 63 seconds
max_iter reached after 66 seconds
max_iter reached after 89 seconds
max_iter reached after 46 seconds
max_iter reached after 60 seconds
max_iter reached after 31 seconds
max_iter reached after 43 seconds
max_iter reached after 27 seconds
max_iter reached after 23 seconds
max_iter r

[Parallel(n_jobs=2)]: Done  84 out of  84 | elapsed: 42.0min finished


0.7100390825470767

In [28]:
#Cross Validation
cross = cross_val_score(pipelineb1, X_trainb, y_trainb, cv=5)
print(cross)
print(cross.mean())

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed: 15.0min
[Parallel(n_jobs=2)]: Done  84 out of  84 | elapsed: 30.1min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed: 14.7min
[Parallel(n_jobs=2)]: Done  84 out of  84 | elapsed: 30.6min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed: 15.8min
[Parallel(n_jobs=2)]: Done  84 out of  84 | elapsed: 30.6min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed: 14.4min
[Parallel(n_jobs=2)]: Done  84 out of  84 | elapsed: 28.9min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed: 15.3min
[Parallel(n_jobs=2)]: Done  84 out of  8

[0.70323661 0.70395386 0.70620465 0.70485974 0.70477461]
0.7046058942774598


In [29]:
print(classification_report(y_testb, predictedb1))

  'precision', 'predicted', average, warn_for)


                      precision    recall  f1-score   support

    Aconcagua Valley       0.00      0.00      0.00        22
          Alentejano       0.45      0.27      0.34       171
            Alentejo       0.67      0.11      0.20        35
              Alsace       0.68      0.79      0.73       502
             America       0.00      0.00      0.00        20
           Andalucia       0.85      0.61      0.71        38
     Australia Other       0.29      0.10      0.15        49
            Bairrada       1.00      0.55      0.71        33
          Beaujolais       0.64      0.53      0.58       206
            Bordeaux       0.59      0.75      0.66       917
    British Columbia       0.00      0.00      0.00        30
          Burgenland       0.54      0.38      0.45       118
            Burgundy       0.55      0.71      0.62       700
    Cachapoal Valley       0.00      0.00      0.00        42
          California       0.86      0.97      0.91      7777
       

In [98]:
# CountVectorizer and Logisitic Regression with Lasso regularization
pipelineb1 = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l1', solver='saga', multi_class='ovr', verbose = 1, n_jobs = 2))
])
pipelineb1.fit(X_trainb, y_trainb)
predictedb1 = pipelineb1.predict(X_testb)
pipelineb1.score(X_testb, y_testb)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.


max_iter reached after 31 seconds




max_iter reached after 48 seconds
max_iter reached after 28 seconds
max_iter reached after 27 seconds
max_iter reached after 28 seconds
max_iter reached after 80 seconds
max_iter reached after 40 seconds
max_iter reached after 26 seconds
max_iter reached after 50 seconds
max_iter reached after 34 seconds
max_iter reached after 96 seconds
max_iter reached after 57 seconds
max_iter reached after 36 seconds
max_iter reached after 84 seconds
max_iter reached after 29 seconds
max_iter reached after 54 seconds
max_iter reached after 70 seconds
max_iter reached after 66 seconds
max_iter reached after 32 seconds
max_iter reached after 266 seconds
max_iter reached after 52 seconds
max_iter reached after 44 seconds
max_iter reached after 34 seconds
max_iter reached after 54 seconds
max_iter reached after 38 seconds
max_iter reached after 66 seconds
max_iter reached after 30 seconds
max_iter reached after 54 seconds
max_iter reached after 41 seconds
max_iter reached after 55 seconds
max_iter reac

[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed: 19.5min


max_iter reached after 53 seconds
max_iter reached after 28 seconds
max_iter reached after 105 seconds
max_iter reached after 58 seconds
max_iter reached after 50 seconds
max_iter reached after 98 seconds
max_iter reached after 75 seconds
max_iter reached after 30 seconds
max_iter reached after 109 seconds
max_iter reached after 52 seconds
max_iter reached after 25 seconds
max_iter reached after 39 seconds
max_iter reached after 153 seconds
max_iter reached after 36 seconds
max_iter reached after 78 seconds
max_iter reached after 59 seconds
max_iter reached after 34 seconds
max_iter reached after 43 seconds
max_iter reached after 43 seconds
max_iter reached after 63 seconds
max_iter reached after 73 seconds
max_iter reached after 65 seconds
max_iter reached after 87 seconds
max_iter reached after 66 seconds
max_iter reached after 50 seconds
max_iter reached after 33 seconds
max_iter reached after 46 seconds
max_iter reached after 28 seconds
max_iter reached after 24 seconds
max_iter re

[Parallel(n_jobs=2)]: Done  84 out of  84 | elapsed: 39.4min finished


0.7100390825470767

In [100]:
print(classification_report(y_testb, predictedb1))

  'precision', 'predicted', average, warn_for)


                      precision    recall  f1-score   support

    Aconcagua Valley       0.00      0.00      0.00        22
          Alentejano       0.45      0.27      0.34       171
            Alentejo       0.67      0.11      0.20        35
              Alsace       0.68      0.79      0.73       502
             America       0.00      0.00      0.00        20
           Andalucia       0.85      0.61      0.71        38
     Australia Other       0.31      0.10      0.15        49
            Bairrada       1.00      0.55      0.71        33
          Beaujolais       0.64      0.53      0.58       206
            Bordeaux       0.59      0.75      0.66       917
    British Columbia       0.00      0.00      0.00        30
          Burgenland       0.54      0.38      0.45       118
            Burgundy       0.54      0.71      0.61       700
    Cachapoal Valley       0.00      0.00      0.00        42
          California       0.86      0.97      0.91      7777
       

In [126]:
vect = TfidfVectorizer(stop_words='english', sublinear_tf=True, max_features=1000,
                             ngram_range = (1,1), strip_accents='unicode')

In [127]:
# TfidfVectorizer and Logisitic Regression with Ridge regularization
pipelinev_ridge = Pipeline([
    ('vect', vect),
    ('cls', LogisticRegression(penalty = 'l2', solver='lbfgs', multi_class='ovr', verbose = 1, n_jobs = 2))
])
pipelinev_ridge.fit(X_trainb, y_trainb)
predictedv_ridge = pipelinev_ridge.predict(X_testb)
pipelinev_ridge.score(X_testb, y_testb)

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   13.2s
[Parallel(n_jobs=2)]: Done  85 out of  85 | elapsed:   23.8s finished


0.5773804740859783

In [130]:
print(classification_report(y_testb, predictedv_ridge))

                               precision    recall  f1-score   support

                    Aglianico       0.00      0.00      0.00        66
                     Albariño       0.39      0.12      0.18       130
                      Barbera       0.50      0.05      0.09       135
                Blaufränkisch       0.62      0.11      0.19        45
                      Bonarda       0.00      0.00      0.00        21
     Bordeaux-style Red Blend       0.58      0.65      0.61      1189
   Bordeaux-style White Blend       0.51      0.15      0.23       154
               Cabernet Franc       0.75      0.28      0.41       282
           Cabernet Sauvignon       0.53      0.70      0.60      2001
    Cabernet Sauvignon-Merlot       0.00      0.00      0.00        25
    Cabernet Sauvignon-Shiraz       0.00      0.00      0.00        21
                    Carmenère       0.40      0.19      0.26       122
              Champagne Blend       0.71      0.42      0.53       279
     

In [128]:
#TfidVectorizer and Logisitic Regression with Lasso regularization with max_features = 1000
pipelinev = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english', sublinear_tf=True, max_features=1000,
                             ngram_range = (1,1), strip_accents='unicode')),
    ('cls', LogisticRegression(penalty = 'l1', solver='saga', multi_class='ovr', verbose = 1, n_jobs = 2))
])
pipelinev.fit(X_trainb, y_trainb)
predictedv = pipelinev.predict(X_testb)
pipelinev.score(X_testb, y_testb)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.


max_iter reached after 10 seconds




max_iter reached after 13 seconds
max_iter reached after 9 seconds
max_iter reached after 13 seconds
max_iter reached after 6 seconds
max_iter reached after 13 seconds
convergence after 89 epochs took 19 seconds
convergence after 38 epochs took 10 seconds
max_iter reached after 15 seconds
max_iter reached after 6 seconds
max_iter reached after 6 seconds
max_iter reached after 13 seconds
max_iter reached after 17 seconds
convergence after 35 epochs took 9 seconds
max_iter reached after 11 seconds
max_iter reached after 13 seconds
max_iter reached after 8 seconds
max_iter reached after 7 seconds
max_iter reached after 7 seconds
max_iter reached after 7 seconds
max_iter reached after 6 seconds
max_iter reached after 13 seconds
max_iter reached after 8 seconds
convergence after 30 epochs took 4 seconds
convergence after 76 epochs took 11 seconds
max_iter reached after 11 seconds
max_iter reached after 6 seconds
max_iter reached after 7 seconds
convergence after 78 epochs took 11 seconds
ma

[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:  3.7min


max_iter reached after 8 seconds
max_iter reached after 8 seconds
max_iter reached after 12 seconds
max_iter reached after 29 seconds
convergence after 84 epochs took 14 seconds
max_iter reached after 13 seconds
convergence after 33 epochs took 7 seconds
max_iter reached after 7 seconds
convergence after 26 epochs took 6 seconds
max_iter reached after 9 seconds
convergence after 99 epochs took 18 seconds
convergence after 44 epochs took 10 seconds
max_iter reached after 7 seconds
max_iter reached after 7 seconds
max_iter reached after 19 seconds
convergence after 31 epochs took 5 seconds
max_iter reached after 8 seconds
max_iter reached after 12 seconds
convergence after 34 epochs took 7 seconds
max_iter reached after 7 seconds
convergence after 34 epochs took 6 seconds
convergence after 39 epochs took 9 seconds
max_iter reached after 6 seconds
max_iter reached after 7 seconds
convergence after 49 epochs took 9 seconds
max_iter reached after 13 seconds
max_iter reached after 8 seconds


[Parallel(n_jobs=2)]: Done  85 out of  85 | elapsed:  6.8min finished


0.5838087585375653

In [129]:
print(classification_report(y_testb, predictedv))

  'precision', 'predicted', average, warn_for)


                               precision    recall  f1-score   support

                    Aglianico       0.40      0.03      0.06        66
                     Albariño       0.35      0.13      0.19       130
                      Barbera       0.39      0.09      0.14       135
                Blaufränkisch       0.64      0.16      0.25        45
                      Bonarda       0.00      0.00      0.00        21
     Bordeaux-style Red Blend       0.60      0.66      0.63      1189
   Bordeaux-style White Blend       0.45      0.21      0.28       154
               Cabernet Franc       0.71      0.30      0.42       282
           Cabernet Sauvignon       0.54      0.69      0.60      2001
    Cabernet Sauvignon-Merlot       0.00      0.00      0.00        25
    Cabernet Sauvignon-Shiraz       0.00      0.00      0.00        21
                    Carmenère       0.43      0.24      0.31       122
              Champagne Blend       0.68      0.45      0.54       279
     

In [31]:
# TfidVectorizer and Logisitic Regression without max_features = 1000 and include Lasso regularization
pipelinev_mul = Pipeline([
    ('vect', vect),
    ('cls', LogisticRegression(solver='saga', multi_class='multinomial', penalty = 'l1',verbose = 1, n_jobs = 2 ))
])
pipelinev_mul.fit(X_trainb, y_trainb)
predictedv_mul = pipelinev_mul.predict(X_testb)
pipelinev_mul.score(X_testb, y_testb)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.


max_iter reached after 593 seconds


[Parallel(n_jobs=2)]: Done   1 out of   1 | elapsed:  9.9min finished


0.7027357782953693

### RANDOM FOREST CLASSIFIER WITH COUNTVECTORIZER AND TFIDVECTORIZER 

In this case Random Forest performs better with TfidVectorizer than with CountVectorizer. 

Logistic Regression performs better than Random Forest Classifier in this data set, with higher accuracy.

All of the models have better scores than the baseline


In [102]:
cvecbr = CountVectorizer(strip_accents='unicode', stop_words="english", ngram_range=(1, 1),  max_features=1000)

In [103]:
# Random Forest with CountVectorizer
pipelinebr1 = Pipeline([
    ('vect', cvecbr),
    ('cls', RandomForestClassifier(n_estimators=200, n_jobs = 2))
])
pipelinebr1.fit(X_trainb, y_trainb)
predictedbr1 = pipelinebr1.predict(X_testb)
pipelinebr1.score(X_testb, y_testb)

0.5367731238403537

In [104]:
print(classification_report(y_testb, predictedbr1))

  'precision', 'predicted', average, warn_for)


                      precision    recall  f1-score   support

    Aconcagua Valley       0.00      0.00      0.00        22
          Alentejano       0.29      0.01      0.02       171
            Alentejo       0.00      0.00      0.00        35
              Alsace       0.58      0.61      0.60       502
             America       0.00      0.00      0.00        20
           Andalucia       1.00      0.03      0.05        38
     Australia Other       0.00      0.00      0.00        49
            Bairrada       0.00      0.00      0.00        33
          Beaujolais       0.57      0.20      0.29       206
            Bordeaux       0.38      0.72      0.50       917
    British Columbia       0.00      0.00      0.00        30
          Burgenland       0.50      0.02      0.03       118
            Burgundy       0.36      0.52      0.42       700
    Cachapoal Valley       0.00      0.00      0.00        42
          California       0.56      0.99      0.72      7777
       

In [14]:
# Random Forest with TdfidVectorizer
pipelinebr3 = Pipeline([
    ('vect', vect),
    
    ('cls', RandomForestClassifier(n_estimators=200, n_jobs = 2))
])
pipelinebr3.fit(X_trainb, y_trainb)
predictedbr3 = pipelinebr3.predict(X_testb)
pipelinebr3.score(X_testb, y_testb)

0.5429710631242352

In [15]:
print(classification_report(y_testb, predictedbr3))

  'precision', 'predicted', average, warn_for)


                      precision    recall  f1-score   support

    Aconcagua Valley       0.00      0.00      0.00        22
          Alentejano       0.69      0.06      0.12       171
            Alentejo       0.00      0.00      0.00        35
              Alsace       0.64      0.57      0.60       502
             America       0.00      0.00      0.00        20
           Andalucia       1.00      0.05      0.10        38
     Australia Other       0.00      0.00      0.00        49
            Bairrada       0.00      0.00      0.00        33
          Beaujolais       0.78      0.17      0.28       206
            Bordeaux       0.40      0.73      0.52       917
    British Columbia       0.00      0.00      0.00        30
          Burgenland       0.00      0.00      0.00       118
            Burgundy       0.40      0.55      0.46       700
    Cachapoal Valley       0.00      0.00      0.00        42
          California       0.52      1.00      0.68      7777
       

### NATIVE BAYES, COUNTVECTORIZER AND TFDFVECTORIZER

As it happend in the small set the best acurracy I got in MultinomialNB with CountVectorizer. 

Logistic Regression performs better in this data set, with higher accuracy. 


All of the models have better scores than the baseline


In [16]:
vect = CountVectorizer(ngram_range=(1, 1), strip_accents='unicode', stop_words='english')

In [17]:
# MultinomialNB with CountVectorizer and TfidTransformer
pipelinebn = Pipeline([
    ('vect', vect),
    ('tfidf', TfidfTransformer()),
    ('cls', MultinomialNB())
])
pipelinebn.fit(X_trainb, y_trainb)
predictedbn = pipelinebn.predict(X_testb)
pipelinebn.score(X_testb, y_testb)

0.42951324464095375

In [18]:
print(classification_report(y_testb, predictedbn))

  'precision', 'predicted', average, warn_for)


                      precision    recall  f1-score   support

    Aconcagua Valley       0.00      0.00      0.00        22
          Alentejano       0.00      0.00      0.00       171
            Alentejo       0.00      0.00      0.00        35
              Alsace       0.58      0.16      0.25       502
             America       0.00      0.00      0.00        20
           Andalucia       0.00      0.00      0.00        38
     Australia Other       0.00      0.00      0.00        49
            Bairrada       0.00      0.00      0.00        33
          Beaujolais       0.00      0.00      0.00       206
            Bordeaux       0.47      0.46      0.47       917
    British Columbia       0.00      0.00      0.00        30
          Burgenland       0.00      0.00      0.00       118
            Burgundy       0.62      0.22      0.33       700
    Cachapoal Valley       0.00      0.00      0.00        42
          California       0.39      1.00      0.56      7777
       

In [19]:
# MultinomialNB with CountVectorizer
pipelinebn1 = Pipeline([
    ('vect', vect),
    ('cls', MultinomialNB())
])
pipelinebn1.fit(X_trainb, y_trainb)
predictedbn1 = pipelinebn1.predict(X_testb)
pipelinebn1.score(X_testb, y_testb)

0.6347953100943508

In [23]:
print(classification_report(y_testb, predictedbn1))

  'precision', 'predicted', average, warn_for)


                      precision    recall  f1-score   support

    Aconcagua Valley       0.00      0.00      0.00        22
          Alentejano       0.68      0.10      0.17       171
            Alentejo       0.00      0.00      0.00        35
              Alsace       0.51      0.74      0.60       502
             America       0.00      0.00      0.00        20
           Andalucia       0.00      0.00      0.00        38
     Australia Other       0.00      0.00      0.00        49
            Bairrada       0.00      0.00      0.00        33
          Beaujolais       1.00      0.04      0.07       206
            Bordeaux       0.41      0.84      0.55       917
    British Columbia       0.00      0.00      0.00        30
          Burgenland       0.67      0.03      0.06       118
            Burgundy       0.37      0.72      0.49       700
    Cachapoal Valley       0.00      0.00      0.00        42
          California       0.78      0.98      0.87      7777
       

In [24]:
# BernoulliNB
pipelinebn2 = Pipeline([
    ('vect', vect),
    ('cls', BernoulliNB())
])
pipelinebn2.fit(X_trainb, y_trainb)
predictedbn2 = pipelinebn2.predict(X_testb)
pipelinebn2.score(X_testb, y_testb)

0.5981998341952548

In [25]:
print(classification_report(y_testb, predictedbn2))

  'precision', 'predicted', average, warn_for)


                      precision    recall  f1-score   support

    Aconcagua Valley       0.00      0.00      0.00        22
          Alentejano       0.00      0.00      0.00       171
            Alentejo       0.00      0.00      0.00        35
              Alsace       0.52      0.68      0.59       502
             America       0.00      0.00      0.00        20
           Andalucia       0.00      0.00      0.00        38
     Australia Other       0.00      0.00      0.00        49
            Bairrada       0.00      0.00      0.00        33
          Beaujolais       0.00      0.00      0.00       206
            Bordeaux       0.36      0.85      0.51       917
    British Columbia       0.00      0.00      0.00        30
          Burgenland       0.00      0.00      0.00       118
            Burgundy       0.33      0.66      0.44       700
    Cachapoal Valley       0.00      0.00      0.00        42
          California       0.71      0.99      0.83      7777
       

In [26]:
vect1 = TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode')

In [27]:
# MultinomialNB with  TdidfVectorizer
pipelinebn3 = Pipeline([
    ('vect',vect1) ,
    ('cls', MultinomialNB())
])
pipelinebn3.fit(X_trainb, y_trainb)
predictedbn3 = pipelinebn3.predict(X_testb)
pipelinebn3.score(X_testb, y_testb)

0.4303027910465438

In [28]:
#print(classification_report(y_testb, predictedbn3))

In [29]:
# BernoulliNB
pipelinebn4 = Pipeline([
    ('vect', vect1),
    ('cls', BernoulliNB())
])
pipelinebn4.fit(X_trainb, y_trainb)
predictedbn4 = pipelinebn4.predict(X_testb)
pipelinebn4.score(X_testb, y_testb)

0.5981998341952548

In [30]:
print(classification_report(y_testb, predictedbn4))

  'precision', 'predicted', average, warn_for)


                      precision    recall  f1-score   support

    Aconcagua Valley       0.00      0.00      0.00        22
          Alentejano       0.00      0.00      0.00       171
            Alentejo       0.00      0.00      0.00        35
              Alsace       0.52      0.68      0.59       502
             America       0.00      0.00      0.00        20
           Andalucia       0.00      0.00      0.00        38
     Australia Other       0.00      0.00      0.00        49
            Bairrada       0.00      0.00      0.00        33
          Beaujolais       0.00      0.00      0.00       206
            Bordeaux       0.36      0.85      0.51       917
    British Columbia       0.00      0.00      0.00        30
          Burgenland       0.00      0.00      0.00       118
            Burgundy       0.33      0.66      0.44       700
    Cachapoal Valley       0.00      0.00      0.00        42
          California       0.71      0.99      0.83      7777
       

### CONCLUSION:

The logisitic regresion models performed better than all the models. The best accuracy was with CountVectoricer and Lasso Regularization. The big data set has better results in all the models than the small data set.