# CAPSTONE PART 4 FINDINGS AND TECHNICAL REPORT WINE BLIND TASTING (VARIETY)

I have built a classification model that, based on a wine description given by an expert or semi-expert, is able to tell what grape was used to produce that wine.

I am going to use for training and testing the data, the same models I defined with the other targets (country and province)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image


pd.set_option('display.max_columns',500)
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

## SMALL DATASET

In [3]:
df = pd.read_csv('small_wineV1.csv', )

In [34]:
df.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,taster_name,title,variety,winery,vintage
0,France,"Buxom and heady, this is a delightfully rich, ...",Vieilles Vignes,99,114.0,Châteauneuf-du-Pape,Rhône Valley,Anna Lee C. Iijima,Domaine de la Janasse 2016 Vieilles Vignes Red...,Rhône-style Red Blend,Domaine de la Janasse,2016.0
1,France,"Sultry and silken on the palate, this wine sta...",La Réserve,98,175.0,Châteauneuf-du-Pape,Rhône Valley,Anna Lee C. Iijima,Domaine le Clos du Caillou 2016 La Réserve Red...,Rhône-style Red Blend,Domaine le Clos du Caillou,2016.0
2,Portugal,The wine's fine perfumed black plum fruits giv...,,98,120.0,Port,Port Blend,Roger Voss,Fonseca 2017 Port,Port,Fonseca,2017.0
3,France,"Veins of vanilla, smoke and toast amplify blac...",Hommage à Henry Tacussel,98,80.0,Châteauneuf-du-Pape,Rhône Valley,Anna Lee C. Iijima,Domaine Moulin-Tacussel 2016 Hommage à Henry T...,Grenache,Domaine Moulin-Tacussel,2016.0
4,France,"This juicy, fruit-forward wine drenches the pa...",La Muse,97,88.0,Châteauneuf-du-Pape,Rhône Valley,Anna Lee C. Iijima,Guillaume Gonnet 2016 La Muse Red (Châteauneuf...,Rhône-style Red Blend,Guillaume Gonnet,2016.0


In [6]:
print("There are {} types of grapes(varieties) in this dataset such as {}... \n".
      format(len(df.variety.unique()), ", ".join(df.variety.unique()[0:5])))


There are 541 types of grapes(varieties) in this dataset such as Rhône-style Red Blend, Port, Grenache, Grenache-Mourvèdre, Champagne Blend... 



In [7]:
print('The variety baseline is', df.variety.value_counts(normalize=True).max())

The variety baseline is 0.1092695257518743


In [7]:
cvec = CountVectorizer(strip_accents='unicode',
                       stop_words="english", 
                       ngram_range=(1, 1))

X_all = cvec.fit_transform(df['description'])
columns = cvec.get_feature_names()
X_all

<34813x15200 sparse matrix of type '<class 'numpy.int64'>'
	with 910512 stored elements in Compressed Sparse Row format>

### DEFINE THE VARIABLE VARIETY AS A TARGET


In [4]:
# The target is the variable 'variety'. This datset has 542  differents types of grapes.
y = df.variety
X =  df.description

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### COUNTVECTORIZER  AND LOGISTIC REGRESSION 

The scores did not differ so much but the best performed model was with CountVectorizer and Logistic Regression with Lasso Regularization and multiclass ovr. I calculated the cross validation in this model and the score does not differ so much than the accuracy score.

All of the models have better scores than the baseline.


In [37]:
# CountVectorizer and Logistic Regression with Ridge regularization
pipeline_ridge = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l2', solver='lbfgs', multi_class='ovr', verbose = 1, n_jobs = 2))
])
pipeline_ridge.fit(X_train, y_train)
predicted_ridge = pipeline_ridge.predict(X_test)
pipeline_ridge.score(X_test, y_test)

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    6.7s
[Parallel(n_jobs=2)]: Done 196 tasks      | elapsed:   21.9s
[Parallel(n_jobs=2)]: Done 446 tasks      | elapsed:   48.0s
[Parallel(n_jobs=2)]: Done 498 out of 498 | elapsed:   53.1s finished


0.577193738331179

In [7]:
# CountVectorizer and Logistic Regression with Lasso regularization
cvec = CountVectorizer(strip_accents='unicode', stop_words="english", ngram_range=(1, 1))
pipeline = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l1', solver='saga', multi_class='ovr'))
])
pipeline.fit(X_train, y_train)
predicted = pipeline.predict(X_test)
pipeline.score(X_test, y_test)



0.5980180956484275

In [8]:
print(classification_report(y_test, predicted))

                                    precision    recall  f1-score   support

                         Aglianico       0.50      0.12      0.19        25
                          Albariño       0.52      0.33      0.41        36
                           Albillo       0.00      0.00      0.00         1
                          Aleatico       0.00      0.00      0.00         1
                          Alicante       0.00      0.00      0.00         1
                 Alicante Bouschet       0.00      0.00      0.00         2
                           Aligoté       0.00      0.00      0.00         4
                Alsace white blend       0.00      0.00      0.00         2
                           Altesse       0.00      0.00      0.00         1
                         Alvarinho       0.71      0.36      0.48        14
               Alvarinho-Trajadura       0.00      0.00      0.00         1
                            Arinto       1.00      0.33      0.50         3
           

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [9]:
cross = cross_val_score(pipeline, X_train, y_train, cv=5)
print(cross)
print(cross.mean())



[0.55576495 0.56470797 0.58318264 0.59838769 0.60905045]
0.5822187383358466


In [12]:
# CountVectorizer and Logistic Regression with Ridge regularization and mult_class multinomial
cvec = CountVectorizer(strip_accents='unicode', stop_words="english", ngram_range=(1, 1))
pipeline = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l1', solver='saga', multi_class='multinomial'))
])
pipeline.fit(X_train, y_train)
predicted = pipeline.predict(X_test)
pipeline.score(X_test, y_test)



0.5878213413758437

### TFIDFVECTORIZER  AND LOGISTIC REGRESSION 

CountVectorizer just counts the word frequencies. The TFIDFVectorizer count the word frequencies and compute the Inverse Document Frequency values, that measure how much information the word provide (whether the term is common or rare in all the documents).

I am going to define the same parameters thay I used with CountVectorize but I am going to include sublinear_tf to scale tf and max_features = 2000 that only is going to consider the top 2000 max_features ordered by term frequency across the corpus.


The best model performer was when I included the Lasso regularization with multinomial multiclass.

All of the models have better scores than the baseline.


In [18]:
# TfidfVectorizer and Logistic Regression with Lasso regularization
cvec = CountVectorizer(strip_accents='unicode', stop_words="english", ngram_range=(1, 1))
pipeline = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode')),
    ('cls', LogisticRegression(penalty = 'l1', solver='saga', multi_class='ovr'))
])
pipeline.fit(X_train, y_train)
predicted = pipeline.predict(X_test)
pipeline.score(X_test, y_test)



0.5839437024271148

In [15]:
# TFIDFVECTORIZER AND LOGISTIC REGRESSION inluding Lasso regularization and multinomial multi_class
pipelinesv = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode')),
    ('cls', LogisticRegression(solver='saga', multi_class='multinomial', penalty = 'l1'))
])
pipelinesv.fit(X_train, y_train)
predictedsv = pipelinesv.predict(X_test)
pipelinesv.score(X_test, y_test)



0.5866724113169611

In [16]:
# TFIDFVECTORIZER AND LOGISTIC REGRESSION inluding Lasso regularization and max_features = 2000
pipelinesv = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode',  max_features = 2000)),
    ('cls', LogisticRegression(solver='saga', multi_class='multinomial', penalty = 'l1'))
])
pipelinesv.fit(X_train, y_train)
predictedsv = pipelinesv.predict(X_test)
pipelinesv.score(X_test, y_test)



0.5830820048829527

In [17]:
# TFIDFVECTORIZER AND LOGISTIC REGRESSION inluding Ridge regularization solver sag.
pipelinesv = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode')),
    ('cls', LogisticRegression(solver='sag', multi_class='multinomial', penalty = 'l2'))
])
pipelinesv.fit(X_train, y_train)
predictedsv = pipelinesv.predict(X_test)
pipelinesv.score(X_test, y_test)

0.5727416343530087

### RANDOM FOREST CLASSIFIER, COUNTVECTORIZER AND TFDFVECTORIZER

Logistic Regression performs better than Random Forest Classifier in this data set, with higher accuracy.

The best performed model was Random Forest with CountVectorizer.

All of the models have better scores than the baseline

In [19]:
# Random Forest with CountVectorizer
cvecsr = CountVectorizer(strip_accents='unicode', stop_words="english", ngram_range=(1, 1),  max_features=1000)
pipelinesr = Pipeline([
    ('vect', cvecsr),
    #('tfidf', TfidfTransformer()),
    ('cls', RandomForestClassifier(n_estimators=200, n_jobs = 2))
])
pipelinesr.fit(X_train, y_train)
predictedsr = pipelinesr.predict(X_test)
pipelinesr.score(X_test, y_test)

0.5487577193738331

In [20]:
# Random Forest with TfidfVectorizer
pipelinesr = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode')),
    
    ('cls', RandomForestClassifier(n_estimators=200, n_jobs = 2))
])
pipelinesr.fit(X_train, y_train)
predictedsr = pipelinesr.predict(X_test)
pipelinesr.score(X_test, y_test)

0.5455981617119058

In [21]:
# Random Forest with TfidfVectorizer and max_depth = 10
pipelinesr = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode')),
    
    ('cls', RandomForestClassifier(n_estimators=200, n_jobs = 2, max_depth = 10 ))
])
pipelinesr.fit(X_train, y_train)
predictedsr = pipelinesr.predict(X_test)
pipelinesr.score(X_test, y_test)

0.33074824070084735

### NATIVE BAYES, COUNTVECTORIZER AND TFDFVECTORIZER

The best accuracy I got in MultinomialNB with CountVectorizer. MultinomialNB is one of the two classic Naive Bayes variants used in text classification.

Logistic Regression and some of the Random Forest Classifier models perform better than Native Bayes in this data set, with higher accuracy.

All of the models have better scores than the baseline


In [None]:
cvect = CountVectorizer(lowercase=True, strip_accents='unicode', stop_words='english')

In [23]:
# MultinomialNB with CountVectorizer and TfidfTransformer
pipelinesn = Pipeline([
    ('vect', cvect),
    ('tfidf', TfidfTransformer()),
    ('cls', MultinomialNB())
])
pipelinesn.fit(X_train, y_train)
predictedsn = pipelinesn.predict(X_test)
pipelinesn.score(X_test, y_test)

0.3742639666810283

In [24]:
# MultinomialNB with CountVectorizer
pipelinesn1 = Pipeline([
    ('vect', cvect),
    ('cls', MultinomialNB())
])
pipelinesn1.fit(X_train, y_train)
predictedsn1 = pipelinesn1.predict(X_test)
pipelinesn1.score(X_test, y_test)

0.4707740916271722

In [26]:
# BernoulliNB
pipelinesn2 = Pipeline([
    ('vect', cvect),
    ('cls', BernoulliNB())
])
pipelinesn2.fit(X_train, y_train)
predictedsn2 = pipelinesn2.predict(X_test)
pipelinesn2.score(X_test, y_test)

0.4229498779261812

In [27]:
vect1 = TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode')

In [28]:
# MultinomialNB with  TdidfVectorizer
pipelinesn3 = Pipeline([
    ('vect',vect1) ,
    ('cls', MultinomialNB())
])
pipelinesn3.fit(X_train, y_train)
predictedsn3 = pipelinesn3.predict(X_test)
pipelinesn3.score(X_test, y_test)

0.3744075829383886

In [29]:
# BernoulliNB with  TdidfVectorizer
pipelinesn4 = Pipeline([
    ('vect', vect1),
    ('cls', BernoulliNB())
])
pipelinesn4.fit(X_train, y_train)
predictedsn4 = pipelinesn4.predict(X_test)
pipelinesn4.score(X_test, y_test)

0.4229498779261812

# SMALL DATASET WITH VARIETIES REDUCTION

By looking at the varieties, we can find out that there are 542 varieties in the small dataset. I am going to limit the number of varieties. I am going to define the varieties that appear more than 50 times. I am also going to replace some of the varieties that they are similar.

I am going to train some of the models I define above with this dataset.

In [9]:
df1 = df[df['variety'].map(df['variety'].value_counts()) > 50]

In [12]:
df1.variety = df1.variety.apply(lambda x: str(x).replace('Syrah', 'Shiraz').
                                    replace('Pinot Gris', 'Pinot Grigio').replace('Petite Sirah', 'Shiraz').
                                   replace('Grenache', 'Garnacha').replace('Garnacha Blanc', 'Garnacha').
                                   replace('Rosato', 'Rosé').replace('Pinot Bianco', 'Pinot Blanc').
                                   replace('Zinfandel', 'Primitivo').replace('Alvarinho', 'Alvariño').
                                   replace('Alvariño', 'Albariño'))

In [14]:
len(df1.variety.unique())

57

In [15]:
print('The variety baseline is', df1.variety.value_counts(normalize=True).max())

The variety baseline is 0.11847514638096425


In [16]:
X1 = df1.description
y1 = df1.variety

In [17]:
#split and stratify
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size=0.2, stratify = y1, random_state = 1)

### COUNTVECTORIZER  AND TFIDFVECTORIZER WITH LOGISTIC REGRESSION 

The scores did not differ so much but the best performed model was with CountVectorizer and Logistic Regression with Lasso Regularization and multiclass ovr. I calculated the cross validation in this model and the score does not differ so much than the accuracy score.

All of the models have better scores than the baseline.


In [18]:
cvec = CountVectorizer(strip_accents='unicode', stop_words="english", ngram_range=(1, 1))

In [160]:
# CountVectorizer and LogisticRegression with Ridge Regularization
pipeline_ridge1 = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l2', solver='lbfgs', multi_class='ovr', verbose = 1, n_jobs = 2))
])
pipeline_ridge1.fit(X_train1, y_train1)
predicted_ridge1 = pipeline_ridge1.predict(X_test1)
pipeline_ridge1.score(X_test1, y_test1)

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   11.0s
[Parallel(n_jobs=2)]: Done  57 out of  57 | elapsed:   12.8s finished


0.63640610401744

In [42]:
# CountVectorizer and LogisticRegression with Lasso Regularization
cvec = CountVectorizer(strip_accents='unicode', stop_words="english", ngram_range=(1, 1))
pipelines1 = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l1', solver='saga', multi_class='ovr', verbose = 1, n_jobs = 2))
])
pipelines1.fit(X_train1, y_train1)
predicteds1 = pipelines1.predict(X_test1)
pipelines1.score(X_test1, y_test1)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.


max_iter reached after 5 seconds




max_iter reached after 6 seconds
max_iter reached after 3 seconds
max_iter reached after 5 seconds
max_iter reached after 6 seconds
max_iter reached after 13 seconds
max_iter reached after 10 seconds
max_iter reached after 5 seconds
max_iter reached after 6 seconds
max_iter reached after 18 seconds
max_iter reached after 6 seconds
max_iter reached after 3 seconds
max_iter reached after 5 seconds
max_iter reached after 17 seconds
max_iter reached after 5 seconds
max_iter reached after 4 seconds
max_iter reached after 11 seconds
max_iter reached after 6 seconds
max_iter reached after 5 seconds
max_iter reached after 4 seconds
max_iter reached after 5 seconds
max_iter reached after 12 seconds
max_iter reached after 5 seconds
max_iter reached after 6 seconds
max_iter reached after 13 seconds
max_iter reached after 4 seconds
max_iter reached after 8 seconds
max_iter reached after 6 seconds
max_iter reached after 7 seconds
max_iter reached after 13 seconds
max_iter reached after 6 seconds
ma

[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:  3.5min


max_iter reached after 4 seconds
max_iter reached after 7 seconds
max_iter reached after 11 seconds
max_iter reached after 5 seconds
max_iter reached after 4 seconds
max_iter reached after 4 seconds
max_iter reached after 6 seconds
max_iter reached after 4 seconds
max_iter reached after 8 seconds
max_iter reached after 11 seconds
max_iter reached after 5 seconds


[Parallel(n_jobs=2)]: Done  57 out of  57 | elapsed:  4.1min finished


0.6418561195889131

In [43]:
print(classification_report(y_test1, predicteds1))

                            precision    recall  f1-score   support

                 Aglianico       0.33      0.14      0.20        21
                  Albariño       0.52      0.36      0.42        42
                   Barbera       0.63      0.31      0.41        39
             Blaufränkisch       0.56      0.60      0.58        15
  Bordeaux-style Red Blend       0.73      0.85      0.78       368
Bordeaux-style White Blend       0.64      0.52      0.58        69
            Cabernet Franc       0.53      0.23      0.32        92
        Cabernet Sauvignon       0.51      0.64      0.56       444
                 Carmenère       0.62      0.37      0.47        27
           Champagne Blend       0.83      0.72      0.77        95
                Chardonnay       0.67      0.82      0.73       568
              Chenin Blanc       0.60      0.12      0.20        25
                     Fiano       0.40      0.36      0.38        11
                     G-S-M       0.33      0.07

In [19]:
# CountVectorizer and LogisticRegression with Lasso Regularization
cvec = CountVectorizer(strip_accents='unicode', stop_words="english", ngram_range=(1, 1))
pipelines1 = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l1', solver='saga', multi_class='ovr', verbose = 1, n_jobs = 2))
])
pipelines1.fit(X_train1, y_train1)
predicteds1 = pipelines1.predict(X_test1)
pipelines1.score(X_test1, y_test1)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.


max_iter reached after 6 seconds




max_iter reached after 7 seconds
max_iter reached after 5 seconds
max_iter reached after 7 seconds
max_iter reached after 8 seconds
max_iter reached after 16 seconds
max_iter reached after 12 seconds
max_iter reached after 6 seconds
max_iter reached after 7 seconds
max_iter reached after 21 seconds
max_iter reached after 6 seconds
max_iter reached after 5 seconds
max_iter reached after 6 seconds
max_iter reached after 20 seconds
max_iter reached after 6 seconds
max_iter reached after 5 seconds
max_iter reached after 12 seconds
max_iter reached after 6 seconds
max_iter reached after 7 seconds
max_iter reached after 4 seconds
max_iter reached after 5 seconds
max_iter reached after 13 seconds
max_iter reached after 5 seconds
max_iter reached after 5 seconds
max_iter reached after 12 seconds
max_iter reached after 4 seconds
max_iter reached after 7 seconds
max_iter reached after 5 seconds
max_iter reached after 6 seconds
max_iter reached after 11 seconds
max_iter reached after 5 seconds
ma

[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:  3.5min


max_iter reached after 4 seconds
max_iter reached after 7 seconds
max_iter reached after 12 seconds
max_iter reached after 5 seconds
max_iter reached after 4 seconds
max_iter reached after 5 seconds
max_iter reached after 6 seconds
max_iter reached after 4 seconds
max_iter reached after 8 seconds
max_iter reached after 3 seconds
max_iter reached after 10 seconds


[Parallel(n_jobs=2)]: Done  57 out of  57 | elapsed:  4.1min finished


0.6418561195889131

In [20]:
print(classification_report(y_test1, predicteds1))

                            precision    recall  f1-score   support

                 Aglianico       0.33      0.14      0.20        21
                  Albariño       0.52      0.36      0.42        42
                   Barbera       0.63      0.31      0.41        39
             Blaufränkisch       0.56      0.60      0.58        15
  Bordeaux-style Red Blend       0.73      0.85      0.79       368
Bordeaux-style White Blend       0.64      0.52      0.58        69
            Cabernet Franc       0.53      0.23      0.32        92
        Cabernet Sauvignon       0.51      0.64      0.57       444
                 Carmenère       0.59      0.37      0.45        27
           Champagne Blend       0.83      0.72      0.77        95
                Chardonnay       0.67      0.82      0.74       568
              Chenin Blanc       0.60      0.12      0.20        25
                     Fiano       0.40      0.36      0.38        11
                     G-S-M       0.33      0.07

In [26]:
#Cross validation scores
cross = cross_val_score(pipelines1, X_train1, y_train1, cv=5)
print(cross)
print(cross.mean())

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:  2.6min
[Parallel(n_jobs=2)]: Done  57 out of  57 | elapsed:  3.1min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:  2.5min
[Parallel(n_jobs=2)]: Done  57 out of  57 | elapsed:  2.8min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:  2.8min
[Parallel(n_jobs=2)]: Done  57 out of  57 | elapsed:  3.2min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:  2.3min
[Parallel(n_jobs=2)]: Done  57 out of  57 | elapsed:  2.6min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:  2.7min
[Parallel(n_jobs=2)]: Done  57 out of  5

[0.63100775 0.63616939 0.63875365 0.63219512 0.63345057]
0.6343152956196116


In [38]:
# CountVectorizer and LogisticRegression with Lasso Regularization and multinomial as a multiclass
pipelines2 = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l1', solver='saga', multi_class='multinomial', verbose = 1, n_jobs = 2))
])
pipelines2.fit(X_train1, y_train1)
predicteds2 = pipelines2.predict(X_test1)
pipelines2.score(X_test1, y_test1)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.


max_iter reached after 284 seconds


[Parallel(n_jobs=2)]: Done   1 out of   1 | elapsed:  4.7min finished


0.6348489567113049

In [41]:
print(classification_report(y_test1, predicteds2))

                            precision    recall  f1-score   support

                 Aglianico       0.27      0.14      0.19        21
                  Albariño       0.45      0.33      0.38        42
                   Barbera       0.64      0.36      0.46        39
             Blaufränkisch       0.69      0.73      0.71        15
  Bordeaux-style Red Blend       0.72      0.84      0.78       368
Bordeaux-style White Blend       0.59      0.49      0.54        69
            Cabernet Franc       0.47      0.21      0.29        92
        Cabernet Sauvignon       0.51      0.61      0.55       444
                 Carmenère       0.48      0.37      0.42        27
           Champagne Blend       0.78      0.69      0.73        95
                Chardonnay       0.67      0.79      0.72       568
              Chenin Blanc       0.67      0.16      0.26        25
                     Fiano       0.62      0.45      0.53        11
                     G-S-M       0.17      0.07

In [28]:
vect = TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode')

In [169]:
# TfidfVectorizer and LogisticRegression with Ridge Regularization
pipelines3 = Pipeline([
    ('vect', vect),
    ('cls', LogisticRegression(penalty = 'l2', solver='lbfgs', multi_class='ovr', verbose = 1, n_jobs = 2))
])
pipelines3.fit(X_train1, y_train1)
predicteds3 = pipelines3.predict(X_test1)
pipelines3.score(X_test1, y_test1)

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    7.0s
[Parallel(n_jobs=2)]: Done  57 out of  57 | elapsed:    7.8s finished


0.611336032388664

In [170]:
print(classification_report(y_test1, predicteds3))

                            precision    recall  f1-score   support

                 Aglianico       0.00      0.00      0.00        21
                  Albariño       0.77      0.24      0.36        42
                   Barbera       1.00      0.15      0.27        39
             Blaufränkisch       1.00      0.07      0.12        15
  Bordeaux-style Red Blend       0.66      0.82      0.73       368
Bordeaux-style White Blend       0.72      0.42      0.53        69
            Cabernet Franc       0.60      0.10      0.17        92
        Cabernet Sauvignon       0.47      0.68      0.55       444
                 Carmenère       0.75      0.11      0.19        27
           Champagne Blend       0.89      0.66      0.76        95
                Chardonnay       0.56      0.87      0.68       568
              Chenin Blanc       0.00      0.00      0.00        25
                     Fiano       0.00      0.00      0.00        11
                     G-S-M       0.00      0.00

  'precision', 'predicted', average, warn_for)


In [171]:
# TfidfVectorizer and LogisticRegression with Lasso Regularization
pipelines4 = Pipeline([
    ('vect', vect),
    ('cls', LogisticRegression(penalty = 'l1', solver='saga', multi_class='ovr', ))
])
pipelines4.fit(X_train1, y_train1)
predicteds4 = pipelines4.predict(X_test1)
pipelines4.score(X_test1, y_test1)



0.6248832139520398

In [172]:
print(classification_report(y_test1, predicteds4))

                            precision    recall  f1-score   support

                 Aglianico       0.50      0.05      0.09        21
                  Albariño       0.68      0.40      0.51        42
                   Barbera       0.89      0.21      0.33        39
             Blaufränkisch       0.75      0.20      0.32        15
  Bordeaux-style Red Blend       0.70      0.82      0.76       368
Bordeaux-style White Blend       0.66      0.42      0.51        69
            Cabernet Franc       0.55      0.13      0.21        92
        Cabernet Sauvignon       0.49      0.66      0.56       444
                 Carmenère       0.69      0.33      0.45        27
           Champagne Blend       0.86      0.63      0.73        95
                Chardonnay       0.60      0.85      0.70       568
              Chenin Blanc       0.67      0.08      0.14        25
                     Fiano       1.00      0.18      0.31        11
                     G-S-M       0.00      0.00

  'precision', 'predicted', average, warn_for)


In [173]:
# TfidfVectorizer and LogisticRegression with Lasso Regularization and multinomial as a multiclass
pipelines5 = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode')),
    ('cls', LogisticRegression(solver='saga', multi_class='multinomial', penalty = 'l1'))
])
pipelines5.fit(X_train1, y_train1)
predicteds5 = pipelines5.predict(X_test1)
pipelines5.score(X_test1, y_test1)



0.6334475241357832

In [174]:
print(classification_report(y_test1, predicteds5))

                            precision    recall  f1-score   support

                 Aglianico       0.25      0.05      0.08        21
                  Albariño       0.71      0.40      0.52        42
                   Barbera       0.83      0.26      0.39        39
             Blaufränkisch       0.83      0.33      0.48        15
  Bordeaux-style Red Blend       0.72      0.82      0.77       368
Bordeaux-style White Blend       0.65      0.46      0.54        69
            Cabernet Franc       0.57      0.14      0.23        92
        Cabernet Sauvignon       0.51      0.67      0.58       444
                 Carmenère       0.44      0.30      0.36        27
           Champagne Blend       0.83      0.68      0.75        95
                Chardonnay       0.61      0.82      0.70       568
              Chenin Blanc       0.50      0.08      0.14        25
                     Fiano       1.00      0.27      0.43        11
                     G-S-M       0.00      0.00

  'precision', 'predicted', average, warn_for)


### RANDOM FOREST CLASSIFIER, COUNTVECTORIZER AND TFDFVECTORIZER

Logistic Regression performs better than Random Forest Classifier in this data set, with higher accuracy.

The best performed model was Random Forest with CountVectorizer.

All of the models have better scores than the baseline


In [13]:
cvecsr = CountVectorizer(strip_accents='unicode', stop_words="english", ngram_range=(1, 1))

In [14]:
# Random Forest with CountVectorizer
pipelines6  = Pipeline([
    ('vect', cvecsr),
    #('tfidf', TfidfTransformer()),
    ('cls', RandomForestClassifier(n_estimators=200, n_jobs = 2))
])
pipelines6.fit(X_train1, y_train1)
predicteds6 = pipelines6.predict(X_test1)
pipelines6.score(X_test1, y_test1)

0.597477421364061

In [19]:
print(classification_report(y_test1, predicteds6))

                            precision    recall  f1-score   support

                 Aglianico       0.00      0.00      0.00        21
                  Albariño       0.77      0.24      0.36        42
                   Barbera       1.00      0.08      0.14        39
             Blaufränkisch       0.75      0.40      0.52        15
  Bordeaux-style Red Blend       0.55      0.84      0.67       368
Bordeaux-style White Blend       0.68      0.43      0.53        69
            Cabernet Franc       0.75      0.03      0.06        92
        Cabernet Sauvignon       0.48      0.68      0.56       444
                 Carmenère       1.00      0.26      0.41        27
           Champagne Blend       0.86      0.58      0.69        95
                Chardonnay       0.53      0.88      0.66       568
              Chenin Blanc       0.00      0.00      0.00        25
                     Fiano       0.00      0.00      0.00        11
                     G-S-M       0.00      0.00

  'precision', 'predicted', average, warn_for)


In [20]:
# Random Forest with TfidfVectorizer and max_depth = 10
pipelines7 = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode')),
    
    ('cls', RandomForestClassifier(n_estimators=200, n_jobs = 2, max_depth = 10 ))
])
pipelines7.fit(X_train1, y_train1)
predicteds7 = pipelines7.predict(X_test1)
pipelines7.score(X_test1, y_test1)

0.35970102771722207

In [21]:
print(classification_report(y_test1, predicteds7))

                            precision    recall  f1-score   support

                 Aglianico       0.00      0.00      0.00        21
                  Albariño       0.00      0.00      0.00        42
                   Barbera       0.00      0.00      0.00        39
             Blaufränkisch       0.00      0.00      0.00        15
  Bordeaux-style Red Blend       0.55      0.70      0.61       368
Bordeaux-style White Blend       0.00      0.00      0.00        69
            Cabernet Franc       0.00      0.00      0.00        92
        Cabernet Sauvignon       0.78      0.21      0.34       444
                 Carmenère       0.00      0.00      0.00        27
           Champagne Blend       0.94      0.16      0.27        95
                Chardonnay       0.37      0.80      0.50       568
              Chenin Blanc       0.00      0.00      0.00        25
                     Fiano       0.00      0.00      0.00        11
                     G-S-M       0.00      0.00

  'precision', 'predicted', average, warn_for)


### NATIVE BAYES, COUNTVECTORIZER AND TFDFVECTORIZER

he best accuracy I got in MultinomialNB with CountVectorizer. MultinomialNB is one of the two classic Naive Bayes variants used in text classification.

Logistic Regression and some of the Random Forest Classifier models perform better than Native Bayes in this data set, with higher accuracy.

All of the models have better scores than the baseline


In [25]:
# MultinomialNB with CountVectorizer
pipelines8 = Pipeline([
    ('vect', cvec),
    ('cls', MultinomialNB())
])
pipelines8.fit(X_train1, y_train1)
predicteds8 = pipelines8.predict(X_test1)
pipelines8.score(X_test1, y_test1)

0.5317658050451572

In [31]:
print(classification_report(y_test1, predicteds8))

                            precision    recall  f1-score   support

                 Aglianico       0.00      0.00      0.00        21
                  Albariño       0.00      0.00      0.00        42
                   Barbera       0.00      0.00      0.00        39
             Blaufränkisch       0.00      0.00      0.00        15
  Bordeaux-style Red Blend       0.44      0.73      0.55       368
Bordeaux-style White Blend       0.69      0.16      0.26        69
            Cabernet Franc       0.25      0.01      0.02        92
        Cabernet Sauvignon       0.41      0.65      0.50       444
                 Carmenère       0.00      0.00      0.00        27
           Champagne Blend       0.73      0.64      0.69        95
                Chardonnay       0.55      0.82      0.66       568
              Chenin Blanc       0.00      0.00      0.00        25
                     Fiano       0.00      0.00      0.00        11
                     G-S-M       0.00      0.00

  'precision', 'predicted', average, warn_for)


In [26]:
# BernoulliNB
pipelines9 = Pipeline([
    ('vect', cvec),
    ('cls', BernoulliNB())
])
pipelines9.fit(X_train1, y_train1)
predicteds9 = pipelines9.predict(X_test1)
pipelines9.score(X_test1, y_test1)

0.4796013702896294

In [29]:
# MultinomialNB with  TdidfVectorizer
pipelines10 = Pipeline([
    ('vect',vect) ,
    ('cls', MultinomialNB())
])
pipelines10.fit(X_train1, y_train1)
predicteds10 = pipelines10.predict(X_test1)
pipelines10.score(X_test1, y_test1)

0.4198069137340392

This dataset with some reductions performs better. The accuracy has been improved, in all the models. And as happend in the other dataset the best accuracy is in the model with CountVectorizer, Logistic Regression and Lasso Regularization.

# BIG DATASET VARIETY REDUCTIONS

By looking at the varieties, we can find out that there are 745 varieties in the big data set. I am going to limit how many samples a variety has to have. I eliminated the varieties that appear less than 100 times and I replace some of them that there are the same family.

I am going to train some of the models I define above with this dataset.

In [12]:
vino = pd.read_csv('big_wineV1.csv')

In [8]:
vino.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,taster_name,title,variety,winery,vintage
0,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,Roger Voss,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,2011.0
1,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Paul Gregutt,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,2013.0
2,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,Alexander Peartree,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,2013.0
3,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Paul Gregutt,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,2012.0
4,5,Spain,Blackberry and raspberry aromas show a typical...,Ars In Vitro,87,15.0,Northern Spain,Navarra,Michael Schachner,Tandem 2011 Ars In Vitro Tempranillo-Merlot (N...,Tempranillo-Merlot,Tandem,2011.0


In [13]:
vino1 = vino[vino['variety'].map(vino['variety'].value_counts()) > 100]

In [16]:
vino1.variety = vino1.variety.apply(lambda x: str(x).replace('Syrah', 'Shiraz').
                                    replace('Pinot Gris', 'Pinot Grigio').replace('Petite Sirah', 'Shiraz').
                                   replace('Grenache', 'Garnacha').replace('Garnacha Blanc', 'Garnacha').
                                   replace('Rosato', 'Rosé').replace('Pinot Bianco', 'Pinot Blanc').
                                   replace('Zinfandel', 'Primitivo').replace('Alvarinho', 'Alvariño').
                                   replace('Alvariño', 'Albariño'))

In [17]:
len(vino1.variety.unique())

85

In [18]:
print("There are {} types of grapes(varieties) in this dataset such as {}... \n".
      format(len(vino.variety.unique()), ", ".join(vino.variety.unique()[0:5])))

There are 743 types of grapes(varieties) in this dataset such as Portuguese Red, Pinot Gris, Riesling, Pinot Noir, Tempranillo-Merlot... 



In [19]:
print('The variety baseline is', vino1.variety.value_counts(normalize=True).max())

The variety baseline is 0.11419319067554863


### DEFINE THE VARIABLE VARIETY AS A TARGET

In [20]:
yb = vino1.variety
Xb =  vino1.description

In [21]:
yb.shape, Xb.shape

((124447,), (124447,))

In [22]:
#Split and stratify the data
X_trainb, X_testb, y_trainb, y_testb = train_test_split(Xb, yb, test_size=0.2, stratify = yb, random_state = 1)

### COUNT VECTORIZER  AND LOGISTIC REGRESSION 

I am going to train some of the models I define above with this dataset.

The scores did not differ so much but the best performed model was with CountVectorizer and Logistic Regression with Lasso Regularization and multiclass ovr. I calculated the cross validation in this model and the score does not differ so much than the accuracy score.
 
All of the models have better scores than the baseline.


In [11]:
cvec = CountVectorizer(strip_accents='unicode', stop_words="english", ngram_range=(1, 1))

In [117]:
cvec.fit_transform(Xb)

<124447x30170 sparse matrix of type '<class 'numpy.int64'>'
	with 2959640 stored elements in Compressed Sparse Row format>

In [118]:
#CountVectorizer and Logistic Regression with Ridge regularization
pipeline = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l2', solver='lbfgs', multi_class='ovr', verbose = 1, n_jobs = 2))
])
pipeline.fit(X_trainb, y_trainb)
predicte = pipeline.predict(X_testb)
pipeline.score(X_testb, y_testb)

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   29.5s
[Parallel(n_jobs=2)]: Done  85 out of  85 | elapsed:   54.5s finished


0.6383688228204099

In [120]:
print(classification_report(y_testb, predicte))

                               precision    recall  f1-score   support

                    Aglianico       0.76      0.47      0.58        66
                     Albariño       0.62      0.45      0.52       130
                      Barbera       0.62      0.41      0.50       135
                Blaufränkisch       0.72      0.40      0.51        45
                      Bonarda       1.00      0.24      0.38        21
     Bordeaux-style Red Blend       0.63      0.68      0.65      1189
   Bordeaux-style White Blend       0.64      0.47      0.54       154
               Cabernet Franc       0.60      0.35      0.44       282
           Cabernet Sauvignon       0.57      0.69      0.62      2001
    Cabernet Sauvignon-Merlot       0.50      0.04      0.07        25
    Cabernet Sauvignon-Shiraz       0.67      0.10      0.17        21
                    Carmenère       0.57      0.39      0.46       122
              Champagne Blend       0.71      0.57      0.64       279
     

In [None]:
print(cross_val_score(pipelines1, X_train1, y_train1, cv=5))

## mirar resultados

In [23]:

#CountVectorizer and Logistic Regression and Lasso regularization
pipeline = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l1', solver='saga', multi_class='ovr', verbose = 1, n_jobs = 2))
])
pipeline.fit(X_trainb, y_trainb)
predicted = pipeline.predict(X_testb)
pipeline.score(X_testb, y_testb)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.


max_iter reached after 41 seconds




max_iter reached after 56 seconds
max_iter reached after 33 seconds
max_iter reached after 48 seconds
max_iter reached after 25 seconds
max_iter reached after 48 seconds
max_iter reached after 128 seconds
max_iter reached after 80 seconds
max_iter reached after 29 seconds
max_iter reached after 28 seconds
max_iter reached after 45 seconds
max_iter reached after 176 seconds
max_iter reached after 65 seconds
max_iter reached after 55 seconds
max_iter reached after 49 seconds
max_iter reached after 33 seconds
max_iter reached after 28 seconds
max_iter reached after 191 seconds
max_iter reached after 30 seconds
max_iter reached after 23 seconds
max_iter reached after 33 seconds
max_iter reached after 47 seconds
max_iter reached after 26 seconds
max_iter reached after 54 seconds
max_iter reached after 83 seconds
max_iter reached after 34 seconds
max_iter reached after 25 seconds
max_iter reached after 25 seconds
max_iter reached after 59 seconds
max_iter reached after 27 seconds
max_iter re

[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed: 20.2min


max_iter reached after 98 seconds
max_iter reached after 29 seconds
max_iter reached after 43 seconds
max_iter reached after 205 seconds
max_iter reached after 77 seconds
max_iter reached after 53 seconds
max_iter reached after 26 seconds
max_iter reached after 112 seconds
max_iter reached after 82 seconds
max_iter reached after 43 seconds
max_iter reached after 181 seconds
max_iter reached after 105 seconds
max_iter reached after 112 seconds
max_iter reached after 39 seconds
max_iter reached after 29 seconds
max_iter reached after 35 seconds
max_iter reached after 29 seconds
max_iter reached after 84 seconds
max_iter reached after 22 seconds
max_iter reached after 121 seconds
max_iter reached after 156 seconds
max_iter reached after 94 seconds
max_iter reached after 31 seconds
max_iter reached after 29 seconds
max_iter reached after 56 seconds
max_iter reached after 91 seconds
max_iter reached after 29 seconds
max_iter reached after 27 seconds
max_iter reached after 30 seconds
max_ite

[Parallel(n_jobs=2)]: Done  85 out of  85 | elapsed: 40.6min finished


0.6448774608276416

In [24]:
cross = cross_val_score(pipeline, X_trainb, y_trainb, cv=5)
print(cross)
print(cross.mean())

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed: 13.8min
[Parallel(n_jobs=2)]: Done  85 out of  85 | elapsed: 27.6min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed: 15.2min
[Parallel(n_jobs=2)]: Done  85 out of  85 | elapsed: 29.7min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed: 14.2min
[Parallel(n_jobs=2)]: Done  85 out of  85 | elapsed: 29.2min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed: 14.8min
[Parallel(n_jobs=2)]: Done  85 out of  85 | elapsed: 29.7min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed: 13.7min
[Parallel(n_jobs=2)]: Done  85 out of  8

[0.64490574 0.64416562 0.63433623 0.63470044 0.64270985]
0.6401635763847414


In [122]:
#CountVectorizer and Logistic Regression and Lasso regularization
pipeline = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l1', solver='saga', multi_class='ovr', verbose = 1, n_jobs = 2))
])
pipeline.fit(X_trainb, y_trainb)
predicted = pipeline.predict(X_testb)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.


max_iter reached after 39 seconds




max_iter reached after 59 seconds
max_iter reached after 39 seconds
max_iter reached after 58 seconds
max_iter reached after 28 seconds
convergence after 885 epochs took 202 seconds
max_iter reached after 52 seconds
max_iter reached after 142 seconds
max_iter reached after 84 seconds
max_iter reached after 32 seconds
max_iter reached after 29 seconds
max_iter reached after 47 seconds
convergence after 1330 epochs took 423 seconds
max_iter reached after 183 seconds
max_iter reached after 63 seconds
max_iter reached after 46 seconds
max_iter reached after 37 seconds
max_iter reached after 27 seconds
max_iter reached after 22 seconds
max_iter reached after 153 seconds
max_iter reached after 25 seconds
max_iter reached after 22 seconds
max_iter reached after 33 seconds
max_iter reached after 45 seconds
max_iter reached after 25 seconds
max_iter reached after 53 seconds
max_iter reached after 80 seconds
max_iter reached after 35 seconds
max_iter reached after 25 seconds
max_iter reached aft

[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed: 19.4min


max_iter reached after 90 seconds
max_iter reached after 27 seconds
max_iter reached after 37 seconds
max_iter reached after 185 seconds
max_iter reached after 69 seconds
max_iter reached after 49 seconds
max_iter reached after 24 seconds
max_iter reached after 103 seconds
max_iter reached after 75 seconds
max_iter reached after 39 seconds
max_iter reached after 168 seconds
max_iter reached after 98 seconds
max_iter reached after 102 seconds
max_iter reached after 32 seconds
max_iter reached after 24 seconds
max_iter reached after 35 seconds
max_iter reached after 28 seconds
max_iter reached after 81 seconds
max_iter reached after 21 seconds
max_iter reached after 118 seconds
max_iter reached after 145 seconds
max_iter reached after 28 seconds
max_iter reached after 85 seconds
max_iter reached after 27 seconds
max_iter reached after 54 seconds
max_iter reached after 86 seconds
max_iter reached after 30 seconds
max_iter reached after 28 seconds
max_iter reached after 29 seconds
max_iter

[Parallel(n_jobs=2)]: Done  85 out of  85 | elapsed: 38.4min finished


In [124]:
pipeline.score(X_testb, y_testb)

0.6449578143832865

In [125]:
print(classification_report(y_testb, predicted))

                               precision    recall  f1-score   support

                    Aglianico       0.77      0.50      0.61        66
                     Albariño       0.66      0.47      0.55       130
                      Barbera       0.66      0.42      0.51       135
                Blaufränkisch       0.59      0.38      0.46        45
                      Bonarda       0.86      0.29      0.43        21
     Bordeaux-style Red Blend       0.64      0.69      0.66      1189
   Bordeaux-style White Blend       0.65      0.45      0.54       154
               Cabernet Franc       0.60      0.36      0.45       282
           Cabernet Sauvignon       0.57      0.69      0.63      2001
    Cabernet Sauvignon-Merlot       0.00      0.00      0.00        25
    Cabernet Sauvignon-Shiraz       0.50      0.05      0.09        21
                    Carmenère       0.60      0.43      0.50       122
              Champagne Blend       0.71      0.56      0.62       279
     

I did not run the CountVectorizer and Logistic Regression with Lasso regularization and multiclass because I did not obtain good results with the other datasets.

### TFIDFVECTORIZER  AND LOGISTIC REGRESSION 

I defineded the same model thay worked better in the small dataset. I compared including max_features in the first and including Lasso regularization in the second.
The best accuracy is with the Lasso regularization.

In [126]:
vect = TfidfVectorizer(stop_words='english', sublinear_tf=True, max_features=1000,
                             ngram_range = (1,1), strip_accents='unicode')

In [127]:
# TfidfVectorizwer and Logistic Regression with Ridge regularization.
pipelinev_ridge = Pipeline([
    ('vect', vect),
    ('cls', LogisticRegression(penalty = 'l2', solver='lbfgs', multi_class='ovr', verbose = 1, n_jobs = 2))
])
pipelinev_ridge.fit(X_trainb, y_trainb)
predictedv_ridge = pipelinev_ridge.predict(X_testb)
pipelinev_ridge.score(X_testb, y_testb)

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   13.2s
[Parallel(n_jobs=2)]: Done  85 out of  85 | elapsed:   23.8s finished


0.5773804740859783

In [130]:
print(classification_report(y_testb, predictedv_ridge))

                               precision    recall  f1-score   support

                    Aglianico       0.00      0.00      0.00        66
                     Albariño       0.39      0.12      0.18       130
                      Barbera       0.50      0.05      0.09       135
                Blaufränkisch       0.62      0.11      0.19        45
                      Bonarda       0.00      0.00      0.00        21
     Bordeaux-style Red Blend       0.58      0.65      0.61      1189
   Bordeaux-style White Blend       0.51      0.15      0.23       154
               Cabernet Franc       0.75      0.28      0.41       282
           Cabernet Sauvignon       0.53      0.70      0.60      2001
    Cabernet Sauvignon-Merlot       0.00      0.00      0.00        25
    Cabernet Sauvignon-Shiraz       0.00      0.00      0.00        21
                    Carmenère       0.40      0.19      0.26       122
              Champagne Blend       0.71      0.42      0.53       279
     

In [128]:
#TfidVectorizer and Logisitic Regression with max_features = 1000
pipelinev = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english', sublinear_tf=True, max_features=1000,
                             ngram_range = (1,1), strip_accents='unicode')),
    ('cls', LogisticRegression(penalty = 'l1', solver='saga', multi_class='ovr', verbose = 1, n_jobs = 2))
])
pipelinev.fit(X_trainb, y_trainb)
predictedv = pipelinev.predict(X_testb)
pipelinev.score(X_testb, y_testb)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.


max_iter reached after 10 seconds




max_iter reached after 13 seconds
max_iter reached after 9 seconds
max_iter reached after 13 seconds
max_iter reached after 6 seconds
max_iter reached after 13 seconds
convergence after 89 epochs took 19 seconds
convergence after 38 epochs took 10 seconds
max_iter reached after 15 seconds
max_iter reached after 6 seconds
max_iter reached after 6 seconds
max_iter reached after 13 seconds
max_iter reached after 17 seconds
convergence after 35 epochs took 9 seconds
max_iter reached after 11 seconds
max_iter reached after 13 seconds
max_iter reached after 8 seconds
max_iter reached after 7 seconds
max_iter reached after 7 seconds
max_iter reached after 7 seconds
max_iter reached after 6 seconds
max_iter reached after 13 seconds
max_iter reached after 8 seconds
convergence after 30 epochs took 4 seconds
convergence after 76 epochs took 11 seconds
max_iter reached after 11 seconds
max_iter reached after 6 seconds
max_iter reached after 7 seconds
convergence after 78 epochs took 11 seconds
ma

[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:  3.7min


max_iter reached after 8 seconds
max_iter reached after 8 seconds
max_iter reached after 12 seconds
max_iter reached after 29 seconds
convergence after 84 epochs took 14 seconds
max_iter reached after 13 seconds
convergence after 33 epochs took 7 seconds
max_iter reached after 7 seconds
convergence after 26 epochs took 6 seconds
max_iter reached after 9 seconds
convergence after 99 epochs took 18 seconds
convergence after 44 epochs took 10 seconds
max_iter reached after 7 seconds
max_iter reached after 7 seconds
max_iter reached after 19 seconds
convergence after 31 epochs took 5 seconds
max_iter reached after 8 seconds
max_iter reached after 12 seconds
convergence after 34 epochs took 7 seconds
max_iter reached after 7 seconds
convergence after 34 epochs took 6 seconds
convergence after 39 epochs took 9 seconds
max_iter reached after 6 seconds
max_iter reached after 7 seconds
convergence after 49 epochs took 9 seconds
max_iter reached after 13 seconds
max_iter reached after 8 seconds


[Parallel(n_jobs=2)]: Done  85 out of  85 | elapsed:  6.8min finished


0.5838087585375653

In [129]:
print(classification_report(y_testb, predictedv))

  'precision', 'predicted', average, warn_for)


                               precision    recall  f1-score   support

                    Aglianico       0.40      0.03      0.06        66
                     Albariño       0.35      0.13      0.19       130
                      Barbera       0.39      0.09      0.14       135
                Blaufränkisch       0.64      0.16      0.25        45
                      Bonarda       0.00      0.00      0.00        21
     Bordeaux-style Red Blend       0.60      0.66      0.63      1189
   Bordeaux-style White Blend       0.45      0.21      0.28       154
               Cabernet Franc       0.71      0.30      0.42       282
           Cabernet Sauvignon       0.54      0.69      0.60      2001
    Cabernet Sauvignon-Merlot       0.00      0.00      0.00        25
    Cabernet Sauvignon-Shiraz       0.00      0.00      0.00        21
                    Carmenère       0.43      0.24      0.31       122
              Champagne Blend       0.68      0.45      0.54       279
     

I am not going to run TfidVectorizer and Logisitic Regression without max_features = 1000 and include Lasso regularization because in the other datasets it was not so good as the others


### RANDOM FOREST CLASSIFIER WITH COUNTVECTORIZER AND TFIDVECTORIZER 



Logistic Regression performs better than Random Forest Classifier in this data set, with higher accuracy.

The best performed model was Random Forest with TfidVectorizer.

All of the models have better scores than the baseline


In [132]:
# Random Forest with CountVectorizer and 1000 max_features
cvecbr = CountVectorizer(strip_accents='unicode', stop_words="english", ngram_range=(1, 1),  max_features=1000)
pipelinebr1 = Pipeline([
    ('vect', cvecbr),
    ('cls', RandomForestClassifier(n_estimators=200, n_jobs = 2))
])
pipelinebr1.fit(X_trainb, y_trainb)
predictedbr1 = pipelinebr1.predict(X_testb)
pipelinebr1.score(X_testb, y_testb)

0.5511852149457613

In [134]:
# Random Forest with CountVectorizer
cvecbr = CountVectorizer(strip_accents='unicode', stop_words="english", ngram_range=(1, 1))
pipelinebr2 = Pipeline([
    ('vect', cvecbr),
    ('cls', RandomForestClassifier(n_estimators=200, n_jobs = 2))
])
pipelinebr2.fit(X_trainb, y_trainb)
predictedbr2 = pipelinebr2.predict(X_testb)
pipelinebr2.score(X_testb, y_testb)

0.5655685014061872

In [135]:
print(classification_report(y_testb, predictedbr2))

  'precision', 'predicted', average, warn_for)


                               precision    recall  f1-score   support

                    Aglianico       1.00      0.02      0.03        66
                     Albariño       0.95      0.16      0.28       130
                      Barbera       0.92      0.24      0.39       135
                Blaufränkisch       0.00      0.00      0.00        45
                      Bonarda       0.00      0.00      0.00        21
     Bordeaux-style Red Blend       0.51      0.73      0.60      1189
   Bordeaux-style White Blend       0.88      0.14      0.24       154
               Cabernet Franc       0.92      0.08      0.15       282
           Cabernet Sauvignon       0.44      0.76      0.56      2001
    Cabernet Sauvignon-Merlot       0.00      0.00      0.00        25
    Cabernet Sauvignon-Shiraz       0.00      0.00      0.00        21
                    Carmenère       1.00      0.20      0.33       122
              Champagne Blend       0.83      0.30      0.45       279
     

In [136]:
# Random Forest with TdfidVectorizer
pipelinebr3 = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode')),
    
    ('cls', RandomForestClassifier(n_estimators=200, n_jobs = 2))
])
pipelinebr3.fit(X_trainb, y_trainb)
predictedbr3 = pipelinebr3.predict(X_testb)
pipelinebr3.score(X_testb, y_testb)

0.5640417838489353

In [138]:
print(classification_report(y_testb, predictedbr3))

  'precision', 'predicted', average, warn_for)


                               precision    recall  f1-score   support

                    Aglianico       1.00      0.02      0.03        66
                     Albariño       0.96      0.17      0.29       130
                      Barbera       0.94      0.23      0.37       135
                Blaufränkisch       0.00      0.00      0.00        45
                      Bonarda       0.00      0.00      0.00        21
     Bordeaux-style Red Blend       0.53      0.72      0.61      1189
   Bordeaux-style White Blend       0.76      0.10      0.18       154
               Cabernet Franc       0.95      0.06      0.12       282
           Cabernet Sauvignon       0.45      0.75      0.56      2001
    Cabernet Sauvignon-Merlot       0.00      0.00      0.00        25
    Cabernet Sauvignon-Shiraz       0.00      0.00      0.00        21
                    Carmenère       1.00      0.21      0.35       122
              Champagne Blend       0.82      0.29      0.43       279
     

### NATIVE BAYES, COUNTVECTORIZER AND TFDFVECTORIZER

As it happend in the small set the best acurracy I got in MultinomialNB with CountVectorizer. 

All of the models have better scores than the baseline.

In [139]:
vect = CountVectorizer(ngram_range=(1, 1), strip_accents='unicode', stop_words='english')

In [140]:
# MultinomialNB with CountVectorizer
pipelinebn = Pipeline([
    ('vect', vect),
    ('tfidf', TfidfTransformer()),
    ('cls', MultinomialNB())
])
pipelinebn.fit(X_trainb, y_trainb)
predictedbn = pipelinebn.predict(X_testb)
pipelinebn.score(X_testb, y_testb)

0.40020088388911207

In [None]:
print(classification_report(y_testb, predictedbn))

In [141]:
# MultinomialNB with CountVectorizer
pipelinebn1 = Pipeline([
    ('vect', vect),
    ('cls', MultinomialNB())
])
pipelinebn1.fit(X_trainb, y_trainb)
predictedbn1 = pipelinebn1.predict(X_testb)
pipelinebn1.score(X_testb, y_testb)

0.5276014463640016

In [142]:
# BernoulliNB
pipelinebn2 = Pipeline([
    ('vect', vect),
    ('cls', BernoulliNB())
])
pipelinebn2.fit(X_trainb, y_trainb)
predictedbn2 = pipelinebn2.predict(X_testb)
pipelinebn2.score(X_testb, y_testb)

0.5020891924467658

In [143]:
vect1 = TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode')

In [144]:
# MultinomialNB with  TdidfVectorizer
pipelinebn3 = Pipeline([
    ('vect',vect1) ,
    ('cls', MultinomialNB())
])
pipelinebn3.fit(X_trainb, y_trainb)
predictedbn3 = pipelinebn3.predict(X_testb)
pipelinebn3.score(X_testb, y_testb)

0.4001205303334673

In [146]:
# BernoulliNB
pipelinebn4 = Pipeline([
    ('vect', vect1),
    ('cls', BernoulliNB())
])
pipelinebn4.fit(X_trainb, y_trainb)
predictedbn4 = pipelinebn4.predict(X_testb)
pipelinebn4.score(X_testb, y_testb)

0.5020891924467658

## Conclusion:
The best acurracy I obtained in this dataset has been in Logistic Regression and CountVectorizer with Lasso Regularization and  multi_class 'ovr'. The dataset with the reduction of the number of varieties performs better in all of the training models.