# CAPSTONE - FINDINGS AND TECHNICAL REPORT WINE BLIND TASTING (COUNTRY)

I have built a classification model that, based on a wine description given by an expert or semi-expert, is able to tell what country was used to produce that wine. The small dataset contains 42 differents countries and the big data set contains 43 countries. I have included Washington and California as a country for the big representation of this wines in US

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
pd.set_option('display.max_columns',500)
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

## SMALL DATASET

In [2]:
df = pd.read_csv('small_wineV1.csv')

In [3]:
df.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,taster_name,title,variety,winery,vintage
0,France,"Buxom and heady, this is a delightfully rich, ...",Vieilles Vignes,99,114.0,Châteauneuf-du-Pape,Rhône Valley,Anna Lee C. Iijima,Domaine de la Janasse 2016 Vieilles Vignes Red...,Rhône-style Red Blend,Domaine de la Janasse,2016.0
1,France,"Sultry and silken on the palate, this wine sta...",La Réserve,98,175.0,Châteauneuf-du-Pape,Rhône Valley,Anna Lee C. Iijima,Domaine le Clos du Caillou 2016 La Réserve Red...,Rhône-style Red Blend,Domaine le Clos du Caillou,2016.0
2,Portugal,The wine's fine perfumed black plum fruits giv...,,98,120.0,Port,Port Blend,Roger Voss,Fonseca 2017 Port,Port,Fonseca,2017.0
3,France,"Veins of vanilla, smoke and toast amplify blac...",Hommage à Henry Tacussel,98,80.0,Châteauneuf-du-Pape,Rhône Valley,Anna Lee C. Iijima,Domaine Moulin-Tacussel 2016 Hommage à Henry T...,Grenache,Domaine Moulin-Tacussel,2016.0
4,France,"This juicy, fruit-forward wine drenches the pa...",La Muse,97,88.0,Châteauneuf-du-Pape,Rhône Valley,Anna Lee C. Iijima,Guillaume Gonnet 2016 La Muse Red (Châteauneuf...,Rhône-style Red Blend,Guillaume Gonnet,2016.0


In [4]:
df = df.dropna(axis=0, subset=['country'])

In [5]:
print("There are {} countries producing wine in this dataset such as {}... \n".
      format(len(df.country.unique()), ", ".join(df.country.unique()[0:5])))  

There are 40 countries producing wine in this dataset such as France, Portugal, Italy, US, Chile... 



In [6]:
print('The country baseline is', df.country.value_counts(normalize=True).max())

The country baseline is 0.3593197943297044


In [7]:
cvec = CountVectorizer(strip_accents='unicode',
                       stop_words="english", 
                       ngram_range=(1, 1))

X_all = cvec.fit_transform(df['description'])
columns = cvec.get_feature_names()
X_all

<34813x15200 sparse matrix of type '<class 'numpy.int64'>'
	with 910512 stored elements in Compressed Sparse Row format>

### DEFINE THE VARIABLE COUNTRY AS A TARGET


In [8]:
X = df.description
y = df.country

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,)

### COUNT VECTORIZER  AND LOGISTIC REGRESSION 

In Count Vectorizer:
- I am using stip_accents to Remove accents and perform other character normalization.
- I used 'stop_words' inside the CountVectorizer because I am going to eliminate most of the common words like 'and', 'the', 'of',..... that they are not relevant for the determining wine language.
- the ngram_range is (1,1) because I only want to extract 1 word.

In Logistic Regression:
- 'l1' I am using Lasso and 'l2' Ridge regulation. 
- 'saga' this parameter handle with Lasso regulation and it's faster than 'liblinear'.
- 'lbfgs' with Ridge regulation
- 'ovr' in multiclass in the first performance. I change it to 'Multinomial' in the second performance but the accurancy was higher in the first. 
- 'random_state' I am going to use the default because I am using 'saga' as a solver.

The best performed model was when I used Ridge regularization and with mult_class ovr.


In [10]:
#CountVectorizer and Logistic Regression with Ridge Regularization and multiclass 'ovr' 
pipeline_ridge = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l2', solver='lbfgs', multi_class='ovr', verbose = 1, n_jobs = 2))
])
pipeline_ridge.fit(X_train, y_train)
predicted_ridge = pipeline_ridge.predict(X_test)
pipeline_ridge.score(X_test, y_test)

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  39 out of  39 | elapsed:    6.4s finished


0.9133993968117191

In [11]:
print(classification_report(y_test, predicted_ridge))

              precision    recall  f1-score   support

   Argentina       0.70      0.48      0.57       127
   Australia       0.79      0.89      0.83       173
     Austria       0.88      0.82      0.85       182
      Brazil       0.00      0.00      0.00         2
    Bulgaria       0.53      0.80      0.64        10
      Canada       0.00      0.00      0.00        12
       Chile       0.55      0.44      0.49       142
     Croatia       0.00      0.00      0.00         3
     England       1.00      0.22      0.36         9
      France       0.88      0.96      0.92      1461
     Georgia       0.80      0.44      0.57         9
     Germany       0.95      0.92      0.93       120
      Greece       1.00      1.00      1.00        21
     Hungary       0.80      0.67      0.73         6
       India       0.00      0.00      0.00         1
      Israel       0.80      0.97      0.88        36
       Italy       0.98      0.98      0.98       927
      Kosovo       0.00    

  'precision', 'predicted', average, warn_for)


In [12]:
cross = cross_val_score(pipeline_ridge, X_train, y_train, cv=5)
print(cross)
print(cross.mean())

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  37 out of  37 | elapsed:    4.4s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  36 out of  39 | elapsed:    3.6s remaining:    0.3s
[Parallel(n_jobs=2)]: Done  39 out of  39 | elapsed:    3.8s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  36 out of  39 | elapsed:    3.9s remaining:    0.3s
[Parallel(n_jobs=2)]: Done  39 out of  39 | elapsed:    4.1s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  36 out of  39 | elapsed:    3.9s remaining:    0.3s
[Parallel(n_jobs=2)]: Done  39 out of  39 | elapsed:    4.1s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  39 out of  39 | elapsed:    4.2s finished


[0.91013247 0.91178053 0.9091725  0.90577234 0.91143114]
0.9096577975748428


In [13]:
#CountVectorizer and Logistic Regression with Lasso Regularization and multiclass 'ovr' 
pipelinea = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l1', solver='saga', multi_class='ovr', verbose = 1, n_jobs = 2))
])
pipelinea.fit(X_train, y_train)
predicteda = pipelinea.predict(X_test)
pipelinea.score(X_test, y_test)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.


max_iter reached after 2 seconds




max_iter reached after 8 seconds
max_iter reached after 7 seconds
max_iter reached after 3 seconds
max_iter reached after 2 seconds
max_iter reached after 7 seconds
max_iter reached after 4 seconds
max_iter reached after 4 seconds
max_iter reached after 2 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 9 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 5 seconds
max_iter reached after 3 seconds
max_iter reached after 2 seconds
max_iter reached after 19 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 5 seconds
max_iter reached after 2 seconds
max_iter reached after 2 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 2 seconds
max_iter reached after 6 seconds
max_iter reached after 6 seconds
max_iter reached after 3 seconds
max_iter reached after 2 seconds
max_iter 

[Parallel(n_jobs=2)]: Done  39 out of  39 | elapsed:  1.5min finished


0.9112451529513141

In [14]:
print(classification_report(y_test, predicteda))

              precision    recall  f1-score   support

   Argentina       0.72      0.50      0.59       127
   Australia       0.79      0.87      0.82       173
     Austria       0.86      0.84      0.85       182
      Brazil       0.00      0.00      0.00         2
    Bulgaria       0.56      0.90      0.69        10
      Canada       0.00      0.00      0.00        12
       Chile       0.59      0.45      0.51       142
     Croatia       0.00      0.00      0.00         3
     England       0.50      0.33      0.40         9
      France       0.88      0.95      0.92      1461
     Georgia       0.57      0.44      0.50         9
     Germany       0.93      0.89      0.91       120
      Greece       1.00      1.00      1.00        21
     Hungary       1.00      0.33      0.50         6
       India       0.00      0.00      0.00         1
      Israel       0.81      0.94      0.87        36
       Italy       0.98      0.98      0.98       927
      Kosovo       0.00    

  'precision', 'predicted', average, warn_for)


In [15]:
cross = cross_val_score(pipelinea, X_train, y_train, cv=5)
print(cross)
print(cross.mean())

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  37 out of  37 | elapsed:  1.1min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  39 out of  39 | elapsed:  1.1min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  39 out of  39 | elapsed:  1.1min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  39 out of  39 | elapsed:  1.1min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  39 out of  39 | elapsed:  1.1min finished


[0.91013247 0.90998745 0.909711   0.90972847 0.91269127]
0.9104501322263772


In [19]:
#CountVectorizer and Logistic Regression with Lasso Regularization and multiclass 'multinomial' 
pipeline = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l1', solver='saga', multi_class='multinomial', verbose = 1, n_jobs = 2))
])
pipeline.fit(X_train, y_train)
predicted = pipeline.predict(X_test)
pipeline.score(X_test, y_test)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.


max_iter reached after 77 seconds


[Parallel(n_jobs=2)]: Done   1 out of   1 | elapsed:  1.3min finished


0.9123940830101968

### TFIDFVECTORIZER  AND LOGISTIC REGRESSION 

CountVectorizer just counts the word frequencies. The TFIDFVectorizer count the word frequencies and compute the Inverse Document Frequency values, that measure how much information the word provide (whether the term is common or rare in all the documents).

I am going to define the same parameters that I used with CountVectorize but I am going to include sublinear_tf to scale tf and max_features = 2000 that only is going to consider the top 2000 max_features ordered by term frequency across the corpus.

In Logistic Regression and GridsearchCV the parameters are going to be the same than I defined above.

The best model performer was when I included the Lasso regularization with all the features. 


In [20]:
vect = TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode')

In [23]:
# TdidfVectorizer and Logistic Regression with Ridge regularizarion
pipeline_ridge = Pipeline([
    ('vect', vect),
    ('cls', LogisticRegression(penalty = 'l2', solver='lbfgs', verbose = 1, n_jobs = 2))
])
pipeline_ridge.fit(X_train, y_train)
predicted_ridge = pipeline_ridge.predict(X_test)
pipeline_ridge.score(X_test, y_test)

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  39 out of  39 | elapsed:    4.9s finished


0.9003303173919288

In [25]:
# TdidfVectorizer and Logistic Regression with Ridge regularizarion and multiclas 'ovr'
pipeline = Pipeline([
    ('vect', vect),
    ('cls', LogisticRegression(penalty = 'l1', solver='saga', multi_class='ovr'))
])
pipeline.fit(X_train, y_train)
predicted = pipeline.predict(X_test)
pipeline.score(X_test, y_test)



0.9065058164584231

In [34]:
# TdidfVectorizer and Logistic Regression with Ridge regularizarion and multiclass 'multinomial'
pipelinesv = Pipeline([
    ('vect', vect),
    ('cls', LogisticRegression(solver='saga', multi_class='multinomial'))
])
pipelinesv.fit(X_train, y_train)
predictedsv = pipelinesv.predict(X_test)
pipelinesv.score(X_test, y_test)
    
    

0.9063622002010627

In [36]:
# TFIDFVECTORIZER AND LOGISTIC REGRESSION inluding Lasso regularization and mulinomial as a multiclass
pipelinesv = Pipeline([
    ('vect', vect),
    ('cls', LogisticRegression(solver='saga', multi_class='multinomial', penalty = 'l1'))
])
pipelinesv.fit(X_train, y_train)
predictedsv = pipelinesv.predict(X_test)
pipelinesv.score(X_test, y_test)



0.9096653741203504

In [24]:
# TFIDFVECTORIZER AND LOGISTIC REGRESSION inluding Lasso regularization and max_features = 2000
pipelinesv = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode',  max_features = 2000)),
    ('cls', LogisticRegression(solver='saga', multi_class='multinomial', penalty = 'l1'))
])
pipelinesv.fit(X_train, y_train)
predictedsv = pipelinesv.predict(X_test)
pipelinesv.score(X_test, y_test)



0.9053568863995404

In [17]:
# TFIDFVECTORIZER AND LOGISTIC REGRESSION inluding Ridge regularization solver sag.
pipelinesv = Pipeline([
    ('vect', vect),
    ('cls', LogisticRegression(solver='sag', multi_class='multinomial', penalty = 'l2'))
])
pipelinesv.fit(X_train, y_train)
predictedsv = pipelinesv.predict(X_test)
pipelinesv.score(X_test, y_test)

0.9053568863995404

### RANDOM FOREST CLASSIFIER, COUNTVECTORIZER AND TFDFVECTORIZER

I am going to use Random Forests Classifier because is one of the most widespread classifiers and it has the power to handle a large data set with higher dimensionality. Random forests is considered as a highly accurate and robust method because of the number of decision trees participating in the process.
It does not suffer from the overfitting problem. The main reason is that it takes the average of all the predictions, which cancels out the biases. 

The Random Forest Classifier parametres:

- n_estimators represents the number of trees in the forest. Usually the higher the number of trees the better to learn the data. However, adding a lot of trees can slow down the training process considerably, therefore I find the sweet spot in 200
- n_jobs, we are defining 2 the number of jobs to run in parallel for both fit and predict
- Max_dept represents the depth of each tree in the forest. The deeper the tree, the more splits it has and it captures more information about the data. I defined 10 in the last random forest and its the worst performer in all the models that I have created.   

The difference between performs our model with TfidVectorizer or CountVectoriazer, does not differ some much. TfidVectorizer performs a littel bit better than CountVectoriazer.

Logistic Regression perfoms better than Random Forest Classifier in this data set, with higher accuracy. Both of them have better scores than the baseline.

In [9]:
# Random Forest with CountVectorizer
cvecsr = CountVectorizer(strip_accents='unicode', stop_words="english", ngram_range=(1, 1),  max_features=1000)
pipelinesr = Pipeline([
    ('vect', cvecsr),
    #('tfidf', TfidfTransformer()),
    ('cls', RandomForestClassifier(n_estimators=200, n_jobs = 2))
])
pipelinesr.fit(X_train, y_train)
predictedsr = pipelinesr.predict(X_test)
pipelinesr.score(X_test, y_test)

0.8820910527071665

In [16]:
# Random Forest with TfidfVectorizer
pipelinesr = Pipeline([
    ('vect', vect),
    
    ('cls', RandomForestClassifier(n_estimators=200, n_jobs = 2))
])
pipelinesr.fit(X_train, y_train)
predictedsr = pipelinesr.predict(X_test)
pipelinesr.score(X_test, y_test)

0.8822346689645267

In [21]:
# Random Forest with TfidfVectorizer and max_depth = 10
pipelinesr = Pipeline([
    ('vect', vect),
    
    ('cls', RandomForestClassifier(n_estimators=200, n_jobs = 2, max_depth = 10 ))
])
pipelinesr.fit(X_train, y_train)
predictedsr = pipelinesr.predict(X_test)
pipelinesr.score(X_test, y_test)

0.6882091052707167

### NAIVE BAYES, COUNTVECTORIZER AND TFDFVECTORIZER

Naive Bayes is a classification algorithm relying on Bayes' rule. As in all other classification models, one is interested in determining the probability of having a certain class label when a certain combination of feature values is obtained

I am going to use MultinomialNB and BernoullinNB. Both of them are suitable for discrete data. The difference is that while MultinomialNB works with occurrence counts, BernoulliNB is designed for binary/boolean features.

The best accuracy I got in MultinomialNB with CountVectorizer.

In [44]:
cvect = CountVectorizer(lowercase=True, strip_accents='unicode', stop_words='english')

In [40]:
# MultinomialNB with CountVectorizer and TfidTransformer
pipelinesn = Pipeline([
    ('vect', cvect),
    ('tfidf', TfidfTransformer()),
    ('cls', MultinomialNB())
])
pipelinesn.fit(X_train, y_train)
predictedsn = pipelinesn.predict(X_test)
pipelinesn.score(X_test, y_test)

0.7834266839006175

In [45]:
# MultinomialNB with CountVectorizer
pipelinesn1 = Pipeline([
    ('vect', cvect),
    ('cls', MultinomialNB())
])
pipelinesn1.fit(X_train, y_train)
predictedsn1 = pipelinesn1.predict(X_test)
pipelinesn1.score(X_test, y_test)

0.8710326008904208

In [46]:
# BernoulliNB with CountVectorizer
pipelinesn2 = Pipeline([
    ('vect', vect),
    ('cls', BernoulliNB())
])
pipelinesn2.fit(X_train, y_train)
predictedsn2 = pipelinesn2.predict(X_test)
pipelinesn2.score(X_test, y_test)

0.8593996840442338

In [47]:
vect1 = TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode')

In [48]:
# MultinomialNB with TfidfVectorizer 
pipelinesn3 = Pipeline([
    ('vect',vect1) ,
    ('cls', MultinomialNB())
])
pipelinesn3.fit(X_train, y_train)
predictedsn3 = pipelinesn3.predict(X_test)
pipelinesn3.score(X_test, y_test)

0.7837139164153382

In [49]:
# BernoulliNB with TfidfVectorizer
pipelinesn4 = Pipeline([
    ('vect', vect1),
    ('cls', BernoulliNB())
])
pipelinesn4.fit(X_train, y_train)
predictedsn4 = pipelinesn4.predict(X_test)
pipelinesn4.score(X_test, y_test)

0.8593996840442338

The best accuray is in Logistic Regression model with Lasso regularization. With Random Forest I got higher accuracy than with Naives Bayes but the difference is not huge.


## SMALL DATA WITH COUNTRIES REDUCTIONS AND DATA STRATIFY

I am going to work with the countries more representatives, that appear more than 10 times because will not be able to train and test on one sample.


I am only going to perform the models that worked better before. 

As happened before, the best accuracy was in CountVectorizer and Logistic Regression with Ridge regularization. In this case the model training above without countries reductions and without data stratify works a little better. 

In [24]:
df1 = df[df['country'].map(df['country'].value_counts()) > 10]

In [25]:
print('The country reduction baseline is', df1.country.value_counts(normalize=True).max())

The country reduction baseline is 0.360002302357038


In [26]:
y1 = df1.country
X1 =  df1.description

In [27]:
# Split and stratify the data
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size=0.2, stratify = y1, random_state= 1)

##  Logistic Regression, CountVectorizer and TfidfVectorizer 

In [28]:
# COUNTVECTORIZER AND LOGISTIC REGRESSION inluding Ridge regularization
cvec1 = CountVectorizer(strip_accents='unicode', stop_words="english", ngram_range=(1, 1))
pipeline1 = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l1', solver='saga', multi_class='ovr', n_jobs=2, verbose = 1)
    )])
pipeline1.fit(X_train1, y_train1)
predicted1 = pipeline1.predict(X_test1)
pipeline1.score(X_test1, y_test1)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.


max_iter reached after 7 seconds




max_iter reached after 9 seconds
max_iter reached after 2 seconds
max_iter reached after 6 seconds
max_iter reached after 3 seconds
max_iter reached after 5 seconds
max_iter reached after 2 seconds
max_iter reached after 4 seconds
max_iter reached after 10 seconds
max_iter reached after 3 seconds
max_iter reached after 5 seconds
max_iter reached after 2 seconds
max_iter reached after 3 seconds
max_iter reached after 3 seconds
max_iter reached after 18 seconds
max_iter reached after 3 seconds
max_iter reached after 5 seconds
max_iter reached after 6 seconds
max_iter reached after 6 seconds
max_iter reached after 3 seconds
max_iter reached after 5 seconds
max_iter reached after 11 seconds
max_iter reached after 3 seconds
max_iter reached after 9 seconds
max_iter reached after 7 seconds
max_iter reached after 5 seconds


[Parallel(n_jobs=2)]: Done  26 out of  26 | elapsed:  1.2min finished


0.9138129496402878

In [29]:
print(classification_report(y_test1, predicted1))

              precision    recall  f1-score   support

   Argentina       0.66      0.45      0.53       130
   Australia       0.83      0.86      0.84       162
     Austria       0.85      0.87      0.86       183
      Brazil       0.00      0.00      0.00         2
    Bulgaria       0.64      1.00      0.78         7
      Canada       0.00      0.00      0.00        10
       Chile       0.63      0.54      0.58       158
     Croatia       0.00      0.00      0.00         3
     England       0.75      0.55      0.63        11
      France       0.88      0.96      0.92      1488
     Georgia       0.75      0.33      0.46         9
     Germany       0.94      0.83      0.88       123
      Greece       1.00      1.00      1.00        18
     Hungary       0.67      0.80      0.73         5
      Israel       0.84      0.97      0.90        32
       Italy       0.98      0.98      0.98       912
     Moldova       0.50      0.29      0.36         7
 New Zealand       0.77    

  'precision', 'predicted', average, warn_for)


In [30]:
cross = cross_val_score(pipeline1, X_train1, y_train1, cv=5)
print(cross)
print(cross.mean())


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  26 out of  26 | elapsed:   55.5s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  26 out of  26 | elapsed:   51.8s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  26 out of  26 | elapsed:   53.1s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  26 out of  26 | elapsed:   52.6s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  26 out of  26 | elapsed:   52.7s finished


[0.91132651 0.92006467 0.912527   0.91015484 0.91007389]
0.912829381493086


In [31]:
# COUNTVECTORIZER AND LOGISTIC REGRESSION inluding Lasso regularization
pipeline_ridge = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l2', solver='lbfgs', verbose = 1, n_jobs = 2))
])
pipeline_ridge.fit(X_train, y_train)
predicted_ridge = pipeline_ridge.predict(X_test)
pipeline_ridge.score(X_test, y_test)

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  39 out of  39 | elapsed:    6.5s finished


0.9133993968117191

In [32]:
cross = cross_val_score(pipeline_ridge, X_train, y_train, cv=5)
print(cross)
print(cross.mean())

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  37 out of  37 | elapsed:    4.6s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  36 out of  39 | elapsed:    3.8s remaining:    0.3s
[Parallel(n_jobs=2)]: Done  39 out of  39 | elapsed:    4.0s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  36 out of  39 | elapsed:    3.9s remaining:    0.3s
[Parallel(n_jobs=2)]: Done  39 out of  39 | elapsed:    4.1s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  36 out of  39 | elapsed:    3.9s remaining:    0.3s
[Parallel(n_jobs=2)]: Done  39 out of  39 | elapsed:    4.1s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  36 out of  39 | elapsed:    4.0s remaining:    0.3s
[Parallel(n_jobs=2)]: Don

[0.91013247 0.91178053 0.9091725  0.90577234 0.91143114]
0.9096577975748428


In [33]:
# TFIDFVECTORIZER AND LOGISTIC REGRESSION inluding Lasso regularization
pipelines1 = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode')),
    ('cls', LogisticRegression(solver='saga', multi_class='multinomial', penalty = 'l1', n_jobs=2, verbose = 1))
])
pipelines1.fit(X_train1, y_train1)
predicteds1 = pipelines1.predict(X_test1)
pipelines1.score(X_test1, y_test1)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.


max_iter reached after 21 seconds


[Parallel(n_jobs=2)]: Done   1 out of   1 | elapsed:   21.6s finished


0.9107913669064748

## TfidfVectrizer and Random Forest Classifier

In [34]:
## TFIDFVECTORIZER AND RANDOM FORESTCLASSIFIER inluding Lasso regularization
pipelinesr6 = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode')),
    
    ('cls', RandomForestClassifier(n_estimators=200, n_jobs = 2, verbose = 1))
])
pipelinesr6.fit(X_train1, y_train1)
predictedsr6 = pipelinesr6.predict(X_test1)
pipelinesr6.score(X_test1, y_test1)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    6.8s
[Parallel(n_jobs=2)]: Done 196 tasks      | elapsed:   29.5s
[Parallel(n_jobs=2)]: Done 200 out of 200 | elapsed:   30.2s finished
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.2s
[Parallel(n_jobs=2)]: Done 196 tasks      | elapsed:    0.7s
[Parallel(n_jobs=2)]: Done 200 out of 200 | elapsed:    0.7s finished
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.1s
[Parallel(n_jobs=2)]: Done 196 tasks      | elapsed:    0.4s
[Parallel(n_jobs=2)]: Done 200 out of 200 | elapsed:    0.4s finished


0.8847535565454807

## Naives Bayes, CountVectorizer and TfidfVectorizer

In [31]:
# MultinomialNB with CountVectorizer
cvect1 = CountVectorizer(lowercase=True, strip_accents='unicode', stop_words='english')
pipelinesn2 = Pipeline([
    ('vect', cvect1),
    ('cls', MultinomialNB())
])
pipelinesn2.fit(X_train1, y_train1)
predictedsn2 = pipelinesn2.predict(X_test1)
pipelinesn2.score(X_test1, y_test1)

0.8716769650811899

In [32]:
vect1 = TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode')
# BernoulliNB with  TfidfVectorizer
pipelinesn5 = Pipeline([
    ('vect', vect1),
    ('cls', BernoulliNB())
])
pipelinesn5.fit(X_train1, y_train1)
predictedsn5 = pipelinesn5.predict(X_test1)
pipelinesn5.score(X_test1, y_test1)

0.8577381807730996

As happened in the small data set the best accuray is in Logistic Regression model with Lasso regularization. With Random Forest I got higher accuracy than with Naives Bayes but the difference is not huge.

# BIG DATASET

In [4]:
vino = pd.read_csv('big_wineV1.csv')

In [26]:
vino.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,taster_name,title,variety,winery,vintage
0,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,Roger Voss,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,2011.0
1,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Paul Gregutt,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,2013.0
2,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,Alexander Peartree,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,2013.0
3,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Paul Gregutt,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,2012.0
4,5,Spain,Blackberry and raspberry aromas show a typical...,Ars In Vitro,87,15.0,Northern Spain,Navarra,Michael Schachner,Tandem 2011 Ars In Vitro Tempranillo-Merlot (N...,Tempranillo-Merlot,Tandem,2011.0


In [27]:
print("There are {} types of grapes(varieties) in this dataset such as {}... \n".
      format(len(vino.variety.unique()), ", ".join(vino.variety.unique()[0:5])))
print("There are {} provinces producing wine in this dataset such as {}... \n".
      format(len(vino.province.unique()), ", ".join(vino.province.unique()[0:5])))
print("There are {} countries producing wine in this dataset such as {}... \n".
      format(len(vino.country.unique()), ", ".join(vino.country.unique()[0:5]))) 
print("There are {} winery producing wine in this dataset such as {}... \n".
      format(len(vino.winery.unique()), ", ".join(vino.winery.unique()[0:5])))

There are 743 types of grapes(varieties) in this dataset such as Portuguese Red, Pinot Gris, Riesling, Pinot Noir, Tempranillo-Merlot... 

There are 450 provinces producing wine in this dataset such as Douro, Oregon, Michigan, Northern Spain, Sicily & Sardinia... 

There are 43 countries producing wine in this dataset such as Portugal, US, Spain, Italy, France... 

There are 17390 winery producing wine in this dataset such as Quinta dos Avidagos, Rainstorm, St. Julian, Sweet Cheeks, Tandem... 



In [28]:
print('The baselines of my features are:')
print('The country baseline is', vino.country.value_counts(normalize=True).max())
print('The variety baseline is', vino.variety.value_counts(normalize=True).max())
print('The province baseline is', vino.province.value_counts(normalize=True).max())

The baselines of my features are:
The country baseline is 0.4436020682021501
The variety baseline is 0.10773989583096413
The province baseline is 0.2948097830207275


### DEFINE THE VARIABLE COUNTRY AS A TARGET

In [5]:
# The target is the variable 'country'. This datset has 43 differentes countries.
yb = vino.country
Xb =  vino.description

In [6]:
#Split the data
X_trainb, X_testb, y_trainb, y_testb = train_test_split(Xb, yb, test_size=0.2)

### COUNTVECTORIZER  AND LOGISTIC REGRESSION 

The scores did not differ so much but the best performed model was with CountVectorizer and Logistic Regression with Ridge Regularization and multiclass ovr. I calculated the cross validation in this model and the score does not differ so much than the accuracy score.
 
All of the models have better scores than the baseline.


In [27]:
cvec = CountVectorizer(strip_accents='unicode', stop_words="english", ngram_range=(1, 1))

In [28]:
# CountVectorizer and Logistic Regression with Ridge regularization
pipeline_ridge = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l2', solver='lbfgs', multi_class='ovr', verbose = 1, n_jobs = 2))
])
pipeline_ridge.fit(X_trainb, y_trainb)
predicted_ridge = pipeline_ridge.predict(X_testb)
pipeline_ridge.score(X_testb, y_testb)

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  43 out of  43 | elapsed:   27.6s finished


0.8517872711421098

In [29]:
print(classification_report(y_testb, predicted_ridge))

  'precision', 'predicted', average, warn_for)


                        precision    recall  f1-score   support

             Argentina       0.58      0.47      0.52       796
             Australia       0.74      0.57      0.64       490
               Austria       0.76      0.66      0.70       572
Bosnia and Herzegovina       0.00      0.00      0.00         1
                Brazil       0.00      0.00      0.00        11
              Bulgaria       0.79      0.54      0.64        28
                Canada       0.00      0.00      0.00        48
                 Chile       0.61      0.54      0.57       964
                 China       0.00      0.00      0.00         3
               Croatia       0.60      0.21      0.32        14
                Cyprus       0.00      0.00      0.00         2
        Czech Republic       0.00      0.00      0.00         5
               England       1.00      0.11      0.19        19
                France       0.80      0.87      0.83      3940
               Georgia       0.60      

In [30]:
cross = cross_val_score(pipeline_ridge, X_trainb, y_trainb, cv=5)
print(cross)
print(cross.mean())

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  41 out of  41 | elapsed:   21.7s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  43 out of  43 | elapsed:   22.1s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  43 out of  43 | elapsed:   20.3s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  43 out of  43 | elapsed:   20.1s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  43 out of  43 | elapsed:   22.9s finished


[0.85395067 0.85283251 0.85332449 0.85063519 0.85509033]
0.8531666384300459


In [32]:
# CountVectorizer and Logistic Regression with Lasso regularization
cvec = CountVectorizer(strip_accents='unicode', stop_words="english", ngram_range=(1, 1))
pipeline = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l1', solver='saga', multi_class='ovr'))
])
pipeline.fit(X_trainb, y_trainb)
predicted = pipeline.predict(X_testb)
pipeline.score(X_testb, y_testb)



0.849740343428983

In [33]:
print(classification_report(y_testb, predicted))

  'precision', 'predicted', average, warn_for)


                        precision    recall  f1-score   support

             Argentina       0.61      0.45      0.52       796
             Australia       0.74      0.55      0.63       490
               Austria       0.75      0.67      0.71       572
Bosnia and Herzegovina       0.00      0.00      0.00         1
                Brazil       0.00      0.00      0.00        11
              Bulgaria       0.68      0.46      0.55        28
                Canada       0.00      0.00      0.00        48
                 Chile       0.61      0.55      0.58       964
                 China       0.00      0.00      0.00         3
               Croatia       0.71      0.36      0.48        14
                Cyprus       0.00      0.00      0.00         2
        Czech Republic       0.00      0.00      0.00         5
               England       0.33      0.05      0.09        19
                France       0.79      0.87      0.83      3940
               Georgia       0.75      

In [34]:
cross = cross_val_score(pipeline, X_trainb, y_trainb, cv=5)
print(cross)
print(cross.mean())



[0.85238839 0.85098522 0.85128667 0.85006636 0.85362037]
0.8516694032909164


In [9]:
# CountVectorizer and Logistic Regression with Lasso regularizarion and multinomial multiclass
cvecb = CountVectorizer(strip_accents='unicode', stop_words="english", ngram_range=(1, 1))
pipelineb = Pipeline([ 
    ('vect', cvecb),
    ('cls', LogisticRegression(penalty = 'l1', solver='saga', multi_class='multinomial',verbose = 1, n_jobs = 2))
])
pipelineb.fit(X_trainb, y_trainb)
predictedb = pipelineb.predict(X_testb)
pipelineb.score(X_testb, y_testb)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.


max_iter reached after 1346 seconds


[Parallel(n_jobs=2)]: Done   1 out of   1 | elapsed: 22.4min finished


0.8528865471361965

In [11]:
print(classification_report(y_testb, predictedb))

                        precision    recall  f1-score   support

             Argentina       0.58      0.50      0.54       795
               Armenia       0.00      0.00      0.00         1
             Australia       0.71      0.59      0.64       513
               Austria       0.75      0.67      0.71       616
Bosnia and Herzegovina       0.00      0.00      0.00         1
                Brazil       1.00      0.07      0.13        14
              Bulgaria       0.74      0.57      0.64        30
                Canada       0.57      0.08      0.14        51
                 Chile       0.61      0.55      0.58       920
               Croatia       0.60      0.18      0.27        17
                Cyprus       0.00      0.00      0.00         4
               England       0.60      0.43      0.50        14
                France       0.81      0.85      0.83      4125
               Georgia       0.38      0.21      0.27        14
               Germany       0.79      

### TFIDFVECTORIZER  AND LOGISTIC REGRESSION 

I defineded the same model thay worked better in the small dataset. I compared including max_features in the first and including Lasso regularization in the second.
The best accuracy is with the Lasso regularization.

In [31]:
#TfidVectorizer and Logisitic Regression with max_features = 1000
pipelinev = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english', sublinear_tf=True, max_features=1000,
                             ngram_range = (1,1), strip_accents='unicode')),
    ('cls', LogisticRegression(solver='saga', multi_class='multinomial'))
])
pipelinev.fit(X_trainb, y_trainb)
predictedv = pipelinev.predict(X_testb)
pipelinev.score(X_testb, y_testb)

0.8124028656987984

In [37]:
# TfidVectorizer and Logisitic Regression without max_features = 1000 and include Lasso regularization
pipelinev = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode')),
    ('cls', LogisticRegression(solver='saga', multi_class='multinomial', penalty = 'l1' ))
])
pipelinev.fit(X_trainb, y_trainb)
predictedv = pipelinev.predict(X_testb)
pipelinev.score(X_testb, y_testb)



0.8509533376293544

### RANDOM FOREST CLASSIFIER WITH COUNTVECTORIZER AND TFIDVECTORIZER 

Logistic Regression performs better than Random Forest Classifier in this data set, with higher accuracy.

The best performed model was Random Forest with TfidVectorizer.

All of the models have better scores than the baseline


In [14]:
# Random Forest with CountVectorizer
cvecbr = CountVectorizer(strip_accents='unicode', stop_words="english", ngram_range=(1, 1),  max_features=1000)
pipelinebr = Pipeline([
    ('vect', cvecbr),
    ('cls', RandomForestClassifier(n_estimators=200, n_jobs = 2))
])
pipelinebr.fit(X_trainb, y_trainb)
predictedbr = pipelinebr.predict(X_testb)
pipelinebr.score(X_testb, y_testb)

0.7478109245290171

In [41]:
# Random Forest with CountVectorizer
cvecbr = CountVectorizer(strip_accents='unicode', stop_words="english", ngram_range=(1, 1))
pipelinebr = Pipeline([
    ('vect', cvecbr),
    ('cls', RandomForestClassifier(n_estimators=200, n_jobs = 2))
])
pipelinebr.fit(X_trainb, y_trainb)
predictedbr = pipelinebr.predict(X_testb)
pipelinebr.score(X_testb, y_testb)

0.7511087525112771

In [15]:
# Random Forest with TdfidVectorizer
pipelinebr = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode')),
    
    ('cls', RandomForestClassifier(n_estimators=200, n_jobs = 2))
])
pipelinebr.fit(X_trainb, y_trainb)
predictedbr = pipelinebr.predict(X_testb)
pipelinebr.score(X_testb, y_testb)

0.7511466585800387

### NATIVE BAYES, COUNTVECTORIZER AND TFDFVECTORIZER

As it happend in the small set the best acurracy I got in MultinomialNB with CountVectorizer. MultinomialNB is one of the two classic Naive Bayes variants used in text classification.

Logistic Regression models perform better than Native Bayes in this data set, with higher accuracy.

All of the models have better scores than the baseline


In [52]:
vect = CountVectorizer(ngram_range=(1, 1), strip_accents='unicode', stop_words='english')

In [30]:
# MultinomialNB with CountVectorizer
pipelinebn = Pipeline([
    ('vect', vect),
    ('tfidf', TfidfTransformer()),
    ('cls', MultinomialNB())
])
pipelinebn.fit(X_trainb, y_trainb)
predictedbn = pipelinebn.predict(X_testb)
pipelinebn.score(X_testb, y_testb)

0.7296160115234449

In [53]:
# MultinomialNB with CountVectorizer
pipelinebn1 = Pipeline([
    ('vect', vect),
    ('cls', MultinomialNB())
])
pipelinebn1.fit(X_trainb, y_trainb)
predictedbn1 = pipelinebn1.predict(X_testb)
pipelinebn1.score(X_testb, y_testb)

0.8195292066259808

In [55]:
cross = cross_val_score(pipelinebn1, X_trainb, y_trainb, cv=5)
print(cross)
print(cross.mean())



[0.81735549 0.81896552 0.81977157 0.81816458 0.81834132]
0.8185196955332505


In [36]:
# BernoulliNB
pipelinebn2 = Pipeline([
    ('vect', vect),
    ('cls', BernoulliNB())
])
pipelinebn2.fit(X_trainb, y_trainb)
predictedbn2 = pipelinebn2.predict(X_testb)
pipelinebn2.score(X_testb, y_testb)

0.8143739812744021

In [31]:
vect1 = TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode')

In [37]:
# MultinomialNB with  TdidfVectorizer
pipelinebn3 = Pipeline([
    ('vect',vect1) ,
    ('cls', MultinomialNB())
])
pipelinebn3.fit(X_trainb, y_trainb)
predictedbn3 = pipelinebn3.predict(X_testb)
pipelinebn3.score(X_testb, y_testb)

0.7299571661422993

In [38]:
# BernoulliNB
pipelinebn4 = Pipeline([
    ('vect', vect1),
    ('cls', BernoulliNB())
])
pipelinebn4.fit(X_trainb, y_trainb)
predictedbn4 = pipelinebn4.predict(X_testb)
pipelinebn4.score(X_testb, y_testb)

0.8143739812744021

The best performer model was CountVectorizer with Logistic Regression and Ridge regularization.

# BIG DATA WITH COUNTRY REDUCTIONS

This dataset contain the wines that appear more than 10 times.

I am going to train and test the models that preformed better above.

In [35]:
#Define the data, target, predictor and split and stratify it.
vino1 = vino[vino['country'].map(vino['country'].value_counts()) > 10]
y1 = vino1.country
X1 =  vino1.description
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size=0.2, stratify = y1, random_state= 1)

In [36]:
print('The variety baseline is', vino1.variety.value_counts(normalize=True).max())

The variety baseline is 0.10777643784083066


In [37]:
len(vino1.country.unique())

34

###  LOGISTIC REGRESSION, COUNTVECTORIZER AND TFIDFVECTORIZER  

The scores did not differ so much but the best performed model was with CountVectorizer and Logistic Regression with Ridge Regularization and multiclass ovr. I calculated the cross validation in this model and the score does not differ so much than the accuracy score.

All of the models have better scores than the baseline.


In [38]:
#CountVectorizer and Logistic Regression with Lasso regularization
cvec = CountVectorizer(strip_accents='unicode', stop_words="english", ngram_range=(1, 1))
pipelinesb1 = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l1', solver='saga', multi_class='ovr', verbose = 1, n_jobs = 2))
])
pipelinesb1.fit(X_train1, y_train1)
predictedsb1 = pipelinesb1.predict(X_test1)
pipelinesb1.score(X_test1, y_test1)

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.


max_iter reached after 112 seconds
max_iter reached after 112 seconds




max_iter reached after 22 seconds
max_iter reached after 27 seconds
max_iter reached after 44 seconds
max_iter reached after 99 seconds
max_iter reached after 26 seconds
max_iter reached after 17 seconds
max_iter reached after 23 seconds
max_iter reached after 119 seconds
max_iter reached after 26 seconds
max_iter reached after 79 seconds
max_iter reached after 38 seconds
max_iter reached after 213 seconds
max_iter reached after 29 seconds
max_iter reached after 50 seconds
max_iter reached after 22 seconds
max_iter reached after 16 seconds
max_iter reached after 24 seconds
max_iter reached after 127 seconds
max_iter reached after 24 seconds
max_iter reached after 20 seconds
max_iter reached after 17 seconds
max_iter reached after 84 seconds
max_iter reached after 25 seconds
max_iter reached after 97 seconds
max_iter reached after 25 seconds
max_iter reached after 70 seconds
max_iter reached after 17 seconds
max_iter reached after 26 seconds
max_iter reached after 132 seconds
max_iter r

[Parallel(n_jobs=2)]: Done  34 out of  34 | elapsed: 19.3min finished


0.8528631020098597

In [39]:
print(classification_report(y_test1, predictedsb1))

  'precision', 'predicted', average, warn_for)


                precision    recall  f1-score   support

     Argentina       0.62      0.49      0.55       789
     Australia       0.75      0.61      0.67       509
       Austria       0.82      0.64      0.72       608
        Brazil       0.00      0.00      0.00         9
      Bulgaria       0.54      0.23      0.32        31
        Canada       0.33      0.02      0.04        52
         Chile       0.61      0.55      0.58       923
       Croatia       0.86      0.38      0.52        16
Czech Republic       0.00      0.00      0.00         3
       England       0.33      0.12      0.18        16
        France       0.80      0.88      0.84      4098
       Georgia       0.00      0.00      0.00        16
       Germany       0.78      0.67      0.72       484
        Greece       0.71      0.59      0.64        93
       Hungary       0.92      0.38      0.54        29
        Israel       0.63      0.38      0.48       112
         Italy       0.95      0.96      0.95  

In [40]:
#Cross Validation Score
cross = cross_val_score(pipelinesb1, X_train1, y_train1, cv=5)
print(cross)
print(cross.mean())

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  34 out of  34 | elapsed: 13.9min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  34 out of  34 | elapsed: 13.9min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  34 out of  34 | elapsed: 13.9min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  34 out of  34 | elapsed: 13.9min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  34 out of  34 | elapsed: 13.6min finished


[0.85380643 0.84861842 0.85116367 0.84776218 0.84929557]
0.8501292565395634


In [42]:
#CountVectorizer and Logistic Regression with Ridge regularization
pipelinesb2 = Pipeline([
    ('vect', cvec),
    ('cls', LogisticRegression(penalty = 'l2', solver='lbfgs', verbose = 1, n_jobs = 2))
])
pipelinesb2.fit(X_train1, y_train1)
predictedsb2 = pipelinesb2.predict(X_test1)
pipelinesb2.score(X_test1, y_test1)

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  34 out of  34 | elapsed:   20.0s finished


0.8540766021994691

In [56]:
#Cross Validation Score
cross = cross_val_score(pipelinesb2, X_train1, y_train1, cv=5)
print(cross)
print(cross.mean())



[0.84333697 0.83994502 0.84471726 0.83846956 0.84218016]
0.841729793779019


In [43]:
print(classification_report(y_test1, predictedsb2))

  'precision', 'predicted', average, warn_for)


                precision    recall  f1-score   support

     Argentina       0.61      0.50      0.55       789
     Australia       0.74      0.62      0.68       509
       Austria       0.81      0.62      0.71       608
        Brazil       0.00      0.00      0.00         9
      Bulgaria       0.61      0.35      0.45        31
        Canada       0.00      0.00      0.00        52
         Chile       0.60      0.55      0.58       923
       Croatia       0.71      0.31      0.43        16
Czech Republic       0.00      0.00      0.00         3
       England       0.67      0.25      0.36        16
        France       0.80      0.88      0.84      4098
       Georgia       0.00      0.00      0.00        16
       Germany       0.79      0.68      0.73       484
        Greece       0.75      0.58      0.65        93
       Hungary       0.91      0.34      0.50        29
        Israel       0.62      0.42      0.50       112
         Italy       0.96      0.96      0.96  

In [44]:
#Tdidf and Logistic Regression with Lasso regularization and Multinomial multiclass
pipelinesb2 = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode')),
    ('cls', LogisticRegression(solver='saga', multi_class='multinomial', penalty = 'l1' ))
])
pipelinesb2.fit(X_train1, y_train1)
predictedsb2 = pipelinesb2.predict(X_test1)
pipelinesb2.score(X_test1, y_test1)



0.8496397421312097

In [45]:
print(classification_report(y_test1, predictedsb2))

  'precision', 'predicted', average, warn_for)


                precision    recall  f1-score   support

     Argentina       0.64      0.46      0.54       789
     Australia       0.74      0.58      0.65       509
       Austria       0.80      0.63      0.70       608
        Brazil       0.00      0.00      0.00         9
      Bulgaria       0.64      0.29      0.40        31
        Canada       1.00      0.02      0.04        52
         Chile       0.64      0.58      0.60       923
       Croatia       1.00      0.38      0.55        16
Czech Republic       0.00      0.00      0.00         3
       England       0.50      0.12      0.20        16
        France       0.80      0.87      0.83      4098
       Georgia       0.00      0.00      0.00        16
       Germany       0.79      0.62      0.69       484
        Greece       0.74      0.55      0.63        93
       Hungary       1.00      0.41      0.59        29
        Israel       0.66      0.39      0.49       112
         Italy       0.95      0.95      0.95  

### RANDOM FOREST CLASSIFIER WITH TFIDVECTORIZER 

In [46]:
# Random Forest with TdfidVectorizer
pipelinesb3 = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode')),
    
    ('cls', RandomForestClassifier(n_estimators=200, n_jobs = 2))
])
pipelinesb3.fit(X_train1, y_train1)
predictedsb3 = pipelinesb3.predict(X_test1)
pipelinesb3.score(X_test1, y_test1)

0.7532802427000379

### NATIVE BAYES, COUNTVECTORIZER AND TFDFVECTORIZER

In [48]:
# MultinomialNB with CountVectorizer
pipelinebn4 = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english', sublinear_tf=True,
                             ngram_range = (1,1), strip_accents='unicode')),
    ('cls', MultinomialNB())
])
pipelinebn4.fit(X_train1, y_train1)
predictedbn4 = pipelinebn4.predict(X_test1)
pipelinebn4.score(X_test1, y_test1)

0.7284034888130452

As happened with all the countries the best accuray is in Logistic Regression model with Ridge regularization.

With Random Forest I got higher accuracy than with Naives Bayes but the difference is not huge.

The best score is when the target has been reducing. The same what happend in the small data set.

## CONCLUSIONS

- Logistic Regression with CountVectoriazer is the best model when the target is the country. In the small dataset works better with Ridge Regularization in all the countries. However Lasso Regularization performs better in the big dataset. 

- All the models performs better than the baseline. 

- The best accuracy is the small data set with reduction of the number of countries.
