#### Éste dataset reune datos del análisis químico de vinos porducidos en Italia (todos en una misma región) a partir de tres cosechas diferentes. Si bien el dataset original tiene 30 atributos, el de UCI (https://archive.ics.uci.edu/ml/datasets/wine) fué reducido a los 13 atributos. Éste problema trata de a partir de un dataset de vinos, poder predecir de qué cosecha es el vino.

In [31]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import GridSearchCV, cross_val_score
%matplotlib inline

#### Cargamos el dataset, y observamos los primeros 15 registros del mismo.

In [32]:
wineDS = pd.read_csv("wine_con_nombres.csv")
wineDS.head(15)

Unnamed: 0,Class,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280OD315,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735
5,1,14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450
6,1,14.39,1.87,2.45,14.6,96,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290
7,1,14.06,2.15,2.61,17.6,121,2.6,2.51,0.31,1.25,5.05,1.06,3.58,1295
8,1,14.83,1.64,2.17,14.0,97,2.8,2.98,0.29,1.98,5.2,1.08,2.85,1045
9,1,13.86,1.35,2.27,16.0,98,2.98,3.15,0.22,1.85,7.22,1.01,3.55,1045


#### Aquí podemos ver que el dataset no tiene atributos faltantes, entre otras cosas (media, mín, máx, etc.). 

In [33]:
wineDS.describe()

Unnamed: 0,Class,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280OD315,Proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,1.938202,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258
std,0.775035,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474
min,1.0,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0
25%,1.0,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5
50%,2.0,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5
75%,3.0,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0
max,3.0,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0


#### Aquí podemos ver que ninguno de los atributos tiene valores faltantes, y que todos son del tipo numérico.

In [34]:
wineDS.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
Class                   178 non-null int64
Alcohol                 178 non-null float64
Malic acid              178 non-null float64
Ash                     178 non-null float64
Alcalinity of ash       178 non-null float64
Magnesium               178 non-null int64
Total phenols           178 non-null float64
Flavanoids              178 non-null float64
Nonflavanoid phenols    178 non-null float64
Proanthocyanins         178 non-null float64
Color intensity         178 non-null float64
Hue                     178 non-null float64
OD280OD315              178 non-null float64
Proline                 178 non-null int64
dtypes: float64(11), int64(3)
memory usage: 19.5 KB


#### En la siguiente matriz de correlación, podemos ver que tan correlacionados están unos atributos con otros. Los 2 atributos más coorelacionados son "Flavanoids" y "Total Phenols". 

In [35]:
import numpy as np
wineDF = pd.DataFrame(wineDS)
rs = np.random.RandomState(0)
unaCorrM = wineDF.corr()
unaCorrM.style.background_gradient()

Unnamed: 0,Class,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280OD315,Proline
Class,1.0,-0.328222,0.437776,-0.0496432,0.517859,-0.209179,-0.719163,-0.847498,0.489109,-0.49913,0.265668,-0.617369,-0.78823,-0.633717
Alcohol,-0.328222,1.0,0.0943969,0.211545,-0.310235,0.270798,0.289101,0.236815,-0.155929,0.136698,0.546364,-0.0717472,0.0723432,0.64372
Malic acid,0.437776,0.0943969,1.0,0.164045,0.2885,-0.0545751,-0.335167,-0.411007,0.292977,-0.220746,0.248985,-0.561296,-0.36871,-0.192011
Ash,-0.0496432,0.211545,0.164045,1.0,0.443367,0.286587,0.12898,0.115077,0.18623,0.00965194,0.258887,-0.0746669,0.00391123,0.223626
Alcalinity of ash,0.517859,-0.310235,0.2885,0.443367,1.0,-0.0833331,-0.321113,-0.35137,0.361922,-0.197327,0.018732,-0.273955,-0.276769,-0.440597
Magnesium,-0.209179,0.270798,-0.0545751,0.286587,-0.0833331,1.0,0.214401,0.195784,-0.256294,0.236441,0.19995,0.0553982,0.0660039,0.393351
Total phenols,-0.719163,0.289101,-0.335167,0.12898,-0.321113,0.214401,1.0,0.864564,-0.449935,0.612413,-0.0551364,0.433681,0.699949,0.498115
Flavanoids,-0.847498,0.236815,-0.411007,0.115077,-0.35137,0.195784,0.864564,1.0,-0.5379,0.652692,-0.172379,0.543479,0.787194,0.494193
Nonflavanoid phenols,0.489109,-0.155929,0.292977,0.18623,0.361922,-0.256294,-0.449935,-0.5379,1.0,-0.365845,0.139057,-0.26264,-0.50327,-0.311385
Proanthocyanins,-0.49913,0.136698,-0.220746,0.00965194,-0.197327,0.236441,0.612413,0.652692,-0.365845,1.0,-0.0252499,0.295544,0.519067,0.330417


#### Ahora debemos dividir el dataset en entrenamiento y testing. 

In [55]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import (RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier)
rfc = RandomForestClassifier
X = wineDS.drop('Class', axis = 1)
y = wineDS['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

In [59]:
from sklearn.metrics import accuracy_score
modelos = [RandomForestClassifier(random_state=77), GradientBoostingClassifier(random_state=77), AdaBoostClassifier(random_state=77)]

#### Aquí realizamos entrenamiento y predicción:

In [61]:
from sklearn.model_selection import cross_val_score, GridSearchCV

for model in modelos:
    unResultado = cross_val_score(model, X_train, y_train, cv=5)
    unMensaje = ("{0}:\n\tMedia de Precisión (training) \t= {1:.3f} "
           "(+/- {2:.3f})".format(model.__class__.__name__,
                                  unResultado.mean(),
                                  unResultado.std()))
    print(unMensaje)
    model.fit(X_train, y_train)
    unaPrediccion_test = model.predict(X_test)
    unaPrecision_test = accuracy_score(y_test, unaPrediccion_test)
    print("\tPrecisión (test)\t\t= {0:.3f}".format(unaPrecision_test))

RandomForestClassifier:
	Media de Precisión (training) 	= 0.965 (+/- 0.023)
	Precisión (test)		= 1.000
GradientBoostingClassifier:
	Media de Precisión (training) 	= 0.944 (+/- 0.036)
	Precisión (test)		= 0.944
AdaBoostClassifier:
	Media de Precisión (training) 	= 0.915 (+/- 0.018)
	Precisión (test)		= 0.917


#### Como vemos en los resultados anteriores, el mejor  modelo para éste caso fué "Random Forest".