## Importing libraries

In [57]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
import urllib 
import os
from IPython.display import Markdown as md
%matplotlib inline

## Descargando e importando el set de datos de vino blanco

Los datos con los que haremos nuestro análisis corresponden a muestras tomadas del vino blanco de el norte de Portugal, datos que corresponden a características físico-químicas del vino. El  [dataset de vinos blancos](https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv) es público y disponible para investigaciones, cuenta con la [descripción de los datos](https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality.names) disponible al público también.

#### Descarga datos y creando data frame

In [9]:
download_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv'
downloaded_file = 'data/winequality-white.csv'
# download only if not present
if not os.path.isfile(downloaded_file):
    urllib.request.urlretrieve(download_url, downloaded_file)
data = pd.read_csv(downloaded_file, sep=";")
data.head(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


### Informacion de los datos

Los datos contienen 11 variables y en la columna 12 está es score que corresponde a la calidad del vino, con rango de 0 a 10.

In [22]:
columns=["Acidez fija", "Acidez volatil", 
         "Acido citrico", "Azucar residual", "Cloruros",
         "Dioxido de sulfuro libre", "Dioxido de sulfuro total",
         "Densidad", "pH","Sulfatos", "alcohol", "Calidad"]
md_text = ""
for i,c_name in enumerate(columns[:-1]):
    md_text = md_text + "\n    {} - {}".format(i+1, c_name)
md("""#### Lista de variables
{}

""".format(md_text))    


#### Lista de variables

    1 - Acidez fija
    2 - Acidez volatil
    3 - Acido citrico
    4 - Azucar residual
    5 - Cloruros
    6 - Dioxido de sulfuro libre
    7 - Dioxido de sulfuro total
    8 - Densidad
    9 - pH
    10 - Sulfatos
    11 - alcohol



#### Partiendo los datos para entrenamiento y prueba

Utilizamos la funcion `train_test_split` y dividoremos los datos en un 80% para entrenamiento y 20% para pruebas

In [23]:
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

### Escalando datos

In [58]:
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

### Entranamiento del modelo

Utilizaremos varios modelos de regresión para comparar resultados y seleccionar el mejor predictor.

#### Regresion Linear Múltiple

##### Entrenamiento

In [59]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

##### Prediciondo en el set de datos de prueba

In [60]:
y_pred = regressor.predict(X_test)
# y_pred = y_pred.astype(int)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))


[[5.59 5.  ]
 [5.47 6.  ]
 [6.09 7.  ]
 ...
 [5.55 6.  ]
 [5.5  6.  ]
 [4.8  4.  ]]


##### Evaluacion del performance del modelo

In [61]:
print("R2 score : {}".format(r2_score(y_test, y_pred)))
print("Mean squared error: {}".format(mean_squared_error(y_test,y_pred)))

R2 score : 0.2513476761101383
Mean squared error: 0.6598453517957835
