## libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## Load the Data

In [4]:
dataset_url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(dataset_url,sep = ';')

In [5]:
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


## Some statistics
Some statistics to get a brief view about data


In [7]:
data.shape

(1599, 12)

It has **1599 rows** and **12 columns**

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
fixed acidity           1599 non-null float64
volatile acidity        1599 non-null float64
citric acid             1599 non-null float64
residual sugar          1599 non-null float64
chlorides               1599 non-null float64
free sulfur dioxide     1599 non-null float64
total sulfur dioxide    1599 non-null float64
density                 1599 non-null float64
pH                      1599 non-null float64
sulphates               1599 non-null float64
alcohol                 1599 non-null float64
quality                 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


This data.info() gives you the **column values, no.of.rows, Memory size and their data types**. you can even get the data types by typing **data.dtypes.**

In [14]:
data.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


**data.describe** gives
* count:- says no of rows
* mean:- says mean value of a  column
* std:-  gives standard deviation of a column
* min:- gives **min** value of a column
* IQR
* max :- gives **max** value of a column

As all the values are numerical we dont need encoding. you have to perform **Standardisation** on numerical values as they have **different scales**.

## Preprocessing 

In [16]:
from sklearn.model_selection import train_test_split

In [17]:
from sklearn import preprocessing

This preprocessing module contains utilities for **scaling, transforming, and wrangling data**.

In [37]:
#model importing
from sklearn.ensemble import RandomForestRegressor

In [20]:
# tools to help us perform cross-validation.
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

In [21]:
#performance metrics
from sklearn.metrics import mean_squared_error,r2_score

In [22]:
#storing large numpy array
from sklearn.externals import joblib

Joblib is an alternative to Python's pickle package, and we'll use it because it's more efficient for storing large numpy arrays.

lets seperate the target featues from input features below

In [23]:
x= data.drop('quality',axis = 1)
y = data.quality

lets split the train and test data sets

In [24]:
X_train,X_test,y_train,y_test = train_test_split(x,y,test_size = 0.2, random_state = 123, stratify = y)

**Stratify** will ensure your **training set looks similar to your test set**, making your evaluation metrics more reliable.

## Data preprocessing steps

**Standardization = (subtracting the means from each feature)/feature standard deviations. stand = (x-u)/sigma with the mean = 0, standart_deviation = 1**

In [25]:
X_train_scaled = preprocessing.scale(X_train)
X_train_scaled

array([[ 0.51358886,  2.19680282, -0.164433  , ...,  1.08415147,
        -0.69866131, -0.58608178],
       [-1.73698885, -0.31792985, -0.82867679, ...,  1.46964764,
         1.2491516 ,  2.97009781],
       [-0.35201795,  0.46443143, -0.47100705, ..., -0.13658641,
        -0.35492962, -0.20843439],
       ...,
       [-0.98679628,  1.10708533, -0.93086814, ...,  0.24890976,
        -0.98510439,  0.35803669],
       [-0.69826067,  0.46443143, -1.28853787, ...,  1.08415147,
        -0.35492962, -0.68049363],
       [ 3.1104093 , -0.62528606,  2.08377675, ..., -1.61432173,
         0.79084268, -0.39725809]])

You can confirm that the scaled dataset is indeed centered at zero, with unit variance:

In [27]:
print(X_train_scaled.mean(axis = 0))

[ 1.16664562e-16 -3.05550043e-17 -8.47206937e-17 -2.22218213e-17
  2.22218213e-17 -6.38877362e-17 -4.16659149e-18 -2.54439854e-15
 -8.70817622e-16 -4.08325966e-16 -1.17220107e-15]


In [29]:
print(X_train_scaled.std(axis=0))

[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


#### but we dont use this much because same mean and standard deviation values are not used to the testing set. hence we will go with the Transformer API allows you to "fit" a preprocessing step using the training data the same way you'd fit a model.

Here's what that process looks like:

- Fit the transformer on the training set (saving the means and standard deviations)
- Apply the transformer to the training set (scaling the training data)
- Apply the transformer to the test set (using the same means and standard deviations)


In [31]:
scaler = preprocessing.StandardScaler().fit(X_train)

In [32]:
X_train_scaled = scaler.transform(X_train)

In [33]:
X_test_scaled  = scaler.transform(X_test)

#### we're transforming the test set using the means from the training set, not from the test set itself. In practice, when we set up the cross-validation pipeline, we won't even need to manually fit the Transformer API. Instead, we'll simply declare the class object, like

In [35]:
from sklearn.pipeline import make_pipeline

In [38]:
pipeline = make_pipeline(preprocessing.StandardScaler(), 
                         RandomForestRegressor(n_estimators=100))

#### We can list the tunable hyperparameters like 

In [39]:
print(pipeline.get_params)

<bound method Pipeline.get_params of Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('randomforestregressor', RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False))])>


In [40]:
hyperparameters = { 'randomforestregressor__max_features' : ['auto', 'sqrt', 'log2'],
                  'randomforestregressor__max_depth': [None, 5, 3, 1]}

In [41]:
clf = GridSearchCV(pipeline, hyperparameters, cv=10)


In [42]:
clf.fit(X_train,y_train)

GridSearchCV(cv=10, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('randomforestregressor', RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decr...ors=100, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'randomforestregressor__max_features': ['auto', 'sqrt', 'log2'], 'randomforestregressor__max_depth': [None, 5, 3, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

#### 8. Refit on the entire training set
   - No additional code needed if clf.refit == True (default is True)

In [46]:
# 9. Evaluate model pipeline on test data
pred = clf.predict(X_test)
print(r2_score(y_test, pred))
print(mean_squared_error(y_test, pred))

0.4692608511411104
0.342471875


### Its not the good accuracy at all. but this entire process tells you how the data science process goes.

In [47]:
#befor you go, lets save the model
joblib.dump(clf, 'rf_regressor.pkl')

['rf_regressor.pkl']

In [48]:
#when ever you want to load the mode again
clf2 = joblib.load('rf_regressor.pkl')
 
# Predict data set using loaded model
clf2.predict(X_test)

array([6.56, 5.75, 5.03, 5.53, 6.25, 5.63, 4.96, 4.75, 5.02, 5.96, 5.37,
       5.74, 5.84, 5.1 , 5.79, 5.6 , 6.58, 5.82, 5.74, 6.94, 5.5 , 5.62,
       5.  , 6.11, 5.89, 5.05, 5.43, 5.13, 6.01, 5.96, 5.85, 6.48, 5.98,
       5.02, 5.  , 5.93, 5.06, 6.02, 5.05, 6.02, 4.86, 5.86, 6.65, 5.05,
       6.24, 5.37, 5.49, 5.66, 5.2 , 6.35, 6.  , 5.37, 5.83, 5.18, 5.61,
       5.74, 5.38, 5.43, 4.96, 5.33, 5.3 , 5.19, 5.11, 5.86, 6.  , 5.29,
       6.4 , 5.06, 5.23, 6.71, 5.73, 5.86, 5.1 , 5.03, 5.28, 5.94, 5.47,
       5.08, 5.26, 5.18, 6.4 , 5.65, 6.17, 6.42, 5.09, 6.05, 6.32, 6.35,
       5.61, 5.91, 5.92, 5.34, 6.43, 5.72, 5.73, 5.8 , 6.64, 6.77, 5.66,
       6.73, 5.04, 5.43, 5.14, 6.38, 5.03, 4.78, 5.74, 5.01, 5.85, 5.99,
       5.93, 5.5 , 6.07, 5.41, 5.17, 5.24, 5.96, 5.15, 4.98, 6.02, 5.87,
       5.11, 5.85, 6.06, 5.18, 5.42, 5.38, 5.9 , 5.64, 5.36, 5.91, 6.09,
       5.14, 5.12, 5.13, 6.38, 5.03, 5.18, 6.75, 5.44, 5.23, 5.09, 5.54,
       6.08, 5.3 , 5.27, 5.14, 6.48, 5.78, 5.14, 5.

## you can find this tutorial at [here](https://elitedatascience.com/python-machine-learning-tutorial-scikit-learn) as part of [this](https://elitedatascience.com/machine-learning-projects-for-beginners) series.