# Capstone 2

## Preprocessing and Training Data Development
In this step, the data will be preprocessed with the following steps:

1. Create dummy or indicator features for categorical variables
2. Standardize the magnitude of numeric features using a scaler (z-scoring)
3. Split the data into test and training datasets

In [1]:
import pandas as pd
reds = pd.read_csv('../downloads/DataFolder/winequality-red.csv',sep=';')
whites = pd.read_csv('../downloads/DataFolder/winequality-white.csv',sep=';')
whites['type'] = 'white'
reds['type'] = 'red'

all_wines = pd.concat([whites,reds])

## Create dummy features

In [2]:
all_wines = pd.get_dummies(all_wines,drop_first=True)
all_wines.sample(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type_white
665,7.8,0.28,0.22,1.4,0.056,24.0,130.0,0.9944,3.28,0.48,9.5,5,1
1777,6.5,0.18,0.26,1.4,0.041,40.0,141.0,0.9941,3.34,0.72,9.5,6,1
377,7.3,0.2,0.44,1.4,0.045,21.0,98.0,0.9924,3.15,0.46,10.0,7,1
815,6.2,0.3,0.17,2.8,0.04,24.0,125.0,0.9939,3.01,0.46,9.0,5,1
797,9.3,0.37,0.44,1.6,0.038,21.0,42.0,0.99526,3.24,0.81,10.8,7,0
1898,7.2,0.31,0.41,8.6,0.053,15.0,89.0,0.9976,3.29,0.64,9.9,6,1
1052,7.6,0.29,0.42,1.3,0.035,18.0,86.0,0.9908,2.99,0.39,11.3,5,1
4877,5.9,0.54,0.0,0.8,0.032,12.0,82.0,0.99286,3.25,0.36,8.8,5,1
3182,5.5,0.12,0.33,1.0,0.038,23.0,131.0,0.99164,3.25,0.45,9.8,5,1
866,6.8,0.49,0.22,2.3,0.071,13.0,24.0,0.99438,3.41,0.83,11.3,6,0


## Split the data into training and test sets

In [3]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(all_wines.drop('quality',axis=1),all_wines['quality'], test_size=0.33, stratify=all_wines.quality)

### Check to see counts of each quality level in the split

In [4]:
print('Training:\n',y_train.value_counts().sort_index(),sep='')
print('\nTest:\n',y_test.value_counts().sort_index(),sep='')

Training:
3      20
4     145
5    1432
6    1900
7     723
8     129
9       3
Name: quality, dtype: int64

Test:
3     10
4     71
5    706
6    936
7    356
8     64
9      2
Name: quality, dtype: int64


## Center and scale the data

In [5]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

column_names = all_wines.drop('quality',axis=1).columns
X_train = pd.DataFrame(scaler.transform(X_train),columns=column_names)
X_test = pd.DataFrame(scaler.transform(X_test),columns=column_names)

In [6]:
X_train.to_csv('../data/X_train.csv')
X_test.to_csv('../data/X_test.csv')
y_train.to_csv('../data/y_train.csv')
y_test.to_csv('../data/y_test.csv')