# Capstone 2

## Preprocessing and Training Data Development
In this step, the data will be preprocessed with the following steps:

1. Create dummy or indicator features for categorical variables
2. Standardize the magnitude of numeric features using a scaler (z-scoring)
3. Split the data into test and training datasets

In [1]:
import pandas as pd
reds = pd.read_csv('../downloads/DataFolder/winequality-red.csv',sep=';')
whites = pd.read_csv('../downloads/DataFolder/winequality-white.csv',sep=';')
whites['type'] = 'white'
reds['type'] = 'red'

all_wines = pd.concat([whites,reds])

## Create dummy features

In [2]:
all_wines = pd.get_dummies(all_wines,drop_first=True)
all_wines.sample(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type_white
1183,6.4,0.26,0.3,2.2,0.025,33.0,134.0,0.992,3.21,0.47,10.6,6,1
3185,6.5,0.21,0.4,7.3,0.041,49.0,115.0,0.99268,3.21,0.43,11.0,6,1
3568,7.8,0.15,0.34,1.1,0.035,31.0,93.0,0.99096,3.07,0.72,11.3,7,1
4500,7.8,0.27,0.33,2.4,0.053,36.0,175.0,0.992,3.2,0.55,11.0,6,1
4099,6.4,0.5,0.2,2.4,0.059,19.0,112.0,0.99314,3.18,0.4,9.2,6,1
3956,5.4,0.24,0.18,2.3,0.05,22.0,145.0,0.99207,3.24,0.46,10.3,5,1
4066,7.1,0.44,0.27,8.4,0.057,60.0,160.0,0.99257,3.16,0.36,11.8,6,1
4689,6.7,0.16,0.32,12.5,0.035,18.0,156.0,0.99666,2.88,0.36,9.0,6,1
4895,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6,1
292,7.4,0.28,0.42,19.8,0.066,53.0,195.0,1.0,2.96,0.44,9.1,5,1


## Split the data into training and test sets

In [3]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(all_wines.drop('quality',axis=1),all_wines['quality'], test_size=0.33, stratify=all_wines.quality)

### Check to see counts of each quality level in the split

In [4]:
print('Training:\n',y_train.value_counts().sort_index(),sep='')
print('\nTest:\n',y_test.value_counts().sort_index(),sep='')

Training:
3      20
4     145
5    1432
6    1900
7     723
8     129
9       3
Name: quality, dtype: int64

Test:
3     10
4     71
5    706
6    936
7    356
8     64
9      2
Name: quality, dtype: int64


## Center and scale the data

In [5]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

column_names = all_wines.drop('quality',axis=1).columns
X_train_scaled = pd.DataFrame(scaler.transform(X_train),columns=column_names)
X_test_scaled = pd.DataFrame(scaler.transform(X_test),columns=column_names)
X_train_scaled['type_white'] = X_train.reset_index()['type_white']
X_test_scaled['type_white'] = X_test.reset_index()['type_white']

In [6]:
X_train_scaled.to_csv('../data/X_train.csv',index=False)
X_test_scaled.to_csv('../data/X_test.csv',index=False)
y_train.to_csv('../data/y_train.csv',index=False)
y_test.to_csv('../data/y_test.csv',index=False)