## Let's Choose and Transform the Features

In [25]:
# lets use pandas to load the dataset!
import pandas as pd

data = pd.read_csv('cereal.csv')
data.head()

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


### Select only important features!

In [26]:
# lets save name in a separate array for the moment...
name = data.name

# remove column 'shelf' and  'name' from the dataset
data = data[data.columns.difference(['shelf','name'])]
data.head()

Unnamed: 0,calories,carbo,cups,fat,fiber,mfr,potass,protein,rating,sodium,sugars,type,vitamins,weight
0,70,5.0,0.33,1,10.0,N,280,4,68.402973,130,6,C,25,1.0
1,120,8.0,1.0,5,2.0,Q,135,3,33.983679,15,8,C,0,1.0
2,70,7.0,0.33,1,9.0,K,320,4,59.425505,260,5,C,25,1.0
3,50,8.0,0.5,0,14.0,K,330,4,93.704912,140,0,C,25,1.0
4,110,14.0,0.75,2,1.0,R,-1,2,34.384843,200,8,C,25,1.0


### Dummy Variables (One Hot Encoding)

We must convert columns such as mfr to multiple columns with binary values.

We must convert the variables mfr, and type to one-hot-encoders

In [27]:
# conver mfr to one-hot-encodings con pandas!
data = pd.get_dummies(data,prefix=['mfr'], columns = ['mfr'] , drop_first=True)
data = pd.get_dummies(data,prefix=['type'], columns = ['type'] , drop_first=True)
data.head()

Unnamed: 0,calories,carbo,cups,fat,fiber,potass,protein,rating,sodium,sugars,vitamins,weight,mfr_G,mfr_K,mfr_N,mfr_P,mfr_Q,mfr_R,type_H
0,70,5.0,0.33,1,10.0,280,4,68.402973,130,6,25,1.0,0,0,1,0,0,0,0
1,120,8.0,1.0,5,2.0,135,3,33.983679,15,8,0,1.0,0,0,0,0,1,0,0
2,70,7.0,0.33,1,9.0,320,4,59.425505,260,5,25,1.0,0,1,0,0,0,0,0
3,50,8.0,0.5,0,14.0,330,4,93.704912,140,0,25,1.0,0,1,0,0,0,0,0
4,110,14.0,0.75,2,1.0,-1,2,34.384843,200,8,25,1.0,0,0,0,0,0,1,0


### Data Normalization

A lot of ML algorithms work best if data is normalized. Lets put all numerical data in a range between 0 and 1.

The package sklearn will help us with an algorithm that will help us achieve this! The algorithm is the Min-Max Scaler.

The formula to achieve this is:

$x_{scaled}=\frac{x-min(X)}{max(X)-min(X)}$

more examples: <a href="https://github.com/drzamoramora/AI01-MachineLearning/blob/master/Semana%202%20-%20Regresion%20Lineal%20Multiple/2-Normalizacion-Transformacion.ipynb">here</a>

In [30]:
from sklearn.preprocessing import MinMaxScaler

data[data.columns] = MinMaxScaler().fit_transform(data)
data.head()

Unnamed: 0,calories,carbo,cups,fat,fiber,potass,protein,rating,sodium,sugars,vitamins,weight,mfr_G,mfr_K,mfr_N,mfr_P,mfr_Q,mfr_R,type_H
0,0.181818,0.25,0.064,0.2,0.714286,0.848943,0.6,0.665593,0.40625,0.4375,0.25,0.5,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.636364,0.375,0.6,1.0,0.142857,0.410876,0.4,0.210685,0.046875,0.5625,0.0,0.5,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.181818,0.333333,0.064,0.2,0.642857,0.969789,0.6,0.546941,0.8125,0.375,0.25,0.5,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.375,0.2,0.0,1.0,1.0,0.6,1.0,0.4375,0.0625,0.25,0.5,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.545455,0.625,0.4,0.4,0.071429,0.0,0.2,0.215987,0.625,0.5625,0.25,0.5,0.0,0.0,0.0,0.0,0.0,1.0,0.0


### Save Data to New CSV

In [36]:
# restore name variable
data['name'] = name
data.to_csv("cereal_norm.csv", index = False)

In [34]:
data.head()

Unnamed: 0,calories,carbo,cups,fat,fiber,potass,protein,rating,sodium,sugars,vitamins,weight,mfr_G,mfr_K,mfr_N,mfr_P,mfr_Q,mfr_R,type_H,name
0,0.181818,0.25,0.064,0.2,0.714286,0.848943,0.6,0.665593,0.40625,0.4375,0.25,0.5,0.0,0.0,1.0,0.0,0.0,0.0,0.0,100% Bran
1,0.636364,0.375,0.6,1.0,0.142857,0.410876,0.4,0.210685,0.046875,0.5625,0.0,0.5,0.0,0.0,0.0,0.0,1.0,0.0,0.0,100% Natural Bran
2,0.181818,0.333333,0.064,0.2,0.642857,0.969789,0.6,0.546941,0.8125,0.375,0.25,0.5,0.0,1.0,0.0,0.0,0.0,0.0,0.0,All-Bran
3,0.0,0.375,0.2,0.0,1.0,1.0,0.6,1.0,0.4375,0.0625,0.25,0.5,0.0,1.0,0.0,0.0,0.0,0.0,0.0,All-Bran with Extra Fiber
4,0.545455,0.625,0.4,0.4,0.071429,0.0,0.2,0.215987,0.625,0.5625,0.25,0.5,0.0,0.0,0.0,0.0,0.0,1.0,0.0,Almond Delight
