**Dealing with categorical features**

Scikit-learn will not accept categorical features by default

* Need to encode categorical features numerically
* Convert to 'dummy variables'

In [None]:
# import required modules
import pandas as pd
import numpy as np
 
# create dataset
df = pd.DataFrame({'Temperature': ['Hot', 'Cold', 'Warm', 'Cold','Hot', 'Cold', 'Warm', 'Cold'],
                   })

#There are three different classes and 4 data
# display dataset
display(df)

#the order in the dataframe is cold, hot, warm respect the alpha. order 
# create dymmy variables
pd.get_dummies(df)

Unnamed: 0,Temperature
0,Hot
1,Cold
2,Warm
3,Cold
4,Hot
5,Cold
6,Warm
7,Cold


Unnamed: 0,Temperature_Cold,Temperature_Hot,Temperature_Warm
0,0,1,0
1,1,0,0
2,0,0,1
3,1,0,0
4,0,1,0
5,1,0,0
6,0,0,1
7,1,0,0


<font color='blue'> Now we can give this data to scikitlearn functions </font>.

**Handling missing data**

We will work with the diabetes dataset available [here](https://www.kaggle.com/saurabh00007/diabetescsv)

You can download it from Kaggle [here](https://www.kaggle.com/saurabh00007/diabetescsv)

<font color='red'> I modified the dataset and removed some values </font>. 

In [None]:
from google.colab import files
uploaded = files.upload()
import pandas as pd
df = pd.read_csv("diabetes.csv")

Saving diabetes.csv to diabetes.csv


In [None]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,,148,72,35,,33.6,0.627,50,1
1,1.0,85,66,29,,26.6,0.351,31,0
2,8.0,183,64,0,0.0,23.3,0.672,32,1
3,1.0,89,66,23,94.0,28.1,0.167,21,0
4,0.0,137,40,35,168.0,43.1,2.288,33,1


In [None]:
df.shape

(768, 9)

In [None]:
df.Outcome

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

In [None]:
#number of nan in the dataframe
df.isnull().sum().sum()

3

In [None]:
#There are Nan values in the column insulin
df['Insulin'].isnull().values.any()

True

In [None]:
#How many
df['Insulin'].isnull().sum().sum()

2

In [None]:
#replace nan by 0 in this column
df.Insulin.replace(np.nan, 0, inplace=True)

In [None]:
df.Insulin[0:5]

0      0.0
1      0.0
2      0.0
3     94.0
4    168.0
Name: Insulin, dtype: float64

In [None]:
#in all the dataframe
df.replace(np.nan, 0, inplace=True)

In [None]:
#let's check
df.isnull().sum().sum() #no longer nan values !

0

In [None]:
#put a nan value again

df['Insulin'][0]=np.nan

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [None]:
#check
df['Insulin']

0        NaN
1        0.0
2        0.0
3       94.0
4      168.0
       ...  
763    180.0
764      0.0
765    112.0
766      0.0
767      0.0
Name: Insulin, Length: 768, dtype: float64

In [None]:
df.shape

(768, 9)

In [None]:
#extract only rows where there are no nan values

df = df.dropna()
df.shape


(767, 9)

<font color='red'> There are many possibilities to replace NaN values, for instance by a mean value, median, etc... </font>. 



---



---

# Centering and scaling data

![](https://drive.google.com/uc?export=view&id=1lnIEwLbElkma7P1Y7iOSeksBpUy7XVOd)


**Why scale your data?**

* Many models use some form of distance to inform them
* Features on larger scales can unduly influence the model
* Example: k-NN uses distance explicitly when making predictions
* We want features to be on a similar scale
* Normalizing (or scaling and centering)


**Standard way to normalize your data**

* Subtract the mean and divide by standard deviation
* All features are centered around zero and have variance one

**Scaling in a pipeline**

In [1]:
from sklearn import preprocessing
import numpy as np
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                   [ 0.,  1., -1.]])
scaler = preprocessing.StandardScaler().fit(X_train)
X_scaled = scaler.transform(X_train)


In [2]:
#check
print(X_scaled.mean(axis=0))
print(X_scaled.std(axis=0))


[0. 0. 0.]
[1. 1. 1.]


In [4]:
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

X, y = make_classification(random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)


pipe = make_pipeline(StandardScaler(), LogisticRegression())
pipe.fit(X_train, y_train)  # apply scaling on training data
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression())])

pipe.score(X_test, y_test)  # apply scaling on testing data, without leaking training data.
0.96

0.96