In [1]:
%pdb
%load_ext autotime

Automatic pdb calling has been turned ON


# Preprocessing data 

1. NaN values 
2. Transfrom categorial data 
3. Spliting for test and train 
4. Normalization and stadndartization
5. Feature selection

## Working with NaN 

1. Deleting NaN
2. Imputing Nan

### How to find and delete NaN

In [2]:
import pandas as pd
import numpy as np
from io import StringIO

time: 521 ms


In [3]:
csv_data = '''A,B,C,D
              1.0,2.0,3.0,4.0
              5.0,6.0,,8.0
              10.0,11.0,12.0'''

time: 309 µs


In [4]:
df = pd.read_csv(StringIO(csv_data))
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


time: 16.6 ms


In [9]:
df.isnull() # Return df with True/False varibales describing NaN
df.isnull().sum() # Compute how many NaN in each columns

A    0
B    0
C    1
D    1
dtype: int64

time: 3.27 ms


In [6]:
df.to_numpy() # Convert the df into numpy array (usefull for sklearn, because it works via numpy)

array([[ 1.,  2.,  3.,  4.],
       [ 5.,  6., nan,  8.],
       [10., 11., 12., nan]])

time: 2.48 ms


In [7]:
df.dropna() # Drop all rows with NaNs
df.dropna(axis=1) # Drop all columns with NaNs
df.dropna(how='all') # Drop rows if all column has NaN value 
df.dropna(thresh=4) # Drop rows if it has < 4 values 
df.dropna(subset=['C']) # Drop rows if NaN in special columns
df.dropna(subset=['C','D']) # Drop rows if NaN in special columns

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


time: 16.1 ms


### Imputing NaN variables

In [12]:
from sklearn.impute import SimpleImputer
# docs https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

time: 404 µs


In [11]:
imr = SimpleImputer(missing_values=np.nan, strategy='mean') 
imr = imr.fit(df)
imputed_data = imr.transform(df.to_numpy())
imputed_data

array([[ 1. ,  2. ,  3. ,  4. ],
       [ 5. ,  6. ,  7.5,  8. ],
       [10. , 11. , 12. ,  6. ]])

time: 6.74 ms


### Some notes about sklearn working principle 
There are 2 basic methods in sklearn classes:

1. Fit 
2. Transform 

**Fit** is a method for calculating or studing parameters from data:
- fit a train sample and calculate weights 
- calculate statistics from data to use it later 
- ...

**Transform** is a method to apply parameters and change our data
- transform data to impute NaN
- ...

![alt](fit_transform_workflow.png)

## Transfrom categorial data

In [5]:
df = pd.DataFrame([ 
                ['green', 'M', 10.1, 'class1'],
                ['red', 'L', 13.5, 'class2'],
                ['blue', 'XL', 15.3, 'class1']])
df.columns = ['color', 'size', 'price', 'label']
df

Unnamed: 0,color,size,price,label
0,green,M,10.1,class1
1,red,L,13.5,class2
2,blue,XL,15.3,class1


time: 8.81 ms


In [6]:
size_mapping = {
    'XL': 3,
    'L': 2,
    'M': 1,
}
inv_size_mapping = {k: v for v, k in size_mapping.items()}
df['size'] = df['size'].map(size_mapping)
df
# Easy because we know the difference between features: XL = L + 1 = M + 2 

Unnamed: 0,color,size,price,label
0,green,1,10.1,class1
1,red,2,13.5,class2
2,blue,3,15.3,class1


time: 8.28 ms


In [7]:
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
y = class_le.fit_transform(df['label'].to_numpy())
y

array([0, 1, 0])

time: 887 ms


### Categorial data

While sizes are different in level, then colors are quite the same. So, if do transformation like:

- blue = 1 
- red = 2 
- green = 3 

Then our algorithm will sugest that green is bigger than blue, but it's not true, they all have the same impact. These can lead to inefficient prediction. 

#### One-hot encoding
tbd to describe 

In [24]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
X = df[['color', 'size', 'price']].to_numpy()
ct = ColumnTransformer([('Color', OneHotEncoder(), [0])], remainder='passthrough')
X = ct.fit_transform(X)
X

array([[0.0, 1.0, 0.0, 1, 10.1],
       [0.0, 0.0, 1.0, 2, 13.5],
       [1.0, 0.0, 0.0, 3, 15.3]], dtype=object)

time: 5.34 ms


In [8]:
# The simplest way for OneHotEncoding
pd.get_dummies(df[['color', 'size', 'price']])

Unnamed: 0,size,price,color_blue,color_green,color_red
0,1,10.1,0,1,0
1,2,13.5,0,0,1
2,3,15.3,1,0,0


time: 13.7 ms


In [15]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data' 
df_wine = pd.read_csv(url, header=None)
df_wine.columns = ['class_label', 'alcohol', 'apple_acid', 'ash', 
                   'ash_alkalinity', 'magnesium', 'total_phenol', 
                   'flavanoids', 'non-flavanoid phenols', 'proanthocyanins', 
                   'color_intensity', 'tint','00280/00315_dilute_ wines', 'proline'] 
print(f'Class labels {np.unique(df_wine.class_label)}')

Class labels [1 2 3]
time: 786 ms


In [16]:
# Creating test and train set via sklearn
from sklearn.model_selection import train_test_split
X, y = df_wine.iloc[:, 1:].to_numpy(), df_wine.iloc[:, 0].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=0)

time: 60.8 ms


### Scaling the features 

Normalization is tranforming a feaure's value into [0, 1] interval:

$x_{norm}^{(i)} = \frac{x^{(i)}-x_{min}}{x_{max} - x_{min}}$

In [21]:
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)

time: 1.37 ms
