# Python Tutorials

### Data transformation - numeric
(feature engineering)

Solvertank Digital Science   
[http://www.solvertank.com](http://www.solvertank.com)   
<img src="cube.gif" align="left" width="50" />

## Load data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
from pandas import ExcelFile
df = pd.read_excel('datavis.xlsx', sheet_name='datavis')

In [3]:
df.head(5)

Unnamed: 0,status,bmi,bp,sex,category,region
0,-0.107226,,-0.040099,M,White,4
1,,-0.055785,0.025315,F,Blue,3
2,0.012648,0.000261,-0.011409,F,Silver,5
3,-0.052738,-0.018062,0.080401,F,,4
4,-0.009147,0.001339,-0.002228,F,Silver,1


### Null data

In [4]:
# list columns with null data
df.isnull().sum()

status       29
bmi          75
bp            0
sex          25
category    153
region        0
dtype: int64

In [5]:
# list columns with null data
df.columns[df.isna().any()].tolist()

['status', 'bmi', 'sex', 'category']

In [6]:
# list rows with null data
df[df['bmi'].isnull()]

Unnamed: 0,status,bmi,bp,sex,category,region
0,-0.107226,,-0.040099,M,White,4
8,-0.020045,,-0.005671,M,,1
17,0.016281,,-0.043542,F,Silver,4
19,-0.103593,,-0.026328,F,White,1
20,-0.005515,,0.049415,,Silver,2
24,-0.001882,,-0.026328,M,Silver,3
32,0.027178,,0.028758,F,,3
38,-0.023677,,-0.064199,M,,4
61,0.041708,,0.052858,F,Gold,2
64,0.048974,,-0.053871,M,,2


In [7]:
# remove rows with null data
df = df[df['bmi'].notnull()]

In [8]:
# replace null data with zero
df['bmi'] = df['bmi'].fillna(0)

In [9]:
# replace null data with mean
df['bmi'] = df['bmi'].fillna(df['bmi'].mean())

In [10]:
# replace null data with median
df['bmi'] = df['bmi'].fillna(df['bmi'].median())

### Outliers

In [11]:
# deleting rows out of 3 STD
upper_lim = df['bmi'].mean () + df['bmi'].std () * 3
lower_lim = df['bmi'].mean () - df['bmi'].std () * 3
df = df[(df['bmi'] < upper_lim) & (df['bmi'] > lower_lim)]

In [12]:
# deleting rows out of 5%
upper_lim = df['bmi'].quantile(.95)
lower_lim = df['bmi'].quantile(.05)
df = df[(df['bmi'] < upper_lim) & (df['bmi'] > lower_lim)]

In [13]:
# replacing outliers with cap value
upper_lim = df['bmi'].quantile(.95)
lower_lim = df['bmi'].quantile(.05)
df.loc[(df['bmi'] > upper_lim),'bmi'] = upper_lim
df.loc[(df['bmi'] < lower_lim),'bmi'] = lower_lim

### Normalizing and standardizing

In [14]:
# removing null rows
df = df[df['bmi'].notnull()]

In [15]:
# normalizing
df['bmi_normalized'] = (df['bmi'] - df['bmi'].min()) / (df['bmi'].max() - df['bmi'].min())

In [16]:
# standardizing
df['bmi_standardized'] = (df['bmi'] - df['bmi'].mean()) / df['bmi'].std()

### References

https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114   

https://developers.google.com/machine-learning/data-prep/   

https://colab.research.google.com/github/google/eng-edu/blob/master/ml/fe/exercises/intro_to_modeling.ipynb?utm_source=ss-data-prep&utm_campaign=colab-external&utm_medium=referral&utm_content=intro_to_modeling