# **Feature Engineering for Machine Learning**

**Today's Outline:**
- Data Preprocessing & Feature Engineering Basics
- Feature Scaling & Transformation
- Feature Encoding (Categorical Data)
- Feature Cleaning & Imputation
- Data Splitting

==========

### Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

==========

## Feature Engineering (Data Pre-processing)

Scikit-Learn Preprocessing Module:
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing 

### Data Preprocessing & Feature Engineering Basics

In [None]:
from IPython.display import Image
Image("imgs/feature-engineering.jpeg")

- Feature Scaling & Transformation
    - Normalization
    - Standardization
- Feature Encoding (Categorical Data)
    - Label Encoding
    - One-Hot Encoding
- Feature Cleaning & Imputation
    - Outliers
    - Missing Data

### Importing Dataset

In [None]:
df = pd.read_csv('data/feature_data.csv')

In [None]:
df

In [None]:
df = df[['Age','Salary','Region','Purchased']]
df

In [None]:
df.info()

In [None]:
df.select_dtypes(include=['float64'])
# df.select_dtypes(exclude=['float64'])

### Features Scaling & Transformation

- Normalization (*sklearn.preprocessing.MinMaxScaler*)
- Standardization (*sklearn.preprocessing.StandardScaler*)

In [None]:
from IPython.display import Image
Image("imgs/std-norm.png")

##### Normal Distribution

In [None]:
ages = pd.read_csv('data/ages.csv')
ages

In [None]:
plt.figure(figsize=(10,5))
plt.bar(ages['Ages'],ages['Frequency'], width=1)

In [None]:
from IPython.display import Image
Image("imgs/standard-normal-distribution.png")

##### Normalization / Scaling (sklearn.preprocessing.MinMaxScaler)

In [None]:
from IPython.display import Image
Image("imgs/rescaling.gif")

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

In [None]:
df

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
sc = MinMaxScaler()

In [None]:
sc.fit(df[['Salary']])

In [None]:
sc.transform(df[['Salary']])

In [None]:
sc.fit_transform(df[['Salary']])

##### Standardization (sklearn.preprocessing.StandardScaler)

In [None]:
from IPython.display import Image
Image("imgs/normalizing.gif")

In [None]:
from IPython.display import Image
Image("imgs/standardization.gif")

In [None]:
from IPython.display import Image
Image("imgs/std.png")

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
sc = StandardScaler()

In [None]:
sc.fit(df[['Salary']])

In [None]:
sc.transform(df[['Salary']])

In [None]:
sc.fit_transform(df[['Salary']])

### Encoding Categorical Features

- Label / Integer Encoding (*sklearn.preprocessing.LabelEncoder*)
- One-Hot / Dummy Encoding (*sklearn.preprocessing.OneHotEncoder*)

##### Label / Integer Encoding (sklearn.preprocessing.LabelEncoder)

In [None]:
from IPython.display import Image
Image("imgs/labelencoding.png")

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

In [None]:
df

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
enc = LabelEncoder()

In [None]:
enc.fit_transform(df['Purchased'])

In [None]:
enc.fit_transform(df['Region'])

##### One-Hot / Dummy Encoding (sklearn.preprocessing.OneHotEncoder)

In [None]:
from IPython.display import Image
Image("imgs/onehotencoding.png")

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [None]:
df

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
enc = OneHotEncoder()

In [None]:
enc.fit_transform(df[['Purchased']]).toarray()

In [None]:
enc.fit_transform(df[['Region']]).toarray()

In [None]:
pd.get_dummies(df['Region'])

### Features Cleaning & Imputation

- Missing Data
- Outliers

##### Handling Missing Data

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

In [None]:
df.info()

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
imp = SimpleImputer()

In [None]:
imp.fit_transform(df[['Salary']])

In [None]:
from sklearn.impute import MissingIndicator

In [None]:
miss = MissingIndicator()

In [None]:
df[miss.fit_transform(df[['Salary']])]

##### Dealing with Outliers

In [None]:
from IPython.display import Image
Image("imgs/outliers.gif")

In [None]:
df['Salary'].plot.hist(bins=10)

In [None]:
df['Salary'].plot.box()

In [None]:
q1 = df['Salary'].quantile(0.25)

In [None]:
q3 = df['Salary'].quantile(0.75)

In [None]:
iqr = q3 - q1
iqr

In [None]:
lower = q1 - 1.5 * iqr
lower

In [None]:
upper = q3 + 1.5 * iqr
upper

In [None]:
df = df[(df['Salary'] >= lower) & (df['Salary'] <= upper)]
df

In [None]:
df['Salary'].plot.hist(bins=3)

In [None]:
df['Salary'].plot.box()

### Splitting Data for Training & Testing

##### What is a Feature?

In [None]:
from IPython.display import Image
Image("imgs/features.png")

##### Splitting Data for Testing & Training

In [None]:
from IPython.display import Image
Image("imgs/splitting.png")

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
from sklearn.datasets import load_iris

In [None]:
iris = load_iris()

In [None]:
X, y = iris.data, iris.target

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2 ,random_state=42)

===========

# THANK YOU!