# Python Tutorials

### Data transformation - categorical
(feature engineering)

Solvertank Digital Science   
[http://www.solvertank.com](http://www.solvertank.com)   
<img src="cube.gif" align="left" width="50" />

## Load data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
from pandas import ExcelFile
df = pd.read_excel('datavis.xlsx', sheet_name='datavis')
df2 = df

### Null data

In [None]:
# list rows with null data
df[df['sex'].isnull()]

In [None]:
# replace null data - categorical
df['sex'] = df['sex'].fillna('O')

### Create categorical feature based on numeric
"binning" or "bucketing"   
http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

In [None]:
# fixed bins
df['bp_cat'] = pd.cut(df['bp'], 3, labels=['Low', 'Mid', 'High'])

In [None]:
# pre defined bins
df["bp_cat"] = pd.cut(df["bp"], bins=[-0.2,-0.02,0.02,0.2], labels=["Low", "Mid", "High"])

In [None]:
# percentile bins
df["bp_cat"] = pd.qcut(df["bp"], q=5, labels=[0, 1, 2, 3, 4])

### Convert numeric type to categorical type

In [None]:
# list numeric columns
for column in df.select_dtypes(exclude=['object']).columns:
    print(column)

In [None]:
df['region'] = df['region'].astype(object)

### Create numerical features based on categories
"one-hot encoding"   
See also: https://medium.com/@contactsunny/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621

In [None]:
# restoring data
df = df2

In [None]:
# removing null rows
df = df[df['category'].notnull()]

In [None]:
encoded_columns = pd.get_dummies(df['category'])

In [None]:
df = df.join(encoded_columns)

In [None]:
df.head(5)

### Encode categorical into numeric
"label encoding"   
See also: https://medium.com/@contactsunny/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621

In [None]:
# restoring data
df = df2

In [None]:
# removing null rows
df = df[df['category'].notnull()]

In [None]:
#https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

In [None]:
df['category_code'] = le.fit_transform(df['category'])

In [None]:
#or for all categorical columns
for column in df.select_dtypes(include=['object']).columns:
    df[column] = le.fit_transform(df[column])

In [None]:
df.head(5)

### Grouping categories

In [None]:
# restoring data
df = df2

In [None]:
#deleting null data
df = df[df['category'].notnull()] 

In [None]:
conditions = [
    df['category'].str.contains('Platinum'),
    df['category'].str.contains('Gold'),
    df['category'].str.contains('Silver'),
    df['category'].str.contains('Blue'),
    df['category'].str.contains('White')
]

In [None]:
choices = [
    'Premium', 
    'Premium', 
    'Value', 
    'Value', 
    'Value'
]

In [None]:
df['category_group'] = np.select(conditions, choices, default='Other')

In [None]:
df.head(10)

### References

https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114   

https://developers.google.com/machine-learning/data-prep/   

https://colab.research.google.com/github/google/eng-edu/blob/master/ml/fe/exercises/intro_to_modeling.ipynb?utm_source=ss-data-prep&utm_campaign=colab-external&utm_medium=referral&utm_content=intro_to_modeling