# Feature Encoding 
![transform](encoding.jpg)

This jupyter notebook explains how following feature encoding can be done before feeding data to train machine learning model.
- Label Encoding
- Mapping (Categorical ordinal data).
- Binary encoding
- OneHotEncoding
- Date encoding

Finally we will see how we can carve out features and label and make them seperate.


### 1.> Lets load required libraries.

In [2]:
from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd

### 2.> Visualize and understand the data.

In [3]:

# loaded the data and see its top 5 values.
data = pd.read_csv("sample_data.csv")
data.head()


Unnamed: 0,datetime,age,car_color,satisfaction,upsell,car_make
0,07/12/20 18:41,48,gray,satisfied,done,Dodge
1,07/12/20 15:54,44,blue,extremely satisfied,not done,Ford
2,07/12/20 11:23,42,green,not satisfied,done,BMW
3,06/12/20 23:59,59,amber,very satisfied,not done,Hyundai
4,07/12/20 15:41,50,hazel,slightly satisfied,done,Hyundai


In [4]:
data.dtypes

datetime        object
age              int64
car_color       object
satisfaction    object
upsell          object
car_make        object
dtype: object

### 3.> Lable Encoding

In [5]:
# Lets print distinct (unique) values for last column car_make
unique_car_make_values = data['car_make'].unique()

# Print the unique values
print(unique_car_make_values)

['Dodge' 'Ford' 'BMW' 'Hyundai' 'Audi' 'Honda' 'Chevrolet']


In [6]:
# Lable Encoding should be used if we want to encode feature to numbers based on their alphabatical text order.
# e.g.
# Lets encode car_make using label encoding technique.
# This is some text, and machine learning model understands numbers better, so lets encode each text to a number randing from 0 ~ 6.
# In label encoding text is arranged in alphabatically order, and then assign values from 0 to N 
# In this example it would be => Audi(0), BMW(1), Chevrolet(2), Dodge(3), Ford(4), Honda(5), Hyundai(6)
# Here it is important to know this is Categorically Nominal data, so actually Hundai is no better than any other brands, but we just want to assign number alphabatically.
labelencoder = LabelEncoder()
data['car_make'] = labelencoder.fit_transform(data['car_make'])
data.head()

Unnamed: 0,datetime,age,car_color,satisfaction,upsell,car_make
0,07/12/20 18:41,48,gray,satisfied,done,3
1,07/12/20 15:54,44,blue,extremely satisfied,not done,4
2,07/12/20 11:23,42,green,not satisfied,done,1
3,06/12/20 23:59,59,amber,very satisfied,not done,6
4,07/12/20 15:41,50,hazel,slightly satisfied,done,6


In [7]:
data.dtypes

datetime        object
age              int64
car_color       object
satisfaction    object
upsell          object
car_make         int32
dtype: object

### 4.> Mapping (Categorical ordinal data).

In [8]:
# Lets encode satisfaction.
# Print distinct values of satisfaction.
unique_satisfaction_values = data['satisfaction'].unique()

# Print the unique values
print(unique_satisfaction_values)

['satisfied' 'extremely satisfied' 'not satisfied' 'very satisfied'
 'slightly satisfied']


In [9]:
# Satisfaction is categorical ordinal data, we understand "Extreamly Satisfied" is better than its other values all the way to "not satisfied".
# We can not use label encoding technique here, because in label encoding it will arrange in alphabatical order and assign values as follows
#   extremely satisfied	:	0
#   not satisfied		:	1
#   satisfied		    :	2
#   slightly satisfied	:	3
#   very satisfied		:	4

# But because of categorical ordinal data, we want to assing number based on satisfaction level meaning, which is : 
#   not satisfied		:	0
#   slightly satisfied	:	1
#   satisfied		    :	2
#   very satisfied		:	3
#   extremely satisfied	:	4

# So we will be using pandas library's .map() function, and pass this satisfaction mapping, and ask pandas to encode data.


# Define a mapping from satisfaction labels to ordinal numbers
satisfaction_mapping = {
    'not satisfied': 0,
    'slightly satisfied': 1,
    'satisfied': 2,
    'very satisfied': 3,
    'extremely satisfied': 4
}

# Apply the mapping to the satisfaction column and overwrite the original 'satisfaction' column
data['satisfaction'] = data['satisfaction'].map(satisfaction_mapping)


In [106]:
data.head()

Unnamed: 0,datetime,age,car_color,satisfaction,upsell,car_make
0,07/12/20 18:41,48,gray,2,done,3
1,07/12/20 15:54,44,blue,4,not done,4
2,07/12/20 11:23,42,green,0,done,1
3,06/12/20 23:59,59,amber,3,not done,6
4,07/12/20 15:41,50,hazel,1,done,6


In [10]:
data.dtypes

datetime        object
age              int64
car_color       object
satisfaction     int64
upsell          object
car_make         int32
dtype: object

### 5.> Binary encoding

In [11]:
# We can use labelencoder.fit_transform to encode binary data. 
# But before we do it we need be aware that fit_transform assing numbers alphabatical order.
# Lets understand with following example, why we should consider alphabatical text value before binary encoding.

# First of all double check unique/distinct values.
unique_upsell_values = data['upsell'].unique()

# Print the unique values
print(unique_upsell_values)


['done' 'not done']


In [12]:
# So yes, this is having just two distinct values ['done' 'not done']
# Lets fit_transform "upsell".
labelencoder = LabelEncoder()

# Create an empty DataFrame with pandas, and fit_transform to this new dataframe.
test_data = pd.DataFrame()
test_data['test_upsell_encoded'] = labelencoder.fit_transform(data['upsell'])

# Create another combined frame to see values side by side.
data_combined = pd.concat((data, test_data), axis=1)
data_combined.head(10)

Unnamed: 0,datetime,age,car_color,satisfaction,upsell,car_make,test_upsell_encoded
0,07/12/20 18:41,48,gray,2,done,3,0
1,07/12/20 15:54,44,blue,4,not done,4,1
2,07/12/20 11:23,42,green,0,done,1,0
3,06/12/20 23:59,59,amber,3,not done,6,1
4,07/12/20 15:41,50,hazel,1,done,6,0
5,08/12/20 10:57,43,brown,3,done,0,0
6,06/12/20 18:38,30,brown,3,not done,6,1
7,05/12/20 21:13,31,blue,2,done,5,0
8,07/12/20 22:29,56,gray,3,done,5,0
9,07/12/20 07:34,47,gray,3,done,3,0


In [13]:
# We can see, every "done" is encoded to number 0, and "not done" as 1.
# Let's try to Binary encode to test_upsell_encoded
data_combined = data_combined.astype({'test_upsell_encoded': bool})
data_combined.head()


Unnamed: 0,datetime,age,car_color,satisfaction,upsell,car_make,test_upsell_encoded
0,07/12/20 18:41,48,gray,2,done,3,False
1,07/12/20 15:54,44,blue,4,not done,4,True
2,07/12/20 11:23,42,green,0,done,1,False
3,06/12/20 23:59,59,amber,3,not done,6,True
4,07/12/20 15:41,50,hazel,1,done,6,False


In [14]:
# We can see every "done" is encoded to "False", and every "not done" is encoded to "True".
# But this is not business case meaning and its translation to true and false binary encoding. 
# We wanted otherway round, where every "not done" shoudl have been "False" and "done" encodes to binary "True".

# This gives us understanding why we should pass mapping vlaue before label encode it using "fit_transform".
# Define a mapping from upsell labels to 0 and 1 based on correct business value.
upsell_mapping = {
    'not done': 0,
    'done': 1
}

# Apply the mapping to the satisfaction column and overwrite the original 'satisfaction' column
data['upsell'] = data['upsell'].map(upsell_mapping)
data.head()


Unnamed: 0,datetime,age,car_color,satisfaction,upsell,car_make
0,07/12/20 18:41,48,gray,2,1,3
1,07/12/20 15:54,44,blue,4,0,4
2,07/12/20 11:23,42,green,0,1,1
3,06/12/20 23:59,59,amber,3,0,6
4,07/12/20 15:41,50,hazel,1,1,6


In [15]:
# Ok, now every "done" is 1, and "not done" is 0.
# We can now encode it to binary values.
data = data.astype({'upsell': bool})
data.head()


Unnamed: 0,datetime,age,car_color,satisfaction,upsell,car_make
0,07/12/20 18:41,48,gray,2,True,3
1,07/12/20 15:54,44,blue,4,False,4
2,07/12/20 11:23,42,green,0,True,1
3,06/12/20 23:59,59,amber,3,False,6
4,07/12/20 15:41,50,hazel,1,True,6


In [16]:
data.dtypes

datetime        object
age              int64
car_color       object
satisfaction     int64
upsell            bool
car_make         int32
dtype: object

### 6.> OneHotEncoding


In [17]:
# Lets see distinct values for "car_color", so we can understand how many additional columns will be added for OneHotEncoding.
# First of all double check unique/distinct values.
print(data['car_color'].unique())


['gray' 'blue' 'green' 'amber' 'hazel' 'brown']


In [18]:
# We can see there are 6 distinct values, which mean there would be six additional columns be created for row and approrpiate column based on car_color will be marked 0 or 1.

# OneHotEncoding can easily be done using Panda's "get_dummies" method.
data = pd.get_dummies(data,prefix=['carColor'], columns = ['car_color'])
data.head()

Unnamed: 0,datetime,age,satisfaction,upsell,car_make,carColor_amber,carColor_blue,carColor_brown,carColor_gray,carColor_green,carColor_hazel
0,07/12/20 18:41,48,2,True,3,False,False,False,True,False,False
1,07/12/20 15:54,44,4,False,4,False,True,False,False,False,False
2,07/12/20 11:23,42,0,True,1,False,False,False,False,True,False
3,06/12/20 23:59,59,3,False,6,True,False,False,False,False,False
4,07/12/20 15:41,50,1,True,6,False,False,False,False,False,True


In [19]:
data.dtypes

datetime          object
age                int64
satisfaction       int64
upsell              bool
car_make           int32
carColor_amber      bool
carColor_blue       bool
carColor_brown      bool
carColor_gray       bool
carColor_green      bool
carColor_hazel      bool
dtype: object

### 7.> Date encoding

In [20]:
# Now lets see, how we can handle data. 
# As of now, date data type is just "object", which mean pandas is treating it like a text value and not date object.

# Lets convert it into date object.
data['datetime'] = pd.to_datetime(data['datetime'], format='%d/%m/%y %H:%M')
data.head()

Unnamed: 0,datetime,age,satisfaction,upsell,car_make,carColor_amber,carColor_blue,carColor_brown,carColor_gray,carColor_green,carColor_hazel
0,2020-12-07 18:41:00,48,2,True,3,False,False,False,True,False,False
1,2020-12-07 15:54:00,44,4,False,4,False,True,False,False,False,False
2,2020-12-07 11:23:00,42,0,True,1,False,False,False,False,True,False
3,2020-12-06 23:59:00,59,3,False,6,True,False,False,False,False,False
4,2020-12-07 15:41:00,50,1,True,6,False,False,False,False,False,True


In [21]:
data.dtypes

datetime          datetime64[ns]
age                        int64
satisfaction               int64
upsell                      bool
car_make                   int32
carColor_amber              bool
carColor_blue               bool
carColor_brown              bool
carColor_gray               bool
carColor_green              bool
carColor_hazel              bool
dtype: object

In [22]:
# We can also do this while loading CSV file into pandas data frame. Following is example:

my_date_parser = pd.to_datetime
data_loaded_again = pd.read_csv("sample_data.csv", parse_dates=['datetime'], date_parser=my_date_parser)


data_loaded_again.head()

  data_loaded_again = pd.read_csv("sample_data.csv", parse_dates=['datetime'], date_parser=my_date_parser)
  data_loaded_again = pd.read_csv("sample_data.csv", parse_dates=['datetime'], date_parser=my_date_parser)


Unnamed: 0,datetime,age,car_color,satisfaction,upsell,car_make
0,2020-07-12 18:41:00,48,gray,satisfied,done,Dodge
1,2020-07-12 15:54:00,44,blue,extremely satisfied,not done,Ford
2,2020-07-12 11:23:00,42,green,not satisfied,done,BMW
3,2020-06-12 23:59:00,59,amber,very satisfied,not done,Hyundai
4,2020-07-12 15:41:00,50,hazel,slightly satisfied,done,Hyundai


In [23]:
data_loaded_again.dtypes

datetime        datetime64[ns]
age                      int64
car_color               object
satisfaction            object
upsell                  object
car_make                object
dtype: object

### 8.> Carve out feature and label

In [24]:
# Now just before sending data to machine learning models, we have to store all features to X and final lables to Y


# Remove "upsell" from data and we will get all features. Store it in X
X = data.drop(columns=['upsell'])
X.head(10)

Unnamed: 0,datetime,age,satisfaction,car_make,carColor_amber,carColor_blue,carColor_brown,carColor_gray,carColor_green,carColor_hazel
0,2020-12-07 18:41:00,48,2,3,False,False,False,True,False,False
1,2020-12-07 15:54:00,44,4,4,False,True,False,False,False,False
2,2020-12-07 11:23:00,42,0,1,False,False,False,False,True,False
3,2020-12-06 23:59:00,59,3,6,True,False,False,False,False,False
4,2020-12-07 15:41:00,50,1,6,False,False,False,False,False,True
5,2020-12-08 10:57:00,43,3,0,False,False,True,False,False,False
6,2020-12-06 18:38:00,30,3,6,False,False,True,False,False,False
7,2020-12-05 21:13:00,31,2,5,False,True,False,False,False,False
8,2020-12-07 22:29:00,56,3,5,False,False,False,True,False,False
9,2020-12-07 07:34:00,47,3,3,False,False,False,True,False,False


In [25]:
# And jsut "upsell" is our final label. Store it in Y
Y = pd.DataFrame(data['upsell'])
Y.head(10)

Unnamed: 0,upsell
0,True
1,False
2,True
3,False
4,True
5,True
6,False
7,True
8,True
9,True


## Now our X and Y are two data frames, one is freatures and another is label. This is all ready to feed to train machine learning model.
