##Feature Encoding

Generally in our dataset we have 2 types of features


1.   Numerical (Integer,floats)
2.   Categorical (Nominal, ordinal)

---
We cannot pass in categorical features in Machine Learning models. So we need to convert them into numeric features.



Categorical Variables are of 2 types Ordinal and Nominal. 

*   Ordinal variables has some kind order. (Good, Better, Best), (First, Second, Third)
*   Nominal variables has no ordering between them. (Cat, Dog, Monkey), (Apple, Banana, Mango)

Based on categorical variables whether they are ordinal or nominal we appply different techniques on them.

In [0]:
#let's create a dataframe
import pandas as pd
df = pd.DataFrame ({'country' : ['India','U.S','Australia','India','Australia','India','U.S'],
                    'Age' : [44,34,28,27,30,42,25],
                    'Salary' : [72000,44000,35000,27000,32000,56000,45000],
                    'Purchased' : ['yes','no','yes','yes','no','yes','no']
                    })

In [0]:
#Let's check our dataframe
print(df)

     country  Age  Salary Purchased
0      India   44   72000       yes
1        U.S   34   44000        no
2  Australia   28   35000       yes
3      India   27   27000       yes
4  Australia   30   32000        no
5      India   42   56000       yes
6        U.S   25   45000        no


In [0]:
#check the datatypes
df.dtypes

country      object
Age           int64
Salary        int64
Purchased    object
dtype: object

Here we have 2 categorical feature 

*   Country.
*   Purchased.

---


 Age and Salary have numeric values.



We know it well that we cannot pass in categorical values in our models.


###Label Encoding

In [0]:
df['country'].unique() #check unique 

array(['India', 'U.S', 'Australia'], dtype=object)

So Here we have 3 categories in country column.


*   India
*   U.S
*   Australia



In label encoding different categories are given different unique values starting from 0 to (n-1). n is the number of categories. 

In [0]:
from sklearn.preprocessing import LabelEncoder #import the LabelEncoder from sklrean library
le= LabelEncoder()    #create the instance of LabelEncoder

df['country_temp'] = le.fit_transform(df['country'])   #apply LabelEncoding of country column

In [0]:
df['country_temp']

0    1
1    2
2    0
3    1
4    0
5    1
6    2
Name: country_temp, dtype: int64

Here we can see that country feature has been tranformed into numeric values. Label encoding is done in alphabatical order as we can see here.
*   Australia -----> 0
*   India  --------> 1
*   U.S   ---------> 2

### Problem With Label Encoding
Here we have assigned numeric values i.e (0-Australia), (1-India), (2-U.S) in the same column. Problem here is that the machine learning models won't interpret these values as different labels as 0 < 1 < 2. Our model might interpret them in some order. But we don't have any ordering in our country feature. we cannot say Australia < India < U.S .

We use One Hot encoding to overcome this problem. It is also known as nominal encoding. Here We create 3 different columns [India, Australia, U.S]. We assign 1 if that label is present in particular row otherwise we marks it as 0.

In [0]:
#we will use get_dummies to do One Hot encoding
pd.get_dummies(df['country'])

Unnamed: 0,Australia,India,U.S
0,0,1,0
1,0,0,1
2,1,0,0
3,0,1,0
4,1,0,0
5,0,1,0
6,0,0,1


*  Here in first row ['India'] is assigned 1 and Australia and U.S are assigned 0. 
*  Similarly in 2nd row ['U.S'] is assigned 1 and other columns are assigned 0.

We can drop the first column here, it is just increasing the features.
 Reason ---- Even if we just have two columns suppose india and U.S and both are assigned 0. It is understood that when both of these labels are zero The 3rd label is automatically going to be 1.

In [0]:
#Dropping the first column
pd.get_dummies(df['country'],drop_first=True)

Unnamed: 0,India,U.S
0,1,0
1,0,1
2,0,0
3,1,0
4,0,0
5,1,0
6,0,1


Here we have done one hot encoding only on single feature but in real world datasets there will be many categorical features. Suppose our dataset has 50 categorical features with 3 different labels in each features. In that case if we apply one hot encoding, our features will also increase. we will have 100 features. It will make our model more complex.

Based on the dataset there are different techniques that we can apply to over-come this problem of dimensionality.

###Binary Encoding
This is not intiuative like the previous ones. Here the labels are firstly encoded ordinal and then they are converted into binary codes. Then the digits from that binary string are converted into different features.

In [0]:
#create 1 more column occupation here
df['occupation'] = ['Self-employeed','Freelancer','Family-business','Data-scientist','Pensioner','Manager','Daily-wage-worker']
print(df['occupation'])

0       Self-employeed
1           Freelancer
2      Family-business
3       Data-scientist
4            Pensioner
5              Manager
6    Daily-wage-worker
Name: occupation, dtype: object


We have seven different categories here. And we don't have any ordering in them as well.

In [0]:
#install category_encoders first
!pip install category_encoders

Collecting category_encoders
[?25l  Downloading https://files.pythonhosted.org/packages/a0/52/c54191ad3782de633ea3d6ee3bb2837bda0cf3bc97644bb6375cf14150a0/category_encoders-2.1.0-py2.py3-none-any.whl (100kB)
[K     |███▎                            | 10kB 15.6MB/s eta 0:00:01[K     |██████▌                         | 20kB 1.8MB/s eta 0:00:01[K     |█████████▉                      | 30kB 2.6MB/s eta 0:00:01[K     |█████████████                   | 40kB 1.7MB/s eta 0:00:01[K     |████████████████▍               | 51kB 2.1MB/s eta 0:00:01[K     |███████████████████▋            | 61kB 2.5MB/s eta 0:00:01[K     |██████████████████████▉         | 71kB 2.9MB/s eta 0:00:01[K     |██████████████████████████▏     | 81kB 2.2MB/s eta 0:00:01[K     |█████████████████████████████▍  | 92kB 2.5MB/s eta 0:00:01[K     |████████████████████████████████| 102kB 2.1MB/s 
Installing collected packages: category-encoders
Successfully installed category-encoders-2.1.0


In [0]:
# we will use BinaryEncoder from category_encoders library to do binary encoding
import category_encoders as ce
encoder = ce.BinaryEncoder(cols = ['occupation'])
df_binary = encoder.fit_transform(df)
print(df_binary)

     country  Age  Salary  ... occupation_1  occupation_2  occupation_3
0      India   44   72000  ...            0             0             1
1        U.S   34   44000  ...            0             1             0
2  Australia   28   35000  ...            0             1             1
3      India   27   27000  ...            1             0             0
4  Australia   30   32000  ...            1             0             1
5      India   42   56000  ...            1             1             0
6        U.S   25   45000  ...            1             1             1

[7 rows x 9 columns]


We had 7 different categories in occupation if we would have used one hot encoding it would have given us 7 features. But by using Binary Encoding we have limited it to 3. Binary Encoding is very useful whin we have many categories within a single feature. It help us to reduce the dimensionality.

In [0]:
'''we have seen 3 basic types feature encoding techniques here there are many more.
              we will look at them with some practical uses and with some real world dataset'''