Till now we have looked at 6 feature encoding techniques.

* Label Encoding
* One Hot Encoding
* Binary Encoding
* Mapping
* pd.factorize

In this notebook we will look at 2 new encoding techniques.

* Frequency Encoding
* Mean Encoding

In [1]:
import pandas as pd #import pandas
import numpy as np #import numpy
from sklearn.preprocessing import LabelEncoder  #importing LabelEncoder

In [2]:
train = pd.read_csv('/content/datasets_9961_14084_Train.csv')

In [3]:
train.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [4]:
#check the size of the dataset
print('Data has {} Number of rows'.format(train.shape[0]))
print('Data has {} Number of columns'.format(train.shape[1]))

Data has 8523 Number of rows
Data has 12 Number of columns


In [5]:
#let's keep our categorical variables in one table
cat_data = train[['Item_Identifier','Item_Fat_Content','Item_Type','Outlet_Identifier','Outlet_Size','Outlet_Location_Type','Outlet_Type','Item_Outlet_Sales']]


In [6]:
cat_data.head()   #check the head of categorical data

Unnamed: 0,Item_Identifier,Item_Fat_Content,Item_Type,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,Low Fat,Dairy,OUT049,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,Regular,Soft Drinks,OUT018,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,Low Fat,Meat,OUT049,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,Regular,Fruits and Vegetables,OUT010,,Tier 3,Grocery Store,732.38
4,NCD19,Low Fat,Household,OUT013,High,Tier 3,Supermarket Type1,994.7052


In [7]:
#Let's start where we had left 
print(cat_data['Item_Type'].nunique())
print(cat_data['Item_Type'].unique())

16
['Dairy' 'Soft Drinks' 'Meat' 'Fruits and Vegetables' 'Household'
 'Baking Goods' 'Snack Foods' 'Frozen Foods' 'Breakfast'
 'Health and Hygiene' 'Hard Drinks' 'Canned' 'Breads' 'Starchy Foods'
 'Others' 'Seafood']


**Frequency Encoding**

It is a way to utilize the frequency of labels.

In [8]:
fe = cat_data['Item_Type'].value_counts(ascending=True)/len(cat_data)  #count the frequency of labels
print(fe)

Seafood                  0.007509
Breakfast                0.012906
Starchy Foods            0.017365
Others                   0.019829
Hard Drinks              0.025109
Breads                   0.029450
Meat                     0.049865
Soft Drinks              0.052212
Health and Hygiene       0.061011
Baking Goods             0.076030
Canned                   0.076147
Dairy                    0.080019
Frozen Foods             0.100434
Household                0.106770
Snack Foods              0.140795
Fruits and Vegetables    0.144550
Name: Item_Type, dtype: float64


In [9]:
cat_data['Item_Type'].map(fe).head(10)  #map frequency to item type

0    0.080019
1    0.052212
2    0.049865
3    0.144550
4    0.106770
5    0.076030
6    0.140795
7    0.140795
8    0.100434
9    0.100434
Name: Item_Type, dtype: float64


This technique is useful when the frequency is somewhat related with the target variable.

**Mean Encoding**

It is the most followed approach by the kagglers. We will not go into it's technality here. We will just look at it use and it's drwaback.

We go through following steps for mean encoding

1. Group by categorical variable and obtain aggregated sum over target

2. Group by categorical variable and obtain aggregated count over target

3. divide step 2 / step 1


In [10]:
#get the mean of target variable label wise
me = cat_data.groupby('Outlet_Identifier')['Item_Outlet_Sales'].mean()
print(me)

Outlet_Identifier
OUT010     339.351662
OUT013    2298.995256
OUT017    2340.675263
OUT018    1995.498739
OUT019     340.329723
OUT027    3694.038558
OUT035    2438.841866
OUT045    2192.384798
OUT046    2277.844267
OUT049    2348.354635
Name: Item_Outlet_Sales, dtype: float64


In [11]:
#get the mean of target variable label wise
cat_data['Outlet_Identifier'].map(me).head(10)

0    2348.354635
1    1995.498739
2    2348.354635
3     339.351662
4    2298.995256
5    1995.498739
6    2298.995256
7    3694.038558
8    2192.384798
9    2340.675263
Name: Outlet_Identifier, dtype: float64


Here we have mapped different labels with the mean of the target variable.

When we have large number of features mean encoding is a way to go about encoding. As it doesnot creates any new feature. It also correlates with the target feature.

The disadvantage of mean encoding is that it is prone to overfitting.

In [12]:
#check value counts in Outlet_Size
cat_data['Outlet_Size'].value_counts()

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

It is a ordinal variable we will make a dictionary as assign

* Small-----> 0
* Medium -----> 1
* High -----> 2

In [13]:
#Check the null values
cat_data['Outlet_Size'].isnull().sum()

2410

In [14]:
#fill the null values with other category for now
cat_data['Outlet_Size'].fillna('Others',inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


In [15]:
#prepare a dictionary to map
size_fe = {"Small" : 0, "Medium" : 1, "High" : 2, "Others" : 3}
cat_data['Outlet_Size'].map(size_fe).head(10)

0    1
1    1
2    1
3    3
4    2
5    1
6    2
7    1
8    3
9    3
Name: Outlet_Size, dtype: int64

In [16]:
cat_data['Outlet_Location_Type'].value_counts()

Tier 3    3350
Tier 2    2785
Tier 1    2388
Name: Outlet_Location_Type, dtype: int64

Here Tier 1, Teir 2 and Teir 3 are ordinal variables. We can use Label Encoding or map the values.

* Tier 3-----> 0
* Tier 2 -----> 1
* Tier 1-----> 2

In [17]:
location_fe = {"Tier 3" : 1, "Tier 2" : 2, "Tier 1" : 3}
cat_data['Outlet_Location_Type'].map(location_fe).head(10)

0    3
1    1
2    3
3    1
4    1
5    1
6    1
7    1
8    2
9    2
Name: Outlet_Location_Type, dtype: int64

In [18]:
#Check last variable and do the encoding
cat_data['Outlet_Type'].value_counts()

Supermarket Type1    5577
Grocery Store        1083
Supermarket Type3     935
Supermarket Type2     928
Name: Outlet_Type, dtype: int64


The labels here are nominal. It will be better to use nominal encoding. We have only 4 labels we can try one hot encoding or binary encoding as well.

In [19]:
pd.get_dummies(cat_data['Outlet_Type'],drop_first=True).head()

Unnamed: 0,Supermarket Type1,Supermarket Type2,Supermarket Type3
0,1,0,0
1,0,1,0
2,1,0,0
3,0,0,0
4,1,0,0



Next we will use all the encoding techniques we have learnt till now on different datasets. So that you will have some practice and will have better understanding when to use which encoding.