
We are going to apply the different encoding techniques on big mart sales data kaggle.

Link : https://www.kaggle.com/brijbhushannanda1979/bigmart-sales-data

Things to learn -

* Indentifying data type as ordinal,nominal and continuous.
* Applying different types of encoding.
* Challenges with different encoding techniques.
* Choosing the appropriate encoding techniques.

In [1]:
import pandas as pd #import pandas
import numpy as np #import numpy
from sklearn.preprocessing import LabelEncoder  #importing LabelEncoder


In [2]:
train = pd.read_csv('/content/datasets_9961_14084_Train.csv')

In [3]:
#check the head of dataset
train.head(5)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [5]:
#check the size of the dataset
print('Data has {} Number of rows'.format(train.shape[0]))
print('Data has {} Number of columns'.format(train.shape[1]))

Data has 8523 Number of rows
Data has 12 Number of columns


In [6]:
#check the information of the dataset
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB



As we can see here, we have 7 categorical variables and 5 numeric variables. The first task is to identify these categorical variables as nominal or ordinal.

In [8]:
#let's keep our categorical variables in one table
cat_data = train[['Item_Identifier','Item_Fat_Content','Item_Type','Outlet_Identifier','Outlet_Size','Outlet_Location_Type','Outlet_Type']]

In [10]:
cat_data.head()   #check the head of categorical data

Unnamed: 0,Item_Identifier,Item_Fat_Content,Item_Type,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,FDA15,Low Fat,Dairy,OUT049,Medium,Tier 1,Supermarket Type1
1,DRC01,Regular,Soft Drinks,OUT018,Medium,Tier 3,Supermarket Type2
2,FDN15,Low Fat,Meat,OUT049,Medium,Tier 1,Supermarket Type1
3,FDX07,Regular,Fruits and Vegetables,OUT010,,Tier 3,Grocery Store
4,NCD19,Low Fat,Household,OUT013,High,Tier 3,Supermarket Type1


In [9]:
cat_data.apply(lambda x: x.nunique()) #check the number of unique values in each column

Item_Identifier         1559
Item_Fat_Content           5
Item_Type                 16
Outlet_Identifier         10
Outlet_Size                3
Outlet_Location_Type       3
Outlet_Type                4
dtype: int64

Now think which encoding technique can we apply here.

* First thought would be to apply one hot encoding on features which has 3-5 unique categories.
* But what if there is some kind of ordering present between them. So firstly we should identify the nominal and ordinal variable
* Let's check one by one

In [11]:
#check the top 10 frequency in Item_Identifier
cat_data['Item_Identifier'].value_counts().head(10)

FDW13    10
FDG33    10
FDG09     9
FDT07     9
NCY18     9
FDX04     9
NCL31     9
NCF42     9
FDX20     9
FDW26     9
Name: Item_Identifier, dtype: int64

The values in Item_Identifier has no ordering as we can see. These are nominal categorical variable.

The first column has 1559 unique values. If we try to do one hot encoding here we will have 1558 new features. We cannot feed in these many features in our model. It will make our model complex and it will reduce the model accuracy.

In [12]:
pd.get_dummies(cat_data['Item_Identifier'],drop_first=True)  #applying one hot encoding

Unnamed: 0,DRA24,DRA59,DRB01,DRB13,DRB24,DRB25,DRB48,DRC01,DRC12,DRC13,DRC24,DRC25,DRC27,DRC36,DRC49,DRD01,DRD12,DRD13,DRD15,DRD24,DRD25,DRD27,DRD37,DRD49,DRD60,DRE01,DRE03,DRE12,DRE13,DRE15,DRE25,DRE27,DRE37,DRE48,DRE49,DRE60,DRF01,DRF03,DRF13,DRF15,...,NCW05,NCW06,NCW17,NCW18,NCW29,NCW30,NCW41,NCW42,NCW53,NCW54,NCX05,NCX06,NCX17,NCX18,NCX29,NCX30,NCX41,NCX42,NCX53,NCX54,NCY05,NCY06,NCY17,NCY18,NCY29,NCY30,NCY41,NCY42,NCY53,NCY54,NCZ05,NCZ06,NCZ17,NCZ18,NCZ29,NCZ30,NCZ41,NCZ42,NCZ53,NCZ54
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8518,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8519,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8520,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8521,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


As expected from a single feature now we have 1558 features. So it's a bad idea to apply one hot encoding here. We should not apply one hot encoding when there are too many categories.

So one hot encoding has failed us here. Now for rescue we move to LabelEncoding but we are very much aware that if we apply label encoding on a feature it assigns a natural ranking to the categories alphabatically. So we cannot apply Label encoding as well.

So we have 1 thing left (Binary Encoding) that we have learnt previously. Let's apply it and see what we get.

In [15]:
#apply binary encoding on Item_Identifier
import category_encoders as ce                              #import category_encoders
encoder = ce.BinaryEncoder(cols=['Item_Identifier'])        #create instance of binary enocder
df_binary = encoder.fit_transform(cat_data)                 #fit and tranform on cat_data
df_binary.head(5)


  import pandas.util.testing as tm


Unnamed: 0,Item_Identifier_0,Item_Identifier_1,Item_Identifier_2,Item_Identifier_3,Item_Identifier_4,Item_Identifier_5,Item_Identifier_6,Item_Identifier_7,Item_Identifier_8,Item_Identifier_9,Item_Identifier_10,Item_Identifier_11,Item_Fat_Content,Item_Type,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,0,0,0,0,0,0,0,0,0,0,0,1,Low Fat,Dairy,OUT049,Medium,Tier 1,Supermarket Type1
1,0,0,0,0,0,0,0,0,0,0,1,0,Regular,Soft Drinks,OUT018,Medium,Tier 3,Supermarket Type2
2,0,0,0,0,0,0,0,0,0,0,1,1,Low Fat,Meat,OUT049,Medium,Tier 1,Supermarket Type1
3,0,0,0,0,0,0,0,0,0,1,0,0,Regular,Fruits and Vegetables,OUT010,,Tier 3,Grocery Store
4,0,0,0,0,0,0,0,0,0,1,0,1,Low Fat,Household,OUT013,High,Tier 3,Supermarket Type1


Binary encoder has given us 11 new feature which is way less than we were getting from one hot encoding. So we have been rescued here by Binary Encoding.

We have applied binary encoding but it doesn't provide us any intution as how these new features are made. All we know is by using binary encoding Here the labels are firstly encoded ordinal and then they are converted into binary codes. Then the digits from that binary string are converted into different features.

There are other intutive measures to reduce the features. We will look at them later.

**Encoding Item_Fat_Content**

In [16]:
#check the unique values 
cat_data['Item_Fat_Content'].unique()

array(['Low Fat', 'Regular', 'low fat', 'LF', 'reg'], dtype=object)

Here we have 5 unique values but if we look at them closely there are only 2 unique values. Low Fat and Regular, others are just short forms for them or are in small letters

In [None]:
low_fat = ['LF','low fat']
cat_data['Item_Fat_Content'].replace(low_fat,'Low Fat',inplace = True) #replace 'LF' and 'low fat' with 'Low Fat'
cat_data['Item_Fat_Content'].replace('reg','Regular',inplace = True)   #Replace 'reg' with regular

In [17]:
cat_data['Item_Fat_Content'].unique()

array(['Low Fat', 'Regular', 'low fat', 'LF', 'reg'], dtype=object)

Here we have 2 categories in Item_Fat_Content and we have some ordering between the. Low Fat will have less Fat content than the regular Fat. So it is a ordinal variable.

In [18]:
#Apply LabelEncoder
le = LabelEncoder()
cat_data['Item_Fat_Content_temp'] = le.fit_transform(cat_data['Item_Fat_Content'])
print(cat_data['Item_Fat_Content'].head())
print(cat_data['Item_Fat_Content_temp'].head())

0    Low Fat
1    Regular
2    Low Fat
3    Regular
4    Low Fat
Name: Item_Fat_Content, dtype: object
0    1
1    2
2    1
3    2
4    1
Name: Item_Fat_Content_temp, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Here we only had 2 categories 'Low Fat' and 'Regular' so using LabelEncoding has worked here. It has mapped :-

* Low Fat ------- 0
* Regular ------- 1

Here the natural ranking of alphabets has worked but every time you are not this lucky.

**We can use map to do ordinal encoding**

In [19]:
#prepare a dict to map
mapping = {'Low Fat' : 0,'Regular': 1} #map Low Fat as 0 and Regular as 1
cat_data['Item_Fat_Content_temp1'] = cat_data['Item_Fat_Content'].map(mapping)
cat_data['Item_Fat_Content_temp1'].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


0    0.0
1    1.0
2    0.0
3    1.0
4    0.0
Name: Item_Fat_Content_temp1, dtype: float64


It is useful when we have ordering in our categories.

**Use Pandas pd.factorize method.**

It does the nominal encoding based on the order in which the categories apper. If Low Fat is at index 0 then it will be encoded as 0 Regular as 1 and vice versa.



In [20]:
factorized,index = pd.factorize(cat_data['Item_Fat_Content'])  #using pd.factorize it gives us factorized array and index values
print(factorized)
print(index)

[0 1 0 ... 0 1 0]
Index(['Low Fat', 'Regular', 'low fat', 'LF', 'reg'], dtype='object')


In this Notebook we have seen 2 new encoding techniques.

* Mapping
* pd.factorize

We have seen the usage of different methods, their advantages and disadvantages.

In [21]:
#Let's look at item type column
print(cat_data['Item_Type'].nunique())  #check number of unique values
print(cat_data['Item_Type'].unique())   #check the unique values

16
['Dairy' 'Soft Drinks' 'Meat' 'Fruits and Vegetables' 'Household'
 'Baking Goods' 'Snack Foods' 'Frozen Foods' 'Breakfast'
 'Health and Hygiene' 'Hard Drinks' 'Canned' 'Breads' 'Starchy Foods'
 'Others' 'Seafood']


And we don't Have any ordering between them. So we have to apply ordinal encoding technique. i Leave it upto you to decide which technique to apply and we will have look at other techniques in our next Notebook.