# Exploratory Data Analysis (EDA) - Encoding Techniques

This notebook dives deeper into different encoding techniques. Encoding is the process of creating a code for categorical values.  

In [1]:
# Package imports
import os 
import pandas as pd 
import numpy as np

In [42]:
# Read dataset 
df = pd.read_excel("/Users/umreenimam/Documents/BMCC/Lesson Materials/Weeks 3 - 4/Week 4/Lab/public_emdat_project.xlsx")

##### Encoding Categorical Variables

There are few ways to encode categorical variables. Below you will see an example of each way. 

<hr>

##### Find and Replace

In [10]:
# Lets create a copy of the dataframe for this example
df_encode_copy = df.copy()

In [11]:
# FIND AND REPLACE 
# Find a categorical variable column and manually replace them with numbers
df_encode_copy["Disaster Group"].value_counts()

Disaster Group
Natural          10045
Technological     5739
Name: count, dtype: int64

In [13]:
# Create a dictionary that contains the "translation" from categorical value to numerical value
cat_to_num = {"Disaster Group": 
              {"Natural": 0, 
               "Technological": 1
               }
            }

In [14]:
# Replace values using df.replace()
df_encode_copy = df_encode_copy.replace(cat_to_num)
df_encode_copy.head()

Unnamed: 0,DisNo.,Historic,Classification Key,Disaster Group,Disaster Subgroup,Disaster Type,Disaster Subtype,External IDs,Event Name,ISO,...,Reconstruction Costs ('000 US$),"Reconstruction Costs, Adjusted ('000 US$)",Insured Damage ('000 US$),"Insured Damage, Adjusted ('000 US$)",Total Damage,"Total Damage, Adjusted",CPI,Admin Units,Entry Date,Last Update
0,1999-9388-DJI,No,nat-cli-dro-dro,0,Climatological,Drought,Drought,,,DJI,...,,,,,,,58.111474,"[{""adm1_code"":1093,""adm1_name"":""Ali Sabieh""},{...",2006-03-01,2023-09-25
1,1999-9388-SDN,No,nat-cli-dro-dro,0,Climatological,Drought,Drought,,,SDN,...,,,,,,,56.514291,"[{""adm1_code"":2757,""adm1_name"":""Northern Darfu...",2006-03-08,2023-09-25
2,1999-9388-SOM,No,nat-cli-dro-dro,0,Climatological,Drought,Drought,,,SOM,...,,,,,,,56.514291,"[{""adm1_code"":2691,""adm1_name"":""Bay""},{""adm1_c...",2006-03-08,2023-09-25
3,2000-0001-AGO,No,tec-tra-roa-roa,1,Transport,Road,Road,,,AGO,...,,,,,,,56.514291,,2004-10-27,2023-09-25
4,2000-0002-AGO,No,nat-hyd-flo-riv,0,Hydrological,Flood,Riverine flood,,,AGO,...,,,,,10000.0,17695.0,56.514291,"[{""adm2_code"":4214,""adm2_name"":""Baia Farta""},{...",2005-02-03,2023-09-25


The values in "Disaster Group" have been replaced with either a 0 or 1 according to the dictionary

#### Issues

Issues with this way of encoding is the time it would take find and create multiple dictionaries to encode. Let's take a look at another way

<hr>

##### Label Encoding

In [15]:
# Use pandas to convert a column to a category
# Then use cat.codes to convert those categories to numbers

# Copy original df
df_label_encode = df.copy()
df_label_encode["Disaster Type"] = df_label_encode["Disaster Type"].astype('category')
df_label_encode.dtypes

DisNo.                                               object
Historic                                             object
Classification Key                                   object
Disaster Group                                       object
Disaster Subgroup                                    object
Disaster Type                                      category
Disaster Subtype                                     object
External IDs                                         object
Event Name                                           object
ISO                                                  object
Country                                              object
Subregion                                            object
Region                                               object
Location                                             object
Origin                                               object
Associated Types                                     object
OFDA/BHA Response                       

In [16]:
# Now that Disaster Type has been 'categorized'
# Use cat.codes to convert column to numbers
df_label_encode["Disaster Type"] = df_label_encode["Disaster Type"].cat.codes
df_label_encode.head()

Unnamed: 0,DisNo.,Historic,Classification Key,Disaster Group,Disaster Subgroup,Disaster Type,Disaster Subtype,External IDs,Event Name,ISO,...,Reconstruction Costs ('000 US$),"Reconstruction Costs, Adjusted ('000 US$)",Insured Damage ('000 US$),"Insured Damage, Adjusted ('000 US$)",Total Damage,"Total Damage, Adjusted",CPI,Admin Units,Entry Date,Last Update
0,1999-9388-DJI,No,nat-cli-dro-dro,Natural,Climatological,5,Drought,,,DJI,...,,,,,,,58.111474,"[{""adm1_code"":1093,""adm1_name"":""Ali Sabieh""},{...",2006-03-01,2023-09-25
1,1999-9388-SDN,No,nat-cli-dro-dro,Natural,Climatological,5,Drought,,,SDN,...,,,,,,,56.514291,"[{""adm1_code"":2757,""adm1_name"":""Northern Darfu...",2006-03-08,2023-09-25
2,1999-9388-SOM,No,nat-cli-dro-dro,Natural,Climatological,5,Drought,,,SOM,...,,,,,,,56.514291,"[{""adm1_code"":2691,""adm1_name"":""Bay""},{""adm1_c...",2006-03-08,2023-09-25
3,2000-0001-AGO,No,tec-tra-roa-roa,Technological,Transport,26,Road,,,AGO,...,,,,,,,56.514291,,2004-10-27,2023-09-25
4,2000-0002-AGO,No,nat-hyd-flo-riv,Natural,Hydrological,13,Riverine flood,,,AGO,...,,,,,10000.0,17695.0,56.514291,"[{""adm2_code"":4214,""adm2_name"":""Baia Farta""},{...",2005-02-03,2023-09-25


The 'Disaster Type' column has now been encoded as numerical variables. The benefit of this approach is that you get to use the pandas category data type which cuts down on data size, its ability to order, and plotting support 

<hr>

##### One Hot Encoding

This approach has a benefit over Label Encoding due to the way it encodes categorical variables. The issue with Label Encoding is that the numerical variables can be misinterpreted by a machine learning algorithm. When you know that your data will be eventually fed into an algorithm for further testing, try the One Hot Encoding approach. 

One Hot Encoding creates a new column for each category and assigns each value a 0 or 1 (True or False). This allows for not weighting a value improperly, but does make your dataset larger. 

In [18]:
# Use pd.get_dummies() function for One Hot Encoding
# Use the 'prefix' argument to change the label
get_dummies_df = pd.get_dummies(df, columns = ["Disaster Subgroup"], prefix = ["subgroup"])
get_dummies_df.head()

Unnamed: 0,DisNo.,Historic,Classification Key,Disaster Group,Disaster Type,Disaster Subtype,External IDs,Event Name,ISO,Country,...,Last Update,subgroup_Biological,subgroup_Climatological,subgroup_Extra-terrestrial,subgroup_Geophysical,subgroup_Hydrological,subgroup_Industrial accident,subgroup_Meteorological,subgroup_Miscellaneous accident,subgroup_Transport
0,1999-9388-DJI,No,nat-cli-dro-dro,Natural,Drought,Drought,,,DJI,Djibouti,...,2023-09-25,False,True,False,False,False,False,False,False,False
1,1999-9388-SDN,No,nat-cli-dro-dro,Natural,Drought,Drought,,,SDN,Sudan,...,2023-09-25,False,True,False,False,False,False,False,False,False
2,1999-9388-SOM,No,nat-cli-dro-dro,Natural,Drought,Drought,,,SOM,Somalia,...,2023-09-25,False,True,False,False,False,False,False,False,False
3,2000-0001-AGO,No,tec-tra-roa-roa,Technological,Road,Road,,,AGO,Angola,...,2023-09-25,False,False,False,False,False,False,False,False,True
4,2000-0002-AGO,No,nat-hyd-flo-riv,Natural,Flood,Riverine flood,,,AGO,Angola,...,2023-09-25,False,False,False,False,True,False,False,False,False


The categorical columns have now been successfully transformed into either True or False values, however the dataset has become much larger. Since get_dummies() returns the full dataframe, it will be beneficial to filter out objects using select_dtypes when ready to perform final analyses.

<hr>

##### Custom Binary Encoding

A combination of label encoding and one hot encoding. This approach uses np.where(), a conditional function. 

*np.where(condtion, value if true, value if false)*

In [20]:
df["Disaster Group Code"] = np.where(df["Disaster Group"] == "Natural", 1, 0)
df.head()

Unnamed: 0,DisNo.,Historic,Classification Key,Disaster Group,Disaster Subgroup,Disaster Type,Disaster Subtype,External IDs,Event Name,ISO,...,"Reconstruction Costs, Adjusted ('000 US$)",Insured Damage ('000 US$),"Insured Damage, Adjusted ('000 US$)",Total Damage,"Total Damage, Adjusted",CPI,Admin Units,Entry Date,Last Update,Disaster Group Code
0,1999-9388-DJI,No,nat-cli-dro-dro,Natural,Climatological,Drought,Drought,,,DJI,...,,,,,,58.111474,"[{""adm1_code"":1093,""adm1_name"":""Ali Sabieh""},{...",2006-03-01,2023-09-25,1
1,1999-9388-SDN,No,nat-cli-dro-dro,Natural,Climatological,Drought,Drought,,,SDN,...,,,,,,56.514291,"[{""adm1_code"":2757,""adm1_name"":""Northern Darfu...",2006-03-08,2023-09-25,1
2,1999-9388-SOM,No,nat-cli-dro-dro,Natural,Climatological,Drought,Drought,,,SOM,...,,,,,,56.514291,"[{""adm1_code"":2691,""adm1_name"":""Bay""},{""adm1_c...",2006-03-08,2023-09-25,1
3,2000-0001-AGO,No,tec-tra-roa-roa,Technological,Transport,Road,Road,,,AGO,...,,,,,,56.514291,,2004-10-27,2023-09-25,0
4,2000-0002-AGO,No,nat-hyd-flo-riv,Natural,Hydrological,Flood,Riverine flood,,,AGO,...,,,,10000.0,17695.0,56.514291,"[{""adm2_code"":4214,""adm2_name"":""Baia Farta""},{...",2005-02-03,2023-09-25,1


This is a useful approach to consolidate yes/no values in a column, and highlights how important domain knowledge is to solving the problem at hand.

<hr>

##### Using Scikit-Learn

Another approach is uing the library, scikit-learn, a very popular data science library for Python. It is useful for encoding when trying to build a predictive model.

To install the library, I suggest creating a new environment to avoid conflicts with other packages. 

*pip install -U scikit-learn* 

In [21]:
# Import packages
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

In [23]:
# The equivalent of label encoding in scikit-learn is OrdinalEncoder 
# Create an instance of the OrdinalEncoder
# Create a new column, 'Disaster Type Code' by using the fit_transform() function to create numerical values for each Disaster Type
# View the original 'Disaster Type' and 'Disaster Type Code' to see the new encoded values
ord_encoder = OrdinalEncoder()
df["Disaster Type Code"] = ord_encoder.fit_transform(df[["Disaster Type"]])
df[["Disaster Type", "Disaster Type Code"]].head(10)

Unnamed: 0,Disaster Type,Disaster Type Code
0,Drought,5.0
1,Drought,5.0
2,Drought,5.0
3,Road,26.0
4,Flood,13.0
5,Extreme temperature,10.0
6,Road,26.0
7,Road,26.0
8,Fire (Miscellaneous),12.0
9,Road,26.0


In [43]:
# This next example uses OneHotEncoder to encode values
# Create an instance of OneHotEncoder()
# Use fit_transform() to transform categorical values to numerical
onehot_encode_copy = df.copy()
onehot_encoder = OneHotEncoder()
onehot_results = onehot_encoder.fit_transform(onehot_encode_copy[["Disaster Type"]])

# Put the results into a dataframe for viewing 
onehot_df = pd.DataFrame(onehot_results.toarray(), columns = onehot_encode_copy["Disaster Type"].unique())

In [46]:
# Join dataframe back to original dataframe
onehot_encode_copy = onehot_encode_copy.join(onehot_df)
df.head(10)

Unnamed: 0,DisNo.,Historic,Classification Key,Disaster Group,Disaster Subgroup,Disaster Type,Disaster Subtype,External IDs,Event Name,ISO,...,Infestation,Miscellaneous accident (General),Poisoning,Mass movement (dry),Industrial accident (General),Radiation,Oil spill,Impact,Animal incident,Glacial lake outburst flood
0,1999-9388-DJI,No,nat-cli-dro-dro,Natural,Climatological,Drought,Drought,,,DJI,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1999-9388-SDN,No,nat-cli-dro-dro,Natural,Climatological,Drought,Drought,,,SDN,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1999-9388-SOM,No,nat-cli-dro-dro,Natural,Climatological,Drought,Drought,,,SOM,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2000-0001-AGO,No,tec-tra-roa-roa,Technological,Transport,Road,Road,,,AGO,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,2000-0002-AGO,No,nat-hyd-flo-riv,Natural,Hydrological,Flood,Riverine flood,,,AGO,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,2000-0003-BGD,No,nat-met-ext-col,Natural,Meteorological,Extreme temperature,Cold wave,,,BGD,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,2000-0004-BRA,No,tec-tra-roa-roa,Technological,Transport,Road,Road,,,BRA,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
7,2000-0005-CHN,No,tec-tra-roa-roa,Technological,Transport,Road,Road,,,CHN,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
8,2000-0006-CHN,No,tec-mis-fir-fir,Technological,Miscellaneous accident,Fire (Miscellaneous),Fire (Miscellaneous),,Hotel,CHN,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,2000-0007-EGY,No,tec-tra-roa-roa,Technological,Transport,Road,Road,,,EGY,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


These two methods are useful when building a predictive model, but the pandas syntax may be a little simpler and straightforward. 

There are many different ways to encode categorical values to numerical values, but your ultimate goal will determine the best method for you to use.

For a more advance way of encoding, you can take a look at the library *category_encoders*. This library provides many of the same approaches to encoding, just in a different way.