## Understanding Categorical Data

**Reference articles:** [1](https://medium.com/towards-data-science/understanding-feature-engineering-part-2-categorical-data-f54324193e63) & [2](https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd)

Machine learning algorithms **cannot work directly with categorical data** and need some amount of engineering and transformations on this data before we can start modeling on the data.

- **What is numerical data?**
 - Numeric data represents data in the form of *scalar values*, i.e., observations, recordings, measurements.
 - Continous data

- **What is categorical data? What are labels?**
 - Categorical data represents *discrete values* which belong to a specific finite set of categories/classes.
 - These are also known as **labels** or classes in the context of attributes or variables which are to be predicted (i.e. response/target variables) 
 - These discrete values can be text or numeric in nature (or even unstructured data like images!). 

#### Types of categorical data
There are two major classes of categorical data
1. **Nominal**
- No concept of ordering amongst values of that attribute
- e.g. weather
    - sunny, cloudy, windy, snowy etc.
- e.g. movie/music/video games genre, country names, cuisine

2. **Ordinal**
- Have some sense of order amongst values
- e.g. clothes size (XS, S, M, L...)
- e.g. education level (Undergraduate, Graduate, Post-Graduate...)
- e.g. employement level (staff, manager, associate...)
- e.g. shoe size

**NOTE:**

Any standard workflow in feature engineering involves some form of *transformation* of categorical values to numeric labels, and then apply **encoding scheme** on these values. 

### Transforming Nominal Attributes
- Nominal attributes are discrete categorical values with no sense of order amongst them. 
- The idea is to transform these attributes into a numerical format which can be understood by the code.

We look at a dataset for **Video Game Sales.**

In [1]:
# Import libraries
import numpy as np
import pandas as pd

In [2]:
# Read dataset
games_df = pd.read_csv('vgsales.csv')
games_df.head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


In [5]:
# Subset
games_df[['Name', 'Platform', 'Year', 'Genre', 'Publisher']].iloc[:10]

Unnamed: 0,Name,Platform,Year,Genre,Publisher
0,Wii Sports,Wii,2006.0,Sports,Nintendo
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo
5,Tetris,GB,1989.0,Puzzle,Nintendo
6,New Super Mario Bros.,DS,2006.0,Platform,Nintendo
7,Wii Play,Wii,2006.0,Misc,Nintendo
8,New Super Mario Bros. Wii,Wii,2009.0,Platform,Nintendo
9,Duck Hunt,NES,1984.0,Shooter,Nintendo


- Let's focus on the **Genre** attribute, which is a nominal attribute.
- **Platform** and **Publisher** are also nominal attributes.

In [18]:
# Unique genres in dataframe
genres = np.unique(games_df['Genre'])
print('Total number of unique genres: ', len(genres))
print(genres)

Total number of unique genres:  12
['Action' 'Adventure' 'Fighting' 'Misc' 'Platform' 'Puzzle' 'Racing'
 'Role-Playing' 'Shooter' 'Simulation' 'Sports' 'Strategy']


There are *12* unique video games genres. 

Now, we generate label encoding scheme for mapping each category to a numeric value using $scikit-learn$.

In [24]:
# Label encoder
from sklearn.preprocessing import LabelEncoder

# Label encoder object
genre_label_encoder = LabelEncoder()

# Fit the label encoder and return encoded labels
genre_labels = genre_label_encoder.fit_transform(games_df['Genre'])
genre_labels

array([10,  4,  6, ...,  6,  5,  4])

In [22]:
# For each distinct class in 'genres' map to index (as dictionary)
genre_mappings = {index: label for index, label in enumerate(genre_label_encoder.classes_)}
genre_mappings

{0: 'Action',
 1: 'Adventure',
 2: 'Fighting',
 3: 'Misc',
 4: 'Platform',
 5: 'Puzzle',
 6: 'Racing',
 7: 'Role-Playing',
 8: 'Shooter',
 9: 'Simulation',
 10: 'Sports',
 11: 'Strategy'}

- A mapping scheme has been generated where each distinct genre value is mapped to a number(index number) with the help of the *LabelEncoder object genre_label_encoder*. 
- The transformed labels are stored in the *genre_labels* value which we can write back to our data frame.

In [20]:
# Write the transformed labels to the games_df dataframe
games_df['GenreLabel'] = genre_labels
games_df[['Name', 'Platform', 'Year', 'Genre', 'GenreLabel', 'Publisher']].iloc[:10]

Unnamed: 0,Name,Platform,Year,Genre,GenreLabel,Publisher
0,Wii Sports,Wii,2006.0,Sports,10,Nintendo
1,Super Mario Bros.,NES,1985.0,Platform,4,Nintendo
2,Mario Kart Wii,Wii,2008.0,Racing,6,Nintendo
3,Wii Sports Resort,Wii,2009.0,Sports,10,Nintendo
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,7,Nintendo
5,Tetris,GB,1989.0,Puzzle,5,Nintendo
6,New Super Mario Bros.,DS,2006.0,Platform,4,Nintendo
7,Wii Play,Wii,2006.0,Misc,3,Nintendo
8,New Super Mario Bros. Wii,Wii,2009.0,Platform,4,Nintendo
9,Duck Hunt,NES,1984.0,Shooter,8,Nintendo


### Transforming Ordinal Attributes

- Ordinal attributes are categorical attributes with a sense of order amongst the values. 

We look at a dataset for **Pokemon** (available on Kaggle).

In [135]:
# Read data
pokemon_df = pd.read_csv('Pokemon.csv')
pokemon_df.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,Gen 1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,Gen 1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,Gen 1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,Gen 1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,Gen 1,False


The dataframe is ordered and to obtain best results we need to [shuffle](https://datagy.io/pandas-shuffle-dataframe/#:~:text=One%20of%20the%20easiest%20ways,Dataframe%2C%20in%20a%20random%20order) the observations.

**NOTE:**

Why use **random_state**?
- We’re able to reproduce our results by passing a value into the **'random_state ='** argument. We can simply pass in an integer value and the shuffled dataframe will look the same each time the code is run.

In [136]:
# Shuffling a Pandas dataframe with .sample()
# Return the entire dataframe by passing in frac=1 ==> return 100% of the dataframe
pokemon_df = pokemon_df.sample(random_state=1, frac=1).reset_index(drop=True) # Reset index
pokemon_df.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,6,CharizardMega Charizard Y,Fire,Flying,634,78,104,78,159,115,100,Gen 1,False
1,460,Abomasnow,Grass,Ice,494,90,92,75,92,85,60,Gen 4,False
2,161,Sentret,Normal,,215,35,46,34,35,45,20,Gen 2,False
3,667,Litleo,Fire,Normal,369,62,50,58,73,54,72,Gen 6,False
4,224,Octillery,Water,,480,75,105,75,105,75,45,Gen 2,False


In [137]:
# No. of unique generations in dataframe
poke_gen = np.unique(pokemon_df['Generation'])
print('Total number of unique generations: ', len(poke_gen))
print(poke_gen)

Total number of unique generations:  6
['Gen 1' 'Gen 2' 'Gen 3' 'Gen 4' 'Gen 5' 'Gen 6']


Pokémon typically belongs to a *specific generation* based on the video games (when they were released). This attribute is ordinal because most Pokémon belonging to Generation 1 were introduced earlier in the video games than Generation 2 as so on.

**NOTE:**

There is no generic function to map/transform these ordinal features into numeric representations based on order. So, we use custom encoding schemes.

In [138]:
# Ordinal mapping
gen_ordinal_map = { 'Gen 1': 1, 'Gen 2': 2, 'Gen 3': 3, 'Gen 4': 4, 'Gen 5': 5, 'Gen 6': 6}
pokemon_df['GeneraltionLabel'] = pokemon_df['Generation'].map(gen_ordinal_map)
pokemon_df[['Name','Generation','GeneraltionLabel']].iloc[:10]

Unnamed: 0,Name,Generation,GeneraltionLabel
0,CharizardMega Charizard Y,Gen 1,1
1,Abomasnow,Gen 4,4
2,Sentret,Gen 2,2
3,Litleo,Gen 6,6
4,Octillery,Gen 2,2
5,Helioptile,Gen 6,6
6,Dialga,Gen 4,4
7,DeoxysDefense Forme,Gen 3,3
8,Rapidash,Gen 1,1
9,Swanna,Gen 5,5


### Encoding Categorical Attributes

Typical feature engineering on categorical data involves two steps:
- a transformation process which we depicted in the previous sections
- a compulsory **encoding process** where we apply specific encoding schemes to create dummy variables or features for each category/values in a specific categorical attribute

**Why do we need to encode the categorical attributes?**
- Considering video game genres, if we directly fed the GenreLabel attribute as a feature in a machine learning model, it would consider it to be a continuous numeric feature thinking value 10 (Sports) is greater than 6 (Racing) but that is meaningless because the Sports genre is certainly not bigger or smaller than Racing, these are essentially different values or categories which cannot be compared directly. 
- Hence we need an additional layer of encoding schemes where dummy features are created for each unique value or category out of all the distinct categories *per attribute*.

#### 1. One-hot encoding

- Considering we have the numeric representation of any categorical attribute with *m* labels (after transformation), the one-hot encoding scheme, encodes or transforms the attribute into *m* binary features which can only contain a value of 1 or 0. 
- Each observation in the categorical feature is thus converted into a vector of size *m* with only one of the values as 1 (indicating it as active).

In [139]:
# Two attributes of interest 'Generation' and 'Legendary' status
pokemon_df[['Name','Generation','Legendary']].iloc[:10]

Unnamed: 0,Name,Generation,Legendary
0,CharizardMega Charizard Y,Gen 1,False
1,Abomasnow,Gen 4,False
2,Sentret,Gen 2,False
3,Litleo,Gen 6,False
4,Octillery,Gen 2,False
5,Helioptile,Gen 6,False
6,Dialga,Gen 4,True
7,DeoxysDefense Forme,Gen 3,True
8,Rapidash,Gen 1,False
9,Swanna,Gen 5,False


##### Transforming 'Generation' and 'Legendary' categorical attributes to numeric representation


In [140]:
from sklearn.preprocessing import LabelEncoder
# 'Generation' attribute

# Label encoder object
gen_label_encoder = LabelEncoder()
# Transform and map
# Fit the label encoder and return encoded labels
gen_labels = gen_label_encoder.fit_transform(pokemon_df['Generation'])
# print(gen_labels)
# Transformation as a new feature Gen_label
pokemon_df['Gen_label'] = gen_labels

In [141]:
# Label encoder object
legend_label_encoder = LabelEncoder()
# Transform and map
# Fit the label encoder and return encoded labels
legend_labels = legend_label_encoder.fit_transform(pokemon_df['Legendary'])
# Transformation as a new feature Legend_label
pokemon_df['Legend_label'] = legend_labels

In [142]:
subset_poke_df = pokemon_df[['Name', 'Generation', 'Gen_label', 'Legendary', 'Legend_label']]
subset_poke_df.iloc[:10]

Unnamed: 0,Name,Generation,Gen_label,Legendary,Legend_label
0,CharizardMega Charizard Y,Gen 1,0,False,0
1,Abomasnow,Gen 4,3,False,0
2,Sentret,Gen 2,1,False,0
3,Litleo,Gen 6,5,False,0
4,Octillery,Gen 2,1,False,0
5,Helioptile,Gen 6,5,False,0
6,Dialga,Gen 4,3,True,1
7,DeoxysDefense Forme,Gen 3,2,True,1
8,Rapidash,Gen 1,0,False,0
9,Swanna,Gen 5,4,False,0


**Gen_label** and **Legend_label** represent numeric representation of the categorical features.

We now apply **one-hot encoding** scheme on these features.

In [143]:
from sklearn.preprocessing import OneHotEncoder

# Encode Generation Labels using one-hot encoding scheme 

# One-hot encoder object
gen_one_hot = OneHotEncoder()

# Fit the one-hot encoder and return encoded labels
gen_feature_arr = gen_one_hot.fit_transform(subset_poke_df[['Gen_label']]).toarray()

# Gen_feature labels from classes
# label_encoder.classes_ ==> holds the label for each class
gen_feature_labels = gen_label_encoder.classes_

# Put it together in a dataframe
gen_features = pd.DataFrame(gen_feature_arr, columns = gen_feature_labels)

In [144]:
# Encode Legendary Labels using one-hot encoding scheme

# One-hot encoder object
legend_one_hot = OneHotEncoder()

# Fit the one-hot encoder and return encoded labels
legend_feature_arr = legend_one_hot.fit_transform(subset_poke_df[['Legend_label']]).toarray()

# Legend_feature labels from classes
# label_encode.classes_ ==> holds the label for each class
legend_feature_labels = [ 'Legendary_' + str(cls_label) for cls_label in legend_label_encoder.classes_ ]

# Put it together as a dataframe
legend_features = pd.DataFrame(legend_feature_arr, columns = legend_feature_labels)

In [145]:
ohe_pokemon_df = pd.concat([subset_poke_df, gen_features, legend_features], axis = 1) # Column-wise concat
ohe_pokemon_df[4:10]

Unnamed: 0,Name,Generation,Gen_label,Legendary,Legend_label,Gen 1,Gen 2,Gen 3,Gen 4,Gen 5,Gen 6,Legendary_False,Legendary_True
4,Octillery,Gen 2,1,False,0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
5,Helioptile,Gen 6,5,False,0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
6,Dialga,Gen 4,3,True,1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
7,DeoxysDefense Forme,Gen 3,2,True,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
8,Rapidash,Gen 1,0,False,0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
9,Swanna,Gen 5,4,False,0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


Thus you can see that **6** dummy variables or binary features have been created for *Generation* and **2** for *Legendary* since those are the total number of distinct categories in each of these attributes respectively. 

Active state of a category is indicated by the 1 value in one of these dummy variables.

##### Encoding new data to the dataframe
Consider you built this encoding scheme on your training data and built some model. Now you have some new data which has to be engineered for features before predictions as follows.

In [146]:
new_poke_df = pd.DataFrame([['PikaZoom', 'Gen 3', True], 
                           ['CharMyToast', 'Gen 4', False]],
                       columns=['Name', 'Generation', 'Legendary'])
new_poke_df

Unnamed: 0,Name,Generation,Legendary
0,PikaZoom,Gen 3,True
1,CharMyToast,Gen 4,False


**NOTE:**

We can use the **.transform()** function from `LabelEncoder` and `OneHotEncoder` objects on the new data.

In [147]:
# Transforming Generation attribute from new data
new_gen_labels = gen_label_encoder.transform(new_poke_df['Generation'])
new_poke_df['Gen_label'] = new_gen_labels

# Transforming Legendary attribute from new data
new_legend_labels = legend_label_encoder.transform(new_poke_df['Legendary'])
new_poke_df['Legend_label'] = new_legend_labels

new_poke_df[['Name', 'Generation', 'Gen_label', 'Legendary', 'Legend_label']]
new_poke_df

Unnamed: 0,Name,Generation,Legendary,Gen_label,Legend_label
0,PikaZoom,Gen 3,True,2,1
1,CharMyToast,Gen 4,False,3,0


We get have numerical labels. Now apply the encoding scheme. 

In [148]:
# Transforming Gen_label using one-hot encoding
new_gen_features_arr = gen_one_hot.transform(new_poke_df[['Gen_label']]).toarray()
new_gen_features = pd.DataFrame(new_gen_features_arr, columns = gen_feature_labels)

# Transforming Legend_label using one-hot encoding
new_legend_label_arr = legend_one_hot.transform(new_poke_df[['Legend_label']]).toarray()
new_legend_features = pd.DataFrame(new_legend_label_arr, columns = legend_feature_labels)

In [149]:
# Concat all original and new dataframe
new_poke_df_ohe = pd.concat([new_poke_df, new_gen_features, new_legend_features], axis = 1)
new_poke_df_ohe

Unnamed: 0,Name,Generation,Legendary,Gen_label,Legend_label,Gen 1,Gen 2,Gen 3,Gen 4,Gen 5,Gen 6,Legendary_False,Legendary_True
0,PikaZoom,Gen 3,True,2,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
1,CharMyToast,Gen 4,False,3,0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


**NOTE:**

Another way to apply one-hot encoding is to use **get_dummies(...)** function from Pandas.

In [150]:
gen_onehot_features = pd.get_dummies(pokemon_df['Generation'])
# New dataframe
pd.concat([pokemon_df[['Name', 'Generation']], gen_onehot_features], axis=1).iloc[4:10]

Unnamed: 0,Name,Generation,Gen 1,Gen 2,Gen 3,Gen 4,Gen 5,Gen 6
4,Octillery,Gen 2,0,1,0,0,0,0
5,Helioptile,Gen 6,0,0,0,0,0,1
6,Dialga,Gen 4,0,0,0,1,0,0
7,DeoxysDefense Forme,Gen 3,0,0,1,0,0,0
8,Rapidash,Gen 1,1,0,0,0,0,0
9,Swanna,Gen 5,0,0,0,0,1,0


#### 2. Dummy Coding Scheme

- The dummy coding scheme is similar to the one-hot encoding scheme, except in the case of dummy coding scheme, when applied on a categorical feature with m distinct labels, we get m - 1 binary features
- Thus each value of the categorical variable gets converted into a vector of size m - 1
- The extra feature is completely disregarded and thus if the category values range from {0, 1, …, m-1} the 0th or the m - 1th feature column is dropped and corresponding category values are usually represented by a vector of all zeros (0)

In [153]:
# Dummy coding

# Drop first encoded feature column
gen_dummy_features = pd.get_dummies(pokemon_df['Generation'], drop_first = True)

# Putting dummy coded features with 'Name' and 'Generation'
pd.concat([pokemon_df[['Name', 'Generation']], gen_dummy_features], axis = 1).iloc[4:15]

Unnamed: 0,Name,Generation,Gen 2,Gen 3,Gen 4,Gen 5,Gen 6
4,Octillery,Gen 2,1,0,0,0,0
5,Helioptile,Gen 6,0,0,0,0,1
6,Dialga,Gen 4,0,0,1,0,0
7,DeoxysDefense Forme,Gen 3,0,1,0,0,0
8,Rapidash,Gen 1,0,0,0,0,0
9,Swanna,Gen 5,0,0,0,1,0
10,Tyrogue,Gen 2,1,0,0,0,0
11,Exeggutor,Gen 1,0,0,0,0,0
12,Silcoon,Gen 3,0,1,0,0,0
13,SharpedoMega Sharpedo,Gen 3,0,1,0,0,0


In [155]:
# Dropping last encoded feature column
gen_onehot_features_new = pd.get_dummies(pokemon_df['Generation'])
gen_dummy_features_new = gen_onehot_features_new.iloc[:,:-1]

# Putting dummy coded features with 'Name' and 'Generation'
pd.concat([pokemon_df[['Name', 'Generation']], gen_dummy_features_new], axis =1).iloc[4:15]

Unnamed: 0,Name,Generation,Gen 1,Gen 2,Gen 3,Gen 4,Gen 5
4,Octillery,Gen 2,0,1,0,0,0
5,Helioptile,Gen 6,0,0,0,0,0
6,Dialga,Gen 4,0,0,0,1,0
7,DeoxysDefense Forme,Gen 3,0,0,1,0,0
8,Rapidash,Gen 1,1,0,0,0,0
9,Swanna,Gen 5,0,0,0,0,1
10,Tyrogue,Gen 2,0,1,0,0,0
11,Exeggutor,Gen 1,1,0,0,0,0
12,Silcoon,Gen 3,0,0,1,0,0
13,SharpedoMega Sharpedo,Gen 3,0,0,1,0,0


### Curse of Dimensionality

- Categorical data starts creating problems when the number of distinct categories in a feature becomes very large
- For any categorical feature of **m** distinct labels, you get m separate features
    - This increases the size of the feature set, causing problems such a storage issues, model training with time, space and memory
- [Curse of dimensionality](https://towardsdatascience.com/the-curse-of-dimensionality-50dc6e49aa1e) also comes into play
    - With a huge number of features and not enough representative samples, model performance starts degrading and often leads to overfitting
- If we have more features than observations than we run the risk of massively overfitting our model 
- With certain machine learning algorithms such as clustering algorithms, when we have too many features, observations become harder to cluster
    - Too many dimensions causes every observation in the dataset to appear equidistant from all the others
    - And because clustering uses a distance measure such as Euclidean distance to quantify the similarity between observations, this is a big problem
    - If the distances are all approximately equal, then all the observations appear equally alike (as well as equally different), and no meaningful clusters can be formed