# Unit 3: EDA (Exploratory Data Analysis)
---

## Lesson 3.5: Handling Categorical Values

In [57]:
# First we will load all the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [53]:
# Loading the Dataset
df = pd.read_csv("Data/titanic_leapcode_train.csv")

In [54]:
# Fill missing values to not cause issues later
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='most_frequent')
df = pd.DataFrame(imputer.fit_transform(df), columns = df.columns)

df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,B96 B98,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,B96 B98,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,B96 B98,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,B96 B98,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,24.0,1,2,W./C. 6607,23.45,B96 B98,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C


In [43]:
# Checking if missing values are handled
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

### Why Can't We Use Categorical Values Directly?

Machine learning models **only work with numbers**, they can’t understand words like `"Male"` or `"Braud, Mr. Owen Harris"`.

If we give a model a column of text, it will throw an error

So we need to **transform** (convert) them into numbers.

---

### How to Handle Categorical Values

There are two main methods:

### 1. **Label Encoding**

We convert each category to a number

Example: `["Male", "Female"]` -> `[1, 0]`

Example: `["S", "C", "Q"]` -> `[0,1,2]`

In [56]:
from sklearn.preprocessing import LabelEncoder

# defining the LabelEncoder
le = LabelEncoder()

# fitting the Encoder and then transforming it in one step
df["Sex"] = le.fit_transform(df["Sex"])
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Embarked_C,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,B96 B98,S,0.0,0.0,1.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,C,1.0,0.0,0.0
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,B96 B98,S,0.0,0.0,1.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,C123,S,0.0,0.0,1.0
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,B96 B98,S,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",1,27.0,0,0,211536,13.0,B96 B98,S,0.0,0.0,1.0
887,888,1,1,"Graham, Miss. Margaret Edith",0,19.0,0,0,112053,30.0,B42,S,0.0,0.0,1.0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",0,24.0,1,2,W./C. 6607,23.45,B96 B98,S,0.0,0.0,1.0
889,890,1,1,"Behr, Mr. Karl Howell",1,26.0,0,0,111369,30.0,C148,C,1.0,0.0,0.0


Instead of "Male" or "Female", the LabelEncoder transformed the values in the "Sex" column to 0s and 1s

### 2. **One Hot Encoding**

<img src="../Photos/OneHotEncoding.png" width="600">

In [55]:
from sklearn.preprocessing import OneHotEncoder

# One-hot encoder
encoder = OneHotEncoder(sparse_output=False)

# Training the encoder which creates a dataframe with the new columns
encoded_array = encoder.fit_transform(df[["Embarked"]])

# Naming the new columns made
encoded_df = pd.DataFrame(
    encoded_array,
    columns=encoder.get_feature_names_out(["Embarked"])
)

# Concatenating the new dataframe with the old one
df = pd.concat([df, encoded_df], axis=1)

df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Embarked_C,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,B96 B98,S,0.0,0.0,1.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1.0,0.0,0.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,B96 B98,S,0.0,0.0,1.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0.0,0.0,1.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,B96 B98,S,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,B96 B98,S,0.0,0.0,1.0
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S,0.0,0.0,1.0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,24.0,1,2,W./C. 6607,23.45,B96 B98,S,0.0,0.0,1.0
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C,1.0,0.0,0.0


The OneHotEncoder created the columns Embarked_C, Embarked_Q, Embarked_S. If Embarked = S, then Embarked_S will equal S. The same goes for the other values

---

# When Do We Use Them?



## Label Encoding

### Use Label Encoding When:
- The column has **only two categories** (like `"Yes"` / `"No"`, `"Male"` / `"Female"`)
- Or the **order of categories matters** (e.g., `"Low"`, `"Medium"`, `"High"`)

### Don't Use It When:
- When there is no order to the values (e.g., `"Sunny"`, `"Rainy"`, `"Cloudy"`)

## One-Hot Encoding

### Use One-Hot Encoding When:
- There is **no order** or ranking between the categories
- You want to avoid the model making **false assumptions** about the values

### Don't Use It When:
- There are **too many unique values** (like 100 different countries or the name of every person on a boat) because it will create too many columns