# Unit 3: EDA (Exploratory Data Analysis)
---

## Lesson 3.5: Handling Categorical Values

In [None]:
# First we will load all the necessary libraries
import pandas as pd
import numpy as np

In [None]:
# Loading the Dataset
df = pd.read_csv("Data/titanic_leapcode_train.csv")

In [None]:
# Fill missing values to not cause issues later
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='most_frequent')
df = pd.DataFrame(imputer.fit_transform(df), columns = df.columns)

df

In [None]:
# Checking if missing values are handled
df.isnull().sum()

### Why Can't We Use Categorical Values Directly?

Machine learning models **only work with numbers**, they canâ€™t understand words like `"Male"` or `"Braud, Mr. Owen Harris"`.

If we give a model a column of text, it will throw an error

So we need to **transform** (convert) them into numbers.

---

### How to Handle Categorical Values

There are two main methods:

### 1. **Label Encoding**

We convert each category to a number

Example: `["Male", "Female"]` -> `[1, 0]`

Example: `["S", "C", "Q"]` -> `[0,1,2]`

In [None]:
from sklearn.preprocessing import LabelEncoder

# defining the LabelEncoder
le = LabelEncoder()

# fitting the Encoder and then transforming it in one step
df["Sex"] = le.fit_transform(df["Sex"])
df

Instead of "Male" or "Female", the LabelEncoder transformed the values in the "Sex" column to 0s and 1s

### 2. **One Hot Encoding**

<img src="../Photos/OneHotEncoding.png" width="600">

In [None]:
from sklearn.preprocessing import OneHotEncoder

# One-hot encoder
encoder = OneHotEncoder(sparse_output=False)

# Training the encoder which creates a dataframe with the new columns
encoded_array = encoder.fit_transform(df[["Embarked"]])

# Naming the new columns made
encoded_df = pd.DataFrame(
    encoded_array,
    columns=encoder.get_feature_names_out(["Embarked"])
)

# Concatenating the new dataframe with the old one. axis = 1 means we are joining it by columns, if axis = 0 we join by rows
df = pd.concat([df, encoded_df], axis=1)

df

The OneHotEncoder created the columns Embarked_C, Embarked_Q, Embarked_S. If Embarked = S, then Embarked_S will equal 1. The same goes for the other values

---

# When Do We Use Them?



## Label Encoding

### Use Label Encoding When:
- The column has **only two categories** (like `"Yes"` / `"No"`, `"Male"` / `"Female"`)
- Or the **order of categories matters** (e.g., `"Low"`, `"Medium"`, `"High"`)

### Don't Use It When:
- When there is no order to the values (e.g., `"Sunny"`, `"Rainy"`, `"Cloudy"`)

## One-Hot Encoding

### Use One-Hot Encoding When:
- There is **no order** or ranking between the categories
- You want to avoid the model making **false assumptions** about the values

### Don't Use It When:
- There are **too many unique values** (like 100 different countries or the name of every person on a boat) because it will create too many columns