<a href="https://colab.research.google.com/github/poonamaswani/DataScienceAndAI/blob/main/CAM_DS_C101_Demo_4_1_5_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**First things first** - please go to 'File' and select 'Save a copy in Drive' so that you have your own version of this activity set up and ready to use.
Remember to update the portfolio index link to your own work once completed!

# Demonstration 4.1.5 Dealing with categorical data
The Titanic data set is often used to explain concepts in machine learning. The data set contains information about the passengers aboard the RMS Titanic, and includes features like passenger age, sex, ticket class, fare paid, and survival status.

Follow the demonstration to see the importance of converting categorical data, based on the Titanic data set. In this video, you will learn  how to encode categorical data with the ordinal and one-hot encoding methods.

In [None]:
# Import the necessary libraries.
import pandas as pd
import numpy as np

In [None]:
# Import the modified Titanic file (data set) from GitHub with a URL.
url = "https://raw.githubusercontent.com/fourthrevlxd/cam_dsb/main/modified_titanic.csv"

# Read the CSV file into a new DataFrame.
titanic_2 = pd.read_csv(url)

# Display the first few rows of the DataFrame.
print(titanic_2.shape)
titanic_2.head()

(891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,Third,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,First,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,Third,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,First,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,Third,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## a. Ordinal encoding for passenger class

> Ordinal data represents categories with a specific order or ranking. For instance, ratings (low, medium, high) or education levels (high school, bachelor's, master's, PhD). Encoding techniques for ordinal data maintain this inherent order, often translating categories into integers representing their rank.

In [None]:
# Import OrdinalEncoder.
from sklearn.preprocessing import OrdinalEncoder

# Define the correct order for the categories.
order = ['First', 'Second', 'Third']

# Create the OrdinalEncoder with the specified categories.
ordinal_encoder = OrdinalEncoder(categories=[order])

# Reshape your data either using array.reshape(-1, 1) if your data has a single feature.
titanic_2['Pclass'] = ordinal_encoder.fit_transform(titanic_2[['Pclass']]).astype(int)

# Display the DataFrame.
titanic_2.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,2,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,2,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,2,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,2,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,0,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,2,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,2,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,1,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


## b. One-hot encoding for gender and embarked
> Nominal data represents categories that do not have any intrinsic order. For example, colours (red, blue, green) or cities (New York, London, Tokyo). A common encoding technique for nominal data is one-hot encoding.

In [None]:
# One-hot encoding for Sex and Embarked.
titanic_3 = pd.get_dummies(titanic_2,
                           columns=["Sex", "Embarked"])

# View DataFrame.
titanic_3.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,1,0,2,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,False,True,False,False,True
1,2,1,0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,True,False,True,False,False
2,3,1,2,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,True,False,False,False,True
3,4,1,0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,True,False,False,False,True
4,5,0,2,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,False,True,False,False,True
5,6,0,2,"Moran, Mr. James",,0,0,330877,8.4583,,False,True,False,True,False
6,7,0,0,"McCarthy, Mr. Timothy J",54.0,0,0,17463,51.8625,E46,False,True,False,False,True
7,8,0,2,"Palsson, Master. Gosta Leonard",2.0,3,1,349909,21.075,,False,True,False,False,True
8,9,1,2,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",27.0,0,2,347742,11.1333,,True,False,False,False,True
9,10,1,1,"Nasser, Mrs. Nicholas (Adele Achem)",14.0,1,0,237736,30.0708,,True,False,True,False,False


In [None]:
# Drop a feature from the new features.
titanic_4 = pd.get_dummies(titanic_2,
                           columns=["Sex", "Embarked"],
                           drop_first=True)

In [None]:
titanic_4.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Sex_male,Embarked_Q,Embarked_S
0,1,0,2,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,True,False,True
1,2,1,0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,False,False,False
2,3,1,2,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,False,False,True
3,4,1,0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,False,False,True
4,5,0,2,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,True,False,True
5,6,0,2,"Moran, Mr. James",,0,0,330877,8.4583,,True,True,False
6,7,0,0,"McCarthy, Mr. Timothy J",54.0,0,0,17463,51.8625,E46,True,False,True
7,8,0,2,"Palsson, Master. Gosta Leonard",2.0,3,1,349909,21.075,,True,False,True
8,9,1,2,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",27.0,0,2,347742,11.1333,,False,False,True
9,10,1,1,"Nasser, Mrs. Nicholas (Adele Achem)",14.0,1,0,237736,30.0708,,False,False,False


# Key information
This demonstration illustrated the importance of understanding the business context when converting categorical data. Any modifications made to the original data will impact the machine learning model's accuracy.


## Reflect
What are the practical applications of this technique?

> Select the pen from the toolbar to add your entry.