<a href="https://colab.research.google.com/github/nkatara/iiit-ai-ml/blob/main/U1_MH2_Titanic_Classification_updated.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint


## Learning Objectives


At the end of the mini-hackathon you will be able to:
* Perform Data preprocessing
* Apply different ML algorithms on the **Titanic** dataset
* Perform VotingClassifier


## Dataset Description

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of many passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

[ Data Set Link: Kaggle competition](https://www.kaggle.com/competitions/titanic)

<br/>

### Data Set Characteristics:

**PassengerId:** Id of the Passenger

**Survived:** Survived or Not information

**Pclass:** Socio-economic status (SES)
  * 1st = Upper
  * 2nd = Middle
  * 3rd = Lower

**Name:** Surname, First Names of the Passenger

**Sex:** Gender of the Passenger

**Age:** Age of the Passenger

**SibSp:**	No. of siblings/spouse of the passenger aboard the Titanic

**Parch:**	No. of parents/children of the passenger aboard the Titanic

**Ticket:**	Ticket number

**Fare:** Passenger fare

**Cabin:**	Cabin number

**Embarked:** Port of Embarkation
  * S = Southampton
  * C = Cherbourg
  * Q = Queenstown


## Problem Statement

Build a predictive model that answers the question: “what sort of people were more likely to survive?” using titanic's passenger data (ie name, age, gender, socio-economic class, etc).

In [None]:
# @title Download the datasets
from IPython import get_ipython

ipython = get_ipython()

notebook="U1_MH1_Data_Munging" #name of the notebook

def setup():
    from IPython.display import HTML, display
    ipython.magic("sx wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/titanic.csv")
    ipython.magic("sx wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/test_titanic.csv")
    print("Data downloaded successfully")
    return

setup()

In [None]:
!ls

## Exercise 1 - Load and Explore the Data (2 Marks)

* Understand different features in the training dataset
* Understand the data types of each column
* Notice the columns of missing values




#### Import Required Packages

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Load the dataset
df = pd.read_csv("titanic.csv")

In [None]:
# Getting information about the dataset
df.head()

In [None]:
df.describe()


In [None]:
df.info()

## Exercise 02: Split the data into train and test sets (1 Mark)
Note: Apply all your data preprocessing steps in the train set first and keep the test set aside.

In [None]:
from sklearn.model_selection import train_test_split

# Split BEFORE preprocessing
train_df, test_df = train_test_split(df, test_size=0.3, random_state=42)

print(train_df.shape)
print(test_df.shape)


In [None]:
#Handle missing values
train_df['Age'].fillna(train_df['Age'].median(), inplace=True)
train_df['Embarked'].fillna(train_df['Embarked'].mode()[0], inplace=True)

## Exercise 03: Data Cleaning and Processing (15 Marks)
### 3.1 Working on the "Cabin" column (2 Marks)
Find unique entries in the Cabin column. We can label all passengers in two categories having a cabin or not. Check the data type(use: type) of each entry of the Cabin. Convert a string data type into '1' i.e. passengers with cabin and others into '0' i.e. passengers without cabin.  Write a function for the above operation and apply it to the cabin column and create another column with the name " Has_cabin" containing only 0 or 1 entries.





In [None]:
# Find Unique Entries in the Cabin Column
df['Cabin'].unique()

In [None]:
# Check the Data Type of Each Entry
df['Cabin'].apply(type).unique()

In [None]:
#Write a Function to Convert Cabin → 1 or 0
def cabin_to_binary(value):
  if isinstance(value, str):
    return 1
  else:
    return 0

In [None]:
# Apply the Function and Create "Has_cabin" Column
# 1:passenger has a cabin(string)
# 0:passenger does NOT have a cabin(NaN)
df['Has_cabin'] = df['Cabin'].apply(cabin_to_binary)
df[['Cabin', 'Has_cabin']].head()

 ### 3.2 Working on "SibSp" & "Parch" columns (1 Mark)
Combine columns "SibSp" & "Parch" and create another column that represents the total passengers in one ticket with the name "family_size". In each ticket, there might be Siblings/Spouses (SibSp =Number of Siblings/Spouses Aboard) or Parents/Children (Parch=Number of Parents/Children Aboard ) along with the passenger who booked the ticket.

  

In [None]:
#Create column family_size
#family_size = SibSp + Parch + 1
df['family_size'] = df['SibSp'] + df['Parch'] + 1


In [None]:
#nspect the new column
df[['SibSp', 'Parch', 'family_size']].head()

### 3.3 Working on the"Embarked" column (2 Marks)
The "embarked" column represents the port of Embarkation: Cherbourg(C), Queenstown(Q), and  Southampton(S ). Thus, the entries are of three categories in this column. Fill in the missing rows in this column. We can fill it with the most frequent category. Map these categorical string entries into numerical.



In [None]:
# Check unique values in "Embarked"
df['Embarked'].unique()

In [None]:
#Find the most frequent (mode) value
mode_embarked = df['Embarked'].mode()[0]
mode_embarked

In [None]:
# Fill missing values with the most frequent category
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

In [None]:
# Map categorical values to numerical values
embarked_mapping = {'S': 0, 'C': 1, 'Q': 2}
df['Embarked'] = df['Embarked'].map(embarked_mapping)
df['Embarked'].head()
print(embarked_mapping)
print(df['Embarked'])

### 3.4 Working on the "Age" column (2 Marks)
find the number of NaN entries in the age column and their row index. Calculate the mean, Standard deviation of the Age column and check the distribution of the age column.We can fill the missing values with randomly generated integer values between (mean+Standard deviation, mean-Standard deviation). Use : np.isnan; np.random.randint; concept of slicing dataframe. Convert the age column as an integer data type.



In [None]:
# Find the number of NaN entries in the Age column
df['Age'].isnull().sum()

In [None]:
# Find the row indices of NaN Age values
nan_age_index = df[df['Age'].isnull()].index
nan_age_index

In [None]:
# Calculate mean and standard deviation of Age
age_mean = df['Age'].mean()
age_std  = df['Age'].std()

print("Mean Age:", age_mean)
print("Std Age:", age_std)


In [None]:
# Check the distribution of Age
import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(df['Age'], kde=True)
plt.title("Age Distribution")
plt.show()


In [None]:
# Fill missing Age values with random integers
low  = int(age_mean - age_std) # Compute the range
high = int(age_mean + age_std) # Compute the range
random_ages = np.random.randint(low, high, size=df['Age'].isnull().sum()) # Generate random integers for missing values
df.loc[df['Age'].isnull(), 'Age'] = random_ages # Replace NaN values using slicing
df['Age'] = df['Age'].astype(int) # Convert Age column to integer type
df['Age'].isnull().sum() # check for any NAN

### 3.5 Working on "sex" column (1 Mark)
Map the Sex column as 'female' : 0, 'male': 1, and convert it into an integer data type.



In [None]:
df['Sex'] = df['Sex'].astype('object')
df['Sex'].replace(['nan', 'None', '', ' '], np.nan, inplace=True)
df['Sex'].fillna(df['Sex'].mode()[0], inplace=True)
df['Sex'] = df['Sex'].map({'female': 0, 'male': 1})
#df['Sex'] = df['Sex'].astype(int)

### 3.6  Optional- Working on the "Name" column :
Fetch titles from the name. We can map these titles with numbers and convert them into an integer. Use: concept of the regular expression.

### 3.7 Optional- Working on the "Fare" column :
We can convert face into categorical entries like Low, Medium, and High.



### 3.8 Drop the columns (1 Mark)

Drop the columns: - "PassengerId", "Name",  "SibSp" & "Parch", "Tickets", "Cabin"

Now apply different ML algorithms and check the accuracy of your model.



In [None]:
df.drop(columns=['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin'], inplace=True)


In [None]:
#Split Features and Target
X = df.drop('Survived', axis=1)
y = df['Survived']


In [None]:
#Train/Test Split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

X_train

In [None]:
#Helper function to evaluate models
from sklearn.metrics import accuracy_score

def evaluate_model(model):
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    return accuracy_score(y_test, preds)


In [None]:
#Logistic Regression
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(max_iter=1000)
acc_log = evaluate_model(log_reg)

acc_log


In [None]:
#Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(random_state=42)
acc_dt = evaluate_model(dt)
acc_dt

In [None]:
#Random Forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200, random_state=42)
acc_rf = evaluate_model(rf)
acc_rf

In [None]:
#K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=7)
acc_knn = evaluate_model(knn)
acc_knn

In [None]:
#Support Vector Machine
from sklearn.svm import SVC
svm = SVC(kernel='rbf')
acc_svm = evaluate_model(svm)
acc_svm

### 3.9 Apply Standard Scalar (1 Mark)

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)


### 3.10 Create a single function for preprocessing the test set (X_test) and apply it. (4 Marks)
#### **Note**: All the pre-processing steps that were applied on the train set before ML Modelling are also applied on the test set before passing through the predict function.

In [None]:
def preprocess_test_set(X_test, age_mean, age_std, embarked_mode, sex_mode, scaler):

    X_test = X_test.copy()

    # 1. Has_cabin (safe)
    if 'Cabin' in X_test.columns:
        X_test['Has_cabin'] = X_test['Cabin'].apply(lambda x: 1 if isinstance(x, str) else 0)
    else:
        X_test['Has_cabin'] = 0

    # 2. family_size
    if 'SibSp' in X_test.columns and 'Parch' in X_test.columns:
        X_test['family_size'] = X_test['SibSp'] + X_test['Parch'] + 1
    else:
        X_test['family_size'] = 1

    # 3. Embarked
    X_test['Embarked'].fillna(embarked_mode, inplace=True)
    embarked_mapping = {'S': 0, 'C': 1, 'Q': 2}
    X_test['Embarked'] = X_test['Embarked'].map(embarked_mapping)

    # 4. Sex
    X_test['Sex'] = X_test['Sex'].astype('object')
    X_test['Sex'].replace(['nan', 'None', '', ' '], sex_mode, inplace=True)
    X_test['Sex'].fillna(sex_mode, inplace=True)
    X_test['Sex'] = X_test['Sex'].map({'female': 0, 'male': 1})

    # 5. Age
    low = int(age_mean - age_std)
    high = int(age_mean + age_std)
    missing_age_count = X_test['Age'].isnull().sum()
    random_ages = np.random.randint(low, high, missing_age_count)
    X_test.loc[X_test['Age'].isnull(), 'Age'] = random_ages
    X_test['Age'] = X_test['Age'].astype(int)

    # 6. Drop unused columns (only drop if they exist)
    cols_to_drop = ['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin']
    for col in cols_to_drop:
        if col in X_test.columns:
            X_test.drop(columns=col, inplace=True)

    # 7. Scale
    X_test_scaled = scaler.transform(X_test)

    return X_test_scaled


In [None]:
## Applyting above function
age_mean = X_train['Age'].mean()
age_std  = X_train['Age'].std()
embarked_mode = X_train['Embarked'].mode()[0]
sex_mode = X_train['Sex'].mode()[0]

scaler = StandardScaler()
scaler.fit(X_train)   # fit ONLY on training data



In [None]:
X_test_processed = preprocess_test_set( X_test, age_mean, age_std, embarked_mode, sex_mode, scaler )


### 3.11 Apply standard Scalar transformation to x_test (1 Mark)

In [None]:
scaler = StandardScaler()
scaler.fit(X_train)   # fit ONLY on training data


In [None]:
X_test_scaled = scaler.transform(X_test)

## Exercise  4. Apply Multiple ML Algo. along with  Ensemble Technique (Voting classifier) and display the accuracy (7 Marks)
#### Expected Accuracy >= 80%  


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score


In [None]:
log_reg = LogisticRegression(max_iter=1000)
dt = DecisionTreeClassifier(random_state=42)
rf = RandomForestClassifier(n_estimators=200, random_state=42)
knn = KNeighborsClassifier(n_neighbors=7)
svm = SVC(kernel='rbf', probability=True)


In [None]:
models = {
    "Logistic Regression": log_reg,
    "Decision Tree": dt,
    "Random Forest": rf,
    "KNN": knn,
    "SVM": svm
}

accuracies = {}

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    preds = model.predict(X_test_scaled)
    accuracies[name] = accuracy_score(y_test, preds)


In [None]:
voting_clf = VotingClassifier(
    estimators=[
        ('lr', log_reg),
        ('rf', rf),
        ('svm', svm)
    ],
    voting='soft'
)

voting_clf.fit(X_train_scaled, y_train)
voting_preds = voting_clf.predict(X_test_scaled)
accuracies["Voting Classifier"] = accuracy_score(y_test, voting_preds)


In [None]:
for model_name, acc in accuracies.items():
    print(f"{model_name}: {acc*100:.2f}%")


## Exercise  5. Pre-process the test_set (3 Marks)
Again we have to apply the same preprocess function and standard scaler on this test set before passing through predict function.

#### Understanding the test set:

In [None]:
print(X_test.columns)

#### Note: In the initial train set there were no missing entries in the "Fare" column. But, now for the submission test set, there is one missing entry in this column.

#### There will be a minor change in the preprocess function to address the above issue.

In [None]:
fare_median = X_train['Fare'].median()


In [None]:
def preprocess_test_set(X_test, age_mean, age_std, embarked_mode, sex_mode, fare_median, scaler):

    X_test = X_test.copy()

    # 1. Has_cabin (safe)
    if 'Cabin' in X_test.columns:
        X_test['Has_cabin'] = X_test['Cabin'].apply(lambda x: 1 if isinstance(x, str) else 0)
    else:
        X_test['Has_cabin'] = 0

    # 2. family_size
    if 'SibSp' in X_test.columns and 'Parch' in X_test.columns:
        X_test['family_size'] = X_test['SibSp'] + X_test['Parch'] + 1
    else:
        X_test['family_size'] = 1

    # 3. Embarked
    X_test['Embarked'].fillna(embarked_mode, inplace=True)
    embarked_mapping = {'S': 0, 'C': 1, 'Q': 2}
    X_test['Embarked'] = X_test['Embarked'].map(embarked_mapping)

    # 4. Sex
    X_test['Sex'] = X_test['Sex'].astype('object')
    X_test['Sex'].replace(['nan', 'None', '', ' '], sex_mode, inplace=True)
    X_test['Sex'].fillna(sex_mode, inplace=True)
    X_test['Sex'] = X_test['Sex'].map({'female': 0, 'male': 1})

    # 5. Age
    low = int(age_mean - age_std)
    high = int(age_mean + age_std)
    missing_age_count = X_test['Age'].isnull().sum()
    random_ages = np.random.randint(low, high, missing_age_count)
    X_test.loc[X_test['Age'].isnull(), 'Age'] = random_ages
    X_test['Age'] = X_test['Age'].astype(int)

    # 6. Fare (NEW FIX)
    X_test['Fare'].fillna(fare_median, inplace=True)

    # 7. Drop unused columns
    cols_to_drop = ['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin']
    for col in cols_to_drop:
        if col in X_test.columns:
            X_test.drop(columns=col, inplace=True)

    # 8. Scale
    X_test = X_test.fillna(0)
    X_test_scaled = scaler.transform(X_test)
    X_test_scaled = pd.DataFrame(X_test_scaled).fillna(0).values

    return X_test_scaled


In [None]:
X_test_processed = preprocess_test_set( X_test, age_mean, age_std, embarked_mode, sex_mode, fare_median, scaler )

## Exercise  6. Prediction for test data (2 Mark)

In [None]:
print(X_train.columns)
print(X_test.columns)
pd.DataFrame(X_test).isnull().sum()
pd.DataFrame(X_train).isnull().sum()

In [None]:
X_test_processed

In [None]:
pd.DataFrame(X_test_processed).isnull().sum()

In [None]:
y_pred = voting_clf.predict(X_test_processed)
y_pred


In [None]:
print(X_test.columns.tolist())

In [None]:
pred_df = pd.DataFrame({ 'Survived': y_pred })
pred_df