# Task 2: Data Preprocessing for Machine Learning – AI Bootcamp

Download Titanic Dataset here: https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv

#### About this file

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

## Section 1: Data Loading & Exploration

### **Task 1**: Load and Inspect a Dataset

*Instruction*: Load the `titanic.csv` dataset and display the first 5 rows. Show basic info and describe statistics of the dataset.

## Section 2: Handling Missing Values

### **Task 2**: Identify and Handle Missing Data

*Instruction*:



*   Display the number of missing values per column.
*   Fill missing `Age` values with the median.
*   Drop the second row in the dataset.



In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('titanic.csv')

# Display number of missing values per column
print("Missing values per column:")
print(df.isnull().sum())

# Fill missing Age values with the median
median_age = df['Age'].median()
df['Age'].fillna(median_age, inplace=True)

# Drop the second row (index 1)
df.drop(index=1, inplace=True)

# Optional: Display the updated dataframe
df.head()

Missing values per column:
Survived                   0
Pclass                     0
Name                       0
Sex                        0
Age                        0
Siblings/Spouses Aboard    0
Parents/Children Aboard    0
Fare                       0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(median_age, inplace=True)


Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05
5,0,3,Mr. James Moran,male,27.0,0,0,8.4583


## Section 3: Encoding Categorical Features

### **Task 3**: Convert Categorical to Numeric

*Instruction*: Convert `Sex` and `Pclass` columns to numeric using:


*   Label Encoding for `Sex`
*   One-Hot Encoding for `Pclass`



In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Load the dataset
df = pd.read_csv("titanic.csv")

# Fill missing Age values with the median
df['Age'].fillna(df['Age'].median(), inplace=True)

# Drop the second row (index 1)
df.drop(index=1, inplace=True)

# Label Encoding for 'Sex'
label_encoder = LabelEncoder()
df['Sex'] = label_encoder.fit_transform(df['Sex'])

# One-Hot Encoding for 'Pclass'
df = pd.get_dummies(df, columns=['Pclass'], prefix='Pclass')

# Display the result
print(df.head())


   Survived                                         Name  Sex   Age  \
0         0                       Mr. Owen Harris Braund    1  22.0   
2         1                        Miss. Laina Heikkinen    0  26.0   
3         1  Mrs. Jacques Heath (Lily May Peel) Futrelle    0  35.0   
4         0                      Mr. William Henry Allen    1  35.0   
5         0                              Mr. James Moran    1  27.0   

   Siblings/Spouses Aboard  Parents/Children Aboard     Fare  Pclass_1  \
0                        1                        0   7.2500     False   
2                        0                        0   7.9250     False   
3                        1                        0  53.1000      True   
4                        0                        0   8.0500     False   
5                        0                        0   8.4583     False   

   Pclass_2  Pclass_3  
0     False      True  
2     False      True  
3     False     False  
4     False      True  
5     Fa

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)


## Section 4: Feature Scaling

### **Task 4**: Scale Numerical Features

*Instruction*: Use StandardScaler to scale the Age and Fare columns.*italicized text*

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load the dataset
df = pd.read_csv('titanic.csv')

# Check for missing values
print(df[['Age', 'Fare']].isnull().sum())

# Fill missing values if any (e.g., with median)
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Fare'].fillna(df['Fare'].median(), inplace=True)

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the 'Age' and 'Fare' columns
df[['Age_scaled', 'Fare_scaled']] = scaler.fit_transform(df[['Age', 'Fare']])

# Display the first few rows to check results
print(df[['Age', 'Age_scaled', 'Fare', 'Fare_scaled']].head())


Age     0
Fare    0
dtype: int64
    Age  Age_scaled     Fare  Fare_scaled
0  22.0   -0.529366   7.2500    -0.503586
1  38.0    0.604265  71.2833     0.783412
2  26.0   -0.245958   7.9250    -0.490020
3  35.0    0.391709  53.1000     0.417948
4  35.0    0.391709   8.0500    -0.487507


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Fare'].fillna(df['Fare'].median(), inplace=True)


## Section 5: Feature Engineering

### **Task 5**: Build Preprocessing Pipeline

*Instruction*: Using `ColumnTransformer` and `Pipeline` from `sklearn`, build a pipeline that:



*   Imputes missing values
*   Scales numeric data
*   Encodes categorical data



In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load the dataset
df = pd.read_csv('titanic.csv')

# Check for missing values
print(df[['Age', 'Fare']].isnull().sum())

# Fill missing values if any (e.g., with median)
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Fare'].fillna(df['Fare'].median(), inplace=True)

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the 'Age' and 'Fare' columns
df[['Age_scaled', 'Fare_scaled']] = scaler.fit_transform(df[['Age', 'Fare']])

# Display the first few rows to check results
print(df[['Age', 'Age_scaled', 'Fare', 'Fare_scaled']].head())



Age     0
Fare    0
dtype: int64
    Age  Age_scaled     Fare  Fare_scaled
0  22.0   -0.529366   7.2500    -0.503586
1  38.0    0.604265  71.2833     0.783412
2  26.0   -0.245958   7.9250    -0.490020
3  35.0    0.391709  53.1000     0.417948
4  35.0    0.391709   8.0500    -0.487507


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Fare'].fillna(df['Fare'].median(), inplace=True)


## Section 6: Feature Engineering

### **Task 6**: Create a New Feature

*Instruction*: Create a new feature `FamilySize` = `Siblings/Spouses Aboard` + `Parents/Children Aboard` + 1.

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('titanic.csv')

# Strip any whitespace from column names
df.columns = df.columns.str.strip()

# Create the FamilySize feature
df['FamilySize'] = df['Siblings/Spouses Aboard'] + df['Parents/Children Aboard'] + 1

# Display the result
print(df[['Siblings/Spouses Aboard', 'Parents/Children Aboard', 'FamilySize']].head())



   Siblings/Spouses Aboard  Parents/Children Aboard  FamilySize
0                        1                        0           2
1                        1                        0           2
2                        0                        0           1
3                        1                        0           2
4                        0                        0           1
