# Task 2: Data Preprocessing for Machine Learning – AI Bootcamp

Download Titanic Dataset here: https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv

#### About this file

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

## Section 1: Data Loading & Exploration

### **Task 1**: Load and Inspect a Dataset

*Instruction*: Load the `titanic.csv` dataset and display the first 5 rows. Show basic info and describe statistics of the dataset.

In [None]:
import pandas as pd
from google.colab import files
uploaded = files.upload()

# Load the dataset
df = pd.read_csv('titanic.csv')

# Display the first 5 rows
print("🔹 First 5 rows of the dataset:")
print(df.head())

# Show basic info about the dataset
print("\n🔹 Dataset Info:")
print(df.info())

# Show descriptive statistics
print("\n🔹 Descriptive Statistics:")
print(df.describe())


Saving titanic.csv to titanic.csv
🔹 First 5 rows of the dataset:
   Survived  Pclass                                               Name  \
0         0       3                             Mr. Owen Harris Braund   
1         1       1  Mrs. John Bradley (Florence Briggs Thayer) Cum...   
2         1       3                              Miss. Laina Heikkinen   
3         1       1        Mrs. Jacques Heath (Lily May Peel) Futrelle   
4         0       3                            Mr. William Henry Allen   

      Sex   Age  Siblings/Spouses Aboard  Parents/Children Aboard     Fare  
0    male  22.0                        1                        0   7.2500  
1  female  38.0                        1                        0  71.2833  
2  female  26.0                        0                        0   7.9250  
3  female  35.0                        1                        0  53.1000  
4    male  35.0                        0                        0   8.0500  

🔹 Dataset Info:
<class 'pan

## Section 2: Handling Missing Values

### **Task 2**: Identify and Handle Missing Data

*Instruction*:



*   Display the number of missing values per column.
*   Fill missing `Age` values with the median.
*   Drop the second row in the dataset.



In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('titanic.csv')

# 1. Display the number of missing values per column
print("🔹 Missing values per column:")
print(df.isnull().sum())

# 2. Fill missing 'Age' values with the median
age_median = df['Age'].median()
df['Age'].fillna(age_median, inplace=True)

# 3. Drop the second row (index 1)
df.drop(index=1, inplace=True)

# Optional: Display the first few rows to confirm
print("\n🔹 First 5 rows after processing:")
print(df.head())


🔹 Missing values per column:
Survived                   0
Pclass                     0
Name                       0
Sex                        0
Age                        0
Siblings/Spouses Aboard    0
Parents/Children Aboard    0
Fare                       0
dtype: int64

🔹 First 5 rows after processing:
   Survived  Pclass                                         Name     Sex  \
0         0       3                       Mr. Owen Harris Braund    male   
2         1       3                        Miss. Laina Heikkinen  female   
3         1       1  Mrs. Jacques Heath (Lily May Peel) Futrelle  female   
4         0       3                      Mr. William Henry Allen    male   
5         0       3                              Mr. James Moran    male   

    Age  Siblings/Spouses Aboard  Parents/Children Aboard     Fare  
0  22.0                        1                        0   7.2500  
2  26.0                        0                        0   7.9250  
3  35.0                     

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(age_median, inplace=True)


## Section 3: Encoding Categorical Features

### **Task 3**: Convert Categorical to Numeric

*Instruction*: Convert `Sex` and `Pclass` columns to numeric using:


*   Label Encoding for `Sex`
*   One-Hot Encoding for `Pclass`



In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Load the dataset
df = pd.read_csv('titanic.csv')

# Fill missing Age values with the median to avoid issues
df['Age'].fillna(df['Age'].median(), inplace=True)

# --- Label Encoding for 'Sex' ---
label_encoder = LabelEncoder()
df['Sex'] = label_encoder.fit_transform(df['Sex'])

# --- One-Hot Encoding for 'Pclass' ---
df = pd.get_dummies(df, columns=['Pclass'], prefix='Class')

# Display the first 5 rows
print("🔹 Transformed DataFrame:")
print(df.head())


🔹 Transformed DataFrame:
   Survived                                               Name  Sex   Age  \
0         0                             Mr. Owen Harris Braund    1  22.0   
1         1  Mrs. John Bradley (Florence Briggs Thayer) Cum...    0  38.0   
2         1                              Miss. Laina Heikkinen    0  26.0   
3         1        Mrs. Jacques Heath (Lily May Peel) Futrelle    0  35.0   
4         0                            Mr. William Henry Allen    1  35.0   

   Siblings/Spouses Aboard  Parents/Children Aboard     Fare  Class_1  \
0                        1                        0   7.2500    False   
1                        1                        0  71.2833     True   
2                        0                        0   7.9250    False   
3                        1                        0  53.1000     True   
4                        0                        0   8.0500    False   

   Class_2  Class_3  
0    False     True  
1    False    False  
2    Fa

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)


## Section 4: Feature Scaling

### **Task 4**: Scale Numerical Features

*Instruction*: Use StandardScaler to scale the Age and Fare columns.*italicized text*

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load dataset
df = pd.read_csv('titanic.csv')

# Fill missing Age values (required before scaling)
df['Age'].fillna(df['Age'].median(), inplace=True)

# Fill missing Fare values if any
df['Fare'].fillna(df['Fare'].median(), inplace=True)

# Initialize the scaler
scaler = StandardScaler()

# Apply scaling to Age and Fare
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

# Display the first 5 rows
print("🔹 Scaled 'Age' and 'Fare' columns:")
print(df[['Age', 'Fare']].head())


🔹 Scaled 'Age' and 'Fare' columns:
        Age      Fare
0 -0.529366 -0.503586
1  0.604265  0.783412
2 -0.245958 -0.490020
3  0.391709  0.417948
4  0.391709 -0.487507


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Fare'].fillna(df['Fare'].median(), inplace=True)


## Section 5: Feature Engineering

### **Task 5**: Build Preprocessing Pipeline

*Instruction*: Using `ColumnTransformer` and `Pipeline` from `sklearn`, build a pipeline that:



*   Imputes missing values
*   Scales numeric data
*   Encodes categorical data



In [None]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Load dataset
df = pd.read_csv('titanic.csv')

# Select relevant columns for processing
features = ['Age', 'Fare', 'Sex', 'Pclass']
df = df[features]

# Separate numeric and categorical columns
numeric_features = ['Age', 'Fare']
categorical_features = ['Sex', 'Pclass']

# Pipeline for numeric features: Impute then scale
numeric_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Pipeline for categorical features: Impute then encode
categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combine using ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_pipeline, numeric_features),
    ('cat', categorical_pipeline, categorical_features)
])

# Apply the preprocessing pipeline
processed_data = preprocessor.fit_transform(df)

# Optional: Show resulting NumPy array shape
print("🔹 Transformed data shape:", processed_data.shape)


## Section 6: Feature Engineering

### **Task 6**: Create a New Feature

*Instruction*: Create a new feature `FamilySize` = `Siblings/Spouses Aboard` + `Parents/Children Aboard` + 1.

In [22]:
import pandas as pd

# Assuming df is already loaded. If not, load your DataFrame here
# For example:
# df = pd.read_csv("your_dataset.csv")

# Check available columns
print("Available columns:")
print(df.columns)

# Create the FamilySize feature using the correct column names
df['FamilySize'] = df['Siblings/Spouses Aboard'] + df['Parents/Children Aboard'] + 1

# Show the result
print(df[['Siblings/Spouses Aboard', 'Parents/Children Aboard', 'FamilySize']].head())


Available columns:
Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'Siblings/Spouses Aboard',
       'Parents/Children Aboard', 'Fare'],
      dtype='object')
   Siblings/Spouses Aboard  Parents/Children Aboard  FamilySize
0                        1                        0           2
1                        1                        0           2
2                        0                        0           1
3                        1                        0           2
4                        0                        0           1
