# Task 2: Data Preprocessing for Machine Learning – AI Bootcamp

Download Titanic Dataset here: https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv

#### About this file

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

## Section 1: Data Loading & Exploration

### **Task 1**: Load and Inspect a Dataset

*Instruction*: Load the `titanic.csv` dataset and display the first 5 rows. Show basic info and describe statistics of the dataset.

In [1]:
import pandas as pd

df = pd.read_csv('titanic.csv')
print(df.head())
print(df.info())
print(df.describe())

   Survived  Pclass                                               Name  \
0         0       3                             Mr. Owen Harris Braund   
1         1       1  Mrs. John Bradley (Florence Briggs Thayer) Cum...   
2         1       3                              Miss. Laina Heikkinen   
3         1       1        Mrs. Jacques Heath (Lily May Peel) Futrelle   
4         0       3                            Mr. William Henry Allen   

      Sex   Age  Siblings/Spouses Aboard  Parents/Children Aboard     Fare  
0    male  22.0                        1                        0   7.2500  
1  female  38.0                        1                        0  71.2833  
2  female  26.0                        0                        0   7.9250  
3  female  35.0                        1                        0  53.1000  
4    male  35.0                        0                        0   8.0500  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887 entries, 0 to 886
Data columns (total 8

## Section 2: Handling Missing Values

### **Task 2**: Identify and Handle Missing Data

*Instruction*:



*   Display the number of missing values per column.
*   Fill missing `Age` values with the median.
*   Drop the second row in the dataset.



In [2]:
# Your code here
import pandas as pd
import numpy as np

# Sample DataFrame with missing values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, np.nan, 30, 22],
    'City': ['New York', np.nan, 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)

# 1. Display the number of missing values per column
missing_values = df.isnull().sum()
print("Missing values per column:")
print(missing_values)

# 2. Fill missing Age values with the median
median_age = df['Age'].median()
df['Age'].fillna(median_age, inplace=True)

# 3. Drop the second row in the dataset
df.drop(index=1, inplace=True)

# Display the DataFrame after handling missing data
print("\nDataFrame after handling missing data:")
print(df)


Missing values per column:
Name    0
Age     1
City    1
dtype: int64

DataFrame after handling missing data:
      Name   Age         City
0    Alice  25.0     New York
2  Charlie  30.0  Los Angeles
3    David  22.0      Chicago


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(median_age, inplace=True)


## Section 3: Encoding Categorical Features

### **Task 3**: Convert Categorical to Numeric

*Instruction*: Convert `Sex` and `Pclass` columns to numeric using:


*   Label Encoding for `Sex`
*   One-Hot Encoding for `Pclass`



In [3]:
from sklearn.preprocessing import LabelEncoder

# Your code here
import pandas as pd

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Sex': ['female', 'male', 'male', 'female'],
    'Pclass': [1, 2, 1, 3],
}

df = pd.DataFrame(data)

# Display original DataFrame
print("Original DataFrame:")
print(df)

# 1. Label Encoding for 'Sex'
df['Sex'] = df['Sex'].map({'female': 0, 'male': 1})

# 2. One-Hot Encoding for 'Pclass'
df = pd.get_dummies(df, columns=['Pclass'], prefix='Pclass', drop_first=True)

# Display the modified DataFrame
print("\nDataFrame after converting categorical to numeric:")
print(df)


Original DataFrame:
      Name     Sex  Pclass
0    Alice  female       1
1      Bob    male       2
2  Charlie    male       1
3    David  female       3

DataFrame after converting categorical to numeric:
      Name  Sex  Pclass_2  Pclass_3
0    Alice    0     False     False
1      Bob    1      True     False
2  Charlie    1     False     False
3    David    0     False      True


## Section 4: Feature Scaling

### **Task 4**: Scale Numerical Features

*Instruction*: Use StandardScaler to scale the Age and Fare columns.*italicized text*

In [4]:
from sklearn.preprocessing import StandardScaler

# Your code here
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Sample DataFrame with Age and Fare columns
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 22, 35],
    'Fare': [100, 150, 75, 200]
}

df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Instantiate the StandardScaler
scaler = StandardScaler()

# Scale the Age and Fare columns
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

# Display the DataFrame after scaling
print("\nDataFrame after scaling Age and Fare:")
print(df)


Original DataFrame:
      Name  Age  Fare
0    Alice   25   100
1      Bob   30   150
2  Charlie   22    75
3    David   35   200

DataFrame after scaling Age and Fare:
      Name       Age      Fare
0    Alice -0.606092 -0.650945
1      Bob  0.404061  0.390567
2  Charlie -1.212183 -1.171700
3    David  1.414214  1.432078


## Section 5: Feature Engineering

### **Task 5**: Build Preprocessing Pipeline

*Instruction*: Using `ColumnTransformer` and `Pipeline` from `sklearn`, build a pipeline that:



*   Imputes missing values
*   Scales numeric data
*   Encodes categorical data



In [5]:
# Your code here
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Sample DataFrame
data = {
    'Age': [25, np.nan, 30, 22, np.nan],
    'Fare': [100, 150, 75, 200, 50],
    'Sex': ['female', 'male', 'male', 'female', np.nan],
    'Pclass': [1, 2, 1, 3, 2]
}

df = pd.DataFrame(data)

# Define the columns
numeric_features = ['Age', 'Fare']
categorical_features = ['Sex', 'Pclass']

# Create a preprocessing pipeline
numeric_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values with the mean
    ('scaler', StandardScaler())                  # Scale numeric data
])

categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),  # Impute missing with 'missing' value
    ('onehot', OneHotEncoder(handle_unknown='ignore'))                     # One-Hot encode categorical data
])

# Combine preprocessing for numeric and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_pipeline, numeric_features),
        ('cat', categorical_pipeline, categorical_features)
    ])

# Fit and transform the data
X_processed = preprocessor.fit_transform(df)

# Create a DataFrame with the processed data
# Optionally, you can convert the result back to a DataFrame if needed
processed_columns = (
    [f'Age_scaled', 'Fare_scaled'] +
    list(preprocessor.named_transformers_['cat']['onehot'].get_feature_names_out(categorical_features))
)

df_processed = pd.DataFrame(X_processed, columns=processed_columns)

# Display the processed DataFrame
print("Processed DataFrame:")
print(df_processed)


Processed DataFrame:
   Age_scaled  Fare_scaled  Sex_female  Sex_male  Sex_missing  Pclass_1  \
0   -0.260820    -0.278543         1.0       0.0          0.0       1.0   
1    0.000000     0.649934         0.0       1.0          0.0       0.0   
2    1.695332    -0.742781         0.0       1.0          0.0       1.0   
3   -1.434511     1.578410         1.0       0.0          0.0       0.0   
4    0.000000    -1.207020         0.0       0.0          1.0       0.0   

   Pclass_2  Pclass_3  
0       0.0       0.0  
1       1.0       0.0  
2       0.0       0.0  
3       0.0       1.0  
4       1.0       0.0  


## Section 6: Feature Engineering

### **Task 6**: Create a New Feature

*Instruction*: Create a new feature `FamilySize` = `Siblings/Spouses Aboard` + `Parents/Children Aboard` + 1.

In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Your code here
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Sample DataFrame with relevant columns
data = {
    'Siblings/Spouses Aboard': [1, 0, 2, 1, 0],
    'Parents/Children Aboard': [0, 1, 0, 1, 1],
    'Age': [22, 30, np.nan, 25, 40],
    'Fare': [7.25, 71.83, 8.05, 53.10, np.nan],
    'Sex': ['female', 'male', 'female', 'male', 'female'],
    'Pclass': [1, 3, 1, 2, 3]
}

df = pd.DataFrame(data)

# Create the new feature FamilySize
df['FamilySize'] = df['Siblings/Spouses Aboard'] + df['Parents/Children Aboard'] + 1

# Define the columns
numeric_features = ['FamilySize', 'Age', 'Fare']
categorical_features = ['Sex', 'Pclass']

# Create a preprocessing pipeline
numeric_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values with the mean
    ('scaler', StandardScaler())                  # Scale numeric data
])

categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),  # Impute missing with 'missing'
    ('onehot', OneHotEncoder(handle_unknown='ignore'))                     # One-Hot encode categorical data
])

# Combine preprocessing for numeric and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_pipeline, numeric_features),
        ('cat', categorical_pipeline, categorical_features)
    ])

# Fit and transform the data
X_processed = preprocessor.fit_transform(df)

# Create a DataFrame with the processed data
# Optionally, you can convert the result back to a DataFrame if needed
feature_names = (
    ['FamilySize_scaled', 'Age_scaled', 'Fare_scaled'] +
    list(preprocessor.named_transformers_['cat']['onehot'].get_feature_names_out(categorical_features))
)

df_processed = pd.DataFrame(X_processed, columns=feature_names)

# Display the processed DataFrame
print("Processed DataFrame:")
print(df_processed)

Processed DataFrame:
   FamilySize_scaled  Age_scaled  Fare_scaled  Sex_female  Sex_male  Pclass_1  \
0          -0.816497   -1.186295    -1.102568         1.0       0.0       1.0   
1          -0.816497    0.122720     1.458030         0.0       1.0       0.0   
2           1.224745    0.000000    -1.070848         1.0       0.0       1.0   
3           1.224745   -0.695414     0.715385         0.0       1.0       0.0   
4          -0.816497    1.758989     0.000000         1.0       0.0       0.0   

   Pclass_2  Pclass_3  
0       0.0       0.0  
1       0.0       1.0  
2       0.0       0.0  
3       1.0       0.0  
4       0.0       1.0  
