<a href="https://colab.research.google.com/github/muhammad-yaseen007/Quiz-2/blob/main/Lab_6_EDA%2BSklearn_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Data Preparation

*   Feature Scaling
*   Feature Binning
*   Column Transformer
*   Function Transfoerm
*   Sklearn Pipeline

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.pipeline import Pipeline

In [2]:
titanic = pd.read_csv('titanic.csv')

**Feature Scaling**

In [3]:
#Select features for scaling
features_to_scale = ['Age', 'Fare']

In [4]:
# Handle missing values
# Fill missing values in 'Age' and 'Fare' with the median
titanic['Age'] = titanic['Age'].fillna(titanic['Age'].median())
titanic['Fare'] = titanic['Fare'].fillna(titanic['Fare'].median())
#titanic['Age'].fillna(titanic['Age'].median(), inplace=True)
#titanic['Fare'].fillna(titanic['Fare'].median(), inplace=True)

In [5]:
#Apply MinMaxScaler to the selected features
scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(titanic[features_to_scale])
#print(scaled_features)
print(type(scaled_features))

<class 'numpy.ndarray'>


In [6]:

# Convert scaled features into a DataFrame
scaled_df = pd.DataFrame(scaled_features, columns=features_to_scale)
print(scaled_df)


          Age      Fare
0    0.271174  0.014151
1    0.472229  0.139136
2    0.321438  0.015469
3    0.434531  0.103644
4    0.434531  0.015713
..        ...       ...
886  0.334004  0.025374
887  0.233476  0.058556
888  0.346569  0.045771
889  0.321438  0.058556
890  0.396833  0.015127

[891 rows x 2 columns]


In [7]:
#Concatenate the scaled features back into the original DataFrame
print(titanic[features_to_scale])
titanic[features_to_scale] = scaled_df
print(titanic[features_to_scale])
print(type(titanic[features_to_scale]))

      Age     Fare
0    22.0   7.2500
1    38.0  71.2833
2    26.0   7.9250
3    35.0  53.1000
4    35.0   8.0500
..    ...      ...
886  27.0  13.0000
887  19.0  30.0000
888  28.0  23.4500
889  26.0  30.0000
890  32.0   7.7500

[891 rows x 2 columns]
          Age      Fare
0    0.271174  0.014151
1    0.472229  0.139136
2    0.321438  0.015469
3    0.434531  0.103644
4    0.434531  0.015713
..        ...       ...
886  0.334004  0.025374
887  0.233476  0.058556
888  0.346569  0.045771
889  0.321438  0.058556
890  0.396833  0.015127

[891 rows x 2 columns]
<class 'pandas.core.frame.DataFrame'>


In [8]:
# Step 6: Display the results
print("After MinMax Scaling:")
print(titanic.shape)
print(titanic.head())

After MinMax Scaling:
(891, 12)
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex       Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  0.271174      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  0.472229      1   
2                             Heikkinen, Miss. Laina  female  0.321438      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  0.434531      1   
4                           Allen, Mr. William Henry    male  0.434531      0   

   Parch            Ticket      Fare Cabin Embarked  
0      0         A/5 21171  0.014151   NaN        S  
1      0          PC 17599  0.139136   C85        C  
2      0  STON/O2. 3101282  0.015469   NaN        S  
3      0            113803  0.103644  C123      

# **Feature Binning**

In [9]:
# Extract Title from Name
titanic['Title'] = titanic['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)
print(titanic['Title'])

0        Mr
1       Mrs
2      Miss
3       Mrs
4        Mr
       ... 
886     Rev
887    Miss
888    Miss
889      Mr
890      Mr
Name: Title, Length: 891, dtype: object


In [10]:
# Define title mapping dictionary
title_mapping = {
    'Mr': 'Mr', 'Mrs': 'Mrs', 'Miss': 'Miss', 'Master': 'Master',
    'Capt': 'Officer', 'Col': 'Officer', 'Major': 'Officer', 'Dr': 'Officer', 'Rev': 'Officer',
    'Countess': 'Royalty', 'Sir': 'Royalty', 'Lady': 'Royalty', 'Jonkheer': 'Royalty', 'Don': 'Royalty',
    'Mlle': 'Miss', 'Ms': 'Miss', 'Mme': 'Mrs'
}
print(type(titanic['Title']))
# Map extracted titles to broader categories & handle unknown titles
titanic['Title_Group'] = titanic['Title'].map(title_mapping)
print(titanic)
#filtered_titanic = titanic[titanic['Title_Group'] == 'Other'].tail(100)
#print(filtered_titanic)

<class 'pandas.core.series.Series'>
     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex       Age  \
0                              Braund, Mr. Owen Harris    male  0.271174   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  0.472229   
2                               Heikkinen, Miss. Laina  female  0.321438   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  0.434531   
4                             Allen, Mr. William Henry    male  0.434531   
..                                                 

In [11]:
# Drop the original 'Title' column (optional)
titanic.drop(columns=['Title'], inplace=True)
print(titanic.head())

# Encode the categorical feature using One-Hot Encoding (optional for ML models)
titanic = pd.get_dummies(titanic, columns=['Title_Group'],  dtype=int)

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex       Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  0.271174      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  0.472229      1   
2                             Heikkinen, Miss. Laina  female  0.321438      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  0.434531      1   
4                           Allen, Mr. William Henry    male  0.434531      0   

   Parch            Ticket      Fare Cabin Embarked Title_Group  
0      0         A/5 21171  0.014151   NaN        S          Mr  
1      0          PC 17599  0.139136   C85        C         Mrs  
2      0  STON/O2. 3101282  0.015469   NaN        S        Miss  
3      0            113803  0.10

In [12]:
# Display the updated dataset with new feature
print(titanic.head())
print(titanic.shape)

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex       Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  0.271174      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  0.472229      1   
2                             Heikkinen, Miss. Laina  female  0.321438      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  0.434531      1   
4                           Allen, Mr. William Henry    male  0.434531      0   

   Parch            Ticket      Fare Cabin Embarked  Title_Group_Master  \
0      0         A/5 21171  0.014151   NaN        S                   0   
1      0          PC 17599  0.139136   C85        C                   0   
2      0  STON/O2. 3101282  0.015469   NaN        S                   0

**Column Transformer**

In [13]:
# Sample DataFrame
data = pd.DataFrame({
    'Age': [22, None, 24, 22, None, 24],
    'Sex': ['male', 'female', 'female','male', 'female', 'female'],
    'Embarked': ['B', 'B', 'C', 'C', 'C','S'],
    'Fare': [7.25, 71.83, 8.05,7.25, 71.83, 8.05]
})

print("Original DataFrame:")
print(data)

# Define ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ('age_imputer', SimpleImputer(strategy='mean'), ['Age']),
    ('ohe', OneHotEncoder(), ['Sex','Embarked'])
], remainder='passthrough')  # Pass through other columns like 'Fare'

# Transform the data
transformed_data = preprocessor.fit_transform(data)

# Output shape and transformed data
print("\nTransformed Data Shape:")
print(transformed_data.shape)
#print(type(transformed_data))


transformed_df = pd.DataFrame(transformed_data)
print("\nTransformed DataFrame:")
print(transformed_df)

Original DataFrame:
    Age     Sex Embarked   Fare
0  22.0    male        B   7.25
1   NaN  female        B  71.83
2  24.0  female        C   8.05
3  22.0    male        C   7.25
4   NaN  female        C  71.83
5  24.0  female        S   8.05

Transformed Data Shape:
(6, 7)

Transformed DataFrame:
      0    1    2    3    4    5      6
0  22.0  0.0  1.0  1.0  0.0  0.0   7.25
1  23.0  1.0  0.0  1.0  0.0  0.0  71.83
2  24.0  1.0  0.0  0.0  1.0  0.0   8.05
3  22.0  0.0  1.0  0.0  1.0  0.0   7.25
4  23.0  1.0  0.0  0.0  1.0  0.0  71.83
5  24.0  1.0  0.0  0.0  0.0  1.0   8.05


**Function Transfomer**

In [14]:
# Sample DataFrame
data = pd.DataFrame({
    'Age': [22, None, 24,22, None, 24],
    'Sex': ['male', 'female', 'female','male', 'female', 'female'],
    'Embarked': ['B', 'B', 'C', None, 'C','S'],
    'Fare': [7.25, 71.83, 8.05,7.25, 71.83, 8.05]
})

print("Original DataFrame:")
print(data)

# Custom function to impute 'Embarked'
def impute_embarked(X):
    X['Embarked'] = X['Embarked'].fillna(X['Embarked'].mode()[0])  # Fill missing values
    print(X['Embarked'])
    return X

# Define ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ('age_imputer', SimpleImputer(strategy='mean'), ['Age']),
    ('embarked_imputer', FunctionTransformer(impute_embarked), ['Embarked']),
    ('ohe', OneHotEncoder(), ['Sex','Embarked'])
], remainder='passthrough')  # Pass through other columns like 'Fare'

# Transform the data
transformed_data = preprocessor.fit_transform(data)

# Output shape and transformed data
print("\nTransformed Data Shape:")
print(transformed_data.shape)
print(transformed_data)


transformed_df = pd.DataFrame(transformed_data)
print("\nTransformed DataFrame:")
print(transformed_df)


Original DataFrame:
    Age     Sex Embarked   Fare
0  22.0    male        B   7.25
1   NaN  female        B  71.83
2  24.0  female        C   8.05
3  22.0    male     None   7.25
4   NaN  female        C  71.83
5  24.0  female        S   8.05
0    B
1    B
2    C
3    B
4    C
5    S
Name: Embarked, dtype: object

Transformed Data Shape:
(6, 9)
[[22.0 'B' 0.0 1.0 1.0 0.0 0.0 0.0 7.25]
 [23.0 'B' 1.0 0.0 1.0 0.0 0.0 0.0 71.83]
 [24.0 'C' 1.0 0.0 0.0 1.0 0.0 0.0 8.05]
 [22.0 'B' 0.0 1.0 0.0 0.0 0.0 1.0 7.25]
 [23.0 'C' 1.0 0.0 0.0 1.0 0.0 0.0 71.83]
 [24.0 'S' 1.0 0.0 0.0 0.0 1.0 0.0 8.05]]

Transformed DataFrame:
      0  1    2    3    4    5    6    7      8
0  22.0  B  0.0  1.0  1.0  0.0  0.0  0.0   7.25
1  23.0  B  1.0  0.0  1.0  0.0  0.0  0.0  71.83
2  24.0  C  1.0  0.0  0.0  1.0  0.0  0.0   8.05
3  22.0  B  0.0  1.0  0.0  0.0  0.0  1.0   7.25
4  23.0  C  1.0  0.0  0.0  1.0  0.0  0.0  71.83
5  24.0  S  1.0  0.0  0.0  0.0  1.0  0.0   8.05


**Sklean Pipeline**

In [15]:
# Sample DataFrame
data = pd.DataFrame({
    'Age': [22, None, 24,22, None, 24],
    'Sex': ['male', 'female', 'female','male', 'female', 'female'],
    'Embarked': ['B', 'B', 'C', None, 'C','S'],
    'Fare': [7.25, 71.83, 8.05,7.25, 71.83, 8.05]
})

print("Original DataFrame:")
print(data)

# Custom function to impute 'Embarked'
def impute_embarked(X):
    X['Embarked'] = X['Embarked'].fillna(X['Embarked'].mode()[0])  # Fill missing values
    print (X['Embarked'])
    return X

preprocessor = ColumnTransformer(transformers=[
    ('age_imputer', SimpleImputer(strategy='mean'), ['Age']),
    ('embarked_encoder', Pipeline(steps=[
        ('imputer', FunctionTransformer(impute_embarked)),  # Impute Embarked first
        ('onehot', OneHotEncoder())  # Then apply OneHotEncoder
    ]), ['Embarked']),
    ('ohe', OneHotEncoder(), ['Sex'])
], remainder='passthrough')  # Pass through other columns like 'Fare'


# Transform the data
transformed_data = preprocessor.fit_transform(data)

# Output shape and transformed data
print("\nTransformed Data Shape:")
print(transformed_data.shape)
print(transformed_data)

transformed_df = pd.DataFrame(transformed_data)
print("\nTransformed DataFrame:")
print(transformed_df)


Original DataFrame:
    Age     Sex Embarked   Fare
0  22.0    male        B   7.25
1   NaN  female        B  71.83
2  24.0  female        C   8.05
3  22.0    male     None   7.25
4   NaN  female        C  71.83
5  24.0  female        S   8.05
0    B
1    B
2    C
3    B
4    C
5    S
Name: Embarked, dtype: object

Transformed Data Shape:
(6, 7)
[[22.    1.    0.    0.    0.    1.    7.25]
 [23.    1.    0.    0.    1.    0.   71.83]
 [24.    0.    1.    0.    1.    0.    8.05]
 [22.    1.    0.    0.    0.    1.    7.25]
 [23.    0.    1.    0.    1.    0.   71.83]
 [24.    0.    0.    1.    1.    0.    8.05]]

Transformed DataFrame:
      0    1    2    3    4    5      6
0  22.0  1.0  0.0  0.0  0.0  1.0   7.25
1  23.0  1.0  0.0  0.0  1.0  0.0  71.83
2  24.0  0.0  1.0  0.0  1.0  0.0   8.05
3  22.0  1.0  0.0  0.0  0.0  1.0   7.25
4  23.0  0.0  1.0  0.0  1.0  0.0  71.83
5  24.0  0.0  0.0  1.0  1.0  0.0   8.05


# Lab Task

**Apply Feature Creation preprocessing step on the Titanic dataset to create a Family Size feature which calculates Family Size for each Passenger using following Equation. FamilySize = SibSp + Parch + 1**

**Apply ColumnTransformer, FunctionTransformer and Sklearn Pipeline on the Titanic dataset.**

**Use ColumnTransformer, FunctionTransformer and Sklearn Pipeline features to preprocess the following dataset**

https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease