## FEATURE ENGINEERING: Binning, Decomposition,Aggregation, Creation of Features

In [20]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder
from category_encoders import BinaryEncoder, CountEncoder

In [21]:
df = pd.read_excel('/kaggle/input/train-2441656/train.xlsx')

## Tasks

Binning of features:

Bin the 'Age' column into different age groups (e.g.,child, adult, elderly).

Bin the 'Fare' column into different fare ranges (e.g., low, medium, high).

In [22]:
# Bin the 'Age' column into different age groups (e.g.,child, adult, elderly).
print(df['Age'].min())
print(df['Age'].max())
bins = [0,17,59,80]
labels = ['child', 'adult','elderly']
pd.cut(df['Age'], bins=bins, labels = labels).dropna()

0.42
80.0


0      adult
1      adult
2      adult
3      adult
4      adult
       ...  
885    adult
886    adult
887    adult
889    adult
890    adult
Name: Age, Length: 714, dtype: category
Categories (3, object): ['child' < 'adult' < 'elderly']

In [23]:
# Bin the 'Fare' column into different fare ranges (e.g., low, medium, high).
print(df['Fare'].min())
print(df['Fare'].max())
bins = [0]+list(df['Fare'].quantile([0.25,0.75]).values)+[df['Fare'].max()]
labels = ['Low', 'Medium','High']
pd.cut(df['Fare'], bins=bins, labels = labels).dropna()

0.0
512.3292


0         Low
1        High
2      Medium
3        High
4      Medium
        ...  
886    Medium
887    Medium
888    Medium
889    Medium
890       Low
Name: Fare, Length: 876, dtype: category
Categories (3, object): ['Low' < 'Medium' < 'High']

Aggregation of features:

Group the dataset by 'Pclass' and calculate the average 'Age' and 'Fare' for each class.

Group the dataset by 'Sex' and calculate the total number of passengers and the average 'Age' for each gender.

In [24]:
# Group the dataset by 'Pclass' and calculate the average 'Age' and 'Fare' for each class.
print('Average age by Pclass:')
print(df.groupby('Pclass')['Age'].mean())
print('\n')
print('Average fare by Pclass:')
print(df.groupby('Pclass')['Fare'].mean())

Average age by Pclass:
Pclass
1    38.233441
2    29.877630
3    25.140620
Name: Age, dtype: float64


Average fare by Pclass:
Pclass
1    84.154687
2    20.662183
3    13.675550
Name: Fare, dtype: float64


Decomposing of features:
    
Decompose the 'Name' column into two new columns: 'Title' (extracted from the name prefix) and 'LastName' (extracted from the last name).

In [25]:
df['Title'] = df['Name'].apply(lambda name : name.split('.')[0].split(',')[1])
df['Last Name'] = df['Name'].apply(lambda name : name.split('.')[0].split(',')[0])
df[['Name','Title','Last Name']].dropna()

Unnamed: 0,Name,Title,Last Name
0,"Braund, Mr. Owen Harris",Mr,Braund
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Mrs,Cumings
2,"Heikkinen, Miss. Laina",Miss,Heikkinen
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",Mrs,Futrelle
4,"Allen, Mr. William Henry",Mr,Allen
...,...,...,...
886,"Montvila, Rev. Juozas",Rev,Montvila
887,"Graham, Miss. Margaret Edith",Miss,Graham
888,"Johnston, Miss. Catherine Helen ""Carrie""",Miss,Johnston
889,"Behr, Mr. Karl Howell",Mr,Behr


Feature creation:
    
Create a new feature called 'FamilySize' by summing the 'SibSp' and 'Parch' columns

Create a new feature called 'IsAlone' to indicate whether a passenger is traveling alone or with family.


In [26]:
df['FamilySize'] = df['SibSp']+df['Parch']
df['IsAlone'] = df['FamilySize'].apply(lambda size: 'No' if size >= 1 else 'Yes')
df[['Name','SibSp','Parch','FamilySize','IsAlone']]

Unnamed: 0,Name,SibSp,Parch,FamilySize,IsAlone
0,"Braund, Mr. Owen Harris",1,0,1,No
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,0,1,No
2,"Heikkinen, Miss. Laina",0,0,0,Yes
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,0,1,No
4,"Allen, Mr. William Henry",0,0,0,Yes
...,...,...,...,...,...
886,"Montvila, Rev. Juozas",0,0,0,Yes
887,"Graham, Miss. Margaret Edith",0,0,0,Yes
888,"Johnston, Miss. Catherine Helen ""Carrie""",1,2,3,No
889,"Behr, Mr. Karl Howell",0,0,0,Yes


Feature transformation:
    
Encode categorical features (e.g., 'Sex', 'Embarked') using appropriate techniques (e.g., one-hot encoding, label encoding).
Mention the Top 5 Categorical Encoding Techniques and also list out the Major differences between them with the most suitable scenarios where we can use them.

In [27]:
saved_econders={}

In [28]:
saved_econders={}
df.dropna(subset=['Sex', 'Embarked'],inplace=True)
features = df[['Sex', 'Embarked']]

def OneHotEncoderFunction(features_to_encode): 
    encoder = OneHotEncoder(sparse_output=False)
    encoder.fit(features_to_encode)
    saved_econders['OneHotEncoder_'+'_'.join(features_to_encode.columns)] = encoder

def LabelEncoderFunction(features_to_encode):
    list_encoder = []
    for col in features_to_encode.columns:
        encoder = LabelEncoder()
        encoder.fit(features_to_encode[col])
        saved_econders['LabelEncoder_'+ str(col)] = encoder

def OrdinalEncoderFunction(features_to_encode):
    categories = []
    for col in features_to_encode.columns:
        categories.append(sorted(list(features_to_encode[col].unique())))
    encoder = OrdinalEncoder(categories=categories)
    encoder.fit(features_to_encode)
    saved_econders['OrdinalEncoder_'+'_'.join(features_to_encode.columns)] = encoder

def BinaryEncoderFunction(features_to_encode):
    encoder = BinaryEncoder(cols=features_to_encode.columns)
    encoder.fit(features_to_encode)
    saved_econders['BinaryEncoder_'+'_'.join(features_to_encode.columns)] = encoder

def CountEncoderFunction(features_to_encode):
    encoder = CountEncoder(cols=features_to_encode.columns)
    encoder.fit(features_to_encode)
    saved_econders['CountEncoder_'+'_'.join(features_to_encode.columns)] = encoder

encoder_functions = [OneHotEncoderFunction,OrdinalEncoderFunction,BinaryEncoderFunction,CountEncoderFunction]

for fun in encoder_functions:
    fun(features)

saved_econders

{'OneHotEncoder_Sex_Embarked': OneHotEncoder(sparse_output=False),
 'OrdinalEncoder_Sex_Embarked': OrdinalEncoder(categories=[['female', 'male'], ['C', 'Q', 'S']]),
 'BinaryEncoder_Sex_Embarked': BinaryEncoder(cols=Index(['Sex', 'Embarked'], dtype='object'),
               mapping=[{'col': 'Sex',
                         'mapping':     Sex_0  Sex_1
  1      0      1
  2      1      0
 -1      0      0
 -2      0      0},
                        {'col': 'Embarked',
                         'mapping':     Embarked_0  Embarked_1
  1           0           1
  2           1           0
  3           1           1
 -1           0           0
 -2           0           0}]),
 'CountEncoder_Sex_Embarked': CountEncoder(cols=Index(['Sex', 'Embarked'], dtype='object'),
              combine_min_nan_groups=True)}

In [29]:
encoded_dfs = {}

for encoder_name, encoder in saved_econders.items():
    if 'OneHotEncoder' in encoder_name:
        transformed_data = encoder.transform(features)
        columns = encoder.get_feature_names_out(features.columns)
        encoded_df = pd.DataFrame(transformed_data, columns=columns)
        
    elif 'LabelEncoder' in encoder_name:
        column = encoder_name.split('_')[-1]
        transformed_data = encoder.transform(features[column])
        encoded_df = pd.DataFrame(transformed_data, columns=[column + '_encoded'])

    elif 'OrdinalEncoder' in encoder_name:
        transformed_data = encoder.transform(features)
        columns = features.columns + '_ordinal'
        encoded_df = pd.DataFrame(transformed_data, columns=columns)

    elif 'BinaryEncoder' in encoder_name:
        transformed_data = encoder.transform(features)
        columns = transformed_data.columns
        encoded_df = pd.DataFrame(transformed_data, columns=columns)

    elif 'CountEncoder' in encoder_name:
        transformed_data = encoder.transform(features)
        columns = transformed_data.columns
        encoded_df = pd.DataFrame(transformed_data, columns=columns)

    # Storing the DataFrame in a dictionary for later use
    encoded_dfs[encoder_name] = encoded_df

# Displaying the DataFrames
for encoder_name, df in encoded_dfs.items():
    print(f"\n{encoder_name}:\n", df)


OneHotEncoder_Sex_Embarked:
      Sex_female  Sex_male  Embarked_C  Embarked_Q  Embarked_S
0           0.0       1.0         0.0         0.0         1.0
1           1.0       0.0         1.0         0.0         0.0
2           1.0       0.0         0.0         0.0         1.0
3           1.0       0.0         0.0         0.0         1.0
4           0.0       1.0         0.0         0.0         1.0
..          ...       ...         ...         ...         ...
884         0.0       1.0         0.0         0.0         1.0
885         1.0       0.0         0.0         0.0         1.0
886         1.0       0.0         0.0         0.0         1.0
887         0.0       1.0         1.0         0.0         0.0
888         0.0       1.0         0.0         1.0         0.0

[889 rows x 5 columns]

OrdinalEncoder_Sex_Embarked:
      Sex_ordinal  Embarked_ordinal
0            1.0               2.0
1            0.0               0.0
2            0.0               2.0
3            0.0               

### Mention the Top 5 Categorical Encoding

1. 'OneHotEncoder' 
2. 'LabelEncoder' 
3. 'OrdinalEncoder' 
4. 'BinaryEncoder' 
5. 'CountEncoder'

Major differences between them with the most suitable scenarios where we can use them.

1. 'OneHotEncoder' 
Converts each unique category level into a separate binary column (0/1).
Prevents assumptions about ordinal relationships.Provides a complete representation of categories.
Can increase dimensionality significantly with many unique categories, leading to sparse matrices.

    suitable scenarios
Suitable for algorithms that don't assume order among categories (e.g., linear regression, neural networks)
Nominal data without an inherent order (e.g., colors, gender).

2. 'LabelEncoder' 
Encodes categories as integers from 0 to n-1.
Assumes ordinal relationship between categories, which may not be suitable for nominal data
Simple and efficient. Maintains order for ordinal data.
     
    suitable scenarios
Ordinal data where categories have a clear order.
Suitable for algorithms that can handle numerical values directly (e.g., decision trees, random forests).

3. 'OrdinalEncoder' 
Encodes categories as integers based on a specified order.
Maintains specified order for ordinal data.Suitable for models needing ordinal information.
Assumes ordinal relationship, which may not apply to all datasets. Requires specifying category order.

    suitable scenarios
Ordinal data where categories have a clear hierarchy or ranking (e.g., education levels, satisfaction ratings).

4. 'BinaryEncoder'
Encodes categories into binary digits, reducing the number of columns compared to OneHotEncoder.
Reduces dimensionality compared to OneHotEncoder.Less sparse than OneHotEncoder for high-cardinality data.
More complex to interpret compared to OneHotEncoder. Assumes no inherent order among categories.
    
    suitable scenarios
Nominal data with many categories, reducing dimensionality while preserving information. Suitable for large datasets where OneHotEncoding would be too sparse.

5. 'CountEncoder'
Replaces categories with their corresponding frequency counts.
Reduces dimensionality. Captures information about category frequency.
May lose some categorical information.<br>- Assumes categories with higher frequency are more important.

    suitable scenarios
High-cardinality categorical data. Suitable for tree-based algorithms that handle numerical features well (e.g., decision trees, random forests).
