## DATA PREPROCESSING AND FEATURE ENGINEERING IN MACHINE LEARNING

### Objective:
This assignment aims to equip you with practical skills in data preprocessing, feature engineering, and feature selection techniques, which are crucial for building efficient machine learning models. You will work with a provided dataset to apply various techniques such as scaling, encoding, and feature selection methods including isolation forest and PPS score analysis.

#### Dataset:
Given "Adult" dataset, which predicts whether income exceeds $50K/yr based on census data.


### Tasks:

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("adult_with_headers.csv")

In [3]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


##### 1.	Handle missing values as per the best practices (imputation, removal, etc.).

In [4]:
df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64

##### ●	Apply scaling techniques to numerical features:
a.	Standard Scaling   b. Min-Max Scaling
●	Discuss the scenarios where each scaling technique is preferred and why.

In [5]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [6]:
df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education_num',
       'marital_status', 'occupation', 'relationship', 'race', 'sex',
       'capital_gain', 'capital_loss', 'hours_per_week', 'native_country',
       'income'],
      dtype='object')

In [15]:
num_features = df.select_dtypes(include=["number"]).columns
num_features

Index(['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss',
       'hours_per_week'],
      dtype='object')

In [11]:
scaler_standard = StandardScaler()
scaler_standard

In [12]:
df_standard_scaled = df.copy()

In [16]:
df_standard_scaled[num_features] = scaler_standard.fit_transform(df[num_features])
df_standard_scaled

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,0.030671,State-gov,-1.063611,Bachelors,1.134739,Never-married,Adm-clerical,Not-in-family,White,Male,0.148453,-0.21666,-0.035429,United-States,<=50K
1,0.837109,Self-emp-not-inc,-1.008707,Bachelors,1.134739,Married-civ-spouse,Exec-managerial,Husband,White,Male,-0.145920,-0.21666,-2.222153,United-States,<=50K
2,-0.042642,Private,0.245079,HS-grad,-0.420060,Divorced,Handlers-cleaners,Not-in-family,White,Male,-0.145920,-0.21666,-0.035429,United-States,<=50K
3,1.057047,Private,0.425801,11th,-1.197459,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,-0.145920,-0.21666,-0.035429,United-States,<=50K
4,-0.775768,Private,1.408176,Bachelors,1.134739,Married-civ-spouse,Prof-specialty,Wife,Black,Female,-0.145920,-0.21666,-0.035429,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,-0.849080,Private,0.639741,Assoc-acdm,0.746039,Married-civ-spouse,Tech-support,Wife,White,Female,-0.145920,-0.21666,-0.197409,United-States,<=50K
32557,0.103983,Private,-0.335433,HS-grad,-0.420060,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,-0.145920,-0.21666,-0.035429,United-States,>50K
32558,1.423610,Private,-0.358777,HS-grad,-0.420060,Widowed,Adm-clerical,Unmarried,White,Female,-0.145920,-0.21666,-0.035429,United-States,<=50K
32559,-1.215643,Private,0.110960,HS-grad,-0.420060,Never-married,Adm-clerical,Own-child,White,Male,-0.145920,-0.21666,-1.655225,United-States,<=50K


In [19]:
scaler_minmax = MinMaxScaler()
scaler_minmax

In [20]:
df_minmax_scaled = df.copy()

In [21]:
df_minmax_scaled[num_features] = scaler_minmax.fit_transform(df[num_features])
df_minmax_scaled

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,0.301370,State-gov,0.044302,Bachelors,0.800000,Never-married,Adm-clerical,Not-in-family,White,Male,0.021740,0.0,0.397959,United-States,<=50K
1,0.452055,Self-emp-not-inc,0.048238,Bachelors,0.800000,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.000000,0.0,0.122449,United-States,<=50K
2,0.287671,Private,0.138113,HS-grad,0.533333,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.000000,0.0,0.397959,United-States,<=50K
3,0.493151,Private,0.151068,11th,0.400000,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.000000,0.0,0.397959,United-States,<=50K
4,0.150685,Private,0.221488,Bachelors,0.800000,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.000000,0.0,0.397959,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,0.136986,Private,0.166404,Assoc-acdm,0.733333,Married-civ-spouse,Tech-support,Wife,White,Female,0.000000,0.0,0.377551,United-States,<=50K
32557,0.315068,Private,0.096500,HS-grad,0.533333,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0.000000,0.0,0.397959,United-States,>50K
32558,0.561644,Private,0.094827,HS-grad,0.533333,Widowed,Adm-clerical,Unmarried,White,Female,0.000000,0.0,0.397959,United-States,<=50K
32559,0.068493,Private,0.128499,HS-grad,0.533333,Never-married,Adm-clerical,Own-child,White,Male,0.000000,0.0,0.193878,United-States,<=50K


#### 2. Encoding Techniques:
●	Apply One-Hot Encoding to categorical variables with less than 5 categories.
●	Use Label Encoding for categorical variables. Data Exploration and Preprocessing:
●	Load the dataset and conduct basic data exploration (summary statistics, missing values, data types).
●	les with more than 5 categories.
●	Discuss the pros and cons of One-Hot Encoding and Label Encoding.


In [33]:
df_onehot = pd.get_dummies(df, columns=["sex", "race"], drop_first=True)
df_onehot.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,capital_gain,capital_loss,hours_per_week,native_country,income,capital_balance,sex_ Male,race_ Asian-Pac-Islander,race_ Black,race_ Other,race_ White
0,39,7,77516,Bachelors,13,Never-married,1,Not-in-family,2174,0,40,39,<=50K,2174,True,False,False,False,True
1,50,6,83311,Bachelors,13,Married-civ-spouse,4,Husband,0,0,13,39,<=50K,0,True,False,False,False,True
2,38,4,215646,HS-grad,9,Divorced,6,Not-in-family,0,0,40,39,<=50K,0,True,False,False,False,True
3,53,4,234721,11th,7,Married-civ-spouse,6,Husband,0,0,40,39,<=50K,0,True,False,True,False,False
4,28,4,338409,Bachelors,13,Married-civ-spouse,10,Wife,0,0,40,5,<=50K,0,False,False,True,False,False


In [23]:
from sklearn.preprocessing import LabelEncoder

In [24]:
label_enc = LabelEncoder()
label_enc

In [32]:
df["workclass"] = label_enc.fit_transform(df["workclass"])
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income,capital_balance
0,39,7,77516,Bachelors,13,Never-married,1,Not-in-family,White,Male,2174,0,40,39,<=50K,2174
1,50,6,83311,Bachelors,13,Married-civ-spouse,4,Husband,White,Male,0,0,13,39,<=50K,0
2,38,4,215646,HS-grad,9,Divorced,6,Not-in-family,White,Male,0,0,40,39,<=50K,0
3,53,4,234721,11th,7,Married-civ-spouse,6,Husband,Black,Male,0,0,40,39,<=50K,0
4,28,4,338409,Bachelors,13,Married-civ-spouse,10,Wife,Black,Female,0,0,40,5,<=50K,0


In [31]:
df["occupation"] = label_enc.fit_transform(df["occupation"])
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income,capital_balance
0,39,7,77516,Bachelors,13,Never-married,1,Not-in-family,White,Male,2174,0,40,39,<=50K,2174
1,50,6,83311,Bachelors,13,Married-civ-spouse,4,Husband,White,Male,0,0,13,39,<=50K,0
2,38,4,215646,HS-grad,9,Divorced,6,Not-in-family,White,Male,0,0,40,39,<=50K,0
3,53,4,234721,11th,7,Married-civ-spouse,6,Husband,Black,Male,0,0,40,39,<=50K,0
4,28,4,338409,Bachelors,13,Married-civ-spouse,10,Wife,Black,Female,0,0,40,5,<=50K,0


In [30]:
df["native_country"] = label_enc.fit_transform(df["native_country"])
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income,capital_balance
0,39,7,77516,Bachelors,13,Never-married,1,Not-in-family,White,Male,2174,0,40,39,<=50K,2174
1,50,6,83311,Bachelors,13,Married-civ-spouse,4,Husband,White,Male,0,0,13,39,<=50K,0
2,38,4,215646,HS-grad,9,Divorced,6,Not-in-family,White,Male,0,0,40,39,<=50K,0
3,53,4,234721,11th,7,Married-civ-spouse,6,Husband,Black,Male,0,0,40,39,<=50K,0
4,28,4,338409,Bachelors,13,Married-civ-spouse,10,Wife,Black,Female,0,0,40,5,<=50K,0


#### 3. Feature Engineering:
●	Create at least 2 new features that could be beneficial for the model. Explain the rationale behind your choices.
●	Apply a transformation (e.g., log transformation) to at least one skewed numerical feature and justify your choice.


In [29]:
df["capital_balance"] = df["capital_gain"] - df["capital_loss"]
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income,capital_balance
0,39,7,77516,Bachelors,13,Never-married,1,Not-in-family,White,Male,2174,0,40,39,<=50K,2174
1,50,6,83311,Bachelors,13,Married-civ-spouse,4,Husband,White,Male,0,0,13,39,<=50K,0
2,38,4,215646,HS-grad,9,Divorced,6,Not-in-family,White,Male,0,0,40,39,<=50K,0
3,53,4,234721,11th,7,Married-civ-spouse,6,Husband,Black,Male,0,0,40,39,<=50K,0
4,28,4,338409,Bachelors,13,Married-civ-spouse,10,Wife,Black,Female,0,0,40,5,<=50K,0


In [36]:
df["age_group"] = pd.cut(df["age"], bins=[16,30,50,80,100],
                         labels=["Young","Middle","Senior","Elder"])
df[["age","age_group"]]

Unnamed: 0,age,age_group
0,39,Middle
1,50,Middle
2,38,Middle
3,53,Senior
4,28,Young
...,...,...
32556,27,Young
32557,40,Middle
32558,58,Senior
32559,22,Young
