### 1. Data Exploration and Preprocessing

In [1]:
import pandas as pd

#### Load the Dataset and Conduct Basic Data Exploration

In [2]:
# Load the dataset
df = pd.read_csv('adult_with_headers.csv')

In [3]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [4]:
# Basic exploration
df.info()
df.describe(include='all')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
count,32561.0,32561,32561.0,32561,32561.0,32561,32561,32561,32561,32561,32561.0,32561.0,32561.0,32561,32561
unique,,9,,16,,7,15,6,5,2,,,,42,2
top,,Private,,HS-grad,,Married-civ-spouse,Prof-specialty,Husband,White,Male,,,,United-States,<=50K
freq,,22696,,10501,,14976,4140,13193,27816,21790,,,,29170,24720
mean,38.581647,,189778.4,,10.080679,,,,,,1077.648844,87.30383,40.437456,,
std,13.640433,,105550.0,,2.57272,,,,,,7385.292085,402.960219,12.347429,,
min,17.0,,12285.0,,1.0,,,,,,0.0,0.0,1.0,,
25%,28.0,,117827.0,,9.0,,,,,,0.0,0.0,40.0,,
50%,37.0,,178356.0,,10.0,,,,,,0.0,0.0,40.0,,
75%,48.0,,237051.0,,12.0,,,,,,0.0,0.0,45.0,,


#### Handle Missing Values

In [5]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing values:\n", missing_values)

Missing values:
 age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64


In [6]:
# Handle missing values - Imputation or Removal
df.dropna(inplace=True)

#### Apply Scaling Techniques to Numerical Features

In [7]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [10]:
# Select numerical features
numerical_features = ['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']

In [11]:
# Standard Scaling
scaler_standard = StandardScaler()
df_standard_scaled = df.copy()
df_standard_scaled[numerical_features] = scaler_standard.fit_transform(df_standard_scaled[numerical_features])

In [12]:
# Min-Max Scaling
scaler_minmax = MinMaxScaler()
df_minmax_scaled = df.copy()
df_minmax_scaled[numerical_features] = scaler_minmax.fit_transform(df_minmax_scaled[numerical_features])

#### Discussion on Scaling Techniques:

Standard Scaling: Preferred when the data follows a Gaussian distribution (normal distribution). It scales the data to have a mean of 0 and a standard deviation of 1.

Min-Max Scaling: Preferred when the data does not follow a normal distribution and when you need to maintain the original distribution's shape. It scales the data to a fixed range, typically 0 to 1.

### 2. Encoding Techniques

In [13]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

In [14]:
# One-Hot Encoding for categorical variables with less than 5 categories
one_hot_columns = ['sex', 'income']
df_one_hot_encoded = pd.get_dummies(df, columns=one_hot_columns)

In [16]:
# Label Encoding for categorical variables with more than 5 categories
label_columns = ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'native_country']
label_encoders = {}
df_label_encoded = df.copy()

In [17]:
for col in label_columns:
    le = LabelEncoder()
    df_label_encoded[col] = le.fit_transform(df_label_encoded[col])
    label_encoders[col] = le

#### Discussion on Encoding Techniques:

One-Hot Encoding: Useful for categorical variables with a small number of categories. It avoids the assumption of ordinal relationships between categories. However, it can lead to high-dimensional data when there are many categories.

Label Encoding: Useful for categorical variables with many categories. It maintains the ordinal nature of categories if present, but it can introduce ordinal relationships where none exist, potentially misleading the model.

### 3. Feature Engineering

#### Create New Features

In [20]:
# Create age-group feature
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 30, 50, 100], labels=['Child', 'Young Adult', 'Adult', 'Senior'])

In [21]:
# Create hours-per-week-binned feature
df['hours_per_week-binned'] = pd.cut(df['hours_per_week'], bins=[0, 25, 40, 60, 100], labels=['Part_time', 'Full_time', 'Overtime', 'Extreme'])

# Rationale: Age group and work hours categories can help the model better understand the socioeconomic status.

#### Apply Transformation to a Skewed Numerical Feature

In [22]:
import numpy as np

In [23]:
# Log transformation
df['capital_gain_log'] = np.log1p(df['capital_gain'])

# Justification: Log transformation can help in normalizing the distribution of highly skewed data.

### 4. Feature Selection

#### Use Isolation Forest to Identify and Remove Outliers

In [24]:
from sklearn.ensemble import IsolationForest

In [25]:
# Apply Isolation Forest
iso_forest = IsolationForest(contamination=0.1, random_state=42)
outliers = iso_forest.fit_predict(df[numerical_features])



In [26]:
outliers

array([ 1,  1,  1, ...,  1,  1, -1])

In [27]:
# Remove outliers
df_no_outliers = df[outliers == 1]

In [28]:
df_no_outliers

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income,age-group,age_group,hours_per_week-binned,capital_gain_log
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,Adult,Adult,Full_time,7.684784
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,Adult,Adult,Part_time,0.000000
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,Adult,Adult,Full_time,0.000000
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K,Senior,Senior,Full_time,0.000000
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,Young Adult,Young Adult,Full_time,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32555,22,Private,310152,Some-college,10,Never-married,Protective-serv,Not-in-family,White,Male,0,0,40,United-States,<=50K,Young Adult,Young Adult,Full_time,0.000000
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K,Young Adult,Young Adult,Full_time,0.000000
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K,Adult,Adult,Full_time,0.000000
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K,Senior,Senior,Full_time,0.000000


#### Apply PPS to Find Relationships Between Features

In [30]:
pip install ppscore

Collecting ppscoreNote: you may need to restart the kernel to use updated packages.
  Downloading ppscore-1.3.0.tar.gz (17 kB)

Building wheels for collected packages: ppscore
  Building wheel for ppscore (setup.py): started
  Building wheel for ppscore (setup.py): finished with status 'done'
  Created wheel for ppscore: filename=ppscore-1.3.0-py2.py3-none-any.whl size=13150 sha256=bbab240d87094bf5eca83c1491cd58742c7ab585d4c43994b6465b5ed2973e49
  Stored in directory: c:\users\hp\appdata\local\pip\cache\wheels\5c\80\75\b631985b161d4a29cc0cf94b5f64b00be6297b0968ff1337ce
Successfully built ppscore
Installing collected packages: ppscore
Successfully installed ppscore-1.3.0


In [33]:
import ppscore as pps
import warnings
warnings.filterwarnings('ignore')

In [34]:
# Calculate PPS matrix
pps_matrix = pps.matrix(df)

In [35]:
# Compare PPS with correlation matrix
correlation_matrix = df.corr()

In [36]:

print("PPS Matrix:\n", pps_matrix)
print("Correlation Matrix:\n", correlation_matrix)

PPS Matrix:
                     x                      y   ppscore            case  \
0                 age                    age  1.000000  predict_itself   
1                 age              workclass  0.011232  classification   
2                 age                 fnlwgt  0.000000      regression   
3                 age              education  0.052315  classification   
4                 age          education_num  0.000000      regression   
..                ...                    ...       ...             ...   
356  capital_gain_log                 income  0.297578  classification   
357  capital_gain_log              age-group  0.000000  classification   
358  capital_gain_log              age_group  0.000000  classification   
359  capital_gain_log  hours_per_week-binned  0.055588  classification   
360  capital_gain_log       capital_gain_log  1.000000  predict_itself   

     is_valid_score               metric  baseline_score   model_score  \
0              True     

#### Discussion on Outliers and Feature Relationships:

Outliers: Outliers can significantly impact the performance of machine learning models by distorting the learning process. Isolation Forest helps in identifying and removing these outliers to improve model accuracy.

PPS vs Correlation Matrix: While correlation matrices only capture linear relationships, PPS can detect both linear and non-linear relationships, providing a more comprehensive understanding of feature interactions.