<a href="https://colab.research.google.com/github/gummadidalaashishkumar/AI-ML-Internship-Task-4/blob/main/Adult_Income_Feature_Engineering_Encoding_Scaling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Load the dataset
df = pd.read_csv('adult.csv')

# Display initial data
print("Dataset Shape:", df.shape)
df.head()

Dataset Shape: (48842, 15)


Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


1. Identify Categorical and Numerical Features
We first distinguish between columns that contain numbers (numerical) and those that contain text/categories (categorical).

In [31]:
# Identify numerical and categorical columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

print(f"Numerical Features: {numerical_cols}")
print(f"Categorical Features: {categorical_cols}")

Numerical Features: ['age', 'fnlwgt', 'educational-num', 'capital-gain', 'capital-loss', 'hours-per-week']
Categorical Features: ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'gender', 'native-country', 'income']


2. & 3. Label and One-Hot Encoding
Label Encoding: Used for the target variable income (<=50K, >50K) and gender, as they are binary.

One-Hot Encoding: Used for nominal features like workclass, occupation, and race where no natural order exists.

In [32]:
# Label Encoding for binary features
le = LabelEncoder()
df['income'] = le.fit_transform(df['income'])
df['gender'] = le.fit_transform(df['gender'])

# One-Hot Encoding for remaining features
# We drop 'education' because 'educational-num' already represents it numerically.
df_encoded = df.drop('education', axis=1)
cols_to_ohe = ['workclass', 'marital-status', 'occupation', 'relationship', 'race', 'native-country']
df_encoded = pd.get_dummies(df_encoded, columns=cols_to_ohe)

print(f"New shape after One-Hot Encoding: {df_encoded.shape}")
df_encoded.head()

New shape after One-Hot Encoding: (48842, 92)


Unnamed: 0,age,fnlwgt,educational-num,gender,capital-gain,capital-loss,hours-per-week,income,workclass_?,workclass_Federal-gov,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,25,226802,7,1,0,0,40,0,False,False,...,False,False,False,False,False,False,False,True,False,False
1,38,89814,9,1,0,0,50,0,False,False,...,False,False,False,False,False,False,False,True,False,False
2,28,336951,12,1,0,0,40,1,False,False,...,False,False,False,False,False,False,False,True,False,False
3,44,160323,10,1,7688,0,40,1,False,False,...,False,False,False,False,False,False,False,True,False,False
4,18,103497,10,0,0,0,30,0,True,False,...,False,False,False,False,False,False,False,True,False,False


4. Scale Numerical Features
We use StandardScaler to transform features so they have a mean of 0 and a standard deviation of 1. This ensures that large values (like fnlwgt) don't overpower smaller values (like age).

In [33]:
scaler = StandardScaler()
df_scaled = df_encoded.copy()

# Features to scale
num_to_scale = ['age', 'fnlwgt', 'educational-num', 'capital-gain', 'capital-loss', 'hours-per-week']
df_scaled[num_to_scale] = scaler.fit_transform(df_scaled[num_to_scale])

# Save the processed dataset for GitHub submission
df_scaled.to_csv('processed_adult_income.csv', index=False)
print("Processed dataset saved as 'processed_adult_income.csv'")
df_scaled.head()

Processed dataset saved as 'processed_adult_income.csv'


Unnamed: 0,age,fnlwgt,educational-num,gender,capital-gain,capital-loss,hours-per-week,income,workclass_?,workclass_Federal-gov,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,-0.995129,0.351675,-1.197259,1,-0.144804,-0.217127,-0.034087,0,False,False,...,False,False,False,False,False,False,False,True,False,False
1,-0.046942,-0.945524,-0.419335,1,-0.144804,-0.217127,0.77293,0,False,False,...,False,False,False,False,False,False,False,True,False,False
2,-0.776316,1.394723,0.74755,1,-0.144804,-0.217127,-0.034087,1,False,False,...,False,False,False,False,False,False,False,True,False,False
3,0.390683,-0.277844,-0.030373,1,0.886874,-0.217127,-0.034087,1,False,False,...,False,False,False,False,False,False,False,True,False,False
4,-1.505691,-0.815954,-0.030373,0,-0.144804,-0.217127,-0.841104,0,True,False,...,False,False,False,False,False,False,False,True,False,False


5. & 6. Model Readiness and Impact of Scaling
Before Scaling: The fnlwgt column has values in the hundreds of thousands, while educational-num is mostly between 1 and 16. A model like KNN would assume fnlwgt is thousands of times more important simply because the numbers are larger.

After Scaling: All features are on the same scale.

Impact on Algorithms: > * Gradient Descent: Algorithms like Logistic Regression and Neural Networks converge much faster.

Distance-based: KNN and K-Means become accurate because distances are calculated fairly across all features.

Interpretability: Feature weights in Linear models become directly comparable.

7. Summary of Feature EngineeringFeature Expansion: One-Hot Encoding increased our feature count from 15 to over 90, allowing the model to understand specific categorical relationships.Target Encoding: The income label was converted to $0$ and $1$, making it ready for binary classification.Data Uniformity: Standard Scaling removed the bias caused by different units of measurement (hours vs. dollars vs. age).