# Feature Encoding & Scaling

Feature Encoding & Scaling is used to convert categorical data into numbers and bring numerical values to a similar scale.
 This helps the machine learning model understand the data easily and work better during training.

In [19]:
import pandas as pd
import numpy as np

In [20]:
df = pd.read_csv("adult_income_dataset.csv")

In [21]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       48842 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   48842 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [23]:
df.describe()

Unnamed: 0,age,fnlwgt,educational-num,capital-gain,capital-loss,hours-per-week
count,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0
mean,38.643585,189664.1,10.078089,1079.067626,87.502314,40.422382
std,13.71051,105604.0,2.570973,7452.019058,403.004552,12.391444
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117550.5,9.0,0.0,0.0,40.0
50%,37.0,178144.5,10.0,0.0,0.0,40.0
75%,48.0,237642.0,12.0,0.0,0.0,45.0
max,90.0,1490400.0,16.0,99999.0,4356.0,99.0


## Identifying Categorical and Numerical Features

Categorical features contain text values, while numerical features contain
continuous or discrete numbers. These need different preprocessing techniques.


In [24]:
categorical_cols = df.select_dtypes(include=["object"]).columns
numerical_cols = df.select_dtypes(include=["int64", "float64"]).columns

categorical_cols, numerical_cols

(Index(['workclass', 'education', 'marital-status', 'occupation',
        'relationship', 'race', 'gender', 'native-country', 'income'],
       dtype='object'),
 Index(['age', 'fnlwgt', 'educational-num', 'capital-gain', 'capital-loss',
        'hours-per-week'],
       dtype='object'))

## Label Encoding (For Ordered Categories)

Label Encoding is applied where categories have a meaningful order.

In [25]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

for col in categorical_cols:
    df[col] = le.fit_transform(df[col])

## Feature Scaling

Numerical features are scaled to bring them to the same range using StandardScaler.
This helps many machine learning algorithms perform better.

In [26]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

In [27]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,-0.995129,4,0.351675,1,-1.197259,4,7,3,2,1,-0.144804,-0.217127,-0.034087,39,0
1,-0.046942,4,-0.945524,11,-0.419335,2,5,0,4,1,-0.144804,-0.217127,0.77293,39,0
2,-0.776316,2,1.394723,7,0.74755,2,11,0,4,1,-0.144804,-0.217127,-0.034087,39,1
3,0.390683,4,-0.277844,15,-0.030373,2,7,0,2,1,0.886874,-0.217127,-0.034087,39,1
4,-1.505691,0,-0.815954,15,-0.030373,4,0,3,4,0,-0.144804,-0.217127,-0.841104,39,0


## Model Readiness Check

After encoding and scaling, the dataset is fully numerical and ready
to be used in machine learning models.


In [28]:
df.to_csv("adult_income_processed.csv", index=False)
