# Data Preprocessing

---
Before we preprocess our data that is going to be used to train and test our machine learning algorithms we first have to ensure our data is consistant and of high quality, so we have to ensure optimal performance. We will have to do the following:

- Handle missing values.
- Encode categorical features.
- Standardize numerical features.

`To view final preprocessing() function, you can scroll down to the bottom`

In [132]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

---

## Missing Values: Categorical

In [109]:
# Load project_adult.csv
df = pd.read_csv('../data/raw/project_adult.csv', index_col=0)  

# Print shape of df
print(f'''
Number of rows: {df.shape[0]}
Number of features: {df.shape[1]}
      ''')

# Check missing values
print(df.isnull().sum())


Number of rows: 26048
Number of features: 15
      
age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64


In [110]:
# Checking for values in categorical varaibles
df_categorical = df[['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']]

for cat in df_categorical:
    values = df_categorical[cat].unique()
    print(f'''
Unique values for {cat}:
{values}
''')


Unique values for workclass:
['Local-gov' 'Private' 'Self-emp-not-inc' '?' 'Federal-gov' 'Self-emp-inc'
 'State-gov' 'Without-pay' 'Never-worked']


Unique values for education:
['Bachelors' 'Assoc-voc' '9th' 'Some-college' '10th' 'HS-grad'
 'Prof-school' 'Assoc-acdm' '11th' '12th' 'Masters' '7th-8th' 'Doctorate'
 '5th-6th' '1st-4th' 'Preschool']


Unique values for marital-status:
['Never-married' 'Married-civ-spouse' 'Separated' 'Divorced' 'Widowed'
 'Married-spouse-absent' 'Married-AF-spouse']


Unique values for occupation:
['Prof-specialty' 'Exec-managerial' 'Craft-repair' 'Farming-fishing'
 'Other-service' 'Machine-op-inspct' 'Sales' 'Handlers-cleaners'
 'Transport-moving' 'Protective-serv' '?' 'Adm-clerical' 'Priv-house-serv'
 'Tech-support' 'Armed-Forces']


Unique values for relationship:
['Not-in-family' 'Husband' 'Other-relative' 'Unmarried' 'Own-child' 'Wife']


Unique values for race:
['White' 'Black' 'Asian-Pac-Islander' 'Amer-Indian-Eskimo' 'Other']


Unique values for 

We have `'?'` for unknown field is the following columns: `workclass`, `occupation`, `native-country`. We will treat this as an unknown value. If the number of rows that contain `'?'` in any of these fields is around 5-10% we will drop the rows. If its more we will consider imputing instead. 

In [111]:
# Number of rows that contains at lest one field with '?'
num_of_rows = df[df[['workclass','occupation','native-country']].isin(['?']).any(axis=1)].shape[0]

print(f"Number of rows with at least one field with '?': {num_of_rows}")
print(f"Percentage of total rows: {num_of_rows / df.shape[0]:.2f}%")

Number of rows with at least one field with '?': 1891
Percentage of total rows: 0.07%


In [112]:
# Drop rows where '?' is present
df = df[~df[['workclass','occupation','native-country']].isin(['?']).any(axis=1)]

---

## Dropping Redundant Columns

We have two features `education` & `education-num`, `education-num` seems to be the numerical representation of `education`. It is standard practive to drop the categorical column to avoid redundancy. We will check if `education` matches `education-num` in each row before dropping `education`.

In [113]:
# Create the mapping dictionary from unique pairs
edu_map = dict(df[['education', 'education-num']].drop_duplicates().values)

# Check if each row's education-num matches the mapping for education
mismatch_mask = df['education-num'] != df['education'].map(edu_map)

# Print rows where there is a mismatch
print(df[mismatch_mask])

Empty DataFrame
Columns: [age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country, income]
Index: []


Every row's categorical education value matches its corresponding numerical value according to our mapping. So we can safely drop the redundant column and keep the mapping for reference. 

In [114]:
# Dropping 'education' column
df = df.drop(columns=['education'])

In [115]:
# Ecuation map for reference
print(edu_map)

{'Bachelors': 13, 'Assoc-voc': 11, '9th': 5, 'Some-college': 10, '10th': 6, 'HS-grad': 9, 'Assoc-acdm': 12, '11th': 7, '12th': 8, 'Masters': 14, '7th-8th': 4, 'Doctorate': 16, 'Prof-school': 15, '5th-6th': 3, '1st-4th': 2, 'Preschool': 1}


---

## Econding Categorical Features

For algorithms like **Perceptron** and **Adaline**, we encode categorical features into numbers because these models compute **weighted sums of inputs**, and they can’t handle raw text or categories—only numeric values that can be multiplied by weights.

In [None]:
# One-hot encoding categorical variables
df = pd.get_dummies(df, columns=[col for col in df_categorical.columns if col != 'education'], dtype=int)

# Verifying one-hot encoding
df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,income,workclass_Federal-gov,workclass_Local-gov,workclass_Private,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
5514,33,198183,13,0,0,50,>50K,0,1,0,...,0,0,0,0,0,0,0,1,0,0
19777,36,86459,11,0,1887,50,>50K,0,0,1,...,0,0,0,0,0,0,0,1,0,0
10781,58,203039,5,0,0,40,<=50K,0,0,0,...,0,0,0,0,0,0,0,1,0,0
32240,21,180190,11,0,0,46,<=50K,0,0,1,...,0,0,0,0,0,0,0,1,0,0
9876,27,279872,10,0,0,40,<=50K,0,0,1,...,0,0,0,0,0,0,0,1,0,0


---

## Standardize Numerical Features: Feature Scaling

Both Perceptron and Adaline are linear models trained with gradient-based updates (or update rules proportional to input values).

- Features are on very different scales (e.g. `age` in years vs. `capital-loss` in dollars), the larger scaled feature dominates the weight updates.
- This can cuase unstable training, slower convergence, or even failure to converge.

Benefits:

- **Perceptron**: scaling helps the algorithm find a cleaner decision boundary because the step updates won’t be skewed toward large-magnitude features.
- **Adaline**: since it minimizes MSE via gradient descent, scaling makes the optimization surface smoother → gradients are balanced → much faster and more stable convergence.

We will use the `StandardScaler()` method from `scikit learn's` `perprocessing` module. 

In [118]:
# Numeric columns
numeric_cols = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']

# Standardizing features
scaler = StandardScaler()

df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

# Verfiying results
df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,income,workclass_Federal-gov,workclass_Local-gov,workclass_Private,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
5514,-0.413348,0.079327,1.128346,-0.147456,-0.21998,0.763915,>50K,0,1,0,...,0,0,0,0,0,0,0,1,0,0
19777,-0.185241,-0.981089,0.344207,-0.147456,4.442817,0.763915,>50K,0,0,1,...,0,0,0,0,0,0,0,1,0,0
10781,1.487546,0.125418,-2.008209,-0.147456,-0.21998,-0.071383,<=50K,0,0,0,...,0,0,0,0,0,0,0,1,0,0
32240,-1.325777,-0.091451,0.344207,-0.147456,-0.21998,0.429796,<=50K,0,0,1,...,0,0,0,0,0,0,0,1,0,0
9876,-0.869562,0.85467,-0.047862,-0.147456,-0.21998,-0.071383,<=50K,0,0,1,...,0,0,0,0,0,0,0,1,0,0


---

## Label Encoding for the Response Variable

Now we will convert the categorical response variable `income` into binary and by doing this we will assing labels to our reponse column.

In [None]:
# Encoding response column
df['income'] = df['income'].map({'>50K': 1, '<=50K': 0})

# Verfiy results
df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,income,workclass_Federal-gov,workclass_Local-gov,workclass_Private,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
5514,-0.413348,0.079327,1.128346,-0.147456,-0.21998,0.763915,1,0,1,0,...,0,0,0,0,0,0,0,1,0,0
19777,-0.185241,-0.981089,0.344207,-0.147456,4.442817,0.763915,1,0,0,1,...,0,0,0,0,0,0,0,1,0,0
10781,1.487546,0.125418,-2.008209,-0.147456,-0.21998,-0.071383,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
32240,-1.325777,-0.091451,0.344207,-0.147456,-0.21998,0.429796,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
9876,-0.869562,0.85467,-0.047862,-0.147456,-0.21998,-0.071383,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0


- `'>50K' → 1` (positive class)
- `'<=50K' → 0` (negative class)

---

## Seperate `X` (Features) & `y` (labels)

The last step will be to seperate the features and labels, so the data can be ready for training. 

In [135]:
# Separate features and target
X = df.drop(columns=['income'], axis=1)
y = df[['income']]

---

## `Preprocess()` Function

Now we will bring everything together and create one method that does everything that we have done until now.

In [140]:
def preprocess_data(df):
    """
    Preprocess the Adult Income dataset.

    This function performs several preprocessing steps to clean and prepare
    the dataset for machine learning models. Steps include stripping whitespace,
    handling missing values, dropping redundant columns, one-hot encoding
    categorical features, encoding the target variable, and returning feature
    and target matrices.

    Parameters
    ----------
    df : pandas.DataFrame
        Raw dataset containing both numeric and categorical features,
        along with the 'income' target column.

    Returns
    -------
    X : pandas.DataFrame
        Feature matrix after preprocessing (numeric + one-hot encoded categorical variables).

    y : pandas.Series
        Target vector with binary-encoded 'income' values (1 for >50K, 0 for <=50K).

    Notes
    -----
    - Rows containing missing values in categorical or target columns are dropped.
    - The 'education' column is removed since 'education-num' is retained.
    - One-hot encoding is applied to categorical features (excluding 'education').
    - Scaling is **not** performed here to avoid data leakage; it should be done
      later within a training pipeline using only the training set.
    - Ensure train/test split is done before fitting scalers or models.
    """

    # 0) Peek
    print("First 5 rows before transformation:\n", df.head(), "\n*********************\n")

    # 1) Define categorical columns up front
    cat_cols = ['workclass', 'education', 'marital-status', 'occupation',
                'relationship', 'race', 'sex', 'native-country']

    # 2) Normalize placeholders & whitespace
    df = df.copy()
    df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
    df.replace('?', np.nan, inplace=True)

    # 3) Drop rows missing required categoricals or target
    df.dropna(subset=cat_cols + ['income'], inplace=True)

    # 4) Drop redundant 'education' (keep 'education-num')
    df.drop(columns=['education'], inplace=True)

    # 5) One-hot encode categoricals (education already removed)
    ohe_cols = [c for c in cat_cols if c != 'education']
    df = pd.get_dummies(df, columns=ohe_cols, dtype=int)

    # 6) Target encode AFTER stripping spaces
    df['income'] = df['income'].map({'>50K': 1, '<=50K': 0})

    # 7) Split features/target
    y = df['income']
    X = df.drop(columns=['income'])

    # 9. Separate features and target
    X = df.drop(columns=['income'], axis=1)
    y = df[['income']]

    # 8) Do NOT scale here to avoid leakage; scale in a Pipeline or on X_train only
    # Example (outside): scaler.fit(X_train[numeric_cols]); X_train[numeric_cols] = scaler.transform(...)

    print("\n*********************\nFirst 5 rows of X:\n", X.head(), "\n*********************")
    print("First 5 rows of y:\n", y.head(), "\n*********************\n")
    return X, y

