# Data Preprocessing

---
Before we preprocess our data that is going to be used to train and test our machine learning algorithms we first have to ensure our data is consistant and of high quality, so we have to ensure optimal performance. We will have to do the following:

- Handle missing values.
- Encode categorical features.
- Standardize numerical features.

`To view final preprocessing() function, you can scroll down to the bottom`

In [112]:
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

---

## Missing Values: Categorical

In [113]:
# Load project_adult.csv
df = pd.read_csv('../data/raw/project_adult.csv', index_col=0)  

# Print shape of df
print(f'''
Number of rows: {df.shape[0]}
Number of features: {df.shape[1]}
      ''')

# Check missing values
print(df.isnull().sum())


Number of rows: 26048
Number of features: 15
      
age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64


In [114]:
# Checking for values in categorical varaibles
df_categorical = df[['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']]

for cat in df_categorical:
    values = df_categorical[cat].unique()
    print(f'''
Unique values for {cat}:
{values}
''')


Unique values for workclass:
['Local-gov' 'Private' 'Self-emp-not-inc' '?' 'Federal-gov' 'Self-emp-inc'
 'State-gov' 'Without-pay' 'Never-worked']


Unique values for education:
['Bachelors' 'Assoc-voc' '9th' 'Some-college' '10th' 'HS-grad'
 'Prof-school' 'Assoc-acdm' '11th' '12th' 'Masters' '7th-8th' 'Doctorate'
 '5th-6th' '1st-4th' 'Preschool']


Unique values for marital-status:
['Never-married' 'Married-civ-spouse' 'Separated' 'Divorced' 'Widowed'
 'Married-spouse-absent' 'Married-AF-spouse']


Unique values for occupation:
['Prof-specialty' 'Exec-managerial' 'Craft-repair' 'Farming-fishing'
 'Other-service' 'Machine-op-inspct' 'Sales' 'Handlers-cleaners'
 'Transport-moving' 'Protective-serv' '?' 'Adm-clerical' 'Priv-house-serv'
 'Tech-support' 'Armed-Forces']


Unique values for relationship:
['Not-in-family' 'Husband' 'Other-relative' 'Unmarried' 'Own-child' 'Wife']


Unique values for race:
['White' 'Black' 'Asian-Pac-Islander' 'Amer-Indian-Eskimo' 'Other']


Unique values for 

We have `'?'` for unknown field is the following columns: `workclass`, `occupation`, `native-country`. Let's check how many of these rows contain `'?'`.

In [115]:
# Number of rows that contains at lest one field with '?'
num_of_rows = df[df[['workclass','occupation','native-country']].isin(['?']).any(axis=1)].shape[0]

print(f"Number of rows with at least one field with '?': {num_of_rows}")
print(f"Percentage of total rows: {(num_of_rows / df.shape[0]) * 100:.2f}%")

Number of rows with at least one field with '?': 1891
Percentage of total rows: 7.26%


Instead of dropping these rows, we will assign an appropriate label for `'?'`, which will now be `"Unknown"`, this will act as another category and will be encoded later on. We are doing this becuase we make predictions on our `project_validation_inputs.csv` we don't want to drop any rows. 

In [116]:
# Replacing "?" to "Unknown"
df = df.replace("?", "Unknown")  

---

## Dropping Redundant Columns

We have two features `education` & `education-num`, `education-num` seems to be the numerical representation of `education`. It is standard practive to drop the categorical column to avoid redundancy. We will check if `education` matches `education-num` in each row before dropping `education`.

In [117]:
# Create the mapping dictionary from unique pairs
edu_map = dict(df[['education-num', 'education']].drop_duplicates().values)

# Check if each row's education-num matches the mapping for education
mismatch_mask = df['education'] != df['education-num'].map(edu_map)

# Print rows where there is a mismatch
print(df[mismatch_mask])

Empty DataFrame
Columns: [age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country, income]
Index: []


Every row's categorical education value matches its corresponding numerical value according to our mapping. So we can safely drop the redundant column and keep the mapping for reference. 

In [118]:
# Dropping 'education' column
df = df.drop(columns=['education'])

In [119]:
# Ecuation map for reference
print(edu_map)

{13: 'Bachelors', 11: 'Assoc-voc', 5: '9th', 10: 'Some-college', 6: '10th', 9: 'HS-grad', 15: 'Prof-school', 12: 'Assoc-acdm', 7: '11th', 8: '12th', 14: 'Masters', 4: '7th-8th', 16: 'Doctorate', 3: '5th-6th', 2: '1st-4th', 1: 'Preschool'}


---

## Econding Categorical Features

For algorithms like **Perceptron** and **Adaline**, we encode categorical features into numbers because these models compute **weighted sums of inputs**, and they can’t handle raw text or categories—only numeric values that can be multiplied by weights.

In [120]:
# One-hot encoding categorical variables
df = pd.get_dummies(df, columns=[col for col in df_categorical.columns if col != 'education'], dtype=int)

# Verifying one-hot encoding
df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,income,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,...,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Unknown,native-country_Vietnam,native-country_Yugoslavia
5514,33,198183,13,0,0,50,>50K,0,1,0,...,0,0,0,0,0,0,1,0,0,0
19777,36,86459,11,0,1887,50,>50K,0,0,0,...,0,0,0,0,0,0,1,0,0,0
10781,58,203039,5,0,0,40,<=50K,0,0,0,...,0,0,0,0,0,0,1,0,0,0
32240,21,180190,11,0,0,46,<=50K,0,0,0,...,0,0,0,0,0,0,1,0,0,0
9876,27,279872,10,0,0,40,<=50K,0,0,0,...,0,0,0,0,0,0,1,0,0,0


---

## Standardize Numerical Features: Feature Scaling

Both Perceptron and Adaline are linear models trained with gradient-based updates (or update rules proportional to input values).

- Features are on very different scales (e.g. `age` in years vs. `capital-loss` in dollars), the larger scaled feature dominates the weight updates.
- This can cuase unstable training, slower convergence, or even failure to converge.

Benefits:

- **Perceptron**: scaling helps the algorithm find a cleaner decision boundary because the step updates won’t be skewed toward large-magnitude features.
- **Adaline**: since it minimizes MSE via gradient descent, scaling makes the optimization surface smoother → gradients are balanced → much faster and more stable convergence.

We will use the `StandardScaler()` method from `scikit learn's` `perprocessing` module. 

In [121]:
# Numeric columns
numeric_cols = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']

# Standardizing features
scaler = StandardScaler()

df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

# Verfiying results
df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,income,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,...,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Unknown,native-country_Vietnam,native-country_Yugoslavia
5514,-0.408756,0.080051,1.133702,-0.145715,-0.217998,0.77946,>50K,0,1,0,...,0,0,0,0,0,0,1,0,0,0
19777,-0.188857,-0.981653,0.357049,-0.145715,4.457168,0.77946,>50K,0,0,0,...,0,0,0,0,0,0,1,0,0,0
10781,1.423734,0.126197,-1.97291,-0.145715,-0.217998,-0.03151,<=50K,0,0,0,...,0,0,0,0,0,0,1,0,0,0
32240,-1.288351,-0.090935,0.357049,-0.145715,-0.217998,0.455072,<=50K,0,0,0,...,0,0,0,0,0,0,1,0,0,0
9876,-0.848554,0.856334,-0.031277,-0.145715,-0.217998,-0.03151,<=50K,0,0,0,...,0,0,0,0,0,0,1,0,0,0


---

## Label Encoding for the Response Variable

Now we will convert the categorical response variable `income` into binary and by doing this we will assing labels to our reponse column.

In [122]:
# Encoding response column
df['income'] = df['income'].map({'>50K': 1, '<=50K': 0})

# Verfiy results
df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,income,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,...,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Unknown,native-country_Vietnam,native-country_Yugoslavia
5514,-0.408756,0.080051,1.133702,-0.145715,-0.217998,0.77946,1,0,1,0,...,0,0,0,0,0,0,1,0,0,0
19777,-0.188857,-0.981653,0.357049,-0.145715,4.457168,0.77946,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
10781,1.423734,0.126197,-1.97291,-0.145715,-0.217998,-0.03151,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
32240,-1.288351,-0.090935,0.357049,-0.145715,-0.217998,0.455072,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
9876,-0.848554,0.856334,-0.031277,-0.145715,-0.217998,-0.03151,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


- `'>50K' → 1` (positive class)
- `'<=50K' → 0` (negative class)

---

## Seperate `X` (Features) & `y` (labels)

The last step will be to seperate the features and labels, so the data can be ready for training. 

In [123]:
# Separate features and target
X = df.drop(columns=['income'], axis=1)
y = df[['income']]

---

## `Preprocess()` Function

Now we will bring everything together and create one method that does everything that we have done until now.

In [124]:
def preprocess_data(df: pd.DataFrame, scaler: StandardScaler | None = None):
    """
    Preprocessing for the Adult Income flat file.

    - Strip strings, turn '?' -> 'Unknown'
    - Keep all rows (do not drop missing placeholders)
    - Drop 'education' (keep 'education-num')
    - One-hot encode categoricals
    - Encode target 'income' to {<=50K:0, >50K:1} if present
    - Standardize numeric columns (fit on train if scaler is None; otherwise transform)

    Returns
    -------
    X : DataFrame
    y : Series | None
    scaler : StandardScaler
    """
    import pandas as pd
    from sklearn.preprocessing import StandardScaler

    # fixed columns
    cat_cols = ['workclass', 'education', 'marital-status', 'occupation',
                'relationship', 'race', 'sex', 'native-country']
    num_cols = ['age', 'fnlwgt', 'education-num', 'capital-gain',
                'capital-loss', 'hours-per-week']

    df = df.copy()
    has_target = 'income' in df.columns

    # 1) clean strings/placeholders
    for c in df.select_dtypes(include="object").columns:
        df[c] = df[c].map(lambda x: x.strip() if isinstance(x, str) else x)
    df.replace('?', 'Unknown', inplace=True)

    # 2) drop redundant education
    if 'education' in df.columns:
        df.drop(columns=['education'], inplace=True, errors='ignore')

    # 3) one-hot encode categoricals
    ohe_cols = [c for c in cat_cols if c != 'education' and c in df.columns]
    if ohe_cols:
        df = pd.get_dummies(df, columns=ohe_cols, dtype=int)

    # 4) split target
    y = None
    if has_target:
        df['income'] = df['income'].map({'>50K': 1, '<=50K': 0})
        y = df['income'].astype('int64')
        X = df.drop(columns=['income'])
    else:
        X = df

    # 5) standardize numerics
    present_num = [c for c in num_cols if c in X.columns]
    fitted_scaler = scaler or StandardScaler()
    if present_num:
        X.loc[:, present_num] = X.loc[:, present_num].astype('float64', copy=False)
        if scaler is None:
            X.loc[:, present_num] = fitted_scaler.fit_transform(X.loc[:, present_num])
        else:
            X.loc[:, present_num] = fitted_scaler.transform(X.loc[:, present_num])

    return X, y, fitted_scaler


We’ll use your `preprocess_data` function to clean the Adult Income dataset, split it into train/test, preprocess with scaling, and then save separate CSVs for features (`X`) and target (`y`). Finally, we’ll show how to preprocess the validation inputs and reload the CSVs back.

- We load the raw dataset `project_adult.csv` and split into train and test sets.
- We stratify by the target (`income`) to preserve class balance
- Fit scaler on training (`scaler=None`).
- Reuse scaler on test (to avoid leakage).
- We save features and labels separately for both train and test.

In [125]:
# Load raw data
df = pd.read_csv("../data/raw/project_adult.csv", index_col=0)

# Split into train and test
train_df, test_df = train_test_split(
    df, test_size=0.3, stratify=df["income"], random_state=42
)

# Preprocess (fit scaler on train, reuse on test)
X_train, y_train, scaler = preprocess_data(train_df, scaler=None)
X_test,  y_test,  _      = preprocess_data(test_df,  scaler=scaler)

# Ensure processed folder exists
processed_path = Path("../data/processed")
processed_path.mkdir(parents=True, exist_ok=True)

# Save features
X_train.to_csv(processed_path / "X_train.csv", index=False)
X_test.to_csv(processed_path / "X_test.csv", index=False)

# Save targets (Series → CSV with single column)
y_train.to_csv(processed_path / "y_train.csv", index=False)
y_test.to_csv(processed_path / "y_test.csv", index=False)

print("Saved X_train, y_train, X_test, y_test to ../data/processed/")

 -1.36236174]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  X.loc[:, present_num] = fitted_scaler.fit_transform(X.loc[:, present_num])
 -0.15473664]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  X.loc[:, present_num] = fitted_scaler.fit_transform(X.loc[:, present_num])
 -0.0340253 ]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  X.loc[:, present_num] = fitted_scaler.fit_transform(X.loc[:, present_num])
 -0.14447913]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  X.loc[:, present_num] = fitted_scaler.fit_transform(X.loc[:, present_num])
 -0.21893206]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  X.loc[:, present_num] = fitted_scaler.fit_transform(X.loc[:, present_num])
 -1.64982852]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  X

Saved X_train, y_train, X_test, y_test to ../data/processed/


Validation set has no `income` column, so we just transform using the scaler from training and save features as X_val.csv.


In [126]:
# Load validation inputs
val_df = pd.read_csv("../data/raw/project_validation_inputs.csv", index_col=0)

# Preprocess using the same scaler from training
X_val, _, _ = preprocess_data(val_df, scaler=scaler)

# Save features only
X_val.to_csv(processed_path / "X_val.csv", index=False)

print("Saved X_val.csv to ../data/processed/")

Saved X_val.csv to ../data/processed/


 -1.50881589]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  X.loc[:, present_num] = fitted_scaler.transform(X.loc[:, present_num])
 -1.55239572]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  X.loc[:, present_num] = fitted_scaler.transform(X.loc[:, present_num])
 -0.4210392 ]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  X.loc[:, present_num] = fitted_scaler.transform(X.loc[:, present_num])
 -0.14447913]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  X.loc[:, present_num] = fitted_scaler.transform(X.loc[:, present_num])
 -0.21893206]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  X.loc[:, present_num] = fitted_scaler.transform(X.loc[:, present_num])
 -1.24407972]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  X.loc[:, present_num]

---

## Reload Processed CSVs for Training

In [127]:
# Reload features
X_train = pd.read_csv("../data/processed/X_train.csv")
X_test  = pd.read_csv("../data/processed/X_test.csv")

# Reload targets (squeezed into Series)
y_train = pd.read_csv("../data/processed/y_train.csv").squeeze("columns")
y_test  = pd.read_csv("../data/processed/y_test.csv").squeeze("columns")

# Reload validation features
X_val = pd.read_csv("../data/processed/X_val.csv")