# Data Preprocessing

Before we preprocess our data that is going to be used to train and test our machine learning algorithms we first have to ensure our data is consistant and of high quality, so we have to ensure optimal performance. We will have to do the following:

- Handle missing values.
- Encode categorical features.
- Standardize numerical features.

We will be using the `preprocess()` function we built for a previous project that using the same data.

---

## Importing Libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

## ``Preprocess()`` Function

In [3]:
def preprocess_data(df: pd.DataFrame, scaler: StandardScaler | None = None):
    """
    Preprocessing for the Adult Income flat file.

    - Strip strings, turn '?' -> 'Unknown'
    - Keep all rows (do not drop missing placeholders)
    - Drop 'education' (keep 'education-num')
    - One-hot encode categoricals
    - Encode target 'income' to {<=50K:0, >50K:1} if present
    - Standardize numeric columns (fit on train if scaler is None; otherwise transform)

    Returns
    -------
    X : DataFrame
    y : Series | None
    scaler : StandardScaler
    """
    import pandas as pd
    from sklearn.preprocessing import StandardScaler

    # fixed columns
    cat_cols = ['workclass', 'education', 'marital-status', 'occupation',
                'relationship', 'race', 'sex', 'native-country']
    num_cols = ['age', 'fnlwgt', 'education-num', 'capital-gain',
                'capital-loss', 'hours-per-week']

    df = df.copy()
    has_target = 'income' in df.columns

    # 1) clean strings/placeholders
    for c in df.select_dtypes(include="object").columns:
        df[c] = df[c].map(lambda x: x.strip() if isinstance(x, str) else x)
    df.replace('?', 'Unknown', inplace=True)

    # 2) drop redundant education
    if 'education' in df.columns:
        df.drop(columns=['education'], inplace=True, errors='ignore')

    # 3) one-hot encode categoricals
    ohe_cols = [c for c in cat_cols if c != 'education' and c in df.columns]
    if ohe_cols:
        df = pd.get_dummies(df, columns=ohe_cols, dtype=int)

    # 4) split target
    y = None
    if has_target:
        df['income'] = df['income'].map({'>50K': 1, '<=50K': 0})
        y = df['income'].astype('int64')
        X = df.drop(columns=['income'])
    else:
        X = df

    # 5) standardize numerics
    present_num = [c for c in num_cols if c in X.columns]
    fitted_scaler = scaler or StandardScaler()
    if present_num:
        X.loc[:, present_num] = X.loc[:, present_num].astype('float64', copy=False)
        if scaler is None:
            X.loc[:, present_num] = fitted_scaler.fit_transform(X.loc[:, present_num])
        else:
            X.loc[:, present_num] = fitted_scaler.transform(X.loc[:, present_num])

    return X, y, fitted_scaler