# Main Notebook
This notebook is for the main analysis and experimentation.

In [305]:
# Enable autoreloading of imported modules
%load_ext autoreload
%autoreload 2

import sys
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


# Add the repo root to access the courselib
repo_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
courselib_path = os.path.join(repo_root, "AppliedML", "courselib")
if courselib_path not in sys.path:
    sys.path.insert(0, courselib_path)
    print(f"{courselib_path} added to sys.path.")
else:
    print("Courselib path already in sys.path.")

from utils.preprocessing import encode_features, preprocess_data

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Courselib path already in sys.path.


As this project is supposed to integrate well with the courselib, we have downloaded the current GitHub Repo up to week 11 and our code will be integrated within courselib libraries.

In [306]:
from utils.loaders import load_uciadult

# ensure the data directory exists / else create it
os.makedirs('data', exist_ok=True)

# get the data
df = load_uciadult()

Loading from local `data/adult.data`...


In [307]:
# check for missing values
df.isnull().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     583
income               0
dtype: int64

### Data Cleaning

1.  **Handle Duplicates**: We remove any duplicate rows from the dataset.
2.  **Handle Missing Values**: Instead of removing rows with missing data, we treat the missing values in our categorical columns as a distinct category called 'Missing' as these observations could also contain additional information, expecially when condisering their categorical nature.


In [None]:
from utils.preprocessing import preprocess_data

# preprocessing of  the data
df = preprocess_data(df)


24 duplicate observations in the dataset were removed.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna('Missing', inplace=True)


# Exploratory Data Analysis

In [None]:
# basic info about the data set 
df.info()

In [None]:
# summary stats about the variables
df.describe()

In [None]:
# distribution of the target variable
plt.figure(figsize=(7, 5))
bars = df['income'].value_counts().sort_index().plot(
    kind='bar',
    color=['blue', 'red'],
    edgecolor='black'
)
plt.title('Class Balance of Income', fontsize=16)
plt.xlabel('Income (0 = <=50K, 1 = >50K)', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks([0, 1], ['<=50K', '>50K'], rotation=0, fontsize=11)
plt.yticks(fontsize=11)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

In [None]:
import seaborn as sns

# relationship between income and age
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='age', hue='income', multiple='stack', bins=30, palette='viridis')
plt.title('Age Distribution by Income Level', fontsize=16)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.legend(title='Income', labels=['>50K', '<=50K'])
plt.show()

In [None]:
# income by education
plt.figure(figsize=(12, 7))
sns.countplot(y='education', hue='income', data=df, order=df['education'].value_counts().index, palette='magma')
plt.title('Income Level by Education', fontsize=16)
plt.xlabel('Count', fontsize=12)
plt.ylabel('Education Level', fontsize=12)
plt.legend(title='Income', labels=['>50K', '<=50K'])
plt.tight_layout()
plt.show()

In [None]:
# first glimpse into the data set
df.head()

In [None]:
import matplotlib.pyplot as plt

# Visualize the frequency distribution of native-country
df['native-country'].value_counts().head(10).plot(kind='bar')
plt.title("Top 10 native-country values by frequency")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Difference in means of occupation and income, analyze if the target encoding is reasonable
df.groupby('occupation')['income'].mean().sort_values().plot(kind='barh')
plt.title("Mean income by occupation")
plt.xlabel("P(income > 50K)")
plt.tight_layout()
plt.show()


# Feature Engineering


In [None]:

# encoding strategies for each column
encoding_strategies = {
    'one-hot': ['workclass', 'marital-status', 'relationship', 'race', 'sex'],
    'ordinal': {'education': ['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th', '11th', '12th', 'HS-grad',
                               'Some-college', 'Bachelors', 'Masters', 'Doctorate', 'Prof-school', 'Assoc-acdm',
                                 'Assoc-voc']},
    'target': ['income', 'occupation', 'native-country'], # Target column must be the first in the list
    # 'frequency': ['native-country'], # compare with target encoding
  }

# Apply the encoding
df_encoded = encode_features(df.copy(), encoding_strategies)

# Display the first few rows of the encoded dataframe
df_encoded.head()


# Train-Test Set

Now we will split the data into train and test sets to prepare for model training and evaluation.

In [None]:
# Split encoded data into train/test sets
from sklearn.model_selection import train_test_split

X = df_encoded.drop('income', axis=1)
y = df_encoded['income']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# 1. Create the model object
model = LogisticRegression(max_iter=1000)

# 2. Fit the model
model.fit(X_train, y_train)

# 3. Predict on the test set
y_pred = model.predict(X_test)

# 4. Evaluate performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

print("RF Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot()
plt.title("Confusion Matrix for Logistic Regression")
plt.show()
cm_rf = confusion_matrix(y_test, y_pred_rf)
disp_rf = ConfusionMatrixDisplay(confusion_matrix=cm_rf, display_labels=rf_model.classes_)
disp_rf.plot()
plt.title("Confusion Matrix for Random Forest")
plt.show()


In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
print("Cross-validated accuracy:", scores.mean())



## Categorical Feature Analysis

To determine the best encoding strategy, we can analyze the relationship between each categorical feature and the target variable (`income`). We'll calculate the mean income for each category to see if there's a natural ordering.


In [None]:

# Analyze the relationship between categorical features and income
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
if 'income' in categorical_cols:
    categorical_cols.remove('income') # Remove target variable

for col in categorical_cols:
    print(f"--- {col} ---")
    # Group by the column and calculate the mean of the target variable
    # We can do this because the target is 0 or 1
    print(df.groupby(col)['income'].mean().sort_values(ascending=False))
    print("\n")


# Feature Engineering

Now we will apply the encoding strategies we defined in our `preprocessing.py` file. We will create a dictionary to specify which encoding to use for each feature type.

# What I did. 

(you can delete this if you agree with these changes, if not, pls let me know which better ideas you have.)

in preprocessing.py I did:
1.
    if 'ordinal' in encoding_strategies:
        df_encoded = ordinal_encode(df_encoded, encoding_strategies['ordinal'],{})
to
    if 'ordinal' in encoding_strategies:
        df_encoded = ordinal_encode(df_encoded, encoding_strategies['ordinal'])  

2.
to strip leading/trailing whitespace from string columns, so in the preprocess_data() I added:

for col in df.select_dtypes(include='object').columns:
    df[col] = df[col].str.strip()

3.
fix “index not in index” problems, so in the target_encode()：
    
    df.loc[val_indices, encoded_col_name] = val_fold_data[col].map(target_mean_map)
to
    mapped_values = val_fold_data[col].map(target_mean_map)
    df.iloc[val_indices, df.columns.get_loc(encoded_col_name)] = mapped_values

in main.ipynb I did:
1.just be sure tht everyone can import the data, at beginning I added:

from utils.preprocessing import encode_features, preprocess_data

2.
Remove "occupation" from the "one-hot" code, and only let it go into the "target" code.

And for the improvements from Kaggle:

1. Convert object columns to category to improve efficiency:

for col in df.select_dtypes(include=['object']).columns:
    df[col] = df[col].astype('category')

But this currently causes many bugs and I haven't found a clean fix yet.

2. Apply both frequency encoding and target encoding to native-country for comparison

3. Visualize the frequency distribution of high-cardinality categories (e.g., native-country)

4. Visualize the average income by occupation to evaluate whether target encoding is appropriate

and added the train/test sets.