<a href="https://colab.research.google.com/github/jacobhebbel/csci4150-lab1/blob/main/Projects_lab1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Dataset

### Motivation / Intended Use
This dataset is commonly used for predicting whether an individual's income exceeds 50K USD per year based on demographic and employment-related features.

This dataset contains 48,000 rows, so it has over 2,000 rows. It contains 14 features, however some will not be used. We will still consider at least 10. Finally, there are several opportunities for encountering data leakage.

First, a feature fnl_wgt computes how many rows in the dataset are like this. Frequency of a specific sample considers every sample in the set, which means this feature is not individual to the sample. Therefore, we will not consider this column during training.

Second, there are several duplicate entries in the dataset. We must make each sample singular to give each sample equal weighting.

Finally, there are two columns pertaining to education: a string representation and an enumeration of the string options. Both should not be used, as it may give that feature unequal weighting, and samples with unknown education backgrounds would be minimized.

Therefore, this dataset is compliant with the project outlines.

### Target Definition
The target variable is income, a binary classification task. The two classes are `<=50K` (income is less than or equal to 50,000 USD per year) and `>50K` (income is greater than 50,000 USD per year).

### Data Source + License/Terms
*   **Data Source:** UCI Machine Learning Repository
*   **Link:** [https://archive.ics.uci.edu/dataset/2/adult](https://archive.ics.uci.edu/dataset/2/adult)
*   **Terms:** The dataset is publicly available for research purposes.

### Feature Dictionary
Here are some key features:
*   `age`: continuous. The age of the individual.
*   `workclass`: categorical. Type of employer (e.g., Private, Self-emp-not-inc, Federal-gov).
*   `education`: categorical. The highest level of education achieved (e.g., Bachelors, HS-grad, Some-college).
*   `marital-status`: categorical. Marital status.
*   `occupation`: categorical. The individual's occupation (e.g., Tech-support, Craft-repair, Other-service).
*   `race`: categorical. Race of the individual.
*   `sex`: categorical. Gender of the individual (Male, Female).
*   `capital-gain`: continuous. Capital gains for the individual.
*   `capital-loss`: continuous. Capital losses for the individual.
*   `hours-per-week`: continuous. The number of hours worked per week.
*   `native-country`: categorical. Country of origin.

### Limitations/Risks
*   **Selection Bias:** The dataset originates from a specific census year (1994) and may not be representative of the current population.
*   **Representativeness:** The dataset heavily samples individuals from the United States, which could limit the generalizability of models trained on this data to other geographical regions.

## Data Quality Audit

### Missingness Summary

In [None]:
!pip install ucimlrepo
from ucimlrepo import fetch_ucirepo
import pandas as pd
import numpy as np

# fetch dataset
adult = fetch_ucirepo(id=2)

# data (as pandas dataframes)
X = adult.data.features
y = adult.data.targets





In [None]:
missing_X = X.isnull().sum()
missing_X = missing_X[missing_X > 0].sort_values(ascending=False)
print('Missing values in features (X):')
print(missing_X)

missing_y = y.isnull().sum()
missing_y = missing_y[missing_y > 0].sort_values(ascending=False)
print('\nMissing values in target (y):')
print(missing_y)

Missing values in features (X):
occupation        966
workclass         963
native-country    274
dtype: int64

Missing values in target (y):
Series([], dtype: int64)


The `workclass`, `occupation`, and `native-country` columns in the features (`X`) have missing values, indicated by `?` in the original dataset in which we will treat as new unknown value during preprocessing. The `target` (`y`) does not have any explicit missing values.

### Duplicate Row Check
Duplicate rows were identified and removed. Additionally, the columns `fnlwgt` and `education-num` were dropped from `X` as they were features that would cause leaks.

### Target Distribution

In [None]:
target_distribution = y.value_counts(normalize=True) * 100
print('Target variable (income) distribution:')
print(target_distribution)

Target variable (income) distribution:
income
<=50K     50.612178
<=50K.    25.459645
>50K      16.053806
>50K.      7.874370
Name: proportion, dtype: float64


The target distribution shows an imbalance, with the majority of individuals (`~75%`) earning `<=50K` and a smaller proportion (`~25%`) earning `>50K`. This class imbalance should be considered during model training and evaluation, as models might tend to predict the majority class more often.

### One Bias/Ethics Note
This dataset contains demographic information such as `sex`, `race`, and `native-country`. Models trained on such data could inadvertently learn biases present in the training data, leading to unfair predictions for certain demographic groups.

In [None]:
# Convert X, y into correct values
y = y.replace('<=50K.', '<=50K').replace('>50K.', '>50K')
X = X.replace('?', np.nan)
X = X.fillna('Unknown')

# Drop leaky features
X_for_dedupe = X.drop(columns=['fnlwgt', 'education-num'])

# Boolean mask for unique rows
mask = ~X_for_dedupe.duplicated()

# Apply mask to both X and y to remove the duplicate rows
X = X.loc[mask].reset_index(drop=True)
y = y.loc[mask].reset_index(drop=True)

In [None]:
# Model training
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0, stratify=y
)

# columns
num_cols = ["age", "capital-gain", "capital-loss", "hours-per-week"]
cat_cols = [
    "workclass", "education", "marital-status", "occupation", "race", "sex", "native-country"
]

# preprocess
preprocess = ColumnTransformer(
    transformers=[
        ("num", Pipeline([
            ("scaler", StandardScaler()),
        ]), num_cols),
        ("cat", Pipeline([
            ("onehot", OneHotEncoder(handle_unknown="ignore")),
        ]), cat_cols),
    ],
    remainder="drop"
)

# model
model = Pipeline([
    ("preprocess", preprocess),
    ("knn", KNeighborsClassifier(n_neighbors=25, weights="distance", p=1)),
])

# model results
model.fit(X_train, y_train)
pred = model.predict(X_test)
print("Test accuracy:", accuracy_score(y_test, pred))

  return self._fit(X, y)


Test accuracy: 0.8448359073359073


In [None]:
# Confusion matrix
import pandas as pd
from sklearn.metrics import confusion_matrix

labels = np.unique(y_test)

cm = confusion_matrix(y_test, pred, labels=labels)

cm_df = pd.DataFrame(
    cm,
    index=[f"True {l}" for l in labels],
    columns=[f"Pred {l}" for l in labels]
)

print(cm_df)

            Pred <=50K  Pred >50K
True <=50K        5823        481
True >50K          805       1179


# Linear SVC Model

In [None]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt

# model
model = Pipeline([
    ("preprocess", preprocess),
    ("svm", LinearSVC())
])

param_grid = {
    "svm__C": [0.01, 0.1, 1, 10],
    "svm__class_weight": [None, "balanced"]
}

grid = GridSearchCV(
    model,
    param_grid,
    cv=3,
    scoring="accuracy",
    n_jobs=-1
)

# model results
grid.fit(X_train, y_train)

best_model = grid.best_estimator_
pred = best_model.predict(X_test)

print("Best params:", grid.best_params_)
print("Test accuracy:", accuracy_score(y_test, pred))

  y = column_or_1d(y, warn=True)


Best params: {'svm__C': 0.1, 'svm__class_weight': None}
Test accuracy: 0.855815637065637


### Results

Using a seed of random_state=0 for both logistic and SVM models, we obtained the following accuracy results:

Logistic Regression: 0.8448

Linear SVC: 0.8558

For Linear SVC we compared different regularization values and found that C=0.1, class_weight=None was the best performing after running grid search.

# Calibrated Classifier Cross Validation with Sigmoid (Platt Scaling)


In [None]:
import numpy as np
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import accuracy_score, confusion_matrix

# Make sure y is binary 0/1 (adjust mapping to your labels)
y_train_bin = (y_train.iloc[:, 0] == ">50K").astype(int)
y_test_bin  = (y_test.iloc[:, 0]  == ">50K").astype(int)

calibrated_clf = CalibratedClassifierCV(
    estimator=model,     # your Pipeline (preprocess + LinearSVC)
    method="sigmoid",
    cv=5
)

calibrated_clf.fit(X_train, y_train_bin)

# Probabilities for positive class
probs = calibrated_clf.predict_proba(X_test)[:, 1]

# Cost-sensitive threshold (your formula gives 1/11 â‰ˆ 0.091)
C_FP = 1
C_FN = 1
threshold = 1 / (C_FP + C_FN)


y_pred_cost_sensitive = (probs >= threshold).astype(int)

print("Cost-sensitive accuracy:", accuracy_score(y_test_bin, y_pred_cost_sensitive))
print("Confusion matrix:\n", confusion_matrix(y_test_bin, y_pred_cost_sensitive))



Cost-sensitive accuracy: 0.8560569498069498
Confusion matrix:
 [[5914  390]
 [ 803 1181]]


In [None]:
#Threshold
thresholds = np.linspace(0.001, 0.999, 1000)
costs = []

def expected_cost(y_true, y_pred, C_FP, C_FN):
    FP = ((y_true == 0) & (y_pred == 1)).sum()
    FN = ((y_true == 1) & (y_pred == 0)).sum()
    return FP * C_FP + FN * C_FN

for t in thresholds:
    preds = (probs >= t).astype(int)
    cost = expected_cost(y_test_bin, preds, C_FP, C_FN)
    costs.append(cost)

best_idx = np.argmin(costs)
best_threshold = thresholds[best_idx]

print("Best threshold:", best_threshold)
print("Validation cost:", costs[best_idx])

y_pred_cost_sensitive = (probs >= best_threshold).astype(int)

print("Cost-sensitive accuracy:", accuracy_score(y_test_bin, y_pred_cost_sensitive))
print("Confusion matrix:\n", confusion_matrix(y_test_bin, y_pred_cost_sensitive))

Best threshold: 0.49450550550550554
Validation cost: 1182
Cost-sensitive accuracy: 0.8573841698841699
Confusion matrix:
 [[5907  397]
 [ 785 1199]]


# Threshold Results
For our problem set, we decided on equal costs for false positives (C_FP=1) and false negatives (C_FN=1) giving us a defualt threshold of .500. We saw no reason to value one over the other. With this we did a threshold sweep and found that the best threshold was approximately 0.495. This slightly improved our accuracy score from 0.855 to 0.857.

# Confusion Matrices
 Defualt:

 [[5914  390]

 [ 803 1181]]

Best:

  [[5907  397]

 [ 785 1199]]