<a href="https://colab.research.google.com/github/jacobhebbel/csci4150-lab1/blob/main/Projects_lab1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Dataset

### Motivation / Intended Use
This dataset is commonly used for predicting whether an individual's income exceeds 50K USD per year based on demographic and employment-related features.

This dataset contains 48,000 rows, so it has over 2,000 rows. It contains 14 features, however some will not be used. We will still consider at least 10. Finally, there are several opportunities for encountering data leakage.

First, a feature fnl_wgt computes how many rows in the dataset are like this. Frequency of a specific sample considers every sample in the set, which means this feature is not individual to the sample. Therefore, we will not consider this column during training.

Second, there are several duplicate entries in the dataset. We must make each sample singular to give each sample equal weighting.

Finally, there are two columns pertaining to education: a string representation and an enumeration of the string options. Both should not be used, as it may give that feature unequal weighting, and samples with unknown education backgrounds would be minimized.

Therefore, this dataset is compliant with the project outlines.

### Target Definition
The target variable is income, a binary classification task. The two classes are `<=50K` (income is less than or equal to 50,000 USD per year) and `>50K` (income is greater than 50,000 USD per year).

### Data Source + License/Terms
*   **Data Source:** UCI Machine Learning Repository
*   **Link:** [https://archive.ics.uci.edu/dataset/2/adult](https://archive.ics.uci.edu/dataset/2/adult)
*   **Terms:** The dataset is publicly available for research purposes.

### Feature Dictionary
Here are some key features:
*   `age`: continuous. The age of the individual.
*   `workclass`: categorical. Type of employer (e.g., Private, Self-emp-not-inc, Federal-gov).
*   `education`: categorical. The highest level of education achieved (e.g., Bachelors, HS-grad, Some-college).
*   `marital-status`: categorical. Marital status.
*   `occupation`: categorical. The individual's occupation (e.g., Tech-support, Craft-repair, Other-service).
*   `race`: categorical. Race of the individual.
*   `sex`: categorical. Gender of the individual (Male, Female).
*   `capital-gain`: continuous. Capital gains for the individual.
*   `capital-loss`: continuous. Capital losses for the individual.
*   `hours-per-week`: continuous. The number of hours worked per week.
*   `native-country`: categorical. Country of origin.

### Limitations/Risks
*   **Selection Bias:** The dataset originates from a specific census year (1994) and may not be representative of the current population.
*   **Representativeness:** The dataset heavily samples individuals from the United States, which could limit the generalizability of models trained on this data to other geographical regions.

## Data Quality Audit

### Missingness Summary

In [6]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
adult = fetch_ucirepo(id=2)

# data (as pandas dataframes)
X = adult.data.features
y = adult.data.targets

# metadata
print(adult.metadata)

# variable information
print(adult.variables)



{'uci_id': 2, 'name': 'Adult', 'repository_url': 'https://archive.ics.uci.edu/dataset/2/adult', 'data_url': 'https://archive.ics.uci.edu/static/public/2/data.csv', 'abstract': 'Predict whether annual income of an individual exceeds $50K/yr based on census data. Also known as "Census Income" dataset. ', 'area': 'Social Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 48842, 'num_features': 14, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Age', 'Income', 'Education Level', 'Other', 'Race', 'Sex'], 'target_col': ['income'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1996, 'last_updated': 'Tue Sep 24 2024', 'dataset_doi': '10.24432/C5XW20', 'creators': ['Barry Becker', 'Ronny Kohavi'], 'intro_paper': None, 'additional_info': {'summary': "Extraction was done by Barry Becker from the 1994 Census database.  A set of reasonably clean records was extracted using the fol

In [7]:
missing_X = X.isnull().sum()
missing_X = missing_X[missing_X > 0].sort_values(ascending=False)
print('Missing values in features (X):')
print(missing_X)

missing_y = y.isnull().sum()
missing_y = missing_y[missing_y > 0].sort_values(ascending=False)
print('\nMissing values in target (y):')
print(missing_y)

Missing values in features (X):
occupation        966
workclass         963
native-country    274
dtype: int64

Missing values in target (y):
Series([], dtype: int64)


The `workclass`, `occupation`, and `native-country` columns in the features (`X`) have missing values, indicated by `?` in the original dataset in which we will treat as new unknown value during preprocessing. The `target` (`y`) does not have any explicit missing values.

### Duplicate Row Check
Duplicate rows were identified and removed. Additionally, the columns `fnlwgt` and `education-num` were dropped from `X` as they were features that would cause leaks.

### Target Distribution

In [8]:
target_distribution = y.value_counts(normalize=True) * 100
print('Target variable (income) distribution:')
print(target_distribution)

Target variable (income) distribution:
income
<=50K     50.612178
<=50K.    25.459645
>50K      16.053806
>50K.      7.874370
Name: proportion, dtype: float64


The target distribution shows an imbalance, with the majority of individuals (`~75%`) earning `<=50K` and a smaller proportion (`~25%`) earning `>50K`. This class imbalance should be considered during model training and evaluation, as models might tend to predict the majority class more often.

### One Bias/Ethics Note
This dataset contains demographic information such as `sex`, `race`, and `native-country`. Models trained on such data could inadvertently learn biases present in the training data, leading to unfair predictions for certain demographic groups.

In [9]:
!pip install ucimlrepo
from ucimlrepo import fetch_ucirepo
import pandas as pd
import numpy as np

adult = fetch_ucirepo(id=2)

X = adult.data.features
y = adult.data.targets



In [10]:
# Convert X, y into correct values
y = y.replace('<=50K.', '<=50K').replace('>50K.', '>50K')
X = X.replace('?', np.nan)
X = X.fillna('Unknown')

# Drop leaky features
X_for_dedupe = X.drop(columns=['fnlwgt', 'education-num'])

# Boolean mask for unique rows
mask = ~X_for_dedupe.duplicated()

# Apply mask to both X and y to remove the duplicate rows
X = X.loc[mask].reset_index(drop=True)
y = y.loc[mask].reset_index(drop=True)

In [11]:
# Model training
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0, stratify=y
)

# columns
num_cols = ["age", "capital-gain", "capital-loss", "hours-per-week"]
cat_cols = [
    "workclass", "education", "marital-status", "occupation", "race", "sex", "native-country"
]

# preprocess
preprocess = ColumnTransformer(
    transformers=[
        ("num", Pipeline([
            ("scaler", StandardScaler()),
        ]), num_cols),
        ("cat", Pipeline([
            ("onehot", OneHotEncoder(handle_unknown="ignore")),
        ]), cat_cols),
    ],
    remainder="drop"
)

# model
model = Pipeline([
    ("preprocess", preprocess),
    ("knn", KNeighborsClassifier(n_neighbors=25, weights="distance", p=1)),
])

# model results
model.fit(X_train, y_train)
pred = model.predict(X_test)
print("Test accuracy:", accuracy_score(y_test, pred))

  return self._fit(X, y)


Test accuracy: 0.8448359073359073


In [12]:
# Confusion matrix
import pandas as pd
from sklearn.metrics import confusion_matrix

labels = np.unique(y_test)

cm = confusion_matrix(y_test, pred, labels=labels)

cm_df = pd.DataFrame(
    cm,
    index=[f"True {l}" for l in labels],
    columns=[f"Pred {l}" for l in labels]
)

print(cm_df)

            Pred <=50K  Pred >50K
True <=50K        5823        481
True >50K          805       1179
