In [None]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
adult = fetch_ucirepo(id=2)

# data (as pandas dataframes)
X = adult.data.features
y = adult.data.targets

# metadata
print(adult.metadata)

# variable information
print(adult.variables)



In [None]:
# importing libraries. Run this first!!

import pandas as pd
import numpy as np
import sklearn as sk

## Dataset Analysis

### About
We chose the [Adult 2 dataset](https://archive.ics.uci.edu/dataset/2/adult) from UCI's Machine Learning Repository. This dataset compiles US Census information relevant to predicting income. It will be used to predict if a sample has an income over 50k USD a year.

### Does it comply with the requirements?
Adult 2 contains 48,000 rows, so it has over 2,000 rows. It contains 14 features, however some will not be used. We will still consider at least 10. Finally, there are several opportunities for encountering data leakage. 

First, a feature *fnl_wgt* computes how many rows in the dataset are like this. Frequency of a specific sample considers every sample in the set, which means this feature is not individual to the sample. Therefore, we will not consider this column during training.

Second, there are several duplicate entries in the dataset. We must make each sample singular to give each sample equal weighting. 

Finally, there are two columns pertaining to education: a string representation and an enumeration of the string options. Both should not be used, as it may give that feature unequal weighting, and samples with unknown education backgrounds would be minimized.

Therefore, this dataset is compliant with the project outlines.

In [None]:
# this block will drop some columns

print('columns before dropping:\n', X.columns)

dropped_columns = [
    'fnlwgt',
    'education-num'
]

new_data = X.drop(columns=dropped_columns)

print('columns after dropping\n', new_data.columns)

# now new_data has all the right variables

Removing duplicates

In [None]:
# drops duplicate entries

print('num rows before dropping duplicates', len(new_data), len(y))
new_data = new_data.drop_duplicates()
new_labels = y.iloc[new_data.index]
print('num ros after dropping duplicates', len(new_data), len(new_labels))

Now that we've identified major sources for leakage, we can safely split our data.

In [None]:
def split_from_seed(data: pd.DataFrame):

    from sklearn.model_selection import train_test_split
    train, test = train_test_split(
        data,
        test_size=0.2,
        random_state=44
    )
    return train, test


train_X, test_X = split_from_seed(new_data)
train_y, test_y = new_labels.loc[train_X.index], new_labels.loc[test_X.index]

print(f'num samples in train: {len(train_X)}, should equal {len(train_y)}')
print(f'num samples in train: {len(test_X)}, should equal {len(test_y)}')

Some columns will have missing values, in that case we can do one of a few things:
1. Drop all rows with missing values
2. Replace missing values

To do this, we can use sklearn's ColumnTransformer, which will replace missing values with the mode (most common) feature.