# Train test partition

For this problem we will assume that we are interested in those individuals who earn more than $50k (e.g., to offer them loans). We will refer to these as the positive class. The remainder we will refer to as the negative class.

Before starting to explore the database, it is necessary to split the data into training and test sets. In this problem we will not use a validation (also known as development) set, which is necessary to obtain unbiased estimates of the performance of various models. Instead, we will use cross-validation for the latter.

Throughout the problem we will assume that the records come from a random sample, i.e., they are independent and identically distributed. 

In [1]:
%load_ext autoreload
%autoreload 2

## Dependencies

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import sys

sys.path.append('../')
from src.constants import *
from src.utils import save_information

## Reading Table

In [3]:
df = pd.read_excel(ORIGINAL_DATASET_PATH, sheet_name=SHEET_NAME)

def format_columns(df):
    columns = df.columns.tolist()
    formatted_columns = [col.lower().strip().replace(" ", "_") for col in columns]
    df.columns = formatted_columns

format_columns(df)

number = (df[ORIGINAL_TARGET] == POSITIVE_CLASS).sum()
percentage = (df[ORIGINAL_TARGET] == POSITIVE_CLASS).mean() * 100

print(f"Number of rows: {len(df)}.")
print(f"Number of individuals earning over $50k: {number}.")
print(f"Percentage of individuals earning over $50k: {percentage : .2f}%.")

Number of rows: 32561.
Number of individuals earning over $50k: 7841.
Percentage of individuals earning over $50k:  24.08%.


## Dataset split

Given that the number of observations in the sample is 32,561, we will be using a 10% test set, since 3,000 records are sufficient to estimate the performance of a model in a population. Since the total number of records in the positive class is not a small number or a small percentage, we will not take the split stratifying based on the target column.

In [4]:
X = df.drop(columns=ORIGINAL_TARGET)
y = df[ORIGINAL_TARGET].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.10, random_state=SEED)

In [5]:
print(f"Size of X_train: {len(X_train)}")
print(f"Size of y_train: {len(y_train)}\n")
print(f"Size of X_test: {len(X_test)}")
print(f"Size of y_test: {len(X_test)}")

Size of X_train: 29304
Size of y_train: 29304

Size of X_test: 3257
Size of y_test: 3257


## Saving information

In [6]:
save_information(X_train, y_train, ORIGINAL_TARGET, TRAIN_DATASET_PATH)
save_information(X_test, y_test, ORIGINAL_TARGET, TEST_DATASET_PATH)