# Load, Split, and Balance

In [13]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

data = pd.read_csv('data.csv')

# drop rows with missing values
data = data.dropna()

#encode the string data as integers
label_encoder = LabelEncoder()
data['State'] = label_encoder.fit_transform(data['State'])
data['County'] = label_encoder.fit_transform(data['County'])

For now, we've decided to keep the county variable because by keeping the county variable, we could find patterns or trends specific to that county that would otherwise be lost if we removed the variable. However, since there are so many counties, it could make the model more complex and this could lead to overfitting. If it ends up causing overfitting, we'll consider removing it later.

### Balancing Dataset

We chose to use quantization thresholds for the ChildPoverty variable and divided them into 4 classes. By doing this, we ensure that each class has an approximately equal number of instances, which helps in balancing the dataset. Also, by using quantization thresholds we deal with the continuous variables by converting them into categorical variables, making them suitable for classification tasks.

We should only balance the training dataset because this ensures that model learns equally from all classes, preventing bias towards any particular class. We shouldn't balance the test dataset because it should  represent the real-world distribution of the data. Balancing the testing set would artificially alter the distribution of classes, leading to a bias in the model's performance.

In [17]:
from sklearn.model_selection import train_test_split

data['ChildPovertyClass'] = pd.qcut(data['ChildPoverty'], 4, labels=False)
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42, stratify=data['ChildPovertyClass'])

# Balance the training set
train_data = train_data.groupby('ChildPovertyClass').apply(lambda x: x.sample(train_data['ChildPovertyClass'].value_counts().min())).reset_index(drop=True)

  train_data = train_data.groupby('ChildPovertyClass').apply(lambda x: x.sample(train_data['ChildPovertyClass'].value_counts().min())).reset_index(drop=True)
