## Equidistant train-test split and k-fold cross-validation

The purpose of having a customized train/test split and cross-fold validation routine is to stratify the data by location without using the 'State' identifier. Stratifying county data both by 'State' and by the 'METRO13' indicator (whether the county is a metro area) introduces imbalances for states with a disproportionately high number of metro counties, or states with few total counties.

This notebook iterates through a permuted list of row indices, successively adding one index to the test set and m indices to the training set, where m is the value of the 'training_rows_per_test_row' variable and the desired test-size is 1/(1 + m). The m training indices are of the m counties closest to the test county, which have not already been added to the training or test sets. This process is applied to the set of metro counties and the set of nonmetro counties separately, and then combined, so that the split is also stratified by the 'METRO13' indicator. A copy of the training data is saved to a .csv file before applying a StandardScaler and proceeding with cross-validation.

Then, for cross-validation, the above process is applied to the training set k - 1 times to subdivide into k subsets of equal size (we have chosen k = 5). From this, training and holdout sets are defined, and these are saved to .csv files along with the final test set.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(689503320)

import seaborn as sns
sns.set_style('whitegrid')

from sklearn.preprocessing import StandardScaler

In [2]:
data = pd.read_csv('../data/data_all.csv')

In [3]:
# Include calculated data of the 20 closest neighbors to each county
closest_neighboring_counties = pd.read_csv('../data/closest_neighboring_counties.csv')
MAX_NEIGHBORS = len(closest_neighboring_counties.columns) - 2

In [4]:
closest_neighboring_counties.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3142 entries, 0 to 3141
Data columns (total 22 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   FIPS        3142 non-null   int64 
 1   County      3142 non-null   object
 2   NEAREST_1   3142 non-null   int64 
 3   NEAREST_2   3142 non-null   int64 
 4   NEAREST_3   3142 non-null   int64 
 5   NEAREST_4   3142 non-null   int64 
 6   NEAREST_5   3142 non-null   int64 
 7   NEAREST_6   3142 non-null   int64 
 8   NEAREST_7   3142 non-null   int64 
 9   NEAREST_8   3142 non-null   int64 
 10  NEAREST_9   3142 non-null   int64 
 11  NEAREST_10  3142 non-null   int64 
 12  NEAREST_11  3142 non-null   int64 
 13  NEAREST_12  3142 non-null   int64 
 14  NEAREST_13  3142 non-null   int64 
 15  NEAREST_14  3142 non-null   int64 
 16  NEAREST_15  3142 non-null   int64 
 17  NEAREST_16  3142 non-null   int64 
 18  NEAREST_17  3142 non-null   int64 
 19  NEAREST_18  3142 non-null   int64 
 20  NEAREST_

In [5]:
# Merge with county census data / closest neighboring counties
data = pd.merge(left = data.sort_values(by='FIPS'), right = closest_neighboring_counties.sort_values(by='FIPS'), how='inner')
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3142 entries, 0 to 3141
Data columns (total 57 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   State                      3142 non-null   object 
 1   County                     3142 non-null   object 
 2   SNAPSPTH17                 3142 non-null   float64
 3   REDEMP_SNAPS17             3142 non-null   float64
 4   PCT_SNAP17                 3142 non-null   float64
 5   PC_SNAPBEN17               3142 non-null   float64
 6   PCT_NSLP17                 3142 non-null   float64
 7   PCT_SBP17                  3142 non-null   float64
 8   PCT_SFSP17                 3142 non-null   float64
 9   PCT_WIC17                  3142 non-null   float64
 10  PCT_CACFP17                3142 non-null   float64
 11  PCT_OBESE_ADULTS17         3142 non-null   float64
 12  GROCPTH16                  3142 non-null   float64
 13  SUPERCPTH16                3142 non-null   float

In [6]:
# Metro and nonmetro counties
data_metro = data.loc[data['METRO13'] == 1].copy()
data_nonmetro = data.loc[data['METRO13'] == 0].copy()

In [7]:
# Train test split with test size of 20%, the closest neighbors (max 20) are set as training data
training_rows_per_test_row = 4

In [8]:
# Designates 1 test sample for every training_rows_per_test_row training samples
def remove_county_and_neighbors(df, remaining_indices, train_indices, test_indices):
    
    # Move next index to test set
    i = remaining_indices[0]
    county = df.iloc[i]
    test_indices.append(i)
    remaining_indices = remaining_indices[1:]
    
    # Move indices of closest neighboring counties to training set
    a = 0
    j = 0
    while a < training_rows_per_test_row and j < MAX_NEIGHBORS:
        neighbor_fips = county['NEAREST_' + str(j + 1)]
        neighbor = df.loc[df['FIPS'] == neighbor_fips]
        if len(neighbor.index > 0):
            ind = neighbor.index[0]
        else:
            ind = -1
        if ind in remaining_indices:
            train_indices.append(ind)
            remaining_indices.remove(ind)
            a += 1
        j += 1
    # If all neighbors have been designated training or test, add additional neighbors to training set
    while a < training_rows_per_test_row and len(remaining_indices) > 0:
        train_indices.append(remaining_indices[0])
        remaining_indices = remaining_indices[1:]
        a += 1
    return remaining_indices, train_indices, test_indices

In [9]:
def distance_preserving_train_test_split(df, slice):
    remaining_indices = list(slice.index)
    remaining_indices = list(np.random.permutation(remaining_indices))
    train_indices = []
    test_indices = []

    # Iterate through shuffled indices, adding one row to the test set and training_rows_per_test_row rows as training samples
    while len(remaining_indices) > 0:
        remaining_indices, train_indices, test_indices = remove_county_and_neighbors(df, remaining_indices, train_indices, test_indices)
    df_train = df.iloc[train_indices]
    df_test = df.iloc[test_indices]
    return df_train, df_test

In [10]:
# Train StandardScaler
scaler = StandardScaler()
categorical_variables = ['State','County','FIPS','PERPOV10','METRO13']
categorical_data = data[categorical_variables]
neighboring_counties_labels = ['NEAREST_' + str(j + 1) for j in range(MAX_NEIGHBORS)]
neighboring_data = data[neighboring_counties_labels]
noncategorical_data = data.drop(categorical_variables, axis=1).drop(neighboring_data, axis=1)
noncategorical_labels = noncategorical_data.columns

scaler.fit(noncategorical_data)

In [11]:
data = pd.concat([categorical_data, noncategorical_data, neighboring_data],axis=1)

In [12]:
metro_train, metro_test = distance_preserving_train_test_split(df=data, slice=data_metro)

In [13]:
nonmetro_train, nonmetro_test = distance_preserving_train_test_split(df=data, slice=data_nonmetro)

In [14]:
data_train = pd.concat([metro_train, nonmetro_train])
data_test = pd.concat([metro_test, nonmetro_test])

In [15]:
# Export unscaled data
data_train.drop(neighboring_counties_labels,axis=1).to_csv('../data/data_train_unscaled.csv', index=False)

In [16]:
# Now scale the data
data_unscaled = pd.concat([data_train, data_test])

categorical_data = data_unscaled[categorical_variables]
neighboring_data = data_unscaled[neighboring_counties_labels]
noncategorical_data = data_unscaled.drop(categorical_variables, axis=1).drop(neighboring_data, axis=1)
noncategorical_labels = noncategorical_data.columns

data_scaled = scaler.transform(noncategorical_data)
noncategorical_data = pd.DataFrame(data_scaled, columns=noncategorical_labels)
data = pd.concat([categorical_data, noncategorical_data, neighboring_data],axis=1)

In [17]:
# Set aside scaled test data
data_test = data.loc[data_test.index].drop(neighboring_counties_labels, axis=1)
data_train = data.drop(data_test.index,axis=0)

In [18]:
# Cross-validation on scaled data

data_metro = data_train.loc[data['METRO13'] == 1].copy()
data_nonmetro = data_train.loc[data['METRO13'] == 0].copy()

In [19]:
# Create sets for k-fold cross-validation
k = 5

metro_to_k_split = data_metro.copy()
nonmetro_to_k_split = data_nonmetro.copy()

metro_k_splits = []
nonmetro_k_splits = []

while k > 0:
    training_rows_per_test_row = k - 1
    metro_remaining, metro_split = distance_preserving_train_test_split(df=data, slice=metro_to_k_split)
    metro_k_splits.append(metro_split)
    metro_to_k_split = metro_remaining.copy()
    nonmetro_remaining, nonmetro_split = distance_preserving_train_test_split(df=data, slice=nonmetro_to_k_split)
    nonmetro_k_splits.append(nonmetro_split)
    nonmetro_to_k_split = nonmetro_remaining.copy()
    
    k = k - 1

In [20]:
# Define holdout and training sets and drop closest neighbor data
holdout_sets = [pd.concat([metro_k_splits[i], nonmetro_k_splits[i]]).drop(neighboring_counties_labels,axis=1) for i in range(len(metro_k_splits))]
training_sets = [pd.concat(holdout_sets[:i] + holdout_sets[i+1:]) for i in range(len(metro_k_splits))]

In [21]:
# Save training, test, and holdover data
for i in range(len(holdout_sets)):
    holdout_sets[i].to_csv('../data/data_holdout_' + str(i) + '.csv')
    training_sets[i].to_csv('../data/data_train_' + str(i) + '.csv')
data_test.to_csv('../data/data_test.csv')