# Data Pre-Processing

In this notebook the data is going to be:
* feature size reduced
* test-train split
* train set balanced
* missing value imputation

In [259]:
import numpy as np
import pandas as pd

In [260]:
df = pd.read_csv("data/NSDUH_2015_RFD_Tab.tsv.gz", sep="\t", compression="gzip")
df.shape

  interactivity=interactivity, compiler=compiler, result=result)


(57146, 2682)



## Remove Features
Only keep the roughly 400 pre-screened features and the computed RFD scores

In [261]:
columns = np.array(pd.read_excel('data/clean_vars.xlsx')['vars'])
columns = np.append(columns,("HERRFD","TOTRFD"))
df1 = df[columns]

In [262]:
df1.shape

(57146, 401)



## Train-Test Split

Simple stratified 80/20 split, use total score for stratification.

In [263]:
from sklearn.model_selection import train_test_split

In [264]:
X = df1.iloc[:,0:399]
y = df1.iloc[:,399:401]

In [265]:
# Use binning for stratification
bins     = np.linspace(0, 1, 11)
y_binned = np.digitize(y.iloc[:,1], bins)

In [266]:
seed = 67689
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify=y_binned, 
                                                    test_size=0.2, random_state=seed)

In [267]:
df_test = pd.concat([X_test, y_test], axis=1)
df_test.shape

(11430, 401)



## Balance Training Data

For better training results. 
We need to balance before imputing (because imputation is expensive), but balancing doesnt support NaNs - so temporary recode them as '77777'. 

In [283]:
print("Before balancing:")
print("RFD = 0:", y_train[y_train.iloc[:,1] == 0].count()[1])
print("RFD > 0:", y_train[y_train.iloc[:,1] > 0].count()[1])

Before balancing:
RFD = 0: 42882
RFD > 0: 2834


In [284]:
from imblearn.under_sampling import NearMiss

In [286]:
X_train = X_train.fillna(77777);

In [287]:
y_train_binned2 = (y_train.iloc[:,1] > 0).astype(int)

In [288]:
# Balance between TOTRFD == 0 and TOTRFD > 0
Xy_train = pd.concat([X_train, y_train], axis=1)

nm1 = NearMiss(version=1)
Xy_resampled_nm1, y_resampled_nm1 = nm1.fit_resample(Xy_train, y_train_binned2)

In [289]:
df_train_bal = pd.DataFrame(data=Xy_resampled_nm1, columns=df1.columns)
y_train_bal = pd.DataFrame(data=Xy_resampled_nm1[:,399:401], columns=df1.columns[399:401])
df_train_bal.shape

(5668, 401)

In [290]:
print("After balancing:")
print("RFD = 0:", y_train_bal[y_train_bal.iloc[:,1] == 0].count()[1])
print("RFD > 0:", y_train_bal[y_train_bal.iloc[:,1] > 0].count()[1])

After balancing:
RFD = 0: 2834
RFD > 0: 2834


## Impute Missing Values

We have an amount of codes that bear no information for us, and also our NaNs (77777), first we set everything to NaN again, then we let the multivariate imputation do its job

In [187]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [268]:
def setToNaN(df_In):
    df_Out = df_In
    # Replace logically assigned
    df_Out = df_Out.replace([81, 981, 9981, 99981], 91)
    df_Out = df_Out.replace([83, 983, 9983, 99983], 93)
    df_Out = df_Out.replace([89, 989, 9989, 99989], 99)
    
    # Set everything not of interest to NaN
    dropvals = [77777, 85, 985, 9985, 94, 994, 9994, 97, 997, 99997, 98, 998, 9998, "."]
    df_Out = df_Out.replace(dropvals, np.nan)
    
    return df_Out

In [296]:
# Replace codes without content with nan to prepare imputer
df_train_imp = setToNaN(df_train_bal)
df_test_imp = setToNaN(df_test)

In [304]:
imp = IterativeImputer(max_iter=5, random_state=0, initial_strategy='most_frequent')

In [None]:
# Imputation for train (k=5668)
df_train_imp = imp.fit_transform(df_train_imp)

In [None]:
# Imputation for test (k=11430)
df_test_imp = imp.fit_transform(df_test_imp)

In [None]:
# converting np array to dataframe
df_train_imp0 = pd.DataFrame(data=df_train_imp, columns=df1.columns)
df_test_imp0  = pd.DataFrame(data=df_test_imp, columns=df1.columns)

## Export Results

In [None]:
df_train_imp0.to_csv("data/train_data.tsv.gz", sep="\t", compression="gzip")
df_test_imp0.to_csv("data/test_data.tsv.gz", sep="\t", compression="gzip")