## Preprocessing
After the EDA we can start to preprocess our data, which primarily includes splitting and encoding. 

### Rubric Questions
Discuss how you split the dataset and why.

Is your dataset IID? Does it have group structure?
We know that the data is iid. By definition "all samples stem from the same generative process and the generative process is assumed to have no memory of past generated samples." Here, this is the case. Also, we know that our data does not have group structure as well - "data has group structure if samples are collected from different subjects, experiments, measurement devices"

Is it a time-series data?
No. 

How should you split the dataset given your ML question to best mimic future use when you deploy the model?
The goal here is to predict relationship status, so we need to include representatives of each relationship group in the training, validation, and test set. Otherwise, there are not other apparent classes or groups. 

How many features do you have in the preprocessed data?
45 features and 1 target variable

In [1]:
#import dependencies
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#preprocessing tools
from sklearn.model_selection import StratifiedKFold
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split

In [2]:
#read in data set and print
df = pd.read_csv('/users/ktoleary.13/Desktop/DATA1030_Proj/mach_rel_classification/data/data_edit1.csv', delimiter=',')
print(df.columns)
df.head()

Index(['Q1A', 'Q2A', 'Q3A', 'Q4A', 'Q5A', 'Q6A', 'Q7A', 'Q8A', 'Q9A', 'Q10A',
       'Q11A', 'Q12A', 'Q13A', 'Q14A', 'Q15A', 'Q16A', 'Q17A', 'Q18A', 'Q19A',
       'Q20A', 'testelapse', 'education', 'urban', 'gender', 'engnat', 'age',
       'hand', 'religion', 'orientation', 'race', 'voted', 'married',
       'familysize', 'major', 'marriedstr', 'Mscore', 'PITtime', 'NITtime',
       'PVHtime', 'CVHtime', 'voc_fake', 'voc_conf', 'extraver', 'agreeable',
       'conscient', 'neuroticism', 'openness'],
      dtype='object')


Unnamed: 0,Q1A,Q2A,Q3A,Q4A,Q5A,Q6A,Q7A,Q8A,Q9A,Q10A,...,NITtime,PVHtime,CVHtime,voc_fake,voc_conf,extraver,agreeable,conscient,neuroticism,openness
0,2.0,4.0,4.0,5.0,5.0,5.0,3.0,2.0,2.0,4.0,...,6799.0,12788.0,7297.0,1,11,8.0,7.0,11.0,7.0,6.0
1,4.0,4.0,3.0,4.0,5.0,2.0,5.0,5.0,2.0,4.0,...,7035.0,6411.0,7910.0,0,10,6.0,10.0,7.0,8.0,6.0
2,4.0,2.0,3.0,2.0,4.0,2.0,4.0,1.0,2.0,2.0,...,8907.0,15823.0,6442.0,0,8,6.0,6.0,8.0,7.0,4.0
3,5.0,5.0,1.0,3.0,5.0,5.0,5.0,5.0,3.0,1.0,...,8994.0,27609.0,6739.0,0,8,5.0,5.0,8.0,10.0,2.0
4,2.0,4.0,2.0,2.0,2.0,2.0,4.0,4.0,1.0,2.0,...,12294.0,4149.0,8038.0,0,10,9.0,10.0,7.0,5.0,5.0


In [3]:
#get types of each column
for name in list(df.columns):
    print(name, "contains", df[name].unique())
df.shape

Q1A contains [2. 4. 5. 3. 1. 0.]
Q2A contains [4. 2. 5. 1. 3. 0.]
Q3A contains [ 4.  3.  1.  2.  5. nan]
Q4A contains [ 5.  4.  2.  3.  1. nan]
Q5A contains [5. 4. 2. 3. 1. 0.]
Q6A contains [ 5.  2.  1.  4.  3. nan]
Q7A contains [ 3.  5.  4.  1.  2. nan]
Q8A contains [2. 5. 1. 4. 3. 0.]
Q9A contains [ 2.  3.  1.  4.  5. nan]
Q10A contains [ 4.  2.  1.  3.  5. nan]
Q11A contains [ 5.  4.  3.  2.  1. nan]
Q12A contains [5. 4. 2. 3. 1. 0.]
Q13A contains [1. 4. 5. 2. 3. 0.]
Q14A contains [ 5.  3.  4.  2.  1. nan]
Q15A contains [4. 1. 5. 3. 2. 0.]
Q16A contains [ 5.  4.  2.  1.  3. nan]
Q17A contains [1. 2. 3. 4. 5. 0.]
Q18A contains [5. 4. 1. 3. 2. 0.]
Q19A contains [ 3.  1.  5.  2.  4. nan]
Q20A contains [4. 2. 1. 5. 3. 0.]
testelapse contains [ 191.  185.  230. ... 1441. 3574. 1572.]
education contains [4 2 1 3 0]
urban contains [3 1 2 0]
gender contains [2 1 3 0]
engnat contains [2 1 0]
age contains [ 31  37  40  79  32  42  45  36  50  43  48  33  34  38  39  56  35  41
  55  62  52  4

(26043, 47)

Most data is clean, but some of the individual questions for the surveys have NaNs. Because zero has no meaning on the likert-scale, we can convert these values to zero and establish a category for the missing values. This way, we can check if there is potential meaning in these missing values but wrap them in a more usable data type. 

In [4]:
#the tranformer doesnt like the NaNs, we'll turn these into zeros, which is also used in the married column
mquestions = ['Q1A', 'Q2A', 'Q3A', 'Q4A', 'Q5A', 'Q6A', 'Q7A', 'Q8A', 'Q9A', 'Q10A','Q11A', 'Q12A', 'Q13A', 'Q14A', \
          'Q15A', 'Q16A', 'Q17A', 'Q18A', 'Q19A', 'Q20A']
#Fill the nans with 0. This is the same encoding scheme for married col. Does not adjust the mscore at all
for q in mquestions:
    df[q] = df[q].fillna(0)

#get types of each column
for name in mquestions:
    print(name, "contains", df[name].unique())

Q1A contains [2. 4. 5. 3. 1. 0.]
Q2A contains [4. 2. 5. 1. 3. 0.]
Q3A contains [4. 3. 1. 2. 5. 0.]
Q4A contains [5. 4. 2. 3. 1. 0.]
Q5A contains [5. 4. 2. 3. 1. 0.]
Q6A contains [5. 2. 1. 4. 3. 0.]
Q7A contains [3. 5. 4. 1. 2. 0.]
Q8A contains [2. 5. 1. 4. 3. 0.]
Q9A contains [2. 3. 1. 4. 5. 0.]
Q10A contains [4. 2. 1. 3. 5. 0.]
Q11A contains [5. 4. 3. 2. 1. 0.]
Q12A contains [5. 4. 2. 3. 1. 0.]
Q13A contains [1. 4. 5. 2. 3. 0.]
Q14A contains [5. 3. 4. 2. 1. 0.]
Q15A contains [4. 1. 5. 3. 2. 0.]
Q16A contains [5. 4. 2. 1. 3. 0.]
Q17A contains [1. 2. 3. 4. 5. 0.]
Q18A contains [5. 4. 1. 3. 2. 0.]
Q19A contains [3. 1. 5. 2. 4. 0.]
Q20A contains [4. 2. 1. 5. 3. 0.]


In [5]:
#DETERMINE ENCODERS
#most everything is already encodeded and bounded in a range
#categorical - questions that are answered on likert
ord_ft = ['Q1A', 'Q2A', 'Q3A', 'Q4A', 'Q5A', 'Q6A', 'Q7A', 'Q8A', 'Q9A', 'Q10A','Q11A', 'Q12A', 'Q13A', 'Q14A', \
          'Q15A', 'Q16A', 'Q17A', 'Q18A', 'Q19A', 'Q20A','education', 'extraver', 'agreeable', 'conscient', 'neuroticism', 'openness']
onehot_ft = ['race', 'voted','familysize', 'major','urban', 'gender', 'engnat', 'orientation','hand', 'religion','voc_fake', 'voc_conf']
minmax_ft = ['Mscore', 'age']                
stnd_ft = ['testelapse','PITtime', 'NITtime', 'PVHtime','CVHtime']    

#0 are people who didn't answer
df['married'].value_counts()
# stratified K Fold to represent groups

2    12521
1     8359
3     5163
Name: married, dtype: int64

Once we've established the encoders, we can define our target variables, split the data, and finish preprocessing. Because there is a distinctive group structure in our target variable, I chose to do a stratified kfold split. That way, the train,validation, and testing sets will always have a representative of the target variable's categories. 

In [6]:
#define X and y and encode
y1 = df['married']
X1 = df.loc[:, df.columns != 'married'] # all other columns are features    

#set the random state
random_state = 77

# collect all the encoders
preprocessor = ColumnTransformer(
    transformers=[
        ('ord', OrdinalEncoder(), ord_ft), #ord fts already in ascending, numeric order
        ('onehot', OneHotEncoder(sparse=False,handle_unknown='ignore'), onehot_ft),
        ('minmax', MinMaxScaler(), minmax_ft),
        ('std', StandardScaler(), stnd_ft)])

clf = Pipeline(steps=[('preprocessor', preprocessor)])

In [7]:
#Split first and then preprocess second
# do KFold split on other
kf = StratifiedKFold(n_splits=6,shuffle=True,random_state=random_state) 
#include shuffle so that young ages aren't overrepresented
for train_index, other_index in kf.split(X1,y1):
    X1_train = X1.iloc[train_index]
    y1_train = y1.iloc[train_index]
    X1_other = X1.iloc[other_index]
    y1_other = y1.iloc[other_index]
    #create another stratified fold split for validation and test set
    for val_index, test_index in kf.split(X1_other,y1_other):
        X1_val = X1.iloc[val_index]
        y1_val = y1.iloc[val_index]
        X1_test = X1.iloc[test_index]
        y1_test = y1.iloc[test_index]
    print('train balance:')
    print(y1_train.value_counts(normalize=True))
    print('val balance:')
    print(y1_val.value_counts(normalize=True))
    #apply transformer
    X_train_prep = clf.fit_transform(X1_train)
    X_val_prep = clf.transform(X1_val)
    X_test_prep = clf.transform(X1_test)

train balance:
2    0.480785
1    0.320984
3    0.198231
Name: married, dtype: float64
val balance:
2    0.461857
1    0.338585
3    0.199558
Name: married, dtype: float64
train balance:
2    0.480785
1    0.320984
3    0.198231
Name: married, dtype: float64
val balance:
2    0.455777
1    0.342454
3    0.201769
Name: married, dtype: float64
train balance:
2    0.480785
1    0.320984
3    0.198231
Name: married, dtype: float64
val balance:
2    0.455777
1    0.339690
3    0.204533
Name: married, dtype: float64
train balance:
2    0.480809
1    0.320923
3    0.198268
Name: married, dtype: float64
val balance:
2    0.463920
1    0.335361
3    0.200719
Name: married, dtype: float64
train balance:
2    0.480763
1    0.320969
3    0.198268
Name: married, dtype: float64
val balance:
2    0.461156
1    0.328725
3    0.210119
Name: married, dtype: float64
train balance:
2    0.480763
1    0.320969
3    0.198268
Name: married, dtype: float64
val balance:
2    0.463091
1    0.335361
3    0.20154