# Preprocessing for classification model

In this notebook, some preprocessing will be applied to the filtered dataset. 

In [2068]:
#Importing all libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
import os
from itertools import combinations
from scipy.stats import chi2_contingency
import category_encoders as ce
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder


In [2069]:
#Importing dataset cleaned and filtered from EDA
profiles = pd.read_csv('../data/profiles_preprocessed1.csv', index_col=False)
profiles = profiles.loc[:, ~profiles.columns.str.contains('^Unnamed')]
profiles.head()

Unnamed: 0,age,body_type,diet,drinks,drugs,education,ethnicity,height,job,offspring,orientation,religion,sex,sign,smokes,status,ethnicity_grouped,dogs,cats
0,22,large,anything,a little,never,college,"asian, white",75.0,service,no kids,straight,agnosticism,m,gemini,yes,single,"asian, white",likes dogs,likes cats
1,35,regular,other,a lot,sometimes,other,white,70.0,service,no kids,straight,agnosticism,m,cancer,no,single,white,likes dogs,likes cats
2,38,regular,anything,a little,unknown,masters,,68.0,unknown,unknown,straight,unknown,m,pisces,no,available,,no dogs,has cats
3,23,regular,vegetarian,a little,unknown,college,white,71.0,student,unsure,straight,unknown,m,pisces,no,single,white,no dogs,likes cats
4,29,regular,unknown,a little,never,college,"asian, black, other",66.0,creative,unknown,straight,unknown,m,aquarius,no,single,rare_ethnicity,likes dogs,likes cats


## Scoping the problem

For the classification problem, we want to predict what body_type the user belongs to, using habits as predictors. </br>

Habits:</br>
- diet</br>
- drinks</br>
- drugs</br>
- smokes</br>

Also, age, sex and height will be considered.

### Filtering the dataset.

Dropping all variables out of the scope

In [2070]:
profiles = profiles.drop(['education', 'ethnicity_grouped', 'ethnicity', 'job', 'offspring', 'orientation', 'religion', 'sign', 'status', 'dogs', 'cats'], axis=1)

In [2071]:
profiles.head()

Unnamed: 0,age,body_type,diet,drinks,drugs,height,sex,smokes
0,22,large,anything,a little,never,75.0,m,yes
1,35,regular,other,a lot,sometimes,70.0,m,no
2,38,regular,anything,a little,unknown,68.0,m,no
3,23,regular,vegetarian,a little,unknown,71.0,m,no
4,29,regular,unknown,a little,never,66.0,m,no


## Imputation and second cardinality reduction for categorical variables

Defining functions for reusable code:

1) Printing proportions with and without the value 'unknown'
2) Imputation of 'unknown' value according to the distribution of the rest of the values in the variable.

In [2072]:
def pre_encode_cat(variable): #this function prints the preliminar proportions and counts before encoding
    count = profiles[variable].value_counts()
    print(f"Value counts for {count}")
    print(f"Proportions for {profiles[variable].value_counts(1)}")
    proportions = (profiles.loc[profiles[variable] != 'unknown', variable]).value_counts(1)
    print(f"Proportions without the value 'unknown' {proportions}")

In [2073]:
def prop_imputer(variable):
    proportions = (profiles.loc[profiles[variable] != 'unknown', variable]).value_counts(1)
    # Find the indices where variable == 'unknown'
    unknown_idx = profiles[profiles[variable] == 'unknown'].index
    # Shuffle indices for random assignment
    shuffled_idx = np.random.permutation(unknown_idx)
    # Number of unknowns
    n_unknown = len(unknown_idx)
    # Calculate how many to assign each new value
    n_assign = []
    for props in range(len(proportions)):
        n_assign.append(int(round(proportions.iloc[props] * n_unknown, 0)))
        start = 0
        for i in range(len(n_assign)):
            # Assign value
            profiles.loc[shuffled_idx[start:start + n_assign[i]], variable] = proportions.index[i]
            start += n_assign[i]
    return print(f"new values: {profiles[variable].value_counts()}")
    

### *'body_type'*

In [2074]:
#checking unique variables and proportions
pre_encode_cat(variable='body_type')

Value counts for body_type
regular    46091
large       8361
unknown     5494
Name: count, dtype: int64
Proportions for body_type
regular    0.768875
large      0.139476
unknown    0.091649
Name: proportion, dtype: float64
Proportions without the value 'unknown' body_type
regular    0.846452
large      0.153548
Name: proportion, dtype: float64


Ordinal encoding seems to be the best option for this variable. Imputation is necessary before. </br>
Since this will be the target variable, let's reduce categories for better prediction.</br>
Replacements done: 'other' with 'unknown'; 'thin' with 'average', 'curvy' with 'larger'</br>

We will have three classes: 'average', 'fit' and 'larger'

In [2075]:
prop_imputer(variable='body_type')

new values: body_type
regular    50741
large       9205
Name: count, dtype: int64


### *'diet'*

In [2076]:
#checking unique variables and proportions
pre_encode_cat(variable='diet')

Value counts for diet
anything      27881
unknown       24395
vegetarian     4986
other          1790
vegan           702
kosher          115
halal            77
Name: count, dtype: int64
Proportions for diet
anything      0.465102
unknown       0.406950
vegetarian    0.083175
other         0.029860
vegan         0.011711
kosher        0.001918
halal         0.001284
Name: proportion, dtype: float64
Proportions without the value 'unknown' diet
anything      0.784254
vegetarian    0.140249
other         0.050350
vegan         0.019746
kosher        0.003235
halal         0.002166
Name: proportion, dtype: float64


Imputation of the value 'unknown' with the mode 'anything'

In [2077]:
profiles['diet'] = profiles['diet'].replace('unknown', 'anything')

### *'drinks'*

In [2078]:
#checking unique variables and proportions
pre_encode_cat(variable='drinks')

Value counts for drinks
a little      47737
a lot          5957
not at all     3267
unknown        2985
Name: count, dtype: int64
Proportions for drinks
a little      0.796333
a lot         0.099373
not at all    0.054499
unknown       0.049795
Name: proportion, dtype: float64
Proportions without the value 'unknown' drinks
a little      0.838065
a lot         0.104580
not at all    0.057355
Name: proportion, dtype: float64


In [2079]:
#Imputation of the value 'unknown' with the mode 'a little'
profiles['drinks'] = profiles['drinks'].replace('unknown', 'a little')

### *'drugs'*

In [2080]:
#checking unique values with total counts
pre_encode_cat(variable='drugs')

Value counts for drugs
never        37724
unknown      14080
sometimes     7732
often          410
Name: count, dtype: int64
Proportions for drugs
never        0.629300
unknown      0.234878
sometimes    0.128983
often        0.006839
Name: proportion, dtype: float64
Proportions without the value 'unknown' drugs
never        0.822483
sometimes    0.168578
often        0.008939
Name: proportion, dtype: float64


Imputation will assign to 'unknown' three different values: 'never', 'sometimes', 'often' in the same proportions they repeat along the dataset. </br>

In [2081]:
prop_imputer(variable='drugs')

new values: drugs
never        49305
sometimes    10106
often          535
Name: count, dtype: int64


### *'smokes'*

In [2082]:
pre_encode_cat(variable='smokes')

Value counts for smokes
no              46127
yes              8307
not answered     5512
Name: count, dtype: int64
Proportions for smokes
no              0.769476
yes             0.138575
not answered    0.091949
Name: proportion, dtype: float64
Proportions without the value 'unknown' smokes
no              0.769476
yes             0.138575
not answered    0.091949
Name: proportion, dtype: float64


In [2083]:
profiles['smokes'] = profiles['smokes'].replace('not answered', 'unknown')

In [2084]:
prop_imputer(variable='smokes')

new values: smokes
no     50798
yes     9148
Name: count, dtype: int64


## Splitting the dataset, encoding categorical variables and transforming numerical ones.

Three subsets will be used: train (70%), validation (15%) and test set (15%)

### Definition of X and y (predictors and labels).

In [2085]:
X = profiles.drop(['body_type'], axis=1)
y = profiles['body_type']

## TEST: Creating new variables as a combination of habits, binning the age and height groups and defining risk groups

In [2086]:
# Combining pairs of habits
X['smokes_drinks'] = X['smokes'] + '_' + X['drinks']
X['drinks_drugs'] = X['drinks'] + '_' + X['drugs']
X['smokes_drugs'] = X['smokes'] + '_' + X['drugs']


In [2087]:
# Age ranges
X['age_group'] = pd.cut(X['age'], bins=[0, 25, 40, 60, 100], labels=['young', 'adult', 'middle_aged', 'senior'])

# Height ranges
X['height_group'] = pd.cut(X['height'], bins=[0, 63, 69, 75, 100], labels=['short', 'avg', 'tall', 'very_tall'])


In [2088]:
# Risk combinations
X['risky_behavior'] = ((X['smokes'] != 'no') & (X['drinks'] != 'not at all') & (X['drugs'] != 'never')).astype(int)


In [2089]:
X.head()

Unnamed: 0,age,diet,drinks,drugs,height,sex,smokes,smokes_drinks,drinks_drugs,smokes_drugs,age_group,height_group,risky_behavior
0,22,anything,a little,never,75.0,m,yes,yes_a little,a little_never,yes_never,young,tall,0
1,35,other,a lot,sometimes,70.0,m,no,no_a lot,a lot_sometimes,no_sometimes,adult,tall,0
2,38,anything,a little,never,68.0,m,no,no_a little,a little_never,no_never,adult,avg,0
3,23,vegetarian,a little,never,71.0,m,no,no_a little,a little_never,no_never,young,tall,0
4,29,anything,a little,never,66.0,m,no,no_a little,a little_never,no_never,adult,avg,0


In [2090]:
np.sum(X.isna())

  return reduction(axis=axis, out=out, **passkwargs)


age               0
diet              0
drinks            0
drugs             0
height            0
sex               0
smokes            0
smokes_drinks     0
drinks_drugs      0
smokes_drugs      0
age_group         0
height_group      0
risky_behavior    0
dtype: int64

In [2091]:
X_train, X_temp, y_train, y_temp = train_test_split(X, y, random_state=42, test_size=0.3, stratify=y)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.50, stratify=y_temp, random_state=42)

In [2092]:
print(len(X_train))
print(len(X_test))
print(len(X_val))

41962
8992
8992


In [2093]:
print(X['risky_behavior'].value_counts())

risky_behavior
0    56781
1     3165
Name: count, dtype: int64


### Separation of numerical and categorical subsets

In [2094]:
X_num_train = X_train[['age', 'height']]
X_cat_train = X_train.drop(['age', 'height'], axis=1)

X_num_val = X_val[['age', 'height']]
X_cat_val = X_val.drop(['age', 'height'], axis=1)

X_num_test = X_test[['age', 'height']]
X_cat_test = X_test.drop(['age', 'height'], axis=1)

In [2095]:
print(len(X_cat_test))

8992


### Transforming numerical variables

In [2096]:
scaler = StandardScaler()
X_num_train_scaled = scaler.fit_transform(X_num_train)
X_num_train_scaled = pd.DataFrame(X_num_train_scaled, columns=X_num_train.columns, index=X_num_train.index)

X_num_val_scaled = scaler.transform(X_num_val)
X_num_val_scaled = pd.DataFrame(X_num_val_scaled, columns=X_num_val.columns, index=X_num_val.index)

X_num_test_scaled = scaler.transform(X_num_test)
X_num_test_scaled = pd.DataFrame(X_num_test_scaled, columns=X_num_test.columns, index=X_num_test.index)

### Encoding categorical variables

#### y = *'body_type'* - (target variable)
We have three classes: {'regular': 0, 'large': 1}

In [2097]:
body_type_mapping = {'regular': 0, 'large': 1}
y_train_enc = y_train.map(body_type_mapping)
y_val_enc = y_val.map(body_type_mapping)
y_test_enc = y_test.map(body_type_mapping)

In [2098]:
#le = LabelEncoder()
#y_train_enc = le.fit_transform(y_train)
#y_val_enc = le.transform(y_val)
#y_test_enc = le.transform(y_test)

In [2099]:
X.columns

Index(['age', 'diet', 'drinks', 'drugs', 'height', 'sex', 'smokes',
       'smokes_drinks', 'drinks_drugs', 'smokes_drugs', 'age_group',
       'height_group', 'risky_behavior'],
      dtype='object')

#### X = *'diet', 'drinks', 'drugs', 'sex', 'smokes', 'smokes_drinks', 'drinks_drugs', 'smokes_drugs', 'age_group', 'height_group', 'risky_behavior'* - (predictors)

##### *'diet', 'drinks', 'drugs', 'smokes_drinks', 'drinks_drugs', 'smokes_drugs'* - One-Hot Encoding

##### *'age_group', 'height_group'* - Ordinal Encoding

##### *'sex'*, *'smokes'* - Binary Encoding

##### *'risky_behavior'* - It does not need encoding

In [2100]:
print(np.sum(X['age_group'].isna()))

0


One-Hot Encoding

In [2101]:
columns_to_encode = ['diet', 'drinks', 'drugs', 'smokes_drinks', 'drinks_drugs', 'smokes_drugs']
X_cat_train_oh = X_cat_train.drop(['sex', 'smokes', 'age_group', 'height_group', 'risky_behavior'], axis=1)
X_cat_val_oh = X_cat_val.drop(['sex', 'smokes', 'age_group', 'height_group', 'risky_behavior'], axis=1)
X_cat_test_oh = X_cat_test.drop(['sex', 'smokes', 'age_group', 'height_group', 'risky_behavior'], axis=1)

ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
X_cat_train_oh = ohe.fit_transform(X_cat_train_oh)

X_cat_val_oh = ohe.transform(X_cat_val_oh)
X_cat_test_oh = ohe.transform(X_cat_test_oh)


In [2102]:
print(X_cat_train_oh.shape)
print(X_cat_val_oh.shape)
print(X_cat_test_oh.shape)

(41962, 33)
(8992, 33)
(8992, 33)


Conversion from Matrix to Dataframe

In [2103]:
X_cat_train_oh = pd.DataFrame(X_cat_train_oh, columns=ohe.get_feature_names_out(), index=X_cat_train.index)
X_cat_val_oh = pd.DataFrame(X_cat_val_oh, columns=ohe.get_feature_names_out(), index=X_cat_val.index)
X_cat_test_oh = pd.DataFrame(X_cat_test_oh, columns=ohe.get_feature_names_out(), index=X_cat_test.index)


Ordinal Encoding

In [2104]:
X_cat_train_oe = X_cat_train[['age_group', 'height_group']].copy()
X_cat_val_oe = X_cat_val[['age_group', 'height_group']].copy()
X_cat_test_oe = X_cat_test[['age_group', 'height_group']].copy()

In [2105]:
height_ordered_labels = [['short', 'avg', 'tall', 'very_tall']]
age_ordered_labels = [['young', 'adult', 'middle_aged', 'senior']]

encoder1 = OrdinalEncoder(categories=height_ordered_labels)
encoder2 = OrdinalEncoder(categories=age_ordered_labels)

X_cat_train_oe['height_group'] = encoder1.fit_transform(X_cat_train_oe[['height_group']])
X_cat_val_oe['height_group'] = encoder1.transform(X_cat_val_oe[['height_group']])
X_cat_test_oe['height_group'] = encoder1.transform(X_cat_test_oe[['height_group']])

X_cat_train_oe['age_group'] = encoder2.fit_transform(X_cat_train_oe[['age_group']])
X_cat_val_oe['age_group'] = encoder2.transform(X_cat_val_oe[['age_group']])
X_cat_test_oe['age_group'] = encoder2.transform(X_cat_test_oe[['age_group']])

In [2106]:
print(X_cat_train_oe.shape)
print(X_cat_val_oe.shape)
print(X_cat_test_oe.shape)

(41962, 2)
(8992, 2)
(8992, 2)


Binary Encoding for 'sex' and 'smokes'

In [2107]:
sex_mapping = {'m': 0, 'f': 1}
smokes_mapping = {'yes': 1, 'no': 0}


X_cat_train['sex'] = X_cat_train['sex'].map(sex_mapping)
X_cat_train['smokes'] = X_cat_train['smokes'].map(smokes_mapping)

X_cat_val['sex'] = X_cat_val['sex'].map(sex_mapping)
X_cat_val['smokes'] = X_cat_val['smokes'].map(smokes_mapping)

X_cat_test['sex'] = X_cat_test['sex'].map(sex_mapping)
X_cat_test['smokes'] = X_cat_test['smokes'].map(smokes_mapping)

Concatenation of One-Hot encoded, Ordinal Encoded, Binary Encoded and unencoded column 'risky_behavior'

In [2108]:
additional_columns = ['sex', 'smokes', 'risky_behavior']
X_cat_train = pd.concat([X_cat_train_oe, X_cat_train_oh, X_cat_train[additional_columns]], axis=1)
X_cat_val = pd.concat([X_cat_val_oe, X_cat_val_oh, X_cat_val[additional_columns]], axis=1)
X_cat_test = pd.concat([X_cat_test_oe, X_cat_test_oh, X_cat_test[additional_columns]], axis=1)

In [2109]:
print(X_cat_train.shape)
print(X_cat_val.shape)
print(X_cat_test.shape)

(41962, 38)
(8992, 38)
(8992, 38)


### Concatenation of encoded categorical + transformed numerical

In [2110]:
X_train_scaled = pd.concat([X_num_train_scaled, X_cat_train], axis=1)
X_val_scaled = pd.concat([X_num_val_scaled, X_cat_val], axis=1)
X_test_scaled = pd.concat([X_num_test_scaled, X_cat_test], axis=1)

## Exporting datasets for modeling

In [2111]:
X_train_scaled.to_csv('X_train.csv')
X_val_scaled.to_csv('X_val.csv')
X_test_scaled.to_csv('X_test.csv')

y_train_enc = pd.DataFrame(y_train_enc)
y_val_enc = pd.DataFrame(y_val_enc)
y_test_enc = pd.DataFrame(y_test_enc)

y_train_enc.to_csv('y_train.csv')
y_val_enc.to_csv('y_val.csv')
y_test_enc.to_csv('y_test.csv')

In [2112]:
X_test_scaled = X_test_scaled.apply(lambda col: col.astype(bool) if col.dtype == 'object' else col)

In [2113]:
X_test_scaled.dtypes

age                                  float64
height                               float64
age_group                            float64
height_group                         float64
diet_anything                        float64
diet_halal                           float64
diet_kosher                          float64
diet_other                           float64
diet_vegan                           float64
diet_vegetarian                      float64
drinks_a little                      float64
drinks_a lot                         float64
drinks_not at all                    float64
drugs_never                          float64
drugs_often                          float64
drugs_sometimes                      float64
smokes_drinks_no_a little            float64
smokes_drinks_no_a lot               float64
smokes_drinks_no_not at all          float64
smokes_drinks_yes_a little           float64
smokes_drinks_yes_a lot              float64
smokes_drinks_yes_not at all         float64
drinks_dru

In [2114]:
print(np.sum(X_test_scaled['age'].isna()))

0


In [2115]:
X_train_scaled.head()

Unnamed: 0,age,height,age_group,height_group,diet_anything,diet_halal,diet_kosher,diet_other,diet_vegan,diet_vegetarian,...,drinks_drugs_not at all_sometimes,smokes_drugs_no_never,smokes_drugs_no_often,smokes_drugs_no_sometimes,smokes_drugs_yes_never,smokes_drugs_yes_often,smokes_drugs_yes_sometimes,sex,smokes,risky_behavior
53915,-0.247409,0.437289,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0
47663,-0.036281,-0.078533,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0
21816,1.019358,-1.626001,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1,0,0
4681,-0.986357,-0.336444,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1,0,0
32877,1.124921,0.179378,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0,0,0
