# Preprocessing for classification model

In this notebook, some preprocessing will be applied to the filtered dataset. 

In [267]:
#Importing all libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
import os
from itertools import combinations
from scipy.stats import chi2_contingency
import category_encoders as ce
from sklearn.preprocessing import StandardScaler


In [268]:
#Importing dataset cleaned and filtered from EDA
profiles = pd.read_csv('../data/profiles_eda.csv', index_col=False)
profiles = profiles.loc[:, ~profiles.columns.str.contains('^Unnamed')]
profiles.head()

Unnamed: 0,age,body_type,diet,drinks,drugs,education,ethnicity,height,job,offspring,orientation,religion,sex,sign,smokes,status,dogs,cats
0,22,larger,anything,a little,never,college,asian,75.0,service,no kids,straight,agnosticism,m,gemini,yes,single,likes dogs,likes cats
1,35,average,other,a lot,sometimes,other,white,70.0,service,no kids,straight,agnosticism,m,cancer,no,single,likes dogs,likes cats
2,38,thin,anything,a little,unknown,masters,unknown,68.0,unknown,unknown,straight,unknown,m,pisces,no,available,no dogs,has cats
3,23,thin,vegetarian,a little,unknown,college,white,71.0,student,unsure,straight,unknown,m,pisces,no,single,no dogs,likes cats
4,29,fit,unknown,a little,never,college,asian,66.0,creative,unknown,straight,unknown,m,aquarius,no,single,likes dogs,likes cats


## Scoping the problem

For the classification problem, we want to predict what body_type the user belongs to, using habits as predictors. </br>

Habits:</br>
- diet</br>
- drinks</br>
- drugs</br>
- smokes</br>

Also, age, sex and height will both be considered.

### Filtering the dataset.

Dropping all variables out of the scope

In [269]:
profiles = profiles.drop(['education', 'ethnicity', 'job', 'offspring', 'orientation', 'religion', 'sign', 'status', 'dogs', 'cats'], axis=1)

In [270]:
profiles.head()

Unnamed: 0,age,body_type,diet,drinks,drugs,height,sex,smokes
0,22,larger,anything,a little,never,75.0,m,yes
1,35,average,other,a lot,sometimes,70.0,m,no
2,38,thin,anything,a little,unknown,68.0,m,no
3,23,thin,vegetarian,a little,unknown,71.0,m,no
4,29,fit,unknown,a little,never,66.0,m,no


### 01.- Encoding categorical variables

Defining functions for reusable code:

1) Printing proportions with and without the value 'unknown'
2) Imputation of 'unknown' value according to the distribution of the rest of the values in the variable.

In [271]:
def pre_encode_cat(variable): #this function prints the preliminar proportions and counts before encoding
    count = profiles[variable].value_counts()
    print(f"Value counts for {count}")
    print(f"Proportions for {profiles[variable].value_counts(1)}")
    proportions = (profiles.loc[profiles[variable] != 'unknown', variable]).value_counts(1)
    print(f"Proportions without the value 'unknown' {proportions}")

In [272]:
def prop_imputer(variable):
    proportions = (profiles.loc[profiles[variable] != 'unknown', variable]).value_counts(1)
    # Find the indices where variable == 'unknown'
    unknown_idx = profiles[profiles[variable] == 'unknown'].index
    # Shuffle indices for random assignment
    shuffled_idx = np.random.permutation(unknown_idx)
    # Number of unknowns
    n_unknown = len(unknown_idx)
    # Calculate how many to assign each new value
    n_assign = []
    for props in range(len(proportions)):
        n_assign.append(int(round(proportions.iloc[props] * n_unknown, 0)))
        start = 0
        for i in range(len(n_assign)):
            # Assign value
            profiles.loc[shuffled_idx[start:start + n_assign[i]], variable] = proportions.index[i]
            start += n_assign[i]
    return print(f"new values: {profiles[variable].value_counts()}")
    

#### Encoding 'body_type'

In [273]:
#checking unique variables and proportions
pre_encode_cat(variable='body_type')

Value counts for body_type
fit        24951
average    14652
thin        6488
unknown     5293
curvy       4933
larger      3073
other        553
Name: count, dtype: int64
Proportions for body_type
fit        0.416245
average    0.244432
thin       0.108236
unknown    0.088301
curvy      0.082295
larger     0.051265
other      0.009225
Name: proportion, dtype: float64
Proportions without the value 'unknown' body_type
fit        0.456560
average    0.268106
thin       0.118719
curvy      0.090265
larger     0.056231
other      0.010119
Name: proportion, dtype: float64


Ordinal encoding seems to be the best option for this variable. Imputation is necessary before. </br>
Since this will be the target variable, let's reduce categories for better prediction.</br>
Replacements done: 'other' with 'unknown'; 'thin' with 'average', 'curvy' with 'larger'</br>

We will have three classes: 'average', 'fit' and 'larger'

In [274]:
profiles['body_type'] = profiles['body_type'].replace('other', 'unknown')
profiles['body_type'] = profiles['body_type'].replace('thin', 'average')
profiles['body_type'] = profiles['body_type'].replace('curvy', 'larger')


In [275]:
prop_imputer(variable='body_type')


new values: body_type
fit        27647
average    23424
larger      8871
unknown        1
Name: count, dtype: int64


In [276]:

profiles['body_type'] = profiles['body_type'].replace('unknown', 'average')

In [277]:
print(profiles['body_type'].value_counts(0))

body_type
fit        27647
average    23425
larger      8871
Name: count, dtype: int64


In [278]:

body_type_mapping = {
    'fit': 0,
    'average': 1,
    'larger': 2
}

profiles['body_type'] = profiles['body_type'].map(body_type_mapping)

In [279]:
print(profiles['body_type'].value_counts())

body_type
0    27647
1    23425
2     8871
Name: count, dtype: int64


#### Encoding 'diet'

In [280]:
#checking unique variables and proportions
pre_encode_cat(variable='diet')

Value counts for diet
anything      27881
unknown       24392
vegetarian     4986
other          1790
vegan           702
kosher          115
halal            77
Name: count, dtype: int64
Proportions for diet
anything      0.465125
unknown       0.406920
vegetarian    0.083179
other         0.029862
vegan         0.011711
kosher        0.001918
halal         0.001285
Name: proportion, dtype: float64
Proportions without the value 'unknown' diet
anything      0.784254
vegetarian    0.140249
other         0.050350
vegan         0.019746
kosher        0.003235
halal         0.002166
Name: proportion, dtype: float64


Imputation of the value 'unknown' with the mode 'anything'

In [281]:
profiles['diet'] = profiles['diet'].replace('unknown', 'anything')

In [282]:
#One-Hot encoding for the value 'diet'
profiles = pd.get_dummies(profiles, columns=['diet'])

#### Encoding 'drinks'

In [283]:
#checking unique variables and proportions
pre_encode_cat(variable='drinks')

Value counts for drinks
a little      47737
a lot          5957
not at all     3267
unknown        2982
Name: count, dtype: int64
Proportions for drinks
a little      0.796373
a lot         0.099378
not at all    0.054502
unknown       0.049747
Name: proportion, dtype: float64
Proportions without the value 'unknown' drinks
a little      0.838065
a lot         0.104580
not at all    0.057355
Name: proportion, dtype: float64


In [284]:
#Imputation of the value 'unknown' with the mode 'a little'
profiles['drinks'] = profiles['drinks'].replace('unknown', 'a little')

In [285]:
drinks_mapping = {
    'not at all': 0,
    'a little': 1,
    'a lot': 2    
}

profiles['drinks'] = profiles['drinks'].map(drinks_mapping)

Encoding 'drugs'

In [286]:
#checking unique values with total counts
pre_encode_cat(variable='drugs')

Value counts for drugs
never        37723
unknown      14078
sometimes     7732
often          410
Name: count, dtype: int64
Proportions for drugs
never        0.629315
unknown      0.234856
sometimes    0.128989
often        0.006840
Name: proportion, dtype: float64
Proportions without the value 'unknown' drugs
never        0.822479
sometimes    0.168582
often        0.008939
Name: proportion, dtype: float64


Imputation will assign to 'unknown' three different values: 'never', 'sometimes', 'often' in the same proportions they repeat along the dataset. </br>
That is, for all 14078 values the following substitutions: </br>

In [287]:
prop_imputer(variable='drugs')

new values: drugs
never        49302
sometimes    10105
often          536
Name: count, dtype: int64


In [288]:
#ordinal encoding for 'drugs'
drugs_mapping = {
    'never': 0,
    'sometimes': 1,
    'often': 2    
}

profiles['drugs'] = profiles['drugs'].map(drugs_mapping)

Encoding the variable 'smokes' 

In [289]:
pre_encode_cat('smokes')

Value counts for smokes
no              46126
yes              8307
not answered     5510
Name: count, dtype: int64
Proportions for smokes
no              0.769498
yes             0.138582
not answered    0.091921
Name: proportion, dtype: float64
Proportions without the value 'unknown' smokes
no              0.769498
yes             0.138582
not answered    0.091921
Name: proportion, dtype: float64


In [290]:
#replacing 'not answered' with 'unknown' to make the imputer function work
profiles['smokes'] = profiles['smokes'].replace('not answered', 'unknown')


Proportional imputation

In [291]:
prop_imputer('smokes')

new values: smokes
no     50795
yes     9148
Name: count, dtype: int64


In [292]:
profiles['smokes'].head()

0    yes
1     no
2     no
3     no
4     no
Name: smokes, dtype: object

Binary encoding for the variable 'smokes'

In [293]:
profiles['smokes'] = profiles['smokes'].map({'yes': 1, 'no': 0})


Encoding the variable 'sex' as binary

In [294]:
profiles['sex'] = profiles['sex'].map({'m': 1, 'f': 0})

In [295]:
print(f"Total number of features = {len(profiles.columns)}")

Total number of features = 13


### 02.- Transformation of numerical variables

In [296]:
num_var = profiles.select_dtypes(include='number')
print(num_var)

       age  body_type  drinks  drugs  height  sex  smokes
0       22          2       1      0    75.0    1       1
1       35          1       2      1    70.0    1       0
2       38          1       1      0    68.0    1       0
3       23          1       1      0    71.0    1       0
4       29          0       1      0    66.0    1       0
...    ...        ...     ...    ...     ...  ...     ...
59938   59          0       1      0    62.0    0       0
59939   24          0       2      1    72.0    1       0
59940   42          1       0      0    71.0    1       0
59941   27          0       1      2    73.0    1       1
59942   39          1       1      0    68.0    1       1

[59943 rows x 7 columns]


In [297]:
profiles.head()

Unnamed: 0,age,body_type,drinks,drugs,height,sex,smokes,diet_anything,diet_halal,diet_kosher,diet_other,diet_vegan,diet_vegetarian
0,22,2,1,0,75.0,1,1,True,False,False,False,False,False
1,35,1,2,1,70.0,1,0,False,False,False,True,False,False
2,38,1,1,0,68.0,1,0,True,False,False,False,False,False
3,23,1,1,0,71.0,1,0,False,False,False,False,False,True
4,29,0,1,0,66.0,1,0,True,False,False,False,False,False


### 03- Exporting the dataset for modeling

In [298]:
profiles.to_csv('profiles_processed.csv')