# Pre-Processing

* Address or complete the following steps:
    * Creating dummy features
    * Scale standardization
    * Split data into training and testing subsets

### Load Dataset

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

In [2]:
df = pd.read_csv('../data/dataset_missing_as_nan.csv', index_col=0)

In [3]:
df.head()

Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes,Smokes (years),Smokes (packs/year),Hormonal Contraceptives,Hormonal Contraceptives (years),IUD,...,STDs: Time since first diagnosis,STDs: Time since last diagnosis,Dx:Cancer,Dx:CIN,Dx:HPV,Dx,Hinselmann,Schiller,Citology,Biopsy
0,18,4.0,15.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,0,0,0,0,0,0,0,0
1,15,1.0,14.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,0,0,0,0,0,0,0,0
2,34,1.0,,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,0,0,0,0,0,0,0,0
3,52,5.0,16.0,4.0,1.0,37.0,37.0,1.0,3.0,0.0,...,,,1,0,1,0,0,0,0,0
4,46,3.0,21.0,4.0,0.0,0.0,0.0,1.0,15.0,0.0,...,,,0,0,0,0,0,0,0,0


In [4]:
df.shape

(855, 36)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 855 entries, 0 to 857
Data columns (total 36 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Age                                 855 non-null    int64  
 1   Number of sexual partners           829 non-null    float64
 2   First sexual intercourse            848 non-null    float64
 3   Num of pregnancies                  799 non-null    float64
 4   Smokes                              845 non-null    float64
 5   Smokes (years)                      845 non-null    float64
 6   Smokes (packs/year)                 845 non-null    float64
 7   Hormonal Contraceptives             750 non-null    float64
 8   Hormonal Contraceptives (years)     750 non-null    float64
 9   IUD                                 741 non-null    float64
 10  IUD (years)                         741 non-null    float64
 11  STDs                                753 non-null  

### Creating Dummy Features

Dummy features allow categorical variables to be used in regression models. It turns a categorical column into multiple boolean columns for each unique value of the categorical columns so the models can see those categories as their own features.

This may not work with the data used in this project as all of our categorical columns are boolean in nature already.

A similar feature engineering method that could help this specific dataset would be adding a feature to describe the missing status of the categorical columns.

There is a large chunk of participants who chose to leave most of the questions in the survey unanswered. We could create a feature in the data denoting if those columns were left blank and set the missing values all to 0. This would remove the missing values while still differentiating the patients who left questions blank from the ones who answered 'No'.

To do this, we begin by checking all of the missing values in our data.

In [6]:
# inspect the missing values
df.isna().sum()

Age                                     0
Number of sexual partners              26
First sexual intercourse                7
Num of pregnancies                     56
Smokes                                 10
Smokes (years)                         10
Smokes (packs/year)                    10
Hormonal Contraceptives               105
Hormonal Contraceptives (years)       105
IUD                                   114
IUD (years)                           114
STDs                                  102
STDs (number)                         102
STDs:condylomatosis                   102
STDs:cervical condylomatosis          102
STDs:vaginal condylomatosis           102
STDs:vulvo-perineal condylomatosis    102
STDs:syphilis                         102
STDs:pelvic inflammatory disease      102
STDs:genital herpes                   102
STDs:molluscum contagiosum            102
STDs:AIDS                             102
STDs:HIV                              102
STDs:Hepatitis B                  

We want to make sure the missing values in the columns regarding STDs are all from the same people, so we can look at all the missing values from the participants in the survey who left the 'STDs' question in the survey blank.

In [7]:
# ensure the STDs columns are missing for all the same rows
blank_std_index = df[df['STDs'].isna()].index
df.loc[blank_std_index].isna().sum()

Age                                     0
Number of sexual partners              12
First sexual intercourse                1
Num of pregnancies                      9
Smokes                                  0
Smokes (years)                          0
Smokes (packs/year)                     0
Hormonal Contraceptives                92
Hormonal Contraceptives (years)        92
IUD                                    98
IUD (years)                            98
STDs                                  102
STDs (number)                         102
STDs:condylomatosis                   102
STDs:cervical condylomatosis          102
STDs:vaginal condylomatosis           102
STDs:vulvo-perineal condylomatosis    102
STDs:syphilis                         102
STDs:pelvic inflammatory disease      102
STDs:genital herpes                   102
STDs:molluscum contagiosum            102
STDs:AIDS                             102
STDs:HIV                              102
STDs:Hepatitis B                  

It was definitely the same 102 women who left the STDs questions unanswered. We can assume that because the behavior was so consistent and they clearly answered other questions, these questions all being left blank may indicate something important about this group. We can add a new boolen feature to our data and give only these women the value of 1 to show that they were in the group that chose not to answer these.

In [8]:
df['Missing STDs columns'] = df['STDs'].isna()
df['Missing STDs columns'] = df['Missing STDs columns'].astype(int)
df['Missing STDs columns'].value_counts()

Missing STDs columns
0    753
1    102
Name: count, dtype: int64

Now we can fill in the questions they left blank with 0's to eliminate the mising values and our new feature will indicate that they are in a different group from the rest of the participants.

In [9]:
# fill N/A with 0 for only the STDs columns
filter_df = [col for col in df if col.startswith('STDs')]
df[filter_df] = df[filter_df].fillna(0)

# fill Hormonal Contraceptives and IUD columns with 0 only for the rows where the STDs columns are missing
filter_df = df.loc[blank_std_index][['Hormonal Contraceptives', 'Hormonal Contraceptives (years)', 'IUD', 'IUD (years)']].isna()
df[filter_df] = df[filter_df].fillna(0)

Check the number of total missing values to see what is still missing

In [10]:
df.isna().sum()

Age                                    0
Number of sexual partners             26
First sexual intercourse               7
Num of pregnancies                    56
Smokes                                10
Smokes (years)                        10
Smokes (packs/year)                   10
Hormonal Contraceptives               13
Hormonal Contraceptives (years)       13
IUD                                   16
IUD (years)                           16
STDs                                   0
STDs (number)                          0
STDs:condylomatosis                    0
STDs:cervical condylomatosis           0
STDs:vaginal condylomatosis            0
STDs:vulvo-perineal condylomatosis     0
STDs:syphilis                          0
STDs:pelvic inflammatory disease       0
STDs:genital herpes                    0
STDs:molluscum contagiosum             0
STDs:AIDS                              0
STDs:HIV                               0
STDs:Hepatitis B                       0
STDs:HPV        

There are a relatively small number of missing values in a few columns that were for some reason left as they were in data wrangling. We can fill the numeric columns with their medians and the boolean columns with the mode to ensure no missing values to confuse our models.

In [11]:
from sklearn.impute import SimpleImputer

# fill the boolean columns with the most frequent value
mode_imputer = SimpleImputer(strategy='most_frequent')
bool_filter_df = df[['Smokes', 'Hormonal Contraceptives', 'IUD']].columns
df[bool_filter_df] = mode_imputer.fit_transform(df[bool_filter_df])

# fill the rest of the columns with the median
median_imputer = SimpleImputer(strategy='median')
df = pd.DataFrame(median_imputer.fit_transform(df), columns=df.columns)

Check our missing values once again to make sure they're all filled

In [12]:
df.isna().sum()

Age                                   0
Number of sexual partners             0
First sexual intercourse              0
Num of pregnancies                    0
Smokes                                0
Smokes (years)                        0
Smokes (packs/year)                   0
Hormonal Contraceptives               0
Hormonal Contraceptives (years)       0
IUD                                   0
IUD (years)                           0
STDs                                  0
STDs (number)                         0
STDs:condylomatosis                   0
STDs:cervical condylomatosis          0
STDs:vaginal condylomatosis           0
STDs:vulvo-perineal condylomatosis    0
STDs:syphilis                         0
STDs:pelvic inflammatory disease      0
STDs:genital herpes                   0
STDs:molluscum contagiosum            0
STDs:AIDS                             0
STDs:HIV                              0
STDs:Hepatitis B                      0
STDs:HPV                              0


### Scale Standardization

Scale standardization converts a number to the number of standard deviations it is away from the mean. This shows you how extreme (or not extreme) of a value it is.

Standardization may help with our particular dataset because there are some features that have greater values and might skew the results of distance calulations. It also helps to put things in simpler terms of how extreme of an outlier they are.

I'm not sure if standardization will help or hinder the accuracy of the models, so I will save one dataset that uses standardization and one that does not. This way we can try modeling on both datasets and see which gets better results.

In [13]:
# split into train and test data
from sklearn.model_selection import train_test_split

X = df.drop(['Hinselmann', 'Schiller', 'Citology', 'Biopsy'], axis=1)
y = df[['Hinselmann', 'Schiller', 'Citology', 'Biopsy']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

processed_train = pd.concat([X_train, y_train], axis=1)
processed_test = pd.concat([X_test, y_test], axis=1)

# Verify the shapes
print("Shape of X_train_standardized:", X_train.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of train:", processed_train.shape)

print("Shape of X_test_standardized:", X_test.shape)
print("Shape of y_test:", y_test.shape)
print("Shape of test:", processed_test.shape)

processed_train.to_csv('../data/processed_train.csv')
processed_test.to_csv('../data/processed_test.csv')

Shape of X_train_standardized: (684, 33)
Shape of y_train: (684, 4)
Shape of train: (684, 37)
Shape of X_test_standardized: (171, 33)
Shape of y_test: (171, 4)
Shape of test: (171, 37)


Now we can standardize the data and save that set separately.

In [14]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

bool_cols = ['Smokes', 'Hormonal Contraceptives', 'IUD', 'STDs', 'STDs:condylomatosis', 'STDs:cervical condylomatosis',
             'STDs:vaginal condylomatosis', 'STDs:vulvo-perineal condylomatosis', 'STDs:syphilis', 'STDs:pelvic inflammatory disease',
             'STDs:genital herpes', 'STDs:molluscum contagiosum', 'STDs:AIDS', 'STDs:HIV', 'STDs:Hepatitis B', 'STDs:HPV',
             'STDs: Number of diagnosis', 'STDs: Time since first diagnosis', 'STDs: Time since last diagnosis', 'Dx:Cancer', 'Dx:HPV', 'Dx']

numeric_cols = ['Age', 'Number of sexual partners', 'First sexual intercourse', 'Smokes (years)', 'Smokes (packs/year)',
                  'Hormonal Contraceptives (years)', 'IUD (years)', 'STDs (number)', 'STDs: Number of diagnosis',
                  'STDs: Time since first diagnosis', 'STDs: Time since last diagnosis']

scaler = scaler.fit(X_train[numeric_cols])
X_train_numeric_standardized = pd.DataFrame(scaler.transform(X_train[numeric_cols]), columns=numeric_cols)
X_test_numeric_standardized = pd.DataFrame(scaler.transform(X_test[numeric_cols]), columns=numeric_cols)

X_train_standardized = pd.concat([X_train_numeric_standardized, X_train[bool_cols].reset_index(drop=True)], axis=1)
X_test_standardized = pd.concat([X_test_numeric_standardized, X_test[bool_cols].reset_index(drop=True)], axis=1)

y_train.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)

train = pd.concat([X_train_standardized, y_train], axis=1)
test = pd.concat([X_test_standardized, y_test], axis=1)

# Verify the shapes
print("Shape of X_train_standardized:", X_train_standardized.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of train:", train.shape)

print("Shape of X_test_standardized:", X_test_standardized.shape)
print("Shape of y_test:", y_test.shape)
print("Shape of test:", test.shape)

train.to_csv('../data/standardized_train.csv')
test.to_csv('../data/standardized_test.csv')

Shape of X_train_standardized: (684, 33)
Shape of y_train: (684, 4)
Shape of train: (684, 37)
Shape of X_test_standardized: (171, 33)
Shape of y_test: (171, 4)
Shape of test: (171, 37)
