# Machine Learning - Predicting Treatment Abandonment with scikit learn¶
By **Daniel Palacio** (github.com/palaciodaniel) - 2020

## STEP TWO (ALTERNATE) - Preparing the DataFrame with Ordinal Encoding

The challenge on this alternate step is similar to the original one: to transform all column values to numeric ones, otherwise we will not be able to fit them to our Machine Learning model. 

However, instead of _dummies_ here we will use Ordinal Encoding on several columns, specifically the ones that range from "Low" to "High".

Let's start with a brief exploratory analysis to remember the layout of the DataFrame...

In [1]:
import pandas as pd
patients_df = pd.read_csv("df_patients.csv", header = 0, index_col = 0)
patients_df.head()

Unnamed: 0,Name,Genre,Age,Neuroticism,Motivation,Resourcefulness,Social Expectation,Introspection,Discipline,Victimhood,Finished
0,Clayton Murillo,M,67,Medium,Medium,Low,Medium,High,Medium,False,False
1,Mary Johnson,F,35,Low,High,Low,Medium,Low,Medium,False,True
2,Angela Taylor,F,25,Low,Low,Low,Medium,Low,High,False,False
3,Sarah Massey,F,48,Low,High,High,Low,Medium,High,False,True
4,Joseph Lam,M,48,Low,Medium,Low,High,Medium,Medium,True,False


In [2]:
# Check data types on every column from 'patients_df'

patients_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Name                100 non-null    object
 1   Genre               100 non-null    object
 2   Age                 100 non-null    int64 
 3   Neuroticism         100 non-null    object
 4   Motivation          100 non-null    object
 5   Resourcefulness     100 non-null    object
 6   Social Expectation  100 non-null    object
 7   Introspection       100 non-null    object
 8   Discipline          100 non-null    object
 9   Victimhood          100 non-null    bool  
 10  Finished            100 non-null    bool  
dtypes: bool(2), int64(1), object(8)
memory usage: 8.0+ KB


- - -
We will start by deleting the column "Name", since it is not necessary for the model. Given that we are talking about names, these values are absolutely random, so they cannot be used at all.

In [3]:
prepared_df = patients_df.drop("Name", axis = 1)
prepared_df.head()

Unnamed: 0,Genre,Age,Neuroticism,Motivation,Resourcefulness,Social Expectation,Introspection,Discipline,Victimhood,Finished
0,M,67,Medium,Medium,Low,Medium,High,Medium,False,False
1,F,35,Low,High,Low,Medium,Low,Medium,False,True
2,F,25,Low,Low,Low,Medium,Low,High,False,False
3,F,48,Low,High,High,Low,Medium,High,False,True
4,M,48,Low,Medium,Low,High,Medium,Medium,True,False


As a second step, we transform the gender column (categorical) into dummies.

In [4]:
# Fixing typing mistake on column's name. Renaming 'Genre' to 'Gender'

prepared_df.columns = [col.replace('Genre', 'Gender') for col in prepared_df.columns]
prepared_df["Gender"].head()

0    M
1    F
2    F
3    F
4    M
Name: Gender, dtype: object

Notice that the numbering will be defined by alphabetical order. "(F)emale" preceeds "(M)ale", therefore the former will be assigned the value 0, whereas the latter will get the value 1.

In [5]:
gender_dummies = pd.get_dummies(prepared_df["Gender"], prefix = "Gender")
gender_dummies.head()

Unnamed: 0,Gender_F,Gender_M
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1


When we attach the dummies columns to the main DataFrame, we will see there is a small inconvenience...

In [6]:
prepared_df = pd.concat([gender_dummies, prepared_df], axis = 1)
prepared_df.head()

Unnamed: 0,Gender_F,Gender_M,Gender,Age,Neuroticism,Motivation,Resourcefulness,Social Expectation,Introspection,Discipline,Victimhood,Finished
0,0,1,M,67,Medium,Medium,Low,Medium,High,Medium,False,False
1,1,0,F,35,Low,High,Low,Medium,Low,Medium,False,True
2,1,0,F,25,Low,Low,Low,Medium,Low,High,False,False
3,1,0,F,48,Low,High,High,Low,Medium,High,False,True
4,0,1,M,48,Low,Medium,Low,High,Medium,Medium,True,False


...so for that reason we need to drop the original "Gender" column from 'patients_df'. 

Also, "Gender_F" will be removed as well, because it is redundant (if "Gender_M" is 0, then it is a female).

In [7]:
prepared_df = prepared_df.drop(["Gender", "Gender_F"], axis = 1)
prepared_df.head()

Unnamed: 0,Gender_M,Age,Neuroticism,Motivation,Resourcefulness,Social Expectation,Introspection,Discipline,Victimhood,Finished
0,1,67,Medium,Medium,Low,Medium,High,Medium,False,False
1,0,35,Low,High,Low,Medium,Low,Medium,False,True
2,0,25,Low,Low,Low,Medium,Low,High,False,False
3,0,48,Low,High,High,Low,Medium,High,False,True
4,1,48,Low,Medium,Low,High,Medium,Medium,True,False


The following part of the cleaning process covers the majority of the remaining columns: "Neuroticism", "Motivation", "Resourcefulness", "Social Expectation", "Introspection" and "Discipline", and it will become the main difference with the original second step.

Since these columns contain ordinal values ("Low" - "Medium" - "High"), we could import _OrdinalEncoder_ from _sklearn.preprocessing_ and apply it to them. However, since it sorts the values in alphabetical order, we would get 0 for "High", 1 for "Low" and 2 for "Medium", which would be highly confusing to interpret.

Luckily, we can transform these values in a single line of code, like this:

In [8]:
prepared_df = prepared_df.replace({"Low": 0, "Medium": 1, "High": 2})
prepared_df.head()

Unnamed: 0,Gender_M,Age,Neuroticism,Motivation,Resourcefulness,Social Expectation,Introspection,Discipline,Victimhood,Finished
0,1,67,1,1,0,1,2,1,False,False
1,0,35,0,2,0,1,0,1,False,True
2,0,25,0,0,0,1,0,2,False,False
3,0,48,0,2,2,0,1,2,False,True
4,1,48,0,1,0,2,1,1,True,False


To finish the cleaning process, we have to transform the boolean columns into integers (so that 0 = False and 1 = True). However, we can extend that operation to the rest of the DataFrame, so that we can be totally certain that all values are integers.

Since we do not have values above 255, we can use "uint8" to encode the columns.

In [9]:
prepared_df = prepared_df.astype("uint8")
prepared_df.head()

Unnamed: 0,Gender_M,Age,Neuroticism,Motivation,Resourcefulness,Social Expectation,Introspection,Discipline,Victimhood,Finished
0,1,67,1,1,0,1,2,1,0,0
1,0,35,0,2,0,1,0,1,0,1
2,0,25,0,0,0,1,0,2,0,0
3,0,48,0,2,2,0,1,2,0,1
4,1,48,0,1,0,2,1,1,1,0


Let's confirm that all columns are integers...

In [10]:
prepared_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   Gender_M            100 non-null    uint8
 1   Age                 100 non-null    uint8
 2   Neuroticism         100 non-null    uint8
 3   Motivation          100 non-null    uint8
 4   Resourcefulness     100 non-null    uint8
 5   Social Expectation  100 non-null    uint8
 6   Introspection       100 non-null    uint8
 7   Discipline          100 non-null    uint8
 8   Victimhood          100 non-null    uint8
 9   Finished            100 non-null    uint8
dtypes: uint8(10)
memory usage: 1.8 KB


We are ready for the next step. We only have to save this new DataFrame to another CSV file and that's all for now.

In [11]:
prepared_df.to_csv("df_ordinal_prepared.csv", index = True)