# Machine Learning - Predicting Treatment Abandonment with scikit learn¶
By **Daniel Palacio** (github.com/palaciodaniel) - 2020

## STEP TWO - Adapting the DataFrame

Having successfully created a functional DataFrame about patients's personality attributes, and whether they managed to finish their treatments or not, now it is time to clean the DataFrame so that it becomes suitable for Machine Learning.

Basically, the challenge on this step is to transform all column values to numeric ones, otherwise we will not be able to fit them to our Machine Learning model. But as we will see now, there is only one column ("Age") that currently satisfies that requirement...

In [1]:
# Loading DataFrame into variable 'patients_df'

import pandas as pd

patients_df = pd.read_csv("df_patients.csv", header = 0, index_col = 0)

patients_df.head()

Unnamed: 0,Name,Genre,Age,Neuroticism,Motivation,Resourcefulness,Social Expectation,Introspection,Discipline,Victimhood,Finished
0,Clayton Murillo,M,67,Medium,Medium,Low,Medium,High,Medium,False,False
1,Mary Johnson,F,35,Low,High,Low,Medium,Low,Medium,False,True
2,Angela Taylor,F,25,Low,Low,Low,Medium,Low,High,False,False
3,Sarah Massey,F,48,Low,High,High,Low,Medium,High,False,True
4,Joseph Lam,M,48,Low,Medium,Low,High,Medium,Medium,True,False


In [2]:
# Check data types on every column from 'patients_df'

patients_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Name                100 non-null    object
 1   Genre               100 non-null    object
 2   Age                 100 non-null    int64 
 3   Neuroticism         100 non-null    object
 4   Motivation          100 non-null    object
 5   Resourcefulness     100 non-null    object
 6   Social Expectation  100 non-null    object
 7   Introspection       100 non-null    object
 8   Discipline          100 non-null    object
 9   Victimhood          100 non-null    bool  
 10  Finished            100 non-null    bool  
dtypes: bool(2), int64(1), object(8)
memory usage: 8.0+ KB


In [3]:
# Force columns with boolean values into integers (therefore 0 = False and 1 = True)

patients_df[["Victimhood", "Finished"]] = patients_df[["Victimhood", "Finished"]].astype("int")
patients_df.head()

Unnamed: 0,Name,Genre,Age,Neuroticism,Motivation,Resourcefulness,Social Expectation,Introspection,Discipline,Victimhood,Finished
0,Clayton Murillo,M,67,Medium,Medium,Low,Medium,High,Medium,0,0
1,Mary Johnson,F,35,Low,High,Low,Medium,Low,Medium,0,1
2,Angela Taylor,F,25,Low,Low,Low,Medium,Low,High,0,0
3,Sarah Massey,F,48,Low,High,High,Low,Medium,High,0,1
4,Joseph Lam,M,48,Low,Medium,Low,High,Medium,Medium,1,0


In [4]:
# Dropping column 'Name' (this one isn't necessary for the model)

patients_df = patients_df.drop("Name", axis = 1)
patients_df.head()

Unnamed: 0,Genre,Age,Neuroticism,Motivation,Resourcefulness,Social Expectation,Introspection,Discipline,Victimhood,Finished
0,M,67,Medium,Medium,Low,Medium,High,Medium,0,0
1,F,35,Low,High,Low,Medium,Low,Medium,0,1
2,F,25,Low,Low,Low,Medium,Low,High,0,0
3,F,48,Low,High,High,Low,Medium,High,0,1
4,M,48,Low,Medium,Low,High,Medium,Medium,1,0


In [5]:
# Transforming categorical columns into dummies

df_dummies = pd.get_dummies(patients_df[["Genre", "Neuroticism", "Motivation", "Resourcefulness", \
                   "Social Expectation", "Introspection", "Discipline"]])

print(df_dummies)

    Genre_F  Genre_M  Neuroticism_High  Neuroticism_Low  Neuroticism_Medium  \
0         0        1                 0                0                   1   
1         1        0                 0                1                   0   
2         1        0                 0                1                   0   
3         1        0                 0                1                   0   
4         0        1                 0                1                   0   
..      ...      ...               ...              ...                 ...   
95        0        1                 1                0                   0   
96        0        1                 0                0                   1   
97        0        1                 0                1                   0   
98        0        1                 0                0                   1   
99        0        1                 0                0                   1   

    Motivation_High  Motivation_Low  Motivation_Med

In [6]:
# Dropping original categorical columns from 'patients_df'

patients_df = patients_df.drop(["Genre", "Neuroticism", "Motivation", "Resourcefulness", \
                   "Social Expectation", "Introspection", "Discipline"], axis = 1)
patients_df.head()

Unnamed: 0,Age,Victimhood,Finished
0,67,0,0
1,35,0,1
2,25,0,0
3,48,0,1
4,48,1,0


In [7]:
# Concatenating two remaining columns of 'patients_df' with 'df_dummies'.

prepared_df = pd.concat([patients_df, df_dummies], axis = 1)
print(prepared_df.head())
print("DataFrame's shape: ", prepared_df.shape)

   Age  Victimhood  Finished  Genre_F  Genre_M  Neuroticism_High  \
0   67           0         0        0        1                 0   
1   35           0         1        1        0                 0   
2   25           0         0        1        0                 0   
3   48           0         1        1        0                 0   
4   48           1         0        0        1                 0   

   Neuroticism_Low  Neuroticism_Medium  Motivation_High  Motivation_Low  ...  \
0                0                   1                0               0  ...   
1                1                   0                1               0  ...   
2                1                   0                0               1  ...   
3                1                   0                1               0  ...   
4                1                   0                0               0  ...   

   Resourcefulness_Medium  Social Expectation_High  Social Expectation_Low  \
0                       0       

In [8]:
# Removing redundant column "Genre_F"

prepared_df = prepared_df.drop("Genre_F", axis = 1)
prepared_df.head()

Unnamed: 0,Age,Victimhood,Finished,Genre_M,Neuroticism_High,Neuroticism_Low,Neuroticism_Medium,Motivation_High,Motivation_Low,Motivation_Medium,...,Resourcefulness_Medium,Social Expectation_High,Social Expectation_Low,Social Expectation_Medium,Introspection_High,Introspection_Low,Introspection_Medium,Discipline_High,Discipline_Low,Discipline_Medium
0,67,0,0,1,0,0,1,0,0,1,...,0,0,0,1,1,0,0,0,0,1
1,35,0,1,0,0,1,0,1,0,0,...,0,0,0,1,0,1,0,0,0,1
2,25,0,0,0,0,1,0,0,1,0,...,0,0,0,1,0,1,0,1,0,0
3,48,0,1,0,0,1,0,1,0,0,...,0,0,1,0,0,0,1,1,0,0
4,48,1,0,1,0,1,0,0,0,1,...,0,1,0,0,0,0,1,0,0,1


In [9]:
# Reordering dummy columns so that new order is 'Low', 'Medium' and then 'High'.
# Also, target column "Finished" will be moved to the very end.

prepared_df = prepared_df[["Age", "Genre_M", "Victimhood",
                           "Discipline_Low", "Discipline_Medium", "Discipline_High", 
                           "Introspection_Low", "Introspection_Medium", "Introspection_High",
                           "Motivation_Low", "Motivation_Medium", "Motivation_High",
                           "Neuroticism_Low", "Neuroticism_Medium", "Neuroticism_High",
                           "Resourcefulness_Low", "Resourcefulness_Medium", "Resourcefulness_High",
                           "Social Expectation_Low", "Social Expectation_Medium", "Social Expectation_High",
                           "Finished"]]
prepared_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 22 columns):
 #   Column                     Non-Null Count  Dtype
---  ------                     --------------  -----
 0   Age                        100 non-null    int64
 1   Genre_M                    100 non-null    uint8
 2   Victimhood                 100 non-null    int64
 3   Discipline_Low             100 non-null    uint8
 4   Discipline_Medium          100 non-null    uint8
 5   Discipline_High            100 non-null    uint8
 6   Introspection_Low          100 non-null    uint8
 7   Introspection_Medium       100 non-null    uint8
 8   Introspection_High         100 non-null    uint8
 9   Motivation_Low             100 non-null    uint8
 10  Motivation_Medium          100 non-null    uint8
 11  Motivation_High            100 non-null    uint8
 12  Neuroticism_Low            100 non-null    uint8
 13  Neuroticism_Medium         100 non-null    uint8
 14  Neuroticism_High           

In [10]:
# Renaming 'Genre_M' to 'Sex_Male'
prepared_df.columns = [col.replace('Genre_M', 'Sex_Male') for col in prepared_df.columns]
print(prepared_df["Sex_Male"][:5])

0    1
1    0
2    0
3    0
4    1
Name: Sex_Male, dtype: uint8


In [11]:
# Saving DataFrame 'prepared_df' to a CSV file

prepared_df.to_csv("df_prepared.csv", index = True)