# Machine Learning - Predicting Treatment Abandonment with scikit learn
By **Daniel Palacio** (github.com/palaciodaniel) - 2020

## STEP THREE - Applying Machine Learning
Having a fully numeric DataFrame, now we are ready to apply a Machine Learning model on it. Considering the target column "Finished" is binary (whether the patient finished the treatment or not), we will be using a Logistic Regression model.

This model will require us to scale the data, to avoid one column having more prevalence than the others.

Several metrics were used to assess the quality of the predictions.

In [1]:
# Importing required libraries

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score,recall_score,precision_score
import pandas as pd

In [2]:
# Loading prepared DataFrame (only numeric values)

df = pd.read_csv("df_prepared.csv", header = 0, index_col = 0)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 22 columns):
 #   Column                     Non-Null Count  Dtype
---  ------                     --------------  -----
 0   Age                        100 non-null    int64
 1   Sex_Male                   100 non-null    int64
 2   Victimhood                 100 non-null    int64
 3   Discipline_Low             100 non-null    int64
 4   Discipline_Medium          100 non-null    int64
 5   Discipline_High            100 non-null    int64
 6   Introspection_Low          100 non-null    int64
 7   Introspection_Medium       100 non-null    int64
 8   Introspection_High         100 non-null    int64
 9   Motivation_Low             100 non-null    int64
 10  Motivation_Medium          100 non-null    int64
 11  Motivation_High            100 non-null    int64
 12  Neuroticism_Low            100 non-null    int64
 13  Neuroticism_Medium         100 non-null    int64
 14  Neuroticism_High           

### Final preparations

In [3]:
# Feature columns

X = df.iloc[:,:-1].to_numpy()
print(X[:5])

[[67  1  0  0  1  0  0  0  1  0  1  0  0  1  0  1  0  0  0  1  0]
 [35  0  0  0  1  0  1  0  0  0  0  1  1  0  0  1  0  0  0  1  0]
 [25  0  0  0  0  1  1  0  0  1  0  0  1  0  0  1  0  0  0  1  0]
 [48  0  0  0  0  1  0  1  0  0  0  1  1  0  0  0  0  1  1  0  0]
 [48  1  1  0  1  0  0  1  0  0  1  0  1  0  0  1  0  0  0  0  1]]


In [4]:
# Target column

y = df.iloc[:, -1].to_numpy()
print(y[:5])

[0 1 0 1 0]


In [5]:
# Dividing between training and test subsets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle = True, random_state = 24)

print("X_train:", X_train.shape, type(y_test))
print("X_test:", X_test.shape, type(y_test))
print("y_train:", y_train.shape, type(y_test))
print("y_test:", y_test.shape, type(y_test))

X_train: (80, 21) <class 'numpy.ndarray'>
X_test: (20, 21) <class 'numpy.ndarray'>
y_train: (80,) <class 'numpy.ndarray'>
y_test: (20,) <class 'numpy.ndarray'>


In [6]:
# Scaling the data

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Model application

In [7]:
# Instantiating and fitting a Logistic Regression model

log_reg = LogisticRegression(solver= "liblinear", random_state = 24)

log_reg.fit(X_train_scaled, y_train)

LogisticRegression(random_state=24, solver='liblinear')

In [8]:
# Making predictions with the Logistic Regression model

y_pred = log_reg.predict(X_test_scaled)
print("y_pred:", y_pred.shape, y_pred)
print("y_test:", y_test.shape, y_test)

y_pred: (20,) [1 0 1 0 0 1 0 1 1 0 0 0 1 1 1 0 1 0 0 1]
y_test: (20,) [1 1 1 0 1 1 0 1 1 1 1 0 0 1 1 0 1 1 0 0]


### Scores

In [9]:
# Accuracy, recall and precision scores
print("Accuracy:", round(accuracy_score(y_test,y_pred), 2))
print("Precision", round(precision_score(y_test,y_pred), 2))
print("Recall:", round(recall_score(y_test,y_pred), 2))

Accuracy: 0.65
Precision 0.8
Recall: 0.62


In [10]:
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print(conf_matrix)

[[5 2]
 [5 8]]


In [11]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print("True Positives:", tp)
print("True Negatives:", tn)
print("False Positives:", fp)
print("False Negatives:", fn)

True Positives: 8
True Negatives: 5
False Positives: 2
False Negatives: 5


In [12]:
# Cross Validation Score
from sklearn.model_selection import cross_val_score

scores = cross_val_score(log_reg, X, y, cv=5)
print("Cross Validation Accuracy Scores:", scores)

Cross Validation Accuracy Scores: [0.85 0.8  0.75 0.6  0.6 ]
