# Machine Learning - Predicting Treatment Abandonment with scikit learn
By **Daniel Palacio** (github.com/palaciodaniel) - 2020

## STEP FOUR - Improving model with GridSearch
Considering the dataset involves a sensitive subject (mental health) it becomes mandatory to look for the best model, the one with the best score. And since we are dealing with a very small dataset, then we can use GridSearch, so that we can determine which hyperparameters are the most suitable for the case.


In [1]:
# Importing required libraries

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import GridSearchCV # This is the new one we are importing.
from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score,recall_score,precision_score
import pandas as pd

In [2]:
# Loading prepared DataFrame (only numeric values)

df = pd.read_csv("df_prepared.csv", header = 0, index_col = 0)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 22 columns):
 #   Column                     Non-Null Count  Dtype
---  ------                     --------------  -----
 0   Age                        100 non-null    int64
 1   Sex_Male                   100 non-null    int64
 2   Victimhood                 100 non-null    int64
 3   Discipline_Low             100 non-null    int64
 4   Discipline_Medium          100 non-null    int64
 5   Discipline_High            100 non-null    int64
 6   Introspection_Low          100 non-null    int64
 7   Introspection_Medium       100 non-null    int64
 8   Introspection_High         100 non-null    int64
 9   Motivation_Low             100 non-null    int64
 10  Motivation_Medium          100 non-null    int64
 11  Motivation_High            100 non-null    int64
 12  Neuroticism_Low            100 non-null    int64
 13  Neuroticism_Medium         100 non-null    int64
 14  Neuroticism_High           

### Final preparations

In [3]:
# Feature columns

X = df.iloc[:,:-1].to_numpy()
print(X[:5])

[[67  1  0  0  1  0  0  0  1  0  1  0  0  1  0  1  0  0  0  1  0]
 [35  0  0  0  1  0  1  0  0  0  0  1  1  0  0  1  0  0  0  1  0]
 [25  0  0  0  0  1  1  0  0  1  0  0  1  0  0  1  0  0  0  1  0]
 [48  0  0  0  0  1  0  1  0  0  0  1  1  0  0  0  0  1  1  0  0]
 [48  1  1  0  1  0  0  1  0  0  1  0  1  0  0  1  0  0  0  0  1]]


In [4]:
# Target column

y = df.iloc[:, -1].to_numpy()
print(y[:5])

[0 1 0 1 0]


In [5]:
# Dividing between training and test subsets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle = True, random_state = 24)

print("X_train:", X_train.shape, type(y_test))
print("X_test:", X_test.shape, type(y_test))
print("y_train:", y_train.shape, type(y_test))
print("y_test:", y_test.shape, type(y_test))

X_train: (80, 21) <class 'numpy.ndarray'>
X_test: (20, 21) <class 'numpy.ndarray'>
y_train: (80,) <class 'numpy.ndarray'>
y_test: (20,) <class 'numpy.ndarray'>


In [6]:
# Scaling the data

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Up to this point there was not a big difference with Step 3, but now we are finally applying GridSearch...

### Model application

**IMPORTANT NOTE:** Logistic Regression has more parameters than the ones selected here; however, given that GridSearch looks for all possible combinations, in this case some of them weren't compatible, therefore it was pointless to spend processing time only to output a long list of warnings, and that is the main reason why some of them were ruled out.

Also, scikit learn's __[documentation for Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)__ specifically recommends the solver 'liblinear' for small datasets.

In [7]:
# Instantiating and fitting a Logistic Regression model with GridSearch

log_reg = LogisticRegression(random_state = 24)

params = {
        "penalty": ["l1", "l2"], 
        "C": [0.01, 0.1, 0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0, 2.5, 5, 10], 
        "solver": ["liblinear"]
         }

grid_search = GridSearchCV(log_reg, param_grid = params)

grid_search.fit(X_train_scaled, y_train)

GridSearchCV(estimator=LogisticRegression(random_state=24),
             param_grid={'C': [0.01, 0.1, 0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75,
                               2.0, 2.5, 5, 10],
                         'penalty': ['l1', 'l2'], 'solver': ['liblinear']})

In [8]:
# Making predictions, now with Grid Search

y_pred = grid_search.predict(X_test_scaled)
print("y_pred:", y_pred.shape, y_pred)
print("y_test:", y_test.shape, y_test)

y_pred: (20,) [1 0 1 0 0 1 0 1 1 1 1 0 0 1 1 1 1 0 1 1]
y_test: (20,) [1 1 1 0 1 1 0 1 1 1 1 0 0 1 1 0 1 1 0 0]


### Scores

In [9]:
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print(conf_matrix)

[[ 4  3]
 [ 3 10]]


In [10]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print("True Positives:", tp)
print("True Negatives:", tn)
print("False Positives:", fp)
print("False Negatives:", fn)

True Positives: 10
True Negatives: 4
False Positives: 3
False Negatives: 3


In [11]:
# The best parameters

print("GridSearch revealed that the best parameters were the following:\n", grid_search.best_params_)

# New accuracy, precision and recall scores

print("\nThe new (and improved) scores are the following:")
print("Accuracy:", round(accuracy_score(y_test,y_pred), 2))
print("Precision", round(precision_score(y_test,y_pred), 2))
print("Recall:", round(recall_score(y_test,y_pred), 2))

GridSearch revealed that the best parameters were the following:
 {'C': 0.01, 'penalty': 'l2', 'solver': 'liblinear'}

The new (and improved) scores are the following:
Accuracy: 0.7
Precision 0.77
Recall: 0.77


### Conclusion
Using GridSearch definitely improved the scores; it is far from being a robust model, of course, and even more considering it is about a delicate subject, as it was already expressed before. However, if we think that we used a dataset containing only 100 observations, it is already a promising start. 

Obviously, if we were going to continue improving the model, we would need to add much more observations. Thousands of them, to be precise.