# Load Libraries

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve, plot_roc_curve, classification_report, plot_confusion_matrix
from sklearn.preprocessing import MinMaxScaler
from imblearn.over_sampling import SMOTE
from warnings import simplefilter

In [None]:
#To ignore possible unnecessary warnings
simplefilter(action='ignore')
pd.set_option('Display.max_columns', None)

# Load Data

In [None]:
#Use the path to your cleaned data
df = pd.read_csv('../../../ml-usecase-classification-humanresourcesattrition/data/HR_cleaned.csv')
df.head()

# Split into train and test
First of all, split into features and label

In [None]:
X = df.drop(columns=['Attrition'])
y = df['Attrition']

For this purpose you can use `train_test_split()` or `StratifiedShuffleSplit()`. The main advantage of `StratifiedShuffleSplit()` is your train and test sets will have the same ratio of negative and positive cases. We will use and compare both. We will choose 70% of the dataset for the training set and the remaining 30% of the dataset for the testing set.
___
- `train_test_split()`: On this dataset, using this method, the training set will have a greater proportion of positive cases than the testing set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2)

In [None]:
y_train.value_counts()/len(y_train)

In [None]:
y_test.value_counts()/len(y_test)

___
- `StratifiedShuffleSplit()`: Both training set and testing set have the same proportion of positive cases.

In [None]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state= 2)
for train_index, test_index in split.split(df, df['Attrition']):
    strat_train_set = df.loc[train_index]
    strat_test_set = df.loc[test_index]

In [None]:
strat_train_set['Attrition'].value_counts()/len(strat_train_set)

In [None]:
strat_test_set['Attrition'].value_counts()/len(strat_test_set)

In [None]:
X_train = strat_train_set.drop(columns=['Attrition'])
X_test = strat_test_set.drop(columns = ['Attrition'])
y_train = strat_train_set['Attrition']
y_test = strat_test_set['Attrition']

# Scale the Data
You might need to scale your features to avoid some bias. You have many choices of scalers. In that case, the scaler used is `MinMaxScaler()` from *Scikit-Learn*. You can check all available scalers on *Scikit Learn* on [this link](https://scikit-learn.org/stable/modules/classes.html?highlight=preprocessing#module-sklearn.preprocessing). Search for `MinMaxScaler()`, `StandardScaler()` or `Robust Scaler` and try to understand which of them you should use in each case. If you want to have a visual demonstration of each scale to choose each one you think is the most suitable, please check [this link](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html).

In [None]:
numeric_features = ['Age', 'DailyRate','DistanceFromHome','HourlyRate','MonthlyIncome','MonthlyRate','NumCompaniesWorked',
                    'PercentSalaryHike','TotalWorkingYears','YearsAtCompany','YearsInCurrentRole','YearsSinceLastPromotion',
                    'YearsWithCurrManager','WorkLifeBalance','TrainingTimesLastYear']

minmax = MinMaxScaler()
X_train_scaled = pd.DataFrame(minmax.fit_transform(X_train[numeric_features]))
X_test_scaled = pd.DataFrame(minmax.fit_transform(X_test[numeric_features]))

X_train_scaled.columns = X_train[numeric_features].columns
X_test_scaled.columns = X_test[numeric_features].columns

X_train_scaled = pd.concat([X_train_scaled.reset_index(), X_train.drop(columns=numeric_features).reset_index()], axis=1).drop(columns = ['index'])
X_test_scaled = pd.concat([X_test_scaled.reset_index(), X_test.drop(columns=numeric_features).reset_index()], axis=1).drop(columns = ['index'])

In [None]:
X_train_scaled.head()

As you could see at EDA(*Exploratory Data Analysis*) stage, *Attrition* column, the label column, is imbalanced and this is a problem to classification model. To solve this, you can use a technique called SMOTE(Synthetic Minority Oversampling Technique). You can reade more about SMOTE on [this link](https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/). Note that you can only use SMOTE on **training set**.

In [None]:
y.value_counts()

In [None]:
oversample = SMOTE()
X_train_sampled, y_train_sampled = oversample.fit_resample(X_train_scaled, y_train)

There are some models that have an argument which balances the data automatically. If you are using `LogisticRegression()`from *Scikit-Learn*, the argument is called *class_weights* and you just have to set it as *'balanced'*. Check the documentation of `LogisticRegression()` on [this link](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to understand this argument. Depending on the model you are using, the name of the argument can change. However, the effect is the same. Check the documentation of each model to see how the argument is called and how you can use it to balance your data. If you use these arguments in your model to balance your data, you don't need to use SMOTE.

# Modeling

To do the modeling, we are choosing LogisticRegression. We should try another models and see which model performs better. We will use a function called `GridSearchCV()`: This function allows to optimize the model hyperparameters using cross validation, which means you have to choose how many folds you want to use in your validation. We recomend to use 5 folds or 10. To know what are the parameters to optimize you should consult the documentation of the model you are using. Furthermore, you should look into what are the best parameters to optimize, saving computation time. The more parameters, the longer it takes to run. This function will define the best model, according to the metrics you choose. Usually, for classification problems, the metric that is used is *roc_auc*.

- **Logistic Regression**

In [None]:
lr = LogisticRegression()

#Define a set of hyperparameters to optimize and their values
parameters = {'penalty':['l1','l2'], 'C': [0.001,0.01,0.1,1,10,100,1000]}
lr_model = GridSearchCV(estimator=lr,param_grid=parameters, cv=5, scoring='roc_auc', refit=True)

#Train the model
lr_model.fit(X_train_sampled, y_train_sampled)

You can see the best estimator hyperparameters

In [None]:
lr_model.best_estimator_

You can see the results of each set of parameters

In [None]:
pd.DataFrame(lr_model.cv_results_).sort_values(by='rank_test_score').head(10)

# Model Evaluation

To evaluate the model you should use the test set. You can use `lr_model`object to predict because `GridSearchCV()` function has an argument called *refit*. When this argument is set as **True**, `lr_model` is refitted with the best set of hyperparameters.

- **Predictions**

In [None]:
y_preds= lr_model.predict(X_test_scaled)
y_preds_proba = lr_model.predict_proba(X_test_scaled)
preds = pd.concat([pd.Series(y_preds), pd.DataFrame(y_preds_proba)], axis=1)
preds.columns = ['y_preds','y_pred_proba_0','y_pred_proba_1']
preds.head(10)

___
- **Metrics**

In [None]:
precision, recall, threshold = precision_recall_curve(y_test, y_preds_proba[:,1])
df_metrics = pd.concat([pd.DataFrame(precision), pd.DataFrame(recall), pd.DataFrame(threshold)], axis=1)
df_metrics.columns = ['Precision','Recall','Threshold']
df_metrics['f1'] = 2* ((df_metrics.Precision * df_metrics.Recall)/(df_metrics.Precision + df_metrics.Recall))
df_metrics

You can plot the roc curve and check the AUC value for test.

In [None]:
plot_roc_curve(lr_model, X_test_scaled, y_test)

You can compare the results with training test. If the values are too diferent (like 0.78 for test and 0.96 for train) it means that the model is overfitting and we have to solve that. One way is to do a better optimization of the hyperparameters. Another way is to implement another models.

In [None]:
plot_roc_curve(lr_model, X_train_sampled, y_train_sampled)

You can check the classification report provided by Scikit Learn.

In [None]:
print(classification_report(y_test, preds['y_preds']))

You can check the confusion matrix to see what your model is failing the most in.

In [None]:
plot_confusion_matrix(lr_model, X_test_scaled, y_test)

The results from the previous classification report aren't optimized according to the treshold. What we can do is check the maximum *f1* value on *df_metrics* and get the correspondent threshold. Then, we take the previous computed probabilities and if the probability of class 1 is higher or equal to the threshold, we assign class 1.  Otherwise, we assign class 0. Let's do it and check the results.

In [None]:
maxf1 = df_metrics.f1.max()

maxf1threshold = df_metrics[df_metrics.f1 == maxf1]['Threshold']

preds.loc[preds['y_pred_proba_1']>= float(maxf1threshold), 'y_preds_t'] = 1
preds.loc[preds['y_pred_proba_1']< float(maxf1threshold), 'y_preds_t'] = 0

In [None]:
print(classification_report(y_test, preds['y_preds_t']))

The results aren't better which means that the model needs to be improved. Test another models, try another combinations of hyperparameters or another aproaches.

## After developing your model you wan't to save it. Lets do it using joblib.

In [None]:
import joblib

The next cell will create a file for your model and save it in the local working directory.

In [None]:
filename = './HR_API/HR_model_load.sav'
joblib.dump(lr_model, filename)

Now you can load the model whenever you want in other files. Let's confirm creating a new variable and comparing results.

In [None]:
lr_model_copy = joblib.load('./HR_API/HR_model_load.sav')
y_preds_copy = lr_model_copy.predict(X_test_scaled)
(y_preds == y_preds_copy)