# **Modeling and Evaluation Notebook using conventional ML**

## Objectives

* Answer business requirement 2:
 - the client, can input patient's information and predict whether this patient is likely to be readmitted or not.
* Fit and evaluate a classification model to predict if a patient has heart disease or not.

## Inputs

* outputs/datasets/collection/HospitalReadmissions.csv

## Outputs

* Data cleaning, feature engineering and modelling pipelines
* Feature importance plot 

## Additional Comments

* No additional comments 


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os

current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

## Libraries needed for the notebook

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# to make a pipeline
from sklearn.pipeline import Pipeline

# Feature Engineering
from feature_engine.selection import SmartCorrelatedSelection
from feature_engine.encoding import OrdinalEncoder
from feature_engine.outliers import Winsorizer

# Feat Scaling
from sklearn.preprocessing import StandardScaler

# Feat Selection
from sklearn.feature_selection import SelectFromModel
from sklearn.decomposition import PCA

# ML algorithms
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

# for hyperparameter tuning
from sklearn.model_selection import GridSearchCV

# to split the dataset
from sklearn.model_selection import train_test_split

# to balance the target variable
from imblearn.over_sampling import SMOTE

# to evaluate the models
from sklearn.metrics import make_scorer, recall_score

warnings.filterwarnings("ignore")
sns.set_style("darkgrid")

# Load Data

In [None]:
data_path = 'outputs/datasets/collection/HospitalReadmissions.csv'

df = pd.read_csv(data_path).drop(labels=['medical_specialty'], axis=1)
df.head()

In [None]:
categorical_columns = df.select_dtypes(include=['object']).columns.tolist()
categorical_columns

---

## Classification ML Pipeline

### ML pipeline for Data Cleaning and Feature Engineering

In [None]:
def PipelineDataCleaningAndFeatureEngineering():
    pipeline_base = Pipeline([
        ("OrdinalEncoder", OrdinalEncoder(encoding_method='arbitrary',variables=categorical_columns)),
        ('Winsorizer_iqr', Winsorizer(variables=[
            'time_in_hospital', 'n_procedures','n_inpatient', 'n_medications','n_lab_procedures'],
                                capping_method='iqr', tail='both', fold=1.5)),
        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None,
        method="spearman", threshold=0.4, selection_method="variance")),
    ])

    return pipeline_base


PipelineDataCleaningAndFeatureEngineering()

## ML Pipeline with Data

### ML pipeline for Data Cleaning and Feature Engineering

In [None]:
def PipelineClf(model):
    pipeline_base = Pipeline([
        ("scaler", StandardScaler()),
        ("feat_selection",SelectFromModel(model)),
        ("model", model),
    ])

    return pipeline_base

Custom Class for Hyperparameter Optimisation

In [None]:
class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")

            model = PipelineClf(self.models[key])
            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                            verbose=verbose, scoring=scoring, )
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)
        columns = ['estimator', 'min_score',
                'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]
        return df[columns], self.grid_searches

### Split Train and Test set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(['readmitted'], axis=1),
                                                    df['readmitted'],
                                                    test_size=0.2,
                                                    random_state=0)

print(X_train.shape,y_train.shape, X_test.shape, y_test.shape)

Data Cleaning Pipeline

In [None]:
pipeline_data_cleaning_feat_eng = PipelineDataCleaningAndFeatureEngineering()
X_train = pipeline_data_cleaning_feat_eng.fit_transform(X_train)
X_test = pipeline_data_cleaning_feat_eng.transform(X_test)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

##### Check Target distribution of the train set

In [None]:
y_train.value_counts().plot(kind='bar', title='Train Set Target Distribution')
plt.show()

* The target looks relatively balanced, however in order to try and minimise overfitting oversampling will be performed.

In [None]:
oversample = SMOTE(sampling_strategy='minority', random_state=0)
X_train, y_train = oversample.fit_resample(X_train, y_train)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

In [None]:
y_train.value_counts().plot(kind='bar', title='Train Set Target Distribution')
plt.show()

### Grid Search CV - Sklearn

In [None]:
models_quick_search = {
    "LogisticRegression": LogisticRegression(random_state=0),
    "XGBClassifier": XGBClassifier(random_state=0),
    "DecisionTreeClassifier": DecisionTreeClassifier(random_state=0),
    "RandomForestClassifier": RandomForestClassifier(random_state=0),
    "GradientBoostingClassifier": GradientBoostingClassifier(random_state=0),
    "ExtraTreesClassifier": ExtraTreesClassifier(random_state=0),
    "AdaBoostClassifier": AdaBoostClassifier(random_state=0),
    "LogisticRegression": LogisticRegression(random_state=0)
}

params_quick_search = {
    "LogisticRegression": {},
    "XGBClassifier": {},
    "DecisionTreeClassifier": {},
    "RandomForestClassifier": {},
    "GradientBoostingClassifier": {},
    "ExtraTreesClassifier": {},
    "AdaBoostClassifier": {},
    "LogisticRegression": {}
}

Using default hyperparameters to find best algorithm, scored by recall (as per business requirement 2)

Quick GridSearch CV - Binary Classifier

In [None]:
search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
search.fit(X_train, y_train,
        scoring =  make_scorer(recall_score, pos_label=1),
        n_jobs=-1, cv=5)

Results of GridSearch

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary 

The top two algorithms rated by mean score for recall were RandomForestClassifier and ExtraTreesClassifier.
 - RandomForestClassifier : 0.543352
 - ExtraTreesClassifier : 0.542135

The scores are pretty low, so we will proceed with extensive hyperparameter search to see if we can imporove the score.

### Extensive search on the most suitable algorithms

#### Explanation for the hyperparameter selection

* *n_estimators'*:
    - Defines the number of trees in the forest.
    - *Effect:* Increasing the number of trees generally improves the model's performance, as it allows for better averaging and more robust predictions.  
* *max_depth'*: 
    - Sets the maximum depth of each decision tree in the forest. It limits how deep the tree can grow.
    - *Effect:* Restricting the max_depth helps to prevent over fitting by limiting the model's complexity.
* *min_samples_leaf*': 
    - This is the minimum number of samples that must be present in a leaf node.
    - *Effect:* With min_samples_leaf set to 1, leaf nodes can contain a single sample. This setting allows the trees to be very flexible, but it can also make the model more prone to overfitting, especially if the trees are deep.
* *min_samples_split'*: 
    - This is the minimum number of samples required to split an internal node.
    - *Effect:* A setting of 2 for min_samples_split means that a node must have at least 2 samples to be split. This is the most permissive setting and allows the tree to grow to its maximum depth, which could lead to over fitting if not controlled by other parameters (like max_depth).
* *max_leaf_nodes'*:
    - This limits the number of leaf nodes in each decision tree.
    - *Effect:* Limiting the number of leaf nodes to 5 enforces a strong regularization on the model, leading to simpler trees. This can prevent over fitting, especially in cases with noisy data or small datasets, but it may also reduce the model's ability to capture complex patterns, potentially leading to under fitting.
* *class_weight'*:
    - The class_weight parameter is used to adjust the weights of classes in the loss function. Setting this to 'balanced' automatically adjusts the weights inversely proportional to class frequencies in the input data.
    - *Effect:* If your dataset is imbalanced (one class significantly outnumbers the other), using 'balanced' helps the model to pay more attention to the minority class. This can improve performance on the less frequent class by reducing the bias towards the majority class.
* *max_features'*:
    - Determines the number of features to consider when looking for the best split at each node.
    - *Effect:* Setting max_features to None means that all features will be considered when determining the best split. This can lead to more accurate but less diverse trees, as each tree could potentially use the same features and become similar to one another, slightly reducing the benefit of randomness in the forest.

In [None]:
models_search = {
    "RandomForestClassifier":RandomForestClassifier(random_state=42),
    "ExtraTreesClassifier":ExtraTreesClassifier(random_state=42),
}

params_search = {
    "RandomForestClassifier":{'model__n_estimators': [150,250],
                            'model__max_depth': [None,15],
                            'model__min_samples_split': [2,75],
                            'model__min_samples_leaf': [1,75],
                            'model__max_leaf_nodes': [5,25],
                            'model__class_weight': ['balanced'],
                            'model__max_features': [ None,'sqrt'],
                            },
    "ExtraTreesClassifier":{'model__max_depth': [None,15],
                            'model__min_samples_split': [2,75],
                            'model__min_samples_leaf': [1,75],
                            'model__max_leaf_nodes': [5,25],
                            'model__class_weight': ['balanced'],
                            'model__max_features': [ None,'sqrt'],
                            },
}

In [None]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train,
        scoring =  make_scorer(recall_score, pos_label=1),
        n_jobs=-1, cv=5)

Results of GridSearch

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary.head(10)

The best combination from the extensive hyperparameter search showed that RandomForestClassifier performed the best with a mean score on recall 0.600187, which was an improvement from the general search of algorithms.

Save the best model and parameters

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

In [None]:
best_parameters = grid_search_pipelines[best_model].best_params_
best_parameters

Define the pipeline using the findings from hyperparameter optimisation

In [None]:
pipeline_clf = grid_search_pipelines[best_model].best_estimator_
pipeline_clf

In [None]:
X_train.head()

### Feature Importance

With the optimal pipeline found, the importance of features to the model can be assessed using the **.feature_importances_** attribute.

In [None]:
## Feature Importance from SelectFromModel
# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
    'Feature': X_train.columns[pipeline_clf['feat_selection'].get_support()],
    'Importance': pipeline_clf['model'].feature_importances_})
    .sort_values(by='Importance', ascending=False)
)

# re-assign best_features order
best_features = df_feature_importance['Feature'].to_list()

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
    f"The model was trained on them: \n{df_feature_importance['Feature'].to_list()}")

df_feature_importance.plot(kind='bar', x='Feature', y='Importance')
plt.show()

* These are the 6 most important features in descending order. The model was trained on them: 
*['n_medications', 'n_lab_procedures', 'diag_1', 'diag_2', 'age', 'diag_3']*

### Evaluate Pipeline on Train and Test sets

In [None]:
from sklearn.metrics import classification_report, confusion_matrix


def confusion_matrix_and_report(X, y, pipeline, label_map):

    prediction = pipeline.predict(X)

    print('---  Confusion Matrix  ---')
    print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
        columns=[["Actual " + sub for sub in label_map]],
        index=[["Prediction " + sub for sub in label_map]]
        ))
    print("\n")

    print('---  Classification Report  ---')
    print(classification_report(y, prediction, target_names=label_map), "\n")


def clf_performance(X_train, y_train, X_test, y_test, pipeline, label_map):
    print("#### Train Set #### \n")
    confusion_matrix_and_report(X_train, y_train, pipeline, label_map)

    print("#### Test Set ####\n")
    confusion_matrix_and_report(X_test, y_test, pipeline, label_map)

In [None]:
clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=pipeline_clf,
                label_map= ['Will readmit', 'Will not readmit']
                )

The model scored the following:
* Train set:
    - Recall on 'Will readmit' 44% 
    - Precision on 'Will not readmit': 54%
* Test set:
    - Recall on 'Will readmit' 46% 
    - Precision on 'Will not readmit': 52%

The above scores are well below are required 70% on recall for 'Will readmit' and 60% precision on 'Will not readmit'. However, we see that the model did not over fit and we were able to determine the most important features. So using those we will try to see if we can get a better performance.

## Refit pipeline with best features

We can refit the ML pipeline with the most important features to determine whether we get the same result as one fitted with all variables.

In [None]:
best_features

#### Rewrite the ML Pipelines

In [None]:
def PipelineDataCleaningAndFeatureEngineeringBestFeatures():
    pipeline_base = Pipeline([
        ("OrdinalEncoder", OrdinalEncoder(encoding_method='arbitrary',variables=[
            'age', 'diag_1', 'diag_3', 'diag_2'])),
        ('Winsorizer_iqr', Winsorizer(variables=['n_medications','n_lab_procedures'],
                                capping_method='iqr', tail='both', fold=1.5)),
    ])

    return pipeline_base

In [None]:
def PipelineClf(model):
    pipeline_base = Pipeline([
        ("scaler", StandardScaler()),
        ("model", model),
    ])

    return pipeline_base

#### Split Train and Tests Sets Using Only Most Important Features

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['readmitted'], axis=1),
    df['readmitted'],
    test_size=0.3,
    random_state=0
)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
X_train.head()

We filter the sets with the best features.

In [None]:
X_train = X_train.filter(best_features)
X_test = X_test.filter(best_features)

print(X_train.shape, X_test.shape)
X_train.head()

### Handle Target imbalance

In [None]:
pipeline_data_cleaning_feat_eng = PipelineDataCleaningAndFeatureEngineeringBestFeatures()
X_train = pipeline_data_cleaning_feat_eng.fit_transform(X_train)
X_test = pipeline_data_cleaning_feat_eng.transform(X_test)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
X_train.head()

In [None]:
X_test.head()

In [None]:
y_train.value_counts().plot(kind='bar', title='Train Set Target Distribution')
plt.show()

In [None]:
oversample = SMOTE(sampling_strategy='minority', random_state=0)
X_train, y_train = oversample.fit_resample(X_train, y_train)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

In [None]:
y_train.value_counts().plot(kind='bar', title='Train Set Target Distribution')
plt.show()

### GridSearch CV

In [None]:
models_search = {'RandomForestClassifier': RandomForestClassifier(random_state=42)}

In [None]:
best_parameters

In [None]:
params_search = {"RandomForestClassifier":{'model__class_weight': ['balanced'],
                                        'model__max_depth': [None],
                                        'model__max_features': [None],
                                        'model__max_leaf_nodes': [5],
                                        'model__min_samples_leaf': [1],
                                        'model__min_samples_split': [2],
                                        'model__n_estimators': [250]
                                        }
                }

In [None]:
quick_search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
quick_search.fit(X_train, y_train,
                scoring=make_scorer(recall_score, pos_label=1),
                n_jobs=-1, cv=5)

Checking the results

In [None]:
grid_search_summary, grid_search_pipelines = quick_search.score_summary(sort_by='mean_score')
grid_search_summary.head()

This time the mean_score on recall was improve slightly to 0.631326, using th best features and the best parameters determined previously.

Defining the best classification pipeline

In [None]:
best_model = grid_search_summary.iloc[0, 0]

pipeline_clf = grid_search_pipelines[best_model].best_estimator_
pipeline_clf

In [None]:
best_parameters = grid_search_pipelines[best_model].best_params_
best_parameters

### Assess feature importance

In [None]:
best_features = X_train.columns

# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
    'Feature': best_features,
    'Importance': pipeline_clf['model'].feature_importances_})
    .sort_values(by='Importance', ascending=False)
)


# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
    f"The model was trained on them: \n{df_feature_importance['Feature'].to_list()}")

df_feature_importance.plot(kind='bar', x='Feature', y='Importance')
plt.show()

### Evaluate Pipeline on Train and Test Sets

In [None]:
clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=pipeline_clf,
                label_map= ['Will readmit', 'Will not readmit'] 
                )

The model scored the following:
* Train set:
    - Recall on 'Will readmit' 45% 
    - Precision on 'Will not readmit': 54%
* Test set:
    - Recall on 'Will readmit' 46% 
    - Precision on 'Will not readmit': 52%

We generally see that the model did not over fit but also didn't reach our targets. Also, the feature importance remained the same.

---

## Refit pipeline using PCA

We now going to try to improve the performance pf the model by using PCA. As we saw on the previous notebook of Feature Engineering, we will still use the previous pipelines and we are going to use all the components of the dataset since the a big number of them is needed to explain more than 80% of the dataset's variance.

In [None]:
def PipelineDataCleaningAndFeatureEngineeringPCA():
    pipeline_base = Pipeline([
        ("OrdinalEncoder", OrdinalEncoder(encoding_method='arbitrary',variables=categorical_columns)),
        ('Winsorizer_iqr', Winsorizer(variables=[
            'time_in_hospital', 'n_procedures','n_inpatient', 'n_medications','n_lab_procedures'],
                                capping_method='iqr', tail='both', fold=1.5)),
        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None,
        method="spearman", threshold=0.4, selection_method="variance")),
    ])

    return pipeline_base

In [None]:
def PipelineClf(model):
    pipeline_base = Pipeline([
        ("scaler", StandardScaler()),
        ("PCA", PCA(n_components=13, random_state=0)),
        ("model", model),
    ])

    return pipeline_base

Split to train and test set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['readmitted'], axis=1),
    df['readmitted'],
    test_size=0.2,
    random_state=0
)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

#### Handle Target Imbalance

In [None]:
pipeline_data_cleaning_feat_eng = PipelineDataCleaningAndFeatureEngineeringPCA()
X_train = pipeline_data_cleaning_feat_eng.fit_transform(X_train)
X_test = pipeline_data_cleaning_feat_eng.transform(X_test)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

In [None]:
y_train.value_counts().plot(kind='bar', title='Train Set Target Distribution')
plt.show()

In [None]:
oversample = SMOTE(sampling_strategy='minority', random_state=0)
X_train, y_train = oversample.fit_resample(X_train, y_train)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

In [None]:
y_train.value_counts().plot(kind='bar', title='Train Set Target Distribution')
plt.show()

#### GridSearch 

In [None]:
models_quick_search = {
    "LogisticRegression": LogisticRegression(random_state=42),
    "XGBClassifier": XGBClassifier(random_state=42),
    "DecisionTreeClassifier": DecisionTreeClassifier(random_state=42),
    "RandomForestClassifier": RandomForestClassifier(random_state=42),
    "GradientBoostingClassifier": GradientBoostingClassifier(random_state=42),
    "ExtraTreesClassifier": ExtraTreesClassifier(random_state=42),
    "AdaBoostClassifier": AdaBoostClassifier(random_state=42),
    "LogisticRegression": LogisticRegression(random_state=42)
}

params_quick_search = {
    "LogisticRegression": {},
    "XGBClassifier": {},
    "DecisionTreeClassifier": {},
    "RandomForestClassifier": {},
    "GradientBoostingClassifier": {},
    "ExtraTreesClassifier": {},
    "AdaBoostClassifier": {},
    "LogisticRegression": {}
}

In [None]:
search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
search.fit(X_train, y_train,
        scoring =  make_scorer(recall_score, pos_label=1),
        n_jobs=-1, cv=5)

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

From the grid search we still get the 'RandomForestClassifier' and the 'ExtraTreesClassifier' as the best performing algorithms.
  - RandomForestClassifier - 0.580993
  - ExtraTreesClassifier - 0.578558

Extensive research

In [None]:
models_quick_search = {
    "RandomForestClassifier":RandomForestClassifier(random_state=42),
}

params_quick_search = {
    "RandomForestClassifier":{'model__n_estimators': [150,250],
                            'model__max_depth': [None,15],
                            'model__min_samples_split': [2,75],
                            'model__min_samples_leaf': [1,75],
                            'model__max_leaf_nodes': [5,25],
                            'model__class_weight': ['balanced'],
                            'model__max_features': [ None,'sqrt'],
                            }
}

In [None]:
search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
search.fit(X_train, y_train,
        scoring =  make_scorer(recall_score, pos_label=1),
        n_jobs=-1, cv=5)

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

After the extensive hyperparameter search the mean score on recall did not improve and remained low at 0.530431

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

In [None]:
best_parameters = grid_search_pipelines[best_model].best_params_
best_parameters

In [None]:
pipeline_clf = grid_search_pipelines[best_model].best_estimator_
pipeline_clf

#### Assess features importance

In [None]:
## PCA 
pca = pipeline_clf['PCA']
model = pipeline_clf['model']
# Get the PCA components (coefficients of original features)
components = pca.components_
# Get the feature names
feature_names = X_train.columns

# Calculate the importance of original features by multiplying the component coefficients
# with the feature importances from the model (if the model has such an attribute)
df_feature_importance = pd.DataFrame({
    'Feature': feature_names,
    'Importance': components.T @ model.feature_importances_
}).sort_values(by='Importance', ascending=False)

# re-assign best_features order
best_features = df_feature_importance['Feature'].to_list()

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
    f"The model was trained on them: \n{df_feature_importance['Feature'].to_list()}")

df_feature_importance.plot(kind='bar', x='Feature', y='Importance')
plt.show()

#### Evaluate Pipeline on Train and Test sets

In [None]:
clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=pipeline_clf,
                label_map= ['Will readmit', 'Will not readmit']
                )

The overall performance using PCA was slightly better, but still failed to meet the metrics we set which they were at least 70% on recall for the 'will readmit'. 

* Train set:
    - Recall on 'Will readmit' 69% 
    - Precision on 'Will not readmit': 64%
* Test set:
    - Recall on 'Will readmit' 67% 
    - Precision on 'Will not readmit': 60%

Although we didn't reach the 70% on recall we managed to keep the precision for 'Will not readmit' above 60% which is one of the failure metrics.

---

# Push files to Repo

We will generate the following files:

* Train set
* Test set
* Data cleaning and Feature Engineering pipeline
* Modeling pipeline
* features importance plot

In [None]:
import joblib
import os

version = 'v1'
file_path = f'outputs/ml_pipeline/predict_readmission/{version}'

try:
    os.makedirs(name=file_path)
except Exception as e:
    print(e)

### Train Set

In [None]:
print(X_train.shape)
X_train.head()

In [None]:
X_train.to_csv(f"{file_path}/X_train.csv", index=False)

In [None]:
y_train

In [None]:
y_train.to_csv(f"{file_path}/y_train.csv", index=False)

### Test Set

In [None]:
print(X_test.shape)
X_test.head()

In [None]:
X_test.to_csv(f"{file_path}/X_test.csv", index=False)

In [None]:
y_test

In [None]:
y_test.to_csv(f"{file_path}/y_test.csv", index=False)

## ML Pipelines: Data Cleaning and Feat Eng pipeline and Modelling Pipeline

We will save 2 pipelines:

* Both should be used in conjunction to predict Live Data.
* To predict on Train Set, Test Set we use only pipeline_clf, since the data is already processed.

Pipeline responsible for Data Cleaning and Feature Engineering.

In [None]:
pipeline_data_cleaning_feat_eng

In [None]:
joblib.dump(value=pipeline_data_cleaning_feat_eng ,
            filename=f"{file_path}/clf_pipeline_data_cleaning_feat_eng.pkl")

In [None]:
pipeline_clf

In [None]:
joblib.dump(value=pipeline_clf ,
            filename=f"{file_path}/clf_pipeline_model.pkl")

## Feature Importance plot

In [None]:
df_feature_importance.plot(kind='bar',x='Feature',y='Importance')
plt.savefig(f'{file_path}/features_importance.png', bbox_inches='tight')

## Conclusions

The model we trained in this notebook did not meet the metric criteria that we have set in our case study to accurately predict patients that are in risk of been readmitted, but remains cautiously reliable on predicting patients that will not readmit.

At this point we could temporarily deploy this model, but we have to warn the users regarding the reliability of the results, until we can develop a more accurate and reliable model. This can be done with further examination of the features and application of more feature engineering techniques and also consider developing an Artificial Neural Network. 