# `BUILDING A ML MODEL TO PREDICT SEPSIS IN PATIENTS` 
#### Using the CRISP-DM framework


## `Business Understanding`


#### Goal/Description
# To create a machine learning model to predict the sepsis in a patient


In [None]:

# %% [markdown]
# #### `Null Hypothesis`
# There is no relationship between a tech savvy customer and the customer retention
# 
# #### `Alternate Hypothesis`
# There is a relationship between a tech savvy customer and the customer retention
# 
# ###### NB: A tech savvy person is someone who has online security or device protection or both


In [None]:

# %% [markdown]
# ### `Key Metrics and Success Criteria`
# 
# The success of this poject will be evaluated based on several key metrics and success criteria including;
# 
# • Model Accuracy : The ability of the machine learning model to accurately predict customer churn.
# 
# • Model Interpretability : The degree to which the model's predictions and insights can be understood and utilized by stakeholders.
# 
# • Business Impact : The effectiveness of retention strategies implemented based on the model's recommendations in reducing customer churn rates and improving overall customer satisfaction and retention.
# 
# 


In [None]:

# %% [markdown]
# #### `Analytical Questions`
# - How does tenure and monthly charge affect customer churn?
# - What is the likelihood of a customer with online security and protection to churn?
# - What is the relationship between the type of contract and the likelihood of a customer churn?
# - Do customers with dependents and internet security likely to Churn?


In [None]:

# %% [markdown]
# ## `Data Understanding`

# %% [markdown]
# #### Data Source
# The data was sourced from a Telecommunication company and divided into three (3) parts :
# - 3000 rows as the training data
# - 2000 rows as the evaluation data 
# - 2000 rows as the test data 


In [None]:

# %% [markdown]
# ### `Issues`
# - Some columns have multiple adjectives of the same word. eg no,no internet service,false 
# 


In [None]:

# %% [markdown]
# #### Data Exploration


In [None]:

# %% [markdown]
# ##### `Libraries`

# %%
#Libraries imported
import sqlalchemy as sa
import pyodbc  
from dotenv import dotenv_values 
import pandas as pd
from scipy import stats 
from scipy.stats import kruskal
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import collections
import numpy as np

#Filter warnings
import warnings
warnings.filterwarnings('ignore')



from sklearn.model_selection import * #train_test_split, cross_val_score

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.utils import resample
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC 
from catboost import CatBoostClassifier
import xgboost as xgb
from xgboost import XGBClassifier
#for balancing dataset
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbpipeline
#for feature selection
from sklearn.feature_selection import mutual_info_classif,SelectKBest
#Crossvalidation for hyper parameter tuning
from sklearn.model_selection import GridSearchCV

#joblib for model persit
import joblib

from sklearn.metrics import *
from sklearn.model_selection import * 
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


In [None]:

# ##### `Accessing the second set of data in CSV format`

# %%
##Accessing the second set of data 
csv_df = pd.read_csv("data\\LP2_Telco-churn-second-2000.csv")
csv_df.info()

# %%
# Describing the Dataframe
csv_df.describe(include='all').T


In [None]:

# %% [markdown]
# ##### `Merging the Two Dataframes`

# %%
com_df=pd.concat([sql_df,csv_df],ignore_index=True)
com_df.head(5)
com_df.shape

# %%
#Checking the datatypes of the columns
datatypes = com_df.dtypes
datatypes

# %% [markdown]
# ##### Converting the TotalCharges datatype to float64

# %%
#Converting TotalCharges column to numeric
com_df['TotalCharges'] = pd.to_numeric(com_df['TotalCharges'], errors='coerce')
com_df=com_df.reset_index()

# %%
# Checking the Null value
com_df.isnull().sum()

# %%
com_df.head(5)
data=com_df.copy()

# %%
#Dropping the index column
com_df = com_df.drop(['index'], axis = 1 )

# %% [markdown]
# ##### Replacing all negatives with False and positives with True

# %%
com_df.replace(['No','No internet service','false','No phone service'], "False", inplace = True)

com_df.replace('Yes',"True", inplace = True)



# %%
com_df['SeniorCitizen'] = np.where(com_df['SeniorCitizen'] == 1, True, False)


# %%
com_df.InternetService.replace('false','None')

# %%
datatypes = com_df.dtypes
datatypes

# %% [markdown]
# ##### Making the True/False to Boolean

# %%
com_df.replace({'True': True, 'False': False}, inplace = True)

# %%
com_df.to_csv("data/customer_churn_merged")

# %% [markdown]
# ### Univariate Analysis

# %%
# Distribution of the variables
com_df.hist(density = True,figsize = (20, 15), facecolor = 'lightgreen', alpha = 0.75,grid = False)

plt.show()

# %%
# Visualize the distribution of categorical columns
categoricals = [column for column in com_df.columns if com_df[column].dtype == "O"]
for column in categoricals:
        if column not in ['customerID']:
                fig = px.histogram(com_df, x = com_df[column], text_auto = True,color = column,
                               title = f"Distribution of customers based on {column}")
                fig.update_layout(uniformtext_minsize = 8, uniformtext_mode = 'hide', xaxis_tickangle = -45)
                fig.show()


# %% [markdown]
# #### OBSERVATION
# - The Gender is evenly distributed 
# - Over 50% of all contract types are month-on-month basis
# - Electronic Check is the most used,covering 30% of all payment methods
# 

# %%
fig = plt.figure(figsize = (5, 4))
 
# Creating plot
plt.boxplot(com_df.tenure)
plt.show()

# %%
fig = plt.figure(figsize = (5, 4))
 
# Creating plot
plt.boxplot(com_df.MonthlyCharges)
plt.show()

# %% [markdown]
# ### Bivariate Analysis

# %%
# Summarizing the relationships between the variables with a heatmap of the correlations
correlation_matrix = com_df.corr(numeric_only = True)
plt.figure(figsize = (10, 8))
sns.heatmap(correlation_matrix, annot = True,cmap = 'vlag')
plt.title("Correlation heatmap of the Telecom Dataset")
plt.show()

# %% [markdown]
#  ## `Answering the Analytical Questions`
# 

# %% [markdown]
# ##### `How does tenure and monthly charge affect customer churn?`
# 

# %%

bins = [ 10, 30, 50,70]
df = com_df
labels = ['Newbie', 'Young', 'Oldies']
df['tenure Group'] = pd.cut(df['tenure'], bins = bins, labels = labels)
streamers = com_df.groupby(['tenure Group','Churn'])['MonthlyCharges'].mean().sort_values(ascending = True)

streamers.plot(kind='bar', title = 'How does tenure and monthly charge affect customer churn?', figsize = (10,6), cmap='Dark2', rot = 30)

plt.show()


# %% [markdown]
# #### OBSERVATION
# - New,Existing and Old Customers with higher charges for software usage are the ones churning.
# - There has to be a loyalty promotion for old customers to lock in the old customers.
# - There can also be a signup discount to new customers to lock them in on the software.

# %% [markdown]
# ##### `What is the likelihood of a customer with online security and device protection to churn?`
# 

# %%
cust_retention = com_df.groupby(['OnlineSecurity','DeviceProtection'])['Churn'].count().sort_values(ascending = True)
cust_retention.plot(kind = 'bar', title = 'The likelihood of a customer with online security and device protection to churn', figsize = (10,6), cmap = 'Dark2', rot = 30)

# %% [markdown]
# #### OBSERVATION
# - Customers with no security at all are more likely to Churn. 
# - Basic cybersecurity can be done to curb customer doubt to reduce Churn.

# %% [markdown]
# #### `What is the relationship between the type of contract and the likelihood of a customer churn?`
# 

# %%
cust_contract = com_df.groupby('Contract')['Churn'].count().sort_values(ascending = True)
cust_contract.plot(kind = 'bar', title = 'The relationship between the type of contract and the likelihood of a customer churn', figsize = (10,6), cmap = 'Dark2', rot = 30)

# %% [markdown]
# #### OBSERVATION
# - Month-to-Month Customers are more likely to churn as they are likely to be floating users.

# %% [markdown]
# #### `Do customers with dependents and internet security likely to Churn?`

# %%
cust_contract = com_df.groupby(['OnlineSecurity','Dependents'])['Churn'].count().sort_values(ascending = True)
cust_contract.plot(kind='bar', title = 'Do customers with dependents and internet security likely to Churn?', figsize = (10,6), cmap = 'Dark2', rot = 30)

# %% [markdown]
# #### OBSERVATION
# Customers with both Online Security and Dependents are less likely to churn.

# %%
com_df.isnull().sum()


# %%
#Dropping Empty rows
com_df = com_df.dropna(subset=['OnlineSecurity','OnlineBackup','DeviceProtection','MultipleLines','TotalCharges','Churn'],axis = 0)

# %%
#finding duplicates
duplicate = com_df[com_df.duplicated()]
duplicate.shape

# %% [markdown]
# ##### OBSERVATION 
# No duplicates found

# %% [markdown]
# #### `HYPOTHESIS`

# %%
#Checking Normality of the data 

def check_normality(data,name):
    test_stat_normality, p_value_normality = stats.shapiro(data)
    print("p value:%.20f" % p_value_normality)
    if p_value_normality < 0.05:
        print(f"Reject null hypothesis >> The data for {name} is not normally distributed")
    else:
        print(f"Fail to reject null hypothesis >> The data for {name} is normally distributed")

# %%
#Hypothesis

df_tech = com_df.loc[com_df.OnlineSecurity & com_df.DeviceProtection]
online = com_df.loc[com_df.OnlineSecurity]
device = com_df.loc[com_df.DeviceProtection]


# %%
#Normality Checks
check_normality(df_tech.TotalCharges,'Online Security and Device Protection')
check_normality(online.TotalCharges,'Online Security')
check_normality(device.TotalCharges,'Device Protection')

# %%
#Using the P-Levene to test the Hypothesis
stat, pvalue_levene = stats.levene(df_tech.TotalCharges, online.TotalCharges,device.TotalCharges )

print("p value:%.10f" % pvalue_levene)
if pvalue_levene < 0.05:
    print("Reject null hypothesis >> The variances of the samples are different.")
else:
    print("Fail to reject null hypothesis >> The variances of the samples are same.")

# %% [markdown]
# ##### Observation 
# - Data samples are not normally distributed
# - The variances of the samples are different
# - Therefore a Non-Parametric test must be done (Kruskal Test)

# %%
#Kruskal Test

stat, p = kruskal(df_tech.TotalCharges, online.TotalCharges,device.TotalCharges)
print('Statistics=%.3f, p=%.15f' % (stat, p))

if p > 0.05:
 print('All sample distributions are the same (fail to reject H0)')
else:
 print('One or more sample distributions are not equal distributions (reject null Hypothesis)')

# %% [markdown]
# ##### OBSERVATION
# Reject the null Hypothesis

# %% [markdown]
# ### `Data preparation`

# %% [markdown]
# #### Feature Correlation and Selection

# %%
# Summarize the relationships between the variables with a heatmap of the correlations
correlation_matrix = df.corr(numeric_only= True).round(3)
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
plt.figure(figsize = (10, 8))
sns.heatmap(correlation_matrix, annot=True,cmap='vlag',mask=mask)
plt.title("Correlation heatmap of the dataset")
plt.show()

# %%

df.head(1)
df1 = data.drop(columns=['index','customerID','gender','TotalCharges'],axis=1)

# %%
# Dropping row with null value
df1.dropna(axis = 0, inplace = True)

# %%
def str_convert(df,column_name):
    df[column_name]=df[column_name].replace({1: 'Yes', 0: 'No'})

    return df

# %%
df1 = str_convert(df1,'SeniorCitizen')


df1=df1.map(lambda x: 'Yes' if x == True else 'No' if x == False else x)


df1['tenure'] = pd.to_numeric(df1['tenure'], errors = 'coerce', downcast = 'integer')


# %%
def cleaner (df):
    df = df.drop(columns=['customerID','gender','TotalCharges'],axis=1)
    df['SeniorCitizen']=df['SeniorCitizen'].replace({1: 'Yes', 0: 'No'})
    df=df.map(lambda x: 'Yes' if x == True else 'No' if x == False else x)
    df['tenure'] = pd.to_numeric(df['tenure'], errors = 'coerce', downcast = 'integer')

    return df

# %% [markdown]
# #### `Distribution of the dependent variable`

# %% [markdown]
# ##### Dataset classification
# 
# - Checking to see if the binary dependent variables are evenly distributed or not 
# - With the current levels of disparity between the two classes what stratification method will be best
# 

# %%
# Separate majority and minority classes
df1_stay = df1[df1.Churn== 'No']
df1_left = df1[df1.Churn=="Yes"]

print((len(df1_stay)/len(df1)),(len(df1_left)/len(df1)))
print(len(df1_left))

# %% [markdown]
# ##### Observation
# - About 70% of the customers stayed as compared to the customers that left therefore the churned customers represent the minority group
# - Using undersampling means there will a huge loss of the majority class to balance the data
# - Using oversampling means that there will  be a too many duplicates of the minority class in the balanced data 
# - For this dataset, it will be best to use SMOTE to balance the dataset

# %%
df1.dtypes

# %%
df1.head(4)

# %% [markdown]
# #### `Modeling`

# %%
df1.dtypes

# %%
# Dropping row with null value
df1.dropna(axis = 0, inplace = True)

# %%
X=df1.drop(columns=['Churn'],axis=1)
y=df1['Churn'].replace({'Yes': 1, 'No': 0})


# %%
# Looking at the descriptive statistics of the columns with categorical values
cats = [column for column in X.columns if (X[column].dtype == "O")]
print("Summary table of the Descriptive Statistics of Columns with Numeric Values")
df1[cats].describe(include="all")

# %%
# Looking at the descriptive statistics of the columns with numeric values
numerics = [column for column in X.columns if (X[column].dtype != "O")]
print("Summary table of the Descriptive Statistics of Columns with Numeric Values")
df1[numerics].describe()

# %%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, random_state = 27)

# %%
y_train_encoded = pd.to_numeric(y_train)
y_test_encoded = pd.to_numeric(y_test)


# %% [markdown]
# ##### `Making pipelines`
# 

# %%
scaler = StandardScaler()
encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)

# putting numeric columns to scaler and categorical to encoder
num_transformer = Pipeline(steps = [
     ('num_imputer', SimpleImputer(strategy = 'median')),
    ('num', scaler)
])
cat_transformer = Pipeline(steps = [
   ('cat_imputer', SimpleImputer(strategy = 'most_frequent')),
    ('cat', encoder)
])


# %%
# getting together our scaler and encoder with preprocessor
preprocessor = ColumnTransformer(
      transformers = [('num', num_transformer , numerics),
                    ('cat', cat_transformer , cats),
     
                    ])

# %%
#Calling the models of interest

log_mod =  (LogisticRegression(random_state = 27 ))
svc_mod = SVC(random_state=27,probability= True)

catboost_mod = (CatBoostClassifier(random_state=27, verbose = False))
xgboost_mod = XGBClassifier(random_state=27)
 

# %%
# Create a dictionary of the model pipelines
all_models_pipelines = {"Logistic_Regressor": (LogisticRegression(random_state = 27 )),
              "SVM": SVC(random_state = 27,probability = True),
              "CatBoost": (CatBoostClassifier(random_state=27, verbose = False)),
              "Xgboost":XGBClassifier(random_state=27)
              }
    

# %%
# Create a function to model and return comparative model evaluation scores,perform the SMOTE on each model pipeline,to calculate and compare accuracy

def evaluate_models(model_pipelines = all_models_pipelines, X_test = X_test, y_test = y_test_encoded):


    # Dictionary for trained models
    trained_models = dict()

    # Create a dataframe matrix to all pipelines
    all_confusion_matrix = []
    
    
    # List to receive scores
    performances = []
    for name, model_pipeline in model_pipelines.items():
        final_pipeline = imbpipeline(steps=[("preprocessor", preprocessor), 
                                   ('smote-sampler',SMOTE(random_state = 0)),
                                   ("feature_selection",SelectKBest(mutual_info_classif, k = 'all')),
                           ("model", model_pipeline)])
    


        
        final_pipeline.fit(X_train,  y_train)
       

        # Predict and calculate performance scores
        y_pred = final_pipeline.predict(X_test)
        performances.append([name,
                             accuracy_score(y_test, y_pred),  # accuracy
                             precision_score(y_test, y_pred, average="weighted"),  # precisions
                             recall_score(y_test, y_pred,average="weighted"),  # recall
                             f1_score(y_test, y_pred, average="weighted")
                             ])

        # Print classification report
        model_pipeline_report = classification_report(y_test, y_pred)
        print("This is the classification report of the",name, "model", "\n", model_pipeline_report, "\n")

        # Defining the Confusion Matrix
        model_pipeline_conf_mat = confusion_matrix(y_test, y_pred)
        model_pipeline_conf_mat = pd.DataFrame(model_pipeline_conf_mat).reset_index(drop = True)
        print(f"Below is the confusion matrix for the {name} model")

        # Visualizing the Confusion Matrix
        f, ax = plt.subplots()
        sns.heatmap(model_pipeline_conf_mat, annot = True, linewidth = 1.0,fmt = ".0f", cmap = "RdPu", ax=ax)
        plt.xlabel = ("Prediction")
        plt.ylabel = ("Actual")
        plt.show()

        # Store trained model
        trained_model_name = "trained_" + str(name).lower()
        trained_models[trained_model_name] = final_pipeline
        
        print("\n", "-----   -----"*6, "\n",  "-----   -----"*6)
    
    # Compile accuracy
    df_compare = pd.DataFrame(performances, columns = ["model", "accuracy", "precision", "recall", "f1_score"])
    df_compare.set_index("model", inplace = True)
    df_compare.sort_values(by = ["f1_score", "accuracy"], ascending = False, inplace=True)
    return df_compare, trained_models

# %%
# Run the function to train models and return performances
all_models_eval, trained_models = evaluate_models()
all_models_eval

# %% [markdown]
# #### `Visualizing Evaluation Using ROC - AUC`

# %%
from sklearn.metrics import roc_curve,auc

all_roc_data = {}
fig, ax = plt.subplots()

for name,model in trained_models.items():
    y_score = model.predict_proba(X_test)[:,1]
    
    fpr, tpr, thresholds = roc_curve(y_test_encoded, y_score)

    roc_auc = auc(fpr,tpr)
    roc_data_df = pd.DataFrame({'False Positive Rate' : fpr , 'True Positve Rate' : tpr , 'Threshold' : thresholds})
    all_roc_data[name] = roc_data_df

    ax.plot(fpr,tpr, label = f'{name} (AUC = {roc_auc: .2f})')

    ax.plot([0,1],[0,1], linestyle='--', color='k', label='Random')
    ax.set_ylabel('False Positive Rate')
    ax.set_xlabel('True Positive Rate')
    ax.set_title('ROC Curve for all pipelines')

plt.legend()
plt.show()

# %%

lr_roc_data = all_roc_data["trained_logistic_regressor"]
svm_roc_data = all_roc_data["trained_svm"]
catboost_roc_data = all_roc_data["trained_catboost"]
xgboost_roc_data = all_roc_data["trained_xgboost"]

# %% [markdown]
# ##### `Business Impact Assessment`
# 
# - The true positive rate is sensitive but there is a need to raise its sensitivity higher for production
# - The acceptable threshold to meet the criteria is 0.4812 for the Logistic Regression model
# - The acceptable threshold to meet the criteria is 0.3703 for the SVM model
# - The acceptable threshold to meet the criteria is 0.2398 for the Cat Boost model
# - The acceptable threshold to meet the criteria is 0.2189 for the Xgboost model
# 

# %% [markdown]
# #### `Hyperparameter Tuning`

# %%
## XGBoost Classifier
xgb_clf = Pipeline(steps=[("preprocessor", preprocessor), 
                          ("model", XGBClassifier(random_state=27))])

# Defining the values for the RandomizedSearchCV
param_grid_xgboost = {"model__learning_rate": [0.1, 0.3, 0.5, 0.7, 1.0],
               "model__max_depth": [5, 10, 15, 20, 25, 30, 35],
               "model__booster": ["gbtree", "gblinear", "dart"],
               "model__n_estimators":  list(range(2, 11, 2))
              }

# %%
# Running the RandomizedSearch Cross-Validation with the above set of Parameters
grid_search_model = GridSearchCV(estimator = xgb_clf, param_grid = param_grid_xgboost, n_jobs=-1, scoring = "accuracy")


# Fitting the model to the training data
grid_search_model.fit(X_train,y_train_encoded)

print("Best parameter (CV score=%0.5f):" % grid_search_model.best_score_)
print(f"The best parameters for the GSCV XGB are: {grid_search_model.best_params_}")

# %%
# Looking at the best combination of hyperparameters for the model
best_gs_params = grid_search_model.best_params_
print("The best combination of hyperparameters for the model will be:")
for param_name in sorted(best_gs_params.keys()):
    print(f"{param_name} : {best_gs_params[param_name]}")

# %%
# Defining the best version of the model with the best parameters
best_gs_model = Pipeline(steps=[("preprocessor", preprocessor), 
                          ("model",XGBClassifier(random_state=27,
                              booster="gblinear",
                              learning_rate=1.0,
                              
                              n_estimators=6
                              ))])

# Fit the model to the training data
best_gs_model.fit(X_train, y_train_encoded)

# Predict on the test data
best_gs_pred = best_gs_model.predict(X_test)

print(best_gs_pred)

# %%
# Confusion Matrix
best_gs_conf_mat = (pd.DataFrame(confusion_matrix(y_test_encoded, best_gs_pred)).reset_index(drop=True))

# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(best_gs_conf_mat, annot=True, linewidth=1.0, fmt=".0f", cmap="RdPu", ax=ax)

# %%
logistic_model = LogisticRegression(random_state=27)
logistic_pipeline = Pipeline(steps=[("preprocessor", preprocessor), ("model", logistic_model)])

# Define the parameter distributions for RandomizedSearchCV
param_distributions = {
        'model__penalty': ['l2'],
    'model__solver' : ['lbfgs', 'liblinear', 'newton-cg'],
    'model__max_iter' : [500,700,1000]
}

# Create the RandomizedSearchCV object
clf = GridSearchCV(logistic_pipeline, param_grid=param_distributions, scoring = 'accuracy',error_score='raise')

# Fit the RandomizedSearchCV on your training data
search_model = clf.fit(X_train, y_train_encoded)

print("Best parameter (CV score=%0.5f):" % search_model.best_score_)
print(f"The best parameters for the GSCV XGB are: {search_model.best_params_}")


# %%

# Predict on the test data
search_pred = search_model.predict(X_test)

# # Get the best hyperparameters
# best_params = search.best_params_

# print(best_params)
# Defining the best version of the model with the best parameters
best_search_model = Pipeline(steps=[("preprocessor", preprocessor), 
                          ("model",LogisticRegression(random_state=27,
                              max_iter=500,
                              penalty='l2',
                              solver = 'newton-cg',
                              verbose=0
                              ))])

# Fit the model to the training data
best_search_model.fit(X_train, y_train_encoded)

# Predict on the test data
best_search_pred = best_gs_model.predict(X_test)

print(best_search_pred)

# %%
# Confusion Matrix
best_search_conf_mat = (pd.DataFrame(confusion_matrix(y_test_encoded, best_search_pred)).reset_index(drop=True))

# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(best_search_conf_mat, annot=True, linewidth=1.0, fmt=".0f", cmap="RdPu", ax=ax)

# %%
cat_model = (CatBoostClassifier(random_state=27))
cat_pipeline = Pipeline(steps=[("preprocessor", preprocessor), ("model", cat_model)])

# Define the parameter distributions for RandomizedSearchCV
param_distributions = {
    #'model__C': uniform(scale=4),
    'model__depth': [6],                   # Depth of the trees
    'model__learning_rate': [0.1,1],          # Learning rate of the model
    'model__l2_leaf_reg': [3],              # L2 regularization term on weights
    'model__rsm': [0.2,0.8],                   # Random Selection Rate (regularization by introducing randomness)
    'model__iterations': [500,800],            # Number of boosting iterations
    'model__loss_function': ['MultiClass'], # Loss function for multi-class classification
    'model__eval_metric': ['Accuracy'],    # Evaluation metric

}

# Create the RandomizedSearchCV object
cat_clf = GridSearchCV(cat_pipeline, param_grid=param_distributions, scoring = 'accuracy',error_score='raise')

# Fit the RandomizedSearchCV on your training data
cat_gs_model = cat_clf.fit(X_train, y_train_encoded)

print("Best parameter (CV score=%0.5f):" % cat_gs_model.best_score_)
print(f"The best parameters for the GSCV XGB are: {cat_gs_model.best_params_}")


# %%
# Predict on the test data
cat_gs__pred = search_model.predict(X_test)

# # Get the best hyperparameters
# best_params = search.best_params_

# print(best_params)
# Defining the best version of the model with the best parameters
best_gs_catboost_model = Pipeline(steps=[("preprocessor", preprocessor), 
                          ("model",CatBoostClassifier(random_state=27,
                              iterations=500,
                              depth=6,
                              eval_metric = 'Accuracy',
                              l2_leaf_reg=3,
                              learning_rate=0.1,
                              rsm = 0.8,
                              loss_function = 'MultiClass',
                              

                              ))])

# Fit the model to the training data
best_gs_catboost_model.fit(X_train, y_train_encoded)

# Predict on the test data
best_catboost_pred = best_gs_catboost_model.predict(X_test)

print(best_catboost_pred)


# %%
# Confusion Matrix
best_search_conf_mat = (pd.DataFrame(confusion_matrix(y_test_encoded, best_catboost_pred)).reset_index(drop=True))

# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(best_search_conf_mat, annot=True, linewidth=1.0, fmt=".0f", cmap="RdPu", ax=ax)

# %%
svc_model = SVC(random_state=27)
svc_pipeline = Pipeline(steps=[("preprocessor", preprocessor), ("model", svc_model)])

# Define the parameter distributions for GridSearchCV
param_distributions = {
    
    'model__break_ties': [True],                   
    'model__kernel': ['linear','rbf','poly'],         
    'model__max_iter': [-1],              
    'model__coef0': [0.0,0.2],                
    'model__probability': [True ],           
    'model__shrinking': [True,False], 
    'model__verbose': [True],    
    'model__tol' : [0.0001,0.1]
}

# Create the GridSearchCV object
svc_clf = GridSearchCV(svc_pipeline, param_grid=param_distributions, scoring = 'accuracy',error_score='raise')

# Fit the GridSearchCV on your training data
svc_gs_model = svc_clf.fit(X_train, y_train_encoded)

print("Best parameter (CV score=%0.5f):" % svc_gs_model.best_score_)
print(f"\nThe best parameters for the GSCV XGB are: {svc_gs_model.best_params_}")

# %%
# Defining the best version of the model with the best parameters
best_gs_svc_model = Pipeline(steps=[("preprocessor", preprocessor), 
                          ("model",SVC(random_state=27,
                              break_ties=True,
                              coef0 = 0,
                              kernel = 'linear',
                              probability = True,
                              max_iter =-1,
                              shrinking = True,                                                      
                              ))])

# Fit the model to the training data
best_gs_svc_model.fit(X_train, y_train_encoded)

# Predict on the test data
best_svc_pred = best_gs_svc_model.predict(X_test)

print(best_svc_pred)

# %%
# Confusion Matrix
best_svc_conf_mat = (pd.DataFrame(confusion_matrix(y_test_encoded, best_svc_pred)).reset_index(drop=True))

# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(best_search_conf_mat, annot=True, linewidth=1.0, fmt=".0f", cmap="RdPu", ax=ax)

# %% [markdown]
# #### `Testing one of the models with the test data set`

# %%
test_data = pd.read_excel('data\\Telco-churn-last-2000.xlsx')
test_data = cleaner(test_data)
best_svc_pred = best_gs_svc_model.predict(test_data)

print(best_svc_pred)


# %% [markdown]
# #### `Persit the model`

# %%


for name, model_pipeline in all_models_pipelines.items():
    joblib.dump(model_pipeline,f'models\{name}.joblib')


# %%
joblib.dump(best_gs_pred ,'models\\tuned\\best_gs_pred .joblib')
joblib.dump(best_search_pred,'models\\tuned\\best_search_pred.joblib')
joblib.dump(best_catboost_pred,'models\\tuned\\best_catboost_pred.joblib')
joblib.dump(best_svc_pred,'models\\tuned\\best_svc_pred.joblib')


