![](https://media.tegna-media.com/assets/WNEP/images/c5383479-1ee1-43d8-84aa-680322a8a778/c5383479-1ee1-43d8-84aa-680322a8a778_1920x1080.jpg)

# Business Problem

- Predicting whether people have diabetes when their characteristics are specified to develop a machine learning model that is capable of learning from a machine learning model.

# Dataset Story

- The dataset is part of a larger dataset held at the National Institutes of Diabetes-Digestive-Kidney Diseases in the US. In the US
  Pima Indian women aged 21 years and older living in Phoenix, the 5th largest city in the State of Arizona
  are the data used for diabetes research.
  
- The target variable is specified as "outcome", where 1 indicates a positive diabetes test result and 0 indicates a negative result.

- **Pregnancies:** Number of pregnancies

- **Glucose:** 2-hour plasma glucose concentration in oral glucose tolerance test

- **Blood Pressure:** Blood Pressure (small blood pressure) (mm Hg)

- **SkinThickness:** Skin Thickness

- **Insulin:** 2-hour serum insulin (mu U/ml)

- **DiabetesPedigreeFunction:** Function (2-hour plasma glucose concentration in oral glucose tolerance test)

- **BMI:** Body mass index

- **Age:** Age (years)

- **Outcome:** Have the disease (1) or not (0)

# Road Map

- 1.Import Required Libraries
- 2.Adjusting Row Column Settings
- 3.Loading the data Set
- 4.Exploratory Data Analysis
- 5.Capturing / Detecting Numeric and Categorical Variables
- 6.Analysis of Categorical Variables
- 7.Analysis of Numerical Variables
- 8.Analysis of Categorical Variables by Target
- 9.Analysis of Numeric Variables by Target
- 10.Examining the Logarithm of the Dependent Variable
- 11.Correlation Analysis
- 12.The Relationship Between Variables
- 13.Base Model Before Feature Engineering
    - 13.1.RandomForestClassifier
    - 13.2.Logistic Regression
    - 13.3.K-Nearest Neighbors (KNN)
    - 13.4.Support Vector Classifier (SVC)
    - 13.5.Decision Tree Classifier
    - 13.6.AdaBoost Classifier
    - 13.7.Gradient Boosting Classifier
    - 13.8.XGBoost Classifier
    - 13.9.LightGBM Classifier
    - 13.10.Comparison of Metrics for Different Models
    - 13.11.Visualization of the Decision Tree
    - 13.12.Plot Importance of Variables According to Base Model
- 14.Feature Engineering
- 15.Missing Value Analysis
- 16.Outlier Analysis
- 17.Feature Extraction
- 18.Encoding
- 19.Standardization Process
- 20.Model Building
    - 20.1.RandomForestClassifier
    - 20.1.1.Random Forest Classifier Hyperparameter Optimization
    - 20.2.Logistic Regression
    - 20.2.1.Logistic Regression Hyperparameter Optimization
    - 20.3.K-Nearest Neighbors (KNN)
        - 20.3.1.K-Nearest Neighbors (KNN) Hyperparameter Optimization
    - 20.4.Support Vector Classifier (SVC)
        - 20.4.1.Support Vector Classifier (SVC) Hyperparameter Optimization
    - 20.5.Decision Tree Classifier
        - 20.5.1.Decision Tree Classifier Hyperparameter Optimization
    - 20.6.AdaBoost Classifier
        - 20.6.1.AdaBoost Classifier Hyperparameter Optimization
    - 20.7.Gradient Boosting Classifier
        - 20.7.1.Gradient Boosting Classifier Hyperparameter Optimization
    - 20.8.XGBoost Classifier Hyperparameter Optimization
    - 20.9.LightGBM Classifier
        - 20.9.1.LightGBM Classifier Hyperparameter Optimization
    - 20.10.Comparison of Metrics for Different Models After Feature Engineering
    - 20.11.Comparison of Metrics for Different Models After Hyperparameter Optimization
    - 20.12.Comparison of Metrics Before and After Hyperparameter Optimization

# 1. Import Required Libraries

In [None]:
import itertools
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
from lightgbm import LGBMClassifier
from sklearn import tree
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.impute import KNNImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder, RobustScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

warnings.simplefilter(action="ignore")


# 2. Adjusting Row Column Settings

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_rows', 20)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# 3. Loading the data Set

In [None]:
df = pd.read_csv("/kaggle/input/diabetes-dataset/diabetes.csv")

# 4. Exploratory Data Analysis

In [None]:
def check_df(dataframe, head=5):
    print("##################### Shape #####################")
    print(dataframe.shape)
    print("##################### Types #####################")
    print(dataframe.dtypes)
    print("##################### Head #####################")
    print(dataframe.head(head))
    print("##################### Tail #####################")
    print(dataframe.tail(head))
    print("##################### NA #####################")
    print(dataframe.isnull().sum())
    print("##################### Quantiles #####################")
    print(dataframe.quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T)

check_df(df)

# 5. Capturing / Detecting Numeric and Categorical Variables

In [None]:
def grab_col_names(dataframe, cat_th=10, car_th=20):
    """

    Returns the names of categorical, numeric and categorical but cardinal variables in the data set.
    Note Categorical variables include categorical variables with numeric appearance.

    Parameters
    ------
        dataframe: dataframe
                Variable names of the dataframe to be taken
        cat_th: int, optional
                class threshold for numeric but categorical variables
        car_th: int, optinal
                class threshold for categorical but cardinal variables

    Returns
    ------
        cat_cols: list
                Categorical variable list
        num_cols: list
                Numeric variable list
        cat_but_car: list
                List of cardinal variables with categorical appearance

    Examples
    ------
        import seaborn as sns
        df = sns.load_dataset("iris")
        print(grab_col_names(df))


    Notes
    ------
        cat_cols + num_cols + cat_but_car = total number of variables
        num_but_cat is inside cat_cols.
        The sum of the 3 return lists equals the total number of variables: cat_cols + num_cols + cat_but_car = number of variables

    """

    # cat_cols, cat_but_car
    cat_cols = [col for col in dataframe.columns if dataframe[col].dtypes == "O"] 

    num_but_cat = [col for col in dataframe.columns if dataframe[col].nunique() < cat_th and
                   dataframe[col].dtypes != "O"]

    cat_but_car = [col for col in dataframe.columns if dataframe[col].nunique() > car_th and
                   dataframe[col].dtypes == "O"]

    cat_cols = cat_cols + num_but_cat

    cat_cols = [col for col in cat_cols if col not in cat_but_car] 

    num_cols = [col for col in dataframe.columns if dataframe[col].dtypes != "O"] 

    num_cols = [col for col in num_cols if col not in num_but_cat] 
    
    print(f"Observations: {dataframe.shape[0]}") 
    print(f"Variables: {dataframe.shape[1]}") 
    print(f'cat_cols: {len(cat_cols)}') 
    print(f'num_cols: {len(num_cols)}') 
    print(f'cat_but_car: {len(cat_but_car)}') 
    print(f'num_but_cat: {len(num_but_cat)}') 


    return cat_cols, num_cols, cat_but_car, num_but_cat

In [None]:
cat_cols, num_cols, cat_but_car,  num_but_cat = grab_col_names(df)

In [None]:
cat_cols

In [None]:
num_cols

In [None]:
cat_but_car

In [None]:
num_but_cat

# 6. Analysis of Categorical Variables

In [None]:
def cat_summary(dataframe, col_name, plot=False):
    print(pd.DataFrame({col_name: dataframe[col_name].value_counts(),
                        "Ratio": 100 * dataframe[col_name].value_counts() / len(dataframe)}))
    print("##########################################")
    if plot:
        sns.countplot(x=dataframe[col_name], data=dataframe)
        plt.show(block=True)

In [None]:
# We did it this way because there is only one categorical variable.

cat_summary(df, "Outcome", plot=True)

In [None]:
# If there were more than one categorical variable, we would loop through all categorical variables one by one as follows to run the function.

for col in cat_cols:
    cat_summary(df, col, plot=True)

# 7. Analysis of Numerical Variables

In [None]:
def num_summary(dataframe, numerical_col, plot=False):
    quantiles = [0.05, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 0.95, 0.99]
    print(dataframe[numerical_col].describe(quantiles).T)

    if plot:
        dataframe[numerical_col].hist(bins=20)
        plt.xlabel(numerical_col)
        plt.title(numerical_col)
        plt.show(block=True)

In [None]:
for col in num_cols:
    num_summary(df, col, plot=True)

# 8. Analysis of Categorical Variables by Target

In [None]:
def target_summary_with_cat(dataframe, target, categorical_col, plot=False):
    print(pd.DataFrame({'TARGET_MEAN': dataframe.groupby(categorical_col)[target].mean()}), end='\n\n\n')
    if plot:
        sns.barplot(x=categorical_col, y=target, data=dataframe)
        plt.show(block=True)

In [None]:
for col in cat_cols:
    target_summary_with_cat(df, "Outcome", col, plot=True)

# 9. Analysis of Numeric Variables by Target

In [None]:
def target_summary_with_num(dataframe, target, numerical_col, plot=False):
    print(pd.DataFrame({numerical_col+'_mean': dataframe.groupby(target)[numerical_col].mean()}), end='\n\n\n')
    if plot:
        sns.barplot(x=target, y=numerical_col, data=dataframe)
        plt.show(block=True)

In [None]:
for col in num_cols:
    target_summary_with_num(df, "Outcome", col, plot=True)

# 10. Examining the Logarithm of the Dependent Variable

In [None]:
np.log1p(df["Outcome"]).hist(bins=50)
plt.show(block=True)

# 11. Correlation Analysis

In [None]:
corr = df[num_cols].corr()

In [None]:
corr

In [None]:
# Correlation heatmap without using functions

sns.set(rc={"figure.figsize": (12, 12)})
corr_values = corr.round(2)
sns.heatmap(corr, cmap="RdBu", annot=corr_values)
plt.show(block=True)

In [None]:
# Creation of correlation heat map using the function

def high_correlated_cols(dataframe, plot=False, corr_th=0.70):
    corr = dataframe.corr()
    cor_matrix = corr.abs()
    upper_triangle_matrix = cor_matrix.where(np.triu(np.ones(cor_matrix.shape), k=1).astype(np.bool))
    drop_list = [col for col in upper_triangle_matrix.columns if any(upper_triangle_matrix[col] > corr_th)]
    if plot:
        import matplotlib.pyplot as plt
        import seaborn as sns
        sns.set(rc={"figure.figsize": (12, 12)})
        corr_values = corr.round(2)
        sns.heatmap(corr, cmap="RdBu", annot=corr_values)
        plt.show(block=True)
    return drop_list

In [None]:
high_correlated_cols(df, plot=True)

# 12. The Relationship Between Variables

In [None]:
# Calculate the counts of each outcome
outcome_counts = df['Outcome'].value_counts()

# Calculate the total number of patients
total_patients = outcome_counts.sum()

# Calculate the percentages
percentages = outcome_counts / total_patients * 100

# Create labels with both quantity and percentage
labels = [f'0 - Non-Diabetic\n({outcome_counts[0]} / {percentages[0]:.1f}%)',
          f'1 - Diabetic\n({outcome_counts[1]} / {percentages[1]:.1f}%)']

# Plot the pie chart with labels and percentages
plt.figure(figsize=(8, 6))
plt.pie(outcome_counts, labels=labels, autopct='%1.1f%%', colors=['purple', 'lightgray'])
plt.title('Distribution of the Outcome Variable')
plt.show()


In [None]:
sns.pairplot(data=df, vars=['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age'], hue='Outcome', height=5)
plt.show(block=True)

In [None]:
# Create combinations of binary categorical variables
feature_combinations = list(itertools.combinations(['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age'], 2))

# Create a separate Bubble Chart for each binary categorical variable
for i, (feature1, feature2) in enumerate(feature_combinations):
    fig = px.scatter(df, x=feature1, y=feature2, color='Outcome', size='BMI',
                     title=f'{feature1} vs {feature2} Bubble Chart')

    fig.show(block=True)

# 13. Base Model Before Feature Engineering

In [None]:
# Creating the Dependent Variable.

y = df["Outcome"]

# Creating Independent Variables.

X = df.drop("Outcome", axis=1)

# Splitting the Data into Training and Test Sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=17)

# 13.1.RandomForestClassifier

In [None]:
# Random Forest Classifier Model Training

rf_model = RandomForestClassifier(random_state=46).fit(X_train, y_train)

# Prediction using Random Forest Classifier Model

y_pred = rf_model.predict(X_test)

print("RandomForestClassifier:")
print(f"Accuracy: {round(accuracy_score(y_pred, y_test), 4)}")
print(f"Recall: {round(recall_score(y_pred,y_test),4)}")
print(f"Precision: {round(precision_score(y_pred,y_test), 4)}")
print(f"F1: {round(f1_score(y_pred,y_test), 4)}")
print(f"Auc: {round(roc_auc_score(y_pred,y_test), 4)}")

# 13.2.Logistic Regression

In [None]:
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)

print("Logistic Regression:")
print(f"Accuracy: {round(accuracy_score(lr_pred, y_test), 4)}")
print(f"Recall: {round(recall_score(lr_pred, y_test), 4)}")
print(f"Precision: {round(precision_score(lr_pred, y_test), 4)}")
print(f"F1: {round(f1_score(lr_pred, y_test), 4)}")
print(f"AUC: {round(roc_auc_score(lr_pred, y_test), 4)}")


# 13.3.K-Nearest Neighbors (KNN)

In [None]:
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)
knn_pred = knn_model.predict(X_test)

print("K-Nearest Neighbors (KNN):")
print(f"Accuracy: {round(accuracy_score(knn_pred, y_test), 4)}")
print(f"Recall: {round(recall_score(knn_pred, y_test), 4)}")
print(f"Precision: {round(precision_score(knn_pred, y_test), 4)}")
print(f"F1: {round(f1_score(knn_pred, y_test), 4)}")
print(f"AUC: {round(roc_auc_score(knn_pred, y_test), 4)}")


# 13.4.Support Vector Classifier (SVC)

In [None]:
svc_model = SVC()
svc_model.fit(X_train, y_train)
svc_pred = svc_model.predict(X_test)

print("Support Vector Classifier (SVC):")
print(f"Accuracy: {round(accuracy_score(svc_pred, y_test), 4)}")
print(f"Recall: {round(recall_score(svc_pred, y_test), 4)}")
print(f"Precision: {round(precision_score(svc_pred, y_test), 4)}")
print(f"F1: {round(f1_score(svc_pred, y_test), 4)}")
print(f"AUC: {round(roc_auc_score(svc_pred, y_test), 4)}")


# 13.5.Decision Tree Classifier

In [None]:
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)

print("Decision Tree Classifier:")
print(f"Accuracy: {round(accuracy_score(dt_pred, y_test), 4)}")
print(f"Recall: {round(recall_score(dt_pred, y_test), 4)}")
print(f"Precision: {round(precision_score(dt_pred, y_test), 4)}")
print(f"F1: {round(f1_score(dt_pred, y_test), 4)}")
print(f"AUC: {round(roc_auc_score(dt_pred, y_test), 4)}")


# 13.6.AdaBoost Classifier

In [None]:
ada_model = AdaBoostClassifier()
ada_model.fit(X_train, y_train)
ada_pred = ada_model.predict(X_test)

print("AdaBoost Classifier:")
print(f"Accuracy: {round(accuracy_score(ada_pred, y_test), 4)}")
print(f"Recall: {round(recall_score(ada_pred, y_test), 4)}")
print(f"Precision: {round(precision_score(ada_pred, y_test), 4)}")
print(f"F1: {round(f1_score(ada_pred, y_test), 4)}")
print(f"AUC: {round(roc_auc_score(ada_pred, y_test), 4)}")


# 13.7.Gradient Boosting Classifier

In [None]:
gb_model = GradientBoostingClassifier()
gb_model.fit(X_train, y_train)
gb_pred = gb_model.predict(X_test)

print("Gradient Boosting Classifier:")
print(f"Accuracy: {round(accuracy_score(gb_pred, y_test), 4)}")
print(f"Recall: {round(recall_score(gb_pred, y_test), 4)}")
print(f"Precision: {round(precision_score(gb_pred, y_test), 4)}")
print(f"F1: {round(f1_score(gb_pred, y_test), 4)}")
print(f"AUC: {round(roc_auc_score(gb_pred, y_test), 4)}")


# 13.8.XGBoost Classifier

In [None]:
xgb_model = XGBClassifier()
xgb_model.fit(X_train, y_train)
xgb_pred = xgb_model.predict(X_test)

print("XGBoost Classifier:")
print(f"Accuracy: {round(accuracy_score(xgb_pred, y_test), 4)}")
print(f"Recall: {round(recall_score(xgb_pred, y_test), 4)}")
print(f"Precision: {round(precision_score(xgb_pred, y_test), 4)}")
print(f"F1: {round(f1_score(xgb_pred, y_test), 4)}")
print(f"AUC: {round(roc_auc_score(xgb_pred, y_test), 4)}")


# 13.9.LightGBM Classifier

In [None]:
lgbm_model = LGBMClassifier()
lgbm_model.fit(X_train, y_train)
lgbm_pred = lgbm_model.predict(X_test)

print("LightGBM Classifier:")
print(f"Accuracy: {round(accuracy_score(lgbm_pred, y_test), 4)}")
print(f"Recall: {round(recall_score(lgbm_pred, y_test), 4)}")
print(f"Precision: {round(precision_score(lgbm_pred, y_test), 4)}")
print(f"F1: {round(f1_score(lgbm_pred, y_test), 4)}")
print(f"AUC: {round(roc_auc_score(lgbm_pred, y_test), 4)}")


# 13.10.Comparison of Metrics for Different Models

In [None]:
# Dictionary containing the metric results
metrics = {
    "Model": ["Random Forest", "Logistic Regression", "KNN", "SVC", "Decision Tree", "AdaBoost", "Gradient Boosting", "XGBoost", "LightGBM"],
    "Accuracy": [0.7706, 0.7879, 0.7619, 0.7446, 0.7186, 0.7532, 0.7706, 0.7706, 0.7619],
    "Recall": [0.7059, 0.7667, 0.6711, 0.6833, 0.6053, 0.6765, 0.7, 0.7059, 0.6857],
    "Precision": [0.5926, 0.5679, 0.6296, 0.5062, 0.5679, 0.5679, 0.6049, 0.5926, 0.5926],
    "F1": [0.6443, 0.6525, 0.6497, 0.5816, 0.586, 0.6174, 0.649, 0.6443, 0.6358],
    "AUC": [0.7517, 0.781, 0.7388, 0.7247, 0.6897, 0.7309, 0.7506, 0.7517, 0.7404]
}

# Creating a DataFrame from the metrics dictionary
results_df = pd.DataFrame(metrics)

# Sorting the DataFrame by accuracy in descending order
results_df = results_df.sort_values(by="Accuracy", ascending=False)

# Creating the figure for the graph
fig = go.Figure()

# Colors for the metrics
colors = ["purple", "green", "blue", "orange", "red"]

# Adding traces for each metric in the specified order
for metric, color in zip(["Accuracy", "Recall", "Precision", "F1", "AUC"], colors):
    fig.add_trace(go.Bar(
        x=results_df["Model"],
        y=results_df[metric],
        marker_color=color,
        name=metric,
        text=results_df[metric],
        textposition='auto'
    ))

# Setting axis labels and title
fig.update_layout(
    xaxis_title="Model",
    yaxis_title="Metric Score",
    title="Comparison of Metrics for Different Models"
)

# Displaying the graph
fig.show(block=True)




**Conclusion**

**Accuracy:** Accuracy represents the overall correctness rate of the model's predictions. It indicates the proportion of correctly classified cases out of the total data.

**Recall:** Recall measures the proportion of actual diabetic cases that are correctly identified by the model. It shows how well the model captures the true positive cases of diabetes.

**Precision:** Precision calculates the proportion of predicted diabetic cases that are actually true positive cases. It indicates the accuracy of the model's positive predictions for diabetes.

**F1 Score:** F1 score is the harmonic mean of recall and precision. It provides a balanced measure by considering both recall and precision equally, evaluating the overall performance of the model.

**AUC (Area Under the Curve):** AUC represents the area under the Receiver Operating Characteristic (ROC) curve. It reflects the model's ability to distinguish between classes and provides an overall measure of performance.

# 13.11.Visualization of the Decision Tree

In [None]:
# Loading the dataset and setting the features and target variable
x = pd.DataFrame(df, columns=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age'])
y = df.Outcome.values.reshape(-1, 1)

# Splitting the data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1)

# Defining and training the Decision Tree Classifier
clf = DecisionTreeClassifier(max_depth=3)
clf = clf.fit(x_train, y_train)

# Making predictions on the test set
y_pred = clf.predict(x_test)

# Generating the text representation of the decision tree and printing it
text_representation = tree.export_text(clf)
print(text_representation)

# Setting the feature and target class names
feature_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
target_names = ['0', '1']

# Generating and saving the visualization of the decision tree
fig = plt.figure(figsize=(25, 20))
plot = tree.plot_tree(clf, feature_names=feature_names, class_names=target_names, filled=True)
fig.savefig('tree1.png')

# 13.12.Plot Importance of Variables According to Base Model

In [None]:
def plot_importance(model, features, num=len(X), save=False):
    feature_imp = pd.DataFrame({'Value': model.feature_importances_, 'Feature': features.columns})
    plt.figure(figsize=(10, 10))
    sns.set(font_scale=1)
    sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False)[0:num])
    plt.title(f'Feature Importance - {model.__class__.__name__}')
    plt.tight_layout()
    plt.show(block=True)
    if save:
        plt.savefig('importances.png')


In [None]:
model_name = [rf_model, dt_model, xgb_model, lgbm_model]

In [None]:
for i in model_name:
    plot_importance(i, X)

# 14. Feature Engineering

**In this section, we will perform the following variable engineering operations.**

- Missing Values Detection
- Outlier Detection (Outliers)
- Feature Extraction

# 15. Missing Value Analysis

In [None]:
# Detection of variables with missing observations filled with zero in the data set.

zero_colunms = [col for col in df.columns if (df[col].min() == 0 and col not in  ["Pregnancies", "Outcome"])]

In [None]:
zero_colunms

In [None]:
df.isnull().sum()

In [None]:
# Filling the missing observations in the dataset with NaN that are filled with zero.

for col in zero_colunms:
    df[col] = np.where(df[col] == 0, np.nan, df[col])

In [None]:
df.isnull().sum()

In [None]:
def missing_values_table(dataframe, na_name=False, plot=False):
    na_columns = [col for col in dataframe.columns if dataframe[col].isnull().sum() > 0]
    n_miss = dataframe[na_columns].isnull().sum().sort_values(ascending=False)
    ratio = (dataframe[na_columns].isnull().sum() / dataframe.shape[0] * 100).sort_values(ascending=False)
    missing_df = pd.concat([n_miss, np.round(ratio, 2)], axis=1, keys=['n_miss', 'ratio'])
    print(missing_df, end="\n")
    
    if plot:
        # Plotting the missing values
        plt.figure(figsize=(10, 8))
        bars = plt.bar(missing_df.index, missing_df['ratio'], color='purple')
        plt.xlabel('Features')
        plt.ylabel('Percentage of Missing Values')
        plt.title('Missing Values by Feature')
        
        for bar in bars:
            yval = bar.get_height()
            plt.text(bar.get_x() + bar.get_width() / 2, yval, f'{yval:.2f}%', ha='center', va='bottom')
        
        plt.xticks(rotation=90)
        plt.tight_layout()
        plt.show()
    
    if na_name:
        return na_columns


In [None]:
na_columns = missing_values_table(df, na_name=True, plot=True)

In [None]:
def missing_vs_target(dataframe, target, na_columns, plot=False):
    temp_df = dataframe.copy()
    for col in na_columns:
        temp_df[col + '_NA_FLAG'] = np.where(temp_df[col].isnull(), 1, 0)
    na_flags = temp_df.loc[:, temp_df.columns.str.contains("_NA_")].columns
    for col in na_flags:
        print(pd.DataFrame({"TARGET_MEAN": temp_df.groupby(col)[target].mean(),
                            "Count": temp_df.groupby(col)[target].count()}), end="\n\n\n")
        if plot:
            # Plotting the target mean by NA flag
            plt.figure(figsize=(6, 4))
            temp_df.groupby(col)[target].mean().plot(kind='bar', color='purple')
            plt.xlabel(col)
            plt.ylabel('Target Mean')
            plt.title(f'Target Mean by {col}')
            plt.xticks(rotation=0)
            plt.tight_layout()
            plt.show()
            print("######################################################################")



In [None]:
missing_vs_target(df, "Outcome", na_columns, plot=True)

In [None]:
"""
# Option 1
# Filling the missing observations filled with NaN in the data set with the median value of that column.

for col in zero_colunms:
    df.loc[df[col].isnull(), col] = df[col].median()

df.isnull().sum()

df.head(10)
"""

In [None]:
# Option 2
# Filled KNN Imputers 

dff = df[na_columns]

In [None]:
rs = RobustScaler()

In [None]:
dff = pd.DataFrame(rs.fit_transform(dff), columns=dff.columns)

In [None]:
dff.head()

In [None]:
dff = pd.DataFrame(KNNImputer(n_neighbors=5).fit_transform(dff), columns = dff.columns)

In [None]:
dff.head()

In [None]:
dff = pd.DataFrame(rs.inverse_transform(dff), columns=dff.columns)

In [None]:
df[na_columns] = dff

In [None]:
df.head(10)

In [None]:
df.isnull().sum()

# 16. Outlier Analysis

In [None]:
def outlier_thresholds(dataframe, col_name, q1=0.05, q3=0.95):
    quartile1 = dataframe[col_name].quantile(q1)
    quartile3 = dataframe[col_name].quantile(q3)
    interquantile_range = quartile3 - quartile1
    up_limit = quartile3 + 1.5 * interquantile_range
    low_limit = quartile1 - 1.5 * interquantile_range
    return low_limit, up_limit

In [None]:
def check_outlier(dataframe, col_name):
    low_limit, up_limit = outlier_thresholds(dataframe, col_name)
    if dataframe[(dataframe[col_name] > up_limit) | (dataframe[col_name] < low_limit)].any(axis=None):
        return True
    else:
        return False

In [None]:
def check_outlier(dataframe, col_name, plot=False):
    low_limit, up_limit = outlier_thresholds(dataframe, col_name)
    outliers = dataframe[(dataframe[col_name] > up_limit) | (dataframe[col_name] < low_limit)]
    if outliers.any(axis=None):
        if plot:
            plt.figure(figsize=(8, 6))
            sns.boxplot(x=dataframe[col_name])
            plt.title(f'Outliers in {col_name}')
            plt.show()
        return True
    else:
        return False


In [None]:
def replace_with_thresholds(dataframe, variable, q1=0.05, q3=0.95):
    low_limit, up_limit = outlier_thresholds(dataframe, variable, q1=0.05, q3=0.95)
    dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit
    dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit

In [None]:
for col in df.columns:
    print(col, check_outlier(df, col, plot=True))
    if check_outlier(df, col, plot=True):
        replace_with_thresholds(df, col)

In [None]:
for col in df.columns:
    print(col, check_outlier(df, col))

# 17. Feature Extraction

In [None]:
# Creating a new age variable by categorizing the age variable

df.loc[(df["Age"] >= 21) & (df["Age"] < 50), "NEW_AGE_CAT"] = "mature"
df.loc[(df["Age"] >= 50), "NEW_AGE_CAT"] = "senior"

In [None]:
# BMI below 18.5 is underweight, between 18.5 and 24.9 is normal, between 24.9 and 29.9 is overweight and above 30 is obese

df['NEW_BMI'] = pd.cut(x=df['BMI'], bins=[0, 18.5, 24.9, 29.9, 100],labels=["Underweight", "Healthy", "Overweight", "Obese"])

In [None]:
# Converting Glucose Value to Categorical Variable

df["NEW_GLUCOSE"] = pd.cut(x=df["Glucose"], bins=[0, 140, 200, 300], labels=["Normal", "Prediabetes", "Diabetes"])

In [None]:
# Creating a categorical variable by considering age and body mass index together 3 breakdowns were captured

df.loc[(df["BMI"] < 18.5) & ((df["Age"] >= 21) & (df["Age"] < 50)), "NEW_AGE_BMI_NOM"] = "underweightmature"
df.loc[(df["BMI"] < 18.5) & (df["Age"] >= 50), "NEW_AGE_BMI_NOM"] = "underweightsenior"
df.loc[((df["BMI"] >= 18.5) & (df["BMI"] < 25)) & ((df["Age"] >= 21) & (df["Age"] < 50)), "NEW_AGE_BMI_NOM"] = "healthymature"
df.loc[((df["BMI"] >= 18.5) & (df["BMI"] < 25)) & (df["Age"] >= 50), "NEW_AGE_BMI_NOM"] = "healthysenior"
df.loc[((df["BMI"] >= 25) & (df["BMI"] < 30)) & ((df["Age"] >= 21) & (df["Age"] < 50)), "NEW_AGE_BMI_NOM"] = "overweightmature"
df.loc[((df["BMI"] >= 25) & (df["BMI"] < 30)) & (df["Age"] >= 50), "NEW_AGE_BMI_NOM"] = "overweightsenior"
df.loc[(df["BMI"] > 18.5) & ((df["Age"] >= 21) & (df["Age"] < 50)), "NEW_AGE_BMI_NOM"] = "obesemature"
df.loc[(df["BMI"] > 18.5) & (df["Age"] >= 50), "NEW_AGE_BMI_NOM"] = "obesesenior"

In [None]:
# Creating a categorical variable by considering age and glucose values together

df.loc[(df["Glucose"] < 70) & ((df["Age"] >= 21) & (df["Age"] < 50)), "NEW_AGE_GLUCOSE_NOM"] = "lowmature"
df.loc[(df["Glucose"] < 70) & (df["Age"] >= 50), "NEW_AGE_GLUCOSE_NOM"] = "lowsenior"
df.loc[((df["Glucose"] >= 70) & (df["Glucose"] < 100)) & ((df["Age"] >= 21) & (df["Age"] < 50)), "NEW_AGE_GLUCOSE_NOM"] = "normalmature"
df.loc[((df["Glucose"] >= 70) & (df["Glucose"] < 100)) & (df["Age"] >= 50), "NEW_AGE_GLUCOSE_NOM"] = "normalsenior"
df.loc[((df["Glucose"] >= 100) & (df["Glucose"] <= 125)) & ((df["Age"] >= 21) & (df["Age"] < 50)), "NEW_AGE_GLUCOSE_NOM"] = "hiddenmature"
df.loc[((df["Glucose"] >= 100) & (df["Glucose"] <= 125)) & (df["Age"] >= 50), "NEW_AGE_GLUCOSE_NOM"] = "hiddensenior"
df.loc[(df["Glucose"] > 125) & ((df["Age"] >= 21) & (df["Age"] < 50)), "NEW_AGE_GLUCOSE_NOM"] = "highmature"
df.loc[(df["Glucose"] > 125) & (df["Age"] >= 50), "NEW_AGE_GLUCOSE_NOM"] = "highsenior"

In [None]:
# Deriving a Categorical variable with Insulin Value.

def set_insulin(dataframe, col_name="Insulin"):
    if 16 <= dataframe[col_name] <= 166:
        return
    else:
        return "Abnormal"

df["NEW_INSULIN_SCORE"] = df.apply(set_insulin, axis=1)
df["NEW_GLUCOSE * INSULIN"] =df["Glucose"] * df["Insulin"]


In [None]:
# Attention to values with zero !!!

df["NEW_GLUCOSE * PREGNANCIES"] = df["Glucose"] * df["Pregnancies"]

In [None]:
# Translating Column Names to Uppercase Letters.

df.columns = [col.upper() for col in df.columns]

In [None]:
df.head()

In [None]:
# Analysis of Variables.

def grab_col_names(dataframe, cat_th=10, car_th=20):
    """

    Returns the names of categorical, numeric and categorical but cardinal variables in the data set.
    Note Categorical variables include categorical variables with numeric appearance.

    Parameters
    ------
        dataframe: dataframe
                Variable names of the dataframe to be taken
        cat_th: int, optional
                class threshold for numeric but categorical variables
        car_th: int, optinal
                class threshold for categorical but cardinal variables

    Returns
    ------
        cat_cols: list
                Categorical variable list
        num_cols: list
                Numeric variable list
        cat_but_car: list
                List of cardinal variables with categorical appearance

    Examples
    ------
        import seaborn as sns
        df = sns.load_dataset("iris")
        print(grab_col_names(df))


    Notes
    ------
        cat_cols + num_cols + cat_but_car = total number of variables
        num_but_cat is inside cat_cols.
        The sum of the 3 return lists equals the total number of variables: cat_cols + num_cols + cat_but_car = number of variables

    """

    # cat_cols, cat_but_car
    cat_cols = [col for col in dataframe.columns if dataframe[col].dtypes == "O"] 

    num_but_cat = [col for col in dataframe.columns if dataframe[col].nunique() < cat_th and
                   dataframe[col].dtypes != "O"]

    cat_but_car = [col for col in dataframe.columns if dataframe[col].nunique() > car_th and
                   dataframe[col].dtypes == "O"]

    cat_cols = cat_cols + num_but_cat

    cat_cols = [col for col in cat_cols if col not in cat_but_car] 

    num_cols = [col for col in dataframe.columns if dataframe[col].dtypes != "O"] 

    num_cols = [col for col in num_cols if col not in num_but_cat] 
    
    print(f"Observations: {dataframe.shape[0]}") # data frame in içerisindeki gözlem sayısına eriştik.
    print(f"Variables: {dataframe.shape[1]}") # data frame in içerisindeki değişken sayısına eriştik.
    print(f'cat_cols: {len(cat_cols)}') # kaçtane kategorik değişken olduğunu tespit ettik.
    print(f'num_cols: {len(num_cols)}') # kaç tane nümerik değişken olduğunu tespit ettik.
    print(f'cat_but_car: {len(cat_but_car)}') # kaç tane kardinal değişken olduğunu tespit ettik.
    print(f'num_but_cat: {len(num_but_cat)}') # kaç tane numerik gibi görünüp kategorik olan değişken olduğunu belirledi


    return cat_cols, num_cols, cat_but_car, num_but_cat

In [None]:
cat_cols, num_cols, cat_but_car,  num_but_cat = grab_col_names(df)

In [None]:
cat_cols

In [None]:
num_cols

In [None]:
cat_but_car

In [None]:
num_but_cat

# 18. Encoding

In [None]:
# Label Encoding

def label_encoder(dataframe, binary_col):
    labelencoder = LabelEncoder()
    dataframe[binary_col] = labelencoder.fit_transform(dataframe[binary_col])
    return dataframe

In [None]:
binary_cols = [col for col in df.columns if df[col].dtypes == "O" and df[col].nunique() == 2]

In [None]:
binary_cols

In [None]:
for col in binary_cols:
    df = label_encoder(df, col)

In [None]:
df.head(10)

In [None]:
# One-Hot Encoding

# cat_cols listesinin güncelleme işlemi

In [None]:
cat_cols = [col for col in cat_cols if col not in binary_cols and col not in ["OUTCOME"]]

In [None]:
cat_cols

In [None]:
def one_hot_encoder(dataframe, categorical_cols, drop_first=False):
    dataframe = pd.get_dummies(dataframe, columns=categorical_cols, drop_first=drop_first)
    return dataframe

In [None]:
df = one_hot_encoder(df, cat_cols, drop_first=True)

In [None]:
df.head(10)

# 19. Standardization Process

In [None]:
num_cols

In [None]:
scaler = RobustScaler()

In [None]:
df[num_cols] = scaler.fit_transform(df[num_cols])

In [None]:
df.head(10)

In [None]:
df.shape

# 20. Model Building

In [None]:
# Creating the Dependent Variable.

y = df["OUTCOME"]

# Creating Independent Variables.

X = df.drop("OUTCOME", axis=1)

# Splitting the Data into Training and Test Sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=17)

# 20.1.RandomForestClassifier

In [None]:
# Random Forest Classifier Model Training

rf_model = RandomForestClassifier(random_state=46).fit(X_train, y_train)

# Prediction using Random Forest Classifier Model

y_pred = rf_model.predict(X_test)

print("RandomForestClassifier:")
print(f"Accuracy: {round(accuracy_score(y_pred, y_test), 4)}")
print(f"Recall: {round(recall_score(y_pred,y_test),4)}")
print(f"Precision: {round(precision_score(y_pred,y_test), 4)}")
print(f"F1: {round(f1_score(y_pred,y_test), 4)}")
print(f"Auc: {round(roc_auc_score(y_pred,y_test), 4)}")

In [None]:
def plot_importance(model, features, num=len(X), save=False):
    feature_imp = pd.DataFrame({'Value': model.feature_importances_, 'Feature': features.columns})
    plt.figure(figsize=(10, 10))
    sns.set(font_scale=1)
    sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False)[0:num])
    plt.title('Feature Importance - RandomForestClassifier')
    plt.tight_layout()
    plt.show(block=True)
    if save:
        plt.savefig('importances.png')

In [None]:
plot_importance(rf_model, X)

# 20.1.1.Random Forest Classifier Hyperparameter Optimization

In [None]:
rf_model = RandomForestClassifier(random_state=46)
parameters = {'n_estimators': [100, 200, 300],
              'max_depth': [None, 5, 10],
              'min_samples_split': [2, 5, 10]}
rf_grid = GridSearchCV(rf_model, parameters, cv=5).fit(X_train, y_train)

best_rf_model = rf_grid.best_estimator_

# En iyi modeli kullanarak tahmin yapma
y_pred = best_rf_model.predict(X_test)
print("Random Forest Classifier - Hyperparameter Optimization")
print(f"Accuracy: {round(accuracy_score(y_pred, y_test), 4)}")
print(f"Recall: {round(recall_score(y_pred, y_test), 4)}")
print(f"Precision: {round(precision_score(y_pred, y_test), 4)}")
print(f"F1: {round(f1_score(y_pred, y_test), 4)}")
print(f"AUC: {round(roc_auc_score(y_pred, y_test), 4)}")

# 20.2.Logistic Regression

In [None]:
lr_model = LogisticRegression(random_state=46).fit(X_train, y_train)

lr_pred = lr_model.predict(X_test)

print("Logistic Regression:")
print(f"Accuracy: {round(accuracy_score(lr_pred, y_test), 4)}")
print(f"Recall: {round(recall_score(lr_pred, y_test), 4)}")
print(f"Precision: {round(precision_score(lr_pred, y_test), 4)}")
print(f"F1: {round(f1_score(lr_pred, y_test), 4)}")
print(f"AUC: {round(roc_auc_score(lr_pred, y_test), 4)}")

# 20.2.1.Logistic Regression Hyperparameter Optimization

In [None]:
lr_model = LogisticRegression(random_state=46)
parameters = {'C': [0.1, 1, 10],
              'penalty': ['l1', 'l2']}
lr_grid = GridSearchCV(lr_model, parameters, cv=5).fit(X_train, y_train)

best_lr_model = lr_grid.best_estimator_

print("Logistic Regression - Hyperparameter Optimization")
print(f"Best Parameters: {lr_grid.best_params_}")
print(f"Accuracy: {round(lr_grid.best_score_, 4)}")
print(f"Recall: {round(recall_score(best_lr_model.predict(X_test), y_test), 4)}")
print(f"Precision: {round(precision_score(best_lr_model.predict(X_test), y_test), 4)}")
print(f"F1: {round(f1_score(best_lr_model.predict(X_test), y_test), 4)}")
print(f"AUC: {round(roc_auc_score(best_lr_model.predict(X_test), y_test), 4)}")

# 20.3.K-Nearest Neighbors (KNN)

In [None]:
knn_model = KNeighborsClassifier().fit(X_train, y_train)

knn_pred = knn_model.predict(X_test)

print("K-Nearest Neighbors (KNN):")
print(f"Accuracy: {round(accuracy_score(knn_pred, y_test), 4)}")
print(f"Recall: {round(recall_score(knn_pred, y_test), 4)}")
print(f"Precision: {round(precision_score(knn_pred, y_test), 4)}")
print(f"F1: {round(f1_score(knn_pred, y_test), 4)}")
print(f"AUC: {round(roc_auc_score(knn_pred, y_test), 4)}")

# 20.3.1.K-Nearest Neighbors (KNN) Hyperparameter Optimization

In [None]:
knn_model = KNeighborsClassifier()
parameters = {'n_neighbors': [3, 5, 7],
              'weights': ['uniform', 'distance']}
knn_grid = GridSearchCV(knn_model, parameters, cv=5).fit(X_train, y_train)

best_knn_model = knn_grid.best_estimator_

print("K-Nearest Neighbors (KNN) - Hyperparameter Optimization")
print(f"Best Parameters: {knn_grid.best_params_}")
print(f"Accuracy: {round(knn_grid.best_score_, 4)}")
print(f"Recall: {round(recall_score(best_knn_model.predict(X_test), y_test), 4)}")
print(f"Precision: {round(precision_score(best_knn_model.predict(X_test), y_test), 4)}")
print(f"F1: {round(f1_score(best_knn_model.predict(X_test), y_test), 4)}")
print(f"AUC: {round(roc_auc_score(best_knn_model.predict(X_test), y_test), 4)}")

# 20.4.Support Vector Classifier (SVC)

In [None]:
svc_model = SVC(random_state=46).fit(X_train, y_train)

svc_pred = svc_model.predict(X_test)

print("Support Vector Classifier (SVC):")
print(f"Accuracy: {round(accuracy_score(svc_pred, y_test), 4)}")
print(f"Recall: {round(recall_score(svc_pred, y_test), 4)}")
print(f"Precision: {round(precision_score(svc_pred, y_test), 4)}")
print(f"F1: {round(f1_score(svc_pred, y_test), 4)}")
print(f"AUC: {round(roc_auc_score(svc_pred, y_test), 4)}")

# 20.4.1.Support Vector Classifier (SVC) Hyperparameter Optimization

In [None]:
svc_model = SVC(random_state=46)
parameters = {'C': [0.1, 1, 10],
              'kernel': ['linear', 'rbf']}
svc_grid = GridSearchCV(svc_model, parameters, cv=5).fit(X_train, y_train)

best_svc_model = svc_grid.best_estimator_

print("Support Vector Classifier (SVC) - Hyperparameter Optimization")
print(f"Best Parameters: {svc_grid.best_params_}")
print(f"Accuracy: {round(svc_grid.best_score_, 4)}")
print(f"Recall: {round(recall_score(best_svc_model.predict(X_test), y_test), 4)}")
print(f"Precision: {round(precision_score(best_svc_model.predict(X_test), y_test), 4)}")
print(f"F1: {round(f1_score(best_svc_model.predict(X_test), y_test), 4)}")
print(f"AUC: {round(roc_auc_score(best_svc_model.predict(X_test), y_test), 4)}")

# 20.5.Decision Tree Classifier

In [None]:
dt_model = DecisionTreeClassifier(random_state=46).fit(X_train, y_train)

dt_pred = dt_model.predict(X_test)

print("Decision Tree Classifier:")
print(f"Accuracy: {round(accuracy_score(dt_pred, y_test), 4)}")
print(f"Recall: {round(recall_score(dt_pred, y_test), 4)}")
print(f"Precision: {round(precision_score(dt_pred, y_test), 4)}")
print(f"F1: {round(f1_score(dt_pred, y_test), 4)}")
print(f"AUC: {round(roc_auc_score(dt_pred, y_test), 4)}")

In [None]:
def plot_importance(model, features, num=len(X), save=False):
    feature_imp = pd.DataFrame({'Value': model.feature_importances_, 'Feature': features.columns})
    plt.figure(figsize=(10, 10))
    sns.set(font_scale=1)
    sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False)[0:num])
    plt.title(f'Feature Importance - {model.__class__.__name__}')
    plt.tight_layout()
    plt.show(block=True)
    if save:
        plt.savefig('importances.png')

In [None]:
plot_importance(dt_model, X)

# 20.5.1.Decision Tree Classifier Hyperparameter Optimization

In [None]:
dt_model = DecisionTreeClassifier(random_state=46)
parameters = {'max_depth': [None, 5, 10],
              'min_samples_split': [2, 5, 10]}
dt_grid = GridSearchCV(dt_model, parameters, cv=5).fit(X_train, y_train)

best_dt_model = dt_grid.best_estimator_

print("Decision Tree Classifier - Hyperparameter Optimization")
print(f"Best Parameters: {dt_grid.best_params_}")
print(f"Accuracy: {round(dt_grid.best_score_, 4)}")
print(f"Recall: {round(recall_score(best_dt_model.predict(X_test), y_test), 4)}")
print(f"Precision: {round(precision_score(best_dt_model.predict(X_test), y_test), 4)}")
print(f"F1: {round(f1_score(best_dt_model.predict(X_test), y_test), 4)}")
print(f"AUC: {round(roc_auc_score(best_dt_model.predict(X_test), y_test), 4)}")

# 20.6.AdaBoost Classifier

In [None]:
ada_model = AdaBoostClassifier(random_state=46).fit(X_train, y_train)

ada_pred = ada_model.predict(X_test)

print("AdaBoost Classifier:")
print(f"Accuracy: {round(accuracy_score(ada_pred, y_test), 4)}")
print(f"Recall: {round(recall_score(ada_pred, y_test), 4)}")
print(f"Precision: {round(precision_score(ada_pred, y_test), 4)}")
print(f"F1: {round(f1_score(ada_pred, y_test), 4)}")
print(f"AUC: {round(roc_auc_score(ada_pred, y_test), 4)}")

# 20.6.1.AdaBoost Classifier Hyperparameter Optimization

In [None]:
ada_model = AdaBoostClassifier(random_state=46)
parameters = {'n_estimators': [50, 100, 200],
              'learning_rate': [0.1, 0.5, 1.0]}
ada_grid = GridSearchCV(ada_model, parameters, cv=5).fit(X_train, y_train)

best_ada_model = ada_grid.best_estimator_

print("AdaBoost Classifier - Hyperparameter Optimization")
print(f"Best Parameters: {ada_grid.best_params_}")
print(f"Accuracy: {round(ada_grid.best_score_, 4)}")
print(f"Recall: {round(recall_score(best_ada_model.predict(X_test), y_test), 4)}")
print(f"Precision: {round(precision_score(best_ada_model.predict(X_test), y_test), 4)}")
print(f"F1: {round(f1_score(best_ada_model.predict(X_test), y_test), 4)}")
print(f"AUC: {round(roc_auc_score(best_ada_model.predict(X_test), y_test), 4)}")

# 20.7.Gradient Boosting Classifier

In [None]:
gb_model = GradientBoostingClassifier(random_state=46).fit(X_train, y_train)

gb_pred = gb_model.predict(X_test)

print("Gradient Boosting Classifier:")
print(f"Accuracy: {round(accuracy_score(gb_pred, y_test), 4)}")
print(f"Recall: {round(recall_score(gb_pred, y_test), 4)}")
print(f"Precision: {round(precision_score(gb_pred, y_test), 4)}")
print(f"F1: {round(f1_score(gb_pred, y_test), 4)}")
print(f"AUC: {round(roc_auc_score(gb_pred, y_test), 4)}")

In [None]:
def plot_importance(model, features, num=len(X), save=False):
    feature_imp = pd.DataFrame({'Value': model.feature_importances_, 'Feature': features.columns})
    plt.figure(figsize=(10, 10))
    sns.set(font_scale=1)
    sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False)[0:num])
    plt.title(f'Feature Importance - {model.__class__.__name__}')
    plt.tight_layout()
    plt.show(block=True)
    if save:
        plt.savefig('importances.png')

In [None]:
plot_importance(gb_model, X)

# 20.7.1.Gradient Boosting Classifier Hyperparameter Optimization

In [None]:
gb_model = GradientBoostingClassifier(random_state=46)
parameters = {'n_estimators': [50, 100, 200],
              'learning_rate': [0.1, 0.5, 1.0]}
gb_grid = GridSearchCV(gb_model, parameters, cv=5).fit(X_train, y_train)

best_gb_model = gb_grid.best_estimator_

print("Gradient Boosting Classifier - Hyperparameter Optimization")
print(f"Best Parameters: {gb_grid.best_params_}")
print(f"Accuracy: {round(gb_grid.best_score_, 4)}")
print(f"Recall: {round(recall_score(best_gb_model.predict(X_test), y_test), 4)}")
print(f"Precision: {round(precision_score(best_gb_model.predict(X_test), y_test), 4)}")
print(f"F1: {round(f1_score(best_gb_model.predict(X_test), y_test), 4)}")
print(f"AUC: {round(roc_auc_score(best_gb_model.predict(X_test), y_test), 4)}")

# 20.8.XGBoost Classifier

In [None]:
xgb_model = XGBClassifier(random_state=46).fit(X_train, y_train)

xgb_pred = xgb_model.predict(X_test)

print("XGBoost Classifier:")
print(f"Accuracy: {round(accuracy_score(xgb_pred, y_test), 4)}")
print(f"Recall: {round(recall_score(xgb_pred, y_test), 4)}")
print(f"Precision: {round(precision_score(xgb_pred, y_test), 4)}")
print(f"F1: {round(f1_score(xgb_pred, y_test), 4)}")
print(f"AUC: {round(roc_auc_score(xgb_pred, y_test), 4)}")

In [None]:
def plot_importance(model, features, num=len(X), save=False):
    feature_imp = pd.DataFrame({'Value': model.feature_importances_, 'Feature': features.columns})
    plt.figure(figsize=(10, 10))
    sns.set(font_scale=1)
    sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False)[0:num])
    plt.title(f'Feature Importance - {model.__class__.__name__}')
    plt.tight_layout()
    plt.show(block=True)
    if save:
        plt.savefig('importances.png')

In [None]:
plot_importance(xgb_model, X)

# 20.8.1.XGBoost Classifier Hyperparameter Optimization

In [None]:
xgb_model = XGBClassifier(random_state=46)
parameters = {'n_estimators': [50, 100, 200],
              'learning_rate': [0.1, 0.5, 1.0]}
xgb_grid = GridSearchCV(xgb_model, parameters, cv=5).fit(X_train, y_train)

best_xgb_model = xgb_grid.best_estimator_

print("XGBoost Classifier - Hyperparameter Optimization")
print(f"Best Parameters: {xgb_grid.best_params_}")
print(f"Accuracy: {round(xgb_grid.best_score_, 4)}")
print(f"Recall: {round(recall_score(best_xgb_model.predict(X_test), y_test), 4)}")
print(f"Precision: {round(precision_score(best_xgb_model.predict(X_test), y_test), 4)}")
print(f"F1: {round(f1_score(best_xgb_model.predict(X_test), y_test), 4)}")
print(f"AUC: {round(roc_auc_score(best_xgb_model.predict(X_test), y_test), 4)}")


# 20.9.LightGBM Classifier

In [None]:
lgbm_model = LGBMClassifier(random_state=46).fit(X_train, y_train)

lgbm_pred = lgbm_model.predict(X_test)

print("LightGBM Classifier:")
print(f"Accuracy: {round(accuracy_score(lgbm_pred, y_test), 4)}")
print(f"Recall: {round(recall_score(lgbm_pred, y_test), 4)}")
print(f"Precision: {round(precision_score(lgbm_pred, y_test), 4)}")
print(f"F1: {round(f1_score(lgbm_pred, y_test), 4)}")
print(f"AUC: {round(roc_auc_score(lgbm_pred, y_test), 4)}")

In [None]:
def plot_importance(model, features, num=len(X), save=False):
    feature_imp = pd.DataFrame({'Value': model.feature_importances_, 'Feature': features.columns})
    plt.figure(figsize=(10, 10))
    sns.set(font_scale=1)
    sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False)[0:num])
    plt.title(f'Feature Importance - {model.__class__.__name__}')
    plt.tight_layout()
    plt.show(block=True)
    if save:
        plt.savefig('importances.png')

In [None]:
plot_importance(lgbm_model, X)

# 20.9.1.LightGBM Classifier Hyperparameter Optimization

In [None]:

lgbm_model = LGBMClassifier(random_state=46)
parameters = {'n_estimators': [50, 100, 200],
              'learning_rate': [0.1, 0.5, 1.0]}
lgbm_grid = GridSearchCV(lgbm_model, parameters, cv=5).fit(X_train, y_train)

best_lgbm_model = lgbm_grid.best_estimator_

print("LightGBM Classifier - Hyperparameter Optimization")
print(f"Best Parameters: {lgbm_grid.best_params_}")
print(f"Accuracy: {round(lgbm_grid.best_score_, 4)}")
print(f"Recall: {round(recall_score(best_lgbm_model.predict(X_test), y_test), 4)}")
print(f"Precision: {round(precision_score(best_lgbm_model.predict(X_test), y_test), 4)}")
print(f"F1: {round(f1_score(best_lgbm_model.predict(X_test), y_test), 4)}")
print(f"AUC: {round(roc_auc_score(best_lgbm_model.predict(X_test), y_test), 4)}")

# 20.10.Comparison of Metrics for Different Models After Feature Engineering

In [None]:
# Dictionary containing the metric results
metrics = {
    "Model": ["Random Forest", "Logistic Regression", "KNN", "SVC", "Decision Tree", "AdaBoost", "Gradient Boosting", "XGBoost", "LightGBM"],
    "Accuracy": [0.7749, 0.7532, 0.7792, 0.7792, 0.7229, 0.7662, 0.7532, 0.7879, 0.7662],
    "Recall": [0.7101, 0.6875, 0.7027, 0.7419, 0.5955, 0.6957, 0.6667, 0.7105, 0.7015],
    "Precision": [0.6049, 0.5432, 0.642, 0.5679, 0.6543, 0.5926, 0.5926, 0.6667, 0.5802],
    "F1": [0.6533, 0.6069, 0.671, 0.6434, 0.6235, 0.64, 0.6275, 0.6879, 0.6351],
    "AUC": [0.7563, 0.733, 0.759, 0.7674, 0.6992, 0.746, 0.7296, 0.7682, 0.7471]
}

# Creating a DataFrame from the metrics dictionary
results_df = pd.DataFrame(metrics)

# Sorting the DataFrame by accuracy in descending order
results_df = results_df.sort_values(by="Accuracy", ascending=False)

# Creating the figure for the graph
fig = go.Figure()

# Colors for the metrics
colors = ["purple", "green", "blue", "orange", "red"]

# Adding traces for each metric in the specified order
for metric, color in zip(["Accuracy", "Recall", "Precision", "F1", "AUC"], colors):
    fig.add_trace(go.Bar(
        x=results_df["Model"],
        y=results_df[metric],
        marker_color=color,
        name=metric,
        text=results_df[metric],
        textposition='auto'
    ))

# Setting axis labels and title
fig.update_layout(
    xaxis_title="Model",
    yaxis_title="Metric Score",
    title="Comparison of Metrics for Different Models After Feature Engineering"
)

# Displaying the graph
fig.show(block=True)


# 20.11.Comparison of Metrics for Different Models After Hyperparameter Optimization

In [None]:
# Dictionary containing the metric results
metrics = {
    "Model": ["Random Forest", "Logistic Regression", "KNN", "SVC", "Decision Tree", "AdaBoost", "Gradient Boosting", "XGBoost", "LightGBM"],
    "Accuracy": [0.7749, 0.7654, 0.7487, 0.7691, 0.7188, 0.745, 0.771, 0.758, 0.7487],
    "Recall": [0.6986, 0.6875, 0.6986, 0.7049, 0.618, 0.7, 0.6901, 0.6835, 0.6579],
    "Precision": [0.6296, 0.5432, 0.6296, 0.5309, 0.679, 0.6049, 0.6049, 0.6667, 0.6173],
    "F1": [0.6623, 0.6069, 0.6623, 0.6056, 0.6471, 0.649, 0.6447, 0.675, 0.6369],
    "AUC": [0.7544, 0.733, 0.7544, 0.7407, 0.7174, 0.7506, 0.7451, 0.753, 0.7289]
}

# Creating a DataFrame from the metrics dictionary
results_df = pd.DataFrame(metrics)

# Sorting the DataFrame by accuracy in descending order
results_df = results_df.sort_values(by="Accuracy", ascending=False)

# Creating the figure for the graph
fig = go.Figure()

# Colors for the metrics
colors = ["purple", "green", "blue", "orange", "red"]

# Adding traces for each metric in the specified order
for metric, color in zip(["Accuracy", "Recall", "Precision", "F1", "AUC"], colors):
    fig.add_trace(go.Bar(
        x=results_df["Model"],
        y=results_df[metric],
        marker_color=color,
        name=metric,
        text=results_df[metric],
        textposition='auto'
    ))

# Setting axis labels and title
fig.update_layout(
    xaxis_title="Model",
    yaxis_title="Metric Score",
    title="Comparison of Metrics for Different Models After Hyperparameter Optimization"
)

# Displaying the graph
fig.show(block=True)


# 20.12.Comparison of Metrics Before and After Hyperparameter Optimization

In [None]:
# Metric names
metrics = ["Accuracy", "Recall", "Precision", "F1", "AUC"]

# Metric values before hyperparameter optimization
before_values = [0.7749, 0.7101, 0.6049, 0.6533, 0.7563]

# Metric values after hyperparameter optimization
after_values = [0.7749, 0.6986, 0.6296, 0.6623, 0.7544]

# Index for the x-axis
x = np.arange(len(metrics))

# Width of the bars
width = 0.35

# Creating the figure and axes
fig, ax = plt.subplots()

# Plotting the bars for before values
rects1 = ax.bar(x - width/2, before_values, width, label='Before')

# Plotting the bars for after values
rects2 = ax.bar(x + width/2, after_values, width, label='After')

# Setting labels and title
ax.set_xlabel('Metrics')
ax.set_ylabel('Metric Score')
ax.set_title('Comparison of Metrics Before and After Hyperparameter Optimization')
ax.set_xticks(x)
ax.set_xticklabels(metrics)
ax.legend()

# Function to attach the metric values on top of the bars
def autolabel(rects):
    for rect in rects:
        height = rect.get_height()
        ax.annotate(f'{round(height, 4)}', xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 3), textcoords='offset points',
                    ha='center', va='bottom')

# Attaching metric values on top of the bars
autolabel(rects1)
autolabel(rects2)

# Displaying the plot
plt.show()
