## Introduction

In this assignment, I was given the task of implementing two machine learning models, that should give good predictions to cases of diabetes.

I will go through all the important steps of the machine learning workflow, including treatment of the dataset, then selecting, training, applying and evaluating the models.

I will explain the various steps I have taken during the process, and provide visualisations.

Finally, I will compare the prediction results of the two models and draw sensible conclusions.

## Preparing the data

Pre-processing the data is a critical phase in machine learning, as this will serve as the foundation for all the following steps.

Various actions can and should be taken when dealing with the dataset, which usually depends on the goal of the project.

Handling missing or 0 values, addressing inaccuracies and inconsistencies and various other tasks can seriously affect the outcomes.

If these steps are not taken, the model can become biased and distorted and the outputs would become unreliable and even dangerous in some cases.

To address these issues, I will treat the data with some well tested methods.

First, the dataset is loaded into a dataframe using pandas.

Using the dataframe, we can observe the first few entries and column names. This helps with understanding the data's features and prepares us for the necessary preprocessing steps.

In [None]:
import numpy as np
import pandas as pd

df = pd.read_csv('diabetes.csv')
df.head()

We can see the tendencies and shape of the dataset's distribution. 

This step is crucial to identify anomalies and understand the scale of each feature.

In [None]:
df.describe()

Here, we can see the data types and non-null counts, it helps us determine the cleanliness of the data and the potential need for type conversions and handling missing values.

In [None]:
df.info()

Understanding the balance of the classes is very important.

Real world datasets are often imbalanced, in which case we have to take measures to prevent inaccuracies in the output.

When one of the classes is overrepresented, it can create a bias towards it, which is a serious problem when it comes to predicting cases of medical issues.

In [None]:
df["Outcome"].value_counts()

Visualising the data provides insights into the distribution, correlations, and patterns within the dataset.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

df.hist(figsize = (12, 8))
plt.tight_layout()
plt.show()

df.plot(kind = 'box', subplots = True, layout = (3,3), figsize = (12, 8))
plt.tight_layout()
plt.show()

plt.figure(figsize = (10, 8))
sns.heatmap(df.corr(), annot = True, fmt = ".2f")
plt.show()

As we saw earlier, some of the columns have zero values.

To address the issue, we impute them with median values as a reasonable estimation.

The only columns we impute are the ones where it makes sense, and leave others out like "pregnancies".

This approach mitigates the skewing effect outliers can have on mean imputation, thus preserving the integrity of the dataset.

In [None]:
col_to_impute = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
for col in col_to_impute:
    df[col] = df[col].replace(0, np.nan)
    df[col] = df[col].fillna(df[col].median())
    
X = df.drop(['Outcome'], axis = 1)
y = df['Outcome']

The next step is one of the most important data processing steps in machine learning.

Right now, the dataset is in one piece, and training the model on it would be ill advised.

After training, we would have no choice but to test the model on the same data it was trained on.

This would mean that the model would suffer from overfitting, as it would have seen the data during training and knew the correct labels already.

To prevent that, we split the dataset into a training set and a test set following a conventional 80-20 ratio.

This widely accepted practice allows us to train our models on the majority of the data while setting aside a portion for unbiased evaluation.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

Now that dataset is properly split, but as we saw earlier, the various data fields have different scales.

To ensure that each feature contributes proportionally to the final model, we need to scale the data.

Standardization modifies the features to have a mean of zero and a standard deviation of one, which is particularly beneficial for algorithms sensitive to the scale of the data, such as SVM.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Finally, we implement Stratified K-Folds, for use in the grid search later.

The purpose of this is to make sure that each fold of the dataset contains the same percentage of samples of the classes, which is important in our case.

In this case, we are using 5 folds for cross validation, which means 4 folds will be used for training and 1 for validations, repeating 5 times.

This method is beneficial for maintaining a representative training process, especially when the dataset shows an imbalance.

In [None]:
from sklearn.model_selection import StratifiedKFold

stratified_kfold = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)

## Support Vector Machine

Support Vector Machines (SVM) are a type of supervised learning used for classification and regression. The algorithm works by finding the hyperplane (decision boundary) that best divides a dataset into classes.

The best hyperplane for an SVM means the one with the largest margin between the two classes. Margins are a gap between the two lines on the closest class points. This is why SVM is also known as a margin classifier.

The larger the margin between these points (support vectors), the lower the error of the classifier.

Linear SVM: When the data is linearly separable, the algorithm finds a hyperplane that separates the classes with the maximum margin.

Non-linear SVM: When the dataset cannot be separated linearly, SVM uses a kernel trick to transform the input space into a higher-dimensional space where a hyperplane can be used to separate the classes. 

The kernel trick uses kernels (e.g. polynomial, RBF), which compute the high-dimensional relationships without having to actually do the transformation.

Hyperparameters:
- `C` (for all kernels): gives weight to classification error, larger values give more importance to the classification error.
- `gamma` (for RBF and polynomial kernels): defines the reach of an individual training example's influence, essential for non-linear kernel functions.
- `degree (d)` (for polynomial kernel): determines the flexibility of the decision boundary, enabling it to take on more complex shapes.
- `coef0 (r)` (for polynomial kernel): adjusts the model's sensitivity to higher-order features.

The selection and tuning of hyperparameters is very important, as they can greatly influence the model's ability to classify new data correctly.

We optimise these parameters through grid search with cross-validation to find the most effective combination of these parameters, especially to achieve a high recall.

We focus on recall, because that is the metric that show how often the model correctly identifies positive cases.

Recall is often prioritised in medicine, because false negatives have serious consequences.

If a patient is diagnosed as a false negative, then they might go on believing that they do not have the condition, which can have serious consequences for their health.

If they are diagnosed as a false positive however, then they might have some anxiety for a period of time about it, but ultimately they do not suffer from diabetes so falsely believing that they do is not a big issue.

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
import time

svm_start = time.time()

svm_parameters = [
    {'kernel': ['linear'], 'C': [0.0001, 0.001, 0.1, 1, 10, 100]},
    {'kernel': ['rbf'], 'C': [0.0001, 0.001, 0.1, 1, 10, 100], 'gamma': [0.0001, 0.001, 0.01, 0.1, 1]},
    {'kernel': ['poly'], 'C': [0.0001, 0.001, 0.1, 1, 10, 100], 'gamma': [0.0001, 0.001, 0.01, 0.1, 1], 'degree': [2, 3], 'coef0': [0.0, 1.0]}
]

print("Conductiong grid search, please wait...\n")

svm_grid_search = GridSearchCV(SVC(class_weight = 'balanced'), svm_parameters, cv = stratified_kfold, scoring = 'recall', n_jobs = -1)

svm_grid_search.fit(X_train, y_train)

svm_end = time.time()

svm_grid_duration = svm_end - svm_start

print(f"SVM grid search took: {svm_grid_duration} seconds to complete\n")

print("Best Parameters: ", svm_grid_search.best_params_)

print("\nEnd of SVM grid search")

With the optimal hyperparameters found by the grid search, we can create a new model.

In [None]:
best_svm = SVC(**svm_grid_search.best_params_)
best_svm.fit(X_train, y_train)

Using the trained SVM model, we generate predictions on the test set.

These predictions allow us to critically evaluate the model's performance on unseen data, which is essential for gauging how the model might perform in real world scenarios.

We assess the model's effectiveness using a variety of metrics:
- Accuracy: measures the overall correctness of the model.
- Precision: measures the proportion of correctly predicted positive instances.
- Recall: measures the completeness of positive predictions.
- F1: provides a balance between precision and recall, important when we need a single metric to convey the balance between these two metrics.

The evaluation phase is not just about assessing model accuracy, it's about understanding the model's predictive behavior.

We utilise a confusion matrix to visualise true and false positives and negatives.

Such analyses are crucial for medical decision making where the cost of a false negative can be much higher than that of a false positive, as previously mentioned.

In [None]:
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

y_pred_svm = best_svm.predict(X_test)

accuracy_svm = accuracy_score(y_test, y_pred_svm)
con_mat_svm = confusion_matrix(y_test, y_pred_svm)
class_report_svm = classification_report(y_test, y_pred_svm)

print("\nSVM accuracy:", accuracy_svm)
print("\nSVM classification report:\n", class_report_svm)

plt.figure(figsize = (8, 6))
sns.heatmap(con_mat_svm, annot = True, fmt = 'd', cmap = 'Blues', xticklabels = ['Not Diabetic', 'Diabetic'], yticklabels=['Not Diabetic', 'Diabetic'])
plt.title('Confusion Matrix')
plt.ylabel('Actual Class')
plt.xlabel('Predicted Class')
plt.show()

The confusion matrix provides a visualisation of the model's predictions. We can see that out of 154 instances, the model correctly predicted 88 non-diabetic cases and 29 diabetic cases, giving us an accuracy of about 76%. 

However, more critical for us is the number of false negatives. There were 26 such cases, which is significant in medical diagnoses where failing to identify the condition can have serious consequences.

Looking at the classification report, we see the model's precision, recall, and f1-scores.

The recall for detecting diabetic cases stands at 53%, which implies that the model is not sensitive enough to positive cases.

The weighted average f1-score, which combines precision and recall, is 75%, suggesting a balance between the precision and recall across the classes.

This balance is crucial since it ensures that our model does not overly favor one class over the other.

In summary, the SVM model demonstrates a decent performance. However, using another type of machine learning model might give us better results for our particular task.

## Ensemble

Ensembles are a method in machine learning that involve combining several models to improve the accuracy of predictions.

These methods work on the assumption that a group of weak learners can come together to form a stronger learner.

The advantage of using them is that they help in reducing overfitting and bias, and can lead to better performance compared to using a single model.

### AdaBoost with RandomForest

Adaptive Boosting (AdaBoost), builds a model from the training data, then creates a second model that tries to correct the errors from the first model.

The process is repeated, with adding models until the training data is predicted well or a maximum number of models are added.

RandomForest is an algorithm that builds multiple decision trees and merges them together to get a more accurate and stable prediction.

It is known for high performance and scalability.

Each tree in a RandomForest is built from a sample drawn with replacement from the training set.

When splitting a node during the construction of the tree, the split chosen is no longer the best split among all the features.

Instead, the split which is picked is the best split among a random subset of the features.

As a result, the bias of the forest increases slightly but due to averaging, its variance decreases, which results in an overall better model.

For this task, the AdaBoost algorithm is used with RandomForest as the base classifier, and these are the chosen parameters:
- `base_estimator__max_depth`: maximum depth of the trees. Deeper trees can capture more complex patterns but also lead to overfitting.
- `base_estimator__min_samples_split`: minimum number of samples required to split an internal node.
- `base_estimator__n_estimators`: number of trees in the RandomForest. More trees can give us better performance but also require more resources.
- `n_estimators` for AdaBoost: maximum number of estimators at which boosting is terminated.
- `learning_rate`: shrinks the contribution of the classifiers. There is a trade-off between the learning_rate and n_estimators.

I chose these parameters because of their control over bias and variance trade off, learning speed and complexity.

Just like we did with the SVM model, we perform a grid search with cross-validation, focusing on achieving the best recall we can, while also measuring computing time.

In [None]:
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier

ada_start = time.time()

rnd_clf = RandomForestClassifier(class_weight = 'balanced', random_state = 42)

ada_classifier = AdaBoostClassifier(rnd_clf, random_state = 42, algorithm = 'SAMME')

ada_parameters = {
    'base_estimator__max_depth': [5, 10, None],
    'base_estimator__min_samples_split': [2, 4],
    'base_estimator__n_estimators': [10, 50, 100],
    'n_estimators': [30, 50],
    'learning_rate': [0.01, 0.1, 1]
}

print("Conductiong grid search, please wait...\n")

ada_grid_search = GridSearchCV(ada_classifier, param_grid = ada_parameters, cv = stratified_kfold, scoring = 'recall', n_jobs = -1)

ada_grid_search.fit(X_train, y_train)

ada_end = time.time()

ada_grid_duration = ada_end - ada_start

print(f"Ensemble grid search took: {ada_grid_duration} seconds to complete\n")

print("Best Parameters: ", ada_grid_search.best_params_)

print("\nEnd of ensemble grid search")

After finding the best values from the grid search, we can now see the best model.

This model is expected to outperform the individual weak learners, giving us better predictions.

In [None]:
best_ada = ada_grid_search.best_estimator_

After training, we can now predict on the test set.

We use the same metrics for the results as we did with the previous SVM model, including accuracy, precision, recall, f1 and a visual confusion matrix.

In [None]:
y_pred_ada = best_ada.predict(X_test)

accuracy_ada = accuracy_score(y_test, y_pred_ada)
con_mat_ada = confusion_matrix(y_test, y_pred_ada)
class_report_ada = classification_report(y_test, y_pred_ada)

print("AdaBoost with RandomForest Accuracy:", accuracy_ada)
print("AdaBoost with RandomForest Classification Report:\n", class_report_ada)

plt.figure(figsize = (8, 6))
sns.heatmap(con_mat_ada, annot = True, fmt = 'd', cmap = 'Blues', xticklabels = ['Not Diabetic', 'Diabetic'], yticklabels = ['Not Diabetic', 'Diabetic'])
plt.title('Confusion Matrix')
plt.ylabel('Actual Class')
plt.xlabel('Predicted Class')
plt.show()

The results from the final ensemble model shows us balanced performance over different metrics.

The model achieved an accuracy of around 75%. The true value of the model however is seen in the other metrics that accounts for the actual use case of diabetes prediction.

The confusion matrix shows that the model predicted 70 non-diabetic and 45 diabetic cases correctly.

There are 29 instances of false positives and much more importantly, 10 false negatives, which as I have previously discussed, is the primary focus in our project.

Precision for non-diabetic predictions is 88%, showing a high likelihood that a non-diabetic prediction by the model is correct.

Recall for diabetic predictions is 82%, which is particularly important. While it's not perfect, it means the model is quite reliable at catching positive cases.

F1, which balances precision and recall, is at 78% for non-diabetic and 70% for diabetic predictions.

These suggest that the model is better at identifying non-diabetic cases but still provides a good performance on diabetic cases.

In conclusion, the model performs well, especially when considering our case of diabetes prediction.

The number of false negatives is kept low, and while there are a decent number of false positives, in a medical setting these would simply mean that the patients would be sent in for further testing to determine their true condition.

## Comparison

The purpose of this assignment is to implement, tune and analyse two different machine learning models in order to create models that can effectively predict cases of diabetes.

The two models I have implemented are a Support Vector Machine (SVM) and an ensemble using an AdaBoost with RandomForest approach.

While both models aim to provide good predictions, their method of approaching the classification problem is different, leading to a differences in their performance and effectiveness for this task.

### SVM:
The SVM model displayed good accuracy of about 76%, indicating a decent ability in distinguishing between the classes.

Its strength lies in its precision for non-diabetic cases at 77%, along with a mediocre recall for diabetic class at 53%.

However, the model's sensitivity to diabetes detection is somewhat concerning, because it reflects the potential risk of misclassifying people with diabetes as non-diabetic, which could have serious consequences.

### AdaBoost with RandomForest:
The AdaBoost with RandomForest ensemble model shows us a similar accuracy of about 75%.

Its recall for diabetic cases is significantly higher than the SVM at 82%, highlighting the ensemble's ability to identify the critical positive cases.

This is vital in medical diagnoses, as it reduces the risk of overlooking patients in need of medical care.

### Advantages and Disadvantages:
The SVM model, with its kernel trick and support vectors, is good at finding the hyperplane for classification tasks and works well with smaller datasets.

However, it can be less effective when the dataset is not linearly separable and may require careful tuning of hyperparameters to balance results.

The AdaBoost with RandomForest model shows the strength of multiple decision trees, reducing variance and bias by averaging the results, which is why it performs better in our case.

Its disadvantage is in its complexity and the potential for overfitting if not properly tuned, though this is somewhat mitigated by the RandomForest's capacity for generalisation.

## Conclusion
Considering the importance of minimising false negatives in predicting diabetes, the AdaBoost with RandomForest model stands out as the more suitable model.

Its high recall rate means that fewer diabetic cases go undetected, which is the primary objective of this application of the model.

Although SVM offers a strong baseline, the ensemble offers a better understanding of the data, proving to be more effective for this specific medical diagnosis task.