### Classification

* To demonstrates binary classification by training a model to predict if a patient should be tested for diabetes based on medical data.

*Supervised* machine learning trains a model to predict a label from a set of features using data with known labels. The function can be represented as:


> *f([x<sub>1</sub>, x<sub>2</sub>, x<sub>3</sub>, ...]) = y*

*Classification* is a supervised learning task where the model predicts the probability of each class and assigns a label. The simplest case is binary classification, where the label is either 0 or 1 (e.g., "True" or "False").



In [None]:
import pandas as pd

# load the training dataset
diabetes = pd.read_csv('../../generated/data/raw/diabetes.csv')
print(diabetes.head())

* The data contains diagnostic information for patients tested for diabetes.
* The final column (**Diabetic**) is the label:
  - **0** for patients who tested negative
  - **1** for patients who tested positive
* Most other columns (**Pregnancies**, **PlasmaGlucose**, **DiastolicBloodPressure**, etc.) are features used to predict the Diabetic label.
* We'll separate features as _**X**_ and the label as _**y**_.

In [None]:
# Separate features and labels
features = ['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']
label = 'Diabetic'
X, y = diabetes[features].values, diabetes[label].values

for n in range(0,4):
    print("Patient", str(n+1), "\n  Features:",list(X[n]), "\n  Label:", y[n])

#### *️⃣ Compare the feature distributions for each label value.

In [None]:
from matplotlib import pyplot as plt

features = ['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']
for col in features:
    diabetes.boxplot(column=col, by='Diabetic', figsize=(6,6))
    plt.title(col)
plt.show()

* Some features show noticeable differences in distribution for each label value.
* **Pregnancies** and **Age** have markedly different distributions between diabetic and non-diabetic patients.
* These features may help predict whether a patient is diabetic.

#### Split the data

To Split the dataset into two parts: one for training the model and one for testing its predictions.

The aim is:
* Evaluate the model's performance by comparing its predictions on the test set with the actual labels.
* Assess the model's accuracy.


The dataset contains known label values, which allows us to train a classifier to learn the relationship between features and labels.
The `scikit-learn` package provides the `train_test_split` function to randomly divide the data.
Typically, 70% of the data is used for training and 30% is reserved for testing.

In [None]:
from sklearn.model_selection import train_test_split

# Split data 70%-30% into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

print ('Training cases: %d\nTest cases: %d' % (X_train.shape[0], X_test.shape[0]))

#### Train and Evaluate a Binary Classification Model
Train our model by fitting the training features (X_train) to the training labels (y_train)

* Choose an algorithm for training; here, we use Logistic Regression, a common classification method.
* Set a regularization parameter to reduce bias and prevent overfitting.
* Hyperparameters are settings defined outside the data, while parameters are values within the data.

**Note**: Parameters for machine learning algorithms are generally referred to as *hyperparameters*. To a data scientist, *parameters* are values in the data itself - *hyperparameters* are defined externally from the data.

In [None]:
# Train the model
from sklearn.linear_model import LogisticRegression

# Set regularization rate
reg = 0.01

# train a logistic regression model on the training set
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)
print (model)

#### Evaluate the Model with Test Data
TO use the trained model to predict labels for the test set, then compare these predictions to the actual labels to evaluate performance.


In [None]:
predictions = model.predict(X_test)
print('Predicted labels: ', predictions)
print('Actual labels:    ', y_test)

####  Check the accuracy of the predictions
Use the metrics provided by scikit-learn to evaluate the model.

Since the arrays of labels are too long to compare manually, we use metrics to efficiently evaluate the model’s performance.

In [None]:
from sklearn.metrics import accuracy_score

print('Accuracy: ', accuracy_score(y_test, predictions))

##### Interpretation:
An accuracy of 0.789 means the model correctly predicted about 79% of the test cases.

Accuracy is shown as a decimal between 0 and 1, where 1.0 indicates perfect predictions and 0.0 means none were correct.

## Summary

Here we prepared our data by splitting it into test and train datasets, and applied logistic regression - a way of applying binary labels to our data. Our model was able to predict whether patients had diabetes with what appears to be reasonable accuracy. But is this good enough? In the next notebook we will look at alternatives to accuracy that can be much more useful in machine learning.