<a href="https://colab.research.google.com/github/tritonhacks/TritonHacks2025-ML-starter-kit/blob/main/ml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TritonHacks 2025: Introduction to AI/ML Starter Kit Part II: Machine Learning (ML)

Welcome to the Intro to AI/ML Starter Kit for TritonHacks 2025! This is the second of two notebooks in this repo, and it focuses on creating ML models.

## Importing Libraries
Like always, we will begin by importing libraries. In this notebook, we'll use SciKit-Learn to help us train our models. The library offers a wide variety of tools that will make the training process much easier. Feel free to check out the official documentation on [SciKit-Learn's Website](https://scikit-learn.org/stable/) to see how these functions work under the hood.

In [1]:
# Import libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
import pandas as pd

## Importing Data
We'll also need to import our data. Check out `Importing Data` in the EDA file for more detailed instructions on this step.

In [None]:
# Method #1: Direct Upload
from google.colab import files
uploaded = files.upload()
path_to_file = 'healthcare-dataset-stroke-data.csv'

In [None]:
# Method #2: Mount Google Drive

path_to_file = 'healthcare-dataset-stroke-data.csv' # REPLACE WITH PATH TO YOUR FILE

from google.colab import drive
drive.mount('/content/drive')
path_to_file = '/content/drive/MyDrive/' + path_to_file

In [8]:
df = pd.read_csv(path_to_file)

## Preprocessing
Now, we need to get our data ready for training. In the EDA file, we identified explanatory variables and analyzed a few graphs showing the relationship between them. In this notebook, we will train our model on the following variables: gender, age, hypertension, heart disease, avg. glucose level, BMI, and smoking status. Feel free to play around with the selection to see which combination yield a better model.

Let's first remove the variables we don't need and look at the new dataframe:

In [None]:
df = df.drop(['id', 'ever_married', 'work_type', 'Residence_type'], axis = 1)
df.head()

### One Hot Encoding
Notice how entries in `gender` and `smoking_status` are formatted as *strings*. But wait, our ML model can't take strings as inputs! We need to perform **one-hot-encoding** (OHE) on the data to make sure all entries are expressed numerically. What this means is that each state a variable can take will be expressed as True/False or a set of 1s and 0s. Take `smoking_status` for instance–the variable can take on the values: formerly smoked, never smoked, smokes, or unknown. Once we OHE, we get the following values:

<center>

| | Formerly Smoked | Never Smoked | Smokes | Unknown | Status |
|:-| :-: | :-: | :-: | :-: | :-:|
|Person 1 | 1 | 0 | 0 | 0 | Formerly Smoked |
|Person 2 | 0 | 1 | 0 | 0 | Never Smoked |
|Person 3 | 0 | 0 | 1 | 0 | Smokes |
|Person 4 | 0 | 0 | 0 | 1 | Unknown |

</center>

Let's edit the dataframe to OHE both `gender` and `smoking_status`:

In [None]:
# One-hot encoding

one_hot = pd.get_dummies(
    df,
    columns=['gender', 'smoking_status'],
    prefix=['gender', 'smoking']
)
df = one_hot
df = df.fillna(df.median(numeric_only=True))
df.head()

Now, we need to separate our explanatory and response variables. The explanatory variables (commonly denoted `X`) are used as inputs to train the model while the response variables (commonly denoted `y`) give us the actual values to tell us how good our model is.

In [12]:
# Define X and y

X = df.drop(["stroke"], axis = 1)
y = df["stroke"]

Recall that in the EDA notebook, we discussed a few data normalization processes. We'll use SciKit-Learn's **Z-score normalization** to rescale `age`, `avg_glucose_level`, and `bmi` so that their values don't dominate the training process. (The variance in the values are larger so the ML model is more prone to fit along these variables and overlook others).

In [13]:
# Scaling dominant numeric features

to_scale = ['age', 'avg_glucose_level', 'bmi']
not_scaled = [col for col in X.columns if col not in to_scale]

scaler = StandardScaler() # Z-score normalization
scaled_part = scaler.fit_transform(X[to_scale])
scaled_df = pd.DataFrame(scaled_part, columns=to_scale)

X = pd.concat([scaled_df, X[not_scaled].reset_index(drop=True)], axis=1)

### Train Test Split
But wait, if we train our model on 100% of the data, how do we evaluate the quality of our model? We need the `train_test_split` function to divide the data into two batches. One batch is used to *train* the model and the other to *test* the model. Typically, the 70-80% of the data is allocated for training and the remaining 20-30% for testing.

However, there's still a small issue...in the EDA notebook, we saw that the number of datapoints for non-stroke cases is significantly larger than the number of datapoints for stroke cases. On any given train-test split, there's a good chance that the majority of the data is just non-stroke cases. That's not very useful for our model because it doesn't learn to differentiate between stroke and non-stroke. To solve this imbalance in data, we will *undersample* the non-stroke class until we get roughly similar proportions between stroke/non-stroke cases.

In [None]:
# Train-Test Spliting
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X, y)

X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size = 0.3, random_state = 0)

## Training a Model

Let's start training models! The first machine learning model we will use will be called a linear classifier. Think of it like a linear equation you likely have seen from algebra class... but like an extension of it. Also, we use a classifier because we are predicting one of two outcomes: the patient either does or does not have a stroke. We aren't predicting some value, so we use `LogisticRegression` from `sklearn`. If you're trying to predict a value, use `LinearRegression` instead.

In [None]:
# Linear Classifier

from sklearn.linear_model import LogisticRegression

linear_classifier = LogisticRegression()
linear_classifier.fit(X_train, y_train)
accuracy = (linear_classifier.predict(X_test) == y_test).mean()
print(f'The linear classifier has a {accuracy * 100}% accuracy!')

### Precision-Recall Curves (PR Curve)
In the linear classifier, we quantified the performance of our model by looking at the proportion of predictions that were correct. Another way to analyze the quality of our model is using a precision-recall curve (PR curve). The graph plots the precision v. recall values for a given model at each decision threshold level.

**Precision:**
The precision measures the proportion of True Positives predicted by the model out of all positive predictions (True Positive + False Positive). It is calculated as:
$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}$$

**Recall:**
The recall measures the proportion of True Positives predicted by the model out of all *actual* positive values (True Positive + False Negative). It is given by:
$$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}$$

When we train a model, the model actually outputs its prediction as a probability. If the value is larger than the decision threshold of 0.5, the model outputs "stroke" and then vice versa otherwise. In a precision-recall curve, we vary this threshold value and observe how the precision and recall of the model respond. The area under the curve (AUC) represents how well the model performes.

In [None]:
from sklearn.metrics import precision_recall_curve, auc
from matplotlib import pyplot as plt

def precision_recall_score(X_data, y_data, model):
  y_scores = model.predict_proba(X_test)[:, 1]
  precision, recall, thresholds = precision_recall_curve(y_data, y_scores)
  auc_score = auc(recall, precision)

  # Plot precision-recall curve
  plt.figure(figsize=(8, 6))
  plt.plot(recall, precision, label=f'Precision-Recall Curve (AUC = {auc_score:.2f})')
  plt.xlabel('Recall')
  plt.ylabel('Precision')
  plt.title('Precision-Recall Curve')
  plt.legend()
  plt.show()

  return precision, recall, auc_score

### K-Nearest Neighbors (KNN) Classifier

Next, let's make a KNN Classifier. How does this ML model work? Think of every single patient we have as a data point in a plot, and each point has a color associated with whether that patient has a stroke or not. To figure out whether a new patient has a stroke or not, we turn it into a point with the other data points, and then we take a look at the `k` closest points to it, hence the name `k` nearest neighbors. Among those `k` points, if there are more points that represent patients with stroke than points that represent patients with no stroke, then we predict that our new patient has a stroke. If there are less, then we predict that our patient has no stroke. If there are equal, the decision may be decided by a coin flip (or you could just make the number of neighbors we look at an odd number)!

Currently, we are looking at the nine closest neighbors. Feel free to play around with the number of neighbors (`n_neighbors`) and see what happens with the predictions!

In [None]:
# KNN Classification
from sklearn.neighbors import KNeighborsClassifier

KNN = KNeighborsClassifier(n_neighbors = 9)
KNN.fit(X_train, y_train)
y_predicted = KNN.predict(X_test)
acc = accuracy_score(y_predicted, y_test)
print(f'Accuracy: {acc}')

precision, recall, auc_score = precision_recall_score(X_test, y_test, KNN)
print(f'AUC Score: {auc_score}')

### Decision Tree Classifier

Another ML model we can make is the Decision Tree. You will eventually see a diagram of this, but the essence of how this works is that we ask a series of questions regarding patient info in order to determine whether the patient has a stroke or not.

One crucial aspect of the Decision Tree is how deep you want to make the tree (`max_depth`). Currently, we have the depth of the tree set at 4, but feel free to play around with the number to see how this affects the predictions.

In [None]:
from sklearn.tree import DecisionTreeClassifier as DT

dt = DT(max_depth = 4, random_state = 42)
dt.fit(X_train, y_train)
accuracy = dt.score(X_test, y_test)
print(f"Accuracy: {accuracy}")
precision, recall, auc_score = precision_recall_score(X_test, y_test, dt)
print(f'AUC Score: {auc_score}')

#### Visualization

In [None]:
from sklearn.tree import export_graphviz
import graphviz

dot_data = export_graphviz(dt, out_file=None, feature_names=X.columns, class_names=['Non-stroke', 'Stroke'], filled=True, rounded=True, special_characters=True)
graph = graphviz.Source(dot_data)
graph

### Random Forest

A Random Forest is essentially training a bunch of Decision Trees. Each Decision Tree takes a "vote" on whether a particular patient has a stroke or not. Our prediction is whichever decision receives the most votes.

Like before, you can play around with the depth of each Decision Tree. However, the number of decision trees (`n_estimators`) could also affect the predictions. Try playing around with the number for it!

In [None]:
from sklearn.ensemble import RandomForestClassifier as RF

rf = RF(n_estimators = 100, max_depth = 4, random_state = 0)
rf.fit(X_train, y_train)
accuracy = rf.score(X_test, y_test)
print(f"Accuracy: {accuracy}")

precision, recall, auc_score = precision_recall_score(X_test, y_test, rf)
print(f'AUC Score: {auc_score}')

### Multi-layer Perceptron (MLP)
A Multi-layer Perceptron, also known as a feedforward neural network, forms the basis for most deep learning algorithms. The model essential consists of layers of "neurons", where each neuron in layer $l$ is connected to every neuron in layer $l+1$. In the case of our MLP model, we will use 2 hidden layers of size 128 and 64 neurons respectively. Our input layer will have 13 neurons, one for each parameter we pass to the network. Since we are doing a binary classification task, we'll have 2 output neurons–one for predicting stroke and the other for predicting non-stroke. When we pass the input to the MLP, each neuron will perform a specific set of operations to the values (which we optimize when we train the model) before passing the output to neurons in the next layer. The process propagates through the network until it reaches the output neurons, where we determine which classification is more likely.

**Some Extra Info:**
If you're wondering why MLP models are so common in deep learning, it's because these models allow us to approximate complicated functions as a set of non-linear transformations. Let's take a neuron $x_1$ in layer 1 which is connected to neuron $x_2$ in layer 2. The transition from $x_1$ to $x_2$ is given by: $$x_{1,2} = \text{ReLU}(w_{1,2} * x_1 + b_{1,2})$$ If $w_{1,2} * x_1 + b_{1,2}$ looks familiar, you're right! It's a linear transformation of $x_1$. We call $w_{1,2}$ the weight between two neurons and $b_{1,2}$ the bias. When we train our model, we are specifically optimizing these values (through [backpropagation](https://en.wikipedia.org/wiki/Backpropagation)). The magic behind these models are *activation functions* like ReLU. The function is super simple $$ReLU(x) =
\begin{cases}
0, & \text{if } x \leq 0 \\
x, & \text{if } x > 0
\end{cases}$$
but it's nonlinear nature is key to allowing ML models to act as universal function approximators.

In [None]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes = (128, 64), max_iter = 1000, random_state = 0)
mlp.fit(X_train, y_train)
y_predicted = mlp.predict(X_test)
acc = accuracy_score(y_predicted, y_test)
print(acc)

precision, recall, auc_score = precision_recall_score(X_test, y_test, mlp)
print(f'AUC Score: {auc_score}')

And that concludes the machine learning notebook! Feel free to try out one of these models for your project. The five ML models that we chose certainly aren't the only models out there, so consider checking out other ML models that you might want to make!