# Day 2: Basic ML Concepts & Medical Dataset Exploration

Welcome to **Day 2** of our elective: **AI in White Coat**. Today, we’ll build on our HPC familiarity and start diving into **core ML concepts** and **practical dataset exploration**—this time using **BloodMNIST**, a dataset of blood cell images from the [MedMNIST](https://medmnist.com/) collection.

By the end of this session, you’ll:

1. Understand basic ML terminology (classification vs. regression).
2. Practice a *prompt-first* approach to load and explore a small **medical** dataset (BloodMNIST) from [Hugging Face Datasets](https://huggingface.co/datasets/MedMNIST/bloodmnist).
3. Perform a simple train/test split and build a baseline classifier (e.g., Logistic Regression).
4. Evaluate your model’s accuracy and examine a confusion matrix.

---
## 1. Quick Recap of Day 1
- We learned how to log in to our HPC at `10.20.110.114`.
- We tried out basic Linux commands and verified GPU availability.
- We also practiced a "prompt-first" approach using an LLM.

If you haven’t already, please create or navigate to your `day2_notebook` folder on the HPC. Let’s get started!

## 2. Classification vs. Regression (Conceptual Overview)

In **Machine Learning**, we often distinguish between:
- **Classification**: Predicting discrete labels (e.g., disease vs. no disease, normal vs. abnormal imaging). 
- **Regression**: Predicting continuous values (e.g., a patient’s blood pressure, hospital length of stay).

### Clinical Examples
- **Classification**: Is a given microscopic blood cell image indicative of a particular cell type or abnormality?
- **Regression**: Predict the exact quantity of certain blood components based on image features.

Today, we’ll do a **classification** example using the **BloodMNIST** dataset. Future sessions may explore more specialized tasks or deeper architectures.

## 3. Installing and Importing Libraries (Prompt-First)

On some HPCs, libraries like `datasets`, `numpy`, `scikit-learn`, etc., might not be installed by default. You may need to prompt the LLM for an installation script.

### Example Prompt
```
I am on a remote HPC (Ubuntu). I have Python 3 and pip.
Please generate a shell command to:
1. Upgrade pip
2. Install datasets, scikit-learn, numpy, matplotlib
```

After receiving the shell command from the LLM, run it in your terminal or as a cell magic (e.g., `!pip install ...`).

In [None]:
# ====== LLM-GENERATED CODE CELL (Install Dependencies) ======
# If these libraries are already installed, you can skip or comment this out.
# Example:
# !pip install --upgrade pip
# !pip install datasets scikit-learn numpy matplotlib

# Uncomment or paste your LLM output here if needed.


Now let’s **import** our libraries. If something is missing, prompt the LLM for troubleshooting help or re-install the relevant package.

In [None]:
# ====== Python Imports (adjust as needed) ======
import numpy as np
import matplotlib.pyplot as plt
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

print("Imports complete.")

## 4. Loading the BloodMNIST Dataset

**BloodMNIST** is part of the [MedMNIST](https://medmnist.com/) suite, containing images of blood cells with multiple classes.

### Prompt Example
```
I want to load the "MedMNIST/bloodmnist" dataset from Hugging Face in a Python notebook.
Please generate code using the 'datasets' library.
Then show how to explore the dataset (like printing a few samples).
```

We’ll do it step by step.

In [None]:
# ====== LLM-GENERATED CODE CELL (Load Dataset) ======
# Example prompt's result:

# We'll load the BloodMNIST dataset from Hugging Face.
bloodmnist = load_dataset("MedMNIST/bloodmnist")
print(bloodmnist)

# Let's see a sample from the 'train' split.
sample = bloodmnist['train'][0]
sample

The dataset typically has **train**, **test**, and sometimes **validation** splits. 

You can explore how many samples are in each split and how they’re structured. Because BloodMNIST images are 28x28 grayscale (like MNIST), they’re quite small—convenient for quick demos!

In [None]:
# ====== Dataset Exploration ======
# For instance, let's print the sizes of each split.

print("Train size:", len(bloodmnist['train']))
print("Test size:\t", len(bloodmnist['test']))

# If there's a 'validation' split, let's see:
if 'validation' in bloodmnist:
    print("Validation size:", len(bloodmnist['validation']))
else:
    print("No dedicated validation split in BloodMNIST.")

## 5. Visualizing a Few Samples

Before jumping into modeling, let’s see what these images look like. Since they’re small, it’s easy to plot them using `matplotlib`.

In [None]:
# ====== LLM-GENERATED CODE CELL (Show Samples) ======
# Example prompt:
# "Please generate code to display 5 random images from the BloodMNIST train split
#   using matplotlib, including their labels."

import random
fig, axes = plt.subplots(1, 5, figsize=(10, 2))
for ax in axes:
    idx = random.randint(0, len(bloodmnist['train']) - 1)
    sample = bloodmnist['train'][idx]
    image = np.array(sample['image'])
    label = sample['label']
    ax.imshow(image, cmap='gray')
    ax.set_title(f"Label: {label}")
    ax.axis('off')

plt.tight_layout()
plt.show()

The **label** indicates the type of blood cell (e.g., RBC, WBC subtypes, etc.). 

## 6. Preparing Data for Classification

Since we’ll do a **baseline** classification with scikit-learn, we need numeric arrays. BloodMNIST images are typically 28x28 (like classic MNIST). Let’s **flatten** them into a 1D vector of size 784 (28*28).

**Note**: This is **not** how you’d typically do modern image-based classification—usually, you’d use a CNN. But it demonstrates the ML workflow in a simple manner.

In [None]:
# ====== Data Conversion to Numpy Arrays ======
# We'll write a helper function that flattens images and returns an (X, y) tuple.

def flatten_data(dataset_split):
    X_list = []
    y_list = []
    for item in dataset_split:
        image = np.array(item['image']).flatten()  # Flatten to 784
        label = item['label']
        X_list.append(image)
        y_list.append(label)
    return np.array(X_list), np.array(y_list)

# We'll create train/test sets from the dataset.
X_train, y_train = flatten_data(bloodmnist['train'])
X_test, y_test = flatten_data(bloodmnist['test'])

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape: ", X_test.shape)
print("y_test shape: ", y_test.shape)

It looks like we have `(N, 784)` for `X_train` and `(N,)` for `y_train`, which is suitable for scikit-learn.

## 7. (Optional) Create a Validation Split

Often, we reserve some data for **validation** to tune hyperparameters. If BloodMNIST doesn’t have a separate validation set, we can create one manually.

### Prompt Example:
```
Using scikit-learn's train_test_split, please create a 10% validation split out of X_train.
```

For brevity, we’ll just keep a train/test for now. If you wish to add a validation split, go ahead!

## 8. Training a Simple Logistic Regression Model

We’ll do a **baseline** classifier using logistic regression. Later, we’ll see more sophisticated approaches.

### Prompt Example
```
Generate Python code to train a LogisticRegression model with max_iter=1000 on (X_train, y_train)
Then evaluate its accuracy on (X_test, y_test) and print the result.
```


In [None]:
# ====== LLM-GENERATED CODE CELL (Logistic Regression) ======
# Example:
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

test_preds = clf.predict(X_test)
test_acc = accuracy_score(y_test, test_preds)
print("Test Accuracy:", test_acc)

## 9. Confusion Matrix

A **confusion matrix** can help us see how well the model is distinguishing between the different blood cell types.

### Prompt Example
```
Please generate code to compute a confusion matrix for test_preds and y_test using scikit-learn.
Then display it using matplotlib.
```


In [None]:
# ====== LLM-GENERATED CODE CELL (Confusion Matrix) ======
cm = confusion_matrix(y_test, test_preds)
print("Confusion Matrix:")
print(cm)

plt.matshow(cm, cmap=plt.cm.Blues)
plt.title("Confusion Matrix")
plt.colorbar()
plt.xlabel("Predicted")
plt.ylabel("True")
plt.show()

## 10. Observations & Clinical Relevance

- **Model Performance**: We’re using a very simple approach—flattening images and using logistic regression. The accuracy may be limited.
- **Why It Matters**: Even a basic workflow reveals the process of training, testing, and interpreting results. In real-world practice, one might use **convolutional neural networks (CNNs)** for image classification.
- **Clinical Tie-In**: In a real scenario, classifying blood cells automatically could help in high-volume labs or prescreening tasks to identify abnormal cells.
- **Next Steps**: We’ll explore more sophisticated models or domain-specific techniques in future sessions.

## 11. Assignment #2: A Deeper Dive

**Task**: Using the same BloodMNIST dataset (or another MedMNIST subset if you’re adventurous):
1. **Perform an Extended EDA**:
   - Prompt the LLM to generate code for: class distribution, more image plots, and summary stats.
2. **Try a Different Classifier** (e.g., `RandomForestClassifier` or `SGDClassifier`) and compare accuracy with Logistic Regression.
3. **Document** your findings in your daily log:
   - How does the confusion matrix differ?
   - Which classes are easiest/hardest to predict?

### Bonus
- Use an LLM to **optimize hyperparameters** (e.g., n_estimators in RandomForest) and see if performance improves.
- Reflect on how this method might be extended to detecting **rare** abnormal cells.

**Feel free to add** any interesting observations or creative solutions. Good luck, and see you on **Day 3**!

# End of Day 2 Notebook

Today, you learned how to:
- Install and import necessary libraries.
- Load a **medical** dataset (BloodMNIST) from Hugging Face.
- Convert data to a format suitable for scikit-learn.
- Train, evaluate, and interpret a baseline classifier.

Remember to record your progress, roadblocks, and reflections in your **Daily Log** or **Portfolio**. If you have questions, ask your mentors or consult the LLM for additional guidance!
