# Day 2: Basic ML Concepts & Dataset Exploration

Welcome to **Day 2** of our elective: **AI in White Coat**. Today, we’ll build on our HPC familiarity and start diving into **core ML concepts** and **practical dataset exploration**. By the end of this session, you’ll:

1. Understand basic ML terminology (classification vs. regression).
2. Practice a *prompt-first* approach to load and explore a small open-source dataset from [Hugging Face](https://huggingface.co/datasets).
3. Perform a simple train/test split and build a baseline classifier (e.g., Logistic Regression).
4. Evaluate your model’s accuracy and examine a confusion matrix.

---
## 1. Quick Recap of Day 1
- We learned how to log in to our HPC at `10.20.110.114`.
- We tried out basic Linux commands and verified GPU availability.
- We also practiced a "prompt-first" approach using an LLM.

If you haven’t already, please create or navigate to your `day2_notebook` folder on the HPC. Let’s get started!

## 2. Classification vs. Regression (Conceptual Overview)

In **Machine Learning**, we often distinguish between:
- **Classification**: Predicting discrete labels (e.g., disease vs. no disease, normal vs. abnormal imaging). 
- **Regression**: Predicting continuous values (e.g., a patient’s blood pressure, hospital length of stay).

### Clinical Examples
- **Classification**: Is a given chest X-ray indicative of pneumonia (Yes/No)?
- **Regression**: Estimate a patient’s ejection fraction (a % measurement) based on various inputs.

Today, we’ll do a **classification** example using a small dataset from Hugging Face. In future sessions, we’ll explore more advanced or specialized tasks.

## 3. Installing and Importing Libraries (Prompt-First)

On some HPCs, libraries like `datasets`, `numpy`, `scikit-learn`, etc., might not be installed by default. You may need to prompt the LLM for an installation script.

### Example Prompt
```
I am on a remote HPC (Ubuntu). I have Python 3 and pip.
Please generate a shell command to:
1. Upgrade pip
2. Install datasets, scikit-learn, numpy, matplotlib
```

After receiving the shell command from the LLM, run it in your terminal or as a cell magic (e.g., `!pip install ...`).

In [None]:
# ====== LLM-GENERATED CODE CELL (Install Dependencies) ======
# If these libraries are already installed, you can skip or comment this out.
# Example:
# !pip install --upgrade pip
# !pip install datasets scikit-learn numpy matplotlib

# Uncomment or paste your LLM output here if needed.


Now let’s **import** our libraries. If something is missing, prompt the LLM for troubleshooting help or re-install the relevant package.

In [None]:
# ====== Python Imports (adjust as needed) ======
import numpy as np
import matplotlib.pyplot as plt
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

print("Imports complete.")

## 4. Loading a Sample Dataset from Hugging Face

Let’s pick a small tabular or image dataset for a classification demonstration. One option is the [*Banking77*](https://huggingface.co/datasets/banking77) dataset (intent classification of banking queries). Another could be a small image dataset like [*beans*](https://huggingface.co/datasets/beans) (healthy vs. diseased leaves). 

### Prompt Example
```
I want to load the "beans" dataset from Hugging Face in a Python notebook.
Please generate code using the 'datasets' library.
Then show how to explore the dataset (like printing a few samples).
```

We'll do it step by step.

In [None]:
# ====== LLM-GENERATED CODE CELL (Load Dataset) ======
# Paste your prompt's result here. Example:

beans = load_dataset("beans")
print(beans)

# Let's see a sample:
sample = beans['train'][0]
sample

Depending on which dataset you choose, you’ll see its **train/test/validation** splits and some sample features. For image datasets, it might store pixel values or file paths.

### Quick Exploration
- Check the number of samples in each split.
- Look at a few random samples to see the labels.


In [None]:
# ====== LLM-GENERATED CODE CELL (Exploration) ======
# Example prompt:
# "Please show me code to shuffle the dataset, take a random sample, and display the label."
# Then paste the code below.

print("Train size:", len(beans['train']))
print("Validation size:", len(beans['validation']))
print("Test size:", len(beans['test']))

# Example random sample
import random
rand_index = random.randint(0, len(beans['train']) - 1)
random_sample = beans['train'][rand_index]
random_sample

## 5. From Dataset to Numpy Arrays (or Tensors)

Many ML algorithms (like scikit-learn’s Logistic Regression) expect numeric arrays. For image data, we may flatten or preprocess images. For text data, we may tokenize. 

Here, let’s do a **simple approach**:
1. Convert images into flattened arrays (if it’s an image dataset). Or if we’re dealing with tabular data, select the relevant features.
2. Convert labels to numeric form.

**Note**: This is a simplified example. For more complex tasks (e.g., deep learning on images), you’d likely use a framework like PyTorch or TensorFlow with specialized data loaders.


In [None]:
# ====== LLM-GENERATED CODE CELL (Data Conversion) ======
# Example prompt:
# "Please generate code to convert the Hugging Face image dataset into flattened numpy arrays and numeric labels"

def dataset_to_arrays(dataset_dict):
    X_list = []
    y_list = []
    for item in dataset_dict:
        image_np = np.array(item['image']).flatten()  # Flatten
        label = item['labels']  # or item['label'] depending on dataset
        X_list.append(image_np)
        y_list.append(label)
    return np.array(X_list), np.array(y_list)

# Convert train/validation/test splits.
X_train, y_train = dataset_to_arrays(beans['train'])
X_val, y_val = dataset_to_arrays(beans['validation'])
X_test, y_test = dataset_to_arrays(beans['test'])

print("Shapes:")
print("X_train:", X_train.shape)
print("y_train:", y_train.shape)


## 6. Building a Simple Classifier

We’ll do a **Logistic Regression** as our baseline classifier. Although it’s not typically used for raw images, this is just to demonstrate the ML workflow:
1. **Fit** a classifier on `(X_train, y_train)`.
2. **Validate** on `(X_val, y_val)` (optional, but good practice).
3. **Test** final performance on `(X_test, y_test)`.

### Train-Test Split Example (Alternative)
If your dataset doesn't have separate splits, you can manually create them using `train_test_split` from scikit-learn. For example:
```
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2)
```

Let’s proceed with the dataset’s default splits.

In [None]:
# ====== LLM-GENERATED CODE CELL (Logistic Regression) ======
# Example prompt:
# "Generate code to train a LogisticRegression model on (X_train, y_train),
# evaluate it on (X_val, y_val), then test on (X_test, y_test). Print accuracy."

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

print("Validation Phase")
val_preds = clf.predict(X_val)
val_acc = accuracy_score(y_val, val_preds)
print("Validation Accuracy:", val_acc)

print("\nTest Phase")
test_preds = clf.predict(X_test)
test_acc = accuracy_score(y_test, test_preds)
print("Test Accuracy:", test_acc)

## 7. Evaluating with a Confusion Matrix

A **confusion matrix** helps visualize how many items were correctly classified vs. misclassified for each class.

### Prompt Example
```
Please generate a code snippet that computes a confusion matrix
for test_preds and y_test using scikit-learn.
Then display it using matplotlib.
```


In [None]:
# ====== LLM-GENERATED CODE CELL (Confusion Matrix) ======
# Example:

cm = confusion_matrix(y_test, test_preds)
print("Confusion Matrix:")
print(cm)

plt.matshow(cm, cmap=plt.cm.Blues)
plt.title("Confusion Matrix")
plt.colorbar()
plt.xlabel("Predicted")
plt.ylabel("True")
plt.show()

## 8. Observations & Clinical Relevance

- **Model Performance**: Logistic Regression on raw images might not yield very high accuracy, but this is a simplified demonstration.
- **Why It Matters**: Even a basic workflow reveals the process of training, validating, testing, and interpreting results. In real-world practice, you’d consider deep CNNs or specialized architectures for images.
- **Next Steps**: We’ll explore more sophisticated models or domain-specific techniques in future sessions.


## 9. Assignment #2: A Deeper Dive

**Task**: Using the same dataset (or a different open-source dataset if you wish):
1. **Perform an Extended EDA** (Exploratory Data Analysis). 
   - Prompt the LLM to generate code that computes class distribution, plots a few images (if image data), or prints summary statistics.
2. **Try a Different Classifier** (e.g., `RandomForestClassifier` from scikit-learn) and compare accuracy with Logistic Regression.
3. **Document** your findings in your daily log:
   - How does the confusion matrix differ?
   - Which classes are easiest/hardest to predict?

### Bonus
- Use an LLM to **optimize hyperparameters** (e.g., number of estimators in RandomForest) and see if performance improves.
- Reflect on how you might apply these methods to actual medical images or clinical data.

**Feel free to add** any interesting observations or creative solutions. Good luck, and see you on **Day 3**!

# End of Day 2 Notebook

Today, you learned how to:
- Install and import necessary libraries.
- Load a dataset from Hugging Face.
- Convert data to a format suitable for scikit-learn.
- Train, evaluate, and interpret a baseline classifier.

Remember to record your progress, roadblocks, and reflections in your **Daily Log** or **Portfolio**. If you have questions, ask your mentors or consult the LLM for additional guidance!
