# Lecture 06

**Date:** Jan 25, 2024

🚀 Welcome to the Computational Biology <sub>micro</sub>Hackathon! 🚀

Today, we dive into health data to unravel insights and innovations in understanding and diagnosing diabetes.
Diabetes is a significant health challenge affecting millions, and your mission is to use data to contribute to the fight against this chronic condition.

## The Quest

**🌐 Mission:** Your mission is to explore a health-related dataset collected from over 400,000 individuals.
Dive deep into the features, uncover patterns, and create solutions that can aid in the diagnosis and understanding of diabetes.

**📈 Data Exploration:** Your challenge is to navigate through the vast dataset containing responses from individuals across the United States.
Understand the nuances of diabetes and its risk factors hidden within the features provided.

**💻 Code for Health:** Use your coding prowess to analyze and train a classifier using sklearn to predict if a patient has diabetes.

## The Rules

**🔍 Guidelines:** You may use the course website, pandas+numpy+sklearn documentation, and search engines.

**🤖 Generative AI:** You may not use any generative AI model such as ChatGPT, Bing Chat, Bard, etc.

**⏰ Time Limit:** You have until the end of class to submit your code and model.

**🤝 Collaboration:** You can work together, but each students needs to have their own submission.

## The Submission

**📤 Export with Joblib:** As part of the submission process, ensure your sklearn model is exportable using the joblib library.
This will enable seamless evaluation and comparison of your models against the held-out test data.

```python
import joblib
joblib.dump(model, 'model.joblib')
```

My autograder has the following code.

```python
student_model = joblib.load("model.joblib")
X_processed = process_features(X_test)
score = student_model.score(X_processed, y_test)
```

Thus, you need to download the Jupyter notebook as a `.py` file and turn that in along with your `model.joblib` file.
You have unlimited submissions.

Make sure you put your team name somewhere in your leaderboard submission name.
For example, "Linda Belcher (Team: Bob's Burgers)".


## The Leaderboard

The overall score for your submission is computed as a linear combination of your model's accuracy on the train set, test set, as well as the speed.

```python
student_model = joblib.load("model.joblib")

X_train = preprocess(X_train)
y_pred = student_model.predict(X_train)
score_train = balanced_accuracy_score(y_train, y_pred)

X_test = preprocess(X_test)
y_pred = student_model.predict(X_test)
score_test = balanced_accuracy_score(y_test, y_pred)

n_runs = 5
t_start = time.time()
X_test = preprocess(X_test)
for _ in range(n_runs):
    _ = self.student_model.predict(X_test)
t_end = time.time()
t_delta = t_end - t_start
t_delta *= 1000
t_delta /= n_runs

overall_score = 0.2 * score_train + 0.8 * score_test + 0.03 / t_delta
```

## The Rewards

**💯 Grading:** To receive full autograder credit (10 points) for today's <sub>micro</sub>Hackathon, your model must have a score above 0.7 on the test set.
The teaching team tested the most straightforward solution, which had a score of 0.74.

**🌟 Model Mastery:**

-   Each teaching team member will be assigned a section of the class.
    Towards the end of class, the instructor will average the **top three unique test scores** as your team score (subject to change based on the results).
    The team with the highest score will be able to drop a homework assignment.
-   If any valid submission has an overall score above `0.90` I will bring in treats for the whole class.
    Each student would have a choice between candy or a rubber duck.
-   The submission with the highest overall score will be able to choose a rubber duck today!

Any and all disputes will be settled with a rock-paper-scissors competition.

## Begin!

## Note

This <sub>micro</sub>Hackathon is also a way for you check your level of Python mastery for this course.
You are expected, and encouraged, to look things up in the documentation and course website.

In [1]:
import joblib
import numpy as np
import pandas as pd

In [2]:
def preprocess(X):
    """This is your general function to ensure compatibility with my autograder.
    If you make any changes to your features, you should do them here, then return them.

    Args:
        X (np.ndarray): Normalized features from the CSV file with MinMaxScaler().

    Returns:
        np.ndarray: Your processed features that are used for fitting.
    """
    # Process X
    return X