# Lecture 06

**Date:** Jan 25, 2024

🚀 Welcome to the Computational Biology <sub>micro</sub>Hackathon! 🚀

Today, we dive into health data to unravel insights and innovations in understanding and diagnosing diabetes.
Diabetes is a significant health challenge affecting millions, and your mission is to use data to contribute to the fight against this chronic condition.

## The Quest

**🌐 Mission:** Your mission is to explore a health-related dataset collected from over 400,000 individuals.
Dive deep into the features, uncover patterns, and create solutions that can aid in the diagnosis and understanding of diabetes.

**📈 Data Exploration:** Your challenge is to navigate through the vast dataset containing responses from individuals across the United States.
Understand the nuances of diabetes and its risk factors hidden within the features provided.

**💻 Code for Health:** Use your coding prowess to analyze and train a classifier using sklearn to predict if a patient has diabetes.

## The Rules

**🔍 Guidelines:** Respect the data, understand the context, and keep your solutions ethical and impactful.

**⏰ Time Limit:** The clock is ticking!
You have 50 minutes to unveil the potential insights hidden in the data.

**🤝 Collaboration:** Form teams and combine your diverse skills for a holistic approach.

## The Submission

**📤 Export with Joblib:** As part of the submission process, ensure your sklearn model is exportable using the joblib library.
This will enable seamless evaluation and comparison of your models against the held-out test data.

```python
joblib.dump(model, 'my-model.joblib')
```

My autograder has the following code.

```python
student_model = joblib.load("model.joblib")
X_processed = process_features(X_test)
score = student_model.score(X_processed, y_test)
```

You have unlimited submissions.

## The Rewards

**💯 Grading:** To receive full credit for today's <sub>micro</sub>Hackathon, your model must have an R-squared value above 0.5 on the test set.
Your teaching team tested the most straightforward solution, which had a R-squared value of 0.74.

**🌟 Model Mastery:** The true test of your model's prowess awaits! After fine-tuning your solutions, you'll have the opportunity to submit your models to Gradescope.
Your models will undergo evaluation against test data that hasn't been provided.
The groups achieving the highest R-squared values will not only earn an additional mystery prize to be determined.

## Begin!

In [1]:
import joblib
import numpy as np
import pandas as pd

In [2]:
def process_features(X):
    """This is your general function to ensure compatability with my autograder.
    If you make any changes to your features, you should do them here, then return them.

    Args:
        X (np.ndarray): The original features from the CSV file.

    Returns:
        np.ndarray: Your processed features that you use for fitting.
    """
    return X