# CS foreach Curriculum Workshop 10/24/2024: Introduction to AI/ML

This Jupyter Notebook is a supplemental demo to the Intro to AI/ML Workshop hosted on 10/24/2024. It aims to create a simple linear regression model to predict sleep quality on a scale of 1-10 based on a number of factors.

In [145]:
# Import all relevant libraries
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

If you are using Google Colab to run this notebook, please un-comment the code cell below. Download the `Health_Sleep_Statistics.csv` file to your computer, run the cell below, and you will have the option to upload the file to this notebook.

In [146]:
# from google.colab import files
# uploaded = files.upload()

In [None]:
# Load the sleep data
sleep_data = pd.read_csv('Health_Sleep_Statistics.csv')
sleep_data.head()

# Data Encoding

We want to try and predict the Sleep Quality Score as found in the "Sleep Quality" column using the other variables that we have.

To start, let's perform One-Hot Encoding for the following columns: "Gender", "Physical Activity Level", "Dietary Habits", "Sleep Disorders", and "Medication Usage", so that all variables are represented by quantities. 

In [148]:
# Initialize sklearn's One Hot Encoder
encoder = OneHotEncoder()

In [149]:
# Perform One-Hot Encoding on "Gender"
encoded_gender = encoder.fit_transform(sleep_data[['Gender']])
gender_df = pd.DataFrame(encoded_gender.toarray(), columns=encoder.get_feature_names_out(['Gender']))

In [150]:
# Perform One-Hot Encoding on "Physical Activity Level"
encoded_physical_activity = encoder.fit_transform(sleep_data[['Physical Activity Level']])
physical_activity_df = pd.DataFrame(encoded_physical_activity.toarray(), columns=encoder.get_feature_names_out(['Physical Activity Level']))

In [151]:
# Perform One-Hot Encoding on "Dietary Habits"
encoded_dietary_habits = encoder.fit_transform(sleep_data[['Dietary Habits']])
dietary_habits_df = pd.DataFrame(encoded_dietary_habits.toarray(), columns=encoder.get_feature_names_out(['Dietary Habits']))

In [152]:
# Perform One-Hot Encoding on "Sleep Disorders"
encoded_sleep_disorders = encoder.fit_transform(sleep_data[['Sleep Disorders']])
sleep_disorders_df = pd.DataFrame(encoded_sleep_disorders.toarray(), columns=encoder.get_feature_names_out(['Sleep Disorders']))

In [153]:
# Perform One-Hot Encoding on "Medication Usage"
encoded_medication_usage = encoder.fit_transform(sleep_data[['Medication Usage']])
medication_usage_df = pd.DataFrame(encoded_medication_usage.toarray(), columns=encoder.get_feature_names_out(['Medication Usage']))

In [154]:
# Join all of the One-Hot encoded data together
encoded_sleep_data = (sleep_data
                      .join(gender_df)
                      .join(physical_activity_df)
                      .join(dietary_habits_df)
                      .join(sleep_disorders_df)
                      .join(medication_usage_df))

Let's drop the names of the columns that we just performed One-Hot Encoding on, since we won't really need them anymore.

In [155]:
encoded_sleep_data = encoded_sleep_data.drop(columns=['Gender', 'Physical Activity Level', 'Dietary Habits', 'Sleep Disorders', 'Medication Usage'])

Let's also index by User ID:

In [156]:
encoded_sleep_data = encoded_sleep_data.set_index('User ID')

The only columns that need to be converted to viable quantities are "Bedtime" and "Wake-up Time". We can convert these times to minutes and create a model based on that.

In [157]:
def convert_to_minutes(time):
    time_components = time.split(':')
    minutes = int(time_components[0]) * 60 + int(time_components[1])
    return minutes

In [158]:
# Convert bedtimes and wake-up times to minutes
bedtime_in_minutes = encoded_sleep_data['Bedtime'].apply(convert_to_minutes)
wakeuptime_in_minutes = encoded_sleep_data['Wake-up Time'].apply(convert_to_minutes)

In [159]:
# Add them to the encoded sleep data DataFrame
encoded_sleep_data['Bedtime - Min'] = bedtime_in_minutes
encoded_sleep_data['Wake-up Time - Min'] = wakeuptime_in_minutes

We'll also drop the original columns for Bedtime and Wake-up Time:

In [None]:
encoded_sleep_data = encoded_sleep_data.drop(columns=['Bedtime', 'Wake-up Time'])
encoded_sleep_data

Everything is quantified now! We can start creating the model.

In [161]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [162]:
sleep_scores = encoded_sleep_data['Sleep Quality']

In [None]:
data_no_scores = encoded_sleep_data.drop(['Sleep Quality'], axis=1)
data_no_scores

In [176]:
train_sleep_data, test_sleep_data, train_sleep_score, test_sleep_score = train_test_split(data_no_scores, sleep_scores, train_size=95, random_state=2)

In [None]:
model = LinearRegression().fit(train_sleep_data, train_sleep_score)
print('The model has ' + str(model.score(train_sleep_data, train_sleep_score)) + '% accuracy!')

In [None]:
print('The model has ' + str(model.score(test_sleep_data, test_sleep_score)) + '% accuracy!')