<a href="https://colab.research.google.com/github/mairabermeo/Data-Science/blob/master/Project2_Proposal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Investigating the Relationship Between Sleep Patterns and Academic Performance Among First-Year College Students**

## **Abstract**

This project explores how lifestyle behaviors, specifically sleep patterns, may influence academic performance among college students. Poor sleep habits are common in university settings and have been linked to reduced cognitive function, yet their direct relationship to academic outcomes remains underexamined in real-world data. This study seeks to identify whether variations in sleep consistency, duration, and timing are associated with changes in students’ academic achievement.

To investigate this, we conducted a data-driven analysis using a combination of statistical and machine learning models. Our approach includes exploratory data analysis, feature transformation, and the application of several predictive techniques to identify patterns and relationships in the data. By comparing multiple modeling strategies, we assess both linear and non-linear effects, as well as the potential influence of individual characteristics.

The analysis reveals that certain sleep behaviors show a measurable relationship with academic outcomes. These findings highlight the role that sleep habits may play in academic success and offer insights that can inform future support strategies for student well-being and performance.

## **Introduction**

Academic performance during the first year of college is an important predictor of student retention, graduation, and long-term success. While universities invest heavily in academic support programs such as tutoring, advising, and orientation, personal habits like sleep often receive less attention. Many college students experience irregular sleep schedules, short sleep durations, and late bedtimes. Research has shown that sleep quality and consistency affect memory, focus, and overall cognitive functioning. Understanding how sleep behaviors influence academic outcomes could help improve student success and well-being.

This project investigates the relationship between sleep habits and GPA among first-year college students. We use data from the CMU Sleep Study, which includes information from 634 students across Carnegie Mellon University, the University of Washington, and the University of Notre Dame. Each student wore a Fitbit for one month, which recorded their sleep patterns, including total sleep time, bedtime variability, and daytime naps. The dataset also includes demographic information (such as gender, race, and first-generation college status), as well as academic data (cumulative and term GPA).

Our primary research questions focus on whether longer or more consistent sleep is associated with higher GPA, whether daytime sleep has a positive or negative impact on academic performance, and how demographic factors influence these relationships. To answer these questions, we apply multiple modeling techniques, including linear regression, K-nearest neighbors, Huber regression, quantile regression, random forest, and gradient boosting. This combination of traditional and machine learning models allows us to evaluate both explanatory and predictive aspects of the data, fulfilling the project’s analytical and technical requirements.

## **Research Questions**
Research Question 1:

To what extent are total sleep time and bedtime consistency associated with term GPA in first-year college students?

Research Question 2:

Is there a relationship between daytime sleep and GPA, and does this relationship change depending on other sleep habits?

Understanding these relationships can help schools promote better academic outcomes through simple behavioral changes. If students who sleep more or go to bed at regular times tend to perform better, schools can include sleep education in first-year programs. These might take the form of workshops, health campaigns, or academic coaching that includes advice on sleep routines.

If daytime sleep is shown to negatively affect GPA, students could be encouraged to adjust their schedules to limit naps and prioritize nighttime sleep. By learning how small changes in daily habits affect academic performance, students may be better prepared to succeed in college. Schools can support this by giving students the tools and information to improve their routines in realistic and effective ways.


## **Data to be Used**

The dataset used in this project comes from the Carnegie Mellon University Statistics Data Repository. It is publicly available and can be accessed at the following link:

**[CMU Sleep Study Dataset](https://cmustatistics.github.io/data-repository/psychology/cmu-sleep.html)**

The dataset contains information from 634 first-year college students enrolled at three institutions: Carnegie Mellon University, the University of Washington, and the University of Notre Dame. Data were collected during the spring term of each student’s first year. Each participant wore a Fitbit device for approximately one month. These devices recorded sleep-related data, including total sleep time, bedtime variability, sleep midpoint, and daytime sleep. Researchers identified and classified sleep episodes using Fitbit tracking data. In addition to sleep data, the dataset includes academic records such as cumulative GPA (from prior terms), term GPA (from the spring term), and course load information. Demographic variables are also provided, including gender, race (categorized as underrepresented or not), and first-generation college student status.

The dataset is provided as a downloadable CSV file and does not require any web scraping or API access. It will be loaded directly into a Google Colab notebook using Python’s pandas library from Github. All cleaning, transformation, and analysis will be conducted within that environment using standard Python tools. This structured dataset provides all the necessary information for answering the project’s research questions without requiring additional data collection.



## **Approach**
**Research Approach**

We will begin by importing the dataset using the pandas library and inspecting its structure with functions like .head(), .info(), and .describe(). Any columns that contain numeric values but are stored as object types will be converted to numeric using pd.to_numeric() to ensure proper analysis. Categorical variables such as gender, race, and first-generation status will be encoded numerically to prepare them for regression models. We will address missing values either by imputing them or removing affected rows, depending on the extent of the issue. If required, we will also normalize or standardize continuous variables like sleep metrics to ensure consistent interpretation across modeling techniques. The cleaned and prepared dataset will then be stored in a separate DataFrame for further use.


**Exploratory Data Analysis (EDA)**

We will conduct exploratory analysis by computing summary statistics and using visual tools to understand the data. We will generate a correlation heatmap using seaborn.heatmap() to visualize the relationships between sleep metrics and GPA. We will also create scatterplots to examine how TotalSleepTime and midpoint_sleep relate to term_gpa. To assess the distribution of academic performance, we will plot a histogram or kernel density estimate of the GPA. Additionally, we will use boxplots and violin plots to explore GPA variation across demographic groups such as gender or first-generation status. All of these plots will be coded in Python using matplotlib and seaborn.


**Data Preparation**

During data preparation, we will apply code to clean and transform variables identified as problematic during EDA. We will handle outliers, standardize variables when necessary, and implement feature engineering techniques such as creating interaction terms between sleep behaviors and demographic variables. Continuous variables needed for distance-based models will be scaled appropriately. At the end of this stage, we will organize the final cleaned dataset into a new DataFrame, which will be used for modeling.


**Prepped Data Review**

After completing data preparation, we will re-run key visualizations on the cleaned data. This includes generating updated scatterplots, histograms, and correlation matrices to confirm the data integrity and consistency of patterns. We will also re-check for missing values and confirm that all variables are in the correct format for analysis. This step helps ensure the data is ready for model fitting.


**Investigative Analysis and Results**

We will build and evaluate a variety of regression models to answer our research questions. These models include Linear Regression, K-Nearest Neighbors Regressor, Huber Regressor, Quantile Regression, Random Forest Regressor, and Gradient Boosting Regressor. Each model will be trained on the prepared data. To assess model performance, we will compute metrics such as Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE).For the Quantile Regressor, we will use Pinball Loss. For ensemble models like Random Forest and Gradient Boosting, we will generate feature importance plots to interpret which sleep or demographic variables most strongly predict GPA.



### **Team Roles and Responsibilities**

#### **Maira**

* **Abstract**

  * Write a concise summary of the problem, approach, and key outcomes of the project.

* **Introduction**

  * Explain the purpose and motivation behind the project.
  * Describe the research question and why it was chosen.
  * Provide background on the dataset and justify its use.

* **Research Approach**

  * Outline the overall strategy used for data handling and analysis.
  * Explain the workflow from exploration to modeling, including data management practices.

* **Exploratory Data Analysis (EDA)**

  * Analyze the raw dataset to uncover patterns and insights.
  * Identify potential data issues or trends.
  * Visualize key features and distributions using Python.
  * Summarize early findings that will shape the direction of the analysis.

* **Data Preparation**

  * Clean the dataset by handling missing values, duplicates, or inconsistencies.
  * Perform feature engineering to enhance the dataset for modeling.
  * Ensure the dataset is ready for analysis and modeling.

* **Prepped Data Review**

  * Re-run EDA on the cleaned dataset.
  * Validate that the data is in good shape for statistical modeling.

#### **Jannat**
* **Investigative Analysis & Results**

  * Build and test statistical models (e.g., regression).
  * Evaluate model performance and interpret the results.
  * Use evidence from the analysis to answer the research question.

* **Conclusions**

  * Summarize overall findings and insights.
  * Reflect on how the research question was addressed.
  * Suggest possible next steps or extensions for the project.