# COGS 108 - Data Checkpoint

# Names

- Benjamin Zhang-Li
- Ilia Aballa
- Keshav Gupta
- Nura Nejad

# Research Question

Which measurable factors, reported through self-surveys—such as sleep duration, physical activity, and screen time—are most strongly associated with getting adequate sleep schedules (7–9 hours per night) among college students (ages 18-25) in the US?

## Background and Prior Work

Maintaining a healthy sleep schedule is essential for college students in the U.S, as it directly impacts their academic performance, mental health, and overall well-being. However, research from the Centers for Disease Control and Prevention (CDC) shows that around 75% of college students report experiencing sleep disturbances or poor sleep quality, largely due to demanding schedules and the challenges of balancing coursework, social activities, and work commitments.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Lifestyle factors such as caffeine intake, dietary habits, screen time, alcohol and substance use, stress levels, and study habits all play a huge role in shaping sleep patterns. For instance, the CDC notes that inconsistent sleep patterns, common among college students, contribute to issues like “daytime sleepiness,” which can impair focus and academic success.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1)

Our project seeks to understand which of these lifestyle habits most strongly affect sleep consistency and quality, focusing on students with demanding college schedules. To do so, we will examine measurable sleep factors like total sleep hours, consistency in average bedtime and wake time, and overall sleep quality. By analyzing these metrics, we aim to pinpoint lifestyle behaviors that correlate with healthier sleep patterns among college students in the U.S., hoping to provide practical insights into developing better sleep routines.

Previous research has highlighted key lifestyle and mental health factors that impact sleep consistency among U.S. college students. For instance, one study found that habits like caffeine use, substance use, and high stress are strongly associated with sleep issues, showing how these daily choices affect students’ ability to keep a regular sleep schedule.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) Similarly, research on sleep hygiene—such as managing screen time before bed, limiting caffeine intake, and moderation of alcohol use—demonstrates how these habits impact sleep quality, often making it difficult for students to maintain a steady routine.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) Another recent study on college students' sleep trends highlights that irregular schedules and high-stress levels are key contributors to poor sleep quality, emphasizing the importance of exploring how lifestyle and mental health impact sleep patterns.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3)

---

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Mbous, Y.P.V., Nili, M., Mohamed, R., & Dwibedi, N. (2022). *Psychosocial Correlates of Insomnia Among College Students*. Preventing Chronic Disease. [Link to study](https://www.cdc.gov/pcd/issues/2022/22_0060.htm)

2. <a name="cite_note-2"></a> [^](#cite_ref-2) Emerson, J. (2024). *The Importance of Sleep for College Students*. University of South Florida. [Link to article](https://admissions.usf.edu/blog/the-importance-of-sleep-for-college-students)

3. <a name="cite_note-3"></a> [^](#cite_ref-3) Zhou, J., Qu, J., Ji, S., et al. (2022). *Research trends in college students' sleep from 2012 to 2021: A bibliometric analysis*. Front Psychiatry. [Link to study](https://pmc.ncbi.nlm.nih.gov/articles/PMC9530190/)

# Hypothesis


We hypothesize that, among U.S. college students aged 18-25, factors such as consistent bedtime and wake time, average duration of sleep, lower caffeine intake, limited screen time, and reduced substance use will correlate strongly with maintaining a sleep schedule of 7-9 hours. We predict these variables will have a significant positive impact on sleep quality due to their direct influence on sleep duration and overall sleep hygiene. The positive impacts of a sleep schedule of 7-9 hours will contribute to positive mental health and stress levels and GPA of at least 3.5.

# Data

## Data overview

- Dataset #1
  - Dataset Name: Mental Health & Technology Usage Dataset
  - Link to the dataset: https://docs.google.com/spreadsheets/d/e/2PACX-1vSFmPYlesdIKLYwQpsMuzlEFwbe5A6bdx_mydBtx-6oZRoFIIMvfPrf5GBhoIRD_AklksWy1tBDpTa0/pub?output=csv  
  - Number of observations: 10,000
  - Number of variables: 14
  - Description: 
    - This dataset offers information on how various factors impact and individual’s mental health. Demographic information is taken such as age and gender. Variables such as daily screen time, self-reported mental health and sleep quality scores from scales of 1-10 are included. The dataset also includes other lifestyle details such as gaming, physical activity, support system access, and work environment effects. There are 10,000 unique observations, though we anticipate that this number will decrease greatly since the age range is from 18-65, and our focus is individuals aged 18-25.
  - Our main focus will be on the following variables:
    - Sleep factors: 
      - Self-reported number of sleep hours
    - Lifestyle factors: 
      - Technology Usage: Reported screen time, technology usage, social media usage, and gaming hours
      - Mental Health: Self-reported stress levels and mental health status 
      - Physical activity: Self-reported measure of daily physical activity 
    - Demographic data: 
      - Age and Gender
  - These variables will be useful in our project because usage of technology and social media may contribute to higher levels of anxiety and depression that may impact sleep quality while online support and support system usage may contribute positively by lowering levels of stress. 
  - Mental health status and gaming hours are also an important variable because it allows us to measure how mental health can affect sleep and sleep quality.<br>
- Dataset #2
  - Dataset Name: Analyzing Student Sleep Patterns and Predicting Sleep Quality
  - Link to the dataset: https://docs.google.com/spreadsheets/d/e/2PACX-1vQRUIlfM1na4PQQAaXLBWv7uoF_opKHsJs_a0spgsqEQC2_HPWvYjt2bOJ2sxIAMA9rIsqGJcDtTGA5/pub?output=csv
  - Number of observations: 500
  - Number of variables: 10
  - Description: 
    - This dataset explores sleep behaviors and lifestyle factors among university students, focusing on variables such as sleep duration, bedtime and waketime on weekdays and weekends, study hours, screen time, caffeine intake, and physical activity levels. The data also includes demographic factors (age, gender, university year) and self-reported sleep quality. 
    - The dataset aims to provide insights into how lifestyle and academic habits relate to students' sleep patterns and overall sleep quality, offering potential areas of focus for improving student wellness.
  - Our main focus will be on the following variables:
    - Sleep Factors: Quality and duration	
      - Quality: Self-reported, allowing a subjective measure of sleep satisfaction or restfulness	
      - Duration: Measured by actual sleep hours, differentiated between weekdays and weekends.
    - Substance Use: Caffeine
      - Caffeine Intake: Detailed data on daily caffeine consumption.
    - Lifestyle Factors: 
      - Screen time: Reported screen time hours, providing insights into possible influences on sleep quality.
      - Physical Activity Levels: Self-reported measure of physical activity, relevant to overall health and sleep quality.
    - Demographics:
      - Age and Gender: Essential demographic data that can influence sleep patterns and lifestyle choices.
      - University Year: Provides insight into academic demands related to sleep patterns.<br>

## Import Statements for Data Loading, Cleaning, Tidying, and Wrangling

In [1]:
import pandas as pd
import numpy as np

ModuleNotFoundError: No module named 'pandas'

## Mental Health & Technology Usage Dataset

In [6]:
# load csv
mental_health_data = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vSFmPYlesdIKLYwQpsMuzlEFwbe5A6bdx_mydBtx-6oZRoFIIMvfPrf5GBhoIRD_AklksWy1tBDpTa0/pub?output=csv')

mental_health_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   User_ID                   10000 non-null  object 
 1   Age                       10000 non-null  int64  
 2   Gender                    10000 non-null  object 
 3   Technology_Usage_Hours    10000 non-null  float64
 4   Social_Media_Usage_Hours  10000 non-null  float64
 5   Gaming_Hours              10000 non-null  float64
 6   Screen_Time_Hours         10000 non-null  float64
 7   Mental_Health_Status      10000 non-null  object 
 8   Stress_Level              10000 non-null  object 
 9   Sleep_Hours               10000 non-null  float64
 10  Physical_Activity_Hours   10000 non-null  float64
 11  Support_Systems_Access    10000 non-null  object 
 12  Work_Environment_Impact   10000 non-null  object 
 13  Online_Support_Usage      10000 non-null  object 
dtypes: floa

In [7]:
mental_health_data.head()

Unnamed: 0,User_ID,Age,Gender,Technology_Usage_Hours,Social_Media_Usage_Hours,Gaming_Hours,Screen_Time_Hours,Mental_Health_Status,Stress_Level,Sleep_Hours,Physical_Activity_Hours,Support_Systems_Access,Work_Environment_Impact,Online_Support_Usage
0,USER-00001,23,Female,6.57,6.0,0.68,12.36,Good,Low,8.01,6.71,No,Negative,Yes
1,USER-00002,21,Male,3.01,2.57,3.74,7.61,Poor,High,7.28,5.88,Yes,Positive,No
2,USER-00003,51,Male,3.04,6.14,1.26,3.16,Fair,High,8.04,9.81,No,Negative,No
3,USER-00004,25,Female,3.84,4.48,2.59,13.08,Excellent,Medium,5.62,5.28,Yes,Negative,Yes
4,USER-00005,53,Male,1.2,0.56,0.29,12.63,Good,Low,5.55,4.0,No,Positive,Yes


In [9]:
# check number of missing values per column
missing_data = mental_health_data.isnull().sum()
print('Missing values in each column:\n' , missing_data)

Missing values in each column:
 User_ID                     0
Age                         0
Gender                      0
Technology_Usage_Hours      0
Social_Media_Usage_Hours    0
Gaming_Hours                0
Screen_Time_Hours           0
Mental_Health_Status        0
Stress_Level                0
Sleep_Hours                 0
Physical_Activity_Hours     0
Support_Systems_Access      0
Work_Environment_Impact     0
Online_Support_Usage        0
dtype: int64


In [11]:
# clean categorical columns
if 'Gender' in mental_health_data.columns:
	mental_health_data['Gender'] = mental_health_data['Gender'].str.capitalize()
    
if 'Mental_Health_Status' in mental_health_data.columns:
	mental_health_data['Mental_Health_Status'] = mental_health_data['Mental_Health_Status'].str.capitalize()

In [13]:
# display values in categorical columns
print("Unique values in Gender:", mental_health_data['Gender'].unique() if 'Gender' in mental_health_data.columns else 'No Gender Column')

print("Unique values in Mental Health Status:", mental_health_data['Mental_Health_Status'].unique() if 'Mental_Health_Status' in mental_health_data.columns else 'No Mental Health Status Column')

Unique values in Gender: ['Female' 'Male' 'Other']
Unique values in Mental Health Status: ['Good' 'Poor' 'Fair' 'Excellent']


In [14]:
# convert columns to numeric (if needed)
numeric_columns = ['Screen_Time', 'Physical_Activity', 'Sleep_Hours'] 

for col in numeric_columns:
	if col in mental_health_data.columns:
	      mental_health_data[col] = pd.to_numeric(mental_health_data[col], errors='coerce')

In [17]:
# ensure all sleep hours are within 24 hour values
if 'Sleep_Hours' in mental_health_data.columns:
	mental_health_data = mental_health_data[(mental_health_data['Sleep_Hours'] >= 0) & (mental_health_data['Sleep_Hours'] <= 24)]

In [34]:
# summary stats to verify cleaning
mental_health_data.describe()

Unnamed: 0,Age,Technology_Usage_Hours,Social_Media_Usage_Hours,Gaming_Hours,Screen_Time_Hours,Sleep_Hours,Physical_Activity_Hours
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,41.5186,6.474341,3.972321,2.515598,7.975765,6.500724,5.00386
std,13.920217,3.169022,2.313707,1.446748,4.042608,1.450933,2.905044
min,18.0,1.0,0.0,0.0,1.0,4.0,0.0
25%,29.0,3.76,1.98,1.26,4.52,5.26,2.49
50%,42.0,6.425,3.95,2.52,7.9,6.5,4.99
75%,54.0,9.2125,5.99,3.79,11.5,7.76,7.54
max,65.0,12.0,8.0,5.0,15.0,9.0,10.0


## Analyzing Student Sleep Patterns and Predicting Sleep Quality

In [19]:
# load in data
sleep_data = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vQRUIlfM1na4PQQAaXLBWv7uoF_opKHsJs_a0spgsqEQC2_HPWvYjt2bOJ2sxIAMA9rIsqGJcDtTGA5/pub?output=csv')

In [20]:
sleep_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Student_ID           500 non-null    int64  
 1   Age                  500 non-null    int64  
 2   Gender               500 non-null    object 
 3   University_Year      500 non-null    object 
 4   Sleep_Duration       500 non-null    float64
 5   Study_Hours          500 non-null    float64
 6   Screen_Time          500 non-null    float64
 7   Caffeine_Intake      500 non-null    int64  
 8   Physical_Activity    500 non-null    int64  
 9   Sleep_Quality        500 non-null    int64  
 10  Weekday_Sleep_Start  500 non-null    float64
 11  Weekend_Sleep_Start  500 non-null    float64
 12  Weekday_Sleep_End    500 non-null    float64
 13  Weekend_Sleep_End    500 non-null    float64
dtypes: float64(7), int64(5), object(2)
memory usage: 54.8+ KB


In [21]:
sleep_data.head()

Unnamed: 0,Student_ID,Age,Gender,University_Year,Sleep_Duration,Study_Hours,Screen_Time,Caffeine_Intake,Physical_Activity,Sleep_Quality,Weekday_Sleep_Start,Weekend_Sleep_Start,Weekday_Sleep_End,Weekend_Sleep_End
0,1,24,Other,2nd Year,7.7,7.9,3.4,2,37,10,14.16,4.05,7.41,7.06
1,2,21,Male,1st Year,6.3,6.0,1.9,5,74,2,8.73,7.1,8.21,10.21
2,3,22,Male,4th Year,5.1,6.7,3.9,5,53,5,20.0,20.47,6.88,10.92
3,4,24,Other,4th Year,6.3,8.6,2.8,4,55,9,19.82,4.08,6.69,9.42
4,5,20,Male,4th Year,4.7,2.7,2.7,0,85,3,20.98,6.12,8.98,9.01


In [22]:
# converting time columns to uniform 24-hr format
time_columns = ['Weekday_Sleep_Start', 'Weekend_Sleep_Start', 'Weekday_Sleep_End', 'Weekend_Sleep_End']

In [24]:
# 24 hr format
for col in time_columns:
	sleep_data[col] = pd.to_datetime(sleep_data[col], errors='coerce').dt.hour

In [25]:
# handle missing values
missing_data = sleep_data.isnull().sum()
print("Missing values in each column:\n", missing_data)

Missing values in each column:
 Student_ID             0
Age                    0
Gender                 0
University_Year        0
Sleep_Duration         0
Study_Hours            0
Screen_Time            0
Caffeine_Intake        0
Physical_Activity      0
Sleep_Quality          0
Weekday_Sleep_Start    0
Weekend_Sleep_Start    0
Weekday_Sleep_End      0
Weekend_Sleep_End      0
dtype: int64


In [26]:
# drop rows with missing values
sleep_data.dropna(inplace=True)

In [35]:
# clean categorical variables
sleep_data['Gender'] = sleep_data['Gender'].str.capitalize()
sleep_data['University_Year'] = sleep_data['University_Year'].str.lower()

In [36]:
# display unique values in categorical columns
print('Unique values in Gender:', sleep_data['Gender'].unique())
print('Unique values in Univeristy Year:', sleep_data['University_Year'].unique())

Unique values in Gender: ['Other' 'Male' 'Female']
Unique values in Univeristy Year: ['2nd year' '1st year' '4th year' '3rd year']


In [33]:
# summary stats to verify cleaning
print(sleep_data.describe())

       Student_ID        Age  Sleep_Duration  Study_Hours  Screen_Time  \
count  500.000000  500.00000      500.000000   500.000000   500.000000   
mean   250.500000   21.53600        6.472400     5.981600     2.525000   
std    144.481833    2.33315        1.485764     3.475725     0.859414   
min      1.000000   18.00000        4.000000     0.100000     1.000000   
25%    125.750000   20.00000        5.100000     2.900000     1.800000   
50%    250.500000   21.00000        6.500000     6.050000     2.600000   
75%    375.250000   24.00000        7.800000     8.800000     3.300000   
max    500.000000   25.00000        9.000000    12.000000     4.000000   

       Caffeine_Intake  Physical_Activity  Sleep_Quality  Weekday_Sleep_Start  \
count       500.000000         500.000000     500.000000                500.0   
mean          2.462000          62.342000       5.362000                  0.0   
std           1.682325          35.191674       2.967249                  0.0   
min      

# Ethics & Privacy

In our project, there are potential unintended consequences that could arise from our analysis. By focusing on specific measurable factors, we may inadvertently oversimplify the complexity of sleep behaviors among students. For instance, our findings could reinforce stereotypes or generalizations about students with irregular sleep schedules without fully considering underlying causes like socioeconomic pressures, mental health challenges, or cultural factors. Such oversights might lead to conclusions that are not inclusive or effective for all students.

Our datasets may exclude or mislabel students who have underlying health conditions such as insomnia, mental health issues, or chronic illnesses. These populations might be underrepresented, and their data could be disproportionately treated as outliers and removed before the datasets were made accessible to us. Additionally, we acknowledge that the anonymized nature of our datasets limits our ability to confirm that the data truly represents college students across the US, which could introduce regional or demographic biases.

Socioeconomic factors present another challenge. For example, students from lower-income backgrounds may have formed sleep habits during high school while managing jobs or other responsibilities, contributing to what might be labeled as "unhealthy" sleep schedules. Our project does not directly measure these socioeconomic factors, and we acknowledge that this could limit the depth of our analysis. Similarly, interpersonal relationships and cultural practices, such as international students managing time zone differences to stay connected with family, could also influence sleep patterns in ways not captured in our surveys.

Moreover, we do not explicitly address how habits may have changed post-pandemic (e.g., altered sleep patterns due to remote learning or shifting routines) or the impact of environmental factors such as shared living spaces, loud roommates, or limited access to comfortable sleeping arrangements. These factors could significantly influence sleep quality but are not accounted for in our analysis.

To mitigate privacy concerns, our datasets anonymize participants by assigning numbers to each observation. While this ensures no personal or identifying information is used, it also limits our ability to contextualize the data fully. Participants' self-reported survey responses may also contain inaccuracies due to social desirability bias or hesitancy to disclose substance or alcohol use, potentially skewing our results.

To mitigate these biases and unintended consequences, we will:
1. Acknowledge the limitations of our data and analysis, particularly the absence of socioeconomic and environmental context.    
2. Avoid broad generalizations, presenting findings as trends specific to the datasets.
3. Recommend future research to explore underrepresented populations and additional variables like socioeconomic status, mental health, and environmental factors.

By critically reflecting on these aspects, we aim to ensure our study is ethical, inclusive, and aware of its limitations.

# Team Expectations 

* We will agree on clear task assignments for each project phase, with set deadlines that all members commit to meeting. Each member will be responsible for updating the group on their progress and any roadblocks encountered.
* We are committed to fostering collaboration by encouraging participation from each team member, ensuring that everyone feels included by encouraging each other to participate and valuing everyone’s ideas.
* We will provide honest and constructive feedback on our collaborative work, supporting one another in producing high-quality outcomes.
* We agree to take turns, as needed, in leadership responsibilities such as planning meetings, setting tasks, sending reminders, and submitting assignments.
* If a member faces difficulty meeting a deadline, they will inform the group promptly. The team will help redistribute tasks or support the member, ensuring project progress and shared responsibility.
* If any conflicts come up, we are committed to handling them respectfully. Team members not involved will help mediate, and those involved will work together to solve the issue and keep the group’s harmony.
* Before finalizing any submission, each member will review both their contributions and those of others, providing constructive feedback to ensure high-quality work across all project sections.

# Project Timeline Proposal

| Meeting Date | Meeting Time | Completed Before Meeting                                    | Discuss at Meeting                                                                                          |
|--------------|--------------|------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------|
| 10/02        | 11 AM        | Initial brainstorming of potential research questions      | Discuss and finalize five potential project ideas; determine preferred communication methods and project goals |
| 10/07        | 11 AM        | Conduct initial background research on each idea; review previous years’ projects | Discuss insights from previous projects; identify relevant data sources; narrow down to 2-3 viable research questions |
| 10/14        | 11 AM        | Review background research on shortlisted questions        | Finalize a research question, select a previous project for the project review assignment                   |
| 10/21        | 11 AM        | Complete project review notes; submit assignment           | Refine research question: "Which measurable sleep and lifestyle factors are most strongly associated with healthy sleep schedules among college students in the US?" based on course content; conduct additional research; divide responsibilities for project proposal assignment |
| 10/29        | 6 PM         | Gather notes for project proposal                          | Collaborate on finalizing the project proposal draft; assign final touches for submission                   |
| 10/31        | 6 PM         | Prepare survey questions individually                      | Create a survey to gather data from college students (ages 18-25) in the US; finalize survey questions and begin distributing online |
| 11/04        | 11 AM        | Review relevant datasets; gather preliminary data; distribute survey link | Prepare for data checkpoint; discuss potential challenges in data collection and cleaning; finalize selected dataset for Checkpoint 1 |
| 11/11        | 3 PM         | Continue distributing survey link; each team member reviewed relevant datasets | Finalize the datasets for this study; ensure all datasets align with the research question                  |
| 11/12        | 1 PM         | Work on Checkpoint 1 submission; review survey progress (45 responses so far) | Complete final touches and submit Checkpoint 1; outline initial steps for exploratory data analysis (EDA); assign tasks for data wrangling and cleaning |
| 11/18        | 11 AM        | Initial data cleaning completed; survey distribution ongoing | Review EDA progress; discuss any issues with data visualization and analysis; make adjustments as necessary for Checkpoint 2 |
| 11/25        | 11 AM        | Complete EDA Checkpoint                                    | Submit EDA for Checkpoint 2; outline steps for further analysis and insights gathering; assign tasks for drafting results and discussion sections |
| 12/02        | 11 AM        | Begin drafting results, discussion, and conclusion; survey distribution near completion | Review findings; finalize insights and visualizations; organize the draft for final project submission       |
| 12/05        | 7 PM         | Finalize survey data collection and analysis               | Ensure all survey responses are integrated into analysis and results                                        |
| 12/09        | 11 AM        | Complete project                                           | Submit final project and complete any required group surveys or feedback forms                              |