# Sleep Quality Classification
## Hanna Chang

---
## Abstract
This project explores how machine learning can be used to classify sleep quality based on a range of health and lifestyle features. Three algorithms were applied: Logistic Regression using dimensionality reduction via PCA, Random Forest using the full feature set, and Gradient Boosting as a strong ensemble method. The models were evaluated using accuracy, F1 scores, execution time, and visual metrics. The dataset was sourced from Kaggle and included information such as stress level, heart rate, BMI, and physical activity level. Our results show that Logistic Regression with PCA achieved the highest accuracy and lowest computation time, while Gradient Boosting was a strong contender in predictive power. Feature importance was also analyzed to interpret which lifestyle factors most influence sleep quality.


<div style="page-break-before: always;"></div>

## Table of Contents
1. [Abstract](#abstract)
2. [1. Preliminaries](#1-preliminaries)
   - [1.1 Goal and Motivations](#11-goal-and-motivations)
   - [1.2 Data](#12-data)
   - [1.3 Preprocessing](#13-preprocessing)
3. [2. Methodology](#2-methodology)
   - [2.1 Dimensionality Reduction (PCA)](#21-dimensionality-reduction-pca)
   - [2.2 Feature Importance](#22-feature-importance)
   - [Correlation Heatmap](#correlation-heatmap)
4. [3. Modeling](#3-modeling)
   - [3.1 Logistic Regression (with PCA)](#31-logistic-regression-with-pca)
   - [3.2 Random Forest (without PCA)](#32-random-forest-without-pca)
   - [3.3 Gradient Boosting (without PCA)](#33-gradient-boosting-without-pca)
   - [3.4 Feature Importance Chart](#34-feature-importance-chart)
   - [3.5 Model Configuration and Evolution](#35-model-configuration-and-evolution)
5. [4. Results & Analysis](#4-results--analysis)
6. [5. Discussion](#5-discussion)
7. [6. Conclusion](#6-conclusion)
8. [7. References](#7-references)

<div style="page-break-before: always;"></div>

---

## 1. Preliminaries

### 1.1 Goal and Motivations
The goal of this project is to develop a classification system capable of predicting the quality of an individual's sleep based on their lifestyle and health indicators. In modern life, poor sleep affects millions, contributing to mental and physical health issues. With increased access to wearable health tech, it becomes feasible to use data science to better understand what lifestyle patterns correlate with good or poor sleep quality.

### 1.2 Data
The dataset was sourced from Kaggle and is titled **"Sleep Health and Lifestyle Dataset"**. It includes 374 records and 15 variables, ranging from age and gender to BMI, stress level, blood pressure, and sleep duration. The target variable is "Quality of Sleep" and it has multiple ordinal classes.

### 1.3 Preprocessing

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load dataset
df = pd.read_csv("Sleep_health_and_lifestyle_dataset.csv")
df.dropna(inplace=True)
df_encoded = pd.get_dummies(df, drop_first=True)
X = df_encoded.drop("Quality of Sleep", axis=1)
y = df_encoded["Quality of Sleep"]

#### 1.3.1 Column Renaming and Dropping
We retained all relevant features and dropped only rows with missing data. One-hot encoding was applied to categorical variables such as occupation and gender.

#### 1.3.2 Encoding
Categorical columns were transformed using `pd.get_dummies()` with `drop_first=True` to avoid multicollinearity. The final dataset had 41 features after encoding.

#### 1.3.3 Data Visualization
Visual analysis of the sleep quality distribution and feature relationships helped guide the choice of modeling strategy. Bar plots and histograms showed distributions of stress level, heart rate, and physical activity across different sleep quality scores. Key histograms for features such as Stress Level, Sleep Duration, Heart Rate, Age, and Daily Steps are included in the appendix.

#### Distribution Plots

#### Stress Level
![Stress Level](hist_Stress_Level.png)

#### Sleep Duration
![Sleep Duration](hist_Sleep_Duration.png)

#### Heart Rate
![Heart Rate](hist_Heart_Rate.png)

#### Age
![Age](hist_Age.png)

#### Daily Steps
![Daily Steps](hist_Daily_Steps.png)

#### Distribution of Stress Level
![Stress Level](sleepdist.png)

<div style="page-break-before: always;"></div>

---

## 2. Methodology

### 2.1 Dimensionality Reduction (PCA)
PCA was applied to reduce the dimensionality while preserving 95% of the variance, resulting in 17 principal components. This transformation was used for training the logistic regression model.

In [17]:
from sklearn.decomposition import PCA

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)

### 2.2 Feature Importance

Random Forest was trained on the original feature set and used to derive feature importances. The top features were:
1. Sleep Duration
2. Stress Level
3. Heart Rate
4. Age
5. Daily Steps

Other features like BMI category and occupation type also contributed slightly but less significantly.


In [14]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_scaled, y)
importances = rf.feature_importances_

### Correlation Heatmap

![Correlation Heatmap](correlation_heatmap.png)

<div style="page-break-before: always;"></div>

---

## 3. Modeling

### 3.1 Logistic Regression (with PCA)


In [15]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_pca, y)


- Input: PCA-transformed features
- Accuracy: **98.7%**
- F1 Score: **0.985**
- Execution Time: **0.07 seconds**
- Best performing on classifying extreme classes (6 and 9)

**Figure 1**: Confusion Matrix - Logistic Regression  
![Logistic Regression Confusion Matrix](Figure_1.png)

### 3.2 Random Forest (without PCA)


In [16]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_scaled, y)


- Input: Full scaled feature set
- Accuracy: **96.0%**
- F1 Score: **0.958**
- Execution Time: **0.17 seconds**

**Figure 2**: Confusion Matrix - Random Forest  
![Random Forest Confusion Matrix](Figure_2.png)

### 3.3 Gradient Boosting (without PCA)


In [10]:
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier()
gbc.fit(X_scaled, y)


- Input: Full scaled feature set
- Accuracy: **97.3%**
- F1 Score: **0.973**
- Execution Time: **0.81 seconds**

**Figure 3**: Confusion Matrix - Gradient Boosting  
![Gradient Boosting Confusion Matrix](Figure_4.png)

### 3.4 Feature Importance Chart
**Figure 4**: Feature Importance - Random Forest  
![Random Forest Feature Importance](Figure_3.png)

### 3.5 Model Configuration and Evolution
To better understand model behavior, we used baseline hyperparameters from course practice, and explored their effects on accuracy and runtime:

- **Logistic Regression (with PCA)**: Used `max_iter=1000` to ensure convergence due to PCA-transformed input. Achieved 98.7% accuracy with minimal compute time (0.07s).

- **Random Forest**:
  - Initial config: `n_estimators=100`, `random_state=42`
  - This gave 96.0% accuracy and strong feature interpretability.
  - A trial with `n_estimators=200` slightly improved accuracy to ~96.5% but doubled the execution time, so `n_estimators=100` was selected for efficiency.

- **Gradient Boosting**:
  - Used standard `n_estimators=100` and `learning_rate=0.1`.
  - We tested a lower learning rate (0.05) and higher tree depth, but accuracy gains were marginal (<0.5%) while training time increased significantly.

These experiments showed that basic hyperparameter changes didn’t justify the increase in runtime given the already strong results. Thus, we used the default-but-effective configurations for final evaluation.


<div style="page-break-before: always;"></div>

---

## 4. Results & Analysis

| Model                     | Accuracy | Weighted F1 | Execution Time (s) |
|--------------------------|----------|--------------|---------------------|
| Logistic Regression (PCA)| 0.9867   | 0.985        | 0.07                |
| Random Forest            | 0.9600   | 0.958        | 0.17                |
| Gradient Boosting        | 0.9733   | 0.973        | 0.81                |

Logistic Regression with PCA achieved the highest accuracy and fastest execution, while Gradient Boosting showed strong predictive power at the cost of longer compute time. Random Forest remains a strong, interpretable baseline.

---

## 5. Discussion
- **Logistic Regression** with PCA is optimal for speed and high performance in small feature spaces.
- **Random Forest** is robust and interpretable, providing direct insight into which features influence predictions.
- **Gradient Boosting** excels in predictive accuracy but is more computationally expensive.
- A heatmap of feature correlations supports the selection of top features, while histograms confirm reasonable distributions.

---

## 6. Conclusion
This study demonstrates that machine learning models can effectively classify sleep quality. Among the tested models, Logistic Regression with PCA emerged as the most efficient, while Gradient Boosting achieved comparable predictive performance. These findings suggest potential for real-time sleep quality classification in wearable technologies. Future work could include ensemble stacking, neural networks, or deeper time-series analysis.

<div style="page-break-before: always;"></div>

---

## 7. References
- Dataset: http://kaggle.com/datasets/uom190346a/sleep-health-and-lifestyle-dataset/data
- Scikit-learn documentation: https://scikit-learn.org
- GeeksforGeeks, PCA, Random Forest, and Boosting tutorials

---
