# Lessons Learned from Previous Failure

- Understand the data
- Ensure proper data use
- Understand basic differences between linear and non-linear problems and their algorithmic solutions

# Student Performance - Multiple Linear Regression
Exploring how study habits, previous scores, extracurricular activities, sleep, and practice tests relate to a synthetic **Performance Index** (10 – 100).

**Author:** Joshua E. Brown  
**Date:** 2025-08-06


# Project Directory Structure

gradient_descent_scratch_p_project0
- data/
- - aircraft_data.csv
- g_d_s_p0
- - .spyproject
- notebooks/
- - project_documentation.ipynb
- src/
- - main.py
- plots/
- README.md
- requirements.txt
- .gitignore


# Dataset Selection

> Title: Student Performance (Multiple Linear Regression).
> Compiled by: Nikhil Narayan.
> Hosting site: Kaggle.
> License: Open-source.
> https://www.kaggle.com/datasets/nikhil7280/student-performance-multiple-linear-regression


# Student Performance (Multiple Linear Regression) — Project Documentation

> *This file is intended to live at the top of your Jupyter notebook as a fully-formatted Markdown documentation page. Add code cells **after** this block.*

## 1. Introduction  

This project investigates how common study-related factors correlate with an overall **Performance Index** using a synthetic dataset of 10,000 students. The goal is to demonstrate the end-to-end workflow for multiple linear regression (and optional alternatives) while following best practices in data science.

### Key Points  
- **Synthetic data**: Relationships uncovered here are illustrative, not prescriptive.  
- **Target**: *Performance Index* (integer 10 – 100, higher = better).  
- **Predictors**: Study hours, prior scores, extracurricular participation, sleep hours, practice papers.

## 2. Dataset Overview  

| Feature | Type | Description | Typical Range |
|---------|------|-------------|---------------|
| Hours Studied | Numeric | Total hours spent studying | 0 – 9 |
| Previous Scores | Numeric | Score in prior test (%) | 40 – 99 |
| Extracurricular Activities | Categorical | **Yes/No** participation | – |
| Sleep Hours | Numeric | Average sleep per day (hours) | 4 – 9 |
| Sample Question Papers Practiced | Integer | Count of practice papers | 10 – 100 |
| **Performance Index** | Integer | Overall academic performance (target) | 10 – 100 |

## 3. Notebook Roadmap  

1. **Environment Setup**  
   - Import NumPy, pandas, seaborn/matplotlib, scikit-learn.  
   - Configure plot aesthetics & random seeds.

2. **Data Loading & Inspection**  
   - Load the CSV.  
   - Display shape, `head()`, `info()`, summary stats, and missing-value counts.

3. **Exploratory Data Analysis (EDA)**  
   - *Univariate*: Histograms/KDEs; bar chart for extracurriculars.  
   - *Bivariate*: Scatterplots of each predictor vs. **Performance Index** with regression lines; boxplot for extracurricular comparison.  
   - *Correlation Matrix*: Pearson coefficients + heatmap to spot multicollinearity.

4. **Data Preparation**  
   - Encode `Extracurricular Activities` (Yes → 1, No → 0).  
   - Optionally scale features (e.g., `StandardScaler`).  
   - Train/test split (e.g., 80/20).

5. **Modeling**  
   - **Multiple Linear Regression**: Fit OLS model, show coefficients.  
   - **Metrics**: R², MAE, RMSE on both splits; plot predicted vs. actual.  
   - **Diagnostics** (optional): Residuals vs. fitted, Q-Q plot, VIF for multicollinearity.

6. **Feature Importance & Interpretation**  
   - Discuss sign and magnitude of coefficients.  
   - Relate findings to plausible academic behaviors (with the synthetic caveat).

7. **Model Enhancement (Optional)**  
   - Regularization: Ridge, Lasso, ElasticNet.  
   - Tree-based or ensemble regressors for non-linear relationships.  
   - Cross-validation for robust hyperparameter tuning.

8. **Conclusions**  
   - Summarize insights, model performance, and next steps (e.g., testing on real data).

9. **Appendix**  
   - Helper functions, plotting utilities, reproducibility notes (library versions, seeds).

## 4. Usage Notes  

- **License**: Dataset is free to share and adapt; cite the Kaggle source when distributing derivative work.  
- **Educational Focus**: Use this notebook to practice feature engineering, regression modeling, and critical evaluation of synthetic results before tackling real-world datasets.
