# Project 1 - Building & Evaluating ML Algorithms

In this project, you will explore a dataset containing **music-related attributes** and build machine learning models for classification and regression tasks. This project requires you to:

1. Carry exploratory data analysis to gather knowledge from data
2. Apply data visualization techniques
3. Build transformation pipelines for data preprocessing and data cleaning
4. Select machine learning algorithms for regression and classification tasks
5. Design pipelines for hyperparameter tuning and model selection
6. Implement performance evaluation metrics and evaluate results
7. Report observations, propose business-centric solutions and propose mitigating strategies

You will select your **own classification and regression tasks** based on the dataset.

## Deliverables

As part of this project, you should deliver the following materials:

1. [**4-page IEEE-format paper**](https://www.ieee.org/conferences/publishing/templates.html). Write a paper with no more than 4 pages addressing the **``tasks``** posed below. When writing this report, consider a business-oriented person as your reader (e.g. your PhD advisor, your internship manager, etc.). Tell *the story* for each datasets' goal and propose solutions by addressing (at least) the **``tasks posed below``**.

2. **Python Code**. Create two separate Notebooks: (1) "training.ipynb" used for training and hyperparameter tuning, (2) "test.ipynb" for evaluating the final trained model in the test set. The "test.ipynb" should load all trained objects and simply evaluate the performance. So don't forget to **push the trained models** to your repository to allow us to run it.

All of your code should run without any errors and be well-documented. 

3. **README.md file**. Edit the readme.md file in your repository and how to use your code. If there are user-defined parameters, your readme.md file must clearly indicate so and demonstrate how to use your code.

This is an **individual assignment**. 

These deliverables are **due Monday, March 3 @ 11:59pm**. Late submissions will not be accepted, so please plan accordingly.

---

# About the Dataset

This dataset contains attributes of songs played on Spotify until 2022, including their duration and various musical characteristics. It provides an opportunity to analyze how different song features relate to playtime and genre classification. The  dataset is available in the ```Spotify_Song_Attributes.csv``` file.

### Attribute Description


1. **trackName** - The name of the track.  
2. **artistName** - The name of the artist or band associated with the track.  
3. **msPlayed** - The duration in milliseconds that the track was played.  
4. **genre** - The genre or genres associated with the track.  
5. **danceability** - A measure of how suitable a track is for dancing.  
6. **energy** - The energy level of the track.  
7. **key** - The key of the track (e.g., C, D, E).  
8. **loudness** - The overall loudness of the track in decibels (dB).  
9. **mode** - The modality of the track (1 = major, 0 = minor).  
10. **speechiness** - The presence of spoken words in the track.  
11. **acousticness** - The acousticness of the track.  
12. **instrumentalness** - The probability of the track being instrumental.  
13. **liveness** - A measure of the presence of a live audience in the track.  
14. **valence** - The musical positiveness or happiness conveyed by the track.  
15. **tempo** - The tempo of the track in beats per minute (BPM).  
16. **type** - The type of the Spotify track.  
17. **id** - The unique identifier of the track.  
18. **uri** - The Spotify URI for the track.  
19. **track_href** - A link to the Spotify Web API endpoint for the track.  
20. **analysis_url** - A link to the audio analysis of the track.  
21. **duration_ms** - The duration of the track in milliseconds.  
22. **time_signature** - The time signature of the track.

---

# Assignment

## **Step 1: Exploratory Data Analysis (EDA)**  

### **What to Do**  
- Load the dataset and inspect its structure using `.head()`, `.info()`, `.describe()`.  
- Check for missing values and determine how to handle them (drop, impute, etc.).  
- Identify duplicates and remove them if necessary.  
- Analyze basic statistics of numerical features:  
  - Mean, median, standard deviation, min, max.  
  - Correlations between variables.  
- Check the distribution of key features using:  
  - Histograms  
  - Box plots   
  - others
- Analyze relationships between features using:  
  - Correlation heatmaps  
  - Scatter plots for key relationships  
  - If applicable, analyze categorical features (e.g., genre) using bar charts.
  - others
- Check for potential outliers and determine how to handle them.  

**``Task 1. Provide a summary of findings from EDA (bullet points or short analysis).``**

**``Task 2. Provide at least three visualizations showing trends or insights from the dataset.``**  

---



## **Step 2: Data Preprocessing & Cleaning Pipelines**  

### **What to Do**  
- Handle missing values (e.g., drop rows, impute with mean/median).  
- Normalize or standardize numerical features if necessary.  
- Encode categorical variables (if applicable).  
- Remove outliers if they affect model performance.  
- Split the dataset into training (80%) and testing (20%) sets.  
- Store preprocessing steps in a pipeline for reuse.  

**``Task 3. Provide a written summary of the preprocessing steps.``**  

----


## **Step 3: Select a Classification and Regression Task**  

 
- Pick **one classification problem** (e.g., predict high/low `danceability`, predict a song’s `energy` category, etc.).  
- Pick **one regression problem** (e.g., predict a song’s `tempo` based on features, predict `loudness` based on other audio properties, etc.).  

**``Task 4. Clearly state the target variable for both classification and regression AND Explain why this task is interesting.``**  

---

## **Step 4: Training Machine Learning Models**  

### **Classification Task**  
- Train and compare:  
    - Logistic Regression  
    - Random Forest Classifier  

- Tune hyperparameters using `GridSearchCV` or `RandomizedSearchCV`.  

- Measure performance using:  
    - Accuracy  
    - Precision, Recall, F1-score  
    - Confusion matrix  
    - ROC-AUC curve  

- Save trained models using joblib
- Save the preprocessing pipeline (scalers, encoders, etc.)


### **Regression Task**  
- Train and compare:  
    - Linear Regression (with and without regularization, e.g., Ridge/Lasso)  
    - Decision Tree Regressor  

- Tune hyperparameters to optimize model performance using `GridSearchCV` or `RandomizedSearchCV`.    

- Compare models using:  
    - R² (Coefficient of Determination)  
    - Mean Absolute Error (MAE)  
    - Mean Squared Error (MSE)  
    - Root Mean Squared Error (RMSE)

- Save trained models using joblib
- Save the preprocessing pipeline (scalers, encoders, etc.)


**``Task 5. After completing all steps above, provide the following:``** 
- Training performance metrics for each model.  
- A short explanation of which model performed better and why.
- Are there any differences when adding regularization into regression? Which features are more important? 

---

## **Step 5: Performance Evaluation**  

- Load Test Data

    - Load the original dataset
    - Apply the same preprocessing pipeline used in training.ipynb
    - Extract the 20% test set (the same as used during training)
    - Load Trained Models & Pipeline

- Load the saved classification model
- Load the saved regression model
- Load the preprocessing pipeline

**Evaluate Classification Model**

- Generate predictions on the test set. Compute:
    - Accuracy
    - Precision, Recall, F1-score
    - Confusion Matrix
    - ROC-AUC Curve

**Evaluate Regression Model**

- Generate predictions on the test set. Compute:
    - R² (Coefficient of Determination)
    - MAE (Mean Absolute Error)
    - MSE (Mean Squared Error)
    - RMSE (Root Mean Squared Error)

**``Task 6. After completing all steps above, provide the following:``**

- Compare models and justify which one is better for each task.
- At least one visualizations per classification tasks (e.g., confusion matrix, ROC curve, precision-recall curves).  

---

## **Step 6: Report Findings & Business Insights**  

**``Task 7. Interpret the results.``**
- What trends did you observe?  
- How well do these models generalize?  
 - How can this analysis be useful to music streaming platforms?  


---

# Submit Your Solution

Confirm that you've successfully completed the assignment.

Along with the Notebook, include a PDF of the notebook with your solutions.

```add``` and ```commit``` the final version of your work, and ```push``` your code to your GitHub repository.

Submit the URL of your GitHub Repository as your assignment submission on Canvas.

---