In [1]:
!pip install pandas
!pip install seaborn





\<indicate target task (i.e. classification or regression) here\>

## The Dataset

-- [Return to Table of Contents](#music-dataset---stintsy-s14-project-(flexbomb)) --

**`Music`** is a universal language, transcending cultures and time. It is a powerful art form that can evoke a wide range of emotions, from joy and excitement to sadness and reflection. Music can be used to express oneself, to connect with others, and to celebrate life. We are studying this dataset because it could be essential for studying music and developing new music technologies. In this notebook in particular, it will be used to train machine learning models to perform a variety of tasks. These models can then be used to create new products and services, such as personalized music streaming services and intelligent music assistants.

The dataset is provided as a `.csv` file where it can be viewed in Excel and Notepad. 

This dataset contains 17,996 **rows** across 17 **columns**. Each row represents **1 song**, while columns represent **audio features**. The following are the columns in the dataset and their descriptions:

| Column Name | Description |
| --- | --- |
| **`Artist Name`** | Name of artist |
| **`Track Name`** | Name of song |
| **`Popularity`** | A value between 0 and 100, calculated by an algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are |
| **`danceability`** | Describes how suitable a track is for dancing; 0.0 is least danceable and 1.0 is most danceable |
| **`energy`** | A measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity |
| **`key`** | The key the track is in, integers map to pitches using standard [Pitch Class notation](https://en.wikipedia.org/wiki/Pitch_class); -1 if no key was detected |
| **`loudness`** | The quality of a sound that is the primary psychological correlate of physical strength (amplitude), values are averaged across the entire track; in decibels (dB) |
| **`mode`** | Indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived; 1 is Major and 0 is Minor |
| **`speechiness`** | The presence of spoken words in a track; >0.66 is probably made entirely of spoken words, 0.33-0.66 may contain both music and speech, <0.33 most likely represents music |
| **`acousticness`** | A confidence measure from 0.0 to 1.0 of whether the track is acoustic |
| **`instrumentalness`** | Predicts whether a track contains no vocals; >0.5 is intended to represent instrumental tracks |
| **`liveness`** | Detects the presence of an audience in the recording; >0.8 provides strong likelihood that the track is live |
| **`valence`** | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track; tracks with high valence sound more positive, and vice versa|
| **`tempo`** | The overall estimated tempo of a track in beats per minute (BPM) |
| **`duration_in min/ms`** | Duration in millisecond (ms) |
| **`time_signature`** | A notational convention to specify how many beats are in each bar |
| **`Class`** | corresponds to the genre of the track |

State a brief description of the dataset.

• Provide a description of the collection process executed to build the dataset. Discuss the
implications of the data collection method on the generated conclusions and insights.
Note that you may need to look at relevant sources related to the dataset to acquire
necessary information for this part of the project.

• Describe the structure of the dataset file. <br>
    o What does each row and column represent? <br>
    o How many instances are there in the dataset? <br>
    o How many features are there in the dataset? <br>
    o If the dataset is composed of different files that you will combine in the succeeding
steps, describe the structure and the contents of each file.

• Discuss the features in each dataset file. What does each feature represent? All features,
even those which are not used for the study, should be described to the reader. The
purpose of each feature in the dataset should be clear to the reader of the notebook
without having to go through an external link.

## List of Requirements
-- [Return to Table of Contents](#music-dataset---stintsy-s14-project-(flexbomb)) --

1. [Numpy](https://numpy.org/)
2. [Matplotlib](https://matplotlib.org/)
3. [CSV](https://docs.python.org/3/library/csv.html)

For this notebook, **numpy**, **matplotlib**, and **csv** must be imported.

In [2]:
%pip install numpy
%pip install matplotlib
%pip install pandas
%pip install seaborn

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [3]:
import numpy as np
import matplotlib.pyplot as plt
import csv
import pandas as pd
import seaborn as sns
import math 

plt.style.use('ggplot')

%matplotlib inline
%load_ext autoreload
%autoreload 2

## Reading the Dataset
-- [Return to Table of Contents](#music-dataset---stintsy-s14-project-(flexbomb)) --

Here we will load the dataset using `csv`. We use the [`reader`](https://docs.python.org/3/library/csv.html) function to load the dataset. The path will have to be changed depending on the location of the file in your machine.


In [4]:
music_df = pd.read_csv('music.csv')
music_df.head()

Unnamed: 0,Artist Name,Track Name,Popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_in min/ms,time_signature,Class
0,Bruno Mars,That's What I Like (feat. Gucci Mane),60.0,0.854,0.564,1.0,-4.964,1,0.0485,0.0171,,0.0849,0.899,134.071,234596.0,4,5
1,Boston,Hitch a Ride,54.0,0.382,0.814,3.0,-7.23,1,0.0406,0.0011,0.00401,0.101,0.569,116.454,251733.0,4,10
2,The Raincoats,No Side to Fall In,35.0,0.434,0.614,6.0,-8.334,1,0.0525,0.486,0.000196,0.394,0.787,147.681,109667.0,4,6
3,Deno,Lingo (feat. J.I & Chunkz),66.0,0.853,0.597,10.0,-6.528,0,0.0555,0.0212,,0.122,0.569,107.033,173968.0,4,5
4,Red Hot Chili Peppers,Nobody Weird Like Me - Remastered,53.0,0.167,0.975,2.0,-4.279,1,0.216,0.000169,0.0161,0.172,0.0918,199.06,229960.0,4,10


The dataset is now loaded in the ???.

Show the contents of the...

## Exploratory Data Analysis Questions
-- [Return to Table of Contents](#music-dataset---stintsy-s14-project-(flexbomb)) --

1. [**`Question 4`**](#question-4): Distribution of Features: Plot histograms or kernel density plots for Valence, Tempo, Liveness, Loudness, Acousticness, and Energy to understand their distributions.
2. [**`Question 5`**](#question-5): Class Distribution: Plot the distribution of the Class variable (target variable). Understand the balance between different classes.
3. [**`Question 6`**](#question-6): Relationship Between Features and Class: Use scatter plots or box plots to visualize the relationship between each feature and the target Class variable Identify any patterns or clusters that may exist.

## Data Preprocessing and Cleaning

-- [Return to Table of Contents](#music-dataset---stintsy-s14-project-(flexbomb)) --

In [5]:
!pip install pandas
!pip install seaborn





\<indicate target task (i.e. classification or regression) here\>

## The Dataset

-- [Return to Table of Contents](#music-dataset---stintsy-s14-project-(flexbomb)) --

**`Music`** is a universal language, transcending cultures and time. It is a powerful art form that can evoke a wide range of emotions, from joy and excitement to sadness and reflection. Music can be used to express oneself, to connect with others, and to celebrate life. We are studying this dataset because it could be essential for studying music and developing new music technologies. In this notebook in particular, it will be used to train machine learning models to perform a variety of tasks. These models can then be used to create new products and services, such as personalized music streaming services and intelligent music assistants.

The dataset is provided as a `.csv` file where it can be viewed in Excel and Notepad. 

This dataset contains 17,996 **rows** across 17 **columns**. Each row represents **1 song**, while columns represent **audio features**. The following are the columns in the dataset and their descriptions:

| Column Name | Description |
| --- | --- |
| **`Artist Name`** | Name of artist |
| **`Track Name`** | Name of song |
| **`Popularity`** | A value between 0 and 100, calculated by an algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are |
| **`danceability`** | Describes how suitable a track is for dancing; 0.0 is least danceable and 1.0 is most danceable |
| **`energy`** | A measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity |
| **`key`** | The key the track is in, integers map to pitches using standard [Pitch Class notation](https://en.wikipedia.org/wiki/Pitch_class); -1 if no key was detected |
| **`loudness`** | The quality of a sound that is the primary psychological correlate of physical strength (amplitude), values are averaged across the entire track; in decibels (dB) |
| **`mode`** | Indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived; 1 is Major and 0 is Minor |
| **`speechiness`** | The presence of spoken words in a track; >0.66 is probably made entirely of spoken words, 0.33-0.66 may contain both music and speech, <0.33 most likely represents music |
| **`acousticness`** | A confidence measure from 0.0 to 1.0 of whether the track is acoustic |
| **`instrumentalness`** | Predicts whether a track contains no vocals; >0.5 is intended to represent instrumental tracks |
| **`liveness`** | Detects the presence of an audience in the recording; >0.8 provides strong likelihood that the track is live |
| **`valence`** | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track; tracks with high valence sound more positive, and vice versa|
| **`tempo`** | The overall estimated tempo of a track in beats per minute (BPM) |
| **`duration_in min/ms`** | Duration in millisecond (ms) |
| **`time_signature`** | A notational convention to specify how many beats are in each bar |
| **`Class`** | corresponds to the genre of the track |

State a brief description of the dataset.

• Provide a description of the collection process executed to build the dataset. Discuss the
implications of the data collection method on the generated conclusions and insights.
Note that you may need to look at relevant sources related to the dataset to acquire
necessary information for this part of the project.

• Describe the structure of the dataset file. <br>
    o What does each row and column represent? <br>
    o How many instances are there in the dataset? <br>
    o How many features are there in the dataset? <br>
    o If the dataset is composed of different files that you will combine in the succeeding
steps, describe the structure and the contents of each file.

• Discuss the features in each dataset file. What does each feature represent? All features,
even those which are not used for the study, should be described to the reader. The
purpose of each feature in the dataset should be clear to the reader of the notebook
without having to go through an external link.

## List of Requirements
-- [Return to Table of Contents](#music-dataset---stintsy-s14-project-(flexbomb)) --

1. [Numpy](https://numpy.org/)
2. [Matplotlib](https://matplotlib.org/)
3. [CSV](https://docs.python.org/3/library/csv.html)

For this notebook, **numpy**, **matplotlib**, and **csv** must be imported.

In [None]:
%pip install numpy
%pip install matplotlib
%pip install pandas
%pip install seaborn

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
^C
Note: you may need to restart the kernel to use updated packages.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import csv
import pandas as pd
import seaborn as sns
import math 

plt.style.use('ggplot')

%matplotlib inline
%load_ext autoreload
%autoreload 2

## Reading the Dataset
-- [Return to Table of Contents](#music-dataset---stintsy-s14-project-(flexbomb)) --

Here we will load the dataset using `csv`. We use the [`reader`](https://docs.python.org/3/library/csv.html) function to load the dataset. The path will have to be changed depending on the location of the file in your machine.


In [None]:
music_df = pd.read_csv('music.csv')
music_df.head()

The dataset is now loaded in the ???.

Show the contents of the...

## Exploratory Data Analysis Questions
-- [Return to Table of Contents](#music-dataset---stintsy-s14-project-(flexbomb)) --

1. [**`Question 4`**](#question-4): Distribution of Features: Plot histograms or kernel density plots for Valence, Tempo, Liveness, Loudness, Acousticness, and Energy to understand their distributions.
2. [**`Question 5`**](#question-5): Class Distribution: Plot the distribution of the Class variable (target variable). Understand the balance between different classes.
3. [**`Question 6`**](#question-6): Relationship Between Features and Class: Use scatter plots or box plots to visualize the relationship between each feature and the target Class variable Identify any patterns or clusters that may exist.

## Data Preprocessing and Cleaning

-- [Return to Table of Contents](#music-dataset---stintsy-s14-project-(flexbomb)) --

Before we can begin exploring the data, we must first clean the dataset. This is to prevent inconsistencies that may cause problems or errors during analysis.

First, we will organize the dataset columns to make it easier to understand. Also, some columns are renamed for shorter accessibility.

Before we begin exploring our data, it is important to understand what are the characteristics of our features to help us make smart decisions to our pre-processing steps. It identifies which pre-processing techniques can be done so that exploratory data analysis and data modeeling is most accurate. Most importantly, this is done to prevent inconsistencies that may cause problems or errors during analysis.

### Column Renaming 

First, we will organize the dataset columns to make it easier to understand. Also, some columns are renamed for shorter accessibility.

In [None]:
column_mapping = {
    'Artist Name': 'artist',
    'Track Name': 'track',
    'Popularity': 'popularity',
    'danceability': 'dance',
    'energy': 'energy',
    'key': 'key',
    'loudness': 'loudness',
    'mode': 'mode',
    'speechiness': 'speechiness',
    'acousticness': 'acousticness',
    'instrumentalness': 'instrumentalness',
    'liveness': 'liveness',
    'valence': 'valence',
    'tempo': 'tempo',
    'duration_in min/ms': 'duration',
    'time_signature': 'time_signature',
    'Class': 'class'
}

# Rename columns in the DataFrame
music_df.rename(columns=column_mapping, inplace=True)
print(music_df.columns)

### Duplicates

We then check if there are any duplicated data in the dataset. We do this by calling the ``pandas.DataFrame.duplicated`` function. The function checks and returns the duplicated values.

In [None]:
numDuplicates = music_df.duplicated().sum()

print(f"Number of duplicates in the dataset: {numDuplicates}")

# Display duplicated rows
duplicated_rows_data = music_df[music_df.duplicated()]
print("Duplicated rows:")
print(duplicated_rows_data)

As displayed above there are **``0 duplicates``** in the dataset. 

### Null Values

Then, check which columns have **NaN or Null** values and **count** how many null values each column has.

In [None]:
# show nan_count per variable
nan_counts = music_df.isna().sum()
for feature, nan_count in nan_counts.items():
    print(f"{feature}: {nan_count}")

plt.figure(figsize=(12, 6))
sns.barplot(x=nan_counts.index, y=nan_counts.values, palette='viridis')
plt.title('NaN Counts per Feature')
plt.xlabel('Features')
plt.ylabel('Number of NaN Values')
plt.xticks(rotation=45, ha='right')  # Adjust rotation for better readability
plt.show()

We can observe here that `popularity`, `key`, and `instrumentalness` has a high count of NaN values. We can use imputation to replace the NaN values with a measure the mean of the column for the non-categorical features. As for categorical features, we will drop the rows as shown below. We will use `pandas.DataFrame.fillna` and `pandas.DataFrame.mean` for our regression values while `pandas.DataFrame.dropna` to drop rows with null categorical values.

In [None]:
# Fill NaN for numerical features with mean
numerical_features = ['popularity', 'dance', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']
music_df[numerical_features] = music_df[numerical_features].fillna(music_df[numerical_features].mean())

# Fill NaN for categorical features with a specific value (e.g., 'Unknown')
categorical_features = ['artist', 'track', 'key', 'mode', 'time_signature', 'class']
music_df_cleaned = music_df.dropna(subset=categorical_features)

# Display NaN counts per feature
nan_counts = music_df.isna().sum()
for feature, nan_count in nan_counts.items():
    print(f"{feature}: {nan_count}")


Now, let's create a graph of the amount of null data in each column for better visualization. As shown below, we can see that all of the columns now have 0 NaN values after performing imputation and dropping of rows. 

In [None]:
# Check for null data in each column
null_data = music_df.isnull()

# Create a bar plot using Seaborn
plt.figure(figsize=(12, 6))
sns.barplot(x=null_data.columns, y=null_data.sum(), palette='pastel')
plt.title('Null Data Counts per Column')
plt.xlabel('Columns')
plt.ylabel('Number of Null Values')
plt.xticks(rotation=45, ha='right')  

In [None]:
music_df = music_df.dropna(subset=['key'])

In [None]:
# Check for null data in each column
null_data = music_df.isnull()

# Create a bar plot using Seaborn
plt.figure(figsize=(12, 6))
sns.barplot(x=null_data.columns, y=null_data.sum(), palette='pastel')
plt.title('Null Data Counts per Column')
plt.xlabel('Columns')
plt.ylabel('Number of Null Values')
plt.xticks(rotation=45, ha='right')  

### Outliers

Descriptive statistics provide a comprehensive overview of the numerical characteristics of each feature, offering insights into their central tendency (mean, median), dispersion (standard deviation), and range (min, max). Additionally, the identification of outliers is crucial to understanding potential anomalies that may influence the distribution and subsequently impact the modeling process.

First, let's use `pandas.DataFrame.describe` to describe our data.

In [None]:
music_df.describe()

We can observe here that some of the features namely, `energy`, `acousticness`, and `instrumentalness` have a min value of `0.000020`,`0.000000`, and `0.000001` respectively. We will eliminate instances with such small values in our dataset in an effort to reduce skewness that is heavily close to 0. However, this may significantly affect the size of our because these cover around 25% of our current size. Still, we opt to provide a higher accuracy even on a smaller dataset. 

In [None]:
music_df_filtered = music_df[(music_df['energy'] >= 0.01) & (music_df['acousticness'] >= 0.01) & (music_df['instrumentalness'] >= 0.01)]
music_df_filtered.describe()

With this, we can use Boxplots from `seaborn` to identify possible outliers for the **numerical/continuous features**.  

In [None]:
def display_boxplot(data, numerical_features):
    numeric_features = data[numerical_features].columns

    num_features = len(numeric_features)
    num_cols = 2  # You can adjust the number of columns as needed
    num_rows = math.ceil(num_features / num_cols)

    fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(10, 2 * num_rows))
    fig.suptitle('Boxplots of Numeric Features', y=1.02)

    for idx, feature in enumerate(numeric_features):
        ax = axes[idx // 2, idx % 2]
        sns.boxplot(x=data[feature], ax=ax)
        ax.set_title(f'Boxplot of {feature}')

    # Adjust layout
    plt.tight_layout()
    plt.show()

In [None]:
display_boxplot(music_df_filtered, numerical_features)

We can observe here that some of the features namely, `popularity`, `dance`, `loudness`, `speechiness`, `instrumentalness`, `liveness`, and `tempo` have an outstanding number of outliers way past the whiskers of the boxplots.

We can use `Interquartile Range` to remove outliers. This technique sets up a boundary outside Quartile 1 (Q1) and Quartile 3 (Q3). To do this, a multiplier of `1.5` is used to the data of the IQR, and the result is subtracted to Q1, while is added to Q3. Any instances in the data that are more than these boundaries are considered as outliers. (3.2 - Identifying Outliers: IQR Method | STAT 200, n.d.).

Using this will remove all the instances that have values below 25% and above 75% of the dataset based on two columns `loudness` and `tempo`.  


We first get the quantile values based on these features.

In [None]:
outlier_features = ['popularity', 'dance', 'loudness', 'speechiness', 'instrumentalness', 'liveness', 'tempo']

# Calculate the IQR for each feature
Q1 = music_df_filtered[outlier_features].quantile(0.25)
Q3 = music_df_filtered[outlier_features].quantile(0.75)

print("Quantile 1: ", Q1)
print("Quantile 1: ", Q3)

Then we subtract Q3 from Q1.

In [None]:
IQR = Q3 - Q1
print(IQR)

As stated, we will use the multiplier `1.5`, store this on variable `multiplier`.

In [None]:
# Define a multiplier for IQR (e.g., 1.5)
multiplier = 1.5

Now, we can check for outliers using the logic of Interquartile Range.

In [None]:
# Identify outliers based on the IQR
outliers = (music_df_filtered[outlier_features] < (Q1 - multiplier * IQR)) | (music_df_filtered[outlier_features] > (Q3 + multiplier * IQR))

With the outliers determined, we can now filter our dataset accordingly. Store the filtered dataset in `music_no_outliers_df`

In [None]:
# Create a DataFrame without outliers
music_no_outliers_df = music_df_filtered[~outliers.any(axis=1)]

# Display information about removed outliers
print("Number of rows before removing outliers:", music_df.shape[0])
print("Number of rows after removing outliers:", music_no_outliers_df.shape[0])

In [None]:
display_boxplot(music_no_outliers_df, numerical_features)

### Data Normalization



In [None]:
def display_histogram(data, columns):
    # Set up the figure and axes
    fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(15, 8))

    # Flatten the axes for easier iteration
    axes = axes.flatten()

    # Create histograms for each column using a for loop
    for i, column in enumerate(columns):
        ax = axes[i]
        ax.hist(data[column], bins=20, color='skyblue', edgecolor='black')
        ax.set_title(f'{column} distribution')
        ax.set_xlabel('Value')
        ax.set_ylabel('Frequency')

    # Adjust layout
    plt.tight_layout()

    # Show the plot
    plt.show()

In [None]:
display_histogram(music_no_outliers_df, ['popularity', 'dance', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo'])

The distributions above tells that each of the features are either positively or negatively skewed. 

In [None]:
music_no_outliers_normalized = music_no_outliers_df.copy()
music_no_outliers_normalized

In [None]:
for column in ['energy', 'speechiness', 'acousticness', 'instrumentalness', 'liveness']:
    # Determine the range
    min_val = np.min(music_no_outliers_df[column])
    max_val = np.max(music_no_outliers_df[column])

    # Apply Min-Max Normalization
    normalized_data = (music_no_outliers_df[column] - min_val) / (max_val - min_val)
    music_no_outliers_normalized[column] = normalized_data

In [None]:
display_histogram(music_no_outliers_normalized, numerical_features)

In [None]:
music_no_outliers_normalized.describe()

Now that we have cleaned all columns that will be used for this notebook. We can now begin the [Exploratory Data Analysis](#exploratory-data-analysis).

## Exploratory Data Analysis

-- [Return to Table of Contents](#music-dataset---stintsy-s14-project-(flexbomb)) --

### Question 1: Question

### EDA Question 1 Results

# MUSIC DATASET - STINTSY S14 PROJECT (FLEXBOMB)
<a id='MUSIC_DATASET'></a>
This notebook is an exploratory data analysis on the Music Dataset. The dataset will be explained, cleaned, and explored by the end of this notebook.

| **`Table of Contents`** |
| --- |
| [The Dataset](#the-dataset) |
| [List of Requirements](#List-of-Requirements) |
| [Reading the Dataset](#reading-the-dataset) |
| [Data Preprocessing and Cleaning](#Data-Preprocessing-and-Cleaning) |
| [Exploratory Data Analysis](#exploratory-data-analysis) |
| - [Question 1](#Question-1:-question) |

<br>

**`Authors`**: 
- Fausto, Lorane Bernadeth M. <br>
- Nadela, Cymon <br>
- Oliva, Irah <br>

## The Dataset

-- [Return to Table of Contents](#music-dataset---stintsy-s14-project-(flexbomb)) --

**`Music`** is a universal language, transcending cultures and time. It is a powerful art form that can evoke a wide range of emotions, from joy and excitement to sadness and reflection. Music can be used to express oneself, to connect with others, and to celebrate life. We are studying this dataset because it could be essential for studying music and developing new music technologies. In this notebook in particular, it will be used to train machine learning models to perform a variety of tasks. These models can then be used to create new products and services, such as personalized music streaming services and intelligent music assistants.

The dataset is provided as a `.csv` file where it can be viewed in Excel and Notepad. 

This dataset contains 17,996 **rows** across 17 **columns**. Each row represents **1 song**, while columns represent **audio features**. The following are the columns in the dataset and their descriptions:

| Column Name | Description |
| --- | --- |
| **`Artist Name`** | Name of artist |
| **`Track Name`** | Name of song |
| **`Popularity`** | A value between 0 and 100, calculated by an algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are |
| **`danceability`** | Describes how suitable a track is for dancing; 0.0 is least danceable and 1.0 is most danceable |
| **`energy`** | A measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity |
| **`key`** | The key the track is in, integers map to pitches using standard [Pitch Class notation](https://en.wikipedia.org/wiki/Pitch_class); -1 if no key was detected |
| **`loudness`** | The quality of a sound that is the primary psychological correlate of physical strength (amplitude), values are averaged across the entire track; in decibels (dB) |
| **`mode`** | Indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived; 1 is Major and 0 is Minor |
| **`speechiness`** | The presence of spoken words in a track; >0.66 is probably made entirely of spoken words, 0.33-0.66 may contain both music and speech, <0.33 most likely represents music |
| **`acousticness`** | A confidence measure from 0.0 to 1.0 of whether the track is acoustic |
| **`instrumentalness`** | Predicts whether a track contains no vocals; >0.5 is intended to represent instrumental tracks |
| **`liveness`** | Detects the presence of an audience in the recording; >0.8 provides strong likelihood that the track is live |
| **`valence`** | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track; tracks with high valence sound more positive, and vice versa|
| **`tempo`** | The overall estimated tempo of a track in beats per minute (BPM) |
| **`duration_in min/ms`** | Duration in millisecond (ms) |
| **`time_signature`** | A notational convention to specify how many beats are in each bar |
| **`Class`** | corresponds to the genre of the track |

## List of Requirements
-- [Return to Table of Contents](#music-dataset---stintsy-s14-project-(flexbomb)) --

1. [Numpy](https://numpy.org/)
2. [Matplotlib](https://matplotlib.org/)
3. [CSV](https://docs.python.org/3/library/csv.html)

For this notebook, **numpy**, **matplotlib**, and **csv** must be imported.

In [None]:
%pip install numpy
%pip install matplotlib
%pip install pandas
%pip install seaborn

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPClassifier
from scipy.stats import randint

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import csv
import pandas as pd
import seaborn as sns
import math 

plt.style.use('ggplot')

%matplotlib inline
%load_ext autoreload
%autoreload 2

## Reading the Dataset
-- [Return to Table of Contents](#music-dataset---stintsy-s14-project-(flexbomb)) --

Here we will load the dataset using `csv`. We use the [`reader`](https://docs.python.org/3/library/csv.html) function to load the dataset. The path will have to be changed depending on the location of the file in your machine.


In [None]:
music_df = pd.read_csv('music.csv')
music_df.head()

## Data Preprocessing and Cleaning

-- [Return to Table of Contents](#music-dataset---stintsy-s14-project-(flexbomb)) --

Before we can begin exploring the data, we must first clean the dataset. This is to prevent inconsistencies that may cause problems or errors during analysis.

First, we will organize the dataset columns to make it easier to understand. Also, some columns are renamed for shorter accessibility.

Before we begin exploring our data, it is important to understand what are the characteristics of our features to help us make smart decisions to our pre-processing steps. It identifies which pre-processing techniques can be done so that exploratory data analysis and data modeeling is most accurate. Most importantly, this is done to prevent inconsistencies that may cause problems or errors during analysis.

### Column Renaming 

First, we will organize the dataset columns to make it easier to understand. Also, some columns are renamed for shorter accessibility.

In [None]:
column_mapping = {
    'Artist Name': 'artist',
    'Track Name': 'track',
    'Popularity': 'popularity',
    'danceability': 'dance',
    'energy': 'energy',
    'key': 'key',
    'loudness': 'loudness',
    'mode': 'mode',
    'speechiness': 'speechiness',
    'acousticness': 'acousticness',
    'instrumentalness': 'instrumentalness',
    'liveness': 'liveness',
    'valence': 'valence',
    'tempo': 'tempo',
    'duration_in min/ms': 'duration',
    'time_signature': 'time_signature',
    'Class': 'class'
}

# Rename columns in the DataFrame
music_df.rename(columns=column_mapping, inplace=True)
print(music_df.columns)

### Duplicates

We then check if there are any duplicated data in the dataset. We do this by calling the ``pandas.DataFrame.duplicated`` function. The function checks and returns the duplicated values.

In [None]:
numDuplicates = music_df.duplicated().sum()

print(f"Number of duplicates in the dataset: {numDuplicates}")

# Display duplicated rows
duplicated_rows_data = music_df[music_df.duplicated()]
print("Duplicated rows:")
print(duplicated_rows_data)

As displayed above there are **``0 duplicates``** in the dataset. 

### Null Values

Then, check which columns have **NaN or Null** values and **count** how many null values each column has.

In [None]:
# show nan_count per variable
nan_counts = music_df.isna().sum()
for feature, nan_count in nan_counts.items():
    print(f"{feature}: {nan_count}")

plt.figure(figsize=(12, 6))
sns.barplot(x=nan_counts.index, y=nan_counts.values, palette='viridis')
plt.title('NaN Counts per Feature')
plt.xlabel('Features')
plt.ylabel('Number of NaN Values')
plt.xticks(rotation=45, ha='right')  # Adjust rotation for better readability
plt.show()

We can observe here that `popularity`, `key`, and `instrumentalness` has a high count of NaN values. We can use imputation to replace the NaN values with a measure the mean of the column for the non-categorical features. As for categorical features, we will drop the rows as shown below. We will use `pandas.DataFrame.fillna` and `pandas.DataFrame.mean` for our regression values while `pandas.DataFrame.dropna` to drop rows with null categorical values.

In [None]:
# Fill NaN for numerical features with mean
numerical_features = ['popularity', 'dance', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']
music_df[numerical_features] = music_df[numerical_features].fillna(music_df[numerical_features].mean())

# Fill NaN for categorical features with a specific value (e.g., 'Unknown')
categorical_features = ['artist', 'track', 'key', 'mode', 'time_signature', 'class']
music_df_cleaned = music_df.dropna(subset=categorical_features)

# Display NaN counts per feature
nan_counts = music_df.isna().sum()
for feature, nan_count in nan_counts.items():
    print(f"{feature}: {nan_count}")


Now, let's create a graph of the amount of null data in each column for better visualization. As shown below, we can see that all of the columns now have 0 NaN values after performing imputation and dropping of rows. 

In [None]:
# Check for null data in each column
null_data = music_df.isnull()

# Create a bar plot using Seaborn
plt.figure(figsize=(12, 6))
sns.barplot(x=null_data.columns, y=null_data.sum(), palette='pastel')
plt.title('Null Data Counts per Column')
plt.xlabel('Columns')
plt.ylabel('Number of Null Values')
plt.xticks(rotation=45, ha='right')  

In [None]:
music_df = music_df.dropna(subset=['key'])

In [None]:
# Check for null data in each column
null_data = music_df.isnull()

# Create a bar plot using Seaborn
plt.figure(figsize=(12, 6))
sns.barplot(x=null_data.columns, y=null_data.sum(), palette='pastel')
plt.title('Null Data Counts per Column')
plt.xlabel('Columns')
plt.ylabel('Number of Null Values')
plt.xticks(rotation=45, ha='right')  

### Outliers

Descriptive statistics provide a comprehensive overview of the numerical characteristics of each feature, offering insights into their central tendency (mean, median), dispersion (standard deviation), and range (min, max). Additionally, the identification of outliers is crucial to understanding potential anomalies that may influence the distribution and subsequently impact the modeling process.

First, let's use `pandas.DataFrame.describe` to describe our data.

In [None]:
music_df.describe()

We can observe here that some of the features namely, `energy`, `acousticness`, and `instrumentalness` have a min value of `0.000020`,`0.000000`, and `0.000001` respectively. We will eliminate instances with such small values in our dataset in an effort to reduce skewness that is heavily close to 0. However, this may significantly affect the size of our because these cover around 25% of our current size. Still, we opt to provide a higher accuracy even on a smaller dataset. 

In [None]:
music_df_filtered = music_df[(music_df['energy'] >= 0.01) & (music_df['acousticness'] >= 0.01) & (music_df['instrumentalness'] >= 0.01)]
music_df_filtered.describe()

With this, we can use Boxplots from `seaborn` to identify possible outliers for the **numerical/continuous features**.  

In [None]:
def display_boxplot(data, numerical_features):
    numeric_features = data[numerical_features].columns

    num_features = len(numeric_features)
    num_cols = 2  # You can adjust the number of columns as needed
    num_rows = math.ceil(num_features / num_cols)

    fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(10, 2 * num_rows))
    fig.suptitle('Boxplots of Numeric Features', y=1.02)

    for idx, feature in enumerate(numeric_features):
        ax = axes[idx // 2, idx % 2]
        sns.boxplot(x=data[feature], ax=ax)
        ax.set_title(f'Boxplot of {feature}')

    # Adjust layout
    plt.tight_layout()
    plt.show()

In [None]:
display_boxplot(music_df_filtered, numerical_features)

We can observe here that some of the features namely, `popularity`, `dance`, `loudness`, `speechiness`, `instrumentalness`, `liveness`, and `tempo` have an outstanding number of outliers way past the whiskers of the boxplots.

We can use `Interquartile Range` to remove outliers. This technique sets up a boundary outside Quartile 1 (Q1) and Quartile 3 (Q3). To do this, a multiplier of `1.5` is used to the data of the IQR, and the result is subtracted to Q1, while is added to Q3. Any instances in the data that are more than these boundaries are considered as outliers. (3.2 - Identifying Outliers: IQR Method | STAT 200, n.d.).

Using this will remove all the instances that have values below 25% and above 75% of the dataset based columns.  


We first get the quantile values based on these features.

In [None]:
outlier_features = ['popularity', 'dance', 'loudness', 'speechiness', 'instrumentalness', 'liveness', 'tempo']

# Calculate the IQR for each feature
Q1 = music_df_filtered[outlier_features].quantile(0.25)
Q3 = music_df_filtered[outlier_features].quantile(0.75)

print("Quantile 1: ", Q1)
print("Quantile 1: ", Q3)

Then we subtract Q3 from Q1.

In [None]:
IQR = Q3 - Q1
print(IQR)

As stated, we will use the multiplier `1.5`, store this on variable `multiplier`.

In [None]:
# Define a multiplier for IQR (e.g., 1.5)
multiplier = 1.5

Now, we can check for outliers using the logic of Interquartile Range.

In [None]:
# Identify outliers based on the IQR
outliers = (music_df_filtered[outlier_features] < (Q1 - multiplier * IQR)) | (music_df_filtered[outlier_features] > (Q3 + multiplier * IQR))

With the outliers determined, we can now filter our dataset accordingly. Store the filtered dataset in `music_no_outliers_df`

In [None]:
# Create a DataFrame without outliers
music_no_outliers_df = music_df_filtered[~outliers.any(axis=1)]

# Display information about removed outliers
print("Number of rows before removing outliers:", music_df.shape[0])
print("Number of rows after removing outliers:", music_no_outliers_df.shape[0])

In [None]:
display_boxplot(music_no_outliers_df, numerical_features)

### Data Normalization



Looking at the boxplots above, it seems that there are still outliers even if we remove some of them using Interquartile Range. We can visualize `music_no_outliers_df` through a histogram for us to understand how each of the features are characterized under the normal distribution. To do this we will use the module `matplotlib.pyplot`.

In [None]:
def display_histogram(data, columns):
    # Set up the figure and axes
    fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(15, 8))

    # Flatten the axes for easier iteration
    axes = axes.flatten()

    # Create histograms for each column using a for loop
    for i, column in enumerate(columns):
        ax = axes[i]
        ax.hist(data[column], bins=20, color='skyblue', edgecolor='black')
        ax.set_title(f'{column} distribution')
        ax.set_xlabel('Value')
        ax.set_ylabel('Frequency')

    # Adjust layout
    plt.tight_layout()

    # Show the plot
    plt.show()

In [None]:
display_histogram(music_no_outliers_df, ['popularity', 'dance', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo'])

We can see that the features `speechiness`, `acousticness`, `instrumentalness`, and `liveness` are negatively skewed, with more instances having a value that is closer to 0. Features `dance`, `energy`, `and loudness` are positively skewed, on the other hand. While the rest are closer to the normal distribution. Based on the outliers and histograms, it seems that the outliers remained even after the outlier removal step does contribute to the overall meaning of the data. For example, `liveness` infers that the songs tend to decrease in frequency as the value of being lively increases from zero (0) to one (1).

We will try to normalize the values on these features using Min-Max Scaling. This algoritm scales the data according to the lowest possible and highest possible values, so that it is focused (*scaled*) while maintaining its relative position (Loukas, 2023). First, let's store a copy of `music_no_outliers_df` to a variable called `music_no_outliers_normalized`.

In [None]:
music_no_outliers_normalized = music_no_outliers_df.copy()
music_no_outliers_normalized

Then, for each column below, normalize each of the data and store back to their respective columns.

In [None]:
for column in ['energy', 'speechiness', 'acousticness', 'instrumentalness', 'liveness']:
    # Determine the range
    min_val = np.min(music_no_outliers_df[column])
    max_val = np.max(music_no_outliers_df[column])

    # Apply Min-Max Normalization
    normalized_data = (music_no_outliers_df[column] - min_val) / (max_val - min_val)
    music_no_outliers_normalized[column] = normalized_data

We can now display the histogram of the normalized data. 

In [None]:
display_histogram(music_no_outliers_normalized, numerical_features)

In [None]:
music_no_outliers_normalized.describe()

As we can see, the features `speechiness`, `acousticness`, `instrumentalness`, and `liveness` are still negatively skewed and the normalization has failed. This may be due to where the values of the instanes lie in the data. Looking at the description of `music_no_outliers_normalized` above, it seems that 25% of the data based on these features are below 0.07 which is relatively small compared to the max value of 1.00. Because of this, it may be best to retain the data as is after the normalization attempt so that we can have a better how these features are related in the next section of our notebook.

Now that we have cleaned all columns that will be used for this notebook. We can now begin the [Exploratory Data Analysis](#exploratory-data-analysis).

## Exploratory Data Analysis Questions
-- [Return to Table of Contents](#music-dataset---stintsy-s14-project-(flexbomb)) --

1. [**`Question 4`**](#question-4): Distribution of Features: Plot histograms or kernel density plots for Valence, Tempo, Liveness, Loudness, Acousticness, and Energy to understand their distributions.
2. [**`Question 5`**](#question-5): Class Distribution: Plot the distribution of the Class variable (target variable). Understand the balance between different classes.
3. [**`Question 6`**](#question-6): Relationship Between Features and Class: Use scatter plots or box plots to visualize the relationship between each feature and the target Class variable Identify any patterns or clusters that may exist.

### Question 1: Question

In [None]:
#correlation_matrix = music_df[selected_features + [target_variable]].corr()
correlation_matrix = music_df.select_dtypes(include='number').corr()

# Display correlation coefficients
print("Correlation Matrix:")
print(correlation_matrix)

# Visualize correlation matrix as a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Heatmap')
plt.show()

### EDA Question 1 Results
This shows the relationship between 13 different audio features, including popularity, dance, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration, time signature, and class.

The correlation values range from -1 to 1, with 1 indicating a perfect positive correlation, -1 indicating a perfect negative correlation, and 0 indicating no correlation.

**Positive correlations**

1. `Popularity` is positively correlated with `dance`, `energy`, and `loudness`. This suggests that more popular songs tend to be more danceable, energetic, and louder.
2. `Dance` is positively correlated with `energy`, `loudness`, and `valence`. This suggests that danceable songs tend to be more energetic, louder, and more positive.
3. `Energy` is positively correlated with `loudness` and `valence`. This suggests that energetic songs tend to be louder and more positive.
4. `Speechiness` is positively correlated with `tempo`. This suggests that songs with more speech tend to have a faster tempo.
5. `Acousticness` is positively correlated with `time signature`. This suggests that acoustic songs tend to have a more regular time signature.

**Negative correlations**

1. `Popularity` is negatively correlated with `acousticness`. This suggests that more popular songs tend to be less acoustic.
2. `Dance` is negatively correlated with `speechiness` and `acousticness`. This suggests that danceable songs tend to have less speech and are less acoustic.
3. `Energy` is negatively correlated with `acousticness`. This suggests that energetic songs tend to be less acoustic.
4. `Instrumentalness` is negatively correlated with `valence`. This suggests that instrumental songs tend to be less positive.
5. `Liveness` is negatively correlated with `acousticness`. This suggests that livelier songs tend to be less acoustic.
6. `Tempo` is negatively correlated with `acousticness` and `duration`. This suggests that faster songs tend to be less acoustic and shorter.
7. `Duration` is negatively correlated with `class`. This suggests that longer songs are less likely to be classified as a particular genre.

Overall, the correlation heatmap provides insights into the relationships between different audio features. For example, we can see that more popular songs tend to be more danceable, energetic, and louder, while acoustic songs tend to be less popular, less danceable, and more mellow.

To also gain insights for our Machine Learning, we provided insights for `mode` being our target variable.

1. `Mode` is positively correlated with `key`, `loudness`, `speechiness`, `acousticness`, `instrumentalness`, `liveness`, `valence`, `tempo`, and `duration`. This means that songs in **major mode** tend to be *more popular, danceable, energetic, loud, speechy, acoustic, instrumental, live, positive, fast, and long.*

2. `Mode` is negatively correlated with `popularity`, `danceability`, `energy`, and `valence`. This means that songs in **minor mode** tend to be *less popular, danceable, energetic, and positive.*

### Question 2: What is the distribution of *`mode`*?

Here, we will compare the number of *tracks* for each *mode*.

In [None]:
music_modeCount_df = music_no_outliers_normalized['mode'].value_counts()
print(music_modeCount_df)

music_modeCount_df.plot.bar(figsize=(6,4)).invert_xaxis()
plt.xlabel('Mode')
plt.ylabel('Number of Tracks')
plt.title('Number of Tracks for each Mode')

### EDA Question 2 Results

We can see that the dataset has more tracks in the major (1) mode, than in the minor (0) mode. There are 2467 tracks in the major (1) mode while minor (0) mode has 1426 tracks.

### Question 3: What is the distribution of the feature `class`?

Here, we will compare the number of *tracks* for each *class*.

In [None]:
plt.figure(figsize=(12, 6))
ax = sns.countplot(x='class', data=music_no_outliers_normalized, order=music_no_outliers_normalized['class'].value_counts(ascending=True).index)
ax.bar_label(ax.containers[0])
plt.title('Distribution of Class')
plt.xlabel('Class')
plt.ylabel('Number of Tracks')
plt.xticks(rotation=45)
plt.show()

### EDA Question 3 Results

We can see that the lowest track count is 38 for class 7 while the highest has 970 tracks for class 10. Although, class 9 has 962 tracks which is coming at a close second. This means that most of the tracks in the dataset are at class 9 & 10.

### Question 4: What is the distribution of the `mode` for each `class`?

Here, we will compare the number of *modes* for each *class*.

In [None]:
ax = pd.crosstab(music_no_outliers_normalized['class'], music_no_outliers_normalized['mode']).plot(kind='bar', stacked=True)
ax.bar_label(ax.containers[0])
ax.bar_label(ax.containers[1])

plt.title('Class Distribution by Mode')
plt.xlabel('Class')
plt.ylabel('Number of Tracks')
plt.show()

### EDA Question 4 Results

We can see that there is a different mode distribution for each class. 

| Class | Mode 0 | Mode 1 | Total |
| --- | --- | --- | --- |
| 0 | 52 | 168 | 220 |
| 1 | 79 | 175 | 254 |
| 2 | 104 | 208 | 312 |
| 3 | 38 | 77 | 115 |
| 4 | 11 | 171 | 182 |
| 5 | 85 | 105 | 190 |
| 6 | 202 | 368 | 570 |
| 7 | 17 | 21 | 38 |
| 8 | 38 | 42 | 80 |
| 9 | 455 | 507 | 962 |
| 10 | 345 | 625 | 970 |

# Data Modelling

In [None]:
X = music_no_outliers_normalized.select_dtypes(include='number')
X.drop(['mode', 'class'], axis=1, inplace=True)
y = music_no_outliers_normalized['class']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, stratify=y, random_state=42)

print('Training data shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)

## Logistic Regression

### Model Training
Logistic Regression 
Let's instantiate an `SGDClassifier` object and set the following hyperparameters:
- Loss function: 'log_loss'
- Initial learning rate: 0.001
- Maximum iterations: 200
- Learning rate: 'constant'
- Random state: 1
- Verbose: 1


In [None]:
model = SGDClassifier(loss='log_loss', 
                      eta0=0.001, 
                      max_iter=200, 
                      learning_rate='constant', 
                      random_state=1, 
                      verbose=1)

model.fit(X_train, y_train)

Let's predict the training data first

In [None]:
predictions = model.predict(X_train)
num_correct = np.count_nonzero((y_train==predictions))
print(num_correct)
accuracy = num_correct/len(y_train)
print("Accuracy :", accuracy * 100)


Then, compare the results with test data.

In [None]:
predictions = model.predict(X_test)
num_correct = np.count_nonzero((y_test==predictions))
print(num_correct)
accuracy = num_correct/len(y_test)
print("Accuracy :", accuracy * 100)

In [None]:
prob = pd.DataFrame(model.predict_proba(X_test), columns=['y = 0', 'y = 1'])
prob['actual'] = y_test.values
prob

### Hyperparameter Tuning

To tune hyperparameters for our `SGDClassifier`, we can used  grid search. Scikit-learn provides the `GridSearchCV` classes for this purpose.

Set our Pipeline that includes a standard scaler and the `SGDClassifier`
We can set the parameter grid as a parameter for our `GridSearchCV` as such: 
> 'classifier__alpha': [0.0001, 0.001, 0.01, 0.1, 1.0] <br>
> 'classifier__max_iter': [100, 200, 300]<br>
> 'classifier__eta0': [0.001, 0.01, 0.1]<br>

We can Instantiate the `GridSearchCV` with our made pipeline, param_grid, cross validation = 5, scoring = accuracy, and verbose = 1


In [None]:

# Define the pipeline with a standard scaler and SGDClassifier
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', SGDClassifier(loss='log_loss', random_state=1, verbose=1))
])

# Define the hyperparameters 
param_grid = {
    'classifier__alpha': [0.0001, 0.001, 0.01, 0.1, 1.0],
    'classifier__max_iter': [100, 200, 300],
    'classifier__eta0': [0.001, 0.01, 0.1],
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', verbose=1)
grid_search.fit(X_train, y_train)

# Get the best parameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Evaluate the best model on the test set
accuracy_test = best_model.score(X_test, y_test)

print(f"Best Parameters: {best_params}")
print(f"Test Accuracy with Best Model: {accuracy_test}")


After performing the grid search, we can now locate the best parameters for our Logistic regression.

- Best Parameters: {'classifier__alpha': 0.01, 'classifier__eta0': 0.001, 'classifier__max_iter': 100}
- Test Accuracy with Best Model: 0.6410844629822732

We can finally select `best_model` SGDClassifier from above that contains the best parameters.

In [None]:
final_model = SGDClassifier(alpha=best_params['classifier__alpha'],
                            max_iter=best_params['classifier__max_iter'],
                            eta0=best_params['classifier__eta0'],
                            loss='log_loss',
                            random_state=1,
                            verbose=1)

final_model.fit(X_train, y_train)

In [None]:
final_accuracy = best_model.score(X_train, y_train)
print(f"Final Model Accuracy on Train Set: {final_accuracy}")

final_accuracy = best_model.score(X_test, y_test)
print(f"Final Model Accuracy on Test Set: {final_accuracy}")

### Binomial Logistic Regression Result:
| Description | Without Tuning Hyperparameter | Tuned Hyperparameter
| --- | --- | --- |
| Epochs after Converge | `13` | `100` |
| Average Loss at Converge | `3.950169` | `0.839764` |
| Accuracy of the Model with X_train | `40.73%` | `63.82%` |
| Accuracy of the Model with X_test | `43.40%` | `62.33%` |


## Neural Network

In [None]:
grid = {
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.001, 0.0001, 0.5],
    'learning_rate': ['constant','adaptive'],
    'max_iter': [500, 1000, 2000],
}

In [None]:
clf = GridSearchCV(MLPClassifier(random_state=42, hidden_layer_sizes=(50, 50, 50)), grid, cv = 5, verbose=3)

In [None]:
clf.fit(X_train, y_train)

In [None]:
print("best parameters of the model: ", clf.best_params_)
print("best score of the model: ", clf.best_score_)

In [None]:
predictions = clf.predict(X_train)

In [None]:
accuracy = np.count_nonzero((y_train==predictions))/len(y_train)
print(accuracy)

In [None]:
predictions = clf.predict(X_test)

In [None]:
accuracy = np.count_nonzero((y_test==predictions))/len(y_test)
print(accuracy)

In [None]:
report = classification_report(y_test, predictions)
print("Classification Report:\n", report)

## Random Forest

A **`random forest machine learning model`** is an ensemble model that makes use of multiple decision trees trained through bagging. It takes multiple decision tree classifier models and trains all models on bootstrapped data which is sampled from the training data without replacement because each decision tree is independent from each other. Each tree can be trained in parallel with the bootstrapped data. This should work well for our data because each tree can learn very specific features unique to each seed in the dataset that make it a specific category. To reduce overfitting, bagging and random partitions are used to increase the model's ability to generalize on new data. 

In [None]:
rf_classifier = RandomForestClassifier(random_state=42)

rf_classifier.fit(X_train, y_train)

predictions = rf_classifier.predict(X_test)
accuracy = accuracy_score(y_test, predictions)

predictions_train = rf_classifier.predict(X_train)
accuracy_train = accuracy_score(y_train, predictions_train)

print(f"Accuracy on training data: {accuracy_train}")
print(f"Accuracy on test data: {accuracy}")

We get 71.43% accuracy on the test data without hyperparameter tuning, and 100% accuracy on the training data. Random forests naturally get very high accuracies on the training set, which means it is not a cause for concern.

## Hyperparameter Tuning

### Random Forest

We will be using *random search* to tune the hyperparameters of the random forest. The hyperparameter values are randomized at each interation within a specified range.

- `n_estimators`: This is the total amount of decision trees used in the random forest, initially set to 100. The most frequent label generated by all the trees is the final label given by the ensemble model. 
- `max_depth`: The maximum depth all the trees are limited to, which can help reduce overfitting on training data. This is initially set to None.
- `max_features`: Set to 'sqrt'. The total amount of features to consider to consider at each split of all trees.
- `min_samples_split`: Set to 2. The minimum amount of data points that are needed to split a decision node. A higher value will prevent the model from making splits that are too specific to certain data points, or noise.
- `min_samples_leaf`: Set to 1. The minimum number of samples a leaf node can have, increasing this can also reduce overfitting.

In [None]:
param_dist = {'n_estimators': randint(100,300),
              'max_depth': randint(1,30),
                'max_features': randint(1,5),
               'min_samples_split': randint (2, 10),
               'min_samples_leaf': randint(1, 5)}

Random search will run for 10 iterations using accuracy as the main evaluation metric. 

In [None]:
random_search = RandomizedSearchCV(estimator=rf_classifier, param_distributions=param_dist, n_iter=10, cv=5, scoring='accuracy', random_state=42)

random_search.fit(X_train, y_train)

for i in range(len(random_search.cv_results_['params'])):
    print(f"Iteration {i + 1}:")
    print("Hyperparameters:", random_search.cv_results_['params'][i])
    print("Mean Test Score (Accuracy):", random_search.cv_results_['mean_test_score'][i])
    print('----------------------------------')

best_rf = random_search.best_estimator_

predictions = best_rf.predict(X_test)
predictions_train = best_rf.predict(X_train)

In [None]:
report = classification_report(y_test, predictions)

print("Classification Report:\n", report)

The model has a precision 0.71 for both classes, which means the model is not better at predicting true positives for one class compared to the other. It has an overall accuracy of 71%.

In [None]:
print("Best model hyperparameters:", random_search.best_params_)
print("Accuracy of best model on test set: ", accuracy_score(predictions, y_test))
print("Accuracy of best model on training set: ", accuracy_score(predictions_train, y_train))

# References

3.2 - Identifying outliers: IQR Method | STAT 200. (n.d.). PennState: Statistics Online Courses. https://online.stat.psu.edu/stat200/lesson/3/3.2

Priyanka, P. (2023, September 5). Audio Normalization - Poudel priyanka - Medium. Medium. https://medium.com/@poudelnipriyanka/audio-normalization-9dbcedfefcc0

Magnolia International Ltd. (n.d.). TC Electronic | Loudness explained. https://www.tcelectronic.com/de/loudness-explained.html

MasterClass. (n.d.). Beats per Minute explained: How to find a song’s BPM - 2023 - MasterClass. https://www.masterclass.com/articles/how-to-find-the-bpm-of-a-song

## The Dataset

-- [Return to Table of Contents](#music-dataset---stintsy-s14-project-(flexbomb)) --

**`Music`** is a universal language, transcending cultures and time. It is a powerful art form that can evoke a wide range of emotions, from joy and excitement to sadness and reflection. Music can be used to express oneself, to connect with others, and to celebrate life. We are studying this dataset because it could be essential for studying music and developing new music technologies. In this notebook in particular, it will be used to train machine learning models to perform a variety of tasks. These models can then be used to create new products and services, such as personalized music streaming services and intelligent music assistants.

The dataset is provided as a `.csv` file where it can be viewed in Excel and Notepad. 

This dataset contains 17,996 **rows** across 17 **columns**. Each row represents **1 song**, while columns represent **audio features**. The following are the columns in the dataset and their descriptions:

| Column Name | Description |
| --- | --- |
| **`Artist Name`** | Name of artist |
| **`Track Name`** | Name of song |
| **`Popularity`** | A value between 0 and 100, calculated by an algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are |
| **`danceability`** | Describes how suitable a track is for dancing; 0.0 is least danceable and 1.0 is most danceable |
| **`energy`** | A measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity |
| **`key`** | The key the track is in, integers map to pitches using standard [Pitch Class notation](https://en.wikipedia.org/wiki/Pitch_class); -1 if no key was detected |
| **`loudness`** | The quality of a sound that is the primary psychological correlate of physical strength (amplitude), values are averaged across the entire track; in decibels (dB) |
| **`mode`** | Indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived; 1 is Major and 0 is Minor |
| **`speechiness`** | The presence of spoken words in a track; >0.66 is probably made entirely of spoken words, 0.33-0.66 may contain both music and speech, <0.33 most likely represents music |
| **`acousticness`** | A confidence measure from 0.0 to 1.0 of whether the track is acoustic |
| **`instrumentalness`** | Predicts whether a track contains no vocals; >0.5 is intended to represent instrumental tracks |
| **`liveness`** | Detects the presence of an audience in the recording; >0.8 provides strong likelihood that the track is live |
| **`valence`** | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track; tracks with high valence sound more positive, and vice versa|
| **`tempo`** | The overall estimated tempo of a track in beats per minute (BPM) |
| **`duration_in min/ms`** | Duration in millisecond (ms) |
| **`time_signature`** | A notational convention to specify how many beats are in each bar |
| **`Class`** | corresponds to the genre of the track |

State a brief description of the dataset.

• Provide a description of the collection process executed to build the dataset. Discuss the
implications of the data collection method on the generated conclusions and insights.
Note that you may need to look at relevant sources related to the dataset to acquire
necessary information for this part of the project.

• Describe the structure of the dataset file. <br>
    o What does each row and column represent? <br>
    o How many instances are there in the dataset? <br>
    o How many features are there in the dataset? <br>
    o If the dataset is composed of different files that you will combine in the succeeding
steps, describe the structure and the contents of each file.

• Discuss the features in each dataset file. What does each feature represent? All features,
even those which are not used for the study, should be described to the reader. The
purpose of each feature in the dataset should be clear to the reader of the notebook
without having to go through an external link.

## List of Requirements
-- [Return to Table of Contents](#music-dataset---stintsy-s14-project-(flexbomb)) --

1. [Numpy](https://numpy.org/)
2. [Matplotlib](https://matplotlib.org/)
3. [CSV](https://docs.python.org/3/library/csv.html)

For this notebook, **numpy**, **matplotlib**, and **csv** must be imported.

In [None]:
%pip install numpy
%pip install matplotlib
%pip install pandas
%pip install seaborn

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import csv
import pandas as pd
import seaborn as sns
import math 

plt.style.use('ggplot')

%matplotlib inline
%load_ext autoreload
%autoreload 2

## Reading the Dataset
-- [Return to Table of Contents](#music-dataset---stintsy-s14-project-(flexbomb)) --

Here we will load the dataset using `csv`. We use the [`reader`](https://docs.python.org/3/library/csv.html) function to load the dataset. The path will have to be changed depending on the location of the file in your machine.


In [None]:
music_df = pd.read_csv('music.csv')
music_df.head()

The dataset is now loaded in the ???.

Show the contents of the...

## Exploratory Data Analysis Questions
-- [Return to Table of Contents](#music-dataset---stintsy-s14-project-(flexbomb)) --

1. [**`Question 4`**](#question-4): Distribution of Features: Plot histograms or kernel density plots for Valence, Tempo, Liveness, Loudness, Acousticness, and Energy to understand their distributions.
2. [**`Question 5`**](#question-5): Class Distribution: Plot the distribution of the Class variable (target variable). Understand the balance between different classes.
3. [**`Question 6`**](#question-6): Relationship Between Features and Class: Use scatter plots or box plots to visualize the relationship between each feature and the target Class variable Identify any patterns or clusters that may exist.

## Data Preprocessing and Cleaning

-- [Return to Table of Contents](#music-dataset---stintsy-s14-project-(flexbomb)) --