# Introduction: Spotify genre classifier

## Description:
This Python notebook presents a comprehensive solution for automating the loan approval process using machine learning techniques. The notebook employs a dataset containing historical loan data to train and evaluate a predictive model, aiming to streamline the decision-making process for loan approval.

Key Features:

## Data Exploration and Preprocessing: 
The notebook begins with a thorough exploration of the dataset, identifying key features and potential challenges. It covers data cleaning, handling missing values, and encoding categorical variables to prepare the data for model training.

# Data

The data is provided by [Spotify ](https://www.kaggle.com/datasets/thedevastator/spotify-tracks-genre-dataset/data)

## Metric: ROC AUC

Once we have a grasp of the data (reading through the [column descriptions](https://www.kaggle.com/c/home-credit-default-risk/data) helps immensely), we need to understand the metric by which our submission is judged. In this case, it is a common classification metric known as the Receiver Operating Characteristic Area Under the Curve (ROC AUC, also sometimes called AUROC)

The ROC AUC may sound intimidating, but it is relatively straightforward once you can get your head around the two individual concepts. The Reciever Operating Characteristic (ROC) curve graphs the true positive rate versus the false positive rate:

![image](https://en.wikipedia.org/wiki/Partial_Area_Under_the_ROC_Curve#/media/File:Basic_AUC_annotated.png)

A single line on the graph indicates the curve for a single model, and movement along a line indicates changing the threshold used for classifying a positive instance. The threshold starts at 0 in the upper right to and goes to 1 in the lower left. A curve that is to the left and above another curve indicates a better model. For example, the blue model is better than the red model, which is better than the black diagonal line which indicates a naive random guessing model.

The [Area Under the Curve (AUC)](http://gim.unmc.edu/dxtests/roc3.htm) explains itself by its name! It is simply the area under the ROC curve. (This is the integral of the curve.) This metric is between 0 and 1 with a better model scoring higher. A model that simply guesses at random will have an ROC AUC of 0.5.

When we measure a classifier according to the ROC AUC, we do not generate 0 or 1 predictions, but rather a probability between 0 and 1. This may be confusing because we usually like to think in terms of accuracy, but when we get into problems with inbalanced classes (we will see this is the case), accuracy is not the best metric. For example, if I wanted to build a model that could detect terrorists with 99.9999% accuracy, I would simply make a model that predicted every single person was not a terrorist. Clearly, this would not be effective (the recall would be zero) and we use more advanced metrics such as ROC AUC or the [F1 score](https://en.wikipedia.org/wiki/F1_score) to more accurately reflect the performance of a classifier. A model with a high ROC AUC will also have a high accuracy, but the [ROC AUC is a better representation of model performance.](https://datascience.stackexchange.com/questions/806/advantages-of-auc-vs-standard-accuracy)

Not that we know the background of the data we are using and the metric to maximize, let's get into exploring the data. In this notebook, as mentioned previously, we will stick to the main data sources and simple models which we can build upon in future work.

In [5]:
import pandas as pd

df = pd.read_csv('./data/spotify-data.csv')
df.drop(['0'], axis=1, inplace=True)
df.drop('track_id', axis=1, inplace=True)
print("shape", df.shape)
print("col", df.columns.values)

shape (114000, 19)
col ['artists' 'album_name' 'track_name' 'popularity' 'duration_ms' 'explicit'
 'danceability' 'energy' 'key' 'loudness' 'mode' 'speechiness'
 'acousticness' 'instrumentalness' 'liveness' 'valence' 'tempo'
 'time_signature' 'track_genre']


In [6]:
df.dropna(inplace=True)

for col in df.columns:
    print("Column:",col)
    print(df[col])
    print("Null values: ", df[col].isnull().sum())

Column: artists
0                    Gen Hoshino
1                   Ben Woodward
2         Ingrid Michaelson;ZAYN
3                   Kina Grannis
4               Chord Overstreet
                   ...          
113995             Rainy Lullaby
113996             Rainy Lullaby
113997             Cesária Evora
113998          Michael W. Smith
113999             Cesária Evora
Name: artists, Length: 113999, dtype: object
Null values:  0
Column: album_name
0                                                    Comedy
1                                          Ghost (Acoustic)
2                                            To Begin Again
3         Crazy Rich Asians (Original Motion Picture Sou...
4                                                   Hold On
                                ...                        
113995    #mindfulness - Soft Rain for Mindful Meditatio...
113996    #mindfulness - Soft Rain for Mindful Meditatio...
113997                                              Best Of
1

In [7]:
# Preprocessing
df['artists'] = df['artists'].apply(lambda x: str(x))
df['artists'].apply(type).value_counts() # 1 item is float?? 
df['artists'] = df['artists'].apply(lambda x: x.split(';'))

## Exploratory Data Analysis (EDA): 
Here we will provide insights into the distribution of key variables, relationships between features, and an understanding of the data patterns. Visualizations aid in uncovering trends that contribute to the decision-making process.

Feature Engineering: The notebook implements feature engineering techniques to enhance the predictive power of the model. This involves creating new features, transforming existing ones, and selecting relevant variables to improve the model's ability to capture underlying patterns.

Model Selection: Multiple machine learning algorithms are explored for Spotify Genre prediction, including but not limited to ?? 
logistic regression, decision trees, random forests, and support vector machines. The notebook includes a comparative analysis of their performance metrics, helping users choose the most suitable model for their specific use case.

Model Training and Evaluation: The selected model is trained on the preprocessed dataset, and its performance is evaluated using various metrics such as accuracy, precision, recall, and F1 score. The notebook emphasizes the importance of choosing an evaluation metric that aligns with the business goals of the loan approval system.

Hyperparameter Tuning: To optimize the model's performance, the notebook incorporates hyperparameter tuning techniques, fine-tuning the model for better accuracy and robustness.

Deployment Considerations: The notebook concludes with a discussion on deploying the trained model into a production environment. It provides insights into model deployment options, considerations for scalability, and integration with existing loan approval systems.

By leveraging this Python notebook, users can seamlessly integrate an automated loan approval prediction model into their financial systems, significantly enhancing efficiency and reducing the time and resources required for manual decision-making.


In [20]:
import matplotlib.pyplot as plt

genres = df['track_genre'].value_counts()

# We can see that almost all the genres have at lease 999 songs and at most 1000 songs, which means there is no imbalance in the dataset.
print(min(genres.values))
print(max(genres.values))

999
1000
