## Preprocessing
Preprocessing for the project includes processing of the *Billboard “The Hot 100” Songs* dataset to determine classification according to the rankings of the songs and processing of the *Million Song Dataset* for the songs that will need to be classified.
### Preprocessing: The *Billboard “The Hot 100” Songs* dataset
The *Billboard “The Hot 100” Songs* dataset is available via the Kaggle API at https://www.kaggle.com/datasets/dhruvildave/billboard-the-hot-100-songs. It includes the hot 100 songs from 1958 to 2021 (inclusive). Because the *Million Song Dataset* includes titles from 1922 to 2010 preprocessing of the dataset includes filtering the list of songs so that titles more recent than 2010 are removed. The most important features of the dataset are “song”, “ast”, “rtirank”.
Preprocessing the *Billboard “The Hot 100” Songs* dataset included 5 main steps:
1.	Downloading the *Billboard “The Hot 100” Songs* dataset from the Kaggle API. This dataset had no missing values for the columns: *song*, *artist*, and *rank*.
2.	Processing the downloaded CSV file by removing titles more recent than 2010, then grouping by both the *song* and the *artist*, and aggregating the results to take the minimum rank of each. The result of this processing is that for each *song* and *artist* pair, the minimum rank (highest billboard rank) is kept as it was determined that our project is only interested in the highest rank achieved for each song.
3.	The next step in processing was to classify each of the data points such that ranks 1 to 10 were defined as class 0, ranks 11 to 20 were defined as class 1, and so on for each of the ranks up to 100. 
4.	The columns *song*, *artist*, and *class* were then saved to a CSV file for the last data processing step where it is used to classify the *Million Song Dataset* before it can be used to train our models.
5. As part of an effort to enhance the models, a new spotify dataset was processed and the billboard dataset was updated to include all available entries prior to 2023.
Pyspark was determined to be the best tool for processing the *Billboard “The Hot 100” Songs* dataset. The script can be found here in /src/preprocessing/billboard_preprocessor.py


### Preprocessing: The *Million Songs* dataset
1. Downloading the Million songs dataset following this page's instructions: http://millionsongdataset.com/pages/getting-dataset/
The direct link provided is the subset of 10,000 songs from the dataset for testing purposes. There is a link for a relational database for the entire dataset. There is a download function that takes in a url in order to download the subset dataset. 
2. Extract the dataset using the extract_file function that accepts the filepath of the downloaded file. 
3. Obtain all the paths of the h5 files from the extracted dataset using the glob module and store them as a list. 
4. For each h5 file, there are 3 groups that are of concern for song processing. They are the analysis, metadata, and musicbrainz groups. Each of them contain their own "songs" dataset with their own separate features for each song. Using the read_hdf of dask and each key being the name of the group followed by the name of the dataset(songs), read all the features of one group for every song into a single dataframe. 
5. Merge all the songs using the concat method of dask where you specify axis = 1 for a column wise concatenation.
6. Process the dataset in using the process function. Select which features of the dataset will be included and drop any songs with null values for any of its features.
7. Write the final dataframe to a csv file. 

The million song dataset was not used in the final dataset for our models. We realized that the dataset had missing data for important features such as danceability as well as issues determining singles. This is due to the Million songs dataset having different titles compared to the billboard song dataset for random songs. Another issue was the over representation of unranked songs from the million songs dataset that was affecting our model. We used Spotify's API to obtain the features necessary for the songs in the billboard dataset such as danceability. The billboard dataset had a large enough samples of songs where the performance would be fine on its own without the additional songs from the million songs dataset skewing the class distribution towards the unranked category. 

