# Spotify Recommendation System Project

**Purpose:** Build a scalable music recommendation system using PySpark and Spotify datasets.

**Group members:**

* Raakin Bhatti
* Aneesh Bulusu
* Walid Farhat
* Long Nguyen
* Strahinja Radakovic

# Introduction

**What is our big data problem? What is our goal?**
* Build a scalable music recommendation system using a large dataset of songs and their audio features from Spotify.
* Given the names of some songs, the algorithm will predict and recommend songs similar to the input songs based on their audio features (e.g. danceability, energy, acoustics, etc.) and categorical data like genres.
* Use Spark/PySpark to process large-scale data and develop machine learning algorithms.

**Why we chose building Recommendation Systems:**
* Recommendation systems are widely used in modern digital platforms to enhance user experience (e.g., Spotify, Netflix, Amazon).
* Help users discover relevant content, which helps to increase engagement and satisfaction.

**Why Spotify dataset?**
* Spotify is a leading music streaming platform which has rich data on songs, audio features, and artists.
* The dataset provides an opportunity to analyze music preferences and recommend personalized songs or playlists.

**What is a big data challenge?**
* Spotify data involves thousands of users, songs, and interactions, requiring storage and processing at scale.
* PySpark is well-suited for handling and analyzing big data with its distributed computing capabilities.

**Broader applications:**
* Insights from the project can extend to other recommendation systems.
* Demonstrates the integration of big data tools and machine learning for real-world applications.


**Filtering method:**

Our group's recommendation system will use Content-Based Filtering method instead of Collaborative filtering. Content-Based Filtering method analyzes the audio characteristics of songs you’ve previously enjoyed, then the model will make personalized suggestions. We do not use Collaborative filtering or user-based filtering method becuase we cannot collect the information related to Spotify's users such as `user_id`.

**What is the Target Variable for this project?**

In a content-based recommendation system, we do not have a traditional "target variable" like in supervised learning. Instead, the goal is to calculate similarity metrics between songs based on their features. However, you can think of the **song similarity score** (e.g., cosine similarity, Euclidean distance) as the implicit target metric for creating recommendations.

* The goal is to recommend songs similar to the input songs based on their audio features and genres.
* For our recommendation system, the focus is on matching songs based on audio features like danceability, energy, acousticness, etc., and categorical data like genre.

**What are the interesting Features of the dataset?**

**Numerical Features (Audio Characteristics):** 

danceability: Indicates how suitable a track is for dancing.
energy: Represents the intensity and activity level of a track.
loudness: Measures the decibel level of the track.
acousticness: Likelihood of the track being acoustic.
instrumentalness: Determines the degree to which a track is instrumental.
valence: Describes the musical positiveness conveyed by a track.
tempo: The speed of the song in beats per minute (BPM).
duration_ms: Song duration, which can help differentiate between shorter and longer tracks.

**Categorical Features:**

genre: A key factor in identifying similar songs.
key: Musical key in which the song is composed.
mode: Indicates whether the song is in a major or minor scale.

**Meta Information (Optional):**

popularity: While not directly linked to audio characteristics, it can serve as a secondary ranking factor in your recommendations.
year: Could help in filtering songs by era if needed.

# Setting up Spark

This section is optional. In case that you have not installed Spark, Hadoop, etc. in your local machine, then this part will help setting up Spark in the Jupyter Notebook for running.

In [1]:
## OPTIONAL: Setting up Spark in Jupyter Notebook

# !apt-get install openjdk-8-jdk-headless -qq > /dev/null
# !wget https://dlcdn.apache.org/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz
# !tar -xvf spark-3.5.3-bin-hadoop3.tgz
# !pip install findspark
# 
# import os
# os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
# os.environ["SPARK_HOME"] = "/content/spark-3.5.3-bin-hadoop3"
# 
# import findspark
# findspark.init()
# findspark.find()

In [2]:
## OPTIONAL: Test if PySpark is ready to go

# from pyspark.sql import SparkSession
# 
# spark = SparkSession.builder.appName("Test").getOrCreate()
# print(f"Spark version:", spark.version)
# 
# spark.stop()

## expected result: 3.5.3 or similar

# Data Collection

**Required Tasks:**
* Load the dataset into a PySpark DataFrame.
* Verify the dataset schema and check if the data is loaded correctly.

**Output:** A PySpark DataFrame loaded and ready for processing, with the schema verified.


### How we get the dataset?

We extracted the dataset from Spotify using Spotify API and the `Spotipy` library in Python.

Attached the Python files for extraction and transformation (from JSON to CSV). Please note that these files are for references only, because in order to run those files, you will have to set up a virtual environment.

### Loading datasets

In [3]:
import findspark
findspark.init()
findspark.find()

'C:\\Program Files\\spark-3.5.3'

In [4]:
# import libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, count
import pandas as pd
import matplotlib.pyplot as plt

# initialize a SparkSession
spark = SparkSession.builder \
    .appName("SpotifyRecommendationSystem") \
    .config("spark.executor.memory", "4g") \
    .config("spark.driver.memory", "4g") \
    .getOrCreate()

print(f"Spark version:", spark.version)

Spark version: 3.5.3


In [5]:
# load dataset into a DataFrame
file_path = "./dataset/spotify_dataset.csv" # Note: The dataset file is too large (>250 MB) to commit to a GitHub repo.
spotify_df = spark.read.csv(file_path, header=True, inferSchema=True)

# display the first few rows
spotify_df.show(5)


+---+-------------+----------------+--------------------+----------+----+--------+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+-------+-----------+--------------+
|_c0|  artist_name|      track_name|            track_id|popularity|year|   genre|danceability|energy|key|loudness|mode|speechiness|acousticness|instrumentalness|liveness|valence|  tempo|duration_ms|time_signature|
+---+-------------+----------------+--------------------+----------+----+--------+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+-------+-----------+--------------+
|  0|   Jason Mraz| I Won't Give Up|53QF56cjZA9RTuuMZ...|        68|2012|acoustic|       0.483| 0.303|  4| -10.058|   1|     0.0429|       0.694|             0.0|   0.115|  0.139|133.406|   240166.0|           3.0|
|  1|   Jason Mraz|93 Million Miles|1s8tP3jP4GZcyHDsj...|        50|2012|acoustic|       0.572| 0.454|  3| -10.286|   1|     0.0258|       0

# Data Inspection and Validation

In this section, we will do the following tasks:

* Check the data schema with column names and data types.
* Convert data types if needed.
* Check for the summary statistics of the dataset.
* Check for missing values. Handle missing values properly.
* Check for outliers. Handle the outliers.
* Check for distinct values.

In [6]:
# print schema to verify column names and data types
spotify_df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- artist_name: string (nullable = true)
 |-- track_name: string (nullable = true)
 |-- track_id: string (nullable = true)
 |-- popularity: string (nullable = true)
 |-- year: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- danceability: string (nullable = true)
 |-- energy: string (nullable = true)
 |-- key: string (nullable = true)
 |-- loudness: string (nullable = true)
 |-- mode: string (nullable = true)
 |-- speechiness: string (nullable = true)
 |-- acousticness: string (nullable = true)
 |-- instrumentalness: string (nullable = true)
 |-- liveness: string (nullable = true)
 |-- valence: double (nullable = true)
 |-- tempo: double (nullable = true)
 |-- duration_ms: double (nullable = true)
 |-- time_signature: double (nullable = true)



**Data type conversins:**

From [Spotify's API documentation](https://developer.spotify.com/documentation/web-api/reference/get-audio-features) about track's audio features, each feature has its own meaning and data types. So we will convert the data types of features in our dataset to match Spotify's documentation.

In [7]:
# casting columns to their appropriate data types as per Spotify's documentation
spotify_df = spotify_df \
    .withColumn("popularity", col("popularity").cast("int")) \
    .withColumn("year", col("year").cast("int")) \
    .withColumn("danceability", col("danceability").cast("float")) \
    .withColumn("energy", col("energy").cast("float")) \
    .withColumn("key", col("key").cast("int")) \
    .withColumn("loudness", col("loudness").cast("float")) \
    .withColumn("mode", col("mode").cast("int")) \
    .withColumn("speechiness", col("speechiness").cast("float")) \
    .withColumn("acousticness", col("acousticness").cast("float")) \
    .withColumn("instrumentalness", col("instrumentalness").cast("float")) \
    .withColumn("liveness", col("liveness").cast("float")) \
    .withColumn("tempo", col("tempo").cast("float")) \
    .withColumn("time_signature", col("time_signature").cast("int"))

# re-check the updated schema
spotify_df.printSchema()


root
 |-- _c0: integer (nullable = true)
 |-- artist_name: string (nullable = true)
 |-- track_name: string (nullable = true)
 |-- track_id: string (nullable = true)
 |-- popularity: integer (nullable = true)
 |-- year: integer (nullable = true)
 |-- genre: string (nullable = true)
 |-- danceability: float (nullable = true)
 |-- energy: float (nullable = true)
 |-- key: integer (nullable = true)
 |-- loudness: float (nullable = true)
 |-- mode: integer (nullable = true)
 |-- speechiness: float (nullable = true)
 |-- acousticness: float (nullable = true)
 |-- instrumentalness: float (nullable = true)
 |-- liveness: float (nullable = true)
 |-- valence: double (nullable = true)
 |-- tempo: float (nullable = true)
 |-- duration_ms: double (nullable = true)
 |-- time_signature: integer (nullable = true)



In [8]:
# count total rows in the dataset
total_rows = spotify_df.count()
print(f"Total rows in the dataset: {total_rows}")

Total rows in the dataset: 1159764


Remarks: The dataset contains 1,159,764 rows, which is quite large, indicating the need for big data tools like PySpark.

**Check for Summary Statistics:**

Next, we want to check for the summary statistics of the dataset. In [Spark](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.describe.html), if using `.describe()` method, it will by default calculate the stats for all columns including both numerical and non-numerical (string) columns. Therefore, we have to filter out the columns based on their data types as follows.

In [9]:
# filter numeric columns using dtypes
numeric_columns = [name for name, dtype in spotify_df.dtypes if dtype in ('int', 'bigint', 'double', 'float', 'decimal')]

# select only numeric columns
numeric_df = spotify_df.select(*numeric_columns)

# show summary statistics for numerical columns only
numeric_df.describe().show()

+-------+-----------------+------------------+------------------+-----------------+------------------+------------------+------------------+-----------------+-------------------+-------------------+-------------------+-------------------+-------------------+------------------+------------------+------------------+
|summary|              _c0|        popularity|              year|     danceability|            energy|               key|          loudness|             mode|        speechiness|       acousticness|   instrumentalness|           liveness|            valence|             tempo|       duration_ms|    time_signature|
+-------+-----------------+------------------+------------------+-----------------+------------------+------------------+------------------+-----------------+-------------------+-------------------+-------------------+-------------------+-------------------+------------------+------------------+------------------+
|  count|          1159764|           1158091|      

**Remarks for summary statistics:**

* `popularity`: Ranges from 0 to 100, indicating a reasonable scale for popularity.
* `year`: The dataset includes tracks from 0 to 2023. The value 0 seems anomalous and might need further investigation.
* Other features like `danceability`, `energy`, `tempo`, and `duration_ms` have a wide range of values, which may need normalization or standardization for machine learning.
* Potential outliers: Columns like `tempo` (min = -24.073) and `loudness` (min = -58.1) have unusual values that might indicate outliers or data entry issues.

In [10]:
# check for missing values in each column
missing_values = spotify_df.select([
    count(when(col(c).isNull(), c)).alias(c) for c in spotify_df.columns
])
missing_values.show()

+---+-----------+----------+--------+----------+----+-----+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+-----+-----------+--------------+
|_c0|artist_name|track_name|track_id|popularity|year|genre|danceability|energy|key|loudness|mode|speechiness|acousticness|instrumentalness|liveness|valence|tempo|duration_ms|time_signature|
+---+-----------+----------+--------+----------+----+-----+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+-----+-----------+--------------+
|  0|          0|         0|       0|      1673| 623|    0|        1184|   408|162|      94|  33|         16|           8|               3|       2|      0|    0|          0|             0|
+---+-----------+----------+--------+----------+----+-----+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+-----+-----------+--------------+



**Remarks for missing values:**

* Columns with missing data: popularity (1,673 missing values), year (623), and others such as  danceability, energy, key, loudness, and mode.
* Given the dataset has over 1.15M rows, the proportion of missing data is extremely small (less than 0.1% for all affected columns).
* In this case, we will drop rows with missing values in features that are critical for our recommendation system.

In [11]:
# drop rows with missing data in critical columns
columns_to_check = ['popularity', 'year', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness']

spotify_df_clean = spotify_df.na.drop(subset=columns_to_check)

# verify the number of rows after dropping
cleaned_rows = spotify_df_clean.count()
print(f"Rows after dropping missing data: {cleaned_rows}")

dropped_rows = total_rows - cleaned_rows
print(f"Number of rows dropped: {dropped_rows}")


Rows after dropping missing data: 1158091
Number of rows dropped: 1673


**Anomalies and Outliers**:

We will investigate the folloiwng things:

* Check for distinct values of `year` column to make sure no abnormal values (such as 1500).
* 



In [12]:
# get distinct years and their frequencies
year_distribution = spotify_df_clean.groupBy("year").count().orderBy("year")

year_distribution.show(50)

+----+-----+
|year|count|
+----+-----+
|2000|43944|
|2001|42316|
|2002|42084|
|2003|42250|
|2004|43293|
|2005|43708|
|2006|45419|
|2007|45920|
|2008|47336|
|2009|46810|
|2010|46818|
|2011|46381|
|2012|54725|
|2013|53105|
|2014|53120|
|2015|51569|
|2016|40246|
|2017|56171|
|2018|56541|
|2019|55739|
|2020|55035|
|2021|53529|
|2022|53637|
|2023|38395|
+----+-----+



Remarks for `year`: The years range from 2000 to 2023, which are expected and normal.

**Check for anomalies in numerical columns:**

Compare data against the value ranges specified in the [Spotify documentation](https://developer.spotify.com/documentation/web-api/reference/get-audio-features):
* acousticness: [0, 1]
* danceability: [0, 1]
* energy: [0, 1] 
* instrumentalness: [0, 1]
* key: [-1, 11]
* liveness: [0, 1]
* mode: {0, 1}
* speechiness: [0, 1]
* tempo: Typically [1, 250], (note: this is a typical range; Spotify docs don't strictly enforce this.)
* time_signature: [3, 7]
* valence: [0, 1]

In [13]:
# define the range checks for each feature
range_checks = {
    "acousticness": (0, 1),
    "danceability": (0, 1),
    "energy": (0, 1),
    "instrumentalness": (0, 1),
    "key": (-1, 11),
    "liveness": (0, 1),
    "mode": (0, 1),
    "speechiness": (0, 1),
    "tempo": (1, 350),
    "time_signature": (3, 7),
    "valence": (0, 1),
}

# identify anomalies in each feature
anomalies = {}
for feature, (min_val, max_val) in range_checks.items():
    anomalies[feature] = spotify_df_clean.filter((col(feature) < min_val) | (col(feature) > max_val)).count()

# print the anomalies count for each feature
for feature, count in anomalies.items():
    print(f"Number of anomalies in {feature}: {count}")


Number of anomalies in acousticness: 0
Number of anomalies in danceability: 0
Number of anomalies in energy: 0
Number of anomalies in instrumentalness: 0
Number of anomalies in key: 0
Number of anomalies in liveness: 0
Number of anomalies in mode: 0
Number of anomalies in speechiness: 0
Number of anomalies in tempo: 1198
Number of anomalies in time_signature: 13816
Number of anomalies in valence: 0


From the results:

* No Anomalies:
    * Most features (acousticness, danceability, energy, instrumentalness, etc.) have no anomalies. These are clean and can be used as-is.
* tempo Anomalies:
    * 1,198 anomalies fall outside the range [1, 250]. These might include invalid or outlier values (e.g., extremely high or low BPM).
* time_signature Anomalies:
    * 13,816 anomalies fall outside the range [3, 7]. These could represent tracks with unusual time signatures, potentially errors or special cases.

In [15]:
# Show rows with tempo anomalies
tempo_anomalies = spotify_df_clean.filter((col("tempo") < 1) | (col("tempo") > 350))

# Descriptive statistics for tempo anomalies
tempo_anomalies.describe("tempo").show()


+-------+-----+
|summary|tempo|
+-------+-----+
|  count| 1198|
|   mean|  0.0|
| stddev|  0.0|
|    min|  0.0|
|    max|  0.0|
+-------+-----+



**Remarks for Tempo:**

* As you can see, the min and max values of tempo anomalies are 0.0, which means these values are missing in the dataset (i.e. 0.0 == missing values). We can eliminate them since the quantity is very small in the dataset.

In [25]:
# Show rows with time_signature anomalies
time_signature_anomalies = spotify_df_clean.filter((col("time_signature") < 3) | (col("time_signature") > 7))

# Descriptive statistics for time_signature anomalies
time_signature_anomalies.describe("time_signature").show()


+-------+-------------------+
|summary|     time_signature|
+-------+-------------------+
|  count|              13816|
|   mean| 0.9111899247249565|
| stddev|0.28447970650303633|
|    min|                  0|
|    max|                  1|
+-------+-------------------+



**Distinct values**

Let's check the distinct values in the most important columns: `artist_name`, `track_name`, `genre`.

In [16]:
# Count distinct values in key columns
key_columns = ['artist_name', 'track_name', 'genre']
distinct_counts = {col_name: spotify_df.select(col_name).distinct().count() for col_name in key_columns}

for col_name, count in distinct_counts.items():
    print(f"Distinct values in {col_name}: {count}")

Distinct values in artist_name: 64159
Distinct values in track_name: 882285
Distinct values in genre: 392


# Data Filtering

In this section, we will identify and retain only the relevant features for building the recommendation system. We will consider dropping irrelevant or redundant columns like `_c0` that may not contribute to the model.

* Given that `time_signature` feature has more than 13K values of anamolies and has little importance to finding similarities between songs, we will drop this feature.
* `_c0` feature is only the index of the dataset and will contribute to the model so we will drop it.

In [17]:
# retain only relevant features for the recommendation system
selected_columns = [
    'artist_name', 'track_name', 'track_id', 'popularity', 'year', 
    'genre', 'danceability', 'energy', 'key', 'loudness', 'mode', 
    'speechiness', 'acousticness', 'instrumentalness', 'liveness', 
    'valence', 'tempo', 'duration_ms'
]

# create a filtered DataFrame
spotify_df_filtered = spotify_df_clean.select(*selected_columns)

# show the schema of the filtered DataFrame
spotify_df_filtered.printSchema()

# verify the filtering process
spotify_df_filtered.show(5)


root
 |-- artist_name: string (nullable = true)
 |-- track_name: string (nullable = true)
 |-- track_id: string (nullable = true)
 |-- popularity: integer (nullable = true)
 |-- year: integer (nullable = true)
 |-- genre: string (nullable = true)
 |-- danceability: float (nullable = true)
 |-- energy: float (nullable = true)
 |-- key: integer (nullable = true)
 |-- loudness: float (nullable = true)
 |-- mode: integer (nullable = true)
 |-- speechiness: float (nullable = true)
 |-- acousticness: float (nullable = true)
 |-- instrumentalness: float (nullable = true)
 |-- liveness: float (nullable = true)
 |-- valence: double (nullable = true)
 |-- tempo: float (nullable = true)
 |-- duration_ms: double (nullable = true)

+-------------+----------------+--------------------+----------+----+--------+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+-------+-----------+
|  artist_name|      track_name|            track_id|popularity|year|   gen

# Data Transformations

Here are the goals of this section:

1. Handle categorical features: Features like `genre` and `artist_name` are categorical and need to be encoded to numeric values for modeling.
2. Normalize or scale numeric features: Features like `danceability`, `energy`, `tempo`, etc., may have varying ranges, which can affect model performance. Normalize or standardize these values.
3. Generate additional features (if needed): Create new features based on existing ones. For example, you might engineer a feature for "popularity bucket" if grouping songs by popularity levels makes sense for your recommendation algorithm.

## Encoding categorial features

We will encode `genre` and `artist_name` to numeric values using PySpark's `StringIndexer`.

In [26]:
## First, we want to see the unique values in the 'genre' column 

# group by 'genre', count occurrences, and order by count in descending order
genre_counts = spotify_df.groupBy("genre").count().orderBy(col("count").desc())

genre_counts.show(10)

+-----------+-----+
|      genre|count|
+-----------+-----+
|black-metal|21852|
|     gospel|21621|
|    ambient|21385|
|   acoustic|21095|
|   alt-rock|20917|
|        emo|20840|
|     indian|20580|
|      k-pop|19994|
|    new-age|19908|
|      blues|19682|
+-----------+-----+
only showing top 10 rows



**Remarks for genre's unique values:** The distinct values in the genre column are single-valued entries, including words with hyphens such as "black-metal", "alt-rock", "new-age", etc. These hyphenated values are treated as single entities and do not represent multi-valued genres. Therefore, we can safely proceed with encoding this column using StringIndexer without needing to preprocess or split multi-valued genres. This will simplify the encoding process and ensures that each genre maps to a single, unique numeric value.

In [22]:
from pyspark.ml.feature import StringIndexer

# encode the 'genre' column
genre_indexer = StringIndexer(inputCol="genre", outputCol="genre_index")
spotify_df_encoded = genre_indexer.fit(spotify_df_filtered).transform(spotify_df_filtered)

# verify the encoding of a few rows
spotify_df_encoded.select("genre", "genre_index").show(10)


+--------+-----------+
|   genre|genre_index|
+--------+-----------+
|acoustic|        3.0|
|acoustic|        3.0|
|acoustic|        3.0|
|acoustic|        3.0|
|acoustic|        3.0|
|acoustic|        3.0|
|acoustic|        3.0|
|acoustic|        3.0|
|acoustic|        3.0|
|acoustic|        3.0|
+--------+-----------+
only showing top 10 rows



In [33]:
## View the distinct values of genre and their corresponding genre_index

distinct_genre_mapping = spotify_df_encoded.select("genre", "genre_index").distinct()

distinct_genre_mapping = distinct_genre_mapping.orderBy("genre_index")

distinct_genre_mapping.show(10, truncate=False)

+-----------+-----------+
|genre      |genre_index|
+-----------+-----------+
|black-metal|0.0        |
|gospel     |1.0        |
|ambient    |2.0        |
|acoustic   |3.0        |
|alt-rock   |4.0        |
|emo        |5.0        |
|indian     |6.0        |
|k-pop      |7.0        |
|new-age    |8.0        |
|blues      |9.0        |
+-----------+-----------+
only showing top 10 rows



In [31]:
## Check the unique values in the 'artist_name' column 

# group by 'genre', count occurrences, and order by count in descending order
artist_name_counts = spotify_df.groupBy("artist_name").count().orderBy(col("count").desc())

artist_name_counts.show(10)

+--------------------+-----+
|         artist_name|count|
+--------------------+-----+
|         Traditional| 4058|
|       Grateful Dead| 2320|
|Johann Sebastian ...| 2125|
|   Giacomo Meyerbeer| 1345|
|       Elvis Presley| 1242|
|Wolfgang Amadeus ...| 1084|
|    Armin van Buuren| 1061|
|     Astor Piazzolla|  932|
|         Hans Zimmer|  863|
|       Andrei Krylov|  841|
+--------------------+-----+
only showing top 10 rows



In [34]:
# encode the 'artist_name' column
artist_indexer = StringIndexer(inputCol="artist_name", outputCol="artist_index")
spotify_df_encoded = artist_indexer.fit(spotify_df_encoded).transform(spotify_df_encoded)


IllegalArgumentException: requirement failed: Output column artist_index already exists.

In [35]:
# View the distinct values of artist_name and their corresponding artist_index
distinct_artist_mapping = spotify_df_encoded.select("artist_name", "artist_index").distinct()

distinct_artist_mapping = distinct_artist_mapping.orderBy("artist_index")

distinct_artist_mapping.show(10, truncate=False)

+-----------------------+------------+
|artist_name            |artist_index|
+-----------------------+------------+
|Traditional            |0.0         |
|Grateful Dead          |1.0         |
|Johann Sebastian Bach  |2.0         |
|Giacomo Meyerbeer      |3.0         |
|Elvis Presley          |4.0         |
|Armin van Buuren       |5.0         |
|Wolfgang Amadeus Mozart|6.0         |
|Astor Piazzolla        |7.0         |
|Hans Zimmer            |8.0         |
|Andrei Krylov          |9.0         |
+-----------------------+------------+
only showing top 10 rows



In [36]:
## OPTIONAL: 
# drop the original categorical columns to avoid redundancy (optional)
spotify_df_transformed = spotify_df_encoded.drop("genre", "artist_name")

# Verify encoding
spotify_df_transformed.show(5)

+----------------+--------------------+----------+----+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+-------+-----------+-----------+------------+
|      track_name|            track_id|popularity|year|danceability|energy|key|loudness|mode|speechiness|acousticness|instrumentalness|liveness|valence|  tempo|duration_ms|genre_index|artist_index|
+----------------+--------------------+----------+----+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+-------+-----------+-----------+------------+
| I Won't Give Up|53QF56cjZA9RTuuMZ...|        68|2012|       0.483| 0.303|  4| -10.058|   1|     0.0429|       0.694|             0.0|   0.115|  0.139|133.406|   240166.0|        3.0|       461.0|
|93 Million Miles|1s8tP3jP4GZcyHDsj...|        50|2012|       0.572| 0.454|  3| -10.286|   1|     0.0258|       0.477|         1.37E-5|  0.0974|  0.515|140.182|   216387.0|        3.0|       461.0|
|Do Not Le

In [38]:
spotify_df_encoded.show(5)

+-------------+----------------+--------------------+----------+----+--------+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+-------+-----------+-----------+------------+
|  artist_name|      track_name|            track_id|popularity|year|   genre|danceability|energy|key|loudness|mode|speechiness|acousticness|instrumentalness|liveness|valence|  tempo|duration_ms|genre_index|artist_index|
+-------------+----------------+--------------------+----------+----+--------+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+-------+-----------+-----------+------------+
|   Jason Mraz| I Won't Give Up|53QF56cjZA9RTuuMZ...|        68|2012|acoustic|       0.483| 0.303|  4| -10.058|   1|     0.0429|       0.694|             0.0|   0.115|  0.139|133.406|   240166.0|        3.0|       461.0|
|   Jason Mraz|93 Million Miles|1s8tP3jP4GZcyHDsj...|        50|2012|acoustic|       0.572| 0.454|  3| -10.286|   1|

## Normalize Numerical Features

We will normalize numerical features like `danceability`, `energy`, `tempo`, etc., to have values between 0 and 1 using `MinMaxScaler`.

There are a few reasons why we should do scaling for these numerical values:

* Scaling helps to ensure consistency across features, which measn that all features contribute equally during model training, especially for distance-based algorithms.
* Scaling helps to standardize different scales across features. For example, features like `loudness` (negative values) and `tempo` (positive, larger range) have much different scales.
* Scaling also improves model performance because many ML algorithms, such as neural networks and matrix factorization (used in recommendation systems), will converge faster and perform better with normalized input data.

In [40]:
from pyspark.ml.feature import MinMaxScaler, VectorAssembler

# assemble all numerical features into a single vector
numeric_features = [
    'danceability', 'energy', 'key', 'loudness', 'mode', 
    'speechiness', 'acousticness', 'instrumentalness', 'liveness', 
    'valence', 'tempo', 'duration_ms', 'popularity'
]

assembler = VectorAssembler(inputCols=numeric_features, outputCol="features_vector")
spotify_df_vectorized = assembler.transform(spotify_df_encoded)

# normalize the feature vector
scaler = MinMaxScaler(inputCol="features_vector", outputCol="scaled_features")
scaler_model = scaler.fit(spotify_df_vectorized)
spotify_df_normalized = scaler_model.transform(spotify_df_vectorized)

# keep only the scaled features and other relevant columns
final_columns = ['track_id', 'year', 'scaled_features', 'genre_index', 'artist_index']
spotify_df_final = spotify_df_normalized.select(*final_columns)

# Verify the final transformed DataFrame
spotify_df_final.show(5)


+--------------------+----+--------------------+-----------+------------+
|            track_id|year|     scaled_features|genre_index|artist_index|
+--------------------+----+--------------------+-----------+------------+
|53QF56cjZA9RTuuMZ...|2012|[0.48640485840895...|        3.0|       461.0|
|1s8tP3jP4GZcyHDsj...|2012|[0.57603226934337...|        3.0|       461.0|
|7BRCa8MPiyuvr2VU3...|2012|[0.41188320370473...|        3.0|      4401.0|
|63wsZUhUZLlh1Osyr...|2012|[0.39476334464300...|        3.0|        98.0|
|6nXIYClvJAfi6ujLi...|2012|[0.43303123841708...|        3.0|      3862.0|
+--------------------+----+--------------------+-----------+------------+
only showing top 5 rows



## Verify data preparation

Before proceeding with EDA, we have to thoroughly verify the data to ensure everything is ready for the next steps, such as EDA. We’ll conduct the following checks:

1. check for missing values.
2. verify data types.
3. check for outliers.
4. verify data balancing. This menas that, if needed, we will verify that categorical features like genre_index or year are reasonably balanced. This ensures the recommendation system won’t be biased.

In [42]:
from pyspark.sql.functions import col, when, count

# check for missing or null values in each column
missing_values_check = spotify_df_final.select([
    count(when(col(c).isNull(), c)).alias(c) for c in spotify_df_final.columns
])

# show results
missing_values_check.show()


+--------+----+---------------+-----------+------------+
|track_id|year|scaled_features|genre_index|artist_index|
+--------+----+---------------+-----------+------------+
|       0|   0|              0|          0|           0|
+--------+----+---------------+-----------+------------+



In [43]:
# print schema to verify data types
spotify_df_final.printSchema()


root
 |-- track_id: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- scaled_features: vector (nullable = true)
 |-- genre_index: double (nullable = false)
 |-- artist_index: double (nullable = false)



In [50]:
from pyspark.sql.functions import count, countDistinct, min, max

# check distinct values of genre_index and artist_index
spotify_df_final.select("genre_index").distinct().show()
spotify_df_final.select("artist_index").distinct().show()


+-----------+
|genre_index|
+-----------+
|       70.0|
|        8.0|
|       67.0|
|        0.0|
|       69.0|
|        7.0|
|       49.0|
|       29.0|
|       75.0|
|       64.0|
|       47.0|
|       42.0|
|       44.0|
|       35.0|
|       62.0|
|       18.0|
|       80.0|
|        1.0|
|       39.0|
|       37.0|
+-----------+
only showing top 20 rows

+------------+
|artist_index|
+------------+
|       496.0|
|       596.0|
|      4800.0|
|      8779.0|
|       769.0|
|      7554.0|
|      5858.0|
|     34033.0|
|     10681.0|
|      9923.0|
|     10930.0|
|     20467.0|
|      5776.0|
|     20974.0|
|     13956.0|
|      6433.0|
|     20893.0|
|      5983.0|
|     14452.0|
|     20864.0|
+------------+
only showing top 20 rows



In [48]:
# count the number of records per genre
spotify_df_final.groupBy("genre_index").count().orderBy("count", ascending=False).show()

# count the number of records per year
spotify_df_final.groupBy("year").count().orderBy("year").show()


+-----------+-----+
|genre_index|count|
+-----------+-----+
|        0.0|21852|
|        1.0|21621|
|        2.0|21385|
|        3.0|21095|
|        4.0|20917|
|        5.0|20840|
|        6.0|20580|
|        7.0|19994|
|        8.0|19908|
|        9.0|19682|
|       10.0|19379|
|       11.0|19327|
|       12.0|19153|
|       13.0|18905|
|       14.0|18788|
|       15.0|18784|
|       16.0|18592|
|       17.0|18474|
|       18.0|18037|
|       19.0|17958|
+-----------+-----+
only showing top 20 rows

+----+-----+
|year|count|
+----+-----+
|2000|43944|
|2001|42316|
|2002|42084|
|2003|42250|
|2004|43293|
|2005|43708|
|2006|45419|
|2007|45920|
|2008|47336|
|2009|46810|
|2010|46818|
|2011|46381|
|2012|54725|
|2013|53105|
|2014|53120|
|2015|51569|
|2016|40246|
|2017|56171|
|2018|56541|
|2019|55739|
+----+-----+
only showing top 20 rows



**REmarks for data balancing:**

* The distribution of `genre_index` is fairly balanced. No single genre dominates excessively, which ensures the dataset won't introduce significant bias in training the model.
* The `year` distribution is also reasonably balanced, except for some variation in track counts across years. For example, 2012 to 2019 have slightly higher counts compared to earlier years. This could imply more recent data collection or increased music production in those years. In conclusion, the distribution of `genre_index` and `year` looks good, and we can proceed without additional rebalancing.

# Exploratory Data Analysis (EDA) - IN PROGRESS

In this section, we will explore the dataset to find any patterns and correlations in the dataset. We will use some visualizations to show the trends and also validate the preprocessing in ealier steps.

Here are some tasks that we will perform:

* Check the distribution of numerical features.
* Correlation analysis.
* Analyze the popularity trends over year and by genre.
* Check the distribution by genre.
* Plot the relationship between enegery and danceability.
* Plot the relationship between valence and tempo.
* etc.

## Distribution of numerical features

In this section, we will check the distribution of important numerical features (e.g., popularity, danceability, tempo, valence) to understand data spread and skewness.

More info about each audio feature can be found in [Spotify's API documentation](https://developer.spotify.com/documentation/web-api/reference/get-audio-features).

Reasons to choose these numerical features:

* Popularity: This is a key metric for user preference and relevance. Tracks with higher popularity are more likely to be recommended.
* Danceability: This feature reflects how suitable a track is for dancing—an important factor in user satisfaction for genres like pop or electronic.
* Energy: This feature shows the intensity and activity level of a song, which can help classify songs by mood or genre (e.g., calm vs. energetic).
* Tempo: This is the speed of a song. THis is a key factor for many users when selecting playlists (e.g., playlists for workout, study, or relaxation).
* Valence: This indicates the emotional tone of the song (happy vs. sad), important for personalized recommendations based on mood. 
* Duration (`duration_ms`): This feature is relevant for filtering or grouping songs by length (e.g., short tracks for a quick playlist or long tracks for background music).

In [None]:
# Select numerical features to visualize
numerical_features = ["popularity", "danceability", "energy", "tempo", "valence", "duration_ms"]

# Convert PySpark DataFrame to Pandas for plotting
numerical_data = spotify_df.select(numerical_features).sample(fraction=0.01).toPandas()

# Plot histograms for each feature
for feature in numerical_features:
    plt.figure(figsize=(6, 4))
    plt.hist(numerical_data[feature].dropna(), bins=30, alpha=0.7)
    plt.title(f"Distribution of {feature}")
    plt.xlabel(feature)
    plt.ylabel("Frequency")
    plt.show()

## Correlation analysis - IN PROGRSS

In [None]:
from pyspark.ml.stat import Correlation
from pyspark.ml.feature import VectorAssembler
import pandas as pd

# Assemble features into a vector for correlation computation
vector_col = "features_vector"
assembler = VectorAssembler(inputCols=numerical_features, outputCol=vector_col)
spotify_df_vector = assembler.transform(spotify_df.select(numerical_features))

# Compute Pearson correlation matrix
correlation_matrix = Correlation.corr(spotify_df_vector, vector_col, "pearson").head()[0].toArray()

# Convert to Pandas DataFrame for easier visualization
correlation_df = pd.DataFrame(correlation_matrix, columns=numerical_features, index=numerical_features)
print("Correlation Matrix:")
print(correlation_df)

# Visualize the correlation matrix
import seaborn as sns

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_df, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Feature Correlation Heatmap")
plt.show()


In [None]:
# In progres
# min-max scaling for numerical features

from pyspark.ml.feature import MinMaxScaler, VectorAssembler

# Assemble features
assembler = VectorAssembler(inputCols=['danceability', 'energy'], outputCol='features_raw')
df = assembler.transform(df)

# Apply Min-Max Scaling
scaler = MinMaxScaler(inputCol='features_raw', outputCol='features_scaled')
scaler_model = scaler.fit(df)
df_scaled = scaler_model.transform(df)
