## Spotify Music Recommendation Based on the Sentiment of Social Media Post

## Table of content

1. Introduction. <br>
    - 1.1. Project goals. <br>
    - 1.2. Why logistic regression? <br>
2. Data <br>
    - 2.1. Data acquisition and description. <br>
        - 2.1.1. Social media post data. <br>
        - 2.1.2. Music data. <br>
        - 2.1.3. User's playlist data. <br>
    - 2.2. Data analysis. <br>
        - 2.2.1. Social media post data. <br>
        - 2.2.2. Music data. <br>
        - 2.2.3. User's playlist data. <br>
3. Methodology. <br>
    - 3.1. Extract Transform Load (ETL) pipeline. <br>
        - 3.1.1. Data transformation in the ETL pipeline. <br>
        - 3.1.2. The python package for ETL pipeline. <br>
            - Features as of version 1.1.0. <br>
            - Possible features for future. <br>
    - 3.2. Data pre-processing. <br>
        - 3.2.1. The preprocessing steps <br>
        - 3.2.2. Feature extraction with TfidfVectorizer <br>
    - 3.3. Logistic regression model. <br>
        - 3.3.1. Mathematical formulation. <br>
    - 3.4. Understanding the user's preference in music. <br>
        - 3.4.1. Algorithmic workflow. <br>
4. Results. <br>
5. Conclusion. <br>

## 1. Introduction
This project aims to recommend music based on the sentiment of social media post. Within the scope of this project, I have trained a *Logistic Regression* model on tweeter posts and evaluated the performace of the model to detect sentiment from social media post. The model showed 79% test accuracy. Later on, I have used the trained model to detect the sentiment of a user, and then recommend music based on the sentiment and the user's personal playlist.

### 1.1. Project goals
In the ever-evolving landscape of personalized music experiences, the project seeks to redefine the art of music curation by integrating the dynamic realm of social media sentiment analysis. The primary objective is to offer users a tailored music recommendation system that not only aligns with their individual tastes but also resonates with the emotional context conveyed through their social media posts.

#### Key Components:

1. Sentiment Analysis Model:

    Train a logistic regression model to discern sentiment from social media posts. By understanding the emotional nuances expressed in user-generated content, I aim to capture the mood and preferences that influence music choices.

2. User Playlist Integration:

    Leverage the sentiment scores obtained from the logistic regression model alongside the user's existing playlist data. By incorporating individual playlist preferences, our system strives to provide a holistic understanding of a user's musical inclinations.

3. Recommendation Engine:

    Develop a sophisticated recommendation engine that synthesizes sentiment analysis results and playlist data. The engine will dynamically adapt to users' changing emotions and preferences, ensuring that music suggestions are not only personalized but also responsive to the evolving sentiments expressed in their social media interactions.

In essence, the project work envisions a music recommendation system that transcends traditional genre-based approaches. By harnessing the power of sentiment analysis and user playlists, I aspire to create a deeply personalized and emotionally intelligent music streaming experience, revolutionizing the way users discover and connect with their favorite tunes.

### 1.2. Why logistic regression?

Explainability or interpretability in machine learning or deep learning models are very essential for building trust, ensuring ethical use, complying with regulations, improving model performance, and fostering collaboration between humans and AI systems. As machine learning applications continue to impact various aspects of society, the need for transparency and interpretability becomes increasingly critical. Over the last couple of years, the term *Explainable AI* has become one of the most concering topic for researchers[2]. Many researches have been conducted to prove why explaining the *black-box* models are important [2][3][4].

<figure>
    <img src="images/map-of-explainability-approaches.jpeg" style="width:80%">
    <figcaption align="center"> 
        Figure 1: Map of explainability approaches.
    </figcaption>
</figure>

The figure 1 avobe shows the explainablility approaches for different kind of models. Logistic Regression models are very simple, transparent and easy to explain. Though explaining the trained model does not fit in the scope of the project, but due to the simplicity and transparency of Logistic Regression, I started with such algorithm.

The rest of the report is organized as follows: In section 2, I have written the description, and analysis of the datasets. This is followed by explaintaion of the methodology adopted in section 3. Then Section 4 contains the results of this project and finally, section 5 holds the discussions, remarks and future works of this project.

## 2. Data

For this project work, I required three datasets. One dataset to train a Logistic Regression model that detects sentiments from social media posts, one dataset that contains music information and one dataset that contains a user's playlist. The user's playlist dataset helps us to understand the user's taste in music. The section 2.1 explains the aquisition and description of these datasets and the section 2.2 shows us some analysis on those.

### 2.1. Data acquisition and description

As mentioned above, I have collected three datasets for this project work. The section 2.1.1 explains the social media posts, 2.1.2 explains the music data, and 2.1.3 explains user's playlist data aquisitions respectively.

#### *2.1.1. Social media post data (Sentiment140)*
The *Sentiment140* dataset is a very popular open source dataset that contains 1.6 millions twitter posts by several users. The dataset was originally collected by Alec Go and colleagues[[1]](#References). I have collected the dataset from kaggle. A short description of this dataset is provided below:

* Metadata URL: https://www.kaggle.com/datasets/kazanova/sentiment140
* Data URL: https://www.kaggle.com/datasets/kazanova/sentiment140
* Data Type: ZIP

|       Column number                 |                    Column Name        |                                  Description        |
|-------------------------------------|---------------------------------------|-----------------------------------------------------|
|             0                       |                    target             |      Polarity of the tweet (0 = negative, 4 = positive). |
|             1                       |                    id                 |      The id of the tweet.                            |
|             2                       |                    date               |      The date of the tweet.                          |
|             3                       |                    flag               |      The query. If there is no query, then this value is NO_QUERY. |
|             4                       |                    user               |      The user that tweeted                           |
|             5                       |                    text               |      The text of the tweet                           |

#### *2.1.2. Music dataset*
This is an open source dataset that contains a comprehensive list of the most famous songs as listed on Spotify. The dataset offers a wealth of features beyond what is typically available in similar datasets. It provides insights into each song's attributes, popularity, and presence on various music platforms. A short description of this dataset is provided below:

* Metadata URL: https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023
* Data URL: https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023
* Data Type: ZIP

|       Column number                 |                    Column Name        |                                  Description        |
|-------------------------------------|---------------------------------------|-----------------------------------------------------|
|             0                       |        track_name                     |      Name of the song.                              |   
|             1                       |        artist(s)_name                 |      Name of the artist(s) of the song.             |
|             2                       |        artist_count                   |      Number of artists contributing to the song.    |
|             3                       |        released_year                  |      Year when the song was released.               |
|             4                       |        released_month                 |      Month when the song was released.              |
|             5                       |        released_day                   |      Day of the month when the song was released.   |
|             6                       |        in_spotify_playlists           |      Number of Spotify playlists the song is included in. |
|             7                       |        in_spotify_charts              |      Presence and rank of the song on Spotify charts. |
|             8                       |        streams                        |      Total number of streams on Spotify.             |
|             9                       |        in_apple_playlists             |      Number of Apple Music playlists the song is included in. |
|             10                      |        in_apple_charts                |      Presence and rank of the song on Apple Music charts. |
|             11                      |        in_deezer_playlists            |      Number of Deezer playlists the song is included in. |
|             12                      |        in_deezer_charts               |      Presence and rank of the song on Deezer charts. |
|             13                      |        in_shazam_charts               |      Presence and rank of the song on Shazam charts. |
|             14                      |        bpm                            |      Beats per minute, a measure of song tempo.      |
|             15                      |        key                            |      Key of the song.                                |
|             16                      |        mode                           |      Mode of the song (major or minor).              |
|             17                      |        danceability_%                 |      Percentage indicating how suitable the song is for dancing. |
|             18                      |        valence_%                      |      Positivity of the song's musical content.       |
|             19                      |        energy_%                       |      Perceived energy level of the song.             |
|             20                      |        acousticness_%                 |      Amount of acoustic sound in the song.           |
|             21                      |        instrumentalness_%             |      Amount of instrumental content in the song.     |
|             22                      |        liveness_%                     |      Presence of live performance elements.          |
|             23                      |        speechiness_%                  |      Amount of spoken words in the song.             |

#### *2.1.3. User's playlist data*
Finding a user's playlist that matches the same attributes as [section 2.1.2.](#2.1.2.-music-dataset) was a challenge. Unfortunately, I could not find a suitable user's playlist for this project. But a user's playlist is a subset of all available songs. If all avaialble songs are the dataset in section [2.1.2](#221-music-dataset), then user's playlist is a subset of this dataset. I have used 10% of random data from [section 2.1.2.](#2.1.2.-music-dataset) as user's playlist in order to understand user's preference on music. The playlist has been uploaded to a google drive.

* Metadata URL: https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023
* Data URL: https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023
* Data Type: CSV

### 2.2. Data analysis

Now that all the datasets have been found, we will analyse the datasets individually to understand the data. The section 2.2.1 analyses the social media posts, 2.2.2 analyses the music data, and 2.2.3 analyses user's playlist data respectively.

#### *2.2.1 Social media post data (Sentiment140)*

The figure 2 shows the first and last five rows of Sentiment140 dataset respectively. 

<figure>
    <img src="images/tweets-uncleaned.png" style="width: 100%"/>
    <figcaption align="center"> Figure 2: First and last five rows of Sentiment140. </figcaption>
</figure>

I am interested in the column *"target"* and *"text"*. The "target" column represents the sentiment of the tweet and the "text" column represents the tweet. The tweet itself contains "@user", "links" as well as some special characters. These has been cleaned during data preprocessing step.

The figure 3 shows us the target destribution of Sentiment140. It is evident from the picture that the dataset is equally destributed over two sentiments. Eight million *positive* sentiments and 8 million *negative* sentiments.

<figure>
    <img src="images/tweets-target-destribution.png" style="width:60%">
    <figcaption align="center"> 
        Figure 3: Target destribution of Sentiment140. Negative sentiments are labeled as 0 and positives as 4.
    </figcaption>
</figure>

Let us also have a look at some random tweets and their sentiments to understand the validity of the data. Table 1 displays such examples.

<table>
    <tr>
        <th> Tweet </th><th> Sentiment </th>
    </tr>
    <tr>
        <td> im meeting up with one of my besties tonight! Cant wait!!  - GIRL TALK!! </td> <td> positive </td>
    </tr>
    <tr>
        <td> I LOVE @Health4UandPets u guys r the best!! </td> <td> positive </td>
    </tr>
    <tr>
        <td> @DaRealSunisaKim Thanks for the Twitter add, Sunisa! I got to meet you once at a HIN show here in the DC area and you were a sweetheart. </td> <td> positive </td>
    </tr>
    <tr>
        <td> is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah! </td> <td> negative </td>
    </tr>
    <tr>
        <td> @Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds </td> <td> negative </td>
    </tr>
    <tr>
        <td> my whole body feels itchy and like its on fire </td> <td> negative </td>
    </tr>
    <caption>Table 1: Random tweets from Sentiment140 dataset.</caption>
</table>

The tweets of this dataset seems pretty well labeled. After observing the target destribution and the labels, I am satisfied to train a model with this dataset. Further clean-up is performed in data preprocessing step.

#### *2.2.2. Music dataset*

The music dataset is labeled with several attributes by Spotifiy. The *"valence"* attribute represents the positivity of the song. I have used this attribute to determine the songs sentiment. The table 2 holds few examples of this particular dataset and figure 4 visualized thehistogram of the features.

<table>
    <tr>
        <th> track name </th>
        <th> artist(s) name </th>
        <th> artist count </th>
        <th> released year </th>
        <th> released month </th>
        <th> released day </th>
        <th> in spotify playlists </th>
        <th> in spotify charts </th>
        <th> streams </th>
        <th> in apple playlists </th>
        <th> in apple charts </th>
        <th> in deezer playlists </th>
        <th> in deezer charts </th>
        <th> in shazam charts </th>
        <th> bpm </th>
        <th> key </th>
        <th> mode </th>
        <th> danceability_% </th>
        <th> valence_% </th>
        <th> energy_% </th>
        <th> acousticness_% </th>
        <th> instrumentalness_% </th>
        <th> liveness_% </th>
        <th> speechiness_% </th>
    </tr>
    <tr>
        <td> Seven (feat. Latto) (Explicit Ver.) </td> 
        <td> Latto, Jung Kook </td> 
        <td> 2 </td> 
        <td> 2023 </td> 
        <td> 7 </td> 
        <td> 14 </td> 
        <td> 553 </td> 
        <td> 147 </td> 
        <td> 141381703 </td> 
        <td> 43 </td> 
        <td> 263 </td> 
        <td> 45 </td> 
        <td> 10 </td> 
        <td> 826 </td> 
        <td> 125 </td> 
        <td> B </td> 
        <td> Major </td> 
        <td> 80 </td> 
        <td> 89 </td> 
        <td> 83 </td> 
        <td> 31 </td> 
        <td> 0 </td> 
        <td> 8 </td> 
        <td> 4 </td>
    </tr>
    <tr>
        <td> LALA </td> 
        <td> Myke Towers </td> 
        <td> 1 </td> 
        <td> 2023 </td> 
        <td> 3 </td> 
        <td> 23 </td> 
        <td> 1474 </td> 
        <td> 48 </td> 
        <td> 133716286 </td> 
        <td> 48 </td> 
        <td> 126 </td> 
        <td> 58 </td> 
        <td> 14 </td> 
        <td> 382 </td> 
        <td> 92 </td> 
        <td> C# </td> 
        <td> Major </td> 
        <td> 71 </td> 
        <td> 61 </td> 
        <td> 74 </td> 
        <td> 7 </td> 
        <td> 0 </td> 
        <td> 10 </td> 
        <td> 4 </td>
    </tr>
    <tr>
        <td> vampire </td> 
        <td> Olivia Rodrigo </td> 
        <td> 1 </td> 
        <td> 2023 </td> 
        <td> 6 </td> 
        <td> 30 </td> 
        <td> 1397 </td> 
        <td> 113 </td> 
        <td> 140003974 </td> 
        <td> 94 </td> 
        <td> 207 </td> 
        <td> 91 </td> 
        <td> 14 </td> 
        <td> 949 </td> 
        <td> 138 </td> 
        <td> F </td> 
        <td> Major </td> 
        <td> 51 </td> 
        <td> 32 </td> 
        <td> 53 </td> 
        <td> 17 </td> 
        <td> 0 </td> 
        <td> 31 </td> 
        <td> 6 </td>
    </tr>
    <caption>Table 2: Few examples of spotify songs dataset.</caption>
</table>

<figure>
<img src="images/all-songs-feature-destribution.png" style="width:100%">
<figcaption align="center"> Figure 4: Histogram of song's attributes. </figcaption>
</figure>


#### *2.2.3. User's playlist data*

The user's playlist has been created from the random 10% of Spotify songs. The playlist contains 82 songs. The figure 5 visualizes the histogram of song's attributes from user's playlist.


<figure>
<img src="images/user-playlist-feature-destribution.png" style="width:100%">
<figcaption align="center"> Figure 5: Histogram of song's attributes from user's playlist. </figcaption>
</figure>


## 3. Methodology

<figure>
<img src="images/methodology.png" style="width:100%">
<figcaption align="center"> Figure 6: Methodology followed in this project work. </figcaption>
</figure>

The methodology, as outlined in Figure 6, has been followed while conducting this project work. The steps are as follows:

<pre>
<code>
1. Run three Extract Transform Load (ETL) pipelines that extracts the three dataset mentioned in section 2 from the sources, transforms and loads those in a SQLite Database.
2. Fetch the tweets data from SQLite DB and split it into 80% train and 20% test data.
3. Train the Logistic regression model with the training set.
4. Test the trained model with test dataset and evaluate the model's performance.
5. Take an input text from the user.
6. Detect the sentiment of the user from the text.
7. Fetch user's playlist songs from the SQLite DB that matches the detected sentiment in step-5.
8. Calculate average bpm, danceability, energy, acousticness, instrumentalness, liveness, and speechiness of the songs that have been fetched in step-7.
9. Fetch first ten songs from all songs that matches the sentiment of step-6 and the values of bpm, danceability, energy, acousticness, instrumentalness, liveness, and speechiness are greater than or equals to the average values calculated in step-8. Arrange the songs by valence's descending order.
10. Return the songs of step-9 as output.
</code>
</pre>

Understanding what songs a user likes is explained in Steps 6 to 10, which together make up the User's Songs Preference Algorithm. I have broken down this algorithm in detail in Section 3.4. This section explains the step-by-step process I used to figure out and cater to each person's specific music preferences.

The ETL-Pipeline used in this project work has been explained in section 3.1. *In section 3.1.1, I have presented a python package dedicated to run ETL-pipelines* followed by the data pre-processing in section 3.2. A short description of Logistic regression algorithm is described in section 3.3.

### 3.1. Extract Transform Load (ETL) pipeline

The Extract, Transform, Load (ETL) pipeline is the backbone of data integration, serving as a robust framework for collecting, transforming, and loading data from diverse sources into a unified destination. This essential process ensures data quality, consistency, and accessibility, enabling organizations to derive valuable insights and make informed decisions. In this dynamic landscape of data-driven decision-making, an effective ETL pipeline acts as a bridge between raw data and actionable intelligence, orchestrating the flow of information with efficiency and reliability. This introduction captures the essence of the ETL pipeline as a vital component in the data management lifecycle, facilitating the seamless movement and transformation of data for a variety of analytical and operational purposes.

<figure>
<img src="images/etl-pipeline.png" style="width:100%">
<figcaption align="center"> Figure 6: Extract Transform Load (ETL) Pipeline used in this project work. </figcaption>
</figure>

I have used three different datasets for this project work thus, three ETL pipelines have been created to extract, transform and load the dataset from their sources to a SQLite DB. The figure 6 visualizes the structure of ETL pipelines used in this project work. An ETL Queue has been used to run the pipelines sequentially. The item number in the figure shows the execution sequence of the pipelines in the queue.


#### *3.1.1 Data transformation in the ETL pipeline*

The tweeter dataset has been used for model training purposes. It is evident in figure 2 and table 1 that the tweets contains special characters, @user tags, and urls. Another common thing in social media post is having single characters and white spaces which makes no sence for training a model. Thus, I cleaned the dataset by removing these things. After that I also removed the stop words and finally lematized the text. Stop words and lematizition is discussed in section 3.2. The figure 7 shows how the cleaning process was executed for each tweet. This action has taken palce in *transformation step of Tweeter ETL pipeline*.

<figure>
<img src="images/cleaning-tweets.png" style="width:100%">
<figcaption align="center"> Figure 7: Transformation in ETL pipeline for Tweeter dataset. </figcaption>
</figure>

For other two pipelines, only null values have been droped. No toher transformation is applied.

#### *3.1.2 The python package for ETL pipeline*

Getting motivated from jayvee[5], an open source python package[11] have been developed throughout this course work[6] by me and Arni Islam[10]. We worked on two different project works aiming for different goals thus, our ETL pipeline had different structures. My challenge was to extract data from kaggle archive, apply transformations that have been visualized in figure 6 and 7, and finally load it to the database. On the other hand, her challenge was to extract the data from a source that returns direct csv file and apply her own transformations. We have analysed that these are some basic requirements for any data science project.

Since python is one of the most popular languages in the domain of data science and machine learning[7,8,9], many resources are being developed to support the engineers. Considering the basic needs of data science and machine learning projects, we initiated this open source package. We believe, the package offers a compelling combination of simplicity, flexibility, reliability, and community support, making it a valuable tool for data engineers and analysts involved in ETL workflows. Its user-friendly design and Python-powered customization should provide a seamless experience for users looking to streamline their data processing tasks. The figure 8 demonstrates an example pipeline creation with [etl-pipeline-runner](https://github.com/prantoamt/etl-pipeline-runner).

<figure>
<img src="images/etl-pipeline-runner-example-code.png" style="width:100%">
<figcaption align="center"> Figure 8: Example code to run a pipeline with etl-pipeline-runner package. </figcaption>
</figure>

*Note: If you are trying to extract data from kaggle, you will be requiring credentials. There is a section in the package documentation[11] on how to setup kaggle credentials.*

##### Features as of version 1.1.0:
1. Can extract data from kaggle. (Contributed by me)
2. Can extract data from a source that returns direct CSV file. (Contributed by Arni)
3. Can handle csv files. (Contributed by me and Arni)
4. Can perform transformation based on individual project requirements. (Contributed by me and Arni)
5. Can load data to SQLite databases. (Contributed by me and Arni)
6. Can run multiple pipelines in a Queue. (Contributed by me and Arni)

The repository setup contains several unit test cases and system test cases to make the package robost and error free. All the test cases so far have been contributed by me and Arni.

##### Possible features for future:

1. Extract archives from other sources than kaggle.
2. Extract data from a database.
3. Handle XL/XLS files.
4. Load data to other types of databases.

*You are cordially invited to make feature requests, contribute to package, and fullfil the need of data science community gradually.*

### 3.2. Data pre-processing

The success of sentiment analysis relies heavily on effective text preprocessing, transforming raw textual data into a format suitable for machine learning algorithms. In this project work, Natural Language Toolkit (nltk)'s TfidfVectorizer was employed to preprocess the Twitter posts before training the logistic regression model.

#### *3.2.1. The preprocessing steps:*

1. Tokenization:
    The raw text data, consisting of Twitter posts, was tokenized into individual words or phrases using nltk's tokenization capabilities. This step is crucial for breaking down the text into its basic components, allowing for further analysis.

2. Stopword removal:
    Common words that do not contribute significantly to the sentiment of the text, known as stopwords, were removed from the tokenized data. This helps reduce noise in the dataset and focuses the model on more meaningful words.

3. Lemmatization:
    Lemmatization, the process of reducing words to their base or root form, was applied to ensure that variations of words (e.g., "running" and "ran") were treated as the same feature. This promotes a more consistent and meaningful representation of the text.

4. Tfidf rransformation:
    To convert the tokenized and preprocessed text into a numerical format suitable for machine learning models, the TfidfVectorizer from nltk was utilized. Tfidf (Term Frequency-Inverse Document Frequency) considers the importance of each term not only within a specific document but also across the entire corpus. This transformation helps highlight words that are more discriminative and relevant to the sentiment of the Twitter posts.

*The step 2 and 3 below has been performed in ETL pipeline transformation.*

#### *3.2.2. Feature extraction with TfidfVectorizer:*

1. Term Frequency (TF)
    The TfidfVectorizer computes the Term Frequency (TF) for each term in the document, indicating how frequently a term appears in a specific Twitter post.

2. Inverse Document Frequency (IDF)
    The Inverse Document Frequency (IDF) is calculated to assess the significance of a term across the entire corpus. Rare terms that occur in only a few documents receive a higher IDF score.

3. Tfidf representation
    The combination of TF and IDF produces the final Tfidf representation for each term in the dataset. This vectorization process results in a numerical matrix that captures the importance of each term in each Twitter post.

### 3.3. Logistic regression model

In the realm of predictive modeling, Logistic Regression stands as a powerful statistical technique for binary classification tasks. Unlike linear regression, which predicts continuous outcomes, logistic regression is specifically tailored for predicting the probability of an observation belonging to a particular category. In the context of this project, the goal is to employ Logistic Regression as a predictive tool for identifying the likelihood of sentiment being positive or negative.

#### *3.3.1. Mathematical formulation*

The logistic regression model transforms the linear combination of input features into a probability using the logistic function (sigmoid function). For a binary classification task, the logistic regression equation can be expressed as follows:

$$ P(Y=1) = \frac{1}{1+e^{-(\beta_0+\beta_1X_1+\beta_2X_2+...+\beta_nX_n)}} $$

Here,

- $ P(Y=1) $ is the probability of the event Y occurring (e.g., positive sentiment),
- $ \beta_0 $ is the intercept term,
- $ \beta_1, \beta_2,...,\beta_n $ are the coefficients corresponding to the input features $ X_1, X_2,...,X_n $,
- $ e $ is the base of the natural logarithm.

### 3.4. Understanding the user's preferences in music

To enhance the personalization of music recommendations, this project work focuses on a comprehensive analysis of the user's taste by considering key audio features in their current playlist. The algorithm developed for this purpose intricately examines the average values of bpm, danceability, energy, acousticness, instrumentalness, liveness, and speechiness, providing a nuanced understanding of the musical elements that resonate with the user.

#### *3.4.1. Algorithmic workflow:*

1. Sentiment Analysis:

    The initial step involves determining the user's sentiment through the trained model discussed in section 3.3. This sentiment analysis lays the groundwork for tailoring music recommendations to match the user's emotional context.

2. Playlist Filtering:

    Songs from the user's playlist that align with the predicted sentiment are extracted. This ensures that the subsequent analysis focuses exclusively on the subset of songs associated with the user's prevailing sentiment.

3. Calculation of Audio Feature Averages:

    The algorithm then computes the average values of essential audio features— bpm, danceability, energy, acousticness, instrumentalness, liveness, and speechiness—based on the songs selected from the user's playlist. This step provides a consolidated representation of the musical characteristics favored by the user.

4. Top Ten Song Selection:

    From the broader set of songs that match the user's sentiment, the algorithm identifies the top ten songs. This selection is refined further by considering songs with audio feature values greater than or equal to the previously calculated averages. This meticulous filtering ensures that the recommended songs not only align with the user's sentiment but also exhibit specific musical attributes reflective of their preferences.

5. Outcome and Significance:

    By adopting this feature-driven approach, the algorithm aims to transcend conventional music recommendation systems. Instead of solely relying on genre or popularity, the emphasis is placed on the nuanced analysis of audio features that contribute to the user's musical experience. This approach offers a more refined and personalized music recommendation, aligning closely with the user's sentiment and preferences as captured by the intricate interplay of bpm, danceability, energy, and other key features.

In summary, this algorithmic workflow represents a step forward in understanding and accommodating user-specific music preferences. By leveraging sentiment analysis and audio feature analysis, the system endeavors to offer a more immersive and tailored musical journey for users, contributing to a richer and more personalized music streaming experience.

## 4. Results

## 5. Conclusion

## References
1. Go, A., Bhayani, R. and Huang, L., 2009. *"Twitter sentiment classification using distant supervision."* CS224N Project Report, Stanford, 1(2009), p.12
2. Linardatos P., Papastefanopoulos V., Kotsiantis S., *"Explainable AI: A Review of Machine Learning Interpretability Methods."* Entropy 2021, 23, 18. https://doi.org/10.3390/e23010018
3. Xu, F., Uszkoreit, H., Du, Y., Fan, W., Zhao, D., Zhu, J. (2019). *"Explainable AI: A Brief Survey on History, Research Areas, Approaches and Challenges."* In: Tang, J., Kan, MY., Zhao, D., Li, S., Zan, H. (eds) Natural Language Processing and Chinese Computing. NLPCC 2019. Lecture Notes in Computer Science(), vol 11839. Springer, Cham. https://doi.org/10.1007/978-3-030-32236-6_51
4. Holzinger, A., Saranti, A., Molnar, C., Biecek, P., Samek, W. (2022). "Explainable AI Methods - A Brief Overview." In: Holzinger, A., Goebel, R., Fong, R., Moon, T.,  Müller, KR., Samek, W. (eds) "xxAI - Beyond Explainable AI. xxAI 2020. Lecture Notes in Computer Science()", vol 13200. Springer, Cham. https://doi.org/10.1007/978-3-031-04083-2_2
5. The JValue Project, *Professorship of Open Source Software at the University of Erlangen*, https://jvalue.github.io/jayvee/
6. Advanced Data Engineering, *Professorship for Open-Source Software, Friedrich-Alexander University Erlangen-Nürnberg*, https://oss.cs.fau.de/teaching/specific/made/
7. Raschka S., Patterson J., Nolet C., *"Machine Learning in Python: Main Developments and Technology Trends in Data Science, Machine Learning, and Artificial Intelligence"*, Information 2020, 11, 193. https://doi.org/10.3390/info11040193
8. A. Nagpal and G. Gabrani, *"Python for Data Analytics, Scientific and Technical Applications"*, 2019 Amity International Conference on Artificial Intelligence (AICAI), Dubai, United Arab Emirates, 2019, pp. 140-145, doi: 10.1109/AICAI.2019.8701341.
9. Z. Dobesova, *"Programming language Python for data processing,"* 2011 International Conference on Electrical and Control Engineering, Yichang, China, 2011, pp. 4866-4869, doi: 10.1109/ICECENG.2011.6057428.
10. Arni Islam, https://github.com/islam15-8789
11. etl-pipeline-runner, https://github.com/prantoamt/etl-pipeline-runner
