## Spotify Music Recommendation Based on the Sentiment of Social Media Post

## Table of content

1. [Introduction](#1.-introduction)
2. [Data](#2.-data)
    - [Data Acquisition and Description](#2.1-data-acquisition-and-description)
    - [Data Analysis](#2.2-data-analysis)
3. [ETL Pipeline](#3.-etl-pipeline)
4. [Methodology](#4.-methodology)
5. [Results](#5.-results)
6. [Conclusion](#6.-conclusion)

## 1. Introduction
This project aims to recommend music based on the sentiment of social media post. Within the scope of this project, I have trained a *Logistic Regression* model on tweeter post and evaluated the performace of the model to detect sentiment from social media post. The model showed 79% test accuracy. Later on, I have used the trained model to detect the sentiment of a user, and then recommend songs based on the sentiment and the user's personal playlist.

In an era where music and social media intertwine, this endeavor seeks to revolutionize personalized music discovery. By analyzing the sentiment expressed in users' social media posts, I aim to craft a dynamic recommendation system for Spotify, tailoring playlists that resonate with the emotional context of individuals. Through the fusion of advanced sentiment analysis and machine learning, this project endeavors to enhance the user experience by delivering music that not only aligns with personal preferences but also harmonizes with the sentiments conveyed in their digital expressions.

The rest of the report is organized as follows: In [section 2](#2.-data), I have written the description, and analysis of the datasets. This is followed by explaintaion of ETL Pipeline architecture in [section 3](#etl-pipeline). Then the methodology adopted in this article described in [section 4](#methodology). [Section 5](#results) contains the results of the project as well as the discussions followed by the concluding remark in [section 6](#conclusion).

## 2. Data

### 2.1. Data Acquisition and Description

### *2.1.1. Social Media Post Data (Sentiment140)*
The *Sentiment140* dataset is a very popular open source dataset that contains 1.6 millions twitter posts by several users. The dataset was originally collected by Alec Go and colleagues[[1]](#references). I have collected the dataset from kaggle. *This dataset has been used for the model training purposes.* A short description of this dataset is provided below:

* Metadata URL: https://www.kaggle.com/datasets/kazanova/sentiment140
* Data URL: https://www.kaggle.com/datasets/kazanova/sentiment140
* Data Type: ZIP

|       Column number                 |                    Column Name        |                                  Description        |
|-------------------------------------|---------------------------------------|-----------------------------------------------------|
|             0                       |                    target             |      Polarity of the tweet (0 = negative, 4 = positive). |
|             1                       |                    id                 |      The id of the tweet.                            |
|             2                       |                    date               |      The date of the tweet.                          |
|             3                       |                    flag               |      The query. If there is no query, then this value is NO_QUERY. |
|             4                       |                    user               |      The user that tweeted                           |
|             5                       |                    text               |      The text of the tweet                           |

### *2.1.2. Music Dataset*
This is an open dataset that contains a comprehensive list of the most famous songs of 2023 as listed on Spotify. The dataset offers a wealth of features beyond what is typically available in similar datasets. It provides insights into each song's attributes, popularity, and presence on various music platforms. *This dataset has been used for song recommendation.* A short description of this dataset is provided below:

* Metadata URL: https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023
* Data URL: https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023
* Data Type: ZIP

|       Column number                 |                    Column Name        |                                  Description        |
|-------------------------------------|---------------------------------------|-----------------------------------------------------|
|             0                       |        track_name                     |      Name of the song.                              |   
|             1                       |        artist(s)_name                 |      Name of the artist(s) of the song.             |
|             2                       |        artist_count                   |      Number of artists contributing to the song.    |
|             3                       |        released_year                  |      Year when the song was released.               |
|             4                       |        released_month                 |      Month when the song was released.              |
|             5                       |        released_day                   |      Day of the month when the song was released.   |
|             6                       |        in_spotify_playlists           |      Number of Spotify playlists the song is included in. |
|             7                       |        in_spotify_charts              |      Presence and rank of the song on Spotify charts. |
|             8                       |        streams                        |      Total number of streams on Spotify.             |
|             9                       |        in_apple_playlists             |      Number of Apple Music playlists the song is included in. |
|             10                      |        in_apple_charts                |      Presence and rank of the song on Apple Music charts. |
|             11                      |        in_deezer_playlists            |      Number of Deezer playlists the song is included in. |
|             12                      |        in_deezer_charts               |      Presence and rank of the song on Deezer charts. |
|             13                      |        in_shazam_charts               |      Presence and rank of the song on Shazam charts. |
|             14                      |        bpm                            |      Beats per minute, a measure of song tempo.      |
|             15                      |        key                            |      Key of the song.                                |
|             16                      |        mode                           |      Mode of the song (major or minor).              |
|             17                      |        danceability_%                 |      Percentage indicating how suitable the song is for dancing. |
|             18                      |        valence_%                      |      Positivity of the song's musical content.       |
|             19                      |        energy_%                       |      Perceived energy level of the song.             |
|             20                      |        acousticness_%                 |      Amount of acoustic sound in the song.           |
|             21                      |        instrumentalness_%             |      Amount of instrumental content in the song.     |
|             22                      |        liveness_%                     |      Presence of live performance elements.          |
|             23                      |        speechiness_%                  |      Amount of spoken words in the song.             |

### *2.1.3. User's playlist data*
User's playlist should be a subset of all available songs. If all avaialble songs are the dataset in section [2.1.2](#221-music-dataset), then user's playlist is a subset of this dataset. I have used 10% data from [section 2.1.2.](#2.1.2.-music-dataset) as user's playlist in order to understand user's preference on music. 

* Metadata URL: https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023
* Data URL: https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023
* Data Type: CSV

### 2.2 Data Analysis

### *2.2.1 Social Media Post Data (Sentiment140)*

The figure 1 shows the first and last five rows of Sentiment140 dataset respectively. 

<figure>
    <img src="images/tweets-uncleaned.png" style="width: 100%"/>
    <figcaption align="center"> Figure 1: First and last five rows of Sentiment140. </figcaption>
</figure>

I am interested in the column *"target"* and *"text"*. The "target" column represents the sentiment of the tweet and the "text" column represents the tweet. The tweet itself contains "@user", "links" as well as some special characters. These has been cleaned during data preprocessing step.

The figure 2 shows us the target destribution of Sentiment140. It is evident from the picture that the dataset is equally destributed over two sentiments. Eight million *positive* sentiments and 8 million *negative* sentiments.

<figure>
<img src="images/tweets-target-destribution.png" style="width:80%">
<figcaption align="center"> Figure 2: Target destribution of Sentiment140. </figcaption>
</figure>

The targets are labeled: 0 as negative sentiments and 4 as positive sentiment. During preprocessing, I have renamed the target "4" to "1".

Let us also have a look at some random tweets and their sentiments to understand the validity of the data. Table 1 displays such examples.

<table>
    <tr>
        <th> Tweet </th><th> Sentiment </th>
    </tr>
    <tr>
        <td> im meeting up with one of my besties tonight! Cant wait!!  - GIRL TALK!! </td> <td> positive </td>
    </tr>
    <tr>
        <td> I LOVE @Health4UandPets u guys r the best!! </td> <td> positive </td>
    </tr>
    <tr>
        <td> @DaRealSunisaKim Thanks for the Twitter add, Sunisa! I got to meet you once at a HIN show here in the DC area and you were a sweetheart. </td> <td> positive </td>
    </tr>
    <tr>
        <td> is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah! </td> <td> negative </td>
    </tr>
    <tr>
        <td> @Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds </td> <td> negative </td>
    </tr>
    <tr>
        <td> my whole body feels itchy and like its on fire </td> <td> negative </td>
    </tr>
    <caption>Table 1: Random tweets from Sentiment140 dataset.</caption>
</table>

The tweets of this dataset seems pretty well labeled. After observing the target destribution and the labels, I am satisfied to train a model with this dataset. Further clean-up is performed in data preprocessing step.

### *2.2.1 Music Dataset*

The music dataset is labeled with several attributes by Spotifiy. The *"valence"* attribute represents the positivity of the song. I have used this attribute to determine the songs sentiment. The table 2 holds few examples of this particular dataset and figure 3 visualized the feature's destribution of the dataset.

<table>
    <tr>
        <th> track name </th>
        <th> artist(s) name </th>
        <th> artist count </th>
        <th> released year </th>
        <th> released month </th>
        <th> released day </th>
        <th> in spotify playlists </th>
        <th> in spotify charts </th>
        <th> streams </th>
        <th> in apple playlists </th>
        <th> in apple charts </th>
        <th> in deezer playlists </th>
        <th> in deezer charts </th>
        <th> in shazam charts </th>
        <th> bpm </th>
        <th> key </th>
        <th> mode </th>
        <th> danceability_% </th>
        <th> valence_% </th>
        <th> energy_% </th>
        <th> acousticness_% </th>
        <th> instrumentalness_% </th>
        <th> liveness_% </th>
        <th> speechiness_% </th>
    </tr>
    <tr>
        <td> Seven (feat. Latto) (Explicit Ver.) </td> 
        <td> Latto, Jung Kook </td> 
        <td> 2 </td> 
        <td> 2023 </td> 
        <td> 7 </td> 
        <td> 14 </td> 
        <td> 553 </td> 
        <td> 147 </td> 
        <td> 141381703 </td> 
        <td> 43 </td> 
        <td> 263 </td> 
        <td> 45 </td> 
        <td> 10 </td> 
        <td> 826 </td> 
        <td> 125 </td> 
        <td> B </td> 
        <td> Major </td> 
        <td> 80 </td> 
        <td> 89 </td> 
        <td> 83 </td> 
        <td> 31 </td> 
        <td> 0 </td> 
        <td> 8 </td> 
        <td> 4 </td>
    </tr>
    <tr>
        <td> LALA </td> 
        <td> Myke Towers </td> 
        <td> 1 </td> 
        <td> 2023 </td> 
        <td> 3 </td> 
        <td> 23 </td> 
        <td> 1474 </td> 
        <td> 48 </td> 
        <td> 133716286 </td> 
        <td> 48 </td> 
        <td> 126 </td> 
        <td> 58 </td> 
        <td> 14 </td> 
        <td> 382 </td> 
        <td> 92 </td> 
        <td> C# </td> 
        <td> Major </td> 
        <td> 71 </td> 
        <td> 61 </td> 
        <td> 74 </td> 
        <td> 7 </td> 
        <td> 0 </td> 
        <td> 10 </td> 
        <td> 4 </td>
    </tr>
    <tr>
        <td> vampire </td> 
        <td> Olivia Rodrigo </td> 
        <td> 1 </td> 
        <td> 2023 </td> 
        <td> 6 </td> 
        <td> 30 </td> 
        <td> 1397 </td> 
        <td> 113 </td> 
        <td> 140003974 </td> 
        <td> 94 </td> 
        <td> 207 </td> 
        <td> 91 </td> 
        <td> 14 </td> 
        <td> 949 </td> 
        <td> 138 </td> 
        <td> F </td> 
        <td> Major </td> 
        <td> 51 </td> 
        <td> 32 </td> 
        <td> 53 </td> 
        <td> 17 </td> 
        <td> 0 </td> 
        <td> 31 </td> 
        <td> 6 </td>
    </tr>
    <caption>Table 2: Few examples of spotify songs dataset.</caption>
</table>

<figure>
<img src="images/all-songs-feature-destribution.png" style="width:100%">
<figcaption align="center"> Figure 3: Destribution of song's attribute. </figcaption>
</figure>


### *2.2.3. User's playlist data*

The user's playlist has been created from the random 10% of Spotify songs. The playlist contains 82 songs. The figure 4 visualizes the destribution of song's attributes from user's playlist.


<figure>
<img src="images/user-playlist-feature-destribution.png" style="width:100%">
<figcaption align="center"> Figure 4: Destribution of song attribute from user's playlist. </figcaption>
</figure>


## 3. ETL Pipeline

## 4. Methodology

## 5. Results

## 6. Conclusion

## References
1. Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12