# Tweetify: Mapping the Intersection of Twitter and Spotify 

*By: Kat Limqueco, University of California, Los Angeles (UCLA)*

---

> **Keywords**: Social Media Analytics, Sentiment Analysis, Natural Language Processing, Unsupervised Learning, Music Consumption 

## Table of Contents 
1. **Introduction**
   - 1.1 Background 
   - 1.2 Significance of the Study 
   - 1.3 Objectives and Scope of the Study
   - 1.4 Hypotheses and Research Questions

2. **Literature Review**
   - 2.1 Twitter and Music Consumption
   - 2.2 Spotify and its Impact on Music Preferences
   - 2.3 Role of Social Media in Shaping Music Trends

3. **Methodology**
   - 3.1 Data Collection
     - 3.1.1 Twitter Data Extraction
     - 3.1.2 Spotify Data Extraction
     - 3.1.3 Billboard Data Acquisition
   - 3.2 Data Cleaning
     - 3.2.1 Removing Special Characters
   - 3.3 Data Processing 
     - 3.3.1 Processing Track URIs
     - 3.3.2 Extracting Audio Features 
     - 3.3.3 Determining Track Info
     - 3.3.4 Retrieving Track Lyrics
     - 3.3.5 Deriving Unique Words from Lyrics
     - 3.3.6 Identifying Genres
   - 3.4 Analytical Approach
     - 3.4.1 Natural Language Processing
     - 3.4.2 Unsupervised Machine Learning

4. **Data Analysis and Findings**
   - 4.1 Exploratory Data Analysis
     - 4.1.1 Data Description
     - 4.1.2 Sentiment Analysis
       - Treemap 
       - Genre Bar Chart
     - 4.1.3 Data Visualization 
       - Audio Radar Chart
       - Audio Heat Map
       - Tempo Histogram
       - Key Histogram
       - Popularity and Year Histogram 
       - Geolocation Bubble Map 
     - 4.1.4 Key Observations 
   - 4.2 Cluster Analysis
     - 4.2.1 Clustering Methodology
     - 4.2.2 Results of Cluster Analysis
   - 4.3 Comparative Analysis
     - 4.3.1 Twitter vs Billboard: Audio Features
     - 4.3.2 Twitter vs Billboard: Genre Trends

5. **Playlist Curation**
   - 5.1 Creating a Playlist Representing 2022

6. **Discussion**
   - 6.1 Interpretation of Findings
   - 6.2 Implications and Contributions of the Study
   - 6.3 Limitations and Challenges
   - 6.4 Recommendations for Future Research

7. **Conclusion**
   - 7.1 Summary of Findings
   - 7.2 Study Conclusions

8. **References**

9. **Appendices**
   - 9.1 Data Collection Code
   - 9.2 Data Cleaning and Processing Code
   - 9.3 Data Analysis Code
   - 9.4 Sample Playlist for 2022

---

## 1. Introduction 

### 1.1 Background

### 1.2 Research Questions

### 1.3 Objectives and Scope of Study


The primary objective of this study is to understand the dynamics of music consumption patterns reflected through Twitter data, particularly the links to tracks shared from Spotify, employing techniques such as Natural Language Processing and Machine Learning. Specifically, the study aims to:

1. **Examine** the auditory preferences reflected through the genres and audio features of shared tracks on Twitter.
2. **Utilize** unsupervised learning techniques like K-Means clustering to identify distinct genre-based musical clusters within the tweeted Spotify tracks
3. **Create** a sonically representative playlist for the year 2022 using the data on shared tracks by Twitter users and explore the data-driven curation of music 
4. **Assess** the prominence of specific audio features in shared tracks and compare these trends with those in the broader music industry as reflected through Billboard charts. This comparison could unveil industry-wide patterns or anomalies, which could help enrich understanding regarding contemporary music consumption trends. 

----

## 2. Literature Review

---

## 3. Methodology 

### 3.1 Data Collection

#### 3.1.1 Twitter Data Extraction

The data for this study was sourced from Twitter using the `snscrape` tool, specifically spanning the entirety of the year 2022. This tool was employed to scrape tweets that contained links to tracks on Spotify. These scraped tweets form the primary dataset of this study. 

The specific parameters used in the `snscrape` tool were configured to limit the results to tweets containing "open.spotify.com/track" and were geographically restricted to USA. This was further refined by specifying the date range for each month in the year of 2022.

The fields extracted from each tweet include:

- `username`: the Twitter handle of the user posting the tweet.
- `date`: the timestamp of the tweet.
- `rawContent`: the actual content of the tweet.
- `friendsCount`: the number of friends of the user at the time of the tweet.
- `followersCount`: the number of followers of the user at the time of the tweet.
- `replyCount`: the number of replies to the tweet.
- `retweetCount`: the number of retweets of the tweet.
- `quoteCount`: the number of quotes of the tweet.
- `place`: the geographical location associated with the tweet.
- `outlinks`: the external links present in the tweet.

The data extracted was then processed using `jq` to create a JSON output of unique users, which was then saved to a `us_tweets.json` file for further analysis.

The utilization of this data in my study facilitates an in-depth exploration into the evolving trends of music consumption and preference, providing valuable insights into how social media activity can reflect broader cultural and societal trends.

#### 3.1.2 Spotify Data Extraction

#### 3.1.3 Billboard Data Acquisition

#### Importing Libraries

In [1]:
# data manipulation
import pandas as pd
import numpy as np
import re 
import warnings
import ast 
from collections import Counter

# api requests
import spotipy 
from spotipy.oauth2 import SpotifyClientCredentials
import lyricsgenius
import requests

# nlp libraries
import nltk 
from nltk.tokenize import word_tokenize
from nltk.corpus import words 
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import spacy 
from spacy.lang.en.stop_words import STOP_WORDS

# data visualization 
import plotly.graph_objects as go 
import plotly.io as pio 
import plotly.figure_factory as ff


In [None]:
# tokens, keys, and credentials

# spotipy
BASE_URL = 'https://api.spotify.com/v1/'
AUTH_URL = 'https://accounts.spotify.com/api/token'
SPOTIFY_CLIENT_ID = 'YOUR CLIENT ID'
SPOTIFY_CLIENT_SECRET = 'YOUR CLIENT SECRET'
auth_manager = SpotifyClientCredentials(client_id = SPOTIFY_CLIENT_ID, client_secret = SPOTIFY_CLIENT_SECRET)
sp = spotipy.Spotify(auth_manager = auth_manager)


# lyricsgenius 
GENIUS_CLIENT_ACCESS_TOKEN = 'YOUR GENIUS TOKEN'
genius = lyricsgenius.Genius(GENIUS_CLIENT_ACCESS_TOKEN)


In [None]:
# nltk downloads
nltk.download('punkt') 
nltk.download('vader_lexicon')

# spacy model 
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])


In [2]:
# data visualization themes 
pio.templates.default = 'plotly_dark'

In [None]:
# load the data
df = pd.read_csv('../data/raw/2022-year-us.csv')

In [None]:
# getting a glimpse of the dataset
df.head()

In [None]:
# checking shape and size of dataset
print('Dataset size:', df.shape)
print('Dataset columns:', df.columns)

### 3.2 Data Cleaning

#### 3.2.1 Removing Special Characters 


In [None]:
# text processing
LINK_PATTERN = re.compile(r'http\S+')
SPECIAL_CHAR_PATTERN = re.compile(r'[^\w\s\n]+')
WHITESPACE_PATTERN = re.compile(r'\s+')

In [None]:
def clean_text(text):
    """
    Cleans a given text by removing links, special characters, multiple whitespaces and converting 
    to lowercase.

    Args:
        text (str): The text to clean.

    Returns:
        str: The cleaned text.
    """
    # remove links
    text = LINK_PATTERN.sub('', text)
    # remove special characters
    text = SPECIAL_CHAR_PATTERN.sub('', text)
    # replace multiple whitespaces with a single space
    text = WHITESPACE_PATTERN.sub(' ', text)
    # convert to lowercase
    text = text.lower()
    return text

### 3.3 Data Processing

#### 3.3.1 Processing Track URIs

In [None]:
def extract_track_uris(outlink):
    """
    Extracts track URIs from a given outlink.

    Args:
        outlink (str): The outlink from which to extract track URIs.

    Returns:
        str: The extracted track URI or None if no match or an error occurred.
    """
    # initialize track_uri to 'None'
    track_uri = None
    try:
        # extract the track ID from the link using regular expressions
        match = re.search(r'(?<=\/track\/)[a-zA-Z0-9]+', outlink)
        if match:
            # build the track URI
            track_uri = 'spotify:track:' + match.group(0)
    except Exception as e:
        # if an error occurs, log the error and raise it to stop execution
        print(f"Error occurred while processing link {outlink}: {e}")
        raise e
    # return the track_uri or None if no match or error occurred
    return track_uri

#### 3.3.2 Extracting Audio Features

In [None]:
def get_audio_features(df):
    """
    Retrieves audio features for each track URI in a given DataFrame.

    Args:
        df (pd.DataFrame): The DataFrame containing track URIs.

    Returns:
        pd.DataFrame: A DataFrame containing audio features for each track.
    """
    # extract URIs from the 'track_uri' column
    uris = df['track_uris'].tolist()

    # initialize list to store audio features for each track
    audio_features = []

    # create batches of up to 100 URIs each
    for i in range(0, len(uris), 100):
        batch_uris = uris[i:i+100]

        # filter out None values
        batch_uris = [uri for uri in batch_uris if uri is not None]

        if batch_uris:
            # retrieve audio features for the batch of URIs using the Spotify API
            batch_features = sp.audio_features(batch_uris)
            
            # append retrieved features to the list
            audio_features.extend(batch_features)

    # convert audio features list to a Pandas DataFrame
    audio_features_df = pd.DataFrame(audio_features)
    # drop rows with None values
    audio_features_df.dropna(inplace=True)

    return audio_features_df

#### 3.3.3 Determining Track Info

In [None]:
def get_track_info(df):
    """
    Retrieves track popularity, release year, artist name, song name, and album name for each track URI in a given DataFrame.

    Args:
        df (pd.DataFrame): The DataFrame containing track URIs.

    Returns:
        pd.DataFrame: The input DataFrame with added 'popularity', 'year', 'artist', 'song_name', and 'album_name' columns.

    Note:
        - The function prints the error message when an error occurs while processing a track URI.
        - The function appends None to the 'popularity', 'year', 'artist', 'song_name', and 'album_name' columns when an error occurs.
    """
    # initialize lists to store popularity, release year, artist, song name, and album name
    popularity_list = []
    year_list = []
    artist_list = []
    song_name_list = []
    album_name_list = []

    for uri in df['uri']:
        try:
            # get track info using Spotify API
            track_info = sp.track(uri)
            
            # extract popularity, release year, artist, song name, and album name
            popularity_list.append(track_info.get('popularity'))
            year_list.append(track_info['album']['release_date'][:4] if track_info.get('album') else None)
            artist_list.append(track_info['artists'][0]['name'] if track_info.get('artists') else None)
            song_name_list.append(track_info.get('name'))
            album_name_list.append(track_info['album']['name'] if track_info.get('album') else None)
            
        except spotipy.client.SpotifyException as e:
            # for all types of exceptions, log the error and append None to popularity, year, artist, song name, and album name lists
            print(f"Error occurred while processing URI {uri}: {e}")
            popularity_list.append(None)
            year_list.append(None)
            artist_list.append(None)
            song_name_list.append(None)
            album_name_list.append(None)

    # add popularity, year, artist, song name, and album name to the dataframe
    df['popularity'] = popularity_list
    df['year'] = year_list
    df['artist'] = artist_list
    df['song_name'] = song_name_list
    df['album_name'] = album_name_list
    
    return df

#### 3.3.4 Retrieving Track Lyrics

In [None]:
def get_lyrics_from_dataframe(df):
    """
    Extracts lyrics for each track in the given DataFrame using the Genius API.

    Args:
        df (pd.DataFrame): The DataFrame containing track URIs.

    Returns:
        pd.DataFrame: The input DataFrame with an added 'lyrics' column.

    Note:
        - The function prints an error message when an error occurs while processing a track URI.
        - The function appends an empty string to the 'lyrics' column when an error occurs or lyrics are not found.
    """
    # loop through each row in the DataFrame
    for index, row in df.iterrows():
        # get track information from Spotify
        track_uri = row['uri']
        track_info = sp.track(track_uri)
        track_name = track_info['name']
        artist_name = track_info['artists'][0]['name']
        
        # extract lyrics using Genius API
        try:
            song = genius.search_song(track_name, artist_name)
            lyrics = song.lyrics
        except Exception as e:
            # if lyrics not found or an error occurred, log the error and set lyrics to an empty string
            print(f"Error occurred while processing URI {track_uri}: {e}")
            lyrics = ''
        
        # clean lyrics
        cleaned_lyrics = clean_text(lyrics)
        
        # add cleaned lyrics to the DataFrame
        df.at[index, 'lyrics'] = cleaned_lyrics
    
    return df

#### 3.3.5 Deriving Unique Words from Lyrics

In [None]:
# convert nltk corpus words list to a set for faster lookup
word_list = set(words.words())

In [None]:
def extract_unique_words(df, column_name, stopwords_file):
    """
    Extracts unique words from a specified column in a DataFrame, excluding stop words.

    Args:
        df (pd.DataFrame): The DataFrame containing the text data.
        column_name (str): The column from which to extract unique words.
        stopwords_file (str): The path to a file containing custom stop words.

    Returns:
        pd.DataFrame: The input DataFrame with an added 'unique_words' column.
    """
    # load stop words from file
    with open(stopwords_file) as f:
        custom_stopwords = set([word.strip() for word in f])

    # add all stop words to the stop words set
    stop_words = STOP_WORDS.union(custom_stopwords)

    # fill NA w/ empty strings
    df[column_name] = df[column_name].fillna('')

    # use spaCy's pipe method for batch processing
    docs = list(nlp.pipe(df[column_name].tolist()))

    # tokenize the specified column and create a new column for processed words
    df['processed_words'] = [[token.lemma_ for token in doc if token.lemma_.lower() not in stop_words and token.pos_ not in {'CONJ', 'DET'}] for doc in docs]

    # extract unique words and remove non-dictionary words
    df['unique_words'] = df['processed_words'].apply(lambda x: [word for word in set(x) if word in word_list])

    # drop the 'processed_words' column
    df.drop('processed_words', axis=1, inplace=True)

    return df

#### 3.3.6 Identifying Genres

In [None]:
def get_access_token():
    """
    Gets an access token from the Spotify API.

    Returns:
        str: The access token.

    Raises:
        Exception: If the request to the Spotify API fails or the response cannot be decoded as JSON.
    """
    try:
        # POST
        auth_response = requests.post(AUTH_URL, {
            'grant_type': 'client_credentials',
            'client_id': SPOTIFY_CLIENT_ID,
            'client_secret': SPOTIFY_CLIENT_SECRET,
        })

        # convert the response to JSON
        auth_response_data = auth_response.json()

        # save and return the access token
        return auth_response_data['access_token']
    except Exception as e:
        print(f"Error occurred while getting access token: {e}")
        raise e

In [None]:
def extract_subgenres(df_audio):
    """
    Extracts subgenres for each track in a given DataFrame using the Spotify API.

    Args:
        df_audio (pd.DataFrame): The DataFrame containing track URIs.

    Returns:
        pd.DataFrame: A DataFrame containing track URIs, artist URIs, and genres.
    """
    dict_genre = {}

    # setup the Spotipy client
    auth_manager = SpotifyClientCredentials(client_id=SPOTIFY_CLIENT_ID, client_secret=SPOTIFY_CLIENT_SECRET)
    sp = spotipy.Spotify(auth_manager=auth_manager, requests_timeout=10, retries=10)

    # convert uri column to an iterable list
    track_uris = df_audio['uri'].apply(lambda uri: uri.split(':')[-1]).to_list()

    # loop through track URIs and pull artist URI using the API,
    # then use artist URI to pull genres associated with that artist
    # store all these in a dictionary
    for i, t_uri in enumerate(track_uris, start=1):
        print(f"Processing track {i} of {len(track_uris)}")

        dict_genre[t_uri] = {'artist_uri': "", "genres":[]}

        try:
            r = sp.track(t_uri)
            a_uri = r['artists'][0]['uri'].split(':')[-1]
            dict_genre[t_uri]['artist_uri'] = a_uri

            s = sp.artist(a_uri)
            dict_genre[t_uri]['genres'] = s['genres']
        except spotipy.SpotifyException as e:
            print(f"Error occurred while processing URI {t_uri}: {e}")

    # convert dictionary into dataframe with track_uri as the first column
    df_genre = pd.DataFrame.from_dict(dict_genre, orient='index')
    df_genre.insert(0, 'track_uri', df_genre.index)
    df_genre.reset_index(inplace=True, drop=True)

    return df_genre

----

## 4. Data Analysis and Findings

In [35]:
# load processed data

# audio features
US_2022_audio = pd.read_csv('../data/processed/audio-feat/US_2022_audio.csv')

# lyrics and unique words 
US_2022_lyrics = pd.read_csv('../data/processed/lyrics/US_2022_lyrics.csv')

# genres
US_2022_genres = pd.read_csv('../data/processed/genres/US_2022_genres.csv')

# track info
US_2022_info = pd.read_csv('../data/processed/track-info/US_2022_track_info.csv')

# geolocation
US_2022_geoloc = pd.read_csv('../data/processed/location/US_2022_geoloc.csv')

### 4.1 Exploratory Data Analysis

#### 4.1.1 Data Description

#### 4.1.2 Sentiment Analysis

##### Treemap

In [7]:
def count_words(df, sentiment=None):
    """
    Counts the occurrence of unique words in a DataFrame based on specified sentiment.

    Args:
        df (pd.DataFrame): The DataFrame containing the unique words.
        sentiment (str, optional): The sentiment of words to count. Options are 'positive', 'negative', or None.
                                   If None, counts all words. Defaults to None.

    Returns:
        collections.Counter: A Counter object with counts of unique words.
    """
    # initialize SentimentIntensityAnalyzer
    sia = SentimentIntensityAnalyzer()

    # initialize a Counter object
    word_counter = Counter()
    
    # iterate over the 'unique_words' column
    for words_string in df['unique_words']:
        # convert string representation of list to list
        words_list = ast.literal_eval(words_string)

        # check if the list of words is not empty
        if words_list:
            if sentiment == 'positive':
                # filter positive words
                words_list = [word for word in words_list if sia.polarity_scores(word)['compound'] > 0]
            elif sentiment == 'negative':
                # filter negative words
                words_list = [word for word in words_list if sia.polarity_scores(word)['compound'] < 0]

            # update the counter with the list of words
            word_counter.update(words_list)

    return word_counter

def get_top_words(word_counter, n=50):
    """
    Get the top n words from a word counter.

    Args:
        word_counter (collections.Counter): A Counter object with counts of unique words.
        n (int, optional): The number of top words to return. Defaults to 50.

    Returns:
        list: A list of top n words.
    """
    # get the n most common words
    top_words = word_counter.most_common(n)
    
    # prepare a list of words
    words = [item[0] for item in top_words]

    return words

def create_treemap(word_counter, title, color_scale):
    """
    Creates a treemap visualization of word counts.

    Args:
        word_counter (collections.Counter): A Counter object with counts of unique words.
        title (str): The title of the treemap.
        color_scale (str): The colorscale of the treemap.
    """
    # get the 50 most common words
    top_50_words = word_counter.most_common(50)
    
    # prepare data for the treemap
    words = [item[0] for item in top_50_words]
    counts = [item[1] for item in top_50_words]

    # create a treemap
    fig = go.Figure(go.Treemap(
        labels=words,
        parents=[""]*len(words),
        values=counts,
        marker=dict(
            colors=counts,  # set color 
            colorscale=color_scale,  # choose a colorscale
        ),
        hovertemplate='<b>%{label} </b> <br> Count: %{value}',
        name=''
    ))

    # update layout
    fig.update_layout(
        title=title,
        autosize=False,
        width=500,
        height=500,
    )

    fig.show()

def treemap(df):
    """
    Creates a treemap visualization of the top 50 unique words in a DataFrame.

    Args:
        df (pd.DataFrame): The DataFrame containing the unique words.
    """
    word_counter = count_words(df)
    create_treemap(word_counter, "Top 50 Unique Words", 'plotly3')

def neg_sent_treemap(df):
    """
    Creates a treemap visualization of the top 50 negative words in a DataFrame.

    Args:
        df (pd.DataFrame): The DataFrame containing the unique words.
    """
    word_counter = count_words(df, 'negative')
    create_treemap(word_counter, "Top 50 Negative Words", 'burg')

def pos_sent_treemap(df):
    """
    Creates a treemap visualization of the top 50 positive words in a DataFrame.

    Args:
        df (pd.DataFrame): The DataFrame containing the unique words.
    """
    word_counter = count_words(df, 'positive')
    create_treemap(word_counter, "Top 50 Positive Words", 'aggrnyl')


def get_sentiment_scores(words):
    """
    Compute sentiment scores (polarity) for a list of words using nltk's SentimentIntensityAnalyzer.

    Args:
        words (list): A list of words to compute sentiment scores for.

    Returns:
        pd.DataFrame: A DataFrame with words as the index and their associated sentiment scores as the values.
    """
    # initialize SentimentIntensityAnalyzer
    sia = SentimentIntensityAnalyzer()

    # Compute sentiment scores
    sentiment_scores = {word: sia.polarity_scores(word)['compound'] for word in words}

    # Create a DataFrame from the sentiment scores
    df = pd.DataFrame(list(sentiment_scores.items()), columns=['word', 'sentiment_score'])

    return df

In [8]:
treemap(US_2022_lyrics)

In [9]:
neg_sent_treemap(US_2022_lyrics)

In [10]:
pos_sent_treemap(US_2022_lyrics)

In [11]:
# Get word counters
word_counter_all = count_words(US_2022_lyrics)
word_counter_negative = count_words(US_2022_lyrics, 'negative')
word_counter_positive = count_words(US_2022_lyrics, 'positive')

# Get top 50 words from each category
top_50_words_all = get_top_words(word_counter_all)
top_50_words_negative = get_top_words(word_counter_negative)
top_50_words_positive = get_top_words(word_counter_positive)

# Get sentiment scores for each category and print them
df_scores_all = get_sentiment_scores(top_50_words_all)
df_scores_negative = get_sentiment_scores(top_50_words_negative)
df_scores_positive = get_sentiment_scores(top_50_words_positive)

print("Sentiment scores for all words:\n", df_scores_all)
print("\nSentiment scores for negative words:\n", df_scores_negative)
print("\nSentiment scores for positive words:\n", df_scores_positive)


Sentiment scores for all words:
       word  sentiment_score
0     love           0.6369
1     time           0.0000
2     feel           0.0000
3     baby           0.0000
4     good           0.4404
5     life           0.0000
6      day           0.0000
7      man           0.0000
8   bridge           0.0000
9    leave          -0.0516
10     low          -0.2732
11   thing           0.0000
12   night           0.0000
13    girl           0.0000
14   break           0.0000
15    mind           0.0000
16     win           0.5859
17     eye           0.0000
18   heart           0.0000
19    find           0.0000
20     run           0.0000
21   light           0.0000
22     bad          -0.5423
23   world           0.0000
24    lose          -0.4019
25    live           0.0000
26    hold           0.0000
27    long           0.0000
28    head           0.0000
29     big           0.0000
30    real           0.0000
31    talk           0.0000
32     boy           0.0000
33  ticket     

##### Genre Bar Chart


In [53]:
def genre_count(df):
    """
    Creates a horizontal bar chart visualization of the top 25 genres in a DataFrame.

    Args:
        df (pd.DataFrame): The DataFrame containing the genres.

    Note:
        This function does not return any values. It directly shows the plot using plotly.
    """
    # initialize a Counter object
    genre_counter = Counter()
    
    # iterate over the 'genres' column
    for genres_string in df['genres']:
        # convert string representation of list to list
        genres_list = ast.literal_eval(genres_string)

        # check if the list of genres is not empty
        if genres_list:
            # update the counter with the list of genres
            genre_counter.update(genres_list)
    
    # get the 25 most common genres
    top_25_genres = genre_counter.most_common(25)
    
    # prepare data for the bar chart
    genres = [item[0] for item in top_25_genres]
    counts = [item[1] for item in top_25_genres]

    # create a horizontal bar chart
    fig = go.Figure(data=[go.Bar(
        y=genres,
        x=counts,
        orientation='h',
        marker=dict(
            color=counts,  # set color 
            colorscale='plotly3',  # choose a colorscale
            line_width=0
        )
    )])

    # update layout
    fig.update_layout(
        title="Top 25 Genres",
        xaxis_title="Count",
        yaxis_title="Genre",
        yaxis={'categoryorder':'total ascending', 'nticks': len(genres)},
    )

    fig.show()
    
def count_unique_genres(df):
    """
    Counts the number of unique genres in a DataFrame and prints the result.

    Args:
        df (pd.DataFrame): The DataFrame containing the genres.
    """
    # initialize a set to store unique genres
    unique_genres = set()

    # iterate over the 'genres' column
    for genres_string in df['genres']:
        # convert string representation of list to list
        genres_list = ast.literal_eval(genres_string)

        # update the set with the list of genres
        unique_genres.update(genres_list)

    # print the count of unique genres
    print(f"The number of unique genres is: {len(unique_genres)}")


In [37]:
genre_count(US_2022_genres)

In [55]:
count_unique_genres(US_2022_genres)


The number of unique genres is: 2033


#### 4.1.3 Data Visualization 

##### Audio Radar Chart  

In [15]:
def radar_chart(df):
    """
    Creates a radar chart visualization of the mean values of various features in a DataFrame.

    Note:
        This function does not return any values. It directly shows the plot using plotly.
    """
    # selected features
    cols = ['danceability', 'energy', 'liveness', 'valence', 'acousticness', 'speechiness', 'instrumentalness']

    # subset dataframe
    df = df[cols]

    # mormalize to 0-1 range and take mean values
    df_mean = ((df - df.min()) / (df.max() - df.min())).mean()

    # convert to 0-100 scale for readability
    df_mean = (df_mean * 100).round(2)

    # single color for all lines
    line_color = '#ff00a0'

    # prepare custom data for hovertemplate
    hover_data = ['{}: {}'.format(col, val) for col, val in zip(cols, df_mean.values)]

    # build radar chart
    fig = go.Figure(data=go.Scatterpolar(
        r=df_mean.values,
        theta=cols,
        customdata=np.array(hover_data),
        hovertemplate='%{customdata}<extra></extra>',
        fill='toself',
        line_color=line_color
    ))

    fig.update_layout(
        polar=dict(
            radialaxis=dict(
                visible=True,
                range=[0, 100] 
            )),
        showlegend=False
    )

    fig.show()

In [38]:
radar_chart(US_2022_audio)

##### Audio Heat Map 

In [56]:
def audio_heatmap(df):
    """
    Creates a heatmap from a DataFrame based on the audio feature columns.

    Args:
        df (pd.DataFrame): The DataFrame to analyze.

    Note:
        This function does not return any values. It directly shows the plot using plotly.
    """
    # selected features
    cols = ['danceability', 'energy', 'liveness', 'valence', 'acousticness', 'speechiness', 'instrumentalness']

    # subset dataframe
    df_subset = df[cols]

    # compute correlation matrix
    corr_matrix = df_subset.corr()

    # create a heatmap
    fig = ff.create_annotated_heatmap(
        z=corr_matrix.values,
        x=list(corr_matrix.columns),
        y=list(corr_matrix.index),
        annotation_text=corr_matrix.round(2).values,
        colorscale='plotly3',
        showscale=True,
        reversescale=True
    )

    # update layout
    fig.update_layout(
        title='Correlation Matrix',
        width=800, 
        height=700,
    )

    fig.show()
    
def compute_correlation_matrix(df):
    """
    Computes a correlation matrix for the audio feature columns in a DataFrame.

    Args:
        df (pd.DataFrame): The DataFrame to analyze.

    Returns:
        pd.DataFrame: The correlation matrix as a DataFrame.
    """
    # selected features
    cols = ['danceability', 'energy', 'liveness', 'valence', 'acousticness', 'speechiness', 'instrumentalness']

    # subset dataframe
    df_subset = df[cols]

    # compute correlation matrix
    corr_matrix = df_subset.corr()

    return corr_matrix


In [39]:
audio_heatmap(US_2022_audio)

In [57]:
compute_correlation_matrix(US_2022_audio)

Unnamed: 0,danceability,energy,liveness,valence,acousticness,speechiness,instrumentalness
danceability,1.0,-0.012935,-0.120451,0.38246,-0.151883,0.209669,-0.14911
energy,-0.012935,1.0,0.184762,0.26967,-0.60629,0.069139,-0.026353
liveness,-0.120451,0.184762,1.0,0.025412,-0.078038,0.085743,-0.021665
valence,0.38246,0.26967,0.025412,1.0,-0.093878,0.061986,-0.130159
acousticness,-0.151883,-0.60629,-0.078038,-0.093878,1.0,-0.060241,0.044685
speechiness,0.209669,0.069139,0.085743,0.061986,-0.060241,1.0,-0.155616
instrumentalness,-0.14911,-0.026353,-0.021665,-0.130159,0.044685,-0.155616,1.0


##### Tempo Histogram

In [20]:
def tempo_hist(df):
    """
    Creates a histogram-like bar chart of the distribution of tempo values in a DataFrame.

    Note:
        This function does not return any values. It directly shows the plot using plotly.
    """
    # round the 'tempo' column to the nearest whole number
    df['tempo'] = df['tempo'].round(0)

    # count occurrences of each tempo
    tempo_counts = df['tempo'].value_counts()
    tempos = tempo_counts.index
    counts = tempo_counts.values
    max_count = max(counts)

    # create a bar chart (which looks like a histogram)
    fig = go.Figure(data=go.Bar(
        x=tempos,
        y=counts,
        marker_color=[f'rgba(255, 0, 160, {i/max_count})' for i in counts]  # simulate gradient
    ))
    
    # update layout
    fig.update_layout(
        title="Tempo Distribution",
        xaxis_title="Tempo",
        yaxis_title="Count",
        bargap=0.05  # Gap between bars
    )

    fig.show()

In [42]:
tempo_hist(US_2022_audio)

In [58]:
US_2022_audio['tempo'].describe()

count    11211.000000
mean       121.163233
std         29.435554
min         40.000000
25%         97.000000
50%        120.000000
75%        140.000000
max        220.000000
Name: tempo, dtype: float64

##### Key Histogram 

In [43]:
def plot_key_distribution(df):
    """
    Creates a bar chart of the distribution of keys in a DataFrame.

    Note:
        This function does not return any values. It directly shows the plot using plotly.
    """
    # create a dictionary to map integers to musical keys
    key_mapping = {
        0: "C",
        1: "C♯/D♭",
        2: "D",
        3: "D♯/E♭",
        4: "E",
        5: "F",
        6: "F♯/G♭",
        7: "G",
        8: "G♯/A♭",
        9: "A",
        10: "A♯/B♭",
        11: "B"
    }

    # create a copy of the DataFrame to avoid modifying the original one
    df_copy = df.copy()
    
    # replace integer keys with musical keys in the 'key' column of the copied DataFrame
    df_copy['key'] = df_copy['key'].map(key_mapping)

    # count the occurrences of each key
    key_counts = df_copy['key'].value_counts().reset_index()
    key_counts.columns = ['key', 'count']

    # create a bar chart
    fig = go.Figure(data=go.Bar(
        x=key_counts['key'],
        y=key_counts['count'],
        marker_color='rgba(5, 215, 243, 1)'  # single color for all bars
    ))

    # update layout
    fig.update_layout(
        title="Key Distribution",
        xaxis_title="Key",
        yaxis_title="Count",
        bargap=0.05  # gap between bars
    )

    fig.show()


In [59]:
US_2022_audio['key'].describe()

count    11211.000000
mean         5.356703
std          3.608056
min          0.000000
25%          2.000000
50%          5.000000
75%          9.000000
max         11.000000
Name: key, dtype: float64

In [44]:
plot_key_distribution(US_2022_audio)

##### Popularity and Year Histogram 

In [45]:
def pop_hist(df):
    """
    Creates a histogram-like bar chart of the distribution of popularity values in a DataFrame.

    Note:
        This function does not return any values. It directly shows the plot using plotly.
    """
    # exclude 0 popularity
    df = df[df['popularity'] != 0]

    # count occurrences of each popularity score
    popularity_counts = df['popularity'].value_counts()
    popularity_scores = popularity_counts.index
    counts = popularity_counts.values
    max_count = max(counts)

    # create chart 
    fig = go.Figure(data=go.Bar(
        x=popularity_scores,
        y=counts,
        marker_color=[f'rgba(5, 215, 243, {i**0.5/max_count**0.5})' for i in counts]  # simulate gradient
    ))
    
    # update layout
    fig.update_layout(
        title="Popularity Distribution",
        xaxis_title="Popularity",
        yaxis_title="Count",
        bargap=0.05 
    )

    fig.show()

In [60]:
US_2022_info['popularity'].describe()

count    11211.000000
mean        43.957809
std         25.916302
min          0.000000
25%         25.000000
50%         48.000000
75%         64.000000
max         95.000000
Name: popularity, dtype: float64

In [47]:
pop_hist(US_2022_info)

In [62]:
def year_hist(df):
    """
    Creates a horizontal bar chart of the distribution of years in a DataFrame.

    Args:
        df (pd.DataFrame): The DataFrame to analyze.

    Note:
        This function does not return any values. It directly shows the plot using plotly.
    """
    # count occurrences of each year
    counter = Counter(df['year'])
    
    # get keys and values from the counter and sort based on the keys (years)
    years_counts = sorted(counter.items())
    years = [item[0] for item in years_counts]
    counts = [item[1] for item in years_counts]
    
    max_count = max(counts)

    # create a horizontal bar chart
    fig = go.Figure(data=go.Bar(
        y=years,  # switched with x
        x=counts,  # switched with y
        marker_color=[f'rgba(255, 18, 79,  {i**0.5/max_count**0.5})' for i in counts],  # simulate gradient
        orientation='h'  # This makes the bars horizontal
    ))

    # update layout
    fig.update_layout(
        title="Year Distribution",
        xaxis_title="Count",  # switched with yaxis
        yaxis_title="Year",  # switched with xaxis
        bargap=0.05  # gap between bars
    )

    fig.show()

In [63]:
US_2022_info['year'].describe()

count    11211.000000
mean      2012.848809
std         12.623530
min       1950.000000
25%       2009.000000
50%       2019.000000
75%       2022.000000
max       2023.000000
Name: year, dtype: float64

In [49]:
year_hist(US_2022_info)

##### Geolocation Bubble Map 

In [None]:
def cities_data(df, cities_filepath):
    """
    Merges a DataFrame with a cities DataFrame.

    Args:
        df (pd.DataFrame): The DataFrame to merge. Must contain a 'fullName' column.
        cities_filepath (str): The file path of the cities CSV file.

    Returns:
        pd.DataFrame: The merged DataFrame.
    """
    # load the cities CSV file
    cities_df = pd.read_csv(cities_filepath)

    # split the 'fullName' into two separate columns: 'city' and 'state_id'
    df[['city', 'state_id']] = df['fullName'].str.split(', ', expand=True)

    # convert city and state_id to lowercase in both dataframes to avoid mismatch due to case differences
    df[['city', 'state_id']] = df[['city', 'state_id']].apply(lambda x: x.str.lower())
    cities_df[['city', 'state_id']] = cities_df[['city', 'state_id']].apply(lambda x: x.str.lower())

    # merge the dataframes on 'city' and 'state_id'
    df = pd.merge(df, cities_df[['city', 'state_id', 'lat', 'lng']], on=['city', 'state_id'], how='left')

    return df


In [50]:
def create_bubble_map(df):
    """
    Creates a bubble map showing the geographical distribution of Spotify streams.

    Args:
        df (pd.DataFrame): The DataFrame containing the 'city', 'state_id', 'lat', and 'lng' columns.

    Returns:
        plotly.graph_objects.Figure: The bubble map figure.
    """
    # count the number of occurrences of each city and state
    df_counts = df.groupby(['city', 'state_id', 'lat', 'lng']).size().reset_index(name='counts')

    # create the bubble map
    fig = go.Figure(data=go.Scattergeo(
        lat=df_counts['lat'],
        lon=df_counts['lng'],
        text=df_counts['city'] + ', ' + df_counts['state_id'].str.upper(),
        mode='markers',
        marker=dict(
            size=df_counts['counts'],
            color=df_counts['counts'],  # assign color values as actual counts
            colorscale='tealgrn',  # apply the colorscale
            sizemode='area',
            sizeref=2. * df_counts['counts'].max() / (40. ** 2), 
            sizemin=4,
            showscale=True,
            colorbar=dict(title="Counts"),
            cmin=df_counts['counts'].min(),  
            cmax=df_counts['counts'].max()  
        )
    ))

    # update layout to focus on USA
    fig.update_geos(scope='usa')

    # additional layout settings
    fig.update_layout(
        title={
            'text': "Geographical Distribution of Spotify Stream Tweets",
            'x': 0.5,  # center the title
            'xanchor': 'center'  # specify the 'x' as the center
        },
        geo=dict(
            landcolor='rgb(217, 217, 217)',
            subunitcolor='rgb(217, 217, 217)',
            countrycolor='rgb(217, 217, 217)',
            showlakes=True,
            lakecolor='rgb(255, 255, 255)',
            showsubunits=True)
    )

    return fig

In [51]:
create_bubble_map(US_2022_geoloc)

#### 4.1.4 Key Observations

**Lyrics**:
- In the analysis of all words, the most common words include "love", "time", "feel", "baby", and "good", which make up a significant portion of the total word count. For instance, the word "love" has the highest sentiment score of 0.6369 among the top 50 words, underscoring the pervasiveness of love as a theme in music.

- Further sentiment analysis of all words shows that many of the frequently used words have neutral sentiment scores. For example, the words "time", "feel", "baby", and "life" all have a sentiment score of 0. This trend suggests a propensity for lyrics to employ words that are emotionally neutral, neither veering towards strong positivity nor negativity.

- Upon categorizing words based on sentiment, unique patterns emerge. The most common negative words encompass "leave", "low", "bad", "lose", and "bitch", with sentiment scores ranging from -0.0516 to -0.5859. These words, many of which denote loss, negativity, and conflict, indicate a prevalence of such themes in the music shared on Twitter.

- Contrastingly, the most frequent positive words include "love", "good", "win", "play", and "hand", with sentiment scores ranging from 0.3400 to 0.6369. These words, symbolizing love, positivity, victory, and cooperation, suggest that these positive themes are as pervasive in music as their negative counterparts.

- Ultimately, this provides insight into the diverse themes and sentiments in the music shared on Twitter, revealing a complex mix of positivity, negativity, love, conflict, success, and loss. 


**Audio Features**:
- Firstly, the tempo of the songs, measured in beats per minute (BPM), has a mean value of approximately 121.16 BPM, with a standard deviation of about 29.44. This suggests a preference for moderately paced songs among users. The tempo range is quite wide, from 40 BPM to 220 BPM, indicating a diversity of musical styles and moods.

- The popularity of the songs, ranging from 0 to 95, has a mean score of approximately 43.96. It is notable that the 75th percentile of popularity is 64, indicating that a significant proportion of the shared songs have a high popularity score. This reflects the inclination of Twitter users to share music that is generally well-received and popular.

- The year of the songs spans from 1950 to 2023, with a mean value around 2012.85. This indicates a preference for relatively recent music, reflecting the evolving tastes of listeners and the dynamism of the music industry.

- The analysis of the key of the songs reveals a mean value of approximately 5.36, which corresponds to the key of "F" using the provided key mapping. This suggests a slight preference for songs in this key, but the standard deviation of 3.61 and the full range from 0 ("C") to 11 ("B") indicate a broad diversity of musical keys in the shared songs.

- The mean audio features presents an interesting pattern. Energy and danceability, with mean values of 63.53 and 59.46 respectively, are relatively high. This suggests a preference for energetic and danceable music. In contrast, instrumentalness is quite low (mean value of 8.02), indicating a preference for music with vocals. Speechiness, acousticness, valence, and liveness have moderate mean values, suggesting a balanced mix of these elements in the music shared on Twitter.

**Correlations between Audio Features**:
- Danceability and Valence show a moderate positive correlation (0.38), indicating that tracks with higher danceability tend to have more positive or happier content.

- There's a strong negative correlation (-0.61) between Energy and Acousticness. This suggests that tracks with higher energy are typically less acoustic and more likely to involve electronic or amplified instruments.

- Speechiness shows a mild positive correlation with Danceability (0.21), suggesting that more danceable tracks tend to have more spoken words.

- Instrumentalness has a mild negative correlation with Speechiness (-0.15) and Danceability (-0.15), indicating that tracks with a higher proportion of instrumental content are less likely to be danceable or have spoken words.

- Liveness shows a mild positive correlation with Energy (0.18), suggesting that more energetic tracks are likely to have been recorded with a live audience.

**Genres**:
- The genre with the highest occurrence is 'rap', appearing in 1,311 instances. This indicates a strong preference for rap music among Twitter users who share Spotify links.

- Following closely, 'pop' and 'hip hop' genres are also quite popular, appearing in 1,059 and 985 instances respectively. This suggests a significant interest in these genres.

- The 'rock' and 'trap' genres, while less frequent than the top three genres, still appear in a substantial number of instances (732 and 513 respectively), indicating their relevance in the music shared on Twitter.

- The total number of unique genres identified is 2,033. This high number points to a remarkable diversity in the music genres shared on Twitter, highlighting the platform's role in disseminating a wide range of music styles.


### 4.2 Cluster Analysis