# **Set-up:**

These are the additional packages to be installed on the terminal:

```bash
pip install scikit-learn
pip install rpy2
pip install spotipy
pip install nltk
pip install wordcloud
```

Import necessary packages

*⚠️ Note: Do not run this more than once. Restart the kernel before running this code chunk.*

In [2]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from plotnine import *
import plotnine as p9
import re
from scrapy import Selector
import requests as requests
import json
import statsmodels.api as sm
import spotipy
import base64
from requests import post
from spotipy.oauth2 import SpotifyClientCredentials
from datetime import datetime
from sklearn import *
from base64 import *

import os
os.chdir(os.path.expanduser("../"))                 # change directory to main project directory

from dees_package.spotify_functions import *  

Check that we are in the correct current working directory

*⚠️ Note: We should be in the main project directory*

In [None]:
print("Current working directory:", os.getcwd())

Open JSON file containing credentials

*⚠️ Note: Our credentials should be stored in a file titled `credentials.json` and stored in the root of the project folder*

In [None]:
credentials_file_path = './credentials.json'

with open(credentials_file_path, 'r') as f:
    credentials = json.load(f)

# Expand Dataframe from merged YouTube Data



## **For Ruikai:**
So all the raw and (almost fully) cleaned data that we get from YouTube + Genius is in this csv file called raw_compiled_data.csv
So i edited a lil bit of ur code to take out some of the cleaning that was like already done

In [None]:
scraped_df = pd.read_csv('../data/raw_compiled_data.csv')

### Under 'wikipedia_categories', there are separate links for different potential genres

We have noticed that:
* Each link is separated by a comma ','
* Every song has at least one category – 'music'
* Some songs are in multiple categories, majority of them only has only one, some has two, songs with two and more categories are extremely rare

Therefore, we can count the number of commas to determine the number of categories, with the function as such:

# **IM SO SORRY RUIKAI BUT UR GONNA NEED TO REDO THIS ONE**
The wikipedia categories are all in one string so this doesnt run lol

In [None]:
def get_category_number(x):
    string = str(x)
    return string.count(',')

In [None]:
new_merge['category_number'] = new_merge['wikipedia_categories'].apply(lambda x: get_category_number(x))

In [None]:
new_merge.head()

In [None]:
new_merge2 = new_merge.head(150)

In [None]:
big_merge = new_merge.head(200)
big_merge['lyrics'] = big_merge.apply(lambda row: scrape_lyrics(my_session, row['Genius_URL']), axis=1)

In [None]:
new_merge2['lyrics'] = new_merge2.apply(lambda row: scrape_lyrics(my_session, row['Genius_URL']), axis=1)

new_merge2.head()

# Clean and Analyse Data

### We create a new dataframe with the necessary headers only, removing 'None' values or duplicates

In [None]:
new_merge3 = new_merge2.dropna()

df = new_merge3[['Artist', 'Song', 'like_count', 'view_count', 'comment_count', 'lyrics', 'category_number']].dropna().drop_duplicates(subset = ['Song'])

df = df[df['lyrics'] != '']

df.head()

### Imported package to analyse sentiments

We create function and apply it to dataframe|

In [None]:
sid = SentimentIntensityAnalyzer()

def get_sentiment_score(lyric):
    scores = sid.polarity_scores(lyric)
    list = [scores['neg'], scores['neu'], scores['pos'], scores['compound']]
    return list

In [None]:
df['sentiment_positive'] = df['lyrics'].apply(lambda x: get_sentiment_score(x)[2])
df['sentiment_neutral'] = df['lyrics'].apply(lambda x: get_sentiment_score(x)[1])
df['sentiment_negative'] = df['lyrics'].apply(lambda x: get_sentiment_score(x)[0])
df['sentiment_compound'] = df['lyrics'].apply(lambda x: get_sentiment_score(x)[3])

df.head()

### We define lexical richness as the proportion of unique words to total words used, a quantitative way to analyse the richness of vocabulary used in a song. Using function below:

In [None]:
def get_lexical_richness(lyric):
    total_words = len(lyric.split())
    unique_words = len(set(lyric.split()))
    lexical_richness = unique_words/total_words*100
    return round(lexical_richness)

In [None]:
df['lexical_richness'] = df['lyrics'].apply(lambda x: get_lexical_richness(x))

df.head()

### Find song length as well

In [None]:
df['song_length'] = df['lyrics'].apply(lambda x: len(x.split()))

df.head()

In [None]:
df['sentiment_compound_absolute'] = df['sentiment_compound'].abs()

df.head()

# Integrate Spotify API

Lastly, we integrate spotify API as well to find even more categories

In [None]:
client_id = credentials['client_id']
client_secret = credentials['client_secret']

client_creds = f"{client_id}:{client_secret}"
base64_client_creds = b64encode(client_creds.encode()).decode()

auth_url = 'https://accounts.spotify.com/api/token'
headers = {
    'Authorization': f'Basic {base64_client_creds}'
}
payload = {
    'grant_type': 'client_credentials'
}

response = requests.post(auth_url, headers=headers, data=payload)

response.json()


### Using 'spotipy' package and the search() function, we are able to get data in the json file such as release date, a popularity score, whether the song is explicit, and the number of markets that the song is in during its initial release

### Integrating these into our existing dataframe:

In [None]:
df['release_date'] = df['Song'].apply(lambda x: get_release_date(x, client_id, client_secret))
df['popularity'] = df['Song'].apply(lambda x: get_popularity(x, client_id, client_secret))
df['explicitness'] = df['Song'].apply(lambda x: get_explicitness(x, client_id, client_secret))
df['markets'] = df['Song'].apply(lambda x: get_market_number(x, client_id, client_secret))

### We want to convert our date to datetime format for ease of plotting later on

In [None]:
def convert_date(x):
    try:
        pd.to_datetime(x)
        return pd.to_datetime(x)
    except:
        None
        return None

In [None]:
df['release_date'] = df['release_date'].apply(lambda x: convert_date(x)).dropna()
# df['release_date'] = pd.to_datetime(df['release_date'])
df.head()

### For the number of markets of song release, we found some interesting facts:

For an initial release of song, it is in either:
* all 184 markets in the world
* slightly less than 184 markets (a sign that there are some censorship in some countries, a hint that the song may be culturally inappropriate/politically sensitive)
* or very little markets (<50) (a sign that the song is deliberately only released in some markets, targeting niche categories)

Hence justifying the below function, categorising them into high, medium, or low level of outreach

In [None]:
def market_availability_category(x):
    number = int(x)
    if number == 184:
        return 'High'
    elif 50 < number < 184:
        return 'Medium'
    else:
        return 'Low'

In [None]:
df['markets'] = df['markets'].apply(lambda x: market_availability_category(x))
df.head()

### Similarly, for song categories:

We initially attempted to obtain song genres via YouTube, Genius or Spotify. However, we faced significant difficulties due to the fact that:
* The data is not explicitly available – these platforms offer limited sources of data to public due to privacy reasons
* It is very difficult to get the genre via the API itself

Therefore, we enlisted Wikipedia, an open source, to find out on the song genre/category. However, due to the limited amount of categorisations there are on Wikipedia, we focus on the number of categories, i.e. number of wikipedia pages they occur instead.
* Most songs do not belong to any specific category on Wikipedia, they are being categorised as "music".
* For most of the other songs, they belong to two Wikipedia categories, "music" and something else, such as "electro"
* The rest of the songs are extreme minorities which belongs to three or more Wikipedia categories

Hence justifying our rationale to have broad categories. Songs that are not relevant enough to have more than one genre are categorised as "Low" in terms of category popularity; two as "Medium", three or more as "High". The function below:

In [None]:
def category_popularity(x):
    number = int(x)
    if number == 1:
        return 'Low'
    elif number == 2:
        return 'Medium'
    else:
        return 'High'

In [None]:
df['category_number'] = df['category_number'].apply(lambda x: category_popularity(x))
df.head()

In [None]:
wordcloud = WordCloud().generate(df.iloc[0,5])

plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
corr_df = df[['like_count','view_count','comment_count', 'sentiment_positive', 'sentiment_neutral', 'sentiment_negative', 'sentiment_compound_absolute', 'lexical_richness', 'song_length', 'popularity']].corr()

In [None]:
corr_df2 = corr_df. \
        melt(ignore_index=False) \
        .reset_index()

corr_df2['rounded_value'] = corr_df2['value'].apply(lambda x: np.round(x, 2))

In [None]:
g = p9.ggplot(
        mapping = p9.aes('index', 'variable', fill = 'value'),
        data = corr_df2
    ) + \
        p9.geom_tile() + \
        p9.geom_label(
            p9.aes(label = 'rounded_value'),
            fill = 'white',
            size = 8
        ) + \
        p9.scale_fill_distiller() + \
        p9.theme_minimal() + \
        p9.labs(
            title = 'Correlation Matrix',
            x = '',
            y = ''
        ) + \
        p9.theme(
            axis_text_x = element_text(angle = 90)
        )

g

In [None]:
hist = p9.ggplot(
    mapping = p9.aes(x = 'sentiment_compound'),
    data = df
) + \
geom_histogram(binwidth=0.05)

hist

In [None]:
boxplot = (
    ggplot(df) +
    aes(x = 'explicitness', y = 'popularity') +
    geom_boxplot()
)

boxplot

In [None]:
line = (
    ggplot(df) +
    aes(x = 'release_date', y = 'song_length', colour = 'explicitness') +
    geom_point(alpha = 0.5) +
    geom_smooth(method = "lm") +
    scale_x_datetime(
        limits=(datetime(2000, 1, 1), datetime(2024, 1, 1)),
    )
)

line

In [None]:
contour = (
    ggplot(df) +
    aes(x = 'popularity', y = 'song_length') +
    geom_bin2d() +
    theme_classic()
)

contour

In [None]:
df.to_json("../data/json_for_plot.json")

In [None]:
contour = (
    ggplot(df) +
    aes(x = 'lexical_richness', y = 'sentiment_compound', z = 'popularity') +
    geom_contour_filled(aes(fill = 'level') +
    geom_contour(colour = 'black'))
)

contour

In [None]:
distribution = (
    ggplot(df) +
    aes(x = 'popularity', colour = 'category_number', fill = 'category_number') +
    geom_density(alpha = 0.2)
)

distribution

In [None]:
%load_ext rpy2.ipython

In [None]:
%%R
install.packages("IRkernel")

In [None]:
%%R
install.packages("IRkernel")
plot = (
    ggplot(df, aes(x='lexical_richness', y='sentiment_compound', z='popularity')) +
    geom_contour_filled(aes(fill='..level..')) +
    geom_contour(color='black') +
    scale_fill_cmap(name='viridis')
)