## Assignment 2 Data Analysis using Pandas

This assignment will contain 1 question with details as below. The due date is October 17 (Sunday), 2021 23:59PM. Each late day will result in 20% loss of total points.

### Question 1 (100 points) How to create the next Squid Game?

![hitsong](https://pbs.twimg.com/media/E_U24rOVEAAfrJm?format=jpg)

The 21th century has witnessed the technological advancement in music industry that allowed consumers to store music in hard disks such as MP3 or iPods. The increasing prevalence of smart phones and the digitization of music prompted the establishment and wide usage of numerous music-listening apps such as Spotify, Google Play Music and Apple Music, among others, that gradually replaced CDs. Such switch of music consumptions, from purchasing physical albums to purchasing the single track, not only changed the customer experience, but also fundamentally changed the economics of the music industry. 

Due to such a music industry evolution, Chris Anderson (2004) proposed the long tail theory to characterize the music consumption in digital era, in which a large portion of tracks that were once unknown have gained certain level of popularity altogether to form a long tail of the consumption distribution. This implies that the popularity of the music and artists may spread within a larger range, increasing sales of less known tracks from nearly zero to few.

More recently, the emergence of streaming platform designs such as Pandora and Spotify, as well as the utilization of Artificial Intelligence into music recommendations have gradually exhibited a spill-over effect (Aguiar and Waldfogel 2018) – music listened by other users with similar histories are recommended, thus increasing the music popularity as it spreads from several users to a larger group. This pushed a short list of tracks to become uniquely popular. In 2018, Professor Serguei Netessine from Wharton University of Pennsylvania stated in his podcast that, “We found that, if anything, you see more and more concentration of demand at the top”. Although the podcast focused on movie sales, experiences goods like theater and music sales occur in a similar fashion. In the book “All you need to know about the music industry” by Passman (2019), he highlighted key differences between music business in the streaming era and record sales. In the days of record sales, artists get paid the same money for each record sold, regardless of whether a buyer listened to it once or a thousand times.  But today, the more listens the music tracks have, the more money the artists make. Meanwhile, records sales do not have strong spillover effects as fans of different artists/genres will purchase what they like anyway. In fact, a hit album would bring a lot of people into record stores, and that increased the chances of selling other records. But in the streaming world, that’s no longer true. The more listens one artist gets, the less money other artists would make. In other words, the music consumption is undertaking a radical shift which may affect the definition of popularity in the streaming era, however, it is yet severely underexplored.

Inspired by the evolution of music industry in the recent decades and the recent debunk of long tail theory given a high concentration of popularity for a short list of tracks, this assignment aims to investigate the popularity of music tracks on streaming platform, largely different and not extensively explored about compared to that measured by album sales. In particular, rather than considering the level of advertisement, the inclusion in playlists of Spotify 100 as Luis Aguiar and Joel Waldfogel (2018) have noted. 

References:
- Aguiar, L. & Joel Waldfogel, Platforms, Promotion, and Product Discovery: Evidence from Spotify Playlists; JRC Digital Economy Working Paper 2018-04; JRC Technical Reports, JRC112023
- Passman (2019), All You Need to Know About the Music Business: 10th Edition, Simon & Schuster, US



**Question 1.1** (30 points): We will retrieve the  information from the top 100 songs on [Spotifycharts](https://spotifycharts.com/) on September 30th-October 4th. For each day on the list, we can scrape the following characteristics from the information page. For example, from the ["Global Top 200 on September 30"](https://spotifycharts.com/regional/global/daily/2021-09-30), we want to extract the information about the top song **STAY** as:
- spotify id (5PjdY0CKGZdEuoNab3yDmX)
- Song name (STAY (with Justin Bieber))
- Artist (The Kid LAROI)
- Number of streams (7,714,466)

![spotifycharts](https://aristake.com/wp-content/uploads/2021/09/Spotify-charts-HEADER-1.png)


After scraping the top 100 songs, save the data as a dataframe ```spotify_top_songs_global```. 

Then similarly, let's try to scrape information from the top 100 songs of Portugal market and Japanese market on Septebmer 30th-October 4th, respectively. save the data as dataframes ```spotify_top_songs_portugal``` and ```spotify_top_songs_japan```.


You can concatenate these three dataframes as ```spotify_top_songs``` for next question. 

Note: if you are not able to scrape the data, download the csv files from the top right corner of the website, but you will not receive the scores from this question.

Hint: you can play with the website to check the correct url for each chart.

In [1]:
pip install cloudscraper

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np
import requests
import cloudscraper
from bs4 import BeautifulSoup
import csv

Given that Spotify enforces anti-bot measures to prevent web scraping, we will rely on a package namede ```cloudcraper``` to bypass the mechanism. Essentially you could use the following code to scrape such website easily:

In [3]:
scraper = cloudscraper.create_scraper()  
global_pages = ['https://spotifycharts.com/regional/global/daily/2021-09-30',
            'https://spotifycharts.com/regional/global/daily/2021-10-01',
            'https://spotifycharts.com/regional/global/daily/2021-10-02',
            'https://spotifycharts.com/regional/global/daily/2021-10-03',
            'https://spotifycharts.com/regional/global/daily/2021-10-04']

art_list=[]
song_list = []
id_list = []
streams_list =[]
for n in global_pages:
    page = scraper.get(n)        
    soup = BeautifulSoup(page.text, "html.parser")    
    songs = soup.find("table", {"class":"chart-table"}) 
    def top100(x):
        pg = x
        for s in songs.find('tbody').find_all('tr')[:100]:
            artist= s.find("td", {"class": "chart-table-track"}).find("span").text[3:]
            art_list.append(artist)
            title= s.find("td",{"class": "chart-table-track"}).find("strong").text
            song_list.append(title)
            songid= s.find("td", {"class": "chart-table-image"}).find("a").get("href")[22:]
            songid= songid.split("track/")[1]
            id_list.append(songid)
            streams= s.find("td",{"class": "chart-table-streams"}).text
            streams_list.append(streams)
    top100(n)
    
spotify_top_songs_global = pd.DataFrame({'Spotify id': id_list, 'Song name': song_list, 'Artist' : art_list, 'Streams': streams_list})
#spotify_top_songs_global

In [4]:
scraper = cloudscraper.create_scraper()
japan_pages = ['https://spotifycharts.com/regional/jp/daily/2021-09-30',
              'https://spotifycharts.com/regional/jp/daily/2021-10-01',
              'https://spotifycharts.com/regional/jp/daily/2021-10-02',
              'https://spotifycharts.com/regional/jp/daily/2021-10-03',
              'https://spotifycharts.com/regional/jp/daily/2021-10-04']
art_listj=[]
song_listj = []
id_listj = []
streams_listj =[]
for n in japan_pages:
    page= scraper.get(n)        
    soup = BeautifulSoup(page.text, "html.parser")    
    songs = soup.find("table", {"class":"chart-table"}) 
    def top100(x):
        pg = x
        for s in songs.find('tbody').find_all('tr')[:100]:
            artist= s.find("td", {"class": "chart-table-track"}).find("span").text[3:]
            art_listj.append(artist)
            title= s.find("td",{"class": "chart-table-track"}).find("strong").text
            song_listj.append(title)
            songid= s.find("td", {"class": "chart-table-image"}).find("a").get("href")[22:]
            songid= songid.split("track/")[1]
            id_listj.append(songid)
            streams= s.find("td",{"class": "chart-table-streams"}).text
            streams_listj.append(streams)
    top100(n)
    
spotify_top_songs_japan = pd.DataFrame({'Spotify id': id_listj, 'Song name': song_listj, 'Artist' : art_listj, 'Streams': streams_listj})
#spotify_top_songs_japan

In [5]:
scraper = cloudscraper.create_scraper()
portugal_pages = ['https://spotifycharts.com/regional/pt/daily/2021-09-30',
                 'https://spotifycharts.com/regional/pt/daily/2021-10-01',
                 'https://spotifycharts.com/regional/pt/daily/2021-10-02',
                 'https://spotifycharts.com/regional/pt/daily/2021-10-03',
                 'https://spotifycharts.com/regional/pt/daily/2021-10-04']
art_listp=[]
song_listp = []
id_listp = []
streams_listp =[]
for n in portugal_pages:
    page= scraper.get(n)        
    soup = BeautifulSoup(page.text, "html.parser")    
    songs = soup.find("table", {"class":"chart-table"}) 
    def top100(x):
        pg = x
        for s in songs.find('tbody').find_all('tr')[:100]:
            artist= s.find("td", {"class": "chart-table-track"}).find("span").text[3:]
            art_listp.append(artist)
            title= s.find("td",{"class": "chart-table-track"}).find("strong").text
            song_listp.append(title)
            songid= s.find("td", {"class": "chart-table-image"}).find("a").get("href")[22:]
            songid= songid.split("track/")[1]
            id_listp.append(songid)
            streams= s.find("td",{"class": "chart-table-streams"}).text
            streams_listp.append(streams)
    top100(n)
    
spotify_top_songs_portugal = pd.DataFrame({'Spotify id': id_listp, 'Song name': song_listp, 'Artist' : art_listp, 'Streams': streams_listp})
#spotify_top_songs_portugal

spotify_top_songs = pd.concat([spotify_top_songs_global, spotify_top_songs_portugal, spotify_top_songs_japan])
spotify_top_songs

Unnamed: 0,Spotify id,Song name,Artist,Streams
0,5PjdY0CKGZdEuoNab3yDmX,STAY (with Justin Bieber),The Kid LAROI,7714466
1,5Z9KJZvQzH6PFmb8SNkxuk,INDUSTRY BABY (feat. Jack Harlow),Lil Nas X,6517968
2,02MWAaffLxlfxAUY7c5dvx,Heat Waves,Glass Animals,4460880
3,3FeVmId7tL5YN8B7R3imoM,My Universe,"Coldplay, BTS",4142687
4,6PQ88X9TkUIAUIZJHW2upE,Bad Habits,Ed Sheeran,4077321
...,...,...,...,...
495,2YQ8TlTmNheRI3VafoDpod,10月無口な君を忘れる,あたらよ,38331
496,3QIAwtEEDOrv0g5NKCGrXZ,花束,back number,38136
497,19fhOFi6pNGeZe5uiFlm7c,優しい彗星,YOASOBI,37380
498,3bbIIVIwBoLqVcLebiEJFo,のびしろ,Creepy Nuts,37239


**Question 1.2** (20 points) Now you need to go to Spotify platform to use its API to further get more information. You could find very detailed [documentation](https://developer.spotify.com/documentation/web-api/) that should guide you with the entire process. 

First, you need to get the audio features from the songs in the ```spotify_top_songs```. You could check the API for getting audio features for several tracks [here](https://developer.spotify.com/console/get-audio-features-several-tracmks/). Essentially, you need to call the [API endpoint](https://developer.spotify.com/console/get-audio-features-several-tracks/), which gives the very detailed explanations. Then you should receive the [Audio feature object](https://developer.spotify.com/documentation/web-api/reference/#object-audiofeaturesobject) in json files, save it as the dataframe ```spotify_top_songs_acoustic_features``` with these features:
- danceability
- energy
- key
- loudness
- mode
- speechiness
- acousticness
- instrumentalness
- liveness
- valence
- tempo
- id
- duration_ms
- time_signature

Note: if you are not able to get this data, download the csv file from the moodle to continue the analysis, but you will not receive the grade from this question.

Hint1: when you request acoustic features from multiple tracks, the url would involve the track id connected by ```%2C```. For example, for two tracks STAY (4JpKVNYnVcJ8tuMKjAj50A), and INDUSTRY Baby (5Z9KJZvQzH6PFmb8SNkxuk), you could search for its url as: `https://api.spotify.com/v1/audio-features?ids=4JpKVNYnVcJ8tuMKjAj50A%2C5Z9KJZvQzH6PFmb8SNkxuk`

Hint2: Spotify requires certain authentication (token) to have access to its data. You need to go to Spotify [developer platform](https://developer.spotify.com/console/get-audio-features-several-tracks/) to request a token and include the token in the requests. It may get expired if you have not used it for a while, then you just need to request a new one.

Hint3: Spotify restricts the number of tracks to be requested in each API call (up to 100), so you may need to do it several times seprately and then combine them later.

In [6]:
# request a new token from Spotify to replace the below one
access_token = 'BQBr6C4lMS2JkjW04sC0Ua7AYgNoV11u2FJFCbL-KvkkPiWRhzQbPOLVfH918BcFJHvMHWF28ClV8UrViJhXcDehO5TdDdLnb_juRMYa9LVML5OWNhtSPPfRTSRSMfgAjAORmLUVgX-ZT00pt48-K89M1ZI6SHBhk6A'
headers = {'Authorization': 'Bearer {token}'.format(token=access_token)}

In [7]:
import json
from io import StringIO

unique = spotify_top_songs['Spotify id'].unique()
features=[]

for idd in unique:
    response = requests.get('https://api.spotify.com/v1/audio-features/'+ idd, headers=headers)
    #response = response.text
    #df = pd.read_json(StringIO(response), orient = 'index')
    df = response.json()
    
    danceability = df["danceability"]
    energy = df['energy']
    key = df['key']
    loudness=df['loudness']
    mode=df['mode']
    speechiness=df['speechiness']
    acousticness=df['acousticness']
    instrumentalness=df['instrumentalness']
    liveness=df['liveness']
    valence=df['valence']
    tempo=df['tempo']
    id_=df['id']
    duration_ms=df['duration_ms']
    time_signature=df['time_signature']
    
    features.append([danceability, 
                    energy,
                    key,
                    loudness,
                    mode,
                    speechiness,
                    acousticness,
                    instrumentalness,
                    liveness,
                    valence,
                    tempo,
                    id_,
                    duration_ms,
                    time_signature])

    spotify_top_songs_acoustic_features = pd.DataFrame(features, columns=['danceability','energy','key','loudness','mode','speechiness','acousticness','instrumentalness','liveness','valence','tempo','id_','duration_ms','time_signature'])

spotify_top_songs_acoustic_features 


Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,id_,duration_ms,time_signature
0,0.591,0.764,1,-5.484,1,0.0483,0.03830,0.000000,0.1030,0.478,169.928,5PjdY0CKGZdEuoNab3yDmX,141806,4
1,0.741,0.691,10,-7.395,0,0.0672,0.02210,0.000000,0.0476,0.892,150.087,5Z9KJZvQzH6PFmb8SNkxuk,212353,4
2,0.761,0.525,11,-6.900,1,0.0944,0.44000,0.000007,0.0921,0.531,80.870,02MWAaffLxlfxAUY7c5dvx,238805,4
3,0.588,0.701,9,-6.390,1,0.0402,0.00813,0.000000,0.2000,0.443,104.988,3FeVmId7tL5YN8B7R3imoM,228000,4
4,0.808,0.897,11,-3.712,0,0.0348,0.04690,0.000031,0.3640,0.591,126.026,6PQ88X9TkUIAUIZJHW2upE,231041,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
262,0.825,0.652,1,-3.183,0,0.0802,0.58100,0.000000,0.0931,0.931,95.977,7qiZfU4dY1lWllzX7mPBI3,233713,4
263,0.593,0.663,0,-6.325,0,0.0834,0.06640,0.000000,0.3060,0.559,186.083,118KEprYqS3XNWdBoKzkEH,168982,4
264,0.262,0.470,5,-4.663,1,0.0433,0.04290,0.000000,0.2360,0.307,75.096,2YQ8TlTmNheRI3VafoDpod,332286,4
265,0.519,0.713,2,-3.612,1,0.0324,0.01780,0.000000,0.3530,0.505,159.963,3QIAwtEEDOrv0g5NKCGrXZ,286053,4


**Quesion 1.3** (5 points) 
Merge dataframes ```spotify_top_songs_acoustic_features``` with ```spotify_top_songs``` and to enrich with the acoustic features, check the resulting number of rows and columns.

In [8]:
# Question 1.3
spotify_songs = spotify_top_songs.merge(spotify_top_songs_acoustic_features, left_on= 'Spotify id', right_on = 'id_', how= 'outer')
spotify_songs = spotify_songs.drop(labels='id_', axis = 1)
spotify_songs


Unnamed: 0,Spotify id,Song name,Artist,Streams,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,5PjdY0CKGZdEuoNab3yDmX,STAY (with Justin Bieber),The Kid LAROI,7714466,0.591,0.764,1,-5.484,1,0.0483,0.0383,0.0,0.1030,0.478,169.928,141806,4
1,5PjdY0CKGZdEuoNab3yDmX,STAY (with Justin Bieber),The Kid LAROI,8070390,0.591,0.764,1,-5.484,1,0.0483,0.0383,0.0,0.1030,0.478,169.928,141806,4
2,5PjdY0CKGZdEuoNab3yDmX,STAY (with Justin Bieber),The Kid LAROI,7997464,0.591,0.764,1,-5.484,1,0.0483,0.0383,0.0,0.1030,0.478,169.928,141806,4
3,5PjdY0CKGZdEuoNab3yDmX,STAY (with Justin Bieber),The Kid LAROI,6865128,0.591,0.764,1,-5.484,1,0.0483,0.0383,0.0,0.1030,0.478,169.928,141806,4
4,5PjdY0CKGZdEuoNab3yDmX,STAY (with Justin Bieber),The Kid LAROI,7014304,0.591,0.764,1,-5.484,1,0.0483,0.0383,0.0,0.1030,0.478,169.928,141806,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1495,7qiZfU4dY1lWllzX7mPBI3,Shape of You,Ed Sheeran,43184,0.825,0.652,1,-3.183,0,0.0802,0.5810,0.0,0.0931,0.931,95.977,233713,4
1496,118KEprYqS3XNWdBoKzkEH,Anniversary,HIRAIDAI,42347,0.593,0.663,0,-6.325,0,0.0834,0.0664,0.0,0.3060,0.559,186.083,168982,4
1497,2YQ8TlTmNheRI3VafoDpod,10月無口な君を忘れる,あたらよ,38331,0.262,0.470,5,-4.663,1,0.0433,0.0429,0.0,0.2360,0.307,75.096,332286,4
1498,3QIAwtEEDOrv0g5NKCGrXZ,花束,back number,38136,0.519,0.713,2,-3.612,1,0.0324,0.0178,0.0,0.3530,0.505,159.963,286053,4


**Question 1.4** (5 points) Show the top 3 most popular artists in terms of number of unique songs on chart in global, portugal and japan market, respectively.

In [9]:
df_glob_dup = spotify_top_songs_global.drop_duplicates(subset= 'Spotify id')
spotify_top_songs_global_art = df_glob_dup['Artist'].value_counts()
print(spotify_top_songs_global_art.head(3))

df_port_dup = spotify_top_songs_portugal.drop_duplicates(subset= 'Spotify id')
spotify_top_songs_port_art = df_port_dup['Artist'].value_counts()
print(spotify_top_songs_port_art.head(3))

df_japan_dup = spotify_top_songs_japan.drop_duplicates(subset= 'Spotify id')
spotify_top_songs_japan_art = df_japan_dup['Artist'].value_counts()
print(spotify_top_songs_japan_art.head(3))


Olivia Rodrigo    7
Doja Cat          5
Drake             4
Name: Artist, dtype: int64
Doja Cat          4
Olivia Rodrigo    4
Lil Nas X         3
Name: Artist, dtype: int64
YOASOBI     13
BTS          7
HIRAIDAI     6
Name: Artist, dtype: int64


**Question 1.5** (5 points) Show average value of acousitc features of songs in global market by the distribution of duration at quartile (0-25%, 25-50%, 50-75%, 75-100%). 

In [10]:
spotify_top_songs_acoustic_features3 = spotify_top_songs_global.merge(spotify_top_songs_acoustic_features, left_on= 'Spotify id', right_on = 'id_', how= 'left')
spotify_top_songs_acoustic_features3 = spotify_top_songs_acoustic_features3.drop(labels='id_', axis = 1)


In [13]:
global_dur_sorted = spotify_top_songs_acoustic_features3.sort_values(by ='duration_ms')
global_dur_sorted['QuantileRank'] = pd.qcut(global_dur_sorted['duration_ms'], q = 4, labels= ('0-25','25-50','50-75','75-100'))


In [14]:
global_dur_sorted1 = global_dur_sorted.groupby(['QuantileRank']).mean()
global_dur_sorted1

Unnamed: 0_level_0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
QuantileRank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0-25,0.693444,0.667944,5.793651,-5.687722,0.460317,0.100042,0.21722,0.026354,0.14624,0.524548,122.803595,148858.97619,3.960317
25-50,0.699488,0.655784,4.656,-5.451072,0.68,0.092906,0.25521,0.004012,0.159536,0.61948,129.609736,178881.624,3.88
50-75,0.689528,0.695312,4.824,-5.532992,0.64,0.073694,0.227148,7e-05,0.169395,0.55644,126.697424,207387.048,4.0
75-100,0.66646,0.604258,6.137097,-6.189847,0.669355,0.107756,0.214051,0.001707,0.135916,0.424843,126.064653,254500.354839,3.887097


**Question 1.6** (5 points) Show the top 3 artists with the most total streams in global, portugal and japan markets.

In [15]:
top_songs_global11 = spotify_top_songs_global.copy()
top_songs_global11['intStreams'] = top_songs_global11["Streams"].str.replace(',','')
top_songs_global11['intStreams'] = top_songs_global11['intStreams'].apply(pd.to_numeric, errors='coerce')
top_songs_global11 = top_songs_global11[['Artist','intStreams']].groupby(['Artist']).sum()
top_songs_global12 = top_songs_global11.nlargest(3, 'intStreams', keep= 'all')
top_songs_global12

Unnamed: 0_level_0,intStreams
Artist,Unnamed: 1_level_1
Lil Nas X,64552221
Doja Cat,58792737
Olivia Rodrigo,55254893


In [16]:
top_songs_portugal11 = spotify_top_songs_portugal.copy()
top_songs_portugal11['intStreams'] = top_songs_portugal11["Streams"].str.replace(',','')
top_songs_portugal11['intStreams'] = top_songs_portugal11['intStreams'].apply(pd.to_numeric, errors='coerce')
top_songs_portugal11 = top_songs_portugal11[['Artist','intStreams']].groupby(['Artist']).sum()
top_songs_portugal12 = top_songs_portugal11.nlargest(3, 'intStreams', keep= 'all')
top_songs_portugal12

Unnamed: 0_level_0,intStreams
Artist,Unnamed: 1_level_1
Lil Nas X,490634
CKay,320281
Doja Cat,314178


In [17]:
top_songs_japan11 = spotify_top_songs_japan.copy()
top_songs_japan11['intStreams'] = top_songs_japan11["Streams"].str.replace(',','')
top_songs_japan11['intStreams'] = top_songs_japan11['intStreams'].apply(pd.to_numeric, errors='coerce')
top_songs_japan11 = top_songs_japan11[['Artist','intStreams']].groupby(['Artist']).sum()
top_songs_japan12 = top_songs_japan11.nlargest(3, 'intStreams', keep= 'all')
top_songs_japan12

Unnamed: 0_level_0,intStreams
Artist,Unnamed: 1_level_1
YOASOBI,7197817
BTS,4291855
Official HIGE DANdism,3138171


**Question 1.7** (5 points) Show the number of songs across the keys (row) and (Portugal/Japan) market (column).

In [18]:
# Question 1.7 is it overall number of (unique) songs across key and market. Say, how many songs in portugal in different keys, and so is in Japan.
spotify_songs_port_key = spotify_top_songs_portugal.merge(spotify_top_songs_acoustic_features, left_on= 'Spotify id', right_on = 'id_', how= 'left')
df_port_dup_key = spotify_songs_port_key.drop_duplicates(subset= 'Spotify id')
df_port_dup_key = df_port_dup_key[['Spotify id','key']].groupby(['key']).count()
df_port_dup_key

Unnamed: 0_level_0,Spotify id
key,Unnamed: 1_level_1
0,11
1,16
2,6
3,5
4,4
5,12
6,9
7,11
8,17
9,8


In [19]:
spotify_songs_jap_key = spotify_top_songs_japan.merge(spotify_top_songs_acoustic_features, left_on= 'Spotify id', right_on = 'id_', how= 'left')
df_jap_dup_key = spotify_songs_jap_key.drop_duplicates(subset= 'Spotify id')
df_jap_dup_key = df_jap_dup_key[['Spotify id','key']].groupby(['key']).count()
df_jap_dup_key

Unnamed: 0_level_0,Spotify id
key,Unnamed: 1_level_1
0,5
1,16
2,11
3,4
4,3
5,10
6,6
7,10
8,12
9,9


**Question 1.8** (5 points) Show the top 5 artists that has the most number of songs-days in global market (if a song appeared in 2 days, it will be counted as the 2 song-days.

In [20]:
# Question 1.8
top_songs_global8 = spotify_top_songs_global.copy()
top_songs_global8 = top_songs_global8[['Spotify id', 'Artist']].groupby(['Artist']).count()
top_songs_global81 = top_songs_global8.nlargest(5, 'Spotify id', keep= 'all')
top_songs_global81

Unnamed: 0_level_0,Spotify id
Artist,Unnamed: 1_level_1
Olivia Rodrigo,32
Doja Cat,25
Billie Eilish,20
Drake,20
The Weeknd,20


**Question 1.9** (10 points) Compare the acoustic features of top songs in Portugal and in Japan, by checking the correlations between rank and acoustic features using Pearman and Spearman correlations.


In [21]:
songs_japan_9 = spotify_top_songs_japan.copy()
r1=[*range(1, 101, 1)] * 5
songs_japan_9['Rank'] = r1
spotify_jap_feat = songs_japan_9.merge(spotify_top_songs_acoustic_features, left_on= 'Spotify id', right_on = 'id_', how= 'left')
japan_rank_corr_Pearson = spotify_jap_feat.corr(method='pearson')
japan_rank_corr_Pearson.head(1)

Unnamed: 0,Rank,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
Rank,1.0,-0.022101,-0.014775,-0.056961,0.053387,-0.028735,0.051946,-0.001515,0.115483,-0.183397,-0.00457,-0.021304,0.102141,


In [22]:
japan_rank_corr_Spearman = spotify_jap_feat.corr(method='spearman')
japan_rank_corr_Spearman.head(1)

Unnamed: 0,Rank,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
Rank,1.0,-0.011061,-0.011571,-0.051502,0.060453,-0.028735,0.122617,0.03259,0.159398,-0.177359,-0.006005,-0.027492,0.095879,


In [24]:
songs_portugal_9 = spotify_top_songs_portugal.copy()
r2=[*range(1, 101, 1)] * 5
songs_portugal_9['Rank'] = r2
spotify_port_feat = songs_portugal_9.merge(spotify_top_songs_acoustic_features, left_on= 'Spotify id', right_on = 'id_', how= 'left')
portugal_rank_corr_Pearson = spotify_port_feat.corr(method='pearson')
portugal_rank_corr_Pearson.head(1)

Unnamed: 0,Rank,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
Rank,1.0,0.083782,-0.0178,-0.059467,-0.166626,-0.015522,0.024322,0.021286,-0.065531,-0.087367,-0.086033,0.023731,0.045797,0.025701


In [25]:
portugal_rank_corr_Spearman = spotify_port_feat.corr(method='spearman')
portugal_rank_corr_Spearman.head(1)

Unnamed: 0,Rank,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
Rank,1.0,0.083223,-0.018528,-0.068003,-0.144227,-0.015522,-0.032392,0.004246,-0.110757,0.047954,-0.084194,0.009893,0.046917,0.024947


**Question 1.10** (10 points) 
Compare the acoustic features of top songs in Portugal and in Japan, by checking whether the differences between feature values are statistically significant or not. Show the features ranked by the absolute magnitude of differences with statistical significance level of at least p<0.05.

In [41]:
PortugalSongs1 = songs_portugal_10.merge(spotify_top_songs_acoustic_features, left_on= 'Spotify id', right_on = 'id_', how= 'left')
PortugalSongs1 = PortugalSongs1.drop_duplicates(subset= 'Spotify id')
PortugalSongs1 = PortugalSongs1.iloc[:,4:]
PortugalSongs1 = PortugalSongs1.drop(labels='id_', axis = 1)

JapanSongs1 = songs_japan_10.merge(spotify_top_songs_acoustic_features, left_on= 'Spotify id', right_on = 'id_', how= 'left')
JapanSongs1 = JapanSongs1.drop_duplicates(subset= 'Spotify id')
JapanSongs1 = JapanSongs1.iloc[:,4:]
JapanSongs1 = JapanSongs1.drop(labels='id_', axis = 1)

In [45]:
feat_pval = []
for feature in PortugalSongs1.columns:
    t_test, pvalue = ttest_ind(PortugalSongs1[feature],JapanSongs1[feature],equal_var=True)
    #print(pvalue)
    if pvalue < 0.05:
        dicti = {'Feature': feature, 'P value': pvalue}
        dicti_copy = dicti.copy()
        feat_pval.append(dicti_copy)
    else:
        pass

final_df = pd.DataFrame(feat_pval)
final_df.sort_values(by= 'P value', ascending= False)
final_df

Unnamed: 0,Feature,P value
0,danceability,6.218529e-06
1,energy,5.146515e-08
2,loudness,8.543756e-09
3,mode,9.214955e-05
4,speechiness,7.281897e-07
5,acousticness,1.826706e-07
6,liveness,0.001126757
7,valence,0.01881358
8,duration_ms,5.732983e-09
