## Data Analysis using Pandas

### How to create the next Squid Game?

![hitsong](https://pbs.twimg.com/media/E_U24rOVEAAfrJm?format=jpg)

The 21th century has witnessed the technological advancement in music industry that allowed consumers to store music in hard disks such as MP3 or iPods. The increasing prevalence of smart phones and the digitization of music prompted the establishment and wide usage of numerous music-listening apps such as Spotify, Google Play Music and Apple Music, among others, that gradually replaced CDs. Such switch of music consumptions, from purchasing physical albums to purchasing the single track, not only changed the customer experience, but also fundamentally changed the economics of the music industry. 

Due to such a music industry evolution, Chris Anderson (2004) proposed the long tail theory to characterize the music consumption in digital era, in which a large portion of tracks that were once unknown have gained certain level of popularity altogether to form a long tail of the consumption distribution. This implies that the popularity of the music and artists may spread within a larger range, increasing sales of less known tracks from nearly zero to few.

More recently, the emergence of streaming platform designs such as Pandora and Spotify, as well as the utilization of Artificial Intelligence into music recommendations have gradually exhibited a spill-over effect (Aguiar and Waldfogel 2018) – music listened by other users with similar histories are recommended, thus increasing the music popularity as it spreads from several users to a larger group. This pushed a short list of tracks to become uniquely popular. In 2018, Professor Serguei Netessine from Wharton University of Pennsylvania stated in his podcast that, “We found that, if anything, you see more and more concentration of demand at the top”. Although the podcast focused on movie sales, experiences goods like theater and music sales occur in a similar fashion. In the book “All you need to know about the music industry” by Passman (2019), he highlighted key differences between music business in the streaming era and record sales. In the days of record sales, artists get paid the same money for each record sold, regardless of whether a buyer listened to it once or a thousand times.  But today, the more listens the music tracks have, the more money the artists make. Meanwhile, records sales do not have strong spillover effects as fans of different artists/genres will purchase what they like anyway. In fact, a hit album would bring a lot of people into record stores, and that increased the chances of selling other records. But in the streaming world, that’s no longer true. The more listens one artist gets, the less money other artists would make. In other words, the music consumption is undertaking a radical shift which may affect the definition of popularity in the streaming era, however, it is yet severely underexplored.

Inspired by the evolution of music industry in the recent decades and the recent debunk of long tail theory given a high concentration of popularity for a short list of tracks, this assignment aims to investigate the popularity of music tracks on streaming platform, largely different and not extensively explored about compared to that measured by album sales. In particular, rather than considering the level of advertisement, the inclusion in playlists of Spotify 100 as Luis Aguiar and Joel Waldfogel (2018) have noted. 

References:
- Aguiar, L. & Joel Waldfogel, Platforms, Promotion, and Product Discovery: Evidence from Spotify Playlists; JRC Digital Economy Working Paper 2018-04; JRC Technical Reports, JRC112023
- Passman (2019), All You Need to Know About the Music Business: 10th Edition, Simon & Schuster, US



**Question 1.1**: We will retrieve the  information from the top 100 songs on [Spotifycharts](https://spotifycharts.com/) on September 30th-October 4th. For each day on the list, we can scrape the following characteristics from the information page. For example, from the ["Global Top 200 on September 30"](https://spotifycharts.com/regional/global/daily/2021-09-30), we want to extract the information about the top song **STAY** as:
- spotify id (5PjdY0CKGZdEuoNab3yDmX)
- Song name (STAY (with Justin Bieber))
- Artist (The Kid LAROI)
- Number of streams (7,714,466)

![spotifycharts](https://aristake.com/wp-content/uploads/2021/09/Spotify-charts-HEADER-1.png)


After scraping the top 100 songs, save the data as a dataframe ```spotify_top_songs_global```. 

Then similarly, let's try to scrape information from the top 100 songs of Portugal market and Japanese market on Septebmer 30th-October 4th, respectively. save the data as dataframes ```spotify_top_songs_portugal``` and ```spotify_top_songs_japan```.


You can concatenate these three dataframes as ```spotify_top_songs``` for next question. 

Note: if you are not able to scrape the data, download the csv files from the top right corner of the website, but you will not receive the scores from this question.

Hint: you can play with the website to check the correct url for each chart.

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import cloudscraper
from time import sleep
from datetime import date, timedelta
import os

Given that Spotify enforces anti-bot measures to prevent web scraping, we will rely on a package namede ```cloudcraper``` to bypass the mechanism. Essentially you could use the following code to scrape such website easily:

In [2]:
scraper = cloudscraper.create_scraper()

dates=[]
url_list=[]
final_g = []
rank=[range(1,101,1)]

#Map url for each date between the 30th September and the 4th October

url = "https://spotifycharts.com/regional/global/daily"
start_date= date(2021, 9, 30)
end_date= date(2021, 10, 4)
delta=end_date-start_date

for i in range(delta.days+1):
    day = start_date+timedelta(days=i)
    day_string= day.strftime("%Y-%m-%d")
    dates.append(day_string)
    
def add_url():
    for date in dates:
        c_string = url+"/"+date
        url_list.append(c_string)

add_url()

# Define a function for going through each row in the url and find 
# Song ID, Song name, Artist name, N° of streams associated to each date

def song_scrape(x):
    pg = x
    songs_count=1
    
    for tr in songs.find("tbody").findAll("tr"):
        while songs_count<=100:
            songid= tr.find("td", {"class": "chart-table-image"}).find("a").get("href").split("track/")[1]
            songname= tr.find("td", {"class": "chart-table-track"}).find("strong").text
            artist= tr.find("td", {"class": "chart-table-track"}).find("span").text.replace("by ","").strip()
            n_streams= tr.find("td", {"class": "chart-table-streams"}).text
            url_date= x.split("daily/")[1]
            region="Global"
            
            final_g.append([songid, songname, artist, n_streams, url_date, region])
            
            songs_count+=1
            break
            
#Create array of all of our song info by looping in our 5 urls

for u in url_list:
    read_page= scraper.get(u)
    soup= BeautifulSoup(read_page.text, "html.parser")
    songs= soup.find("table", {"class":"chart-table"})
    song_scrape(u)
    
    
#Convert everything to data frame
spotify_top_songs_global = pd.DataFrame(final_g, columns= ["Song ID", 
                                                           "Song Name", 
                                                           "Artist", 
                                                           "N° of streams", 
                                                           "Chart Date",
                                                           "Region"])

spotify_top_songs_global['Rank']=[*range(1,101,1)]*5
spotify_top_songs_global

Unnamed: 0,Song ID,Song Name,Artist,N° of streams,Chart Date,Region,Rank
0,5PjdY0CKGZdEuoNab3yDmX,STAY (with Justin Bieber),The Kid LAROI,7714466,2021-09-30,Global,1
1,5Z9KJZvQzH6PFmb8SNkxuk,INDUSTRY BABY (feat. Jack Harlow),Lil Nas X,6517968,2021-09-30,Global,2
2,02MWAaffLxlfxAUY7c5dvx,Heat Waves,Glass Animals,4460880,2021-09-30,Global,3
3,3FeVmId7tL5YN8B7R3imoM,My Universe,"Coldplay, BTS",4142687,2021-09-30,Global,4
4,6PQ88X9TkUIAUIZJHW2upE,Bad Habits,Ed Sheeran,4077321,2021-09-30,Global,5
...,...,...,...,...,...,...,...
495,4OwhwvKESFtuu06dTgct7i,Tiroteo - Remix,"Marc Seguí, Rauw Alejandro, Pol Granch",972729,2021-10-04,Global,96
496,5QO79kh1waicV47BqGRL3g,Save Your Tears,The Weeknd,968693,2021-10-04,Global,97
497,2gMXnyrvIjhVBUZwvLZDMP,Before You Go,Lewis Capaldi,961416,2021-10-04,Global,98
498,1dIWPXMX4kRHj6Dt2DStUQ,Chosen (feat. Ty Dolla $ign),"Blxst, Tyga",954619,2021-10-04,Global,99


In [3]:
# Same code for Portugal 

dates_p=[]
url_list_p=[]
final_p = []

url_p = "https://spotifycharts.com/regional/pt/daily"
start_date= date(2021, 9, 30)
end_date= date(2021, 10, 4)
delta=end_date-start_date

for i in range(delta.days+1):
    day = start_date+timedelta(days=i)
    day_string= day.strftime("%Y-%m-%d")
    dates_p.append(day_string)
    
def add_url():
    for date in dates_p:
        c_string = url_p+"/"+date
        url_list_p.append(c_string)

add_url()

def song_scrape(x):
    songs_count=1
    pg = x

    for tr in songs.find("tbody").findAll("tr"):
        while songs_count<=100:
            songid= tr.find("td", {"class": "chart-table-image"}).find("a").get("href").split("track/")[1]
            songname= tr.find("td", {"class": "chart-table-track"}).find("strong").text
            artist= tr.find("td", {"class": "chart-table-track"}).find("span").text.replace("by ","").strip()
            n_streams= tr.find("td", {"class": "chart-table-streams"}).text
            url_date= x.split("daily/")[1]
            region="Portugal"
            
            final_p.append([songid, songname, artist, n_streams, url_date, region])
            songs_count+=1
            break
        
for u in url_list_p:
    read_page= scraper.get(u)
    soup= BeautifulSoup(read_page.text, "html.parser")
    songs= soup.find("table", {"class":"chart-table"})
    song_scrape(u)

spotify_top_songs_portugal = pd.DataFrame(final_p, columns= ["Song ID", 
                                                             "Song Name", 
                                                             "Artist", 
                                                             "N° of streams", 
                                                             "Chart Date",
                                                             "Region"])
spotify_top_songs_portugal['Rank']=[*range(1,101,1)]*5
spotify_top_songs_portugal

Unnamed: 0,Song ID,Song Name,Artist,N° of streams,Chart Date,Region,Rank
0,5Z9KJZvQzH6PFmb8SNkxuk,INDUSTRY BABY (feat. Jack Harlow),Lil Nas X,49560,2021-09-30,Portugal,1
1,5fwSHlTEWpluwOM0Sxnh5k,Pepas,Farruko,45858,2021-09-30,Portugal,2
2,2Xr1dTzJee307rmrkt8c0g,love nwantiti (ah ah ah),CKay,39748,2021-09-30,Portugal,3
3,7aZusA4cWXz3Wv9e9uhavz,Quer Voar,Matuê,38730,2021-09-30,Portugal,4
4,5PjdY0CKGZdEuoNab3yDmX,STAY (with Justin Bieber),The Kid LAROI,34311,2021-09-30,Portugal,5
...,...,...,...,...,...,...,...
495,04sN26COy28wTXYj3dMoiZ,Bored,Billie Eilish,7122,2021-10-04,Portugal,96
496,0DsPj89zlY3Us7xb5cXK5h,"Trava na Pose, Chama no Zoom, Dá um Close (fea...","DJ Patrick Muniz, Dj Olliver, Mc Topre",7046,2021-10-04,Portugal,97
497,1m0UFnuTktOkksvjbF9z0m,Ramenez la coupe à la maison,Vegedream,7026,2021-10-04,Portugal,98
498,275Brpw83x3q0mBa9MpCx3,Volta,T-Rex,7020,2021-10-04,Portugal,99


In [4]:
# Same code for Japan

dates_j=[]
url_list_j=[]
final_j = []

url_j = "https://spotifycharts.com/regional/jp/daily"
start_date= date(2021, 9, 30)
end_date= date(2021, 10, 4)
delta=end_date-start_date

for i in range(delta.days+1):
    day = start_date+timedelta(days=i)
    day_string= day.strftime("%Y-%m-%d")
    dates_j.append(day_string)
    
def add_url():
    for date in dates_j:
        c_string = url_j+"/"+date
        url_list_j.append(c_string)

add_url()

def song_scrape(x):
    songs_count=1
    pg = x 
    for tr in songs.find("tbody").findAll("tr"):
        while songs_count<=100:
            songid= tr.find("td", {"class": "chart-table-image"}).find("a").get("href").split("track/")[1]
            songname= tr.find("td", {"class": "chart-table-track"}).find("strong").text
            artist= tr.find("td", {"class": "chart-table-track"}).find("span").text.replace("by ","").strip()
            n_streams= tr.find("td", {"class": "chart-table-streams"}).text
            url_date= x.split("daily/")[1]
            region= "Japan"
            final_j.append([songid, songname, artist, n_streams, url_date, region])
            songs_count+=1
            break               
        
for u in url_list_j:
    read_page= scraper.get(u)
    soup= BeautifulSoup(read_page.text, "html.parser")
    songs= soup.find("table", {"class":"chart-table"})
    song_scrape(u)
    
spotify_top_songs_japan = pd.DataFrame(final_j, columns= ["Song ID", 
                                                          "Song Name", 
                                                          "Artist", 
                                                          "N° of streams", 
                                                          "Chart Date", 
                                                          "Region"])
spotify_top_songs_japan['Rank']=[*range(1,101,1)]*5
spotify_top_songs_japan

Unnamed: 0,Song ID,Song Name,Artist,N° of streams,Chart Date,Region,Rank
0,5eXBXreN3d1zdj6Sa8dS0u,Permission to Dance,BTS,211629,2021-09-30,Japan,1
1,2bgTY4UwhfBYhGT4HUYStN,Butter,BTS,210937,2021-09-30,Japan,2
2,5m1i6hq7dmRlp3c1utE48L,水平線,back number,206640,2021-09-30,Japan,3
3,7dH0dpi751EoguDDg3xx6J,ドライフラワー,優里,202431,2021-09-30,Japan,4
4,6wDntdm888mDo458RaYjGl,Cry Baby,Official HIGE DANdism,193861,2021-09-30,Japan,5
...,...,...,...,...,...,...,...
495,2YQ8TlTmNheRI3VafoDpod,10月無口な君を忘れる,あたらよ,38331,2021-10-04,Japan,96
496,3QIAwtEEDOrv0g5NKCGrXZ,花束,back number,38136,2021-10-04,Japan,97
497,19fhOFi6pNGeZe5uiFlm7c,優しい彗星,YOASOBI,37380,2021-10-04,Japan,98
498,3bbIIVIwBoLqVcLebiEJFo,のびしろ,Creepy Nuts,37239,2021-10-04,Japan,99


In [5]:
spotify_top_songs= pd.concat([spotify_top_songs_global, spotify_top_songs_japan, spotify_top_songs_portugal ])
spotify_top_songs

Unnamed: 0,Song ID,Song Name,Artist,N° of streams,Chart Date,Region,Rank
0,5PjdY0CKGZdEuoNab3yDmX,STAY (with Justin Bieber),The Kid LAROI,7714466,2021-09-30,Global,1
1,5Z9KJZvQzH6PFmb8SNkxuk,INDUSTRY BABY (feat. Jack Harlow),Lil Nas X,6517968,2021-09-30,Global,2
2,02MWAaffLxlfxAUY7c5dvx,Heat Waves,Glass Animals,4460880,2021-09-30,Global,3
3,3FeVmId7tL5YN8B7R3imoM,My Universe,"Coldplay, BTS",4142687,2021-09-30,Global,4
4,6PQ88X9TkUIAUIZJHW2upE,Bad Habits,Ed Sheeran,4077321,2021-09-30,Global,5
...,...,...,...,...,...,...,...
495,04sN26COy28wTXYj3dMoiZ,Bored,Billie Eilish,7122,2021-10-04,Portugal,96
496,0DsPj89zlY3Us7xb5cXK5h,"Trava na Pose, Chama no Zoom, Dá um Close (fea...","DJ Patrick Muniz, Dj Olliver, Mc Topre",7046,2021-10-04,Portugal,97
497,1m0UFnuTktOkksvjbF9z0m,Ramenez la coupe à la maison,Vegedream,7026,2021-10-04,Portugal,98
498,275Brpw83x3q0mBa9MpCx3,Volta,T-Rex,7020,2021-10-04,Portugal,99


**Question 1.2** Now you need to go to Spotify platform to use its API to further get more information. You could find very detailed [documentation](https://developer.spotify.com/documentation/web-api/) that should guide you with the entire process. 

First, you need to get the audio features from the songs in the ```spotify_top_songs```. You could check the API for getting audio features for several tracks [here](https://developer.spotify.com/console/get-audio-features-several-tracmks/). Essentially, you need to call the [API endpoint](https://developer.spotify.com/console/get-audio-features-several-tracks/), which gives the very detailed explanations. Then you should receive the [Audio feature object](https://developer.spotify.com/documentation/web-api/reference/#object-audiofeaturesobject) in json files, save it as the dataframe ```spotify_top_songs_acoustic_features``` with these features:
- danceability
- energy
- key
- loudness
- mode
- speechiness
- acousticness
- instrumentalness
- liveness
- valence
- tempo
- id
- duration_ms
- time_signature

Note: if you are not able to get this data, download the csv file from the moodle to continue the analysis, but you will not receive the grade from this question.

Hint1: when you request acoustic features from multiple tracks, the url would involve the track id connected by ```%2C```. For example, for two tracks STAY (4JpKVNYnVcJ8tuMKjAj50A), and INDUSTRY Baby (5Z9KJZvQzH6PFmb8SNkxuk), you could search for its url as: `https://api.spotify.com/v1/audio-features?ids=4JpKVNYnVcJ8tuMKjAj50A%2C5Z9KJZvQzH6PFmb8SNkxuk`

Hint2: Spotify requires certain authentication (token) to have access to its data. You need to go to Spotify [developer platform](https://developer.spotify.com/console/get-audio-features-several-tracks/) to request a token and include the token in the requests. It may get expired if you have not used it for a while, then you just need to request a new one.

Hint3: Spotify restricts the number of tracks to be requested in each API call (up to 100), so you may need to do it several times seprately and then combine them later.

In [6]:
# Request a new token from Spotify to replace the below one

access_token = 'BQAy5dTC8YBtIs3biQ_s6B23U-P8VDeTR0fyW6eJZy1d9n7p9bO18Jp4BQhLKmXVK-p6vjNqNtqjUkHAVRb--_2T2HnlGamDoet2jy9J82tRYhSaJFax7mwFchgKwBGiw_VtZHbCuPut7CeZc5ELRH75DNxp-hP6'
headers = {
    'Authorization': 'Bearer {token}'.format(token=access_token)
}


In [8]:
audio_features_base_url = 'https://api.spotify.com/v1/audio-features/'

unique_ids = spotify_top_songs['Song ID'].unique()

features = []

for songid in unique_ids:
    song_features_url = audio_features_base_url+songid
    response = requests.get(song_features_url, headers=headers)
    
    features_data = response.json()
    danceability = features_data['danceability']
    energy = features_data['energy']
    key = features_data['key']
    loudness = features_data['loudness']
    mode = features_data['mode']
    speechiness = features_data['speechiness']
    acousticness = features_data['acousticness']
    instrumentalness = features_data['instrumentalness']
    liveness = features_data['liveness']
    valence = features_data['valence']
    tempo = features_data['tempo']
    duration_ms = features_data['duration_ms']
    time_signature = features_data['time_signature']
    
    features.append([danceability,
                     energy,
                     key,
                     loudness,
                     mode,
                     speechiness,
                     acousticness,
                     instrumentalness,
                     liveness,
                     valence,
                     tempo,
                     songid,
                     duration_ms,
                     time_signature])

spotify_top_songs_acoustic_features = pd.DataFrame(features, columns=['Danceability',
                                                                      'Energy',
                                                                      'Key',
                                                                      'Loudness',
                                                                      'Mode',
                                                                      'Speechiness',
                                                                      'Acousticness',
                                                                      'Instrumentalness',
                                                                      'Liveness','Valence',
                                                                      'Tempo',
                                                                      'Song ID',
                                                                      'Duration (ms)',
                                                                      'Time Signature'])
spotify_top_songs_acoustic_features

Unnamed: 0,Danceability,Energy,Key,Loudness,Mode,Speechiness,Acousticness,Instrumentalness,Liveness,Valence,Tempo,Song ID,Duration (ms),Time Signature
0,0.591,0.764,1,-5.484,1,0.0483,0.03830,0.000000,0.1030,0.478,169.928,5PjdY0CKGZdEuoNab3yDmX,141806,4
1,0.741,0.691,10,-7.395,0,0.0672,0.02210,0.000000,0.0476,0.892,150.087,5Z9KJZvQzH6PFmb8SNkxuk,212353,4
2,0.761,0.525,11,-6.900,1,0.0944,0.44000,0.000007,0.0921,0.531,80.870,02MWAaffLxlfxAUY7c5dvx,238805,4
3,0.588,0.701,9,-6.390,1,0.0402,0.00813,0.000000,0.2000,0.443,104.988,3FeVmId7tL5YN8B7R3imoM,228000,4
4,0.808,0.897,11,-3.712,0,0.0348,0.04690,0.000031,0.3640,0.591,126.026,6PQ88X9TkUIAUIZJHW2upE,231041,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
262,0.725,0.756,4,-5.013,1,0.0572,0.36200,0.000685,0.1030,0.828,100.070,3VvA1wSxukMLsvXoXtlwWx,186133,4
263,0.800,0.658,1,-6.142,0,0.0790,0.25000,0.000000,0.1110,0.462,140.042,7hxHWCCAIIxFLCzvDgnQHX,195429,4
264,0.841,0.728,7,-3.370,1,0.0484,0.08470,0.000000,0.1490,0.430,130.049,6gBFPUFcJLzWGx4lenP6h2,243837,4
265,0.771,0.515,10,-9.342,0,0.0543,0.41600,0.000022,0.0467,0.314,124.002,20cn2KYYgyuxXRC3WynYZn,233247,4


**Quesion 1.3**
Merge dataframes ```spotify_top_songs_acoustic_features``` with ```spotify_top_songs``` and to enrich with the acoustic features, check the resulting number of rows and columns.

In [9]:
# Question 1.3
spotify_all= spotify_top_songs.merge(spotify_top_songs_acoustic_features, on="Song ID", how="outer")
spotify_all["N° of streams"]=spotify_all["N° of streams"].str.replace(',', '').astype(float)
#spotify_all= spotify_all.drop("Unnamed: 0",1)
spotify_all.to_csv('spotify_all.csv',encoding="utf-8")
spotify_all.shape



(1500, 20)

**Question 1.4** Show the top 3 most popular artists in terms of number of unique songs on chart in global, portugal and japan market, respectively.

In [11]:
# Quesion 1.4

spotify_all_new_idreg = spotify_all.drop_duplicates(subset=['Song ID', 'Region'])
spotify_all_new_idreg = spotify_all_new_idreg.groupby('Region')['Artist'].apply(lambda x: x.value_counts().iloc[0:3]).to_frame()
spotify_all_new_idreg

Unnamed: 0_level_0,Unnamed: 1_level_0,Artist
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
Global,Olivia Rodrigo,7
Global,Doja Cat,5
Global,Billie Eilish,4
Japan,YOASOBI,13
Japan,BTS,7
Japan,HIRAIDAI,6
Portugal,Doja Cat,4
Portugal,Olivia Rodrigo,4
Portugal,The Weeknd,3


**Question 1.5** Show average value of acousitc features of songs in global market by the distribution of duration at quartile (0-25%, 25-50%, 50-75%, 75-100%). 

In [12]:
spotify_global= spotify_all[(spotify_all["Region"]=="Global")].copy()
spotify_global = spotify_global.drop_duplicates(subset=['Song ID']).iloc[:,7:]

spotify_global["Quartiles"]=pd.qcut(spotify_global['Duration (ms)'], 4, labels=["0-25%", "25-50%", "50-75%", "75-100%"])
spotify_global.groupby(["Quartiles"]).mean()


Unnamed: 0_level_0,Danceability,Energy,Key,Loudness,Mode,Speechiness,Acousticness,Instrumentalness,Liveness,Valence,Tempo,Duration (ms),Time Signature
Quartiles,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0-25%,0.687724,0.669379,5.827586,-5.686793,0.482759,0.101397,0.236579,0.022901,0.170731,0.524966,125.483103,149746.172414,3.965517
25-50%,0.706552,0.674103,4.862069,-5.19231,0.689655,0.091662,0.244524,0.003462,0.179862,0.629069,128.085862,178988.758621,3.896552
50-75%,0.693821,0.688429,4.928571,-5.571,0.607143,0.076025,0.25305,6.6e-05,0.163929,0.553393,124.055429,207381.214286,4.0
75-100%,0.661931,0.603,6.103448,-6.181621,0.689655,0.101766,0.22598,0.004259,0.134655,0.429514,124.879241,254341.551724,3.862069


**Question 1.6** Show the top 3 artists with the most total streams in global, portugal and japan markets.

In [13]:
# Question 1.6

#Global top 3 artists with the most total streams
spotify_global=spotify_all[(spotify_all["Region"]=="Global")]
global_streams_artist= spotify_global.groupby(['Region','Artist'])["N° of streams"].apply(lambda x: x.sum()).sort_values(ascending=False).to_frame().head(3)

#Portugal top 3 artists with the most total streams
spotify_portugal=spotify_all[(spotify_all["Region"]=="Portugal")]
portugal_streams_artist= spotify_portugal.groupby(['Region','Artist'])["N° of streams"].apply(lambda x: x.sum()).sort_values(ascending=False).to_frame().head(3)

#Japan top 3 artists with the most total streams
spotify_japan=spotify_all[(spotify_all["Region"]=="Japan")]
japan_streams_artist= spotify_japan.groupby(['Region','Artist'])["N° of streams"].apply(lambda x: x.sum()).sort_values(ascending=False).to_frame().head(3)

#All
streams_artist=pd.concat([japan_streams_artist,portugal_streams_artist,global_streams_artist])
streams_artist

Unnamed: 0_level_0,Unnamed: 1_level_0,N° of streams
Region,Artist,Unnamed: 2_level_1
Japan,YOASOBI,7197817.0
Japan,BTS,4291855.0
Japan,Official HIGE DANdism,3138171.0
Portugal,Lil Nas X,490634.0
Portugal,CKay,320281.0
Portugal,Doja Cat,314178.0
Global,Lil Nas X,64552221.0
Global,Doja Cat,58792737.0
Global,Olivia Rodrigo,55254893.0


**Question 1.7** Show the number of songs across the keys (row) and (Portugal/Japan) market (column).

In [14]:
# Question 1.7
portugal_nsongs=spotify_portugal.drop_duplicates(subset=['Song ID'], keep="first").value_counts("Key").sort_index().to_frame()
japan_nsongs=spotify_japan.drop_duplicates(subset=['Song ID'], keep="first").value_counts("Key").sort_index().to_frame()

songs_per_key=portugal_nsongs.merge(japan_nsongs, on = "Key").rename(columns={"0_x": "Portugal","0_y": "Japan"})
songs_per_key

Unnamed: 0_level_0,Portugal,Japan
Key,Unnamed: 1_level_1,Unnamed: 2_level_1
0,11,5
1,16,16
2,6,11
3,5,4
4,4,3
5,12,10
6,9,6
7,11,10
8,17,12
9,8,9


**Question 1.8** Show the top 5 artists that has the most number of songs-days in global market (if a song appeared in 2 days, it will be counted as the 2 song-days.

In [15]:
# Question 1.8

global_songdays_artist=spotify_global[['Artist', 'Chart Date']].groupby(['Artist']).count().sort_values(by='Chart Date',ascending=False).head(5)
global_songdays_artist

Unnamed: 0_level_0,Chart Date
Artist,Unnamed: 1_level_1
Olivia Rodrigo,32
Doja Cat,25
The Weeknd,20
Billie Eilish,20
Drake,20


**Question 1.9** Compare the acoustic features of top songs in Portugal and in Japan, by checking the correlations between rank and acoustic features using Pearman and Spearman correlations.


In [16]:
portugal_corr_p=spotify_portugal.corr(method='pearson').loc['Rank':'Duration (ms)'].iloc[[0]]
portugal_corr_s=spotify_portugal.corr(method='spearman').loc['Rank':'Duration (ms)'].iloc[[0]]

japan_corr_p=spotify_japan.corr(method='pearson').loc['Rank':'Duration (ms)'].iloc[[0]]
japan_corr_s=spotify_japan.corr(method='spearman').loc['Rank':'Duration (ms)'].iloc[[0]]

corr_matrix=pd.concat([portugal_corr_p,portugal_corr_s, japan_corr_p, japan_corr_s])
corr_matrix.insert(0,'Correlation type',['Pearson Portugal','Spearman Portugal','Pearson Japan', 'Spearman Japan' ])

corr_matrix.drop("N° of streams", axis=1)

Unnamed: 0,Correlation type,Rank,Danceability,Energy,Key,Loudness,Mode,Speechiness,Acousticness,Instrumentalness,Liveness,Valence,Tempo,Duration (ms),Time Signature
Rank,Pearson Portugal,1.0,0.083782,-0.0178,-0.059467,-0.166626,-0.015522,0.024322,0.021286,-0.065531,-0.087367,-0.086033,0.023731,0.045797,0.025701
Rank,Spearman Portugal,1.0,0.083223,-0.018528,-0.068003,-0.144227,-0.015522,-0.032392,0.004246,-0.110757,0.047954,-0.084194,0.009893,0.046917,0.024947
Rank,Pearson Japan,1.0,-0.022101,-0.014775,-0.056961,0.053387,-0.028735,0.051946,-0.001515,0.115483,-0.183397,-0.00457,-0.021304,0.102141,
Rank,Spearman Japan,1.0,-0.011061,-0.011571,-0.051502,0.060453,-0.028735,0.122617,0.03259,0.159398,-0.177359,-0.006005,-0.027492,0.095879,


**Question 1.10** 
Compare the acoustic features of top songs in Portugal and in Japan, by checking whether the differences between feature values are statistically significant or not. Show the features ranked by the absolute magnitude of differences with statistical significance level of at least p<0.05.

In [19]:
from scipy.stats import ttest_ind

dictionary_ttest={}

spotify_portugal_features= spotify_portugal.drop_duplicates(subset=['Song ID'], keep="first").set_index("Rank").iloc[:,6:]

spotify_japan_features= spotify_japan.drop_duplicates(subset=['Song ID'], keep="first").set_index("Rank").iloc[:,6:]

for feature in spotify_portugal_features.columns:
    ttest, pvalue = ttest_ind (spotify_portugal_features[feature] , spotify_japan_features[feature],equal_var=True)
   
    if pvalue < 0.05:
        dictionary_ttest[feature] = pvalue
        print("- The differences between \033[1m{}\033[0m values are statistically significant. We reject null hypotheses".format(feature))
    else:
        print("- The differences between {} values are not statistically significant. We accept null hypotheses".format(feature)) 

rank_significantfeatures = pd.DataFrame.from_dict(dictionary_ttest,orient='index').sort_values(by=0,ascending=True)
rank_significantfeatures.rename({0: 'P-value'}, axis=1, inplace=True)
rank_significantfeatures

for feature in spotify_portugal_features.columns:

- The differences between [1mDanceability[0m values are statistically significant. We reject null hypotheses
- The differences between [1mEnergy[0m values are statistically significant. We reject null hypotheses
- The differences between Key values are not statistically significant. We accept null hypotheses
- The differences between [1mLoudness[0m values are statistically significant. We reject null hypotheses
- The differences between [1mMode[0m values are statistically significant. We reject null hypotheses
- The differences between [1mSpeechiness[0m values are statistically significant. We reject null hypotheses
- The differences between [1mAcousticness[0m values are statistically significant. We reject null hypotheses
- The differences between Instrumentalness values are not statistically significant. We accept null hypotheses
- The differences between [1mLiveness[0m values are statistically significant. We reject null hypotheses
- The differences between [1mValence

Unnamed: 0,P-value
Duration (ms),5.732983e-09
Loudness,8.543756e-09
Energy,5.146515e-08
Acousticness,1.826706e-07
Speechiness,7.281897e-07
Danceability,6.218529e-06
Mode,9.214955e-05
Liveness,0.001126757
Valence,0.01881358
