# In this notebook - we will use the APIs from Apple Music, LastFM, and Spotify to collect the top 5 tracks per region

Leveraging Client and Secret keys, we will authenticate with each of the above platforms and collect data about the top tracks per region.  After collecting the data, the goal is to get information about the tracks, and identify any trends that are presented in the data

# After collecting data from each platform, we will clean and aggregate additional information about the tracks

Each API offers different insights into the data. After collecting the top 5 by region, Apple's API give us a clear genre of each song type.  In an effort to standardize/normalize the data, we will use Apple's API to collect information about the genre

Once that is complete, Spotify's API give us detailed information about the audio features of the track.  We will be reviewing the following data points:

**Danceability** describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least # danceable and 1.0 is most danceable. 
    
**Energy** is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

**Pitch Class** = Tonal counterparts:
    0=C
    1=C♯
    2=D
    3=D♯
    4=E
    5=F
    6=F♯
    7=G
    8=G♯
    9=A
    10=t or Aor A♯
    11=e or B

**Loudness** - The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. 

**Speechiness** detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. 

**Accousticness** - a confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. 

**Instrumentalness** - Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. 

**Liveness** - Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. 

**Valence** - A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

In [1]:
#importing the neccessary libraries we will need to run our data and analysis

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import sys
import csv

from googletrans import Translator 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import requests
import json
from pprint import pprint 

import spotipy
import spotipy.util as util
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy.oauth2 as oauth2

from config import lastfm_api_key, CLIENT_ID, CLIENT_SECRET

# Using a quick Google Search we found the top countries as it pertains to music.  We then selected the countries that data was readily available for across all of our platforms 

Countries include:
<ul>
    <li>Hong Kong</li>
    <li>Netherlands</li>
    <li>Australia</li>
    <li>Canada</li>
    <li>France</li>
    <li>Japan</li>
    <li>United Kingdom</li>
    <li>Germany</li>
    <li>United States</li>
</ul>

In [3]:
#list of countries we narrow down to sort through our data

countries = ['hk', 'nl', 'au', 'ca', 'fr', 'jp', 'gb', 'de', 'us' ]

# Apple Music API to get the top songs in the above countries

In [4]:
#create a dataframe that takes in all the top 5 songs from each country on our list

song_list = []
artist_list = []
album_list = []
country_list = []
genre_list = []
rank = []

for country in countries:
    top_list = []
    music_url = f"https://rss.itunes.apple.com/api/v1/{country}/apple-music/top-songs/all/5/explicit.json"
    
    try:
        response = requests.get(music_url).json()
        top_list.append(response["feed"]["results"])
        country_df = pd.DataFrame(top_list)
        for x in range (0,5):
            rank.append(x+1)
            song_list.append(country_df[x][0]['name'])
            country_list.append(country)
            artist_list.append(country_df[x][0]['artistName'])
            album_list.append(country_df[x][0]["collectionName"])
            genre_list.append(country_df[x][0]["genres"][0]["name"])
        
    except:
        print(f"Can't find {country}. Skipping... ")

# Create a dataframe with the country, song name and details, and the source        
apple_top_df = pd.DataFrame({
    "Country" : country_list,
    "Name" : song_list,
    "Artist" : artist_list,
    "Album" : album_list,
    "Genre" : genre_list,
    "Rank" : rank
})

#Clean the dataframe to match our pre-decided naming convention for the group
apple_top_df['Country'] = apple_top_df['Country'].replace({'hk': 'Hong Kong',
                           'nl': 'Netherlands',
                           'au': 'Australia',
                           'ca': 'Canada',
                           'fr': 'France',
                           'jp': 'Japan',
                           'gb': 'UK',
                           'de': 'Germany',
                           'us': 'US'
                          })
apple_top_df['Source'] = 'Apple'
apple_top_df.head()

Unnamed: 0,Country,Name,Artist,Album,Genre,Rank,Source
0,Hong Kong,Señorita,Shawn Mendes & Camila Cabello,Señorita - Single,流行樂,1,Apple
1,Hong Kong,Into the Unknown,Idina Menzel & AURORA,Frozen 2 (Original Motion Picture Soundtrack /...,原聲帶,2,Apple
2,Hong Kong,說好不哭,周杰倫 & 阿信,說好不哭 - Single,國語流行樂,3,Apple
3,Hong Kong,Show Yourself,Idina Menzel & Evan Rachel Wood,Frozen 2 (Original Motion Picture Soundtrack /...,原聲帶,4,Apple
4,Hong Kong,讓愛高飛 (劇集《多功能老婆》片尾曲),周柏豪,讓愛高飛 (劇集《多功能老婆》片尾曲) - Single,廣東歌,5,Apple


In [5]:
#cleaning the dataframe to be more readable 

translator = Translator()

apple_top_df['Genre'] = apple_top_df['Genre'].apply(translator.translate, dest='en').apply(getattr, args=('text',))
apple_top_df['Genre'] = apple_top_df['Genre'].replace({'pop music' : 'Pop', 
                                                                   'Mandarin pop music' : 'Pop',
                                                                   'Alternative Music' : 'Alternative',
                                                                   'R & B, soul' : 'R&B/Soul',
                                                                   'priest' : 'Priest'
                                                                })
apple_top_df.head()

Unnamed: 0,Country,Name,Artist,Album,Genre,Rank,Source
0,Hong Kong,Señorita,Shawn Mendes & Camila Cabello,Señorita - Single,Pop,1,Apple
1,Hong Kong,Into the Unknown,Idina Menzel & AURORA,Frozen 2 (Original Motion Picture Soundtrack /...,Soundtrack,2,Apple
2,Hong Kong,說好不哭,周杰倫 & 阿信,說好不哭 - Single,Pop,3,Apple
3,Hong Kong,Show Yourself,Idina Menzel & Evan Rachel Wood,Frozen 2 (Original Motion Picture Soundtrack /...,Soundtrack,4,Apple
4,Hong Kong,讓愛高飛 (劇集《多功能老婆》片尾曲),周柏豪,讓愛高飛 (劇集《多功能老婆》片尾曲) - Single,Guangdong song,5,Apple


# LastFM API to get the top tracks by region in radio

In [6]:
#start of lastFM data search

jsonFormat = '&format=json'

lastfm_url = 'http://ws.audioscrobbler.com/2.0/?method=chart.gettoptracks'
lastfm_url_end = '&api_key=' + lastfm_api_key + jsonFormat

In [7]:
# Get the top 5 tracks
limit = '5'

countries = ["Hong Kong", "Netherlands", "Australia", 
             "Canada", "France", "Japan", "united kingdom", "Germany", "united states"]
song_list = []
artist_list = []
country_list = []
rank = []

for country in countries:
    top_list = []
    urlgeo = lastfm_url+'&limit=' + limit + '&country=' + country + lastfm_url_end
    try:
        lastfm_response = requests.get(urlgeo).json()
        top_list.append(lastfm_response["tracks"]["track"])
        temp_df = pd.DataFrame(top_list)
        for x in range (0,5):
            rank.append(x+1)
            song_list.append(temp_df[x][0]['name'])
            country_list.append(country)
            artist_list.append(temp_df[x][0]['artist']['name'])
    except:
        print(f"Couldn't find {country}")

In [9]:
#Build the dataframe for LastFM Data
lastfm_top_df=pd.DataFrame({
    "Country" : country_list,
    "Artist" : artist_list,
    "Name" : song_list,
    "Rank" : rank
})

lastfm_top_df["Source"] = "LastFM"

lastfm_top_df.head()

Unnamed: 0,Country,Artist,Name,Rank,Source
0,Hong Kong,Billie Eilish,Everything I Wanted,1,LastFM
1,Hong Kong,Billie Eilish,bad guy,2,LastFM
2,Hong Kong,The Weeknd,Heartless,3,LastFM
3,Hong Kong,Dua Lipa,doN'T StArT nOw,4,LastFM
4,Hong Kong,Mariah Carey,All I Want for Christmas Is You,5,LastFM


In [10]:
# Get the album information for each track
# Need to use another API call to get Album information
albums = []
for index, row in lastfm_top_df.iterrows():
    try:
        track_url = 'http://ws.audioscrobbler.com/2.0/?method=track.getInfo&artist=' + row['Artist'] + '&track=' + row['Name'] + '&autocorrect=1' + lastfm_url_end
        lastfm_response = requests.get(track_url).json()
        albums.append(lastfm_response['track']['album']['title'])
    except:
        print(f"Couldn\'t find track information for {row['Name']}")
        albums.append(row['Name'])

Couldn't find track information for doN'T StArT nOw
Couldn't find track information for doN'T StArT nOw
Couldn't find track information for doN'T StArT nOw
Couldn't find track information for doN'T StArT nOw
Couldn't find track information for All I Want for Christmas Is You
Couldn't find track information for doN'T StArT nOw
Couldn't find track information for doN'T StArT nOw
Couldn't find track information for doN'T StArT nOw
Couldn't find track information for doN'T StArT nOw
Couldn't find track information for doN'T StArT nOw


In [11]:
lastfm_top_df["Album"] = albums
lastfm_top_df.replace(to_replace="united kingdom", value="UK", inplace=True)
lastfm_top_df.replace(to_replace="united states", value="US", inplace=True)

# Switching to Apple Music's API to get Genre information for LastFM Tracks

As mentioned earlier, Apple Music cleanly generates Genre information as compared to the other APIs.  We decided to standardize on Apple for continuity

In [13]:
genre_list = []

for song in lastfm_top_df["Name"]:
    artist = f"term={song}"
    media = 'media=music'
    entity = 'entity=song'
    limit = 'limit=1'

    url = 'http://itunes.apple.com/search?' + artist + "&" + media + "&" + entity + "&" + limit
    response = requests.get(url).json()

    genre_list.append(response["results"][-1]['primaryGenreName'])

lastfm_top_df["Genre"] = genre_list
lastfm_top_df = lastfm_top_df[['Country', 'Name', 'Artist', 'Album', 'Genre', 'Rank', 'Source']]

lastfm_top_df.head()

Unnamed: 0,Country,Name,Artist,Album,Genre,Rank,Source
0,Hong Kong,Everything I Wanted,Billie Eilish,everything i wanted,Alternative,1,LastFM
1,Hong Kong,bad guy,Billie Eilish,"Now That's What I Call Music, Vol. 71",Alternative,2,LastFM
2,Hong Kong,Heartless,The Weeknd,Heartless,Country,3,LastFM
3,Hong Kong,doN'T StArT nOw,Dua Lipa,doN'T StArT nOw,Pop,4,LastFM
4,Hong Kong,All I Want for Christmas Is You,Mariah Carey,Merry Christmas,Holiday,5,LastFM


# Spotify API to obtain top tracks by geo

In [14]:
#start of spotify data search

spotify = spotipy.Spotify()
credentials = oauth2.SpotifyClientCredentials(
        client_id=CLIENT_ID,
        client_secret=CLIENT_SECRET)

token = credentials.get_access_token()
spotify = spotipy.Spotify(auth=token)

In [16]:
spotify =spotipy.Spotify(auth=token)
playlist_name = []
p_id = []
owners_p=[]

playlists=spotify.search(q="Top 50",type = "playlist",limit=50)
for x,y in enumerate(playlists["playlists"]["items"]):
    play_name = (x,y["name"])
    playlist_id = y["uri"]
    owners = (y["owner"]["display_name"])
    #owner_final=owners[]
    if owners == "spotifycharts":
        playlist_name.append(play_name[1])
        p_id.append(playlist_id)
        owners_p.append(owners)

play_final = {
    "Playlist Name":playlist_name,
    "Playlist_ID":p_id,
    "Owners":owners_p
}
playlist_df = pd.DataFrame(play_final)
playlist_df.head()

Unnamed: 0,Playlist Name,Playlist_ID,Owners
0,Global Top 50,spotify:playlist:37i9dQZEVXbMDoHDwVN2tF,spotifycharts
1,Brazil Top 50,spotify:playlist:37i9dQZEVXbMXbN3EUUhlg,spotifycharts
2,Spain Top 50,spotify:playlist:37i9dQZEVXbNFJfN1Vw8d9,spotifycharts
3,Germany Top 50,spotify:playlist:37i9dQZEVXbJiZcmkrIHGU,spotifycharts
4,Mexico Top 50,spotify:playlist:37i9dQZEVXbO3qyFxbkOE1,spotifycharts


## We noticed that Hong Kong didn't show up in this list even though it follows the same naming convention.  So we added in code to pick up the appropriate playlist

In [17]:
playlist_name_t = []
p_id_t = []
owners_p_t=[]

playlists=spotify.search(q="Hong Kong Top 50",type = "playlist",limit=50)
for x,y in enumerate(playlists["playlists"]["items"]):
    play_name = (x,y["name"])
    playlist_id = y["uri"]
    owners = (y["owner"]["display_name"])
    #owner_final=owners[]
    if owners == "spotifycharts":
        playlist_name_t.append(play_name[1])
        p_id_t.append(playlist_id)
        owners_p_t.append(owners)

play_final_t = {
    "Playlist Name":playlist_name_t,
    "Playlist_ID":p_id_t,
    "Owners":owners_p_t
}
playlist_df_t = pd.DataFrame(play_final_t)


In [18]:
final_playists_df = pd.concat([playlist_df,playlist_df_t])
final_playists_df.reset_index(inplace=True, drop=True)
final_playists_df

Unnamed: 0,Playlist Name,Playlist_ID,Owners
0,Global Top 50,spotify:playlist:37i9dQZEVXbMDoHDwVN2tF,spotifycharts
1,Brazil Top 50,spotify:playlist:37i9dQZEVXbMXbN3EUUhlg,spotifycharts
2,Spain Top 50,spotify:playlist:37i9dQZEVXbNFJfN1Vw8d9,spotifycharts
3,Germany Top 50,spotify:playlist:37i9dQZEVXbJiZcmkrIHGU,spotifycharts
4,Mexico Top 50,spotify:playlist:37i9dQZEVXbO3qyFxbkOE1,spotifycharts
5,Netherlands Top 50,spotify:playlist:37i9dQZEVXbKCF6dqVpDkS,spotifycharts
6,Argentina Top 50,spotify:playlist:37i9dQZEVXbMMy2roB9myp,spotifycharts
7,Italy Top 50,spotify:playlist:37i9dQZEVXbIQnj7RRhdSX,spotifycharts
8,Australia Top 50,spotify:playlist:37i9dQZEVXbJPcfkRz0wJ0,spotifycharts
9,Japan Top 50,spotify:playlist:37i9dQZEVXbKXQ4mDTEBXq,spotifycharts


In [19]:
tracks = []
p_id2=[]
test = []
p_name = []
t_ids = []
tracker=[]
p_list=final_playists_df["Playlist_ID"]
data=p_list.str.split(":")
for x in data:
    id =(x[2])
    p_id2.append(id)

for x in p_id2:
    response_json=spotify.user_playlist_tracks("spotifycharts", x)
    rank=1
    cur_play = final_playists_df.loc[final_playists_df["Playlist_ID"] == f"spotify:playlist:{x}"]["Playlist Name"].unique()[0]
    counter = 0
    for i,t in enumerate (response_json["items"]):
        name=(t["track"]["name"])
        t_id=(t["track"]["id"])
        tracks.append(name)
        t_ids.append(t_id)
        test.append(x)
        p_name.append(cur_play)
        rank = rank + 1
        tracker.append(counter+1)
        counter+=1
        if (rank == 6):
            break

music_df = pd.DataFrame({
    "Playlist Name" : p_name,
    "Playlist ID" : test,
    "Name" : tracks,
    "ID":t_ids,
    "Rank":tracker
})
music_df["Playlist Name"] = music_df["Playlist Name"].replace({'United States Top 50' : 'US Top 50', 'United Kingdom Top 50' : 'UK Top 50', 'Hong Kong Viral 50': 'HK Viral 50'})
music_df

ConnectionError: HTTPSConnectionPool(host='api.spotify.com', port=443): Max retries exceeded with url: /v1/users/spotifycharts/playlists/37i9dQZEVXbKCF6dqVpDkS/tracks?limit=100&offset=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x000002CD848A70C8>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))

In [20]:
country = []
Range=music_df["Playlist Name"].str.split(" ")[:,]

for x in Range:
    n_c=(x[0])
    country.append(n_c)
music_df["Country"]=country
music_df
music_df.to_csv("music.csv")

music_df

NameError: name 'music_df' is not defined

In [21]:
m_country=["Hong","Netherlands","Russia","Australia","Canada","France","Japan","Germany","US", "UK"]
final_music_1=music_df.loc[music_df["Country"].isin(m_country)]
spotify_top_df =final_music_1.reset_index()

spotify_top_df

NameError: name 'music_df' is not defined

## Similar to above, we are leveraging Apple Music's API to get the Genre information

In [22]:
spot_album_list = []
spot_genre_list = []
spot_artist_list = []

for song in spotify_top_df["Name"]:
    artist = f"term={song}"
    media = 'media=music'
    entity = 'entity=song'
    limit = 'limit=1'

    url = 'http://itunes.apple.com/search?' + artist + "&" + media + "&" + entity + "&" + limit
    response = requests.get(url).json()

    try:
        spot_genre_list.append(response["results"][0]['primaryGenreName'])
        spot_album_list.append(response["results"][0]["collectionName"])
        spot_artist_list.append(response["results"][0]['artistName'])
    except:
        print(f"Couldn\'t find {song}")
        spot_genre_list.append("NA")
        spot_album_list.append("NA")
        spot_artist_list.append("NA")

NameError: name 'spotify_top_df' is not defined

In [23]:
spotify_top_df["Genre"] = spot_genre_list
spotify_top_df["Album"] = spot_album_list
spotify_top_df["Artist"] = spot_artist_list
spotify_top_df["Source"] = 'Spotify'
# spotify_top_df['Rank'] = np.arange(len(spotify_top_df))
# spotify_top_df['Rank'] = spotify_top5['Rank'] % 5 + 1

spotify_top_df = spotify_top_df[['Country', 'Name', 'Artist', 'Album', 'Genre', 'Rank', 'Source']]
spotify_top_df['Country'] = spotify_top_df['Country'].replace({'Hong' : 'Hong Kong'}) 


spotify_top_df

NameError: name 'spotify_top_df' is not defined

# Now that we have the data from each of the APIs, we will create a consolidated DataFrame to run our audio features analysis

In [24]:
music = [apple_top_df, lastfm_top_df, spotify_top_df]

merge_df = pd.concat(music)

merge_df

NameError: name 'spotify_top_df' is not defined

## Obtain the Spotify ID for each of the tracks to get the audio features

## The API expects a Spotify ID, not a track name, so we must do this to get the information we need for analysis 

In [25]:

spotify_id = []
for index, row in merge_df.iterrows():
    try:
        res = spotify.search(row["Name"], type="track", limit=1)
        spotify_id.append(res["tracks"]["items"][0]["id"])
    except:
        text = row["Name"]
        head, sep, tail = text.partition('(')
        res = spotify.search(head, type="track", limit=1)
        spotify_id.append(res["tracks"]["items"][0]["id"])
        continue
        
merge_df["Spotify ID"] = spotify_id
merge_df.head()

NameError: name 'merge_df' is not defined

## Now that we have the Spotify ID, we can grab the audio features for each track

In [26]:
danceability = []
energy = []
key = []
loudness = []
speechiness = []
acousticness = []
instrumentalness = []
liveness = []
valence = []
tempo = []
duration_ms = []

for index, row in merge_df.iterrows():
    try:
        track_features = spotify.audio_features(tracks=row['Spotify ID'])
        danceability.append(track_features[0]["danceability"])
        energy.append(track_features[0]["energy"])
        key.append(track_features[0]["key"])
        loudness.append(track_features[0]["loudness"])
        speechiness.append(track_features[0]["speechiness"])
        acousticness.append(track_features[0]["acousticness"])
        instrumentalness.append(track_features[0]["instrumentalness"])
        liveness.append(track_features[0]["liveness"])
        valence.append(track_features[0]["valence"])
        tempo.append(track_features[0]["tempo"])
        duration_ms.append(track_features[0]["duration_ms"])
    except:
        print(f"Couldn't find details for {row['Spotify ID']}")
        continue

NameError: name 'merge_df' is not defined

In [27]:
merge_df
output_file = merge_df.to_csv('track_analysis.csv')

NameError: name 'merge_df' is not defined