# Web Scrapping with Spotipy API

In order to get all the audio feautures from all the songs we have, we are going to use this API. Our final goal will be to get 3 DataFrames containing the name of the songs, the artists and their respective audio features obtained from Spotify:
- **100 Hot Songs DataFrame:** DataFrame cointaining the hottest 100 current mainstream hits
- **Not Hot Songs DataFrame:** DataFrame containing an extensive list of songs from the past
- **All Songs DataFrame:** DataFrame containing the content of the previous 2 ones

## Importing the libraries

In [3]:
import pandas as pd
import numpy as np
import pprint
import math
import sys
sys.path.insert(1, '/Users/Hector_Martin/Documents/Labs/music_recommender_project/config.py')
from config import *
import spotipy
import json
from spotipy.oauth2 import SpotifyClientCredentials

## Reading the files to get the DataFrames:

In [5]:
hotsongs = pd.read_csv('/Users/Hector_Martin/Documents/Labs/music_recommender_project/data/hot100.csv')
nothotsongs = pd.read_csv('/Users/Hector_Martin/Documents/Labs/music_recommender_project/data/nothotsongs.csv')

display(hotsongs)
display(nothotsongs)

Unnamed: 0,songs,artists
0,Wait For U,Future Featuring Drake & Tems
1,As It Was,Harry Styles
2,First Class,Jack Harlow
3,Puffin On Zootiez,Future
4,Heat Waves,Glass Animals
...,...,...
95,Ahhh Ha,Lil Durk
96,Rumors,Gucci Mane Featuring Lil Durk
97,Over,Lucky Daye
98,Shake It,"Kay Flock, Cardi B, Dougie B & Bory300"


Unnamed: 0,songs,artists
0,Georgia On My Mind,Michael Bolton
1,Insane In The Brain,Cypress Hill
2,I Don't Like Mondays,The Boomtown Rats
3,Goodbye,Night Ranger
4,Rooms On Fire,Stevie Nicks
...,...,...
5123,Angel,Natasha Bedingfield
5124,Automatically Sunshine,The Supremes
5125,Vaquero (Cowboy),The Fireballs
5126,The Blizzard,Jim Reeves


## Initializing Spotipy with our credentials:

In [8]:
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id= client_id,
                                                           client_secret= client_secret_id))

## Creating the function to collect all the audio features based in a list of songs. This function will also store all the results on a DataFrame:

### Getting the list of hot songs:

In [9]:
hot_songs_list = [song for song in hotsongs['songs']]
hot_songs_list

['Wait For U',
 'As It Was',
 'First Class',
 'Puffin On Zootiez',
 'Heat Waves',
 'Big Energy',
 'Enemy',
 '712PM',
 'Stay',
 "I'm Dat N***a",
 "I'm On One",
 'Love You Better',
 'Woman',
 'Ghost',
 'Keep It Burnin',
 'Super Gremlin',
 'Thats What I Want',
 'Bad Habits',
 'About Damn Time',
 'Massaging Me',
 'Shivers',
 'Cold Heart (PNAU Remix)',
 'abcdefu',
 'For A Nut',
 'Provenza',
 'Chickens',
 'MAMIII',
 'Bam Bam',
 'Gold Stacks',
 'Wasted On You',
 'Industry Baby',
 "'Til You Can't",
 'Need To Know',
 'Boyfriend',
 'AA',
 'Easy On Me',
 'One Right Now',
 'In A Minute',
 'Voodoo',
 'Numb Little Bug',
 'Sweetest Pie',
 'To The Moon!',
 'Thousand Miles',
 'Honest',
 'We Jus Wanna Get High',
 'Holy Ghost',
 'Hrs And Hrs',
 "Doin' This",
 'Good 4 U',
 'Trouble With A Heartbreak',
 "We Don't Talk About Bruno",
 'Right On',
 'Never Say Never',
 'Frozen',
 'If I Was A Cowboy',
 "She's All I Wanna Be",
 'Take My Name',
 'Back To The Basics',
 "When You're Gone",
 'The Way Things Going',


### Let's do the same with the Not Hot Songs:

In [10]:
not_hot_songs_list = [song for song in nothotsongs['songs']]
not_hot_songs_list

['Georgia On My Mind',
 'Insane In The Brain',
 "I Don't Like Mondays",
 'Goodbye',
 'Rooms On Fire',
 "Him Or Me - What's It Gonna Be?",
 'Through The Years',
 'Chicago',
 'I Like It',
 'Superstar/Until You Come Back To Me',
 'My Old Car',
 'Jones Vs. Jones',
 'Mercy, Mercy, Mercy',
 "You're A Lady",
 "Feel That You're Feelin'",
 'All I Have To Offer You (Is Me)',
 "Let's Get Together",
 'I Will Follow Him',
 'Out Of Sight',
 'Apple Of My Eye',
 "What's Your Name",
 'Midnight In Moscow',
 'Peace Will Come (According To Plan)',
 'Funny How Time Slips Away',
 'Spice Of Life',
 "I'd Rather Leave While I'm In Love",
 'My Heroes Have Always Been Cowboys',
 'Get Off My Back Woman',
 'Feelin  So Good',
 'Cheeseburger In Paradise',
 'Those Lazy-Hazy-Crazy Days Of Summer',
 'Ask Me What You Want',
 "Wasn't The Summer Short?",
 'Almost Hear You Sigh',
 'Woodstock',
 'As Good As I Once Was',
 "Love (Makes the World Go 'Round)",
 'Save The Best For Last',
 "Mama Can't Buy You Love",
 'What About 

### Getting the audio features based on the lists of songs

In [11]:
def get_audio_features(list_of_songs, list_name):
    
    '''
    Based on a list of songs, get the audio features straight from Spotify using the Spotipy API.
    The function additionally creates pickle files with the audio features in a separate folder called 'audiofeats'
    Remember to delete/store somewhere else these pickle files before using the formula to avoid overwriting any of these.
    '''
    
    import math
    import pickle
    import pandas as pd
    import sys
    import spotipy
    from spotipy.oauth2 import SpotifyClientCredentials
    from time import sleep
    import glob
    
    song_ids = []

    black = {'danceability': "Null", 
           'energy': "Null", 
           'key': "Null",
           'loudness': "Null", 
           'mode':"Null", 
           'speechiness': "Null", 
           'acousticness': "Null", 
           'instrumentalness': "Null", 
            'liveness': "Null", 
            'valence': "Null", 
            'tempo': "Null", 
            'type': "Null", 
            'id': "Null", 
            'uri': "Null", 
            'track_href': "Null", 
            'analysis_url': "Null", 
            'duration_ms': "Null", 
            'time_signature': "Null"}
    
    keys = list(black.keys())
    
    df = pd.DataFrame()
    chunks = math.ceil(len(list_of_songs)/1000)
    for i in range(chunks): # chunks = 4 -> 0,1,2,3
        song_ids = []
        if ( i < chunks-1 ):
            j = i + 1
        else:
            j = len(list_of_songs)
        for index, song in enumerate(list_of_songs[1000*i:1000*j]): #[0:1000],[1000:2000],[2000:3000],[3000:]
            print("Looking for song: ",index)
            try:
                songs = sp.search(q=song, limit=1)
                song_ids.append(songs['tracks']['items'][0]['id'])
            except: # Si no esta no se para.
                print(song)
                song_ids.append("")
    
        print("Getting audio features")
        audio_feats = [sp.audio_features(song_id)[0] if ((song_id != None) and ( song_id != "")) else black for song_id in song_ids]
        name = "audiofeats/audio_feats_" + str(i) + ".pkl"
        with open(name, "wb") as handle:
            pickle.dump(audio_feats,handle)
        print("Created file: ",name)
        sleep(30)
        print('Sleeping for 30 seconds')
        
        #Grouping the files together in a list:
        
    pkls = glob.glob("audiofeats/*.pkl")
    pkls.sort()
    pkls
        
    df = pd.DataFrame()
        
    for i, pkl in enumerate(pkls):    
        try:
            print(pkl)
            with open(pkl, "rb") as handle:
                audio_feats = pickle.load(handle)
                audio_feats_df = pd.DataFrame(audio_feats)
                df = pd.concat([df,audio_feats_df],axis = 0).reset_index(drop=True)
                    
        except:
            print("Corrupted",pkl)
            continue
                             
    return df

In [None]:
hot_songs_af = get_audio_features(hot_songs_list, hotsongs)

Looking for song:  0
Looking for song:  1
Looking for song:  2
Looking for song:  3
Looking for song:  4
Looking for song:  5
Looking for song:  6
Looking for song:  7
Looking for song:  8
Looking for song:  9
Looking for song:  10
Looking for song:  11
Looking for song:  12
Looking for song:  13
Looking for song:  14
Looking for song:  15
Looking for song:  16
Looking for song:  17
Looking for song:  18
Looking for song:  19
Looking for song:  20
Looking for song:  21
Looking for song:  22
Looking for song:  23
Looking for song:  24
Looking for song:  25
Looking for song:  26
Looking for song:  27
Looking for song:  28
Looking for song:  29
Looking for song:  30
Looking for song:  31
Looking for song:  32
Looking for song:  33
Looking for song:  34
Looking for song:  35
Looking for song:  36
Looking for song:  37
Looking for song:  38
Looking for song:  39
Looking for song:  40
Looking for song:  41
Looking for song:  42
Looking for song:  43
Looking for song:  44
Looking for song:  4

In [None]:
not_hot_songs_af = get_audio_features(not_hot_songs_list)

### Function to concatenate the Audio features to the song DataFrames:

In [None]:
def add_audio_features(df, audio_features_df, csvname):
    '''
    It concatenates the audio features to song dataframe and stores it in a csv file.
    Input: 
    - df: original dataframe of songs
    - audio_features_df: dataframe of audio features
    - csvname: name of the csv file you would like to use
    '''
    if df.shape[0] <= 100:
        df_concat = pd.concat([df, audio_features_df], axis =1)
        df_concat = df_concat[df_concat['mode'] != 'Null']
        df_concat = df_concat.reset_index(drop=True)
        df_concat.to_csv('data/'+csvname+'.csv', index=False)
        return df_concat
    else:
        df = df.drop(df.index[1000:2000]).reset_index(drop=True)
        df_concat = pd.concat([df, audio_features_df], axis =1)
        df_concat = df_concat[df_concat['mode'] != 'Null']
        df_concat = df_concat.reset_index(drop=True)
        df_concat.to_csv('data/'+csvname+'.csv', index=False)
        return df_concat

In [None]:
hotconcat_df = add_audio_features(hotsongs, hot_songs_af, 'hot_songs')
nothotconcat_df = add_audio_features(nothotsongs, hot_songs_af, 'not_hot_songs')

### Function to concatenate the Hot Songs DataFrame with the Not Hot Songs one

In [None]:
def concatallsongs(dfhotconcat,dfnothotconcat,csvname):
    allsongsconcat_df = pd.concat([dfhotconcat, dfnothotconcat], axis =0).reset_index(drop=True)
    allsongsconcat_df = allsongsconcat_df[allsongsconcat_df['mode'] != 'Null']
    allsongsconcat_df = allsongsconcat_df.reset_index(drop=True)
    allsongsconcat_df.to_csv('data/'+csvname+'.csv', index=False)
    return allsongsconcat_df

In [None]:
allsongsconcat = concatallsongs(hotconcat_df,nothotconcat_df,'allsongsconcat_df')