**Tutorial: Getting Music Data with the Last.fm API using Python**

*Practice for GOV 1021 Final Project: Calling Census Bureau API*

Link: https://www.dataquest.io/blog/last-fm-api-python/

Specifically, we’re going to learn:

How to authenticate yourself with an API key. <br>
How to use rate limiting and other techniques to work within the guidelines of an API. <br>
How to use pagination to work with large responses. <br>

**Looking at the Introduction Page in the API documentation, we can notice a few important guidelines:**

Please use an identifiable User-Agent header on all requests. This helps our logging and reduces the risk of you getting banned.

When you make a request to the last.fm API, you can identify yourself using headers. Last.fm wants us to specify a user-agent in the header so they know who we are. We’ll learn how to do that when we make our first request in a moment.

Use common sense when deciding how many calls to make. For example, if you’re making a web application, try not to hit the API on page load. Your account may be suspended if your application is continuously making several calls per second.

In order to build our data set, we’re going to need to make thousands of requests to the Last.fm API. While they don’t provide a specific limit in their documentation, they do advise that we shouldn’t be continuously making many calls per second. In this tutorial we’re going to learn a few strategies for rate limiting, or making sure we don’t hit their API too much, so we can avoid getting banned.

Before we make our first request, we need learn how to authenticate with the Last.fm API

**Authenticating with API Keys**
The majority of APIs require you to authenticate yourself so they know you have permission to use them. One of the most common forms of authentication is to use an API Key, which is like a password for using their API. If you don’t provide an API key when making a request, you will get an error.

The process for using an API key works like this:

You create an account with the provider of the API.
You request an API key, which is usually a long string like 54686973206973206d7920415049204b6579.
You record your API key somewhere safe, like a password keeper. If someone gets your API key, they can use the API pretending to be you.
Every time you make a request, you provide the API key to authenticate yourself.

**Making our first API request**
In order to create a dataset of popular artists, we’ll be working with the chart.getTopArtists endpoint.

Looking at the Last.fm API documentation, we can observe a few things:

It looks like there is only one real endpoint, and each “endpoint” is actually specified by using the method parameter.
The documentation says This service does not require authentication. While this might seem slightly confusing at first, what it’s telling us is that we don’t need to authenticate as a specific Last.fm user. If you look above this, you can see that we do need to provide our API key.
The API can return results in multiple formats – we’ll specify JSON so we can leverage what we already know about working with APIs in Python
Before we start, remember that we need to provide a user-agent header to identify ourselves when we make a request. With the Python requests library, we specify headers using the headers parameter with a dictionary of headers like so:

In [4]:
import sys

In [1]:
import config_last_fm

API_KEY = config_last_fm.api_key
USER_AGENT = "Jake_Schneider"

In [2]:
import requests

headers = {
    'user-agent': USER_AGENT
}

payload = {
    'api_key': API_KEY,
    'method': 'chart.gettopartists',
    'format': 'json'
}

r = requests.get('http://ws.audioscrobbler.com/2.0/', headers=headers, params=payload)
r.status_code

NameError: name 'user_agent' is not defined

To save ourselves time, we’re going to create a function that does a lot of this work for us. We’ll provide the function with a payload dictionary, and then we’ll add extra keys to that dictionary and pass it with our other options to make the request.

In [4]:
def lastfm_get(payload):
    # define headers and URL
    headers = {'user-agent': USER_AGENT}
    url = 'http://ws.audioscrobbler.com/2.0/'

    # Add API key and format to the payload
    payload['api_key'] = API_KEY
    payload['format'] = 'json'

    response = requests.get(url, headers=headers, params=payload)
    return response

In [5]:
r = lastfm_get({
    'method': 'chart.gettopartists'
})

r.status_code

200

As we learned in our beginner Python API tutorial, most APIs return data in a JSON format, and we can use the Python json module to print the JSON data in an easiler to understand format.

Let’s re-use the jprint() function we created in that tutorial and print our response from the API:

In [6]:
import json

def jprint(obj):
    # create a formatted string of the Python JSON object
    text = json.dumps(obj, sort_keys=True, indent=4)
    print(text)

jprint(r.json())

{
    "artists": {
        "@attr": {
            "page": "1",
            "perPage": "50",
            "total": "2980768",
            "totalPages": "59616"
        },
        "artist": [
            {
                "image": [
                    {
                        "#text": "https://lastfm.freetls.fastly.net/i/u/34s/2a96cbd8b46e442fc41c2b86b821562f.png",
                        "size": "small"
                    },
                    {
                        "#text": "https://lastfm.freetls.fastly.net/i/u/64s/2a96cbd8b46e442fc41c2b86b821562f.png",
                        "size": "medium"
                    },
                    {
                        "#text": "https://lastfm.freetls.fastly.net/i/u/174s/2a96cbd8b46e442fc41c2b86b821562f.png",
                        "size": "large"
                    },
                    {
                        "#text": "https://lastfm.freetls.fastly.net/i/u/300x300/2a96cbd8b46e442fc41c2b86b821562f.png",
                        "si

The structure of the JSON response is:

A dictionary with a single artists key, containing:
an @attr key containing a number of attributes about the response.
an artist key containing a list of artist objects.
Let’s look at the '@attr' (attributes) key by itself:

In [7]:
jprint(r.json()['artists']['@attr'])

{
    "page": "1",
    "perPage": "50",
    "total": "2980768",
    "totalPages": "59616"
}


There are almost three million total artists in the results of this API endpoint, and we’re being showing the first 50 artists in a single ‘page’. This technique of spreading the results over multiple pages is called pagination.

Working with Paginated Data
In order to build a dataset with many artists, we need to make an API request for each page and then put them together. We can control the pagination of our results using two optional parameters specified in the documentation:

limit: The number of results to fetch per page (defaults to 50).
page: Which page of the results we want to fetch.
Because the '@attrs' key gives us the total number of pages, we can use a while loop and iterate over pages until the page number is equal to the last page number.

We can also use the limit parameter to fetch more results in each page — we’ll fetch 500 results per page so we only need to make ~6,000 calls instead of ~60,000.

In [8]:
# initialize list for results
results = []

# set initial page and a high total number
page = 1
total_pages = 99999


while page > total_pages:
    # simplified request code for this example
    r = request.get("endpoint_url", params={"page": page})

    # append results to list
    results.append(r.json())

    # increment page
    page += 1

As we mentioned a moment ago, we need to make almost 6,000 calls to this end point, which means we need to think about rate limiting to comply with the Last.fm API’s terms of service. Let’s look at a few approaches.

**Rate Limiting** <br>
Rate limiting is using code to limit the number of times per second that we hit a particular API. Rate limiting will make your code slower, but it’s better than getting banned from using an API altogether.

The easiest way to perform rate limiting is to use Python time.sleep() function. This function accepts a float specifying a number of seconds to wait before proceeding.

For instance, the following code will wait one quarter of a second between the two print statements:

Because making the API call itself takes some time, we’re likely to be making two or three calls per second, not the four calls per second that sleeping for 0.25s might suggest. This should be enough to keep us under Last.fm’s threshold (if we were going to be hitting their API for a number of hours, we might choose an even slower rate).

Another technique that’s useful for rate limiting is using a local database to cache the results of any API call, so that if we make the same call twice, the second time it reads it from the local cache. Imagine that as you are writing your code, you discover syntax errors and your loop fails, and you have to start again. By using a local cache, you have two benefits:

You don’t make extra API calls that you don’t need to.
You don’t need to wait the extra time to rate limit when reading the repeated calls from the cache.
The logic that we could use to combine waiting with a cache looks like the below:

Creating logic for a local cache is a reasonably complex task, but there’s a great library called requests-cache which will do all of the work for you with only a couple of lines of code.

In [9]:
import requests_cache

requests_cache.install_cache()

In [10]:
import time
from IPython.core.display import clear_output

responses = []

page = 1
total_pages = 99999 # this is just a dummy number so the loop starts

while page <= total_pages:
    payload = {
        'method': 'chart.gettopartists',
        'limit': 500,
        'page': page
    }

    # print some output so we can see the status
    print("Requesting page {}/{}".format(page, total_pages))
    # clear the output to make things neater
    clear_output(wait = True)

    # make the API call
    response = lastfm_get(payload)

    # if we get an error, print the response and halt the loop
    if response.status_code != 200:
        print(response.text)
        break

    # extract pagination info
    page = int(response.json()['artists']['@attr']['page'])
    total_pages = int(response.json()['artists']['@attr']['totalPages'])

    # append response
    responses.append(response)

    # if it's not a cached result, sleep
    if not getattr(response, 'from_cache', False):
        time.sleep(0.25)

    # increment the page number
    page += 1

Requesting page 5956/5956


**Processing the Data** <br>
Let’s use pandas to look at the data from the first response in our responses list:

In [11]:
import pandas as pd

r0 = responses[0]
r0_json = r0.json()
r0_artists = r0_json['artists']['artist']
r0_df = pd.DataFrame(r0_artists)
r0_df.head()

Unnamed: 0,image,listeners,mbid,name,playcount,streamable,url
0,[{'#text': 'https://lastfm.freetls.fastly.net/...,612569,,Billie Eilish,38390520,0,https://www.last.fm/music/Billie+Eilish
1,[{'#text': 'https://lastfm.freetls.fastly.net/...,679069,,Post Malone,39640756,0,https://www.last.fm/music/Post+Malone
2,[{'#text': 'https://lastfm.freetls.fastly.net/...,1184262,f4fdbb4c-e4b7-47a0-b83b-d91bbfcfa387,Ariana Grande,128920283,0,https://www.last.fm/music/Ariana+Grande
3,[{'#text': 'https://lastfm.freetls.fastly.net/...,1974538,b7539c32-53e7-4908-bda3-81449c367da6,Lana Del Rey,238756889,0,https://www.last.fm/music/Lana+Del+Rey
4,[{'#text': 'https://lastfm.freetls.fastly.net/...,3747277,b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d,The Beatles,530351668,0,https://www.last.fm/music/The+Beatles


We can use list comprehension to perform this operation on each response from responses, giving us a list of dataframes, and then use the pandas.concat() function to turn the list of dataframes into a single dataframe.

In [12]:
frames = [pd.DataFrame(r.json()['artists']['artist']) for r in responses]
artists = pd.concat(frames)
artists.head()

Unnamed: 0,image,listeners,mbid,name,playcount,streamable,url
0,[{'#text': 'https://lastfm.freetls.fastly.net/...,612569,,Billie Eilish,38390520,0,https://www.last.fm/music/Billie+Eilish
1,[{'#text': 'https://lastfm.freetls.fastly.net/...,679069,,Post Malone,39640756,0,https://www.last.fm/music/Post+Malone
2,[{'#text': 'https://lastfm.freetls.fastly.net/...,1184262,f4fdbb4c-e4b7-47a0-b83b-d91bbfcfa387,Ariana Grande,128920283,0,https://www.last.fm/music/Ariana+Grande
3,[{'#text': 'https://lastfm.freetls.fastly.net/...,1974538,b7539c32-53e7-4908-bda3-81449c367da6,Lana Del Rey,238756889,0,https://www.last.fm/music/Lana+Del+Rey
4,[{'#text': 'https://lastfm.freetls.fastly.net/...,3747277,b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d,The Beatles,530351668,0,https://www.last.fm/music/The+Beatles


In [13]:
artists = artists.drop('image', axis=1)
artists.head()

Unnamed: 0,listeners,mbid,name,playcount,streamable,url
0,612569,,Billie Eilish,38390520,0,https://www.last.fm/music/Billie+Eilish
1,679069,,Post Malone,39640756,0,https://www.last.fm/music/Post+Malone
2,1184262,f4fdbb4c-e4b7-47a0-b83b-d91bbfcfa387,Ariana Grande,128920283,0,https://www.last.fm/music/Ariana+Grande
3,1974538,b7539c32-53e7-4908-bda3-81449c367da6,Lana Del Rey,238756889,0,https://www.last.fm/music/Lana+Del+Rey
4,3747277,b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d,The Beatles,530351668,0,https://www.last.fm/music/The+Beatles


Now, let’s get to know the data a little using DataFrame.info() and DataFrame.describe():

In [14]:
artists.info()
artists.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2400 entries, 0 to 99
Data columns (total 6 columns):
listeners     2400 non-null object
mbid          2400 non-null object
name          2400 non-null object
playcount     2400 non-null object
streamable    2400 non-null object
url           2400 non-null object
dtypes: object(6)
memory usage: 131.2+ KB


Unnamed: 0,listeners,mbid,name,playcount,streamable,url
count,2400,2400.0,2400,2400,2400,2400
unique,2395,1762.0,2400,2399,1,2400
top,27579,,CVBZ,65325,0,https://www.last.fm/music/Des%27ree
freq,2,634.0,1,2,2400,1


We were expecting about 3,000,000 artists but we only have 10,500. Of those, only 10,000 are unique (eg there are duplicates).

Let’s let’s look at the length of the list of artists across our list of response objects to see if we can better understand what has gone wrong.

In [15]:
artist_counts = [len(r.json()['artists']['artist']) for r in responses]
pd.Series(artist_counts).value_counts()

0      5936
100      19
500       1
dtype: int64

In [16]:
print(artist_counts[:50])

[500, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


It looks like after the first twenty responses, this API doesn’t return any data — an undocumented limitation.

This is not the end of the world, since 10,000 artists is still a good amount of data. Let’s get rid of the duplicates we detected earlier.

In [17]:
artists = artists.drop_duplicates().reset_index(drop=True)
artists.describe()

Unnamed: 0,listeners,mbid,name,playcount,streamable,url
count,2400,2400.0,2400,2400,2400,2400
unique,2395,1762.0,2400,2399,1,2400
top,27579,,CVBZ,65325,0,https://www.last.fm/music/Des%27ree
freq,2,634.0,1,2,2400,1


**Augmenting the Data Using a Second Last.fm API Endpoint** 
<br>
<br>
In order to make our data more interesting, let’s use another last.fm API endpoint to add some extra data about each artist.

Last.fm allows its users to create “tags” to categorize artists. By using the artist.getTopTags endpoint we can get the top tags from an individual artist.

Let’s look at the response from that endpoint for one of our artists as an example:

In [18]:
r = lastfm_get({
    'method': 'artist.getTopTags',
    'artist':  'Lana Del Rey'
})

jprint(r.json())

{
    "toptags": {
        "@attr": {
            "artist": "Lana Del Rey"
        },
        "tag": [
            {
                "count": 100,
                "name": "female vocalists",
                "url": "https://www.last.fm/tag/female+vocalists"
            },
            {
                "count": 93,
                "name": "indie",
                "url": "https://www.last.fm/tag/indie"
            },
            {
                "count": 88,
                "name": "indie pop",
                "url": "https://www.last.fm/tag/indie+pop"
            },
            {
                "count": 80,
                "name": "pop",
                "url": "https://www.last.fm/tag/pop"
            },
            {
                "count": 67,
                "name": "alternative",
                "url": "https://www.last.fm/tag/alternative"
            },
            {
                "count": 14,
                "name": "american",
                "url": "https://www.last.fm/tag/a

We’re really only interested in the tag names, and then only the most popular tags. Let’s use list comprehension to create a list of the top three tag names:

In [19]:
tags = [t['name'] for t in r.json()['toptags']['tag'][:3]]
tags

['female vocalists', 'indie', 'indie pop']

In [20]:
', '.join(tags)

'female vocalists, indie, indie pop'

Let’s create a function that uses this logic to return a string of the most popular tag for any artist, which we’ll use later to apply to every row in our dataframe.

Remember that this function will be used a lot in close succession, so we’ll reuse our time.sleep() logic from earlier.

In [21]:
def lookup_tags(artist):
    response = lastfm_get({
        'method': 'artist.getTopTags',
        'artist':  artist
    })

    # if there's an error, just return nothing
    if response.status_code != 200:
        return None

    # extract the top three tags and turn them into a string
    tags = [t['name'] for t in response.json()['toptags']['tag'][:3]]
    tags_str = ', '.join(tags)

    # rate limiting
    if not getattr(response, 'from_cache', False):
        time.sleep(0.25)
    return tags_str

In [22]:
lookup_tags("Billie Eilish")

'pop, indie pop, indie'

In [23]:
from tqdm import tqdm
tqdm.pandas()

artists['tags'] = artists['name'].progress_apply(lookup_tags)

100%|██████████| 2400/2400 [00:11<00:00, 210.33it/s]


In [24]:
artists.head()

Unnamed: 0,listeners,mbid,name,playcount,streamable,url,tags
0,612569,,Billie Eilish,38390520,0,https://www.last.fm/music/Billie+Eilish,"pop, indie pop, indie"
1,679069,,Post Malone,39640756,0,https://www.last.fm/music/Post+Malone,"Hip-Hop, rap, trap"
2,1184262,f4fdbb4c-e4b7-47a0-b83b-d91bbfcfa387,Ariana Grande,128920283,0,https://www.last.fm/music/Ariana+Grande,"pop, female vocalists, rnb"
3,1974538,b7539c32-53e7-4908-bda3-81449c367da6,Lana Del Rey,238756889,0,https://www.last.fm/music/Lana+Del+Rey,"female vocalists, indie, indie pop"
4,3747277,b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d,The Beatles,530351668,0,https://www.last.fm/music/The+Beatles,"classic rock, rock, british"


**Finalizing and Exporting the Data**
<br>
<br>
Before we export our data, we might like to sort the data so the most popular artists are at the top. So far we’ve just been storing data as text without converting any types:

In [25]:
artists.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2400 entries, 0 to 2399
Data columns (total 7 columns):
listeners     2400 non-null object
mbid          2400 non-null object
name          2400 non-null object
playcount     2400 non-null object
streamable    2400 non-null object
url           2400 non-null object
tags          2400 non-null object
dtypes: object(7)
memory usage: 131.3+ KB


Let’s start by converting the listeners and playcount columns to numeric

In [26]:
artists[["playcount", "listeners"]] = artists[["playcount", "listeners"]].astype(int)

Now, let’s sort by number of listeners

In [27]:
artists = artists.sort_values("listeners", ascending=False)
artists.head(10)

Unnamed: 0,listeners,mbid,name,playcount,streamable,url,tags
20,5441558,cc197bad-dc9c-440d-a5b5-d52ba2e14234,Coldplay,365209713,0,https://www.last.fm/music/Coldplay,"rock, alternative, britpop"
10,4790821,a74b1b7f-71a5-4011-9441-d0b5e4122711,Radiohead,509971179,0,https://www.last.fm/music/Radiohead,"alternative, alternative rock, rock"
21,4676934,8bfac288-ccc5-448d-9573-c33ea2aa5c30,Red Hot Chili Peppers,299229292,0,https://www.last.fm/music/Red+Hot+Chili+Peppers,"rock, alternative rock, alternative"
12,4633952,db36a76f-4cdf-43ac-8cd0-5e48092d2bae,Rihanna,205456052,0,https://www.last.fm/music/Rihanna,"pop, rnb, female vocalists"
34,4581844,b95ce3ff-3d05-4e87-9e01-c97b66af13d4,Eminem,205315845,0,https://www.last.fm/music/Eminem,"rap, Hip-Hop, Eminem"
45,4482665,95e1ead9-4d31-4808-a7ac-32c3614c116b,The Killers,213093577,0,https://www.last.fm/music/The+Killers,"indie, rock, indie rock"
6,4465020,164f0d73-1234-4e2c-8743-d77bf2191051,Kanye West,254807007,0,https://www.last.fm/music/Kanye+West,"Hip-Hop, rap, hip hop"
30,4332259,9282c8b4-ca0b-4c6b-b7e3-4f7762dfc4d6,Nirvana,227127825,0,https://www.last.fm/music/Nirvana,"Grunge, rock, alternative"
53,4132003,fd857293-5ab8-40de-b29e-55a69d4e4d0f,Muse,350555042,0,https://www.last.fm/music/Muse,"alternative rock, rock, alternative"
7,4098957,420ca290-76c5-41af-999e-564d7c71f1a7,Queen,199409544,0,https://www.last.fm/music/Queen,"classic rock, rock, 80s"


In [28]:
artists.to_csv('artists.csv', index=False)

Next Steps <br>
In this tutorial we built a dataset using the Last.fm API while learning how to use Python to:

How to authenticate with an API using an API key
How to use pagination to collect larger responses from an API endpoint
How to use rate-limiting to stay within the guidelines of an API.
If you’d like to extend your learning, you might like to:

Complete our interactive Dataquest APIs and scraping course, which you can start for free.
Explore the other API endpoints in the Lasts.fm API.
Try working with some data from this list of Free Public APIs.