# APIs Advanced (oDCM)

*In practice, most APIs require user authentication to get started. Each and every API has its own workflow to generate a kind of "password" that you need to include in your requests. It's close to impossible to design a tutorial that features all of the ways on how to work with authentication. So, here, we focus on a very specific case involving the Spotify Web API.*

--- 

## Learning Objectives

Students will be able to: 
* Understand the various ways on how to authenticate with an API (e.g., credentials, tokens)
* Obtain authentication credentials and tokens via the API calls, check for a valid connection, and renew tokens if they expire
* Apply filters, possibly for multiple endpoints, to narrow down search requests
* Iterate over a variety of API search and result pages 
* Learn how to read API documentation independently
* Learn how to save JSON objects to new-line separated text files
* Learn how to parse JSON objects to tabular files from text input files

--- 

## Acknowledgements
This course draws on a variety of online resources that can be retrieved from the [course website](https://odcm.hannesdatta.com/docs/about/).

--- 

<div class="alert alert-block alert-info"><b>Support Needed?</b> 
    For technical issues outside of scheduled classes, please check the <a href="https://odcm.hannesdatta.com/docs/course/support" target="_blank">support section</a> on the course website.
</div>


## 1. Retrieving and saving data from an API

## 1.1 Authentication
### 1.1.1 Client Key & Client Secret 
**Importance**  
As you may remember, the `icanhazdadjoke` and `Reddit` APIs can be used right out of the box. They did not require you to create an account, login with your credentials, or provide any information associated with you. In this tutorial, we will request data from the Spotify Web API which takes a little bit more preparation. 

First, you need to [sign up](https://www.spotify.com/us/signup/) for a Spotify user account (a free account suffices) if you do not already have one. Second, you log in to the [developer portal](https://developer.spotify.com/dashboard/applications) of Spotify and create a new app (you can give it any name and description you want). Third, you take note of the `Client ID` and `Client Secret` (we'll need those later on!). 


<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/apisadvanced/images/Spotify_credentials.gif" align="left" width=70%/>

**Let's try it out!**  
Follow the steps above and assign the client key and secret you obtained to the variables below (as a string).

In [1]:
# your Spotify App credentials 
client_id = ""
client_secret = ""

Next, in the [API documentation](https://developer.spotify.com/documentation/general/guides/authorization-guide/) authorization guide, we find that the request requires a so-called base 64 encoded string that contains the client id and client secret that follows the format `Authorization: Basic *<base64 encoded client_id:client_secret>*`

This is a more secure way to pass credentials to the API. We start out with an f-string that concatenates the `client_id` and `client_secret` variables. Thereafter, we encode this variable into base 64 using the `b64encode` function of the `base64` module. 

__Important:__ Recall that this is the way *Spotify* wishes you to authenticate with the service. Other API providers may have a completely different way of authentication. If you want to authenticate with a different API, hence, you can't simply copy the code snippet below, but you have to search the web for some Python code, tutorials, and examples on how to authenticate. The technical details are usually also available in the documentation of the API. Watch, for example, [this](https://www.youtube.com/watch?v=iDA710TPXT0) video for a walk through for authenticating with a *different* API.

In [2]:
import base64
client_creds = f"{client_id}:{client_secret}"
print(f"f-string: {client_creds}")

client_creds_encoded = base64.b64encode(client_creds.encode())
print(f"Base64 encoded: {client_creds_encoded}")

f-string: 60f45fe73bef4bfbb7549dde2b02cab5:6855d391f816439dbbeb54b997708efe
Base64 encoded: b'NjBmNDVmZTczYmVmNGJmYmI3NTQ5ZGRlMmIwMmNhYjU6Njg1NWQzOTFmODE2NDM5ZGJiZWI1NGI5OTc3MDhlZmU='


You can think of it as codes and ciphers: you only send your base 64 encoded credentials to the API. So, if anyone would intervene and get their hands on the `client_creds_encoded` they still don't know your `client_id` and `client_secret`. On the other hand, the API is able to decode and thereby verify your authentication credentials. 

Finally, we turn the base64 encoded string into the requested format: 

In [3]:
token_headers = {
    "Authorization": f"Basic {client_creds_encoded.decode()}"
}

token_headers

{'Authorization': 'Basic NjBmNDVmZTczYmVmNGJmYmI3NTQ5ZGRlMmIwMmNhYjU6Njg1NWQzOTFmODE2NDM5ZGJiZWI1NGI5OTc3MDhlZmU='}

---

### 1.1.2 Access Tokens
**Importance**  
When your `client_id` and `client_secret` have been received, you will need to exchange it with an access token. That is, a temporary key associated with your account that expires in 60 minutes (3600 seconds). In practice, this means you need to regenerate your access code once in a while. To obtain an access token you make a POST request to the Spotify Accounts Service with the following endpoint: `https://accounts.spotify.com/api/token` and include two additional parameters `token_data` and `token_headers` (i.e., encoded client key and secret).

In [4]:
import requests
token_url = "https://accounts.spotify.com/api/token"

token_data = {
    "grant_type": "client_credentials"
}

r = requests.post(token_url, data=token_data, headers=token_headers)
token_response_data = r.json()

**Let's try it out!**  
Look up the `r.status_code` of your POST request. What does this tell you? Tip: have a look at the [response status codes](https://developer.spotify.com/documentation/web-api/)!

The `r.json()` method returns a dictionary that contains the `access_token` we're after: 

In [5]:
token_response_data

{'access_token': 'BQD7xYkFpZZ3-AwMgRvikXAqICGDOkIQ-rxfcyOykfqN-qeis765VadE8kyqLSE5ad-6zClZKKs3ysWe6sA',
 'token_type': 'Bearer',
 'expires_in': 3600,
 'scope': ''}

**Let's try it out!**  
Store the `access token` value of the `token_response_data` dictionary into the `access_token` variable below. What happens to the access token once you make another POST request? 

In [7]:
access_token = ###

---

### 1.2 Retrieving data from endpoints

#### 1.2.1 Single endpoints

**Importance**  
Spotify collects large-scale data from multiple entities: artists, albums, playlists, tracks, not to mention all individual user-level data. These collections of data can be accessed through endpoints that prescribe the required parameters and the expected output. Each endpoint consists of a base URL and an endpoint. For example, the base URL for retrieving information about one or more tracks from the Spotify catalog is `https://spotify.com/v1/` and the endpoint `/tracks/{id}`. Taken together, an API request to `https://api.spotify.com/v1/tracks/2EqlS6tkEnglzr7tkKAAYD` returns track-level data (e.g., duration, popularity, artist) of `Come Together - Remastered 2009` by `The Beatles`. 

In a bit, we'll learn you how to obtain this track-level `id`, for now, it's good to know that you can fill out the identifier (`id`) into the search bar of Spotify to get to the song. For example, this is what `spotify:track:2EqlS6tkEnglzr7tkKAAYD` looks like: 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/apisadvanced/images/spotify_search.gif" align="left" width=60%/>

**Let's try it out!**
* Find the items associated with each of the following ids: `6oJ6le65B3SEqPwMRNXWjY` (track), `3fMbdgg4jU18AjLCKBhRSm` (artist), and `0IomjU2bXFng4LQBYn7Het` (album). Tip: depending on the type of collection, you may need to swap `track` for the respective collection you want to search for (e.g., `spotify:artist:{id}`). 

* What happens once you paste the API request URL (`https://api.spotify.com/v1/tracks/2EqlS6tkEnglzr7tkKAAYD`) into your browser? Why is that? 

Next, we create a function `renew_access_token()` that returns a `headers` object from a new access token. This way, we never have to worry that our access token has expired. 

Then, we make a request to the API endpoint associated with the track `Come Together - Remastered 2009`. :

In [8]:
def renew_access_token(token_data=token_data, token_url=token_url, headers=token_headers): 
    r = requests.post(token_url, data=token_data, headers=token_headers)
    token_response_data = r.json()
    access_token = token_response_data["access_token"]
    headers = {"Authorization": f"Bearer {access_token}"}
    return headers

r = requests.get("https://api.spotify.com/v1/tracks/2EqlS6tkEnglzr7tkKAAYD", headers=renew_access_token())
r.json()

{'album': {'album_type': 'album',
  'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/3WrFJ7ztbogyGnTHbHJFl2'},
    'href': 'https://api.spotify.com/v1/artists/3WrFJ7ztbogyGnTHbHJFl2',
    'id': '3WrFJ7ztbogyGnTHbHJFl2',
    'name': 'The Beatles',
    'type': 'artist',
    'uri': 'spotify:artist:3WrFJ7ztbogyGnTHbHJFl2'}],
  'available_markets': ['AD',
   'AE',
   'AG',
   'AL',
   'AM',
   'AR',
   'AT',
   'AU',
   'AZ',
   'BA',
   'BB',
   'BD',
   'BE',
   'BF',
   'BG',
   'BH',
   'BI',
   'BN',
   'BO',
   'BR',
   'BS',
   'BT',
   'BW',
   'BY',
   'BZ',
   'CA',
   'CH',
   'CL',
   'CM',
   'CO',
   'CR',
   'CV',
   'CW',
   'CY',
   'CZ',
   'DE',
   'DK',
   'DM',
   'DO',
   'DZ',
   'EC',
   'EE',
   'EG',
   'ES',
   'FI',
   'FJ',
   'FM',
   'FR',
   'GA',
   'GB',
   'GD',
   'GE',
   'GH',
   'GM',
   'GN',
   'GQ',
   'GR',
   'GT',
   'GW',
   'GY',
   'HK',
   'HN',
   'HR',
   'HT',
   'HU',
   'ID',
   'IE',
   'IL',
   'IN',
   'IS',


As you can see, it returns a wide variety of information including the album (`Abbey Road (Remastered)`), artist (`The Beatles`), release date (26th of September 1969), total number of tracks on the album (`17`), duration (`259946` ms), popularity (`77` - this number can fluctuate over time). 


Similarly, you could retrieve data from any of the following endpoints: 

| Endpoint | Usage | Returns | 
| :----- | :---- | :----- | 
| `/albums/{id}` | Get an album | Album name, total tracks, all seperate tracks, release date |
| `/artists/{id}` | Get an artist | Artist, popularity, followers, and primary music genres |
| `/artists/{id}/related-artists` | Get an artist's related artists | A list of artists with a similar music repertoire |
| `/artists/{id}/albums` | Get an artist's albums | A list of albums from a given artist |
| `/audio-features/{id}` | Get audio features for a track | Music characteristics (e.g., `loudness`, `energy`, `speechiness`) |

Ideally, you write your requested data to a JSON file so that you can access it at another point in time without having to re-run the request. The `json` library can takes care of this, and the syntax looks similar to writing to a CSV file (as we did previously). The reason we save the data as a JSON-file though, is because it is unstructured and thus does not fit the column-based format of a CSV-file.

In [9]:
import json

data = r.json()
with open('song_data.json', 'w') as f:
    json.dump(data, f)

You can then import this locally stored JSON file as follows: 

In [10]:
with open('song_data.json') as json_file:
    data = json.load(json_file)

**Exercise 1**  
You are asked to conduct a market analysis of the listening behavior of The Beatles fans. Using one or more of the APIs above, compile a list of other related artists the fans frequently listen to, store the data as a JSON-file, import it, and rank the artists in terms of their popularity (see [API documentation](https://developer.spotify.com/documentation/web-api/reference/artists/get-artist/)). How do The Beatles rank overall? Tip: don't mix up the artist, album, and track ids! 

In [21]:
# your answer goes here!

**Solution**  

In [20]:
from operator import itemgetter

def write_data(file_name, data):
    with open(file_name, 'w') as f:
        json.dump(data, f)
        
def import_data(file_name):
    with open(file_name) as json_file:
        data = json.load(json_file)
    return data

r = requests.get("https://api.spotify.com/v1/artists/3WrFJ7ztbogyGnTHbHJFl2/related-artists", headers=renew_access_token())
responses = r.json()

write_data('beatles.json', responses)
beatles = import_data('beatles.json')

# analyze data
artists = {}
for artist in beatles["artists"]:
    name = artist['name']
    popularity = artist['popularity']
    artists[name] = popularity
    
sorted(artists.items(), key=itemgetter(1), reverse=True)
# only The Rolling Stones, Elvis Presley, Bob Dylan, and Eric Clapton are more popular than The Beatles (February '21) - these popularity scores may change over time!)

[('The Rolling Stones', 84),
 ('Elvis Presley', 82),
 ('Bob Dylan', 78),
 ('Eric Clapton', 78),
 ('The Beach Boys', 77),
 ('Simon & Garfunkel', 77),
 ('John Lennon', 76),
 ('Paul McCartney', 76),
 ('Jimi Hendrix', 76),
 ('George Harrison', 72),
 ('The Kinks', 72),
 ('Chuck Berry', 71),
 ('Roy Orbison', 70),
 ('Wings', 69),
 ('The Hollies', 67),
 ('The Byrds', 63),
 ('Buddy Holly', 63),
 ('Donovan', 62),
 ('Badfinger', 58),
 ('Ringo Starr', 55)]

**Exercise 2**   
A good friend of yours, a true Beatles fan for years, has asked you to take care of the music at his birthday party next week. In your search for tracks, you decide to consult the Spotify Web API and select the best dance numbers from the album `Abbey Road (Super Deluxe Edition)` to get the party going. Perform a comprehensive search query and argue which song should not be missed in any case. Give it a listen on Spotify, do you agree? 

In [None]:
# your answer goes here!

In [16]:
# solution
def retrieve_album_ids(artist_id):
    r = requests.get(f"https://api.spotify.com/v1/artists/{artist_id}/albums", headers = renew_access_token())
    albums = r.json()

    albums_dict = {}

    for album in albums["items"]: 
        album_id = album["id"]
        name = album["name"]
        albums_dict[name] = album_id
        
    return albums_dict


def retrieve_song_id_names(album_id):
    r = requests.get(f"https://api.spotify.com/v1/albums/{album_id}", headers = renew_access_token())
    songs_album = r.json()
    song_name_ids = {}

    for song in songs_album["tracks"]["items"]: 
        song_name_ids[song["id"]] = song["name"]
    
    return song_name_ids


def retrieve_audio_features(song_name_ids, feature = "danceability"):
    features_songs = {}
        
    for song_id, song_name in song_name_ids.items():
        r = requests.get(f"https://api.spotify.com/v1/audio-features/{song_id}", headers = renew_access_token())
        try:  # the audio features are unavailable for some of the songs (which can raise errors)
            audio_features = r.json()
            features_songs[song_name] = audio_features[feature]
        except: 
            pass
        
    return features_songs


# retrieve a list of all album ids for the Beatles    
albums_dict = retrieve_album_ids("3WrFJ7ztbogyGnTHbHJFl2")

# obtain song ids for Abbey Road (Super Deluxe Edition) album
song_name_ids = retrieve_song_id_names(albums_dict['Abbey Road (Super Deluxe Edition)'])

# obtain danceability scores for songs
danceability_songs = retrieve_audio_features(song_name_ids)

print(f"The song with the highest danceabilty score: {max(danceability_songs, key=danceability_songs.get)}")

The song with the highest danceabilty score: Maxwell's Silver Hammer - 2019 Mix


---
### 1.2.2 Multiple Query Parameters
**Importance**  
By now, you have probably experienced how time-consuming it can to look up the `id` from a human-readable track or album name (artist id > album ids > track ids). Fortunately, there is a more efficient way by using the search endpoint. As we can derive from the [documentation](https://developer.spotify.com/documentation/web-api/reference/search/search/), it requires both a search query (`q`) and an item type (`type`). For example, we can easily obtain the track id of `Come Together - Remastered 2009` as follows (note that spaces are encoded as `+` (or the hex code `%20`), and the `q` and `type`  parameters are separated by a `&` symbol): 

In [27]:
r = requests.get(f"https://api.spotify.com/v1/search?q=Come+Together+-+Remastered+2009&type=track", headers=renew_access_token())
search_request = r.json()
search_request

{'tracks': {'href': 'https://api.spotify.com/v1/search?query=Come+Together+-+Remastered+2009&type=track&offset=0&limit=20',
  'items': [{'album': {'album_type': 'album',
     'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/3WrFJ7ztbogyGnTHbHJFl2'},
       'href': 'https://api.spotify.com/v1/artists/3WrFJ7ztbogyGnTHbHJFl2',
       'id': '3WrFJ7ztbogyGnTHbHJFl2',
       'name': 'The Beatles',
       'type': 'artist',
       'uri': 'spotify:artist:3WrFJ7ztbogyGnTHbHJFl2'}],
     'available_markets': ['AD',
      'AE',
      'AL',
      'AR',
      'AT',
      'AU',
      'BA',
      'BE',
      'BG',
      'BH',
      'BO',
      'BR',
      'BY',
      'CA',
      'CH',
      'CL',
      'CO',
      'CR',
      'CY',
      'CZ',
      'DE',
      'DK',
      'DO',
      'DZ',
      'EC',
      'EE',
      'EG',
      'ES',
      'FI',
      'FR',
      'GB',
      'GR',
      'GT',
      'HK',
      'HN',
      'HR',
      'HU',
      'ID',
      'IE',
      'I

**Let's try it out!**  
How many search results are there? Why is that? What's the difference between these results? 

---
The [documentation](https://developer.spotify.com/documentation/web-api/reference/search/search/) is a great resource to learn more about how to refine your search queries. In the table below, we have summarized these guidelines: 

| Technique | Example | Interpretation | 
| :---- | :------ | :------------ | 
| Quotation | `q='Come+Together+2009'` | Matches with `Come Together (2009)` but not with `Come Together Remastered (2009)` |
| Union | `q=2009+OR+Remastered+2009`| Matches with both `(2009)` and `Remastered (2009)`|
| Exclusion | `q=Come+Together+2009+NOT+Remastered` | Matches with `Come Together (2009)` but not with `Come Together Remastered (2009)` |
| Multiple queries | `q=track:Come+Together+artist:The+Beatles` | Matches with `Come Together` from `The Beatles` |
| Multiple types | `q=Come+Together&type=album,track` | Matches with both `albums` and `tracks` named `Come Together`|
| Genre | `q=Come+Together+genre:rock` | Matches with `rock` tracks named `Come Together` |
| Year | `q=Come+Together+year:2009` | Matches with tracks named `Come Together` from `2009` |












**Exercise 3**  
Suppose that you have set yourself a goal to run half a marathon by the end of this year. Define an appropriate search strategy to find a collection of `workout` tracks aimed at runners that have been released this year. Since you don't want to continuously pick up your phone while running, the `album` should have listed at least 10 tracks.  Note that a variety of solutions are possible here (don't forget to write the data to a file after you requested it!). 

In [None]:
# your answer goes here!

In [22]:
# solution
r = requests.get(f"https://api.spotify.com/v1/search?q=running+genre:workout+year:2021&type=track", headers=renew_access_token())
responses = r.json()

# write and import data
write_data('workout_tracks.json', responses)
workout_tracks = import_data('workout_tracks.json')

# analyze data
workout_albums = []
for workout_track in workout_tracks["tracks"]["items"]:
    if workout_track["album"]["total_tracks"] >= 10: 
        workout_albums.append(workout_track["album"]["name"])
        
print(workout_albums)  # can you think of a plausible reason why there are so many duplicates? 

['Running Ahead', 'Running Ahead', 'Running Ahead', '40 Best Pop Remixes 2021 For Running (Unmixed Compilation for Fitness & Workout 128 Bpm / 32 Count)', 'Running Ahead', 'Running Ahead', 'Running Ahead', 'Running Ahead', 'Running Ahead', 'Running Ahead', 'Running Ahead', 'Happy Running Hits 2021 Workout Session (60 Minutes Non-Stop Mixed Compilation for Fitness & Workout 128 Bpm)', 'Running Ahead', 'Running Ahead', 'Running Ahead', 'Running Ahead', 'Running Ahead', 'Running Ahead', 'Running Ahead', 'Running Ahead']


In [23]:
# to avoid duplicates you may want to change for a different data-structure: a `set()`)
# its most important characteristic is that it only stores unique values
# to add an item to a set you use `.add()` as opposed to `.append()` for lists

workout_albums_set = set()

for workout_track in workout_tracks["tracks"]["items"]:
    if workout_track["album"]["total_tracks"] >= 10: 
        workout_albums_set.add(workout_track["album"]["name"])
        
print(workout_albums_set) 

{'40 Best Pop Remixes 2021 For Running (Unmixed Compilation for Fitness & Workout 128 Bpm / 32 Count)', 'Happy Running Hits 2021 Workout Session (60 Minutes Non-Stop Mixed Compilation for Fitness & Workout 128 Bpm)', 'Running Ahead'}


**Exercise 4**  
After listening to this running playlist for a while, you become more and more selective about the listed tracks. In particular, you find that although the rhythm of the tracks follow your ideal running pace (125+ bpm), some of them lack a bit of energy. Hence, you decide to create a playlist yourself that only contains tracks with an `energy` level of at least `.8`. Pick one of the playlists from Exercise 3 and curate the selection of tracks that match your criterium. 

In [None]:
# your answer goes here!

In [24]:
# retrieve album id
r = requests.get(f"https://api.spotify.com/v1/search?q=The+Best+of+Running+2021+&type=album", headers=renew_access_token())
album = r.json()
album_id = album["albums"]["items"][0]["id"]

# get songs for album id (see Exercise 2)
album_tracks = retrieve_song_id_names(album_id)

# get audio feature for tracks (see Exercise 2)
energy_tracks =  retrieve_audio_features(album_tracks, "energy")
    
# check whether track meet energy criteria
selected_tracks = []
for track, energy in energy_tracks.items():
    if energy > .8: 
        selected_tracks.append(track)

print(selected_tracks)

['Dynamite - Workout Mix Edit 133 bpm', 'Kernkraft 400', 'Miss You - Radio Edit', 'Salt - Workout Mix Edit 134 bpm', 'Break Free - Radio Edit', 'Some Say - Workout Mix Edit 134 bpm', 'What U Need - Radio Edit', 'In My Head', 'Like It Is - Workout Mix Edit 133 bpm', 'Never Go Away - Radio Edit', 'Something Just Like This - Workout Mix Edit 132 bpm', 'Rain On Me - Workout Mix Edit 132 bpm', 'Take You Dancing - Workout Mix Edit 132 bpm', "Can't Stop Me - Radio Edit", 'What A Man Gotta Do - Workout Mix Edit 132 bpm', 'Roses - Workout Mix Edit 133 bpm', 'Be Kind - Workout Mix Edit 133 bpm', 'Watermelon Sugar - Workout Mix Edit 132 bpm', 'Blinding Lights - Workout Mix Edit 135 bpm', 'Waterfall - Radio Edit', "OMG What's Happening - Workout Mix Edit 132 bpm"]


---
### 1.2.3 Iterate over Pages & Results

**Importance**  
Like the `icanhazdadjoke` API, the Spotify API only returns a subset of all research results at the time. For example, if you make a generic search request such as `q=come+together&type=track` you end up with thousands of results, including: `Come Together - Remastered 2009`, `Come Together - Live From Fox Theatre Detroit, MI/2012`, `Come Together - 2019 Mix`, and many more! 

By default, the Spotify API only returns the first 20 results. You can change this with the `limit` parameter (up to 50 results):

In [45]:
def number_results(limit):
    search_url = "https://api.spotify.com/v1/search?q=come+together&type=track"
    r = requests.get(search_url + f"&limit={limit}", headers=renew_access_token())
    search_results = r.json()
    print(f"Numer of results for &limit={limit}: {len(search_results['tracks']['items'])}")

number_results(20)
number_results(50)

Numer of results for &limit=20: 20
Numer of results for &limit=50: 50


**Lets' try it out!**  
What happens once you run `number_results(100)`? Are the first 20 results identical for `limit=20` and `limit=50`? 

At the very bottom of the search request, you find the following information: 

* `next`: The URL you need to request to get to the new batch of results. Note that it looks very similar to the current URL; only the `offset` value has been changed (i.e., it has been incremented by the value of `limit`)
* `offset`: Think of it as a starting index for the search results. For example, `offset=20` means: show result `20` up to (`20+limit`)...  
* `previous`: Similar as `next` but here the `offset` value has been subtracted. For example, if `offset=20` for the current request, `previous` can be found at `offset=0`. 
* `total`: The total number of search results. Together with the `offset` value, you can determine whether you have reached the final result. 

**Let's try it out!**  
Suppose that the search API returns 7094 results and you set `limit` equal to `50`. How many times do you need to make an API call to obtain all results? What's the `offset` value of the last API call in that case? 

Below we give an example on how to implement the `next` url such that it keeps on iterating over the search results until it stored all track names and ids. First, we make our initial request to determine the total number of results (`total_results`). Second, we store all names and ids of the tracks in a list `track_names`. Third, we find the `next_url` and check whether it exists (`None` would indicate this is the last page after all!). Fourth, we repeat until the number of items in `track_names` equals the total number of search results. In other words, we stored all records!

In [34]:
def search_results(search_query):
    r = requests.get(search_query, headers=renew_access_token())
    return r.json()
    
track_names = []
results = search_results("https://api.spotify.com/v1/search?q=track:come+together+year:2020&type=track")
total_results = results['tracks']['total']

while len(track_names) < total_results: 
    all_search_results.append(results)
    track_names.extend([[track["name"], track["id"]] for track in results["tracks"]["items"]])
    next_url = results['tracks']['next']
    if next_url != None: 
        results = search_results(next_url)
        
print(track_names)

[['Come Together', '7DpfOkks38EfsrVcG9Zmhw'], ['Come Together', '7n8sDrEcuMt0yezLDhIbnN'], ['Come Together', '2Vf7umz71NibHBgzU3sQav'], ['Come Together - Mixed', '6xWDBHCxuP7OhCNF2sylKu'], ['Come Together', '170DYhXuUVDyuEZsLb0MBB'], ['Come Together', '7GA49BEANCELzwyBxQVxU1'], ['Come Together - Extended Mix', '10TCB5AtmzLirlAHM0PzVi'], ['Come Together', '0aITsSU1pXt3Tt3noutwzM'], ['Come Together', '2OVNBbPqoktC11yqbCDgV3'], ['We the People (Come Together)', '1iKAD3PTIsjfcw2AinyKVp'], ['Come Together', '75Y9iaqeq3y9cP4ecwnkqY'], ['Rise/Come Together - Live', '6lv06xUOGsvdrY3CwrydCV'], ['Come Together - Live / Ultimate Mix', '6K6QJTaOBZ9BhbavY9AzB0'], ['Come Together', '3tui2rMOT8HYr05PRK4S77'], ['Comeback', '4CVZm7p4kqo7JcPF2OzSdD'], ['Come Together', '2PPzcXr4zU2XkXRquUdceG'], ['Come Together', '4WMRotJmCrjPZHP20qWnQB'], ['Come Together', '4cDMYi7G5Ht846U9oyWySM'], ['Come Together', '1flV5TmOuD2YlXUserWGxw'], ['Come Get Me', '1AmYc2VeJgVgEQC5aJGTN1'], ['Come Together', '72Y3uNSzRlaN6N

**Exercise 5**  
Suppose that you'd listen to all tracks in `track_names` in one go. How long would it take you? Make sure your code  still work if new tracks were added along the way. Also, store all API output in a list `all_search_results` and write it to a JSON file.

In [None]:
# your answer goes here!

In [35]:
# solution
# since our program should be future proof we cannot simply pass all track ids to `/tracks/id` 
# rather, we modify the code snippet above and store the `duration_ms` for each track in a list

track_duration = []
results = search_results("https://api.spotify.com/v1/search?q=track:come+together+year:2020&type=track")
total_results = results['tracks']['total']
all_search_results = []

while len(track_duration) < total_results: 
    all_search_results.append(results)
    track_duration.extend(track["duration_ms"] for track in results["tracks"]["items"])
    next_url = results['tracks']['next']
    if next_url != None: 
        results = search_results(next_url)

write_data("all_search_results.json", all_search_results)
print(f"The total duration is: {round(sum(track_duration)/1000/60/60,1)} hours")  # miliseconds -> seconds -> minute -> hour

The total duration is: 44.3 hours


---
### 1.3 Wrap-Up

Good job - you've made it! We hope working with various endpoints from the Spotify API has given you the confidence to explore some other [endpoints](https://developer.spotify.com/documentation/web-api/reference/) on your own and - perhaps - have even sparked you interest in analyzing the online music streaming market. As a suggestion, you may want to look into which tracks are listed on Spotify [playlists](https://developer.spotify.com/documentation/web-api/reference/playlists/), and which playlists are in turn [featured](https://developer.spotify.com/documentation/web-api/reference/browse/) on Spotify. If you're interested in the relevance of playlist curation, have a look at [this](https://www.youtube.com/watch?v=EbmCVRkmCAc) web lecture I recorded for the Universiteit van Nederland (in Dutch).



---

# 2. Parsing JSON data

Thus far, we have repeatedly asked you to store the API output as a JSON file. Here we practice some more with parsing JSON into CSV format. If you don't know how to extract this information and *structure* it in CSV files, you'll have to rely on programmers to do that job for you - the downside really is that programmers may not understand the context you're working in, and will miss to extract important information. Therefore, if you want to conduct research in social media, it's essential that you acquire these skills yourself.


## 2.1 New-Line Separated Data

Instead of opening a single JSON file, it may happen that a multitude of JSON files are stitched together as a new-line separated UTF-8 file. This means that every line is a new JSON object. For example, see [this](https://gist.github.com/RoyKlaasseBos/2afee1308e19d7570cb84a49d49b9c8c) file where there are 4 empty lines between each object.
<img src="
https://raw.githubusercontent.com/hannesdatta/course-odcm/master/content/docs/tutorials/apisadvanced/images/new-line-separated-file.png" align="left" width=70%/>


In the example above, we have stored the data on a so-called Github Gist web page which is an easy way to share excerpts of code or data with others. Rather than downloading this JSON file manually, we can also directly request the data from the (raw) URL page:

In [2]:
import requests
r = requests.get('https://gist.githubusercontent.com/RoyKlaasseBos/2afee1308e19d7570cb84a49d49b9c8c/raw/5672442be3e183280000b6fe2f9d370724594c40/newline%2520separated%2520JSON')

However, we then can't simply use the `json` library as you did before since `r.json()` will throw an error because of the new-line separated format. You can access the raw text data though:

In [3]:
# first 100 characters of file (it's just string data - you can't call r.text['created_at'] for example)
r.text[:100]

'{"created_at":"Wed Mar 29 13:07:06 +0000 2017","id":847072627730583552,"id_str":"847072627730583552"'

Next, we split this `r.text` object on new line characters so that every line becomes a new element in the list. As you can see, the first element is a complete object enclosed within the curly parentheses (`{}`).

In [84]:
split_elements = r.text.splitlines()
split_elements[0]

'{"created_at":"Wed Mar 29 13:07:06 +0000 2017","id":847072627730583552,"id_str":"847072627730583552","text":"#Brexit: Sogar #Trump kommt im Abschiedsbrief von PM May an #EU vor - wenn auch nicht namentlich: \\"Protektionistische Instinkte nehmen zu\\"","source":"\\u003ca href=\\"http:\\/\\/twitter.com\\" rel=\\"nofollow\\"\\u003eTwitter Web Client\\u003c\\/a\\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":2190051518,"id_str":"2190051518","name":"Kai Kuestner","screen_name":"KuestnerK","location":"Br\\u00fcssel, Belgien","url":null,"description":"WDR\\/NDR Correspondent in Brussels. From 2008-2013 ARD German Radio Bureau Chief South Asia (New Delhi)","protected":false,"verified":false,"followers_count":861,"friends_count":207,"listed_count":71,"favourites_count":146,"statuses_count":1848,"created_at":"Tue Nov 12 10:14:06 +0000 2013","utc_offset":

**Let's try it yourself!**   
Also, inspect the 2nd element of `split_elements`. What do you see? Why is that? How about the 6th element in the list? 

**Exercise 6**  
Each of these objects can then be loaded into a JSON format that we can parse. Parse the first object and explore its contents. What does it represent? Where can you find the online equivalent?

In [None]:
# your answer goes here!

In [120]:
# solution - it's a tweet from a Kai Kuestner from Brussel that contains the hashtags Trump and Brexit (https://twitter.com/KuestnerK/status/847072627730583552)
json.loads(split_elements[0])

{'created_at': 'Wed Mar 29 13:07:06 +0000 2017',
 'id': 847072627730583552,
 'id_str': '847072627730583552',
 'text': '#Brexit: Sogar #Trump kommt im Abschiedsbrief von PM May an #EU vor - wenn auch nicht namentlich: "Protektionistische Instinkte nehmen zu"',
 'source': '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
 'truncated': False,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'in_reply_to_screen_name': None,
 'user': {'id': 2190051518,
  'id_str': '2190051518',
  'name': 'Kai Kuestner',
  'screen_name': 'KuestnerK',
  'location': 'Brüssel, Belgien',
  'url': None,
  'description': 'WDR/NDR Correspondent in Brussels. From 2008-2013 ARD German Radio Bureau Chief South Asia (New Delhi)',
  'protected': False,
  'verified': False,
  'followers_count': 861,
  'friends_count': 207,
  'listed_count': 71,
  'favourites_count': 146,
  'statuses_count': 1848,
  'created_at': 'Tue N

**Exercise 7**  
Write a for loop that parses all dates/times in `split_elements` and stores them in a list called `date_time`. Make sure to skip the empty the lines in between. What time frame does our sample cover? 

In [None]:
# your answer goes here!

In [91]:
# most elegant solution 
date_time = [json.loads(split_elements[counter])['created_at'] for counter in list(range(0, len(split_elements), 5))] 
date_time 

# The large majority of tweets are between 13:07 and 13:18 (March 29 2017). There are also a couple of tweets the next day around 9 o'clock.

['Wed Mar 29 13:07:06 +0000 2017',
 'Wed Mar 29 13:07:07 +0000 2017',
 'Wed Mar 29 13:07:07 +0000 2017',
 'Wed Mar 29 13:07:09 +0000 2017',
 'Wed Mar 29 13:07:09 +0000 2017',
 'Wed Mar 29 13:07:10 +0000 2017',
 'Wed Mar 29 13:07:11 +0000 2017',
 'Wed Mar 29 13:07:11 +0000 2017',
 'Wed Mar 29 13:07:11 +0000 2017',
 'Wed Mar 29 13:07:13 +0000 2017',
 'Wed Mar 29 13:07:12 +0000 2017',
 'Wed Mar 29 13:07:13 +0000 2017',
 'Wed Mar 29 13:07:15 +0000 2017',
 'Wed Mar 29 13:07:16 +0000 2017',
 'Wed Mar 29 13:07:16 +0000 2017',
 'Wed Mar 29 13:07:16 +0000 2017',
 'Wed Mar 29 13:07:17 +0000 2017',
 'Wed Mar 29 13:07:17 +0000 2017',
 'Wed Mar 29 13:07:17 +0000 2017',
 'Wed Mar 29 13:07:22 +0000 2017',
 'Wed Mar 29 13:07:24 +0000 2017',
 'Wed Mar 29 13:07:25 +0000 2017',
 'Wed Mar 29 13:07:27 +0000 2017',
 'Wed Mar 29 13:07:29 +0000 2017',
 'Wed Mar 29 13:07:29 +0000 2017',
 'Wed Mar 29 13:07:29 +0000 2017',
 'Wed Mar 29 13:07:31 +0000 2017',
 'Wed Mar 29 13:07:31 +0000 2017',
 'Wed Mar 29 13:07:3

In [92]:
# alternative solution I
date_time = []

for counter in list(range(0, len(split_elements), 5)):
    obj = json.loads(split_elements[counter])
    date_time.append(obj['created_at'])
date_time

['Wed Mar 29 13:07:06 +0000 2017',
 'Wed Mar 29 13:07:07 +0000 2017',
 'Wed Mar 29 13:07:07 +0000 2017',
 'Wed Mar 29 13:07:09 +0000 2017',
 'Wed Mar 29 13:07:09 +0000 2017',
 'Wed Mar 29 13:07:10 +0000 2017',
 'Wed Mar 29 13:07:11 +0000 2017',
 'Wed Mar 29 13:07:11 +0000 2017',
 'Wed Mar 29 13:07:11 +0000 2017',
 'Wed Mar 29 13:07:13 +0000 2017',
 'Wed Mar 29 13:07:12 +0000 2017',
 'Wed Mar 29 13:07:13 +0000 2017',
 'Wed Mar 29 13:07:15 +0000 2017',
 'Wed Mar 29 13:07:16 +0000 2017',
 'Wed Mar 29 13:07:16 +0000 2017',
 'Wed Mar 29 13:07:16 +0000 2017',
 'Wed Mar 29 13:07:17 +0000 2017',
 'Wed Mar 29 13:07:17 +0000 2017',
 'Wed Mar 29 13:07:17 +0000 2017',
 'Wed Mar 29 13:07:22 +0000 2017',
 'Wed Mar 29 13:07:24 +0000 2017',
 'Wed Mar 29 13:07:25 +0000 2017',
 'Wed Mar 29 13:07:27 +0000 2017',
 'Wed Mar 29 13:07:29 +0000 2017',
 'Wed Mar 29 13:07:29 +0000 2017',
 'Wed Mar 29 13:07:29 +0000 2017',
 'Wed Mar 29 13:07:31 +0000 2017',
 'Wed Mar 29 13:07:31 +0000 2017',
 'Wed Mar 29 13:07:3

In [95]:
# alternative solution II
date_time = []

for counter in list(range(0, len(split_elements))):
    try: # calling json.loads() on an empty object causes an error
        obj = json.loads(split_elements[counter])
        date_time.append(obj['created_at'])
    except: 
        pass
date_time

['Wed Mar 29 13:07:06 +0000 2017',
 'Wed Mar 29 13:07:07 +0000 2017',
 'Wed Mar 29 13:07:07 +0000 2017',
 'Wed Mar 29 13:07:09 +0000 2017',
 'Wed Mar 29 13:07:09 +0000 2017',
 'Wed Mar 29 13:07:10 +0000 2017',
 'Wed Mar 29 13:07:11 +0000 2017',
 'Wed Mar 29 13:07:11 +0000 2017',
 'Wed Mar 29 13:07:11 +0000 2017',
 'Wed Mar 29 13:07:13 +0000 2017',
 'Wed Mar 29 13:07:12 +0000 2017',
 'Wed Mar 29 13:07:13 +0000 2017',
 'Wed Mar 29 13:07:15 +0000 2017',
 'Wed Mar 29 13:07:16 +0000 2017',
 'Wed Mar 29 13:07:16 +0000 2017',
 'Wed Mar 29 13:07:16 +0000 2017',
 'Wed Mar 29 13:07:17 +0000 2017',
 'Wed Mar 29 13:07:17 +0000 2017',
 'Wed Mar 29 13:07:17 +0000 2017',
 'Wed Mar 29 13:07:22 +0000 2017',
 'Wed Mar 29 13:07:24 +0000 2017',
 'Wed Mar 29 13:07:25 +0000 2017',
 'Wed Mar 29 13:07:27 +0000 2017',
 'Wed Mar 29 13:07:29 +0000 2017',
 'Wed Mar 29 13:07:29 +0000 2017',
 'Wed Mar 29 13:07:29 +0000 2017',
 'Wed Mar 29 13:07:31 +0000 2017',
 'Wed Mar 29 13:07:31 +0000 2017',
 'Wed Mar 29 13:07:3

## 2.2 Parsing to tabular CSV files

There are several ways to transform JSON data to a CSV file. Here we show how to do that using a pandas dataframe. Its input is a dictionary of which the values are the `date_time` list we constructed before (converted to a time stamp). Thereafter, it's simply a matter of writing to a CSV file.

In [117]:
import pandas as pd 

date_time = pd.to_datetime(date_time)
df = pd.DataFrame({"date_time": date_time})
df.to_csv("date_time.csv", index=False)

**Exercise 8**  
Extracts the username, tweet, and location of the tweets in `split_elements`, and write it to a CSV file. Tip: a "regular" tweet and a reply/retweet have a different parsing path. Both types should be captures and stored in the `tweet` column.





In [None]:
# your answer goes here!

In [150]:
# solution
tweets = []
for counter in list(range(0, len(split_elements), 5)):
    obj = json.loads(split_elements[counter])
    
    if obj["user"]["description"] == None: 
        tweet = obj["text"]
    else:
        tweet = obj["user"]["description"]
    tweets.append({"username": obj["user"]["screen_name"], 
                "tweet": tweet,
                 "location": obj["user"]["location"]
                })
        
df = pd.DataFrame(tweets)
df.to_csv("tweets.csv", index=False)