# Lastfm API with pyLast

This repository aims to create the datasets based on lastfm API. The functions are written in ```create_last_database.py``` and ```generate_lastfm_users.py```.

This are the main references to follow: 

- [Oficial Website](https://www.last.fm/api/)

- [PyLast Repo](https://github.com/pylast)

To create the datasets, it's necessary yo have an account on Last.fm and to create an Application, in order to obtain an ```api_key``` and ```api_key_secret```. 

**You do not need to generate all data for the analysis. This is a secondary data source and because of the slowness to get data from LAST.FM.**

In [None]:
import pandas as pd 
import numpy as np 
import pylast

import os
import json
import sys
import requests
import time 
from IPython.display import clear_output

sys.path.append('../../scripts/') 

from create_last_database import User
from create_last_database import Track
from create_last_database import Artist
from create_last_database import Album
from create_last_database import Tag
from create_last_database import Library

## Network with the API through PyLast

It connects to the API using PyLast library. 

In [None]:
API_KEY = input()
API_SECRET = input()

network = pylast.LastFMNetwork(api_key=API_KEY, api_secret=API_SECRET)

## Get the Users

Let's get some random users built by `generate_lastfm_users.py`. I use `random_state` to keep reproducibility. The original dataset has more than 30 thousand users. To generate that list of usernames, I visited the Last.fm webpage of several artists. But I considered three users randomly iin the top listenings from three different coutries: Brazil, USA and United Kingdom. Using just this username, I generated additional Last.fm usernames using the user.getFriends method. With some loops, we can get the network (or part of it). It's possible to have some bias unknown yet.  

In [None]:
def get_random_users(filepath: str, quantity: int = 1000, random_state: int = 200) -> pd.DataFrame:
    
    users = pd.read_csv(filepath)
    chosen_users = users.sample(n = quantity, replace = False, random_state = random_state, axis = 'index')
    chosen_users.index = list(range(0,len(chosen_users)))

    return chosen_users

user_path = "../../data/lastfm-api/users_lastfm.csv"

users = get_random_users(user_path)
users.sample()

# Creating the database 

It takes a long, long time (really, really long). For each infomation, I have to request four or five links, to extract the information. Some problems with **MalResponse**, **Network** and **Connection** are expected. For some especial cases, I just rerun the cell. 

Remember: It takes a really big time. That's why it won't be the main dataset (for users 15s in mean for each user)!

We start with the users info. All the following information is saved in a json format.

In [None]:
FOLDER_PATH = '../../data/lastfm-api/'

users_class = User(network, user_path)

file_user_name = '2k_users_info_lastfm.json'

MAX_USERS = 1000

if not os.path.exists(FOLDER_PATH):
    os.mkdir(FOLDER_PATH)

In [None]:
# Handle the file 
if not os.path.exists(os.path.join(FOLDER_PATH, file_user_name)):
    with open(os.path.join(FOLDER_PATH, file_user_name), 'w') as f:
        json.dump({}, f)
with open(os.path.join(FOLDER_PATH, file_user_name), 'r') as f:
    data = json.load(f)
    
t0 = time.time()
for i, user in users.iterrows():
    if i > MAX_USERS: 
        break
    # If the user is in the file already, continue
    if str(user.user_id) in data:
        continue
    with open(os.path.join(FOLDER_PATH, file_user_name), 'r+') as f:
        data = json.load(f)
        
        # A lot of internet problems may occur.
        while True: 
            try: 
                user_info = users_class.get_user_info(user.user_name)
            except pylast.NetworkError as e:
                print(e)
                time.sleep(5)
                continue
            except pylast.MalformedResponseError as e:
                print(e)
                time.sleep(5)
                continue
            break
        
        # We save the information in a json format (as a dictionary)
        data[user.user_id] = user_info
        f.seek(0)
        json.dump(data, f)
        if len(data) % 10 == 0:
            clear_output()
            print('{} users - DONE: {} seconds'.format(len(data), (time.time() - t0)))
            t0 = time.time()

With the `tracks.csv` file, I will build the tracks dataset. It may take long time!

In [None]:
track_class = Track(network)
MAX_TRACKS = 5000

In [None]:
# Handle the filename
if not os.path.exists(os.path.join(FOLDER_PATH, 'tracks_lastfm_info.json')):
    with open(os.path.join(FOLDER_PATH, 'tracks_lastfm_info.json'), 'w') as f:
        json.dump({}, f)
with open(os.path.join(FOLDER_PATH, 'tracks_lastfm_info.json'), 'r+') as f:
    data = json.load(f)
    

t0 = time.time()
for track_id, data_track in track_class.tracks_df.iterrows():
    if track_id > MAX_TRACKS: 
        break
    if str(track_id) in data:
        continue
    with open(os.path.join(FOLDER_PATH, 'tracks_lastfm_info.json'), 'r+') as f:
        data = json.load(f)
        while True:
            try: 
                track_info = track_class.get_track_info(data_track.track_name, data_track.artist_name)
            except pylast.NetworkError:
                print(e)
                time.sleep(5)
                continue
            except pylast.MalformedResponseError as e:
                print(e)
                time.sleep(5)
                continue
            break
            
        # We save the information in a json format (as a dictionary)    
        data[track_id] = track_info
        f.seek(0)
        json.dump(data, f)
        if len(data) % 10 == 0:
            clear_output()
            print('{} tracks - DONE: {} seconds'.format(len(data), (time.time() - t0)))
            t0 = time.time()

Building the artist database. The principle is the same for the last one. 

In [None]:
artist_class = Artist(network)
MAX_ARTISTS = 2000

In [None]:
if not os.path.exists(os.path.join(FOLDER_PATH, 'artists_lastfm_info.json')):
    with open(os.path.join(FOLDER_PATH, 'artists_lastfm_info.json'), 'w') as f:
        json.dump({}, f)
with open(os.path.join(FOLDER_PATH, 'artists_lastfm_info.json'), 'r+') as f:
    data = json.load(f)
    
t0 = time.time()
for artist_id, data_artist in artist_class.artists_df.iterrows():
    if artist_id > MAX_ARTISTS:
        break
    if str(artist_id) in data:
        continue
    with open(os.path.join(FOLDER_PATH, 'artists_lastfm_info.json'), 'r+') as f:
        data = json.load(f)
        while True:
            try: 
                artist_info = artist_class.get_artist_info(data_artist.artist_name)
            except pylast.NetworkError as e:
                print(e)
                time.sleep(5)
                continue
            except pylast.MalformedResponseError as e:
                print(e)
                time.sleep(5)
                continue
            break
        data[artist_id] = artist_info
        f.seek(0)
        json.dump(data, f)
        if len(data) % 10 == 0:
            clear_output()
            print('{} artists - DONE: {} seconds'.format(len(data), (time.time() - t0)))
            t0 = time.time()

Building the dabase for the tags. The principle is the same. 

In [None]:
tag_class = Tag(network)
MAX_TAGS = 1000

In [None]:
if not os.path.exists(os.path.join(FOLDER_PATH, 'tags_lastfm_info.json')):
    with open(os.path.join(FOLDER_PATH, 'tags_lastfm_info.json'), 'w') as f:
        json.dump({}, f)
with open(os.path.join(FOLDER_PATH, 'tags_lastfm_info.json'), 'r+') as f:
    data = json.load(f)
    
t0 = time.time()
for tag_id, data_tag in tag_class.tags_df.iterrows():
    if tag_id > MAX_TAGS: 
        break
    if str(tag_id) in data:
        continue
    with open(os.path.join(FOLDER_PATH, 'tags_lastfm_info.json'), 'r+') as f:
        data = json.load(f)
        while True:
            try: 
                tag_info = tag_class.get_tag_info(data_tag.tag)
            except pylast.NetworkError as e:
                print(e)
                time.sleep(5)
                continue
            except pylast.MalformedResponseError as e:
                print(e)
                time.sleep(5)
                continue
            break
        data[tag_id] = tag_info
        f.seek(0)
        json.dump(data, f)
        if len(data) % 10 == 0:
            clear_output()
            print('{} tags - DONE: {} seconds'.format(len(data), (time.time() - t0)))
            t0 = time.time()

Converting similar tracks in track info to index. I separate of the original code cause it was lazy!

In [None]:
track_class = Track(network)

with open(os.path.join(FOLDER_PATH, 'tracks_lastfm_info.json'), 'r+') as f:
    data = json.load(f)
for index_ex, key in enumerate(data.keys()):
    if len(data[key]) == 0: 
        continue
    for index, info in enumerate(data[key]['similar']): 
        data[key]['similar'][index] = [track_class.get_id_by_name(info[0], info[1]), info[2]]
    if index_ex % 100 == 0: 
        clear_output()
        print("{} - DONE".format(index_ex))

with open(os.path.join(FOLDER_PATH, 'tracks_lastfm_info.json'), 'w') as f: 
    json.dump(data, f)

In [None]:
with open(os.path.join(FOLDER_PATH, 'artists_lastfm_info.json'), 'r+') as f:
    data = json.load(f)
for key in data.keys():
    if len(data[key]) == 0: 
        continue
    for index, info in enumerate(data[key]['similar']): 
        data[key]['similar'][index] = [artist_class.get_id_by_name(info[0]), info[1]]

with open(os.path.join(FOLDER_PATH, 'artists_lastfm_info.json'), 'w') as f: 
    json.dump(data, f)

Writting the new artists and tracks (**Remember to run this cell**)

In [None]:
artist_class.write_to_csv()
track_class.write_to_csv()

## Getting a Library for 50 users

For each user, we get all the artists returned from its library. It's expected some errors when importing these data. 

In [None]:
artist_library = Library(network)
limit = 50

In [None]:
if not os.path.exists(os.path.join(FOLDER_PATH, 'users50_library.json')):
    with open(os.path.join(FOLDER_PATH, 'users50_library.json'), 'w') as f:
        json.dump({}, f)
with open(os.path.join(FOLDER_PATH, 'users50_library.json'), 'r') as f:
    users50_library = json.load(f)

In [None]:
for i, user in users.iterrows():
    if i >= limit: break
    if str(user['user_id']) in users50_library:
        continue
    print(user['user_id'])
    # If printing = True, you get information about each page. 
    user_library = artist_library.get_library(user['user_name'], printing=False)
    users50_library[user['user_id']] = user_library
    clear_output()
    print('{} - DONE'.format(user['user_id']))

In [None]:
with open(os.path.join(FOLDER_PATH, 'users50_library.json'), 'w') as f:
    json.dump(users50_library, f)