# Part 1: Dataset creation for new_ai_rtists Twitter Bot

### First, we use the [Last.fm API](https://www.last.fm/api/) to create our dataset

#### This was inspired by/heavily cribbed from [this great tutorial](https://www.dataquest.io/blog/last-fm-api-python/) on Dataquest.

This notebook assumes that you have either done that tutorial or at least understood most of the different elements that go into it. I'll try to elucidate every time I do something that's different from the DQ tutorial.

### Note:
This demonstration is recreating the methodology I used to create our *artist_names.txt* dataset, but since that methodology involves accessing ever-changing data from an API that tracks ever-evolving real-life events, these operations will always spit out slightly different datasets. That's part of the fun!

In [14]:
import requests
import json
import requests_cache
import time
import pandas as pd
from IPython.core.display import clear_output


requests_cache.install_cache()

For this notebook, we'll be storing our Last.fm API info as environment variables. There are many ways to do this - I'm using python-dotenv and a .env file.

In [10]:
import os
from dotenv import load_dotenv
load_dotenv()

USER_AGENT = os.getenv('USER_AGENT')
API_KEY = os.getenv('API_KEY')

In [11]:
def lastfm_get(payload):
    # define headers and URL
    headers = {'user-agent': USER_AGENT}
    url = 'http://ws.audioscrobbler.com/2.0/'

    # Add API key and format to the payload
    payload['api_key'] = API_KEY
    payload['format'] = 'json'

    response = requests.get(url, headers=headers, params=payload)
    return response

def jprint(obj):
    # create a formatted string of the Python JSON object
    text = json.dumps(obj, sort_keys=True, indent=4)
    print(text)


As in the tutorial, we'll start by calling the *chart.getTopArtists* endpoint for every single "artists" key and end up with a massive dump of json info. 
#### NOTE: This dump is actually much smaller than it initially appears, due to an undocumented limitation of the API. 

Basically, after the first twenty responses, the API doesn't return any data. So, instead of having a list of (500 x ~6,000) results, we have only about 10,000. 

This is a huge part of why, in Part 2, we'll be using **GPT-2** as our network of choice. If we can figure out a way of accessing more Last.fm API results and getting a bigger dataset, it would be a very interesting followup to use that to train a 'from-scratch' text-generation network.

In [12]:
responses = []

page = 1
total_pages = 99999 # this is just a dummy number so the loop starts

while page <= total_pages:
    payload = {
        'method': 'chart.gettopartists',
        'limit': 500,
        'page': page
    }

    # print some output so we can see the status
    print("Requesting page {}/{}".format(page, total_pages))
    # clear the output to make things neater
    clear_output(wait = True)

    # make the API call
    response = lastfm_get(payload)

    # if we get an error, print the response and halt the loop
    if response.status_code != 200:
        print(response.text)
        break

    # extract pagination info
    page = int(response.json()['artists']['@attr']['page'])
    total_pages = int(response.json()['artists']['@attr']['totalPages'])

    # append response
    responses.append(response)

    # if it's not a cached result, sleep
    if not getattr(response, 'from_cache', False):
        time.sleep(0.25)

    # increment the page number
    page += 1
    

Requesting page 6205/6205


We'll then use Pandas to turn this into a list of dataframes, which will make it MUCH easier to acccess the specific features we want. At this point, we can also drop the features we know we aren't interested in and duplicates.

In [21]:
frames = [pd.DataFrame(r.json()['artists']['artist']) for r in responses]
artists = pd.concat(frames)
artists = artists[['name']].drop_duplicates().reset_index(drop=True)

In [22]:
# Just getting artist names series, then converting to list via dict, so we can remove the duplicates
artist_names = artists["name"]
names_list = list(dict.fromkeys(artist_names.tolist()))

# Convering to string
names_string = "\n".join([str(x) for x in names_list])

Billie Eilish
The Weeknd
Tame Impala
Kanye West
Dua Lipa
Post Malone
Lady Gaga
Ariana Grande
Lana Del Rey
Taylor Swift
The Beatles
Eminem
Kendrick Lamar
Radiohead
Halsey
Queen
Drake
Rihanna
Gorillaz
David Bowie
Ed Sheeran
Tyler, the Creator
Justin Bieber
Arctic Monkeys
Doja Cat
Coldplay
Harry Styles
Frank Ocean
Beyoncé
Selena Gomez
The Strokes
Green Day
Grimes
Red Hot Chili Peppers
Khalid
Calvin Harris
Travi$ Scott
Lorde
BROCKHAMPTON
Nirvana
Fleetwood Mac
Pink Floyd
Maroon 5
Katy Perry
Sam Smith
Childish Gambino
Mac Miller
Miley Cyrus
Future
Roddy Ricch
Daft Punk
Panic! at the Disco
The Cure
The Rolling Stones
The Killers
Imagine Dragons
Linkin Park
Sia
The 1975
Joji
Led Zeppelin
Camila Cabello
A$AP Rocky
Lizzo
MGMT
Britney Spears
Paramore
Muse
The Smiths
BTS
Twenty One Pilots
Shawn Mendes
Metallica
Adele
The Chainsmokers
Nicki Minaj
Sufjan Stevens
Madonna
Charli XCX
Elton John
Michael Jackson
Fall Out Boy
Tones And I
David Guetta
Florence + the Machine
Lil Nas X
Foo Fighters
Oasis
Cag

At this point, we're ready to write our dataset to a .txt file for use with **GPT-2**. That was easy!


In [24]:
with open("./artist_names.txt", "w") as file:
    file.write(names_string)
    file.close()

### NOTE: 

This is truly version 0 of this project, a proof of concept to see if the GPT-2 results were inspiring (spoiler: they totally are!). Future versions are going to take advantage of some other Last.fm API endpoints, like *artist.getTopTags*, to create 'genres' for our fictional artists. 

From there, we can implement some cross-functionality with the [Genius API](https://docs.genius.com/) to create large datasets of lyrics from our AI-generated genres, and I am SURE GPT-2 would be happy to help us come up with some lyrics for our fictional artists...