# Tour Locations of Rock Artists vs. Hip-hop Artists
Jake Gluck, Nhien Theresa Phan

## Introduction
Do you have a favorite music artist or genre? Have they performed in your city? Rock and hip-hop are two very popular genres, and top music charts reflect this. However, these lists don't show where artists of a particular genre usually tour, nor do they show listenership of genres in a particular city. This tutorial looks into the geographic distribution of tour locations of rock artists versus hip-hop artists. First, we demonstrate how to scrape top artists of specific genres from last.fm. We use data from setlist.fm to map these artists' tour locations. We make hypotheses about the data, and by plotting these tour locations using `folium`, we can analyze the geographic distribution of genres both worldwide and within cities to see if certain areas are predominated by one genre.

## Python dependencies
You will need Python 3 and the following libraries:

- `bs4`
- `folium`
- `itertools`
- `json`
- `numpy`
- `pandas`
- `requests`
- `time`

`folium` can be installed using `pip`:

In [None]:
!pip install folium

In [2]:
import bs4
import folium
import itertools
import json
import numpy as np
import pandas as pd
import requests
import time

## Scraping and cleaning data from the last.fm website
First, we must retrieve the music artists whose tour dates we want to explore. last.fm is a music website where users can share their listening data and tag artists. By scraping their tag pages, we can get a list of top artists in whatever genres we are interested in.

For this tutorial, we will be comparing rock and hip-hop. Scrape the first three pages of artist results for each genre. As each artist name is scraped, remove the special characters `/` and `!` so that we can later scrape their information from the setlist.fm API. Each page lists 22 artists, so we will have 66 rock artists and 66 hip-hop artists. We want to map 50 of each artists, and some of these artists may have never toured or had their tour locations recorded in setlist.fm, so retrieving 66 will allow us to account for missing data that we might encounter when searching for these artists in the setlist.fm API.

In [None]:
# pages to be scraped
hiphop_page1 = 'https://www.last.fm/tag/hip-hop/artists'
hiphop_page2 = 'https://www.last.fm/tag/hip-hop/artists?page=2'
hiphop_page3 = 'https://www.last.fm/tag/hip-hop/artists?page=3'
rock_page1 = 'https://www.last.fm/tag/rock/artists'
rock_page2 = 'https://www.last.fm/tag/rock/artists?page=2'
rock_page3 = 'https://www.last.fm/tag/rock/artists?page=3'

def scrape_page(link):
    # scrape and parse page
    soup = bs4.BeautifulSoup(requests.get(link).text, 'html.parser')
    # get artist names
    elements = soup.findAll('h3', {'class':'big-artist-list-title'})
    artists = []
    for e in elements:
        # remove special characters and add data to list
        artists.append(e.text.replace("/", " ").replace("!", ""))
    return artists

# concatenate list from each page
rock_artists = scrape_page(rock_page1) + scrape_page(rock_page2) + scrape_page(rock_page3)
hiphop_artists = scrape_page(hiphop_page1) + scrape_page(hiphop_page2) + scrape_page(hiphop_page3)

print(rock_artists)
print()
print(hiphop_artists)

## Define API request functions
In the next section, we will be searching artists in setlist.fm. setlist.fm is a music website that records setlists of artists' performances. These setlists include the location of that particular performance. Each artist on setlist.fm is identified by a unique ID from MusicBrainz, an open-source music encyclopedia. Define a function that sends the MusicBrainz API the string artist names we obtained in the last section and returns a MusicBrainz ID.

In [None]:
def send_req_name(name):
    url = 'http://musicbrainz.org/ws/2/artist/?query=artist:' + name
    headers = {'Accept':'application/json'}
    r = ''
    
    while r == '':
        # try to get ID
        try:
            r = requests.get(url, headers=headers)
        # if max tries exceeded, sleep and try again
        except:
            time.sleep(5)
            
    page = r.json()
    return page

Define a second function to query the setlist.fm API for artists' sets. The function takes in the number of sets we want to request and the MusicBrainz ID.

In [None]:
def send_req_set(num, id):
    url = 'https://api.setlist.fm/rest/1.0/artist/' + id + '/setlists/?p=' + num
    headers = {'Accept': 'application/json', 'x-api-key': 'd65d2f04-4d2a-4354-b6dd-45bedf61cde1'}
    r = ''
    
    while r == '':
        # try to get ID
        try:
            r = requests.get(url, headers=headers)
        # if max tries exceed, sleep and try again
        except:
            time.sleep(5)
            
    page = r.json()
    return page

## Get MusicBrainz IDs for last.fm artists
Loop through the artists' names that we gathered using the first function defined above, and retrieve the MusicBrainz ID. This may take awhile, as the MusicBrainz API has rate limiting that will throttle too many requests made per second.

In [None]:
# get ids for artists names
def get_ids(artists):
    ids = []
    for artist in artists:
        print(artist)
        page = send_req_name(artist)
        list = page['artists']
        id = list[0]['id']
        ids.append(id)
        
        # sleep because website limits API usage
        time.sleep(1) 
    return ids

# call function to get ids for the bands we have collected for rock
rock_ids = get_ids(rock_artists)
print(rock_ids)

In [None]:
# call function to get IDs for bands we collected for hip-hop
hiphop_ids = get_ids(hiphop_artists)
print(hiphop_ids)

## Getting tour location data
Now that we have the MusicBrainz ID for each artist, we can get all the pages of set information available from the setlist.fm API to create our final datasets. Create one dataframe for each genre, and only include artists with at least 50 sets on setlist.fm. This may take awhile.

In [None]:
#Get tour data from list of IDs, using setlist.fm API
def get_tours(ids):
    # create empty dataframe
    d = {'venues': []}
    df = pd.DataFrame(d)
    df['cities'] = []
    df['artists'] = []

    # loop through each artist id
    for id in ids:
        print(id)
        i = 1
        keep_going = True
        count = 0
        cities = []
        venues = []
        artists2 = []
        artist_name = ""

        # keep counting up by page until you run a page is not returned
        while(keep_going):
            #pull artists setlist page
            page = send_req_set(str(i), id)
            
            # if page returns error message, we ran out of pages, so we stop
            if('code' in page):
                keep_going = False
            else:
                if 'setlist' in page:
                    setlist = page["setlist"]
                    
                    # for every set on the page, pull its information and add it to our lists
                    for sets in setlist:
                        #add page data to sets
                        artist2 = sets["artist"]
                        artist_name = artist2["name"]
                        venue = sets["venue"]
                        venue_name = venue["name"]
                        city = venue["city"]
                        name = city["name"]
                        count = count + 1
                        venues.append(venue_name)
                        cities.append(name)
                i = i + 1

        # add artist to dataframe
        artists2 = np.repeat(artist_name, len(cities)).reshape(len(cities),1)
        d = {'venues': venues}
        dft = pd.DataFrame(d)
        if (len(cities) > 50):
            dft['cities'] = cities
            dft['artists'] = artists2
            df = pd.concat([df,dft])          
    return df
rock_tours = get_tours(rock_ids)

In [None]:
hiphop_tours = get_tours(hiphop_ids)

Save these data to `.csv` files, one for each genre.

In [24]:
#rock_tours.to_csv("rock_tours.csv")
#hip_hop_tours.to_csv("hip_hop_tours.csv")
rock_data = pd.read_csv('rock_tours.csv')
hiphop_data = pd.read_csv('hip_hop_tours.csv')
rock_data.drop(rock_data.columns[[0]], axis=1, inplace=True)
hiphop_data.drop(hiphop_data.columns[[0]], axis=1, inplace=True)
#hiphop_data
rock_data.head()

Unnamed: 0,venues,cities,artists
0,Bill Graham Civic Auditorium,San Francisco,Red Hot Chili Peppers
1,The Theater at Madison Square Garden,New York,Red Hot Chili Peppers
2,Gila River Arena,Glendale,Red Hot Chili Peppers
3,Pepsi Center,Denver,Red Hot Chili Peppers
4,Zilker Park,Austin,Red Hot Chili Peppers


In [25]:
hiphop_data.head()

Unnamed: 0,venues,cities,artists
0,Saturday Night Live,New York,Eminem
1,"The SSE Arena, Wembley",London,Eminem
2,Bramham Park,Leeds,Eminem
3,Little John's Farm,Reading,Eminem
4,Bellahouston Park,Glasgow,Eminem


## Determine top cities
Count the number of tour dates that have occurred in each city that appears in the data. We will use this information later to calculate percentages of rock vs. hip-hop concerts.

In [26]:
rock_count = {}
hiphop_count = {}

print("hello")

# for each rock tour stop
for index, row in rock_data.iterrows():
    # increment count of that city
    if row['cities'] in rock_count:
        rock_count[row['cities']] += 1
    else:
        rock_count[row['cities']] = 1

top_cities = pd.DataFrame.from_dict(rock_count, orient='index')
for index, row in top_cities.iterrows():
    top_cities.set_value(index, 'city', index)

# for each hiphop tour stop
for index, row in hiphop_data.iterrows():
    # increment count of that city
    if row['cities'] in hiphop_count:
        hiphop_count[row['cities']] += 1
    else:
        hiphop_count[row['cities']] = 1

# total rock and hiphop values

for index, value in rock_count.items():    
     top_cities.loc[top_cities['city'] == index, 'rock'] = int(value)
for index, value in hiphop_count.items():
    top_cities.loc[top_cities['city'] == index, 'hip-hop'] = int(value)
    top_cities.loc[top_cities['city'] == index, 'total'] = int(value) + top_cities.loc[top_cities['city'] == index, 0]
top_cities.sort_values(by='total', ascending=False).head()

hello


Unnamed: 0,0,city,rock,hip-hop,total
London,2952,London,2952.0,428.0,3380.0
New York,1915,New York,1915.0,560.0,2475.0
Los Angeles,933,Los Angeles,933.0,247.0,1180.0
Chicago,778,Chicago,778.0,280.0,1058.0
Paris,810,Paris,810.0,151.0,961.0


## Getting latitude and longitude using Google Maps Geocoding API
Before we can plot the visited cities on a map, we need to get the latitude and longitude from each city name using the [Google Maps Geocoding API](https://developers.google.com/maps/documentation/geocoding/start). You will need to log into your Google account and [get an API key](https://developers.google.com/maps/documentation/geocoding/start#auth). Save this API key in a UTF-8 encoded text file. We can now use this API key to access the Google Maps Geocoding API.

In [6]:
#file = open('google_maps_api_key.txt')
#google_maps_api_key = file.read().replace('\ufeff','')
google_maps_api_key = 'AIzaSyAOcVd9HpOZ1gevVebL6IbEJmYHO-TWo2g'
#file.close()

Search each city name in the data to get the latitude and longitude of each city. Add this information to the dataframe.

In [None]:
url = 'https://maps.googleapis.com/maps/api/geocode/json?key=' + google_maps_api_key + '&address='
city_locs = {}

for index, row in top_cities.iterrows():
    print("run")
    # get city using Google Maps Geocoding API
    city = ''
    while city == '':
        # try to get ID
        try:
            print(try)
            city = requests.get(url + row['cities'])
            print(city)
        # if max tries exceeded, sleep and try again
        except:
            print(cities)
            time.sleep(5)
            
    # get latitude and longitude
    location = json.loads(city.text)['results'][0]['geometry']['location']
    # set latitude and longitude in tour date dataframe
    rock_data.set_value(index, 'lat', location['lat'])
    rock_data.set_value(index, 'lng', location['lng'])
    # add data to city locations
    city_locs[row['cities']] = location
rock_data.head()

In [None]:
hiphop_data = pd.read_csv('hip_hop_tours.csv')
for index, row in hiphop_data.iterrows():
    # get city using Google Maps Geocoding API
    city = ''
    while city == '':
        # try to get ID
        try:
            city = requests.get(url + row['cities'])
        # if max tries exceeded, sleep and try again
        except:
            time.sleep(5)
            
    # get latitude and longitude
    location = json.loads(city.text)['results'][0]['geometry']['location']
    # set latitude and longitude in hip-hop dataframe
    hiphop_data.set_value(index, 'lat', location['lat'])
    hiphop_data.set_value(index, 'lng', location['lng'])
    # add data to city locations
    city_locs[row['cities']] = location
hiphop_data.head()

We can also add the coordinate data to our `top_cities` dataframe.

In [None]:
for key, value in cities.items():
    top_cities.loc[top_cities['city'] == key, 'lat'] = value['lat']
    top_cities.loc[top_cities['city'] == key, 'lng'] = value['lng']
top_cities.sort_values(by='total', ascending=False).head()

## Mapping artists’ tour locations with `folium`
Now that we have the latitude and longitude coordinates of our artists' tour dates, we can plot the tour locations on a map using [`folium`](http://python-visualization.github.io/folium/docs-master/), a library that adapts the [`leaflet.js`](http://leafletjs.com/) mapping library for a Python ecosystem. We demonstrate how to install `folium` with `pip` in the "Python dependencies" section of this tutorial, but detailed installation instructions can be found [here](http://python-visualization.github.io/folium/docs-master/installing.html#installation).

In [None]:
# map centered on United States
map1 = folium.Map(location=[39.5, -98.35], zoom_start=4)

for index, row in rock_data.iterrows():
    folium.Marker([row['lat'], row['lng']], popup='test', icon=folium.Icon(color='red',icon='info-sign')).add_to(map1)
map1

## Mapping individual cities
The second map we want to create will plot one marker per city that appears in the data. Each marker can be clicked on to reveal the percentage of rock concerts vs. hip hop concerts that have occurred at that city.

In [None]:
# map centered on United States
map2 = folium.Map(location=[39.5, -98.35], zoom_start=4)

# total hiphop concerts
total_hiphop = pd.Dataframe.sum(top_cities['hip-hop'])


for index, row in top_cities.iterrows():
    
    folium.Marker([row['lat'], row['lng']], popup=row['city'], icon=folium.Icon(color='red',icon='info-sign')).add_to(map2)
map2

## Analysis

# Hip Hop vs Rock Cities

## We want to find what cities have more rock concerts and what cities have more hip hop concerts



In [47]:
top_cities = pd.read_csv('top_cities.csv')
top_cities.dropna(inplace=True)
top_cities.sort_values('total', ascending=False)

#get top 200 cities
top_200_cities = top_cities.head(200)
top_200_cities

Unnamed: 0.1,Unnamed: 0,0,city,hip-hop,total,lat,lng
0,London,2952,London,428.0,3380.0,51.507351,-0.127758
1,New York,1915,New York,560.0,2475.0,40.712775,-74.005973
2,Los Angeles,933,Los Angeles,247.0,1180.0,34.052234,-118.243685
3,Chicago,778,Chicago,280.0,1058.0,41.878114,-87.629798
4,Paris,810,Paris,151.0,961.0,48.856614,2.352222
5,Sydney,687,Sydney,175.0,862.0,-33.868820,151.209295
6,Toronto,716,Toronto,137.0,853.0,43.653226,-79.383184
7,Manchester,609,Manchester,175.0,784.0,53.480759,-2.242631
8,Melbourne,613,Melbourne,159.0,772.0,-37.813628,144.963058
9,Philadelphia,586,Philadelphia,176.0,762.0,39.952584,-75.165222


In [64]:
# map centered on United States
map2 = folium.Map(location=[39.5, -98.35], zoom_start=4)

!pip install colour
from colour import Color
red = Color("red")
colors = list(red.range_to(Color("green"),100))



for index, row in top_200_cities.iterrows():
    w = (row['hip-hop']/row['total'])
    w2 = int(w * 100)
    #print(colors[w2])
    if (w2 < 50):
        folium.Marker([row['lat'], row['lng']], popup=row['city'], icon=folium.Icon(color='red',icon='info-sign')).add_to(map2)
    else:
        folium.Marker([row['lat'], row['lng']], popup=row['city'], icon=folium.Icon(color='green',icon='info-sign')).add_to(map2)
map2

