In this notebook we are combining data from:
1. Setlist.fm
2. Spotify data
3. [Appify Spotify scraper data](https://apify.com/augeas/spotify-monthly-listeners)

What do we want in our final analysis dataset?
1. Artist name
2. First date of the tour
3. Number of tour dates
4. Pct of songs played on tour played every / nearly every night (90% of shows) which we call base setlist songs
5. Mean percentage of that base setlist songs makeup of all songs
6. Mean set length on tour
6. Spotify followers number
7. Spotify popularity
8. Spotify genres
9. Closest album release date to the start of tour
9. Some category of followers/popularity
10. Boolean for salt shed artist (just make sure to exclude Mk.gee)

In [7]:
# import packages
import pandas as pd

In [8]:
# import data needed
setlist_data = pd.read_csv("../data/processed/artist_info_filtered.csv")
monthly_listeners = pd.read_csv("../data/processed/appify_info.csv")
spotify_data = pd.read_csv("../data/source/spotify/spotify_artist_info.csv", converters={"genres": lambda x: x.strip("[]").replace("'","").split(", ")})
album_data = pd.read_csv("../data/source/spotify/album_detail.csv")

In [9]:
# let us manipulate our album dataset
album_data_filt = pd.merge(pd.merge(setlist_data[["artist_mbid", "artist_name", "first_tour_date"]],
                                    spotify_data[["artist_mbid", "artist_id"]],
                                    how = "left"),
                           album_data, 
                           how = "left")

# let us get what the most recent release around the tour is
album_data_filt["newest_release"] = abs(pd.to_datetime(album_data_filt["first_tour_date"], format='mixed') -
                                        pd.to_datetime(album_data_filt["album_date"], format='mixed')).dt.days
album_data_filt = (album_data_filt
                    # filter down to the release closest to the start of tour
                    [album_data_filt.groupby("artist_id")["newest_release"].transform("min")==album_data_filt["newest_release"]]
                    # drop any duplicates (just keep first)
                    .drop_duplicates("artist_id", keep="first"))

In [10]:
# okay let us merge all together!

# merge monthly listeners to spotify data
spotify_data = pd.merge(spotify_data,
                        monthly_listeners,
                        how="left")

## first merge spotify data
master_data = pd.merge(setlist_data,
                       spotify_data[["artist_mbid", "artist_id", "popularity", "monthly_listeners", "followers", "genres"]],
                       how = "left")

## merge in album data
master_data = pd.merge(master_data,
                       album_data_filt[["artist_mbid", "total_albums", "album_name", "album_date", "newest_release"]],
                       how = "left")

## add in monthly listeners group
master_data["monthly_listen_group"] = "Low"
master_data.loc[master_data["monthly_listeners"]>=500000, "monthly_listen_group"] = "Medium"
master_data.loc[master_data["monthly_listeners"]>=1500000, "monthly_listen_group"] = "High"
master_data.loc[master_data["monthly_listeners"]>=5000000, "monthly_listen_group"] = "Very high"
master_data["monthly_listen_group"] = pd.Categorical(master_data["monthly_listen_group"], ordered=True, categories=["Low", "Medium", "High", "Very high"])

## add in jam band flag
master_data["is_jam_band"] = False
master_data.loc[master_data["genres"].apply(lambda x: "jam band" in x), "is_jam_band"] = True

## add in flag for salt shed
master_data["is_salt_shed"] = True
master_data.loc[master_data["artist_name"]=="Mk.gee", "is_salt_shed"] = False

In [11]:
master_data.groupby("monthly_listen_group").size()

  master_data.groupby("monthly_listen_group").size()


monthly_listen_group
Low          23
Medium       34
High         32
Very high    27
dtype: int64

In [12]:
master_data.to_csv("../data/processed/analysis_dataset.csv", index=False)

## Methodology

We only included artists who had at least ten tour dates, to ensure enough data, and averaged at least ten songs in their setlist. Salt Shed headliners often have at least an hour long set, which typically is around 12 songs. This helped filter out openers. Additionally, we also filtered out artists who Spotify has listed as jam bands. These artists tend to have a lot of variability in their setlists since they create the music on the stage. It is important to note that Dijon does follow in this legacy of jam bands slightly.
