## Data Cleaning ##

**Imports**

In [1]:
import pandas as pd
pd.set_option('display.max_columns', None)

import numpy as np
np.random.seed(42)

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style = "darkgrid")

import datetime as dt

**Reading in Shows' Data**

In [63]:
shows_df = pd.read_csv('../data/just_shows.csv').drop(columns = 'Unnamed: 0')

In [64]:
shows_df.head(3)

Unnamed: 0,showid,showyear,showmonth,showdate,permalink,setlist_notes,venue,city,state,country,artist_name,tourid,tour_name
0,1251168326,1983,10,1983-10-30,https://phish.net/setlists/phish-october-30-19...,Throughout most of Phish history this was unde...,Harris-Millis Cafeteria - University of Vermont,Burlington,VT,USA,Phish,61,Not Part of a Tour
1,1251253100,1983,12,1983-12-02,https://phish.net/setlists/phish-december-02-1...,"Trey, Mike, Fish, and Jeff Holdsworth recall b...",Harris-Millis Cafeteria - University of Vermont,Burlington,VT,USA,Phish,1,1983 Tour
2,1251253531,1983,12,1983-12-03,https://phish.net/setlists/phish-december-03-1...,"This show, played by Trey, Mike, Fish, and Jef...","Marsh / Austin / Tupper Dormitory, University ...",Burlington,VT,USA,Phish,1,1983 Tour


In [65]:
shows_df.tail(3)

Unnamed: 0,showid,showyear,showmonth,showdate,permalink,setlist_notes,venue,city,state,country,artist_name,tourid,tour_name
2015,1622655409,2022,2,2022-02-25,https://phish.net/setlists/phish-february-25-2...,,Moon Palace,"Cancun, Quintana Roo",,Mexico,Phish,61,Not Part of a Tour
2016,1622655457,2022,2,2022-02-26,https://phish.net/setlists/phish-february-26-2...,,Moon Palace,"Cancun, Quintana Roo",,Mexico,Phish,61,Not Part of a Tour
2017,1622655484,2022,2,2022-02-27,https://phish.net/setlists/phish-february-27-2...,,Moon Palace,"Cancun, Quintana Roo",,Mexico,Phish,61,Not Part of a Tour


In [66]:
shows_df.shape

(2018, 13)

In [67]:
shows_df.columns

Index(['showid', 'showyear', 'showmonth', 'showdate', 'permalink',
       'setlist_notes', 'venue', 'city', 'state', 'country', 'artist_name',
       'tourid', 'tour_name'],
      dtype='object')

**Reading in Setlists' Data**

In [2]:
setlists_df = pd.read_csv('../data/just_setlists.csv').drop(columns = 'Unnamed: 0')

In [3]:
setlists_df.head(3)

Unnamed: 0,showid,showdate,permalink,showyear,uniqueid,meta,setlistnotes,songid,position,transition,set,isjam,isreprise,tracktime,gap,tourid,tourname,song,is_original,venueid,venue,city,state,country,artist_name
0,1326251770,1982-12-07,https://phish.net/setlists/trey-anastasio-dece...,1982,181509,Space Antelope,"This list is likely incomplete, and the date m...",1750,1,2,1,0,0,,0,61,Not Part of a Tour,Lifespace,0,1140,The Taft School,Watertown,CT,USA,Trey Anastasio
1,1326251770,1982-12-07,https://phish.net/setlists/trey-anastasio-dece...,1982,181510,Space Antelope,"This list is likely incomplete, and the date m...",16,2,1,1,0,0,,0,61,Not Part of a Tour,All Along the Watchtower,0,1140,The Taft School,Watertown,CT,USA,Trey Anastasio
2,1326251770,1982-12-07,https://phish.net/setlists/trey-anastasio-dece...,1982,181511,Space Antelope,"This list is likely incomplete, and the date m...",1618,3,1,1,0,0,,0,61,Not Part of a Tour,Franklin's Tower,0,1140,The Taft School,Watertown,CT,USA,Trey Anastasio


In [4]:
setlists_df.tail(3)

Unnamed: 0,showid,showdate,permalink,showyear,uniqueid,meta,setlistnotes,songid,position,transition,set,isjam,isreprise,tracktime,gap,tourid,tourname,song,is_original,venueid,venue,city,state,country,artist_name
58101,1620930022,2021-10-31,https://phish.net/setlists/phish-october-31-20...,2021,462535,,"For the second set, the band's ""musical costum...",630,24,1,3,2,2,,8,61,Not Part of a Tour,Twist,1,1316,MGM Grand Garden Arena,Las Vegas,NV,USA,Phish
58102,1620930022,2021-10-31,https://phish.net/setlists/phish-october-31-20...,2021,462560,,"For the second set, the band's ""musical costum...",2829,25,5,3,2,2,,8,61,Not Part of a Tour,Drift While You're Sleeping,0,1316,MGM Grand Garden Arena,Las Vegas,NV,USA,Phish
58103,1620930022,2021-10-31,https://phish.net/setlists/phish-october-31-20...,2021,462611,,"For the second set, the band's ""musical costum...",250,26,6,e,2,2,,6,61,Not Part of a Tour,Harry Hood,1,1316,MGM Grand Garden Arena,Las Vegas,NV,USA,Phish


In [5]:
setlists_df.shape

(58104, 25)

In [6]:
setlists_df.columns

Index(['showid', 'showdate', 'permalink', 'showyear', 'uniqueid', 'meta',
       'setlistnotes', 'songid', 'position', 'transition', 'set', 'isjam',
       'isreprise', 'tracktime', 'gap', 'tourid', 'tourname', 'song',
       'is_original', 'venueid', 'venue', 'city', 'state', 'country',
       'artist_name'],
      dtype='object')

**Extracing All Phish Setlists in Shows**

In [91]:
phish_shows = shows_df['showid'].tolist()
len(phish_shows)

2018

In [101]:
phish_shows1 = setlists_df['showid'].unique().tolist()
len(phish_shows1)

1804

In [102]:
#making sure all the shows in setlists_df are in shows_df
shows_df = shows_df[shows_df['showid'].isin(phish_shows1)]

In [104]:
shows_df.shape

(1804, 13)

In [109]:
#making sure all the shows in shows_df are in setlists_df
setlists_df = setlists_df[setlists_df['showid'].isin(phish_shows)]

In [106]:
setlists_df['showid'].nunique()

1804

In [107]:
setlists_df.shape

(35807, 21)

**Exploring the Data**

In [110]:
setlists_df.info()
#note, showdate needs to be converted to datetime dtype

<class 'pandas.core.frame.DataFrame'>
Int64Index: 35807 entries, 8 to 58103
Data columns (total 21 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   showid       35807 non-null  int64 
 1   showdate     35807 non-null  object
 2   permalink    35807 non-null  object
 3   showyear     35807 non-null  int64 
 4   uniqueid     35807 non-null  int64 
 5   songid       35807 non-null  int64 
 6   position     35807 non-null  int64 
 7   transition   35807 non-null  int64 
 8   set          35807 non-null  object
 9   isjam        35807 non-null  int64 
 10  isreprise    35807 non-null  int64 
 11  gap          35807 non-null  int64 
 12  tourid       35807 non-null  int64 
 13  tourname     35807 non-null  object
 14  song         35807 non-null  object
 15  is_original  35807 non-null  int64 
 16  venueid      35807 non-null  int64 
 17  venue        35807 non-null  object
 18  city         35807 non-null  object
 19  country      35807 non-nu

In [111]:
setlists_df.isna().sum().sort_values(ascending = False)

showid         0
gap            0
country        0
city           0
venue          0
venueid        0
is_original    0
song           0
tourname       0
tourid         0
isreprise      0
showdate       0
isjam          0
set            0
transition     0
position       0
songid         0
uniqueid       0
showyear       0
permalink      0
artist_name    0
dtype: int64

**Counting the Number of Shows and Songs**

In [112]:
#the number of unique shows since the beginning:
setlists_df['showid'].nunique()

1804

In [113]:
#the number of unique songs played:
setlists_df['songid'].nunique()

949

In [26]:
setlists_df.loc[setlists_df['is_original'] == 1, 'songid'].nunique()

325

DELETE LATER DELETE LATER DELETE LATER

Creating a training dataset
With the language modeling approach in hand, I generated training data by first dropping incomplete setlists and then concatenating together every setlist chronologically into one long list and encoding the data — song-to-integer for each of the 876 unique songs, plus all setlist identifiers. Maintaining the setlist identifiers (Set 1, Set 2, Encore, etc.) in the sequence provides context and will allow the model to learn that certain songs are more likely to occur as the opener vs. middle of the second set vs. the encore.

**Creating Function to Build column 'setlist'**

In [118]:
show_1251253100 = setlists_df.loc[setlists_df['showid'] == 1251253100]['song'].tolist()
show_1251253100

['Long Cool Woman in a Black Dress',
 'Proud Mary',
 'In the Midnight Hour',
 'Squeeze Box',
 'Roadhouse Blues',
 'Happy Birthday to You',
 'Scarlet Begonias',
 'Fire on the Mountain']

In [27]:
#FIGURING OUT FUNCTION FOR SETLISTS:
#I need to look at each unique show ID and get the rows for each unique ID
#I need to include whether or not the song is in set 1 or set 2
#I need to make a new column thats a string list of each song in the show

#I need to make a new dataframe of the entries that are each show. 

In [82]:
#https://stackoverflow.com/questions/12293208/how-to-create-a-list-of-lists

def make_setlist_col(df):
    shows = []
    for show in df['showid'].unique():
        show = df.loc[df['showid'] == show]['song'].tolist()
        shows.append(show)
        
    shows_df['setlists'] = shows
    
    return shows_df

In [119]:
make_setlist_col(setlists_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  shows_df['setlists'] = shows


Unnamed: 0,showid,showyear,showmonth,showdate,permalink,setlist_notes,venue,city,state,country,artist_name,tourid,tour_name,setlists
1,1251253100,1983,12,1983-12-02,https://phish.net/setlists/phish-december-02-1...,"Trey, Mike, Fish, and Jeff Holdsworth recall b...",Harris-Millis Cafeteria - University of Vermont,Burlington,VT,USA,Phish,1,1983 Tour,"[Long Cool Woman in a Black Dress, Proud Mary,..."
3,1250613219,1984,10,1984-10-23,https://phish.net/setlists/phish-october-23-19...,"This show, played in the garage of a house on ...",69 Grant Street,Burlington,VT,USA,Phish,2,1984 Tour,[Makisupa Policeman]
4,1251262142,1984,11,1984-11-03,https://phish.net/setlists/phish-november-03-1...,"The setlist for this show might be incomplete,...","Slade Hall, University of Vermont",Burlington,VT,USA,Phish,2,1984 Tour,"[In the Midnight Hour, Wild Child, Jam, Bertha..."
5,1251262498,1984,12,1984-12-01,https://phish.net/setlists/phish-december-01-1...,Skippy and Fluffhead featured The Dude of Life...,Nectar's,Burlington,VT,USA,Phish,2,1984 Tour,"[Jam, Wild Child, Bertha, Can't You Hear Me Kn..."
6,1251587227,1985,2,1985-02-01,https://phish.net/setlists/phish-february-01-1...,It is unconfirmed if this setlist is correct f...,Doolin's,Burlington,VT,USA,Phish,3,1985 Tour,"[Slave to the Traffic Light, Mike's Song, Dave..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2005,1620929961,2021,10,2021-10-26,https://phish.net/setlists/phish-october-26-20...,Pebbles and Marbles and Sample in a Jar were u...,Santa Barbara Bowl,Santa Barbara,CA,USA,Phish,61,Not Part of a Tour,"[Pebbles and Marbles, Makisupa Policeman, Samp..."
2006,1620929979,2021,10,2021-10-28,https://phish.net/setlists/phish-october-28-20...,This show featured a setlist with all songs fe...,MGM Grand Garden Arena,Las Vegas,NV,USA,Phish,61,Not Part of a Tour,"[Also Sprach Zarathustra, 1999, 555, 46 Days, ..."
2007,1620929993,2021,10,2021-10-29,https://phish.net/setlists/phish-october-29-20...,Trey and Mike quoted Little Squirrel throughou...,MGM Grand Garden Arena,Las Vegas,NV,USA,Phish,61,Not Part of a Tour,"[Olivia's Pool, Axilla (Part II), Mike's Song,..."
2008,1620930007,2021,10,2021-10-30,https://phish.net/setlists/phish-october-30-20...,The songs in this show were based on animals. ...,MGM Grand Garden Arena,Las Vegas,NV,USA,Phish,61,Not Part of a Tour,"[The Dogs, Ocelot, Turtle in the Clouds, Run L..."


In [121]:
shows_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1804 entries, 1 to 2009
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   showid         1804 non-null   int64 
 1   showyear       1804 non-null   int64 
 2   showmonth      1804 non-null   int64 
 3   showdate       1804 non-null   object
 4   permalink      1804 non-null   object
 5   setlist_notes  1767 non-null   object
 6   venue          1804 non-null   object
 7   city           1804 non-null   object
 8   state          1686 non-null   object
 9   country        1804 non-null   object
 10  artist_name    1804 non-null   object
 11  tourid         1804 non-null   int64 
 12  tour_name      1804 non-null   object
 13  setlists       1804 non-null   object
dtypes: int64(4), object(10)
memory usage: 211.4+ KB


In [122]:
shows_df.isna().sum().sort_values(ascending = False)

state            118
setlist_notes     37
showid             0
showyear           0
showmonth          0
showdate           0
permalink          0
venue              0
city               0
country            0
artist_name        0
tourid             0
tour_name          0
setlists           0
dtype: int64

In [125]:
shows_df[shows_df['state'].isna()]['country'].value_counts()

Canada            19
Germany           18
Mexico            16
Japan             12
France            11
Italy             11
Denmark            8
England            5
Netherlands        4
Spain              4
Czech Republic     3
Belgium            3
Ireland            2
USA                1
Austria            1
Name: country, dtype: int64

- The entries where 'state' is null represents enough of my data that I'm not going to drop the entries. Although I can't easily rename the null 'state' to the appropriate ones, given the number of countries across which they are spread, I am going to rename them to 'Not Available'.