## 2021: Week 8 - Karaoke Data

Recently I was helping a colleague prep some karaoke data and I thought it was too fun a subject to resist turning into a Preppin' Data challenge! I had a lot of fun creating the dataset and imagining the type of person who may sing one song and then not bother with the rest of the session. 

We will need to make some assumptions as part of our data prep:

- Customers often don't sing the entire song
- Sessions last 60 minutes
- Customers arrive a maximum of 10 minutes before their sessions begin

I will warn you that this challenge may be a little on the trickier end of the spectrum!

### Input

1. Karaoke song choices and what time they began

![img](https://1.bp.blogspot.com/-OKoZi-s2jrI/X-Iam5_rHYI/AAAAAAAAAqg/XUkttbXMfNEfez_Q2lPotOCVSiqGUPrPACLcBGAsYHQ/w400-h223/Karaoke%2BInput.png)

2. Customer entry times

![img2](https://1.bp.blogspot.com/-fFeRcrRKbvE/X-IasecCtrI/AAAAAAAAAqk/X0Y9RMIJBWAoHK74_QoC3YDaiVMdnodTACLcBGAsYHQ/s0/Customer%2BEntry.png)

### Requirements
- Input the data
- Calculate the time between songs
- If the time between songs is greater than (or equal to) 59 minutes, flag this as being a new session 
- Create a session number field
- Number the songs in order for each session 
- Match the customers to the correct session, based on their entry time
    - The Customer ID field should be null if there were no customers who arrived 10 minutes (or less) before the start of the session
- Output the data

### Output

![img3](https://1.bp.blogspot.com/-KZrVSOQMowk/YDe9W8-t97I/AAAAAAAAAww/rziET4GoHtk3JehmSfcmu_my7KPSNM8pgCLcBGAsYHQ/w640-h184/Karaoke%2BOutput2.png)

6 fields
- Session #
- Customer ID
- Song Order
- Date
- Artist
- Song

988 rows (989 including headers)

In [1377]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

### Input the data

In [1378]:
data = pd.read_excel("./data/Copy of Karaoke Dataset.xlsx", sheet_name=["Karaoke Choices", "Customers"])
karaoke = data["Karaoke Choices"].copy()
customers = data["Customers"].copy()

In [1379]:
karaoke.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 988 entries, 0 to 987
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Date    988 non-null    datetime64[ns]
 1   Artist  988 non-null    object        
 2   Song    988 non-null    object        
dtypes: datetime64[ns](1), object(2)
memory usage: 23.3+ KB


In [1380]:
karaoke.head()

Unnamed: 0,Date,Artist,Song
0,2020-12-22 13:59:59.971,Wham!,Last Christmas
1,2020-12-22 15:00:00.000,Dolly Parton,9 To 5
2,2020-12-22 15:02:00.010,Camilla Cabello Ft. Young Thug,Havana
3,2020-12-22 15:04:00.019,Moana,How Far I’ll Go
4,2020-12-22 18:00:00.000,Backstreet Boys,I Want It That Way


### Calculate the time between songs

In [1381]:
karaoke["Difference"] = karaoke["Date"].diff()
karaoke["Difference"] = pd.to_timedelta(karaoke["Difference"])
karaoke["Difference"] = karaoke["Difference"].fillna(pd.Timedelta(seconds=0))
karaoke["Difference"] = karaoke["Difference"].map(lambda x: int(x.seconds / 60))
karaoke.head(10)

Unnamed: 0,Date,Artist,Song,Difference
0,2020-12-22 13:59:59.971,Wham!,Last Christmas,0
1,2020-12-22 15:00:00.000,Dolly Parton,9 To 5,60
2,2020-12-22 15:02:00.010,Camilla Cabello Ft. Young Thug,Havana,2
3,2020-12-22 15:04:00.019,Moana,How Far I’ll Go,2
4,2020-12-22 18:00:00.000,Backstreet Boys,I Want It That Way,175
5,2020-12-22 19:00:00.029,Alexandra Burke,Hallelujah,60
6,2020-12-22 19:03:00.000,Luis Fonsi & Daddy Yankee,Despacito,2
7,2020-12-22 19:05:00.010,The Lion King,Can You Feel The Love Tonight,2
8,2020-12-22 19:07:00.019,Billie Eilish,Bad Guy,2
9,2020-12-22 22:59:59.971,Lil Nas X,Old Town Road,232


### If the time between songs is greater or equal to 59 minutes, flag as a new session
### Create a session number field

In [1382]:
karaoke["Session #"] = np.where(karaoke["Difference"] >= 59, 1, 0)
karaoke["Session #"] = karaoke["Session #"].cumsum() + 1

### Number the songs in order for each session

In [1383]:
karaoke["Song Order"] = karaoke.groupby(["Session #"])["Session #"].cumcount() + 1
karaoke.head(10)

Unnamed: 0,Date,Artist,Song,Difference,Session #,Song Order
0,2020-12-22 13:59:59.971,Wham!,Last Christmas,0,1,1
1,2020-12-22 15:00:00.000,Dolly Parton,9 To 5,60,2,1
2,2020-12-22 15:02:00.010,Camilla Cabello Ft. Young Thug,Havana,2,2,2
3,2020-12-22 15:04:00.019,Moana,How Far I’ll Go,2,2,3
4,2020-12-22 18:00:00.000,Backstreet Boys,I Want It That Way,175,3,1
5,2020-12-22 19:00:00.029,Alexandra Burke,Hallelujah,60,4,1
6,2020-12-22 19:03:00.000,Luis Fonsi & Daddy Yankee,Despacito,2,4,2
7,2020-12-22 19:05:00.010,The Lion King,Can You Feel The Love Tonight,2,4,3
8,2020-12-22 19:07:00.019,Billie Eilish,Bad Guy,2,4,4
9,2020-12-22 22:59:59.971,Lil Nas X,Old Town Road,232,5,1


### Match the customers to the correct session, based on their entry time
### The Customer ID field should be null if there were no customers who arrived 10 minutes (or less) before the start of the session

In [1384]:
resample_time = karaoke.set_index("Date").resample("1T").bfill()
resample_time.iloc[0, :] = resample_time.iloc[1, :]
resample_time = resample_time.reset_index()
resample_time.head(10)

Unnamed: 0,Date,Artist,Song,Difference,Session #,Song Order
0,2020-12-22 13:59:00,Dolly Parton,9 To 5,60,2,1
1,2020-12-22 14:00:00,Dolly Parton,9 To 5,60,2,1
2,2020-12-22 14:01:00,Dolly Parton,9 To 5,60,2,1
3,2020-12-22 14:02:00,Dolly Parton,9 To 5,60,2,1
4,2020-12-22 14:03:00,Dolly Parton,9 To 5,60,2,1
5,2020-12-22 14:04:00,Dolly Parton,9 To 5,60,2,1
6,2020-12-22 14:05:00,Dolly Parton,9 To 5,60,2,1
7,2020-12-22 14:06:00,Dolly Parton,9 To 5,60,2,1
8,2020-12-22 14:07:00,Dolly Parton,9 To 5,60,2,1
9,2020-12-22 14:08:00,Dolly Parton,9 To 5,60,2,1


In [1385]:
customers = customers.merge(resample_time, how="left", left_on="Entry Time", right_on="Date").drop(["Date", "Artist", "Song", "Difference", "Song Order"], axis=1)

In [1386]:
customers = customers.drop_duplicates(subset=["Session #"], keep="first")

In [1387]:
karaoke = karaoke.merge(customers, how="left", on="Session #", validate="m:m")
karaoke.loc[0, ["Customer ID", "Entry Time"]] = customers.loc[203, ["Customer ID", "Entry Time"]]
karaoke = karaoke.loc[:, ["Session #", "Customer ID", "Song Order", "Date", "Artist", "Song"]]
karaoke.shape

(988, 6)

In [1388]:
karaoke.head(10)

Unnamed: 0,Session #,Customer ID,Song Order,Date,Artist,Song
0,1,cd2834,1,2020-12-22 13:59:59.971,Wham!,Last Christmas
1,2,2de3d7,1,2020-12-22 15:00:00.000,Dolly Parton,9 To 5
2,2,2de3d7,2,2020-12-22 15:02:00.010,Camilla Cabello Ft. Young Thug,Havana
3,2,2de3d7,3,2020-12-22 15:04:00.019,Moana,How Far I’ll Go
4,3,6990000000000000162183243612064360401218956014...,1,2020-12-22 18:00:00.000,Backstreet Boys,I Want It That Way
5,4,316313,1,2020-12-22 19:00:00.029,Alexandra Burke,Hallelujah
6,4,316313,2,2020-12-22 19:03:00.000,Luis Fonsi & Daddy Yankee,Despacito
7,4,316313,3,2020-12-22 19:05:00.010,The Lion King,Can You Feel The Love Tonight
8,4,316313,4,2020-12-22 19:07:00.019,Billie Eilish,Bad Guy
9,5,aa0846,1,2020-12-22 22:59:59.971,Lil Nas X,Old Town Road


### Output the data

In [1389]:
karaoke.to_csv("./output/Week8_output.csv")