# Analyzing My Spotify Data

Payton Burks  
CPSC 222, Fall 2020  
17 December 2020

## Introduction

### Domain

This project deals with music and (my) listening statistics. I have loved listening to music since I was a kid, so I thought analyzing my listening habits would be insightful. Though I have no intention of ever creating music, it has played an important part of my development and will likely continue to do so. Hopefully I am able to glean some insights into what I listen to on a particular day, what artists I like the most, and how different my music taste is apart from others.

### Dataset

The raw data is imported as json files with 4 categories:
* endTime = date and time the song finished (formatted 'year-month-day hour:minute')
* artistName = artist
* trackName = song
* msPlayed = milliseconds played

I will be appending...
* day = day of the week
* skipped? = if the song was skipped (listened to for $\leq$ 30s)
* top100artist? = if the artist was in my top 100 most played artists

### Hypothesis Testing

I will be testing the following hypotheses:
* The average length of songs I listen to is shorter than the average length of hit songs worldwide 
* I listen to more music (per day, on average) on weekends than weekdays.
* I skip songs by my top 100 artists more often than songs by those who aren't on my top 100 list

### Stockholders

Personally, I am a huge stakeholder in these results; the information is fascinating to me. Additionally, due to the fact I am creating a model to see what makes my skips happen more often, these results could be huge to artists in the music industry. If they can get information on which of their songs were skipped more often, they may be able to turn a higher profit on their next project (assuming these results are reproduced with different streaming data).

### Classification

I will be using the "skipped?" category as my classification element. I will be trying to see if I can classify whether or not I skipped a song based on the other characteristics (not including msPlayed, as that directly correlates to skipping the song).

## Data Analysis

### Dataset Description

As previously noted, the raw data comes in as json files. In order to more easily manipulate my data, I loaded each one into a DataFrame and then merged them into one gigantic DataFrame.

In [3]:
import json
import pandas as pd
import matplotlib.pyplot as plt

import utils

#load in data
data0 = utils.load_data("StreamingHistoryJsonFiles/StreamingHistory0.json")
data1 = utils.load_data("StreamingHistoryJsonFiles/StreamingHistory1.json")
data2 = utils.load_data("StreamingHistoryJsonFiles/StreamingHistory2.json")
data3 = utils.load_data("StreamingHistoryJsonFiles/StreamingHistory3.json")
data4 = utils.load_data("StreamingHistoryJsonFiles/StreamingHistory4.json")
data5 = utils.load_data("StreamingHistoryJsonFiles/StreamingHistory5.json")

#join all data into spot_df
spot_df = pd.concat([data0, data1, data2, data3, data4, data5], ignore_index=True)

As a reminder, there are four initial attributes to the data *(endTime*, *artistName*, *artist*, *trackName*, *msPlayed*). In-depth descriptions can be found above.

### Data Preparation

#### Data Cleaning

The only issue I had with computationally with the data was the '\\$' character found in artists such as Joey Bada\\$\\$ and A\\$AP Rocky, among others. This was causing problems, so I replaced '\\$' with 'S'.

Next, I had two tasks remaining. I had to replace "Unknown Artist" with "Playboi Carti" (Spotify was picking up my local files - which are all Playboi Carti song leaks - as "Unknown Artist") and remove the podcasts that I listened to. Finally, I could move the clean data to a csv file.

In [12]:
#clean data   
utils.clean_spot_df(spot_df)
utils.rm_pod(spot_df)

spot_df = spot_df.reset_index(drop=True)

#data to csv
spot_df.to_csv("cleanspotifydata.csv")

#### Appending other data

As part of the preparation, I would be appending three new attributes (*day, skipped?, top100artist?*). Again, more in-depth descriptions of these attributes are found above. In order to find my top 100 artists for the respective attribute, I would have to compute a new dataframe which was composed of two columns, artist and hoursListened.
 
While preparing these attributes, I was also able to compute a few statistics and find out more information about the data. These are printed beneath some of the code that allowed me to create the new attributes.

##### *day*

In [6]:
#get dates from data
raw_dates = utils.get_date_list(spot_df)

#combine dates with findDay
day_of_week = []
for item in raw_dates:
    newEntry = utils.findDay(item)
    day_of_week.append(newEntry)

#append day_of_week to spot_df
spot_df["day"] = day_of_week

##### *skipped?*

In [7]:
#newvars
skipped = []
timeListen = spot_df["msPlayed"].copy()
numSkips = 0
totSongs = 0

#loop through data
for item in timeListen:
    secListened = item/1000
    if secListened < 30:
        skipped.append('y')
        numSkips += 1
        totSongs += 1
    else:
        skipped.append('n')
        totSongs += 1
        
#append data
spot_df["skipped?"] = skipped

print("Number of songs skipped:", numSkips)
print("Percent of songs skipped:", round((numSkips/totSongs),4)*100, '%')

Number of songs skipped: 11952
Percent of songs skipped: 21.02 %


##### *top100artist?*

###### **Must first create new DataFrame, artist X hours listened*

In [9]:
#new vars
bigtotal = 0
#vars for new DF
totalHours_perArtist = []
artist_perArtist = []
#grouping by artist
group_by_artist_df = spot_df.groupby("artistName")

for artist, group_df in group_by_artist_df:
    msplayed_ser = group_df["msPlayed"].copy()
    totalms = msplayed_ser.sum()
    totalhours = totalms/1000/60/60
    bigtotal += totalhours
    
    #data for new df for hours x artist
    totalHours_perArtist.append(round(totalhours, 2))
    artist_perArtist.append(artist)
    
artist_x_hours_df = utils.create_artist_x_hours_df(artist_perArtist, totalHours_perArtist)

print("Total hours listened to:", round(bigtotal))
print("Total days:", round(bigtotal/24))

Total hours listened to: 2178
Total days: 91


##### Actually creating *top100artist?* attribute

In [10]:
#new vars
top100YorN = []
artist_data = spot_df["artistName"].copy()
artist_data = artist_data.to_list()
top100 = artist_x_hours_df.iloc[0:101]["Artist"].copy()

#loop through data
for artist in artist_data:
    artistIsIn = False
    for item in top100:
        if item == artist:
            artistIsIn = True
            break
    if artistIsIn == True:
        top100YorN.append('y')
    else:
        top100YorN.append('n')
#append        
spot_df["top100Artist?"] = top100YorN

##### Convert new, fuller DataFrame to a new csv

In [11]:
spot_df.to_csv('finalspotifydata.csv')

### Exploratory Data Analysis

## Classification Results

## Conclusion