# Spotify User Data Analysis

## Data Cleaning

Before I can conduct analysis on my user data, first I will clean the data, removing unnecessary columns and combining all the listening instances.

In [23]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

In [16]:
df = pd.read_csv("C:/Users/nyc8p/OneDrive/Desktop/spoty-records-master/output/final.csv")
#remove redundant columns from dataframe

In [14]:
def preparedf(df):
    '''Takes pandas dataframe df and returns cleaned dataframe'''
    #only keep these columns, and keep listening instances that were longer than 20 seconds
    df = df.drop(columns = ["name", "Unnamed: 0", "uri", "type", "id", "track_href", "analysis_url","albumID"], axis = 1)
    df = df[df['msPlayed'] >= 20000]

    # group by trackname and sum up the ms played
    aggregation = {
    'msPlayed': 'sum',  # Sum 'minutes_played
    }
    cols_to_keep = ['artistName','albumName', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
                'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature']
    
    # Specify 'first' aggregation for all columns you want to keep as they are
    for col in cols_to_keep:
        aggregation[col] = 'first'
    df = df.groupby('trackName').agg(aggregation).reset_index()

    #create new columns helpful for analysis
    df['minPlayed'] = (df['msPlayed'] / 60000).round(2)
    df['timesPlayed'] = (df['msPlayed'] / df['duration_ms']).round(2)
    df['duration_min'] = (df['duration_ms'] / 60000).round(2)
    df = df.drop(columns = ['msPlayed'])

    # First, create a list of column names in the desired order
    new_column_order = list(df.columns)
    new_column_order.remove('minPlayed')
    new_column_order.insert(3, 'minPlayed')  # 3 represents the 4th position (0-based index)
    new_column_order.remove('timesPlayed')
    new_column_order.insert(4, 'timesPlayed')
    new_column_order.remove('duration_min')
    new_column_order.insert(5, 'duration_min')
    # Reorder the columns in the DataFrame
    df = df[new_column_order]
    df = df[df['timesPlayed'] >= 1]
    df = df.sort_values(["timesPlayed", 'minPlayed'], ascending = False).reset_index(drop = True)
    return df

In [17]:
df = preparedf(df)
df

Unnamed: 0,trackName,artistName,albumName,minPlayed,timesPlayed,duration_min,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,Weird Fishes/ Arpeggi,Radiohead,In Rainbows,309.84,58.43,5.30,0.531,0.6100,11,-8.025,0,0.0387,0.772,0.756000,0.0908,0.199,152.958,318187,4
1,Good Will Hunting,"Black Country, New Road",Ants From Up There,238.54,48.04,4.97,0.587,0.3580,2,-9.194,1,0.0276,0.219,0.002010,0.1920,0.368,96.685,297907,3
2,"Pro Freak (with Doechii, Fatman Scoop)",Smino,Luv 4 Rent,202.78,45.29,4.48,0.502,0.6410,1,-6.271,1,0.4670,0.269,0.000005,0.1970,0.212,75.008,268667,4
3,Present Tense,Radiohead,A Moon Shaped Pool,229.41,44.90,5.11,0.462,0.4070,1,-12.428,0,0.0345,0.912,0.399000,0.1100,0.336,91.915,306581,4
4,the perfect pair,beabadoobee,Beatopia,131.33,44.38,2.96,0.634,0.6630,11,-6.818,1,0.0331,0.433,0.124000,0.1020,0.600,146.053,177533,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2860,Equally Damaged,Blonde Redhead,Melody of Certain Damaged Lemons,0.67,1.00,0.67,0.528,0.0508,0,-22.086,0,0.1220,0.995,0.856000,0.1480,0.891,159.697,40200,3
2861,SCENE,BROCKHAMPTON,SATURATION II,0.66,1.00,0.66,0.434,0.3120,5,-20.861,1,0.0567,0.936,0.001690,0.5530,0.862,145.864,39520,4
2862,Day & Night,Thundercat,Drunk,0.62,1.00,0.62,0.722,0.4890,5,-14.218,1,0.0424,0.371,0.683000,0.1120,0.533,95.072,37397,4
2863,Skit #2,Kanye West,Late Registration,0.52,1.00,0.52,0.731,0.4120,11,-11.590,0,0.9460,0.253,0.000000,0.1640,0.950,66.512,31360,4


This dataset looks much better now, so now I will export it as another csv file!

In [18]:
df.to_csv("cleanhistory.csv")

# EDA

Now, we can take a look at the relationships between my most played songs and some of the features of those songs. 

Let's write a function that will plot scatterplots of the top 500 most played songs in my user history, with one axis being times played and the other being a different feature to try to find which variables correlate with my most played songs

In [40]:
df = df[0:500]

In [41]:
def scatter(df,x):
    '''Take column x and return scatterplots with y being every different audio feature'''
    cols = ['danceability',	'energy',	'key',	'loudness',	'mode',	'speechiness',	'acousticness',	'instrumentalness',	'liveness',	'valence',	'tempo', 'time_signature']
    for y in cols:
        fig = px.scatter(df,
                      x,
                      y,
                      hover_name = df["trackName"],
                      )
        fig.show()

In [42]:
scatter(df, df['timesPlayed'])