# Spotify User Data Analysis

## Data Cleaning

Before I can conduct analysis on my user data, first I will clean the data, removing unnecessary columns and combining all the listening instances.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


In [2]:
df = pd.read_csv("C:/Users/nyc8p/OneDrive/Desktop/spoty-records-master/output/final.csv")
#remove redundant columns from dataframe

In [4]:
def preparedf(df):
    '''Takes pandas dataframe df and returns cleaned dataframe'''
    #
    df = df.drop(columns = ["name", "Unnamed: 0", "uri", "type", "id", "track_href", "analysis_url","albumID"], axis = 1)
    df = df[df['msPlayed'] >= 20000]

    # Create a dictionary of aggregation function
    aggregation = {
    'msPlayed': 'sum',  # Sum 'minutes_played
    }
    cols_to_keep = ['artistName','albumName', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
                'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature']
    
    # Specify 'first' aggregation for all columns you want to keep as they are
    for col in cols_to_keep:
        aggregation[col] = 'first'

    df = df.groupby('trackName').agg(aggregation).reset_index()

    df['minPlayed'] = df['msPlayed'] / 60000
    df['timesPlayed'] = df['msPlayed'] / df['duration_ms']
    df['duration_min'] = df['duration_ms'] / 60000
    df = df.drop(columns = ['msPlayed'])

    # First, create a list of column names in the desired order
    new_column_order = list(df.columns)
    new_column_order.remove('minPlayed')
    new_column_order.insert(3, 'minPlayed')  # 3 represents the 4th position (0-based index)
    new_column_order.remove('timesPlayed')
    new_column_order.insert(4, 'timesPlayed')
    new_column_order.remove('duration_min')
    new_column_order.insert(5, 'duration_min')
    # Reorder the columns in the DataFrame
    df = df[new_column_order]
    df = df[df['timesPlayed'] >= 1]
    df = df.sort_values(["timesPlayed", 'minPlayed'], ascending = False)
    return df

In [5]:
df = preparedf(df)
df

Unnamed: 0,trackName,artistName,albumName,minPlayed,timesPlayed,duration_min,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
3026,Weird Fishes/ Arpeggi,Radiohead,In Rainbows,309.843417,58.426664,5.303117,0.531,0.6100,11,-8.025,0,0.0387,0.772,0.756000,0.0908,0.199,152.958,318187,4
1043,Good Will Hunting,"Black Country, New Road",Ants From Up There,238.539017,48.042983,4.965117,0.587,0.3580,2,-9.194,1,0.0276,0.219,0.002010,0.1920,0.368,96.685,297907,3
2106,"Pro Freak (with Doechii, Fatman Scoop)",Smino,Luv 4 Rent,202.782350,45.286325,4.477783,0.502,0.6410,1,-6.271,1,0.4670,0.269,0.000005,0.1970,0.212,75.008,268667,4
2098,Present Tense,Radiohead,A Moon Shaped Pool,229.408533,44.896820,5.109683,0.462,0.4070,1,-12.428,0,0.0345,0.912,0.399000,0.1100,0.336,91.915,306581,4
3264,the perfect pair,beabadoobee,Beatopia,131.329167,44.384706,2.958883,0.634,0.6630,11,-6.818,1,0.0331,0.433,0.124000,0.1020,0.600,146.053,177533,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1336,Is This Real? (Can You Hear Yourself?),Sudan Archives,Natural Brown Prom Queen,0.704667,1.000000,0.704667,0.286,0.2090,3,-12.302,0,0.0333,0.929,0.030300,0.1150,0.145,129.899,42280,4
1328,Intro (Inside My Mind),BJ The Chicago Kid,In My Mind,0.689550,1.000000,0.689550,0.556,0.2010,7,-20.726,1,0.4220,0.130,0.000001,0.5520,0.169,117.663,41373,4
770,Equally Damaged,Blonde Redhead,Melody of Certain Damaged Lemons,0.670000,1.000000,0.670000,0.528,0.0508,0,-22.086,0,0.1220,0.995,0.856000,0.1480,0.891,159.697,40200,3
2259,SCENE,BROCKHAMPTON,SATURATION II,0.658667,1.000000,0.658667,0.434,0.3120,5,-20.861,1,0.0567,0.936,0.001690,0.5530,0.862,145.864,39520,4


This dataset looks much better now, so now I will export it as another csv file!

In [38]:
df.to_csv("cleanhistory.csv")

# EDA

Now, we can take a look at the relationships between my most played songs and some of the features of those songs