# COGS 108 - Data Checkpoint

# Names

- Mateo Ignacio
- Samuel Piltch
- Nate del Rosario 🐐
- Lisa Hwang
- Geovaunii D. White

<a id='research_question'></a>
# Research Question

Since we’ve never worked with audio data or classification of audio data we wanted to try working with data that is structured as such. 
We ask the question: how do the audio features from songs, specifically Spotify Tracks compare to each other? Is there a relationship between the some of these features such as tempo correlating with danceability/energy/liveness and if so how are they correlated. Additionally, how can we use these features to cluster songs based on these audio tracks of songs being coverted to numeric features?

# Dataset(s)

*Fill in your dataset information here*

- Dataset Name: Spotify Weekly Top 200
- Link to the dataset: [data scraped here](https://charts.spotify.com/charts/view/regional-global-weekly/latest)
- Number of observations: 2554 rows × 14 columns

The dataset consists of 2554 rows, some of which are duplicate songs (songs that chart multiple times in a week). 

# Setup

In [2]:
#import our libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.feature_selection import SelectKBest, chi2

In [3]:
charts = pd.read_csv('data/songs.csv')
charts

Unnamed: 0,track_name,artist_names,danceability,energy,key,mode,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration
0,'Til You Can't,Cody Johnson,0.501,0.815,1.0,1.0,-4.865,0.0436,0.05130,0.000000,0.1060,0.4600,160.087,224213.0
1,'Till I Collapse,"Eminem, Nate Dogg",0.548,0.847,1.0,1.0,-3.237,0.1860,0.06220,0.000000,0.0816,0.1000,171.447,297787.0
2,(Don't Fear) The Reaper,Blue Öyster Cult,0.333,0.927,9.0,0.0,-8.550,0.0733,0.00290,0.000208,0.2970,0.3850,141.466,308120.0
3,(Everybody's Waitin' For) The Man With The Bag...,Kay Starr,0.739,0.317,0.0,1.0,-8.668,0.0905,0.39100,0.004870,0.2430,0.8060,71.165,162373.0
4,(There's No Place Like) Home for the Holidays ...,Perry Como,0.478,0.341,5.0,1.0,-12.556,0.0511,0.89700,0.000000,0.2580,0.4740,143.736,175893.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2549,you broke me first,Tate McRae,0.667,0.373,4.0,1.0,-9.389,0.0500,0.78500,0.000000,0.0906,0.0823,124.148,169266.0
2550,¿Por Qué Me Haces Llorar?,Juan Gabriel,0.647,0.477,0.0,1.0,-8.157,0.0342,0.03740,0.000010,0.1270,0.7930,112.041,182880.0
2551,¿Quién Te Crees?,"MC Davo, Calibre 50",0.747,0.780,9.0,0.0,-5.302,0.2160,0.05830,0.000000,0.1640,0.5380,82.524,185493.0
2552,Éxtasis,"Millonario & W. Corona, Cartel De Santa",0.937,0.791,0.0,1.0,-5.242,0.0871,0.02050,0.000232,0.0433,0.9740,119.967,289013.0


# Data Cleaning

Since we are dealing with features of different scales, we will have to 
- normalize the columns
- add a column for tempo names by binning each 'tempo' accordingly
- convert the 'duration' from seconds to minutes

Lets read in our train test split:

In [4]:
def return_tempo(tempo):
    if tempo < 60:
        return 0
    elif tempo < 90:
        return 1 / 6
    elif tempo < 110:
        return 2 / 6
    elif tempo < 120:
        return 3 / 6
    elif tempo < 160:
        return 4 / 6
    elif tempo < 180:
        return 5 / 6
    else:
        return 6 / 6
        
charts = charts.assign(tempo_name = charts['tempo'].apply(return_tempo))


In [5]:
# define our features
features_array = np.array(charts.columns)
features = np.delete(features_array, [0, 1])
features

array(['danceability', 'energy', 'key', 'mode', 'loudness', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'duration', 'tempo_name'], dtype=object)

In [6]:
# convert milliseconds to minutes
charts = charts.assign(duration_min = charts['duration'] / 60000).drop(columns=['duration'])

# feature scaling
scaler = MinMaxScaler()
scaler.fit(np.array(charts[['tempo', 'loudness', 'duration_min', 'key', 'mode']]))
charts[['tempo', 'loudness', 'duration_min', 'key', 'mode']] = scaler.transform(np.array(charts[['tempo', 'loudness', 'duration_min', 'key', 'mode']]))

# Scaled!
charts

Unnamed: 0,track_name,artist_names,danceability,energy,key,mode,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,tempo_name,duration_min
0,'Til You Can't,Cody Johnson,0.501,0.815,0.090909,1.0,0.858052,0.0436,0.05130,0.000000,0.1060,0.4600,0.710372,0.833333,0.286447
1,'Till I Collapse,"Eminem, Nate Dogg",0.548,0.847,0.090909,1.0,0.911176,0.1860,0.06220,0.000000,0.0816,0.1000,0.777751,0.833333,0.398980
2,(Don't Fear) The Reaper,Blue Öyster Cult,0.333,0.927,0.818182,0.0,0.737804,0.0733,0.00290,0.000208,0.2970,0.3850,0.599926,0.666667,0.414785
3,(Everybody's Waitin' For) The Man With The Bag...,Kay Starr,0.739,0.317,0.000000,1.0,0.733953,0.0905,0.39100,0.004870,0.2430,0.8060,0.182955,0.166667,0.191861
4,(There's No Place Like) Home for the Holidays ...,Perry Como,0.478,0.341,0.454545,1.0,0.607081,0.0511,0.89700,0.000000,0.2580,0.4740,0.613390,0.666667,0.212540
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2549,you broke me first,Tate McRae,0.667,0.373,0.363636,1.0,0.710426,0.0500,0.78500,0.000000,0.0906,0.0823,0.497209,0.666667,0.202404
2550,¿Por Qué Me Haces Llorar?,Juan Gabriel,0.647,0.477,0.000000,1.0,0.750628,0.0342,0.03740,0.000010,0.1270,0.7930,0.425400,0.500000,0.223227
2551,¿Quién Te Crees?,"MC Davo, Calibre 50",0.747,0.780,0.818182,0.0,0.843792,0.2160,0.05830,0.000000,0.1640,0.5380,0.250328,0.166667,0.227223
2552,Éxtasis,"Millonario & W. Corona, Cartel De Santa",0.937,0.791,0.000000,1.0,0.845750,0.0871,0.02050,0.000232,0.0433,0.9740,0.472411,0.500000,0.385560


Check for Missingness 

In [8]:
for col in charts.columns:
    num_NaN = charts[col].isna().sum()
    print("Are there any NaN's?")
    if num_NaN == 0:
        print('No')

Are there any NaN's?
No
Are there any NaN's?
No
Are there any NaN's?
No
Are there any NaN's?
No
Are there any NaN's?
No
Are there any NaN's?
No
Are there any NaN's?
No
Are there any NaN's?
No
Are there any NaN's?
No
Are there any NaN's?
No
Are there any NaN's?
No
Are there any NaN's?
No
Are there any NaN's?
No
Are there any NaN's?
No
Are there any NaN's?
No
