## Spotify Top 50 Analysis

### Introduction

This project is an analysis of the top 50 tracks on Spotify from 2021. The goal of this analysis is to determine the characteristics of the songs that poeple are listening to so songwriters would be able to create a template for which characteristics they could implement into their songs in order to increase the probability of writing a successful song due to listener data.

It should be known that trends do change over time and this is the most up to date data available and that following this data will not guarentee a hit song, just an inside look into what listeners are listening to most. Additionally, as with all data, and as you will see from my analysis, there will be outliers. This is just a look at the average of these songs. 

### A first look into the data

In [10]:
import pandas as pd
import matplotlib.pyplot as plt

In [11]:
top_50_initial = pd.read_csv('../input/spotify-top-50-songs-in-2021/spotify_top50_2021.csv')
print(top_50_initial.head(3))

### Determining Useful Data

### Columns:
- id - The index of the dataset
- artist_name- name of the artist
- track_name - name of the track
- track_id
- popuarity - The higher the value the more popular the song is
- dancability - The higher the value, the easier it is to dance to this song
- energy - The energy of a song - the higher the value, the more energtic. song
- key - the key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1 (range: -1; 11)
- loudness - The higher the value, the louder the song (dB)
- mode - indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0
- speechiness - The higher the value the more spoken word the song contains
- acousticness - The higher the value the more acoustic the song is
- instrumentalness - the number of vocals in a song. The closer the value to 1.0, the more instrumental the song is
- liveness - The higher the value, the more likely the song is a live recording
- valence - The higher the value, the more positive mood for the song
- tempo - the overall estimated tempo of a track in beats per minute (BPM)
- duration_ms - duration of the song in ms
- time_signature - an estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of "3/4", to "7/4"

#### To determine useful data and which columns I want to use, here, I will outline the questions I want to answer:

1. What is the optimal track length?
2. What is the optimal danceability score?
3. What is the most common key?


#### Convert ms to minutes

For useability of use I will also add a column for duration in minutes as that is the most typical measurement of time for songs

In [12]:
top_50_initial["duration_mins"] = top_50_initial.duration_ms.apply(lambda x: round(x / 60000, 2) )
print(top_50_initial.duration_mins.head())

Next I will select the columns I need, leaving out unimportant data for my analysis. The columns I will create in my new dataframe are: 
- id
- artist_name
- track_name
- popularity
- key
- duration_mins
- danceability
- tempo
- speechiness
- energy

In [13]:
top_50 = top_50_initial[['artist_name', 'track_name', 'popularity', 'key', 'duration_mins', 'danceability', 'tempo', 'speechiness', 'energy']].copy()
print(top_50.head())

Here I will also sort values by popularity

In [14]:
top_50 = top_50.sort_values('popularity', ascending=False, ignore_index=True)
print(top_50.head(3))

In [15]:
print("The 10 most popular songs for 2021 were:")
for i in range(0, 10):
    print(f"{i + 1}. {top_50.track_name[i]} by {top_50.artist_name[i]} with a popularity score of {top_50.popularity[i]}")

### 1. What is the optimal track length?

In [16]:
track_duration = top_50[['track_name', 'artist_name', 'duration_mins', 'popularity']].copy()
print(track_duration.head())

In [17]:
max_duration = track_duration.max()
print(f"The longest track in the dataset is {max_duration.track_name} by {max_duration.artist_name} at {max_duration.duration_mins} minutes")

In [18]:
min_duration = track_duration.min()
print(f"The shortest track in the dataset is {min_duration.track_name} by {min_duration.artist_name} at {min_duration.duration_mins} minutes")

In [19]:
average_duration = track_duration.duration_mins.mean()
print(f"The average track duration is: {average_duration:.2f} minutes")

In [20]:
plt.scatter(track_duration.duration_mins, track_duration.popularity)

In [21]:
print(track_duration.sort_values('popularity', ignore_index=True, ascending=False).head())

As you can see from the scatter graph, the majority of these tracks are between 2.5 and 3.5 minutes. Let's take a closer look at the values

In [22]:
track_duration['duration_rounded'] = track_duration.duration_mins.apply(lambda x: round(x, 0))
num_duration = track_duration.groupby('duration_rounded').track_name.count().reset_index()

num_duration.rename(columns={'track_name': 'num_songs'}, inplace=True)
num_duration['percentage'] = num_duration.num_songs.apply(lambda x: round(x / 50 * 100))
print(num_duration)

Percentage breakdown of duration rounded to the nearest minute: 
- 2 ==> 8%
- 3 ==> 58%
- 4 ==> 30%
- 5 ==> 4%

**The optimal track length is between 2.5 and 3.5 minutes**

### 2. What is the optimal danceability score?

In [23]:
track_danceability = top_50[['track_name', 'artist_name', 'danceability', 'popularity']].copy().sort_values('popularity', ignore_index=True, ascending=False)
print(track_danceability.head())

In [24]:
print(f'The range in danceability scores in the dataset is between {track_danceability.danceability.min()} and {track_danceability.danceability.max()}')
print(f'This gives a range of {track_danceability.danceability.max() - track_danceability.danceability.min()}')

In [25]:
print(f'The average danceability score is: {track_danceability.danceability.mean()}')

In [26]:
plt.scatter(track_danceability.danceability,track_danceability.popularity)

In [27]:
danceability_popularity = top_50[['track_name', 'artist_name', 'danceability', 'popularity']].copy()
print(danceability_popularity.head())

In [28]:
danceability_popularity['danceability_rounded'] = danceability_popularity.danceability.apply(lambda x: round(x, 1))
danceability_dist = danceability_popularity.groupby('danceability_rounded').track_name.count().reset_index()
danceability_dist.rename(columns={
    'track_name': 'count'
}, inplace=True)
print(danceability_dist)

**The optimal danceability score is between 0.65 and 0.75**

### 3. What is the most common key?

In [29]:
track_keys = top_50[['track_name', 'artist_name', 'key']].copy()
print(track_keys.head())

In [30]:
track_keys_count = track_keys.groupby('key').track_name.count().reset_index()
track_keys_count.rename(columns={
    'track_name': 'count'
}, inplace=True)
print(track_keys_count.sort_values('count', ascending=False, ignore_index=True))

#### Keys:
- 0 ==> C
- 1 ==> C#
- 2 ==> D
- 3 ==> D#
- 4 ==> E
- 5 ==> F
- 6 ==> F#
- 7 ==> G
- 8 ==> G#
- 9 ==> A
- 10 ==> A#
- 11 ==> B

**The most common key is C**

### Conclusion

#### Three properties for the most popular songs are:
- Track Length between 2.5 and 3.5 minutes
- Danceability score from 0.65 to 0.75
- In the key of C