# Data Processing with NumPy and Pandas
# Top 50 Spotify Tracks- 2020

https://intra.turingcollege.com/hardskills/data-processing-with-numpy-and-pandas-v3
https://www.kaggle.com/datasets/atillacolak/top-50-spotify-tracks-2020

Data Processing with NumPy and Pandas
Preparation
Read the instructions below and follow the guidelines.
Reviews
About this Part
Congrats! You've nearly completed the sprint! Great job! In this part, all the skills you've learned will be put to the test. As the final assignment of this Sprint, you'll analyze the Top 50 Spotify Tracks 2020 dataset. To complete this task, you will have to apply everything you've learned so far about Data Analysis, Linear Algebra, NumPy, and Pandas.

P.S. We don't expect perfection - as you progress through the course you will continue to improve and there will be plenty more opportunities to practice and apply your skills in the upcoming sprints. For now just use what you've learned and try your best!

Context
Imagine you're a data analyst working for Spotify. Your team is responsible for content analysis and in this quarter you've decided to analyze Spotify's top hits to quantify what makes a hit song. Your team's product manager has many ideas and has prepared a list of questions (requirements) that she wants you to answer. After reviewing the list of over 20 questions, you are not in a good mood - it will take a couple of days to get all the answers. Luckily, a few days ago, an experienced data scientist working in your team queried the top 50 tracks for her machine learning project and agreed to share the data with you. This is a great help - your SQL skills are not too sharp yet, and you don't yet know where to find all the relevant tables in your data warehouse. With this dataset, you are confident that you'll be able to answer all of your PM's questions, plus maybe even look into some additional points of interest.

Objectives for this Part
Practice working with data from Kaggle.
Practice performing basic EDA.
Practice reading data, performing queries and filtering data using Pandas.

Evaluation Criteria
Adherence to the requirements. How well did you meet the requirements?
Code quality. Was your code well-structured? Did you use the correct levels of abstraction? Did you remove commented-out and unused code? Did you adhere to the PEP8?
Code performance. Did you use suitable algorithms and data structures to solve the problems?
Project Review
During your project review, you should present your project as if talking to a product manager and senior data analyst working in your team. You will have to find the right balance between explaining the business side and the technical aspects of your work. You can assume that both of your colleagues have a strong understanding of and are very interested in the business aspect of your project, so be sure to clearly explain what new insights you've found while analyzing the dataset and which directions look the most promising for further research. However, you should also spend time explaining the technical aspects of your work, especially the more complex or unconventional choices.

During a project review, you may get asked questions that test your understanding of covered topics.

What advantages do NumPy arrays have over Python lists?
What makes computation on NumPy arrays so fast?
What are the rules of broadcasting?
What advantages do Pandas have over NumPy?
What is a DataFrame in Pandas?

### Requirements
Download the data from Spotify Top 50 Tracks of 2020 dataset.
Load the data using Pandas.
Perform data cleaning by:
Handling missing values.
Removing duplicate samples and features.
Treating the outliers.
Perform exploratory data analysis. Your analysis should provide answers to these questions:
How many observations are there in this dataset?
How many features this dataset has?
Which of the features are categorical?
Which of the features are numeric?
Are there any artists that have more than 1 popular track? If yes, which and how many?
Who was the most popular artist?
How many artists in total have their songs in the top 50?
Are there any albums that have more than 1 popular track? If yes, which and how many?
How many albums in total have their songs in the top 50?
Which tracks have a danceability score above 0.7?
Which tracks have a danceability score below 0.4?
Which tracks have their loudness above -5?
Which tracks have their loudness below -8?
Which track is the longest?
Which track is the shortest?
Which genre is the most popular?
Which genres have just one song on the top 50?
How many genres in total are represented in the top 50?
Which features are strongly positively correlated?
Which features are strongly negatively correlated?
Which features are not correlated?
How does the danceability score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?
How does the loudness score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?
How does the acousticness score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?
Provide clear explanations in your notebook. Your explanations should inform the reader what you are trying to achieve, the results you got, and what these results mean.
Provide suggestions for how your analysis could be improved.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('spotifytoptracks.csv', index_col=0)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 50 entries, 0 to 49
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   artist            50 non-null     object 
 1   album             50 non-null     object 
 2   track_name        50 non-null     object 
 3   track_id          50 non-null     object 
 4   energy            50 non-null     float64
 5   danceability      50 non-null     float64
 6   key               50 non-null     int64  
 7   loudness          50 non-null     float64
 8   acousticness      50 non-null     float64
 9   speechiness       50 non-null     float64
 10  instrumentalness  50 non-null     float64
 11  liveness          50 non-null     float64
 12  valence           50 non-null     float64
 13  tempo             50 non-null     float64
 14  duration_ms       50 non-null     int64  
 15  genre             50 non-null     object 
dtypes: float64(9), int64(2), object(5)
memory usage: 6.

In [4]:
df.head()

Unnamed: 0,artist,album,track_name,track_id,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,genre
0,The Weeknd,After Hours,Blinding Lights,0VjIjW4GlUZAMYd2vXMi3b,0.73,0.514,1,-5.934,0.00146,0.0598,9.5e-05,0.0897,0.334,171.005,200040,R&B/Soul
1,Tones And I,Dance Monkey,Dance Monkey,1rgnBhdG2JDFTbYkYRZAku,0.593,0.825,6,-6.401,0.688,0.0988,0.000161,0.17,0.54,98.078,209755,Alternative/Indie
2,Roddy Ricch,Please Excuse Me For Being Antisocial,The Box,0nbXyq5TXYPCO7pr3N8S4I,0.586,0.896,10,-6.687,0.104,0.0559,0.0,0.79,0.642,116.971,196653,Hip-Hop/Rap
3,SAINt JHN,Roses (Imanbek Remix),Roses - Imanbek Remix,2Wo6QQD1KMDWeFkkjLqwx5,0.721,0.785,8,-5.457,0.0149,0.0506,0.00432,0.285,0.894,121.962,176219,Dance/Electronic
4,Dua Lipa,Future Nostalgia,Don't Start Now,3PfIrDoz19wz7qK7tYeu62,0.793,0.793,11,-4.521,0.0123,0.083,0.0,0.0951,0.679,123.95,183290,Nu-disco


In [5]:
df['key'].info()

<class 'pandas.core.series.Series'>
Index: 50 entries, 0 to 49
Series name: key
Non-Null Count  Dtype
--------------  -----
50 non-null     int64
dtypes: int64(1)
memory usage: 800.0 bytes


Handling missing values.

In [6]:
df[df.isna().any(axis=1)]

Unnamed: 0,artist,album,track_name,track_id,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,genre


No missing data

### Removing duplicate samples and features.

In [7]:
df[df.duplicated()]

Unnamed: 0,artist,album,track_name,track_id,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,genre


In [8]:
df.drop_duplicates()

Unnamed: 0,artist,album,track_name,track_id,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,genre
0,The Weeknd,After Hours,Blinding Lights,0VjIjW4GlUZAMYd2vXMi3b,0.73,0.514,1,-5.934,0.00146,0.0598,9.5e-05,0.0897,0.334,171.005,200040,R&B/Soul
1,Tones And I,Dance Monkey,Dance Monkey,1rgnBhdG2JDFTbYkYRZAku,0.593,0.825,6,-6.401,0.688,0.0988,0.000161,0.17,0.54,98.078,209755,Alternative/Indie
2,Roddy Ricch,Please Excuse Me For Being Antisocial,The Box,0nbXyq5TXYPCO7pr3N8S4I,0.586,0.896,10,-6.687,0.104,0.0559,0.0,0.79,0.642,116.971,196653,Hip-Hop/Rap
3,SAINt JHN,Roses (Imanbek Remix),Roses - Imanbek Remix,2Wo6QQD1KMDWeFkkjLqwx5,0.721,0.785,8,-5.457,0.0149,0.0506,0.00432,0.285,0.894,121.962,176219,Dance/Electronic
4,Dua Lipa,Future Nostalgia,Don't Start Now,3PfIrDoz19wz7qK7tYeu62,0.793,0.793,11,-4.521,0.0123,0.083,0.0,0.0951,0.679,123.95,183290,Nu-disco
5,DaBaby,BLAME IT ON BABY,ROCKSTAR (feat. Roddy Ricch),7ytR5pFWmSjzHJIeQkgog4,0.69,0.746,11,-7.956,0.247,0.164,0.0,0.101,0.497,89.977,181733,Hip-Hop/Rap
6,Harry Styles,Fine Line,Watermelon Sugar,6UelLqGlWMcVH1E5c4H7lY,0.816,0.548,0,-4.209,0.122,0.0465,0.0,0.335,0.557,95.39,174000,Pop
7,Powfu,death bed (coffee for your head),death bed (coffee for your head),7eJMfftS33KTjuF7lTsMCx,0.431,0.726,8,-8.765,0.731,0.135,0.0,0.696,0.348,144.026,173333,Hip-Hop/Rap
8,Trevor Daniel,Nicotine,Falling,2rRJrJEo19S2J82BDsQ3F7,0.43,0.784,10,-8.756,0.123,0.0364,0.0,0.0887,0.236,127.087,159382,R&B/Hip-Hop alternative
9,Lewis Capaldi,Divinely Uninspired To A Hellish Extent,Someone You Loved,7qEHsqek33rTcFNT9PFqLf,0.405,0.501,1,-5.679,0.751,0.0319,0.0,0.105,0.446,109.891,182161,Alternative/Indie


No duplicates

### Treating the outliers.

In [9]:
df.describe()

Unnamed: 0,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.6093,0.71672,5.72,-6.2259,0.256206,0.124158,0.015962,0.196552,0.55571,119.69046,199955.36
std,0.154348,0.124975,3.709007,2.349744,0.26525,0.116836,0.094312,0.17661,0.216386,25.414778,33996.122488
min,0.225,0.351,0.0,-14.454,0.00146,0.029,0.0,0.0574,0.0605,75.801,140526.0
25%,0.494,0.6725,2.0,-7.5525,0.0528,0.048325,0.0,0.09395,0.434,99.55725,175845.5
50%,0.597,0.746,6.5,-5.9915,0.1885,0.07005,0.0,0.111,0.56,116.969,197853.5
75%,0.72975,0.7945,8.75,-4.2855,0.29875,0.1555,2e-05,0.27125,0.72625,132.317,215064.0
max,0.855,0.935,11.0,-3.28,0.934,0.487,0.657,0.792,0.925,180.067,312820.0


In [10]:
num_cols = df.describe().columns

In [11]:
abs(df[num_cols] - df[num_cols].mean()) > (3 * df[num_cols].std())

Unnamed: 0,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms
0,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,True,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,False


In [12]:
df[(abs(df[num_cols] - df[num_cols].mean()) > (3 * df[num_cols].std())).any(axis=1)]

Unnamed: 0,artist,album,track_name,track_id,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,genre
2,Roddy Ricch,Please Excuse Me For Being Antisocial,The Box,0nbXyq5TXYPCO7pr3N8S4I,0.586,0.896,10,-6.687,0.104,0.0559,0.0,0.79,0.642,116.971,196653,Hip-Hop/Rap
19,Future,High Off Life,Life Is Good (feat. Drake),1K5KBOgreBi5fkEHvg5ap3,0.574,0.795,2,-6.903,0.067,0.487,0.0,0.15,0.537,142.053,237918,Hip-Hop/Rap
24,Billie Eilish,everything i wanted,everything i wanted,3ZCTVFBt2Brf31RLEnCkWJ,0.225,0.704,6,-14.454,0.902,0.0994,0.657,0.106,0.243,120.006,245426,Pop
41,Black Eyed Peas,Translation,RITMO (Bad Boys For Life),4NCsrTzgVfsDo8nWyP8PPc,0.704,0.723,10,-7.088,0.0259,0.0571,0.00109,0.792,0.684,105.095,214935,Pop
49,Travis Scott,ASTROWORLD,SICKO MODE,2xLMifQCjDGFmkHkpNLD9h,0.73,0.834,8,-3.714,0.00513,0.222,0.0,0.124,0.446,155.008,312820,Hip-Hop/Rap


There are values lying outside of 3 sigmas. All values look adequate, let's count as there are no outliers in the dataset.

## Perform exploratory data analysis. Your analysis should provide answers to these questions:

### How many observations are there in this dataset?

In [13]:
df.shape[0]

50

### How many features this dataset has?

In [14]:
df.shape[1]

16

In [15]:
df.columns

Index(['artist', 'album', 'track_name', 'track_id', 'energy', 'danceability',
       'key', 'loudness', 'acousticness', 'speechiness', 'instrumentalness',
       'liveness', 'valence', 'tempo', 'duration_ms', 'genre'],
      dtype='object')

### Which of the features are categorical?

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 50 entries, 0 to 49
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   artist            50 non-null     object 
 1   album             50 non-null     object 
 2   track_name        50 non-null     object 
 3   track_id          50 non-null     object 
 4   energy            50 non-null     float64
 5   danceability      50 non-null     float64
 6   key               50 non-null     int64  
 7   loudness          50 non-null     float64
 8   acousticness      50 non-null     float64
 9   speechiness       50 non-null     float64
 10  instrumentalness  50 non-null     float64
 11  liveness          50 non-null     float64
 12  valence           50 non-null     float64
 13  tempo             50 non-null     float64
 14  duration_ms       50 non-null     int64  
 15  genre             50 non-null     object 
dtypes: float64(9), int64(2), object(5)
memory usage: 6.

In [17]:
df['genre'].dtype

dtype('O')

In [18]:
df['genre'].apply(type)[0]

str

In [19]:
cols_categorical = [col for col in df.columns if df[col].dtype == "object"]

In [20]:
cols_categorical

['artist', 'album', 'track_name', 'track_id', 'genre']

Categorical feature: 'genre'

### Which of the features are numeric?

In [21]:
set(df.columns) - set(cols_categorical)

{'acousticness',
 'danceability',
 'duration_ms',
 'energy',
 'instrumentalness',
 'key',
 'liveness',
 'loudness',
 'speechiness',
 'tempo',
 'valence'}

Numeric features: 'acousticness',
 'danceability',
 'duration_ms',
 'energy',
 'instrumentalness',
 'key',
 'liveness',
 'loudness',
 'speechiness',
 'tempo',
 'valence'

### Are there any artists that have more than 1 popular track? If yes, which and how many?

In [22]:
df['artist'].value_counts()

artist
Dua Lipa           3
Billie Eilish      3
Travis Scott       3
Harry Styles       2
Lewis Capaldi      2
Justin Bieber      2
Post Malone        2
The Weeknd         1
Powfu              1
DaBaby             1
Roddy Ricch        1
SAINt JHN          1
Tones And I        1
Arizona Zervas     1
Lil Mosey          1
KAROL G            1
Drake              1
Doja Cat           1
Future             1
Maroon 5           1
Jawsh 685          1
Topic              1
24kGoldn           1
Trevor Daniel      1
Shawn Mendes       1
Cardi B            1
Eminem             1
Surfaces           1
BTS                1
BENEE              1
Surf Mesa          1
Lady Gaga          1
Maluma             1
Regard             1
Black Eyed Peas    1
THE SCOTTS         1
Bad Bunny          1
Juice WRLD         1
Ariana Grande      1
JP Saxe            1
Name: count, dtype: int64

In [23]:
df[df['artist'].map(df['artist'].value_counts()) > 1]

Unnamed: 0,artist,album,track_name,track_id,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,genre
4,Dua Lipa,Future Nostalgia,Don't Start Now,3PfIrDoz19wz7qK7tYeu62,0.793,0.793,11,-4.521,0.0123,0.083,0.0,0.0951,0.679,123.95,183290,Nu-disco
6,Harry Styles,Fine Line,Watermelon Sugar,6UelLqGlWMcVH1E5c4H7lY,0.816,0.548,0,-4.209,0.122,0.0465,0.0,0.335,0.557,95.39,174000,Pop
9,Lewis Capaldi,Divinely Uninspired To A Hellish Extent,Someone You Loved,7qEHsqek33rTcFNT9PFqLf,0.405,0.501,1,-5.679,0.751,0.0319,0.0,0.105,0.446,109.891,182161,Alternative/Indie
12,Post Malone,Hollywood's Bleeding,Circles,21jGcNKet2qwijlDFuPiPb,0.762,0.695,0,-3.497,0.192,0.0395,0.00244,0.0863,0.553,120.042,215280,Pop/Soft Rock
14,Justin Bieber,Changes,Intentions (feat. Quavo),4umIPjkehX1r7uhmGvXiSV,0.546,0.806,9,-6.637,0.3,0.0575,0.0,0.102,0.874,147.986,212867,Pop
16,Lewis Capaldi,Divinely Uninspired To A Hellish Extent (Exten...,Before You Go,2gMXnyrvIjhVBUZwvLZDMP,0.575,0.459,3,-4.858,0.604,0.0573,0.0,0.0885,0.183,111.881,215107,Alternative/Indie
21,Harry Styles,Fine Line,Adore You,3jjujdWJ72nww5eGnfs2E7,0.771,0.676,8,-3.675,0.0237,0.0483,7e-06,0.102,0.569,99.048,207133,Pop
24,Billie Eilish,everything i wanted,everything i wanted,3ZCTVFBt2Brf31RLEnCkWJ,0.225,0.704,6,-14.454,0.902,0.0994,0.657,0.106,0.243,120.006,245426,Pop
26,Billie Eilish,"WHEN WE ALL FALL ASLEEP, WHERE DO WE GO?",bad guy,2Fxmhks0bxGSBdJ92vM42m,0.425,0.701,7,-10.965,0.328,0.375,0.13,0.1,0.562,135.128,194088,Electro-pop
30,Justin Bieber,Changes,Yummy,16wAOAZ2OkqoIDN7TpChjR,0.506,0.676,9,-6.652,0.345,0.0958,0.0,0.118,0.497,145.842,208520,Pop


In [24]:
tracks_by_artists = df['artist'].value_counts()

In [25]:
tracks_by_artists

artist
Dua Lipa           3
Billie Eilish      3
Travis Scott       3
Harry Styles       2
Lewis Capaldi      2
Justin Bieber      2
Post Malone        2
The Weeknd         1
Powfu              1
DaBaby             1
Roddy Ricch        1
SAINt JHN          1
Tones And I        1
Arizona Zervas     1
Lil Mosey          1
KAROL G            1
Drake              1
Doja Cat           1
Future             1
Maroon 5           1
Jawsh 685          1
Topic              1
24kGoldn           1
Trevor Daniel      1
Shawn Mendes       1
Cardi B            1
Eminem             1
Surfaces           1
BTS                1
BENEE              1
Surf Mesa          1
Lady Gaga          1
Maluma             1
Regard             1
Black Eyed Peas    1
THE SCOTTS         1
Bad Bunny          1
Juice WRLD         1
Ariana Grande      1
JP Saxe            1
Name: count, dtype: int64

In [26]:
tracks_by_artists[tracks_by_artists>1]

artist
Dua Lipa         3
Billie Eilish    3
Travis Scott     3
Harry Styles     2
Lewis Capaldi    2
Justin Bieber    2
Post Malone      2
Name: count, dtype: int64

In [27]:
tracks_by_artists[tracks_by_artists>1].count()

np.int64(7)

In [28]:
tracks_by_artists[tracks_by_artists>1].index

Index(['Dua Lipa', 'Billie Eilish', 'Travis Scott', 'Harry Styles',
       'Lewis Capaldi', 'Justin Bieber', 'Post Malone'],
      dtype='object', name='artist')

7 artists with more than one track in the list: ['Dua Lipa', 'Billie Eilish', 'Travis Scott', 'Harry Styles', 'Lewis Capaldi', 'Justin Bieber', 'Post Malone']

### Who was the most popular artist?

In [29]:
tracks_by_artists.max()

np.int64(3)

In [30]:
tracks_by_artists[tracks_by_artists == tracks_by_artists.max()].index

Index(['Dua Lipa', 'Billie Eilish', 'Travis Scott'], dtype='object', name='artist')

The most popular artists (with 3 songs in the list): 'Dua Lipa', 'Billie Eilish', 'Travis Scott'

### How many artists in total have their songs in the top 50?

In [31]:
len(df['artist'].unique())

40

40 artists in total have their songs in the top 50

### Are there any albums that have more than 1 popular track? If yes, which and how many?

In [32]:
album_counts = df['album'].value_counts()

In [33]:
album_counts[album_counts>1]

album
Future Nostalgia        3
Hollywood's Bleeding    2
Fine Line               2
Changes                 2
Name: count, dtype: int64

albums that have more than 1 popular track and tracks number:
Future Nostalgia        3, 
Hollywood's Bleeding    2, 
Fine Line               2, 
Changes                 2

### How many albums in total have their songs in the top 50?

In [34]:
df['album'].nunique()

45

albums in total have their songs in the top 50: 45

### Which tracks have a danceability score above 0.7?

In [35]:
df['track_name'][df['danceability']>0.7]

1                                      Dance Monkey
2                                           The Box
3                             Roses - Imanbek Remix
4                                   Don't Start Now
5                      ROCKSTAR (feat. Roddy Ricch)
7                  death bed (coffee for your head)
8                                           Falling
10                                             Tusa
13                                  Blueberry Faygo
14                         Intentions (feat. Quavo)
15                                     Toosie Slide
17                                           Say So
18                                         Memories
19                       Life Is Good (feat. Drake)
20                 Savage Love (Laxed - Siren Beat)
22                                      Breaking Me
24                              everything i wanted
25                                         Señorita
26                                          bad guy
27          

In [36]:
df['track_name'][df['danceability']>0.7].values
df['track_name'][df['danceability']>0.7].count()

np.int64(32)

32 tracks have a danceability score above 0.7: 'Dance Monkey', 'The Box', 'Roses - Imanbek Remix',
       "Don't Start Now", 'ROCKSTAR (feat. Roddy Ricch)',
       'death bed (coffee for your head)', 'Falling', 'Tusa',
       'Blueberry Faygo', 'Intentions (feat. Quavo)', 'Toosie Slide',
       'Say So', 'Memories', 'Life Is Good (feat. Drake)',
       'Savage Love (Laxed - Siren Beat)', 'Breaking Me',
       'everything i wanted', 'Señorita', 'bad guy',
       'WAP (feat. Megan Thee Stallion)', 'Sunday Best',
       'Godzilla (feat. Juice WRLD)', 'Break My Heart', 'Dynamite',
       'Supalonely (feat. Gus Dapperton)',
       'Sunflower - Spider-Man: Into the Spider-Verse', 'Hawái',
       'Ride It', 'goosebumps', 'RITMO (Bad Boys For Life)', 'THE SCOTTS',
       'SICKO MODE'

### Which tracks have a danceability score below 0.4?

In [37]:
df['track_name'][df['danceability']<0.4]

44    lovely (with Khalid)
Name: track_name, dtype: object

only 1 track has a danceability score below 0.4: lovely (with Khalid)

### Which tracks have their loudness above -5?

In [38]:
df['track_name'][df['loudness']>-5]

4                                   Don't Start Now
6                                  Watermelon Sugar
10                                             Tusa
12                                          Circles
16                                    Before You Go
17                                           Say So
21                                        Adore You
23                           Mood (feat. iann dior)
31                                   Break My Heart
32                                         Dynamite
33                 Supalonely (feat. Gus Dapperton)
35                  Rain On Me (with Ariana Grande)
37    Sunflower - Spider-Man: Into the Spider-Verse
38                                            Hawái
39                                          Ride It
40                                       goosebumps
43                                          Safaera
48                                         Physical
49                                       SICKO MODE
Name: track_

In [39]:
df['track_name'][df['loudness']>-5].values

array(["Don't Start Now", 'Watermelon Sugar', 'Tusa', 'Circles',
       'Before You Go', 'Say So', 'Adore You', 'Mood (feat. iann dior)',
       'Break My Heart', 'Dynamite', 'Supalonely (feat. Gus Dapperton)',
       'Rain On Me (with Ariana Grande)',
       'Sunflower - Spider-Man: Into the Spider-Verse', 'Hawái',
       'Ride It', 'goosebumps', 'Safaera', 'Physical', 'SICKO MODE'],
      dtype=object)

In [40]:
df['track_name'][df['loudness']>-5].count()

np.int64(19)

19 tracks have their loudness above -5: "Don't Start Now", 'Watermelon Sugar', 'Tusa', 'Circles',
       'Before You Go', 'Say So', 'Adore You', 'Mood (feat. iann dior)',
       'Break My Heart', 'Dynamite', 'Supalonely (feat. Gus Dapperton)',
       'Rain On Me (with Ariana Grande)',
       'Sunflower - Spider-Man: Into the Spider-Verse', 'Hawái',
       'Ride It', 'goosebumps', 'Safaera', 'Physical', 'SICKO MODE'

### Which tracks have their loudness below -8?

In [41]:
df['track_name'][df['loudness']<-8]

7                   death bed (coffee for your head)
8                                            Falling
15                                      Toosie Slide
20                  Savage Love (Laxed - Siren Beat)
24                               everything i wanted
26                                           bad guy
36                               HIGHEST IN THE ROOM
44                              lovely (with Khalid)
47    If the World Was Ending - feat. Julia Michaels
Name: track_name, dtype: object

In [42]:
df['track_name'][df['loudness']<-8].values

array(['death bed (coffee for your head)', 'Falling', 'Toosie Slide',
       'Savage Love (Laxed - Siren Beat)', 'everything i wanted',
       'bad guy', 'HIGHEST IN THE ROOM', 'lovely (with Khalid)',
       'If the World Was Ending - feat. Julia Michaels'], dtype=object)

9 tracks have their loudness below -8: 'death bed (coffee for your head)', 'Falling', 'Toosie Slide',
       'Savage Love (Laxed - Siren Beat)', 'everything i wanted',
       'bad guy', 'HIGHEST IN THE ROOM', 'lovely (with Khalid)',
       'If the World Was Ending - feat. Julia Michaels'

### Which track is the longest?

In [43]:
df['duration_ms'].idxmax()

np.int64(49)

In [44]:
df.iloc[df['duration_ms'].idxmax()]

artist                        Travis Scott
album                           ASTROWORLD
track_name                      SICKO MODE
track_id            2xLMifQCjDGFmkHkpNLD9h
energy                                0.73
danceability                         0.834
key                                      8
loudness                            -3.714
acousticness                       0.00513
speechiness                          0.222
instrumentalness                       0.0
liveness                             0.124
valence                              0.446
tempo                              155.008
duration_ms                         312820
genre                          Hip-Hop/Rap
Name: 49, dtype: object

'SICKO MODE' is the longest in the list with 312 secs (5mins 12 secs)

### Which track is the shortest?

In [45]:
df.iloc[df['duration_ms'].idxmin()]

artist                            24kGoldn
album               Mood (feat. iann dior)
track_name          Mood (feat. iann dior)
track_id            3tjFYV6RSFtuktYl3ZtYcq
energy                               0.722
danceability                           0.7
key                                      7
loudness                            -3.558
acousticness                         0.221
speechiness                         0.0369
instrumentalness                       0.0
liveness                             0.272
valence                              0.756
tempo                               90.989
duration_ms                         140526
genre                              Pop rap
Name: 23, dtype: object

'Mood (feat. iann dior)' is the shortest song: 141 secs (2 mins 21 secs)

### Which genre is the most popular?

In [46]:
genre_count = df['genre'].value_counts()

In [47]:
genre_count

genre
Pop                                   14
Hip-Hop/Rap                           13
Dance/Electronic                       5
Alternative/Indie                      4
R&B/Soul                               2
 Electro-pop                           2
R&B/Hip-Hop alternative                1
Nu-disco                               1
Pop/Soft Rock                          1
Pop rap                                1
Hip-Hop/Trap                           1
Dance-pop/Disco                        1
Disco-pop                              1
Dreampop/Hip-Hop/R&B                   1
Alternative/reggaeton/experimental     1
Chamber pop                            1
Name: count, dtype: int64

In [48]:
genre_count[genre_count == genre_count.iloc[0]]

genre
Pop    14
Name: count, dtype: int64

POP genre has the greatest number songs in the list: 14.

### Which genres have just one song on the top 50?

In [49]:
genre_count[genre_count == 1].index

Index(['R&B/Hip-Hop alternative', 'Nu-disco', 'Pop/Soft Rock', 'Pop rap',
       'Hip-Hop/Trap', 'Dance-pop/Disco', 'Disco-pop', 'Dreampop/Hip-Hop/R&B',
       'Alternative/reggaeton/experimental', 'Chamber pop'],
      dtype='object', name='genre')

In [50]:
genre_count[genre_count == 1].count()

np.int64(10)

10 genres have just one song on the top 50: 'R&B/Hip-Hop alternative', 'Nu-disco', 'Pop/Soft Rock', 'Pop rap', 'Hip-Hop/Trap', 'Dance-pop/Disco', 'Disco-pop', 'Dreampop/Hip-Hop/R&B', 'Alternative/reggaeton/experimental', 'Chamber pop'

### How many genres in total are represented in the top 50?

In [51]:
df['genre'].nunique()

16

16 genres in total are represented in the top 50

### Which features are strongly positively correlated?

In [52]:
cols_numeric = [col for col in df.columns if df[col].dtype != "object"]

In [53]:
cor_matrix = df[cols_numeric].corr()

In [54]:
cor_matrix

Unnamed: 0,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms
energy,1.0,0.152552,0.062428,0.79164,-0.682479,0.074267,-0.385515,0.069487,0.393453,0.075191,0.081971
danceability,0.152552,1.0,0.285036,0.167147,-0.359135,0.226148,-0.017706,-0.006648,0.479953,0.168956,-0.033763
key,0.062428,0.285036,1.0,-0.009178,-0.113394,-0.094965,0.020802,0.278672,0.120007,0.080475,-0.003345
loudness,0.79164,0.167147,-0.009178,1.0,-0.498695,-0.021693,-0.553735,-0.069939,0.406772,0.102097,0.06413
acousticness,-0.682479,-0.359135,-0.113394,-0.498695,1.0,-0.135392,0.352184,-0.128384,-0.243192,-0.241119,-0.010988
speechiness,0.074267,0.226148,-0.094965,-0.021693,-0.135392,1.0,0.028948,-0.142957,0.053867,0.215504,0.366976
instrumentalness,-0.385515,-0.017706,0.020802,-0.553735,0.352184,0.028948,1.0,-0.087034,-0.203283,0.018853,0.184709
liveness,0.069487,-0.006648,0.278672,-0.069939,-0.128384,-0.142957,-0.087034,1.0,-0.033366,0.025457,-0.090188
valence,0.393453,0.479953,0.120007,0.406772,-0.243192,0.053867,-0.203283,-0.033366,1.0,0.045089,-0.039794
tempo,0.075191,0.168956,0.080475,0.102097,-0.241119,0.215504,0.018853,0.025457,0.045089,1.0,0.130328


In [55]:
cor_matrix_stacked = cor_matrix.unstack()

In [56]:
cor_matrix_stacked = cor_matrix_stacked[cor_matrix_stacked != 1]

In [57]:
cor_matrix_stacked.sort_values()

energy            acousticness       -0.682479
acousticness      energy             -0.682479
loudness          instrumentalness   -0.553735
instrumentalness  loudness           -0.553735
acousticness      loudness           -0.498695
                                        ...   
loudness          valence             0.406772
danceability      valence             0.479953
valence           danceability        0.479953
energy            loudness            0.791640
loudness          energy              0.791640
Length: 110, dtype: float64

In [58]:
correlation_dict_mirrored = cor_matrix_stacked.sort_values().to_dict()
correlation_dict_mirrored

{('energy', 'acousticness'): -0.6824785203241528,
 ('acousticness', 'energy'): -0.6824785203241528,
 ('loudness', 'instrumentalness'): -0.5537348090851054,
 ('instrumentalness', 'loudness'): -0.5537348090851054,
 ('acousticness', 'loudness'): -0.4986950326515534,
 ('loudness', 'acousticness'): -0.4986950326515534,
 ('instrumentalness', 'energy'): -0.38551503859807107,
 ('energy', 'instrumentalness'): -0.38551503859807107,
 ('danceability', 'acousticness'): -0.35913454296071834,
 ('acousticness', 'danceability'): -0.35913454296071834,
 ('acousticness', 'valence'): -0.24319225768729424,
 ('valence', 'acousticness'): -0.24319225768729424,
 ('tempo', 'acousticness'): -0.24111873337229978,
 ('acousticness', 'tempo'): -0.24111873337229978,
 ('valence', 'instrumentalness'): -0.20328292297434575,
 ('instrumentalness', 'valence'): -0.20328292297434575,
 ('speechiness', 'liveness'): -0.14295683598584816,
 ('liveness', 'speechiness'): -0.14295683598584816,
 ('speechiness', 'acousticness'): -0.135

In [59]:
correlation_dict = {}

In [60]:
for item in correlation_dict_mirrored:
    if (item[1], item[0]) not in correlation_dict:
        correlation_dict.update({item: correlation_dict_mirrored[item]})

In [61]:
correlation_dict

{('energy', 'acousticness'): -0.6824785203241528,
 ('loudness', 'instrumentalness'): -0.5537348090851054,
 ('acousticness', 'loudness'): -0.4986950326515534,
 ('instrumentalness', 'energy'): -0.38551503859807107,
 ('danceability', 'acousticness'): -0.35913454296071834,
 ('acousticness', 'valence'): -0.24319225768729424,
 ('tempo', 'acousticness'): -0.24111873337229978,
 ('valence', 'instrumentalness'): -0.20328292297434575,
 ('speechiness', 'liveness'): -0.14295683598584816,
 ('speechiness', 'acousticness'): -0.13539214333859012,
 ('liveness', 'acousticness'): -0.12838374099020403,
 ('acousticness', 'key'): -0.11339351100259196,
 ('key', 'speechiness'): -0.09496505735843172,
 ('duration_ms', 'liveness'): -0.09018826695099239,
 ('liveness', 'instrumentalness'): -0.0870339124461283,
 ('loudness', 'liveness'): -0.06993949725852708,
 ('valence', 'duration_ms'): -0.03979436283824896,
 ('duration_ms', 'danceability'): -0.033763480296644874,
 ('valence', 'liveness'): -0.03336630340518706,
 ('

('energy', 'loudness') are strongly positively correlated: 0.7916395653045617}

### Which features are strongly negatively correlated?

('energy', 'acousticness'): -0.6824785203241528
is close to count as strongly

### Which features are not correlated?

In [62]:
[{item: correlation_dict[item]} for item in correlation_dict if abs(correlation_dict[item])<=0.1]

[{('key', 'speechiness'): -0.09496505735843172},
 {('duration_ms', 'liveness'): -0.09018826695099239},
 {('liveness', 'instrumentalness'): -0.0870339124461283},
 {('loudness', 'liveness'): -0.06993949725852708},
 {('valence', 'duration_ms'): -0.03979436283824896},
 {('duration_ms', 'danceability'): -0.033763480296644874},
 {('valence', 'liveness'): -0.03336630340518706},
 {('loudness', 'speechiness'): -0.021692935459647147},
 {('danceability', 'instrumentalness'): -0.01770638521729678},
 {('acousticness', 'duration_ms'): -0.010988051809892976},
 {('key', 'loudness'): -0.009178410631968104},
 {('liveness', 'danceability'): -0.006648475599485623},
 {('key', 'duration_ms'): -0.003345303142861897},
 {('tempo', 'instrumentalness'): 0.01885267572734445},
 {('instrumentalness', 'key'): 0.020802356350748005},
 {('tempo', 'liveness'): 0.025456740041450245},
 {('instrumentalness', 'speechiness'): 0.028948017426321342},
 {('valence', 'tempo'): 0.04508867269936379},
 {('speechiness', 'valence'): 0

pairs with correlaction within -0.1 to 0.1 are considered as not correlated here.

### How does the danceability score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

In [63]:
danceability_genre = df.groupby('genre')['danceability'].agg(['min', 'mean', 'median', 'max'])
danceability_genre

Unnamed: 0_level_0,min,mean,median,max
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Electro-pop,0.701,0.7895,0.7895,0.878
Alternative/Indie,0.459,0.66175,0.663,0.862
Alternative/reggaeton/experimental,0.607,0.607,0.607,0.607
Chamber pop,0.351,0.351,0.351,0.351
Dance-pop/Disco,0.73,0.73,0.73,0.73
Dance/Electronic,0.647,0.755,0.785,0.88
Disco-pop,0.746,0.746,0.746,0.746
Dreampop/Hip-Hop/R&B,0.755,0.755,0.755,0.755
Hip-Hop/Rap,0.598,0.765538,0.774,0.896
Hip-Hop/Trap,0.935,0.935,0.935,0.935


In [64]:
danceability_genres_interested = ['Pop', 'Hip-Hop/Rap', 'Dance/Electronic', 'Alternative/Indie']

In [65]:
danceability_genre.loc[danceability_genres_interested]

Unnamed: 0_level_0,min,mean,median,max
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Pop,0.464,0.677571,0.69,0.806
Hip-Hop/Rap,0.598,0.765538,0.774,0.896
Dance/Electronic,0.647,0.755,0.785,0.88
Alternative/Indie,0.459,0.66175,0.663,0.862


danceability doen't differ much for 'Pop', 'Hip-Hop/Rap', 'Dance/Electronic', 'Alternative/Indie' genres.

### How does the acousticness score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

In [66]:
acousticness_genre = df.groupby('genre')['acousticness'].agg(['min', 'mean', 'median', 'max'])
acousticness_genre

Unnamed: 0_level_0,min,mean,median,max
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Electro-pop,0.183,0.2555,0.2555,0.328
Alternative/Indie,0.291,0.5835,0.646,0.751
Alternative/reggaeton/experimental,0.0103,0.0103,0.0103,0.0103
Chamber pop,0.934,0.934,0.934,0.934
Dance-pop/Disco,0.167,0.167,0.167,0.167
Dance/Electronic,0.0137,0.09944,0.0686,0.223
Disco-pop,0.0112,0.0112,0.0112,0.0112
Dreampop/Hip-Hop/R&B,0.533,0.533,0.533,0.533
Hip-Hop/Rap,0.00513,0.188741,0.145,0.731
Hip-Hop/Trap,0.0194,0.0194,0.0194,0.0194


In [67]:
acousticness_genre.loc[danceability_genres_interested]

Unnamed: 0_level_0,min,mean,median,max
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Pop,0.021,0.323843,0.259,0.902
Hip-Hop/Rap,0.00513,0.188741,0.145,0.731
Dance/Electronic,0.0137,0.09944,0.0686,0.223
Alternative/Indie,0.291,0.5835,0.646,0.751


Alternative/Indie genre has more acousticness score in average (0.583500) from specified genres. However maximum score has Pop genre song: 0.902.

Provide clear explanations in your notebook. Your explanations should inform the reader what you are trying to achieve, the results you got, and what these results mean.
Provide suggestions for how your analysis could be improved.

## Results

In [68]:
df.describe()

Unnamed: 0,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.6093,0.71672,5.72,-6.2259,0.256206,0.124158,0.015962,0.196552,0.55571,119.69046,199955.36
std,0.154348,0.124975,3.709007,2.349744,0.26525,0.116836,0.094312,0.17661,0.216386,25.414778,33996.122488
min,0.225,0.351,0.0,-14.454,0.00146,0.029,0.0,0.0574,0.0605,75.801,140526.0
25%,0.494,0.6725,2.0,-7.5525,0.0528,0.048325,0.0,0.09395,0.434,99.55725,175845.5
50%,0.597,0.746,6.5,-5.9915,0.1885,0.07005,0.0,0.111,0.56,116.969,197853.5
75%,0.72975,0.7945,8.75,-4.2855,0.29875,0.1555,2e-05,0.27125,0.72625,132.317,215064.0
max,0.855,0.935,11.0,-3.28,0.934,0.487,0.657,0.792,0.925,180.067,312820.0


Average song in top50: 

In [69]:
df.describe().loc['mean']

energy                   0.609300
danceability             0.716720
key                      5.720000
loudness                -6.225900
acousticness             0.256206
speechiness              0.124158
instrumentalness         0.015962
liveness                 0.196552
valence                  0.555710
tempo                  119.690460
duration_ms         199955.360000
Name: mean, dtype: float64

the most popular genre according to the list is POP.

The dataset with 50 rows and 16 features. No missing data, no duplicates. There are values lying outside of 3 sigmas, but look to be adequate. So, dataset is clean.

Text type features: 'artist', 'album', 'track_name', 'track_id', 'genre'. Only one is categorical: genre.

7 artists with more than one track in the list: ['Dua Lipa', 'Billie Eilish', 'Travis Scott', 'Harry Styles', 'Lewis Capaldi', 'Justin Bieber', 'Post Malone']

The most popular artists (with 3 songs in the list): 'Dua Lipa', 'Billie Eilish', 'Travis Scott'

40 artists in total have their songs in the top 50

albums that have more than 1 popular track and tracks number: Future Nostalgia 3, Hollywood's Bleeding 2, Fine Line 2, Changes 2

albums in total have their songs in the top 50: 45

32 tracks have a danceability score above 0.7

only 1 track has a danceability score below 0.4

19 tracks have their loudness above -5

9 tracks have their loudness below -8

'SICKO MODE' is the longest in the list with 312 secs (5mins 12 secs)

'Mood (feat. iann dior)' is the shortest song: 141 secs (2 mins 21 secs)

POP genre has the greatest number songs in the list: 14.

2 genres are the leaders in the dataset: Pop with 14 songs, Hip-Hop/Rap with 13.
The others: Dance/Electronic                       5
Alternative/Indie                      4
R&B/Soul                               2
 Electro-pop                           2

10 genres have just one song on the top 50: 'R&B/Hip-Hop alternative', 'Nu-disco', 'Pop/Soft Rock', 'Pop rap', 'Hip-Hop/Trap', 'Dance-pop/Disco', 'Disco-pop', 'Dreampop/Hip-Hop/R&B', 'Alternative/reggaeton/experimental', 'Chamber pop'

16 genres in total are represented in the top 50

('energy', 'loudness') are strongly positively correlated: 0.7916395653045617}

('energy', 'acousticness'): -0.6824785203241528 is close to count as strongly

danceability don't differ much for 'Pop', 'Hip-Hop/Rap', 'Dance/Electronic', 'Alternative/Indie' genres.

Alternative/Indie genre has more acousticness score in average (0.583500) from leader genres. However maximum score has Pop genre song: 0.902