In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from dateutil.parser import parse
from datetime import datetime
import sys

# Data Visualization and Exploratory Data Analysis Lab
## Visualizing and exploring data. The data mining process

In this lab, you'll get acquainted with the most streamed songs on Spotify in 2024. The dataset and its associated metadata can be found [here](https://www.kaggle.com/datasets/nelgiriyewithana/most-streamed-spotify-songs-2024). The version you'll need is provided in the `data/` folder.

You know the drill. Do what you can / want / need to answer the questions to the best of your ability. Answers do not need to be trivial, or even the same among different people.

### Problem 1. Read the dataset (1 point)
Read the file without unzipping it first. You can try a different character encoding, like `unicode_escape`. Don't worry too much about weird characters.

In [2]:
# songs_data = pd.read_csv("data/spotify_most_streamed_2024.zip", compression='zip', encoding='unicode_escape', header=0, sep=',', quotechar='"')

In [3]:
songs_data = pd.read_csv("data/spotify_most_streamed_2024.zip", encoding='unicode_escape')

In [4]:
songs_data

Unnamed: 0,Track,Album Name,Artist,Release Date,ISRC,All Time Rank,Track Score,Spotify Streams,Spotify Playlist Count,Spotify Playlist Reach,...,SiriusXM Spins,Deezer Playlist Count,Deezer Playlist Reach,Amazon Playlist Count,Pandora Streams,Pandora Track Stations,Soundcloud Streams,Shazam Counts,TIDAL Popularity,Explicit Track
0,MILLION DOLLAR BABY,Million Dollar Baby - Single,Tommy Richman,4/26/2024,QM24S2402528,1,725.4,390470936,30716,196631588,...,684,62.0,17598718,114.0,18004655,22931,4818457,2669262,,0
1,Not Like Us,Not Like Us,Kendrick Lamar,5/4/2024,USUG12400910,2,545.9,323703884,28113,174597137,...,3,67.0,10422430,111.0,7780028,28444,6623075,1118279,,1
2,i like the way you kiss me,I like the way you kiss me,Artemas,3/19/2024,QZJ842400387,3,538.4,601309283,54331,211607669,...,536,136.0,36321847,172.0,5022621,5639,7208651,5285340,,0
3,Flowers,Flowers - Single,Miley Cyrus,1/12/2023,USSM12209777,4,444.9,2031280633,269802,136569078,...,2182,264.0,24684248,210.0,190260277,203384,,11822942,,0
4,Houdini,Houdini,Eminem,5/31/2024,USUG12403398,5,423.3,107034922,7223,151469874,...,1,82.0,17660624,105.0,4493884,7006,207179,457017,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4595,For the Last Time,For the Last Time,$uicideboy$,9/5/2017,QM8DG1703420,4585,19.4,305049963,65770,5103054,...,,2.0,14217,,20104066,13184,50633006,656337,,1
4596,Dil Meri Na Sune,"Dil Meri Na Sune (From ""Genius"")",Atif Aslam,7/27/2018,INT101800122,4575,19.4,52282360,4602,1449767,...,,1.0,927,,,,,193590,,0
4597,Grace (feat. 42 Dugg),My Turn,Lil Baby,2/28/2020,USUG12000043,4571,19.4,189972685,72066,6704802,...,,1.0,74,6.0,84426740,28999,,1135998,,1
4598,Nashe Si Chadh Gayi,November Top 10 Songs,Arijit Singh,11/8/2016,INY091600067,4591,19.4,145467020,14037,7387064,...,,,,7.0,6817840,,,448292,,0


In [25]:
songs_data.describe()

Unnamed: 0,release_date,track_score,spotify_popularity,apple_music_playlist_count,deezer_playlist_count,amazon_playlist_count
count,4600,4600.0,3796.0,4039.0,3679.0,3545.0
mean,2021-01-27 07:48:18.782608896,41.844043,63.501581,54.60312,32.310954,25.348942
min,1987-07-21 00:00:00,19.4,1.0,1.0,1.0,1.0
25%,2019-07-16 18:00:00,23.3,61.0,10.0,5.0,8.0
50%,2022-06-01 00:00:00,29.9,67.0,28.0,15.0,17.0
75%,2023-08-11 00:00:00,44.425,73.0,70.0,37.0,34.0
max,2024-06-14 00:00:00,725.4,96.0,859.0,632.0,210.0
std,,38.543766,16.186438,71.61227,54.274538,25.989826


### Problem 2. Perform some cleaning (1 point)
Ensure all data has been read correctly; check the data types. Give the columns better names (e.g. `all_time_rank`, `track_score`, etc.). To do so, try to use `apply()` instead of a manual mapping between old and new name. Get rid of any unnecessary ones.

In [5]:
songs_data.dtypes

Track                          object
Album Name                     object
Artist                         object
Release Date                   object
ISRC                           object
All Time Rank                  object
Track Score                   float64
Spotify Streams                object
Spotify Playlist Count         object
Spotify Playlist Reach         object
Spotify Popularity            float64
YouTube Views                  object
YouTube Likes                  object
TikTok Posts                   object
TikTok Likes                   object
TikTok Views                   object
YouTube Playlist Reach         object
Apple Music Playlist Count    float64
AirPlay Spins                  object
SiriusXM Spins                 object
Deezer Playlist Count         float64
Deezer Playlist Reach          object
Amazon Playlist Count         float64
Pandora Streams                object
Pandora Track Stations         object
Soundcloud Streams             object
Shazam Count

### Let's make column names like album_name

In [6]:
songs_data.columns = (songs_data.columns
                        .str.lower()
                        .str.replace(' ', '_'))

In [7]:
songs_data.columns

Index(['track', 'album_name', 'artist', 'release_date', 'isrc',
       'all_time_rank', 'track_score', 'spotify_streams',
       'spotify_playlist_count', 'spotify_playlist_reach',
       'spotify_popularity', 'youtube_views', 'youtube_likes', 'tiktok_posts',
       'tiktok_likes', 'tiktok_views', 'youtube_playlist_reach',
       'apple_music_playlist_count', 'airplay_spins', 'siriusxm_spins',
       'deezer_playlist_count', 'deezer_playlist_reach',
       'amazon_playlist_count', 'pandora_streams', 'pandora_track_stations',
       'soundcloud_streams', 'shazam_counts', 'tidal_popularity',
       'explicit_track'],
      dtype='object')

### Analyzing the data we see that the tidal_popularity column has too many nan. Let's see what data we have in this column.

In [8]:
songs_data.tidal_popularity.unique()

array([nan])

### All data in the tidal_popularity column is nan. Then we don't need this column. We'll drop it. But first we'll save a backup_copy of our songs_data. We will always keep a backup when we transform our dataset.

In [9]:
songs_data_backup_1 = songs_data
songs_data = songs_data.drop(columns=["tidal_popularity"])

In [10]:
songs_data.columns

Index(['track', 'album_name', 'artist', 'release_date', 'isrc',
       'all_time_rank', 'track_score', 'spotify_streams',
       'spotify_playlist_count', 'spotify_playlist_reach',
       'spotify_popularity', 'youtube_views', 'youtube_likes', 'tiktok_posts',
       'tiktok_likes', 'tiktok_views', 'youtube_playlist_reach',
       'apple_music_playlist_count', 'airplay_spins', 'siriusxm_spins',
       'deezer_playlist_count', 'deezer_playlist_reach',
       'amazon_playlist_count', 'pandora_streams', 'pandora_track_stations',
       'soundcloud_streams', 'shazam_counts', 'explicit_track'],
      dtype='object')

In [11]:
songs_data.spotify_streams.unique()

array(['390,470,936', '323,703,884', '601,309,283', ..., '189,972,685',
       '145,467,020', '255,740,653'], dtype=object)

In [12]:
def all_columns_unique_values():
    for col in songs_data.columns:
        print(f"Unique values in column \'{col}\': \n {songs_data[col].unique()}\n\n **************** \n")

# We call the function        
all_columns_unique_values()

Unique values in column 'track': 
 ['MILLION DOLLAR BABY' 'Not Like Us' 'i like the way you kiss me' ...
 'Grace (feat. 42 Dugg)' 'Nashe Si Chadh Gayi'
 'Me Acostumbre (feat. Bad Bunny)']

 **************** 

Unique values in column 'album_name': 
 ['Million Dollar Baby - Single' 'Not Like Us' 'I like the way you kiss me'
 ... 'Dil Meri Na Sune (From "Genius")' 'November Top 10 Songs'
 'Me Acostumbre (feat. Bad Bunny)']

 **************** 

Unique values in column 'artist': 
 ['Tommy Richman' 'Kendrick Lamar' 'Artemas' ... 'Kerim Araz'
 'Jaques Raupï¿' 'BUSHIDO ZHO']

 **************** 

Unique values in column 'release_date': 
 ['4/26/2024' '5/4/2024' '3/19/2024' ... '10/31/2018' '11/8/2016'
 '4/11/2017']

 **************** 

Unique values in column 'isrc': 
 ['QM24S2402528' 'USUG12400910' 'QZJ842400387' ... 'USUG12000043'
 'INY091600067' 'USB271700107']

 **************** 

Unique values in column 'all_time_rank': 
 ['1' '2' '3' ... '4,571' '4,591' '4,593']

 **************** 

Uniqu

In [13]:
print(sys.maxsize)

9223372036854775807


### We could convert explicit_track to bool because it contains only 0 and 1.

In [14]:
songs_data_backup_2 = songs_data
songs_data.explicit_track = songs_data.explicit_track.astype(bool)

In [21]:
songs_data.explicit_track.unique()

array([False,  True])

### Let's convert release_date to datetime. We'll use parse from dateutil.parser and datetime and will make function string_to_date().

In [16]:
songs_data.release_date.unique()

array(['4/26/2024', '5/4/2024', '3/19/2024', ..., '10/31/2018',
       '11/8/2016', '4/11/2017'], dtype=object)

In [17]:
def string_to_date(date_string):
    return parse(date_string)

In [18]:
songs_data.release_date = songs_data.release_date.apply(string_to_date)

In [19]:
songs_data.release_date.unique()

<DatetimeArray>
['2024-04-26 00:00:00', '2024-05-04 00:00:00', '2024-03-19 00:00:00',
 '2023-01-12 00:00:00', '2024-05-31 00:00:00', '2023-11-10 00:00:00',
 '2024-01-18 00:00:00', '2024-02-02 00:00:00', '2024-06-09 00:00:00',
 '2024-05-23 00:00:00',
 ...
 '2015-09-24 00:00:00', '2019-07-18 00:00:00', '2019-06-27 00:00:00',
 '2018-09-20 00:00:00', '2016-06-24 00:00:00', '2016-04-09 00:00:00',
 '2023-06-05 00:00:00', '2018-10-31 00:00:00', '2016-11-08 00:00:00',
 '2017-04-11 00:00:00']
Length: 1562, dtype: datetime64[ns]

### Let's analyze nan values in every column.

In [20]:
def all_columns_nan_values():
    for col in songs_data.columns:
        print(f"All nan values in column \'{col}\': \n {songs_data[col].isna().sum()}\n\n **************** \n")

# Let's call the function
all_columns_nan_values()

All nan values in column 'track': 
 0

 **************** 

All nan values in column 'album_name': 
 0

 **************** 

All nan values in column 'artist': 
 5

 **************** 

All nan values in column 'release_date': 
 0

 **************** 

All nan values in column 'isrc': 
 0

 **************** 

All nan values in column 'all_time_rank': 
 0

 **************** 

All nan values in column 'track_score': 
 0

 **************** 

All nan values in column 'spotify_streams': 
 113

 **************** 

All nan values in column 'spotify_playlist_count': 
 70

 **************** 

All nan values in column 'spotify_playlist_reach': 
 72

 **************** 

All nan values in column 'spotify_popularity': 
 804

 **************** 

All nan values in column 'youtube_views': 
 308

 **************** 

All nan values in column 'youtube_likes': 
 315

 **************** 

All nan values in column 'tiktok_posts': 
 1173

 **************** 

All nan values in column 'tiktok_likes': 
 980

 ******

# TODO - to analyse all columns with NAN and to think what should be done - replace ? remove ?
Useful!
https://www.w3schools.com/python/pandas/pandas_cleaning_empty_cells.asp

In [30]:
songs_data.describe().T

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
release_date,4600.0,2021-01-27 07:48:18.782608896,1987-07-21 00:00:00,2019-07-16 18:00:00,2022-06-01 00:00:00,2023-08-11 00:00:00,2024-06-14 00:00:00,
track_score,4600.0,41.844043,19.4,23.3,29.9,44.425,725.4,38.543766
spotify_popularity,3796.0,63.501581,1.0,61.0,67.0,73.0,96.0,16.186438
apple_music_playlist_count,4039.0,54.60312,1.0,10.0,28.0,70.0,859.0,71.61227
deezer_playlist_count,3679.0,32.310954,1.0,5.0,15.0,37.0,632.0,54.274538
amazon_playlist_count,3545.0,25.348942,1.0,8.0,17.0,34.0,210.0,25.989826


In [31]:
songs_data.columns

Index(['track', 'album_name', 'artist', 'release_date', 'isrc',
       'all_time_rank', 'track_score', 'spotify_streams',
       'spotify_playlist_count', 'spotify_playlist_reach',
       'spotify_popularity', 'youtube_views', 'youtube_likes', 'tiktok_posts',
       'tiktok_likes', 'tiktok_views', 'youtube_playlist_reach',
       'apple_music_playlist_count', 'airplay_spins', 'siriusxm_spins',
       'deezer_playlist_count', 'deezer_playlist_reach',
       'amazon_playlist_count', 'pandora_streams', 'pandora_track_stations',
       'soundcloud_streams', 'shazam_counts', 'explicit_track'],
      dtype='object')

### Problem 3. Most productive artists (1 point)
Who are the five artists with the most songs in the dataset?

Who are the five "clean-mouthed" artists (i.e., with no explicit songs)? **Note:** We're not going into details but we can start a discussion about whether a song needs swearing to be popular.

In [34]:
# Variant 1
# Size includes NaN values, count does not:
# Group by artist and count the number of tracks for each artist
artist_song_count = songs_data.groupby('artist')['track'].count()

# Sort the artists by the number of songs in descending order and get the top 5
top_5_artists = artist_song_count.sort_values(ascending=False).head(5)

print(top_5_artists)

artist
Drake           63
Taylor Swift    63
Bad Bunny       60
KAROL G         32
The Weeknd      31
Name: track, dtype: int64


In [37]:
# Variant 2
artist_song_count_1 = songs_data.groupby(['artist','track']).size()

# Sort the artists by the number of songs in descending order and get the top 5
top_5_artists_1 = artist_song_count_1.sort_values(ascending=False).head(5)

print(top_5_artists)

artist
Drake           63
Taylor Swift    63
Bad Bunny       60
KAROL G         32
The Weeknd      31
Name: track, dtype: int64


### Who are the five "clean-mouthed" artists (i.e., with no explicit songs)

In [40]:
# Group by artist and count the number of explicit tracks for each artist
clean_mouthed_artists_count = songs_data.groupby('artist')['explicit_track'].count()

# Sort the artists by the number of explicit tracks in ascending order and get the top 5
top_5_clean_artists = clean_mouthed_artists_count.sort_values(ascending=True).head(5)

print(top_5_clean_artists)

artist
"XY"               1
Mc Ws da leste     1
Mc Poze do Rodo    1
Mc Paiva ZS        1
Mc Livinho         1
Name: explicit_track, dtype: int64


### Print top 5 'dirty_mouthed' artists

In [41]:
# Sort the artists by the number of explicit tracks in descending order and get the top 5
top_5_dirty_artists = clean_mouthed_artists_count.sort_values(ascending=False).head(5)

print(top_5_dirty_artists)

artist
Drake           63
Taylor Swift    63
Bad Bunny       60
KAROL G         32
The Weeknd      31
Name: explicit_track, dtype: int64


### Problem 4. Most streamed artists (1 point)
And who are the top five most streamed (as measured by Spotify streams) artists?

In [42]:
songs_data.columns

Index(['track', 'album_name', 'artist', 'release_date', 'isrc',
       'all_time_rank', 'track_score', 'spotify_streams',
       'spotify_playlist_count', 'spotify_playlist_reach',
       'spotify_popularity', 'youtube_views', 'youtube_likes', 'tiktok_posts',
       'tiktok_likes', 'tiktok_views', 'youtube_playlist_reach',
       'apple_music_playlist_count', 'airplay_spins', 'siriusxm_spins',
       'deezer_playlist_count', 'deezer_playlist_reach',
       'amazon_playlist_count', 'pandora_streams', 'pandora_track_stations',
       'soundcloud_streams', 'shazam_counts', 'explicit_track'],
      dtype='object')

In [43]:
# Group by artist and count the number of spotify_streams for each artist
most_streamed_artists_count = songs_data.groupby('artist')['spotify_streams'].count()

# Sort the artists by the number of explicit tracks in ascending order and get the top 5
top_5_most_streamed_artists = most_streamed_artists_count.sort_values(ascending=False).head(5)

print(top_5_most_streamed_artists)

artist
Taylor Swift    63
Drake           62
Bad Bunny       60
KAROL G         32
The Weeknd      30
Name: spotify_streams, dtype: int64


### Problem 5. Songs by year and month (1 point)
How many songs have been released each year? Present an appropriate plot. Can you explain the behavior of the plot for 2024?

How about months? Is / Are there (a) popular month(s) to release music?

### Problem 6. Playlists (2 points)
Is there any connection (correlation) between users adding a song to playlists in one service, or another? Only Spotify, Apple, Deezer, and Amazon offer the ability to add a song to a playlist. Find a way to plot all these relationships at the same time, and analyze them. Experiment with different types of correlations.

### Problem 7. YouTube views and likes (1 point)
What is the relationship between YouTube views and likes? Present an appropriate plot. 

What is the mean YouTube views-to-likes ratio? What is its distribution? Find a way to plot it and describe it.

### Problem 8. TikTok stuff (2 points)
The most popular songs on TikTok released every year show... interesting behavior. Which years peaked the most TikTok views? Show an appropriate chart. Can you explain this behavior? For a bit of context, TikTok was created in 2016.

Now, how much popular is the most popular song for each release year, than the mean popularity? Analyze the results.

In both parts, it would be helpful to see the actual songs.

### * Problem 9. Explore (and clean) at will
There is a lot to look for here. For example, you can easily link a song to its genres, and lyrics. You may also try to link artists and albums to more info about them. Or you can compare and contrast a song's performance across different platforms, in a similar manner to what you already did above; maybe even assign a better song ranking system (across platforms with different popularity metrics, and different requirements) than the one provided in the dataset.