In [2]:
%load_ext sql

%sql postgresql+psycopg2://localhost:5432/intro_to_sql_for_analytics

# Spotify Data Analysis - SQL & Data Exploration

## Overview
This Jupyter Notebook contains an analysis of Spotify data using SQL queries and data exploration techniques. It is designed to provide insights into music trends, artist popularity, and track features using structured queries and visualizations.

## Contents
- **SQL Queries**: Retrieving insights on artists, tracks, and genres
- **Aggregations & Grouping**: Understanding trends in streaming data
- **Joins & Subqueries**: Connecting multiple datasets for deeper insights
- **Feature Analysis**: Investigating track attributes (e.g., tempo, energy, danceability)
- **Intro to Performance Optimisation**: Improving query efficiency


## How to Use This Notebook
- Run each cell sequentially to execute SQL queries and analyses
- Modify the queries to explore different aspects of the dataset
- Leverage visualizations to identify patterns in music trends
- Use it as a reference for working with structured data in SQL



Data Columns:

track_name	artist_names	artist_count	released_year	released_month	released_day	in_spotify_playlists	in_spotify_charts	streams	in_apple_playlists	key	mode	danceability_percent	energy_percent	acousticness_percent	liveness_percent	bpm	speechiness_percent


Find out the count of tracks which are present in more than 1000 Spotify playlists

In [3]:
%%sql

SELECT COUNT(track_name) FROM spotify_tracks
WHERE in_spotify_playlists > 1000

 * postgresql+psycopg2://localhost:5432/intro_to_sql_for_analytics
1 rows affected.


count
676


Find out the total sum of streams from the top 20 tracks (here “top” means the tracks who are present in the most Spotify charts)

In [None]:
%%sql

-- Step 1: How to find the number of streams for each track in the top 20


SELECT streams
FROM spotify_tracks
ORDER BY in_spotify_charts DESC
LIMIT 20



 * postgresql+psycopg2://localhost:5432/intro_to_sql_for_analytics
20 rows affected.


streams
141381703
2513188493
1316855716
140003974
1297026226
30546883
127408954
800840817
387570742
183706234


In [None]:
%%sql

-- Step 2: finding the total streams for all 20 tracks in the top 20

SELECT SUM(top_songs.streams) AS total_streams 
FROM (
SELECT * FROM spotify_tracks ORDER BY in_spotify_charts DESC LIMIT 20
) top_songs;

 * postgresql+psycopg2://localhost:5432/intro_to_sql_for_analytics
1 rows affected.


total_streams
17276722950


Find out the average BPM of all tracks

In [6]:
%%sql

SELECT ROUND(AVG(bpm), 2) as average_bpm FROM spotify_tracks


 * postgresql+psycopg2://localhost:5432/intro_to_sql_for_analytics
1 rows affected.


average_bpm
122.54


Find out the average BPM of the top 20 tracks (here “top” means the tracks who are present in the most Spotify charts)


In [7]:
%%sql

SELECT ROUND(AVG(bpm), 1) AS average_bpm FROM (
    SELECT * FROM spotify_tracks ORDER BY in_spotify_charts LIMIT 20
) average_bpm;

 * postgresql+psycopg2://localhost:5432/intro_to_sql_for_analytics
1 rows affected.


average_bpm
123.7


Find out the average BPM per release year. Do you notice any pattern?

In [13]:
%%sql

SELECT released_year, ROUND(AVG(bpm), 2) AS average_bpm_per_year FROM spotify_tracks GROUP BY released_year ORDER BY released_year DESC LIMIT 10;

 * postgresql+psycopg2://localhost:5432/intro_to_sql_for_analytics
10 rows affected.


released_year,average_bpm_per_year
2023,124.06
2022,122.0
2021,125.83
2020,118.03
2019,118.36
2018,115.3
2017,119.0
2016,126.94
2015,127.36
2014,106.0


Find out which are the top 5 years having the highest total number of streams from “lyrical” songs only (these are tracks with a ‘speechless’ percent higher than 15)

In [9]:
%%sql

SELECT released_year, SUM(streams)
AS total_streams_from_lyrical_songs 
FROM spotify_tracks 
WHERE speechiness_percent > 15 
GROUP BY released_year 
ORDER BY total_streams_from_lyrical_songs  
DESC LIMIT 5

 * postgresql+psycopg2://localhost:5432/intro_to_sql_for_analytics
5 rows affected.


released_year,total_streams_from_lyrical_songs
2022,21868242081
2021,17973512774
2002,4728427653
2023,4702858600
2019,3410848778


List all tracks released in the year where most tracks where released

In [11]:
%%sql

SELECT track_name, released_year
FROM spotify_tracks
WHERE released_year = (
    -- Get the released year
    SELECT released_year
    FROM (
        -- Get the number of songs per year
        SELECT released_year 
        FROM spotify_tracks
        GROUP BY released_year
        ORDER BY COUNT(*) DESC
        LIMIT 1
    ) subquery
)
LIMIT 10

 * postgresql+psycopg2://localhost:5432/intro_to_sql_for_analytics
10 rows affected.


track_name,released_year
As It Was,2022
Kill Bill,2022
Calm Down (with Selena Gomez),2022
Creepin',2022
Anti-Hero,2022
I'm Good (Blue),2022
I Ain't Worried,2022
O.O,2022
La Bachata,2022
Left and Right (Feat. Jung Kook of BTS),2022


In [None]:
%%sql

-- How to get insights on the performance of a SQL query: use the explain analyze keywords at start of query

explain analyze select * from spotify_tracks where released_year > 2020

 * postgresql+psycopg2://localhost:5432/intro_to_sql_for_analytics
5 rows affected.


QUERY PLAN
Seq Scan on spotify_tracks (cost=0.00..30.91 rows=696 width=115) (actual time=0.007..0.126 rows=696 loops=1)
Filter: (released_year > 2020)
Rows Removed by Filter: 257
Planning Time: 0.130 ms
Execution Time: 0.168 ms
