# Spotify Listening History Analysis

## Project Overview
This project aims to perform descriptive analysis on personal Spotify listening history. The objective is to uncover patterns, trends, and insights into my music listening habits over time. By analyzing my streaming history and library data provided by Spotify, I hope to gain a deeper understanding of my music preferences and how they have evolved.

## Data Description
The data comes in JSON format, divided into three main categories:
- **Streaming History:** Contains detailed records of songs listened to, including artist name, track name, end time, and listening duration. We have three files representing different time periods.
- **Library Information:** Provides information about the tracks saved in the Spotify library, including artist name, album name, and track details.

## Analysis Steps
1. **Data Loading:** Load the JSON files into pandas DataFrames for easy manipulation and analysis.
2. **Data Cleaning:** Inspect the data for inconsistencies, missing values, or anomalies that could skew the analysis. Standardize the data formats if necessary.
3. **Exploratory Data Analysis (EDA):**
   - Analyze listening trends over time.
   - Identify top artists, albums, and tracks.
   - Explore listening habits (e.g., most active listening times/days).
4. **Insights and Visualization:** Use graphs and charts to visualize findings from the EDA. Highlight any interesting patterns or insights about listening preferences and habits.
5. **Conclusions:** Summarize the key takeaways and any potential recommendations for future listening based on the analysis.

## Interesting Facts and Learnings
(As we progress through the analysis, this section will be populated with any interesting findings, patterns, or insights derived from the data.)


## 1. Data Loading
There is a requirements file in the same directory as this script
`pip install -r requirements.txt` to install the required packages


In [20]:
import pandas as pd
import os

print(os.getcwd())  # To get a sense of where we are

# Load one of the streaming history files
streaming_history_path = "StreamingHistory_music_0.json"

# Load the data into a pandas dataframe
streaming_history_df = pd.read_json(streaming_history_path)

# Display the first few rows of the dataframe to understand its structure
streaming_history_df.head()

/Users/heritierkaumbu/Documents/Bcom Hons IS/Learning From Data/Spotify Account Data


Unnamed: 0,endTime,artistName,trackName,msPlayed
0,2023-03-10 22:58,Hot Chelle Rae,Don't Say Goodnight,119586
1,2023-03-11 05:20,Florocka,Twale,256998
2,2023-03-11 05:27,Banky W.,My Destiny,435472
3,2023-03-11 05:42,DOE,What I'm Waiting For,170589
4,2023-03-11 18:14,DJ Khaled,"All I Do Is Win (feat. T-Pain, Ludacris, Snoop...",232506


The first streaming history file has been successfully loaded, and here's a glimpse into its structure:

- **endTime:** The timestamp when the listening session ended.
- **artistName:** The name of the artist.
- **trackName:** The name of the track.
- **msPlayed:** The duration of the track played in milliseconds.

With this structure in mind, we can proceed to load all streaming history data and the library information to perform a comprehensive analysis. Let's move on to combining all streaming history data into a single DataFrame for a more extensive exploration.

In [21]:
# Paths to all streaming history files
streaming_files = [
    "StreamingHistory_music_0.json",
    "StreamingHistory_music_1.json",
    "StreamingHistory_music_2.json",
]

# Load and concatenate all streaming history data
all_streaming_df = pd.concat(
    [pd.read_json(file) for file in streaming_files], ignore_index=True
)

# Display basic information about the combined DataFrame
all_streaming_df.info()

# Preview the first few rows of the combined DataFrame
all_streaming_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28110 entries, 0 to 28109
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   endTime     28110 non-null  object
 1   artistName  28110 non-null  object
 2   trackName   28110 non-null  object
 3   msPlayed    28110 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 878.6+ KB


Unnamed: 0,endTime,artistName,trackName,msPlayed
0,2023-03-10 22:58,Hot Chelle Rae,Don't Say Goodnight,119586
1,2023-03-11 05:20,Florocka,Twale,256998
2,2023-03-11 05:27,Banky W.,My Destiny,435472
3,2023-03-11 05:42,DOE,What I'm Waiting For,170589
4,2023-03-11 18:14,DJ Khaled,"All I Do Is Win (feat. T-Pain, Ludacris, Snoop...",232506



Loaded the Spotify streaming history into a DataFrame, revealing 28,110 entries detailing my listening sessions. The data is clean, with each record showing when I finished listening to a track, who the artist was, the track name, and how long I listened.

### Initial Observations:

- All data is present and accounted for—no cleaning required.
- Memory usage is minimal, so we're good to handle the data in-memory.

### Next Steps:

The plan is to convert listening time to minutes, identify top artists and tracks, look for listening patterns over time, and present these insights visually.

Let's dive into the data and see what my music preferences reveal about me!



In [None]:
# Let's write the python code that includes the statistical metrics and print their values

# Assuming 'all_streaming_df' is our main DataFrame after loading and merging the JSON data

# Convert 'msPlayed' to 'minutesPlayed' for better readability
all_streaming_df['minutesPlayed'] = all_streaming_df['msPlayed'] / 60000

# Calculate total listening time in minutes
total_listening_minutes = all_streaming_df['minutesPlayed'].sum()
print(f"Total Listening Time: {total_listening_minutes} minutes")

# Calculate the average listening time per track in minutes
average_track_length_minutes = all_streaming_df['minutesPlayed'].mean()
print(f"Average Track Length: {average_track_length_minutes:.2f} minutes")

# Find the most common artist - the mode for 'artistName'
most_common_artist = all_streaming_df['artistName'].mode()[0]
print(f"Most Common Artist: {most_common_artist}")

# Find the most common track - the mode for 'trackName'
most_common_track = all_streaming_df['trackName'].mode()[0]
print(f"Most Common Track: {most_common_track}")

# Standard Deviation for 'minutesPlayed' - measures the amount of variation or dispersion of listening times
std_dev_listening_times = all_streaming_df['minutesPlayed'].std()
print(f"Standard Deviation of Listening Times: {std_dev_listening_times:.2f} minutes")

# Median of 'minutesPlayed' - a better measure of central tendency when data is skewed
median_listening_time = all_streaming_df['minutesPlayed'].median()
print(f"Median Listening Time per Track: {median_listening_time:.2f} minutes")


- **Total Listening Time:** 88,115.28 minutes spent on Spotify. That's about 61 days of music! It showcases the significant portion of time music occupies in my daily life.
  
- **Average Track Length:** At 3.13 minutes per track, it's clear that I favor songs of a typical length, suggesting a preference for standard, radio-friendly tracks.

- **Most Common Artist:** Hillsong Instrumentals topping the list indicates a strong inclination towards their music, that's because I'm a Chistian that code 😂. Helps with focus.

- **Most Common Track:** 'My Blessings (Love Me)' as the most frequently played track. I really loved that song. It resonates with me on a personal level, both for its melody or lyrics. I have been practicing gratefulness and appreciation for my life and the positive things going on in it.

- **Standard Deviation of Listening Times:** A standard deviation of 1.70 minutes points to a relatively consistent track length preference with some variety.

- **Median Listening Time per Track:** The median time of 3.22 minutes is close to the average, reinforcing that most tracks I listen to are around the standard song length.

Let’s break down these next steps:

1. **Temporal Analysis**: Identify any patterns based on the `endTime` column to see when I listen to music the most, possibly by extracting the hour of the day or the day of the week from the timestamps.
2. **Variability in Listening Sessions**: Use the standard deviation to explore how varied my listening sessions are. A higher standard deviation would indicate that the length of time I spend listening varies more significantly from session to session.
3. **Genre Exploration**: If available, categorize tracks by genre and analyze which genres I listen to most often, and see if there are shifts in genre preference over time.
4. **Engagement Over Time**: Calculate metrics like the total number of tracks listened per day or month to evaluate my engagement over the period represented by the data.


In [23]:


# Convert 'endTime' to datetime format and create new columns for day of the week and hour of the day
all_streaming_df['endTime'] = pd.to_datetime(all_streaming_df['endTime'])

all_streaming_df['dayOfWeek'] = all_streaming_df['endTime'].dt.day_name()
all_streaming_df['hourOfDay'] = all_streaming_df['endTime'].dt.hour

# Calculate the total listening time for each day of the week
total_listening_time_by_day = all_streaming_df.groupby('dayOfWeek')['minutesPlayed'].sum().sort_values()

# Calculate the mean listening time for each hour of the day
average_listening_time_by_hour = all_streaming_df.groupby('hourOfDay')['minutesPlayed'].mean().sort_values()

# Determine the most active listening day of the week
most_active_listening_day = total_listening_time_by_day.idxmax()

# Determine the most active listening hour of the day
most_active_listening_hour = average_listening_time_by_hour.idxmax()

# Calculate the variability in listening sessions across different hours of the day
variability_by_hour = all_streaming_df.groupby('hourOfDay')['minutesPlayed'].std()

# Find the median listening time by day of the week
median_listening_time_by_day = all_streaming_df.groupby('dayOfWeek')['minutesPlayed'].median().sort_values()

# Print out the calculated metrics
print(f"Total Listening Time by Day:\n{total_listening_time_by_day}\n")
print(f"Average Listening Time by Hour:\n{average_listening_time_by_hour}\n")
print(f"Most Active Listening Day: {most_active_listening_day}")
print(f"Most Active Listening Hour: {most_active_listening_hour} o'clock")
print(f"Variability in Listening Sessions by Hour:\n{variability_by_hour}\n")
print(f"Median Listening Time by Day:\n{median_listening_time_by_day}\n")


Total Listening Time by Day:
dayOfWeek
Sunday        9862.170800
Wednesday    10460.158783
Monday       12550.803483
Friday       12848.220717
Tuesday      13501.420617
Thursday     13524.394833
Saturday     15368.105900
Name: minutesPlayed, dtype: float64

Average Listening Time by Hour:
hourOfDay
17    2.735068
14    2.819591
19    2.858447
18    2.903232
16    2.910174
12    2.922905
15    2.945923
21    2.960521
10    2.992653
20    3.011407
13    3.013192
11    3.026278
7     3.117266
9     3.123484
6     3.160322
8     3.233287
22    3.349536
5     3.408720
23    3.468259
0     3.603213
4     3.623562
2     3.780599
1     3.791506
3     3.868385
Name: minutesPlayed, dtype: float64

Most Active Listening Day: Saturday
Most Active Listening Hour: 3 o'clock
Variability in Listening Sessions by Hour:
hourOfDay
0     1.452867
1     1.549547
2     1.497633
3     1.660310
4     1.829009
5     1.828476
6     1.903588
7     1.856959
8     1.899253
9     1.697958
10    1.599455
11    1.600


- **Total Listening Time by Day:** Saturdays, I listen to music the most. It fits with coding, sports, and chilling out.
  
- **Average Listening Time by Hour:** At 3 AM, it looks like I listen to a lot of music. This is when I often code at night on weekends.

- **Most Active Listening Day:** Saturday is the day I play the most music, which makes sense—it’s a busy day with lots of different activities.

- **Most Active Listening Hour:** The data shows I listen to music the most at 3 AM, which lines up with when I’m usually up late working on code.

- **Variability in Listening Sessions by Hour:** There's a big mix in how long I listen to music early in the morning. Some days it’s just for a quick workout or during a shower, and other times it’s for longer while coding.

- **Median Listening Time by Day:** On Fridays, I tend to have music on for longer times, which might be when I start my weekend coding and have a good workout with music.

So, music is a big part of my day, every day. It keeps me going whether I’m working out, coding, or just getting ready for the day. It’s cool to see how it fits into my daily life and studies.

In [24]:
# Calculate the top 10 most listened to artists and tracks
top_artists = all_streaming_df['artistName'].value_counts().head(10)
top_tracks = all_streaming_df['trackName'].value_counts().head(10)

# Group by hour and track name to identify preferences for specific activities
patterns_by_hour_track = all_streaming_df.groupby(['hourOfDay', 'trackName']).size().reset_index(name='count')
top_track_each_hour = patterns_by_hour_track.loc[patterns_by_hour_track.groupby('hourOfDay')['count'].idxmax()]

# Display the top artists and tracks
print("Top 10 Artists:\n", top_artists)
print("\nTop 10 Tracks:\n", top_tracks)
print("\nTop Track Each Hour:\n", top_track_each_hour[['hourOfDay', 'trackName', 'count']])


Top 10 Artists:
 artistName
Hillsong Instrumentals    1781
Mali Music                 637
Lecrae                     597
Hillsong Worship           569
Kanye West                 561
Michael W. Smith           502
Planetshakers              476
Matt Maher                 353
Don Moen                   277
Paul Wilbur                227
Name: count, dtype: int64

Top 10 Tracks:
 trackName
My Blessings (Love Me)                               305
Lord, I Need You                                     158
For Your Name Is Holy                                100
Because He Lives                                      74
Nothing Is Impossible (Featuring Israel Houghton)     69
Overtaken (From"Onepiece")                            63
Mighty To Save                                        61
Turn It Up - Live                                     58
Run                                                   56
Endless Praise - Live                                 55
Name: count, dtype: int64

Top Track Ea

- Favorite artists and tracks reflect my Christian faith, with Hillsong Instrumentals as my top artist.
- "My Blessings (Love Me)" by Mali Music is my most played track, significant for its motivational value.
- "Lord, I Need You" is a recurring early-hour favorite, likely part of my morning routine for focus or inspiration.
- Through the morning to early afternoon, "My Blessings (Love Me)" dominates, indicating its role in positive mornings.
- Evening variety with tracks like "Bound 2" by Kanye West shows a mix for winding down or energizing my nights.