# Initial Streaming Analysis
   
---
   
This is initial analysis use basic datasets of `StreamingHistory_music_#.json` which is include in Spotify Account Data. According to Understanding your data page of Spotify, this dataset consists of recent annual music history that includes:

*  `endTime`    : Date and time of when the stream ended in Coordinated Universal Time format (UTC).
*  `artistName` : Name of "creator" for each stream (e.g. the artist name of a music track).
*  `trackName`  : Name of items listened to or watched (e.g. title of music track or name of video).
*  `msPlayed`   : How many mili-seconds the track was listened to.
   
---
  
This dataset can be used to:
*  Behavioural Pattern Analysis : Listening Trend
  
---
  

## Table of Contents

- [1. Load, parse, and merge data files](#load-parse-merge)
- [2. Exploratory Data Analysis](#eda)
    - [2.1. Dataset overview](#dataset-overview)
    - [2.2. Univariate analysis](#univariate)
        - [2.2.1. Data Prep](#uni-data-prep)
        - [2.2.2. Numerical Data](#uni-numerical)
        - [2.2.3. Categorical Data](#uni-categorical)
        - [2.2.4. Text Data Analysis](#uni-text)
    - [2.3. Bivariate analysis](#bivariate)
        - [2.3.1. Data Prep](#bi-prep)
    - [2.4 Multivariate analysis](#multivariate)
- [3. Key insights and research questions](#key-research)
    - [3.1. Key findings](#key-findings)
    - [3.2. Machine Learning research questions and justification](#ml-questions)
- [References](#references)


<a name="load-parse-merge"></a>
# 1. Load, parse, and merge data files

In [None]:
# pip install numpy pandas matplotlib seaborn

In [10]:
# Import Library
import os

## Universal Data Processing
import numpy as np
import pandas as pd

## Regular Expression for Text Data
import re

## JSON Files Manipulation
import json

In [29]:
# Get Current Directory Address
base_dir = os.getcwd()
dataset_dir = os.path.join(base_dir, "Dataset")

# List included datasets paths
paths = [
    os.path.join(dataset_dir, "StreamingHistory_music_0.json"),
    os.path.join(dataset_dir, "StreamingHistory_music_1.json"),
    os.path.join(dataset_dir, "StreamingHistory_music_2.json"),
    os.path.join(dataset_dir, "StreamingHistory_music_3.json"),
    os.path.join(dataset_dir, "StreamingHistory_music_4.json"),
]

# Load each datasets
all_data = []

for idx, path in enumerate(paths):              # use `for count, item in enumerate(items, start=1)` to customize the indexing
    print(f"Loading file {idx}: {path}")
    with open(path, "r", encoding="utf-8") as json_file:
        data_idx = json.load(json_file)
        all_data.append(data_idx)

print("Files loaded", len(all_data))


Loading file 0: c:\03. Other\Spotify_Unwrapped\Dataset\StreamingHistory_music_0.json
Loading file 1: c:\03. Other\Spotify_Unwrapped\Dataset\StreamingHistory_music_1.json
Loading file 2: c:\03. Other\Spotify_Unwrapped\Dataset\StreamingHistory_music_2.json
Loading file 3: c:\03. Other\Spotify_Unwrapped\Dataset\StreamingHistory_music_3.json
Loading file 4: c:\03. Other\Spotify_Unwrapped\Dataset\StreamingHistory_music_4.json
Files loaded 5


In [30]:
# Convert loaded data into suitable objects for ease of manipulation
flat_data = []

for file_data in all_data:
    flat_data.extend(file_data)

# Flatten the Table to make sure the dictionary or embedded arays data in the flat table
df_json = pd.json_normalize(flat_data)
df_json.head()

Unnamed: 0,endTime,artistName,trackName,msPlayed
0,2025-01-17 19:39,jisokuryClub,Then Tonight,35141
1,2025-01-18 22:02,Fujii Kaze,Matsuri,171093
2,2025-01-19 04:29,Sincere,rain,176817
3,2025-01-19 04:32,Sincere,bed,196449
4,2025-01-19 04:35,Sincere,Good Girl,202122


<a name="eda"></a>
# 2. Exploratory Data Analysis

The primary goal of Exploratory Data Analysis (EDA) is to gain an in-depth understanding of the dataset to inform subsequent decisions, such as data preprocessing, model design, or hypothesis generation

In [21]:
# Import Library
import matplotlib.pyplot as plt
import seaborn as sns

*Missing Values*

In [22]:
print(f"Null values before cleaning:\n{df_json.isnull().sum()}")

Null values before cleaning:
endTime       0
artistName    0
trackName     0
msPlayed      0
dtype: int64


<a name="dataset-overview"></a>
## 2.1 Dataset overview

In [31]:
df_json.info()

<class 'pandas.DataFrame'>
RangeIndex: 41802 entries, 0 to 41801
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   endTime     41802 non-null  str  
 1   artistName  41802 non-null  str  
 2   trackName   41802 non-null  str  
 3   msPlayed    41802 non-null  int64
dtypes: int64(1), str(3)
memory usage: 1.3 MB


Most of the data are string and only `msPlayed` is integer which indicate the time songs is streamed

In [32]:
df_json.shape

(41802, 4)

This dataset have 4 columns x 41,802 rows

<a name="univariate"></a>
## 2.2 Univariate analysis

<a name="uni-data-prep"></a>
### 2.2.1 Data Prep

In [37]:
# Copy dataset for risk assessment
json_uni = df_json.copy()

In [38]:
# Convert `endTime` to datetime
json_uni['endTime'] = pd.to_datetime(json_uni['endTime'], errors='coerce')

# Extract detailed time-related data
json_uni['Year']          = json_uni['endTime'].dt.year
json_uni['Month']         = json_uni['endTime'].dt.month
json_uni['Day']           = json_uni['endTime'].dt.day
json_uni['Hour']          = json_uni['endTime'].dt.hour
json_uni['Minute']        = json_uni['endTime'].dt.minute
json_uni['HourDecimal']   = (json_uni['Hour'] + json_uni['Minute'] / 60)

In [39]:
json_uni.head()

Unnamed: 0,endTime,artistName,trackName,msPlayed,Year,Month,Day,Hour,Minute,HourDecimal
0,2025-01-17 19:39:00,jisokuryClub,Then Tonight,35141,2025,1,17,19,39,19.65
1,2025-01-18 22:02:00,Fujii Kaze,Matsuri,171093,2025,1,18,22,2,22.033333
2,2025-01-19 04:29:00,Sincere,rain,176817,2025,1,19,4,29,4.483333
3,2025-01-19 04:32:00,Sincere,bed,196449,2025,1,19,4,32,4.533333
4,2025-01-19 04:35:00,Sincere,Good Girl,202122,2025,1,19,4,35,4.583333


<a name="uni-numerical"></a>
### 2.2.2 Numerical Data Analysis

In [41]:
json_uni.describe()

Unnamed: 0,endTime,msPlayed,Year,Month,Day,Hour,Minute,HourDecimal
count,41802,41802.0,41802.0,41802.0,41802.0,41802.0,41802.0,41802.0
mean,2025-07-16 04:47:16.781972,175753.8,2025.10765,5.705875,15.006603,11.43151,29.475767,11.922773
min,2025-01-17 19:39:00,0.0,2025.0,1.0,1.0,0.0,0.0,0.0
25%,2025-04-02 12:53:15,150426.0,2025.0,3.0,7.0,6.0,14.0,6.033333
50%,2025-07-27 21:48:30,183791.0,2025.0,5.0,15.0,11.0,29.0,11.95
75%,2025-10-22 07:16:15,216860.0,2025.0,9.0,22.0,17.0,44.0,17.833333
max,2026-01-19 23:58:00,1579289.0,2026.0,12.0,31.0,23.0,59.0,23.983333
std,,73773.35,0.309942,3.587571,8.745664,6.930869,17.361094,6.935139


***Findings***:

*  The data range form `17 January 2025` to `19 January 2026`, approximately counted one year of streaming history.

*  Initialy, we can see that the mean of `Month`, `Day`, and `Hour` are half of its range. Possibly indicating the normalized data distribution.
   It also shown by the percentile figres where perfectly distributed.