# ETL Pipeline - Automated Execution

This notebook runs the ETL pipeline using the Python script.

In [1]:
# Parameters cell - can be overridden by papermill
input_path = "data/raw/dataset.csv"
output_path = "data/processed/cleaned_spotify_data.csv"

In [2]:
import sys
sys.path.append('..')

from src.etl_pipeline import SpotifyETL
import pandas as pd

In [3]:
# Run ETL pipeline
print("Starting ETL Pipeline...")
etl = SpotifyETL(raw_data_path=input_path)
df_clean, csv_path, parquet_path = etl.run(output_path=output_path)

print(f"\n✅ ETL Complete!")
print(f"Records: {len(df_clean):,}")
print(f"Features: {len(df_clean.columns)}")
print(f"CSV: {csv_path}")
print(f"Parquet: {parquet_path}")



2025-11-12 01:41:41,325 - INFO - STARTING ETL PIPELINE




2025-11-12 01:41:41,326 - INFO - Extracting data from data/raw/dataset.csv


2025-11-12 01:41:41,504 - INFO - Successfully loaded 114,000 records with 21 columns


2025-11-12 01:41:41,505 - INFO - Generating data quality report...


Starting ETL Pipeline...


2025-11-12 01:41:41,661 - INFO - Found 0 duplicate rows


2025-11-12 01:41:41,661 - INFO - Total missing values: 3


2025-11-12 01:41:41,662 - INFO - Starting data transformation...


2025-11-12 01:41:41,701 - INFO - Removed 0 duplicate rows


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  self.df[col].fillna(self.df[col].mode()[0], inplace=True)
2025-11-12 01:41:41,765 - INFO - Filled 3 missing values


2025-11-12 01:41:41,765 - INFO - Engineering new features...


2025-11-12 01:41:41,766 - INFO - Created 'duration_min' feature


2025-11-12 01:41:42,201 - INFO - Created 'mood_energy' classification


2025-11-12 01:41:42,205 - INFO - Created 'energy_category' feature


2025-11-12 01:41:42,207 - INFO - Created 'popularity_category' feature


2025-11-12 01:41:42,209 - INFO - Created 'tempo_category' feature


2025-11-12 01:41:42,210 - INFO - Validating transformed data...


2025-11-12 01:41:42,223 - INFO - Data validation passed


2025-11-12 01:41:42,224 - INFO - Data transformation complete


2025-11-12 01:41:42,224 - INFO - Saving cleaned dataset to data/processed/cleaned_spotify_data.csv


2025-11-12 01:41:42,954 - INFO - Successfully saved CSV: 114,000 records


2025-11-12 01:41:42,955 - INFO - Saving Parquet format to data/processed/cleaned_spotify_data.parquet


2025-11-12 01:41:43,088 - INFO - Successfully saved Parquet: 114,000 records


2025-11-12 01:41:43,088 - INFO - Generating data quality report...


2025-11-12 01:41:43,268 - INFO - Found 0 duplicate rows


2025-11-12 01:41:43,268 - INFO - Total missing values: 0


2025-11-12 01:41:43,269 - INFO - Data quality report saved to data/processed/data_quality_report.txt




2025-11-12 01:41:43,269 - INFO - ETL PIPELINE COMPLETE





✅ ETL Complete!
Records: 114,000
Features: 26
CSV: data/processed/cleaned_spotify_data.csv
Parquet: data/processed/cleaned_spotify_data.parquet


In [4]:
# Preview cleaned data
print("\nData Preview:")
df_clean.head()


Data Preview:


Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,liveness,valence,tempo,time_signature,track_genre,duration_min,mood_energy,energy_category,popularity_category,tempo_category
0,0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,...,0.358,0.715,87.917,4,acoustic,3.844433,Chill/Happy,Medium Energy,High Popularity,Slow
1,1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,...,0.101,0.267,77.489,4,acoustic,2.4935,Sad/Low Energy,Low Energy,Medium Popularity,Slow
2,2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,...,0.117,0.12,76.332,4,acoustic,3.513767,Sad/Low Energy,Medium Energy,Medium Popularity,Slow
3,3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,...,0.132,0.143,181.74,3,acoustic,3.36555,Sad/Low Energy,Low Energy,High Popularity,Very Fast
4,4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,...,0.0829,0.167,119.949,4,acoustic,3.314217,Sad/Low Energy,Medium Energy,High Popularity,Moderate


In [5]:
# Summary statistics
print("\nSummary Statistics:")
df_clean.describe()


Summary Statistics:


Unnamed: 0.1,Unnamed: 0,popularity,duration_ms,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,duration_min
count,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0
mean,56999.5,33.238535,228029.2,0.5668,0.641383,5.30914,-8.25896,0.637553,0.084652,0.31491,0.15605,0.213553,0.474068,122.147837,3.904035,3.800486
std,32909.109681,22.305078,107297.7,0.173542,0.251529,3.559987,5.029337,0.480709,0.105732,0.332523,0.309555,0.190378,0.259261,29.978197,0.432621,1.788295
min,0.0,0.0,0.0,0.0,0.0,0.0,-49.531,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,28499.75,17.0,174066.0,0.456,0.472,2.0,-10.013,0.0,0.0359,0.0169,0.0,0.098,0.26,99.21875,4.0,2.9011
50%,56999.5,35.0,212906.0,0.58,0.685,5.0,-7.004,1.0,0.0489,0.169,4.2e-05,0.132,0.464,122.017,4.0,3.548433
75%,85499.25,50.0,261506.0,0.695,0.854,8.0,-5.003,1.0,0.0845,0.598,0.049,0.273,0.683,140.071,4.0,4.358433
max,113999.0,100.0,5237295.0,0.985,1.0,11.0,4.532,1.0,0.965,0.996,1.0,1.0,0.995,243.372,5.0,87.28825
