# Exploratory Data Analysis - Tracked with Weights and Biases

This notebook carries out basic Exploratory Data Analysis (EDA) of the [Dataset of songs in Spotify](https://www.kaggle.com/datasets/mrmorj/dataset-of-songs-in-spotify); its **goal is to showcase how to track notebook executions with Weight and Biases**. It assumes:

- The dataset has been uploaded to W&B using the CLI; see the `README.md` file.
- You have started this notebook via `mlflow run .` to correctly create the required conda environemnt specified in `conda.yaml`; see the `README.md` file.

All in all, the follosing tasks are carried out here:

- Start a new run `run = wandb.init(project="music_genre_classification", save_code=True)`; `save_code=True` makes possible to track the code execution.
- Download the dataset artifact and explore it briefly.
- Perform a simple EDA:
  - Run `pandas_profiling.ProfileReport()`.
  - Drop duplicates.
  - Impute missing song and tile values with `''`.
  - Create new text field which is the concatenation of the title and the song name.
- Finish the run: `run.finish()`.

To have a look at the data modeling, see th eother notebooks -- which are not tracked!

In [1]:
import wandb

In [2]:
import pandas_profiling
import pandas as pd
import seaborn as sns

In [3]:
# Create a run for our project; name automatically generated
run = wandb.init(
  project="music_genre_classification",
  save_code=True
)

[34m[1mwandb[0m: Currently logged in as: [33mdatamix-ai[0m (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: wandb version 0.13.4 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


In [4]:
# Open the artifact: the name is not the filename,
# but the name we used when registering it
# To download the file we need to call .file()
artifact = run.use_artifact("music_genre_classification/genres_mod.parquet:latest")
df = pd.read_parquet(artifact.file())

In [5]:
df.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,duration_ms,time_signature,genre,song_name,title
0,0.831,0.814,2,-7.364,1,0.42,0.0598,0.0134,0.0556,0.389,156.985,audio_features,124539,4,Dark Trap,Mercury: Retrograde,
1,0.719,0.493,8,-7.23,1,0.0794,0.401,0.0,0.118,0.124,115.08,audio_features,224427,4,Dark Trap,Pathology,
2,0.85,0.893,5,,1,0.0623,0.0138,4e-06,0.372,0.0391,218.05,audio_features,98821,4,Dark Trap,Symbiote,
3,0.476,0.781,0,-4.71,1,0.103,0.0237,0.0,0.114,0.175,186.948,audio_features,123661,3,Dark Trap,ProductOfDrugs (Prod. The Virus and Antidote),
4,0.798,0.624,2,-7.668,1,0.293,0.217,0.0,0.166,0.591,147.988,audio_features,123298,4,Dark Trap,Venom,


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42896 entries, 0 to 42895
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   danceability      42896 non-null  float64
 1   energy            42896 non-null  float64
 2   key               42896 non-null  int64  
 3   loudness          33726 non-null  float64
 4   mode              42896 non-null  int64  
 5   speechiness       42896 non-null  float64
 6   acousticness      42896 non-null  float64
 7   instrumentalness  42896 non-null  float64
 8   liveness          42896 non-null  float64
 9   valence           42896 non-null  float64
 10  tempo             42896 non-null  float64
 11  type              42896 non-null  object 
 12  duration_ms       42896 non-null  int64  
 13  time_signature    42896 non-null  int64  
 14  genre             42896 non-null  object 
 15  song_name         21811 non-null  object 
 16  title             21079 non-null  object

In [7]:
# Note: Jupyter Lab has sometimes issues; use Jupyter Notebook if you come up with them
profile = pandas_profiling.ProfileReport(df, title="Pandas Profiling Report", explorative=True)
profile.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

In [8]:
# Drop duplicates
df = df.drop_duplicates().reset_index(drop=True)

In [9]:
# New feature
# This feature will have to go to the feature store.
# If you do not have a feature store,
# then you should not compute it here as part of the preprocessing step.
# Instead, you should compute it within the inference pipeline.
df['title'].fillna(value='', inplace=True)
df['song_name'].fillna(value='', inplace=True)
df['text_feature'] = df['title'] + ' ' + df['song_name']

In [10]:
# Finish run to upload the results.
# Close the notebook and stop the jupyter server
# by clicking on Quit in the main Jupyter page (upper right).
# NOTE: DO NOT use Crtl+C to shutdown Jupyter.
# That would also kill the mlflow job.
run.finish()

VBox(children=(Label(value=' 0.05MB of 0.05MB uploaded (0.01MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

In [11]:
# Go to W&B web interface: select run.
# You will see an option {} in the left panel.
# Click on it to see the uploaded Jupyter notebook.