Skip to content

Latest commit

 

History

History
67 lines (48 loc) · 3.37 KB

File metadata and controls

67 lines (48 loc) · 3.37 KB

Reproducible ML Pipelines: Exploratory Data Analysis (EDA) and Modeling

This sub-folder is a stand-alone sub-project in which the Exploratory Data Analysis (EDA) and basic modeling is done prior to transferring all the code into the production environment project.

There are 3 notebooks:

  • EDA_Tracking.ipynb: a basic EDA is performed and in the process we track the notebook execution with W&B.
  • Modeling.ipynb: basic data modeling which is transferred later on to the production environment in the components of the higher level.
  • UnsupervisedLearning.ipynb: basic unsupervised learning for genre clustering.

In order to run the notebooks, first, we need to upload the dataset to W&B as an artifact.

# Folder where the dataset is located
cd ../dataset/

# Upload the dataset
# Project name: music_genre_classification
wandb artifact put \
      --name music_genre_classification/genres_mod.parquet \
      --type raw_data \
      --description "A modified version of the songs dataset" \
      genres_mod.parquet

# Folder where the data analysis is carried out
cd ../data_analysis

# We can download the dataset to artifacts/ folder
# using the CLI; however, the notebook `EDA_Tracking`
# shows how to do it with the python API.
# Note the syntax: project_name/dataset_name:version
wandb artifact get music_genre_classification/genres_mod.parquet:latest

Note that in the production components the dataset is downloaded from an URL and added as an artifact, too, so we skip this CLI steps.

Second, we run mlflow, which launches the project defined in MLproject; this project basically installs a conda environment defined in conda.yaml and starts the Jupyter Notebook IDE. Within that Jupyter instance, we can create all the notebooks we want.

mlflow run .

# The first time the environment is installed and it takes time.
# A Jupyter Notebook server should be opened in our default browser.
# The next times, it's much faster.

The notebook which shows to the perform the tracking of code runs (in notebooks) and artifacts using W&B is EDA_Tracking.ipynb; the steps carried out in that notebook are:

  • Start a new run run = wandb.init(project="music_genre_classification", save_code=True); save_code=True makes possible to track the code execution.
  • Download the dataset artifact and explore it briefly.
  • Perform a simple EDA:
    • Run pandas_profiling.ProfileReport().
    • Drop duplicates.
    • Impute missing song and tile values with ''.
    • Create new text field which is the concatenation of the title and the song name.
  • Finish the run: run.finish().
  • Note: do not close with Ctrl+C, but with File > Close and Halt.

We can check in the W&B web interface that the artifacts and the run(s) are registered and tracked.

The other two the notebooks perform operations that are in part transferred to the production components, but they are not tracked; the tracking is shown only in the notebook EDA_Tracking.ipynb.

Links