
*Note:* You can run this from your computer (Jupyter or terminal), or use one of the
hosted options:
[![binder-logo](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/ploomber/binder-env/main?urlpath=git-pull%3Frepo%3Dhttps%253A%252F%252Fgithub.com%252Fploomber%252Fprojects%26urlpath%3Dlab%252Ftree%252Fprojects%252Fetl%252FREADME.ipynb%26branch%3Dmaster)
[![deepnote-logo](https://deepnote.com/buttons/launch-in-deepnote-small.svg)](https://deepnote.com/launch?template=deepnote&url=https://github.com/ploomber/projects/blob/master/etl/README.ipynb)


# ETL example

This example shows a non-trivial pipeline that resembles a typical
scenario when analyzing data. It helps demonstrate how
[Ploomber](https://github.com/ploomber/ploomber) helps you
develop data pipelines without worrying about plumbing code (managing
database connections, orchestrating execution, etc.)

Most notably, this project contains minimal configuration code (just a small
`db.py` file to establish a connection with the database), the rest are scripts
that perform the actual analysis. The `pipeline.yaml` files tells Ploomber how
to run the pipeline and it allows everyone in the analysis team to understand
how all parts stitch together.

This pipeline uses a subset of the [Stack Exchange dataset](https://archive.org/details/stackexchange). It gets the data from the original source, converts it from XML to CSV, uploads it to a database, aggregates it, dumps it and generates a few plots. See the diagram below (generated using `ploomber plot`):

![pipeline](pipeline.png)

The ``pipeline.yaml`` file contains a few comments to understand what's going on at each step.

This project also has non-trivial dependencies: a package to uncompress `.7z` files, a few Python packages, R and the R kernel for Jupyter. Everything is installed via a conda environment. See the `environment.yml` file for details.

On each push, the pipeline is tested, ensuring it works at all times. See `.github/workflows/ci.yml` for details (`etl` job).

## Setup

(Note: Only required if you are running this example in your computer, not
required if using Binder/Deepnote)

~~~bash
# make sure you are in the etl folder.
conda env create --file environment.yml
conda activate etl
~~~

## Pipeline summary

In [1]:
%%sh
ploomber status

name         Last run    Outdated?    Product      Doc (short)    Location
-----------  ----------  -----------  -----------  -------------  -----------
download     Has not     Source code  {'nb': File                 preprocess/
             been run                 (/Users/Edu                 download.py
                                      /dev/projec
                                      ts-ploomber
                                      /etl/output
                                      /download.i
                                      pynb),
                                      'zipped': F
                                      ile(/Users/
                                      Edu/dev/pro
                                      jects-ploom
                                      ber/etl/out
                                      put/data.7z
                                      ), 'extract
                                      ed': File(/
                                      Users/Ed

100%|██████████| 14/14 [00:00<00:00, 5631.56it/s]
* upload-users
The following upstream dependencies in task "upload-users" were not used {'convert2csv'}

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

* upload-comments
The following upstream dependencies in task "upload-comments" were not used {'convert2csv'}

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

* upload-posts
The following upstream dependencies in task "upload-posts" were not used {'convert2csv'}


## Executing the pipeline from the command line (shell)

In [2]:
%%sh
ploomber build

name                 Ran?      Elapsed (s)    Percentage
-------------------  ------  -------------  ------------
download             True       104.234      76.675
convert2csv          True        15.6406     11.5053
upload-users         True         1.96886     1.4483
upvotes-by-location  True         0.036233    0.0266531
upvotes-dump         True         0.004018    0.00295565
plot-upvotes         True         2.65313     1.95165
upload-comments      True         1.95488     1.43801
comments-by-post     True         0.060915    0.0448092
comments-dump        True         0.043736    0.0321723
comments-plot        True         3.12358     2.29771
upload-posts         True         2.69441     1.98201
posts-by-length      True         0.546323    0.401876
posts-dump           True         0.071959    0.0529332
posts-plot           True         2.91        2.1406


100%|██████████| 14/14 [00:00<00:00, 5335.29it/s]
* upload-users
The following upstream dependencies in task "upload-users" were not used {'convert2csv'}

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

* upload-comments
The following upstream dependencies in task "upload-comments" were not used {'convert2csv'}

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

* upload-posts
The following upstream dependencies in task "upload-posts" were not used {'convert2csv'}
Building task "download":   0%|          | 0/14 [00:00<?, ?it/s]
Executing:   0%|          | 0/5 [00:00<?, ?cell/s][A
Executing:  20%|██        | 1/5 [00:00<00:03,  1.04cell/s][A
Executing: 100%|██████████| 5/5 [01:44<00:00, 20.84s/cell]
Building task "convert2csv":   7%|▋         | 1/14 [01:44<22:35,

Output is generated in the ``output/`` directory.