# Integration of the awesome pipelines data source

This notebook loads the "README.md" from [awesome pipeline](https://github.com/pditommaso/awesome-pipeline) and extracts all listed tools and tool information. This also includes the tool URL and a tool category. 
The tool category is derived from the respective section header in the original README.md.

## Imports

In [None]:
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup
import markdown
import numpy as np

## Raw Stage - Load raw data source from the GitHub repository

A specific git hash was used to ensure reproducibility.

In [47]:
RAW_DATA_SOURCE_URL = "https://raw.githubusercontent.com/pditommaso/awesome-pipeline/01f0248ad50234204c0ef1be31e3a754e0e2ae96/README.md"

In [48]:
response = requests.get(RAW_DATA_SOURCE_URL)
if response.status_code == 200:
    readme_text = response.text

### Extract relevant tool data from raw source and convert into a pandas DataFrame

In [50]:
soup = BeautifulSoup(markdown.markdown(readme_text), 'html.parser')

In [51]:
h2_elements = soup.find_all('h2')
ul_elements=soup.find_all('ul')
all_data = list()
for h2_element, ul_element in zip(h2_elements,ul_elements):
    h2_text= h2_element.text
    for ul_li_element in  ul_element.find_all("li"):
        for li_element in ul_li_element.find_all("a"):
            name=li_element.text    
            ref=li_element.get("href")
            all_data.append((h2_text,name,ref))


In [52]:
df = pd.DataFrame(all_data,columns=["category","name","url"])

In [53]:
df.shape

(205, 3)

### Save result of raw stage

In [None]:
df.to_csv("data/01_raw/awesome_pipelines.csv", index=False)

## Intermediate Stage - e.g. URL mapping, column mapping, create id


### Map tool URL to homepage_url or repo_url

Only one URL is provided for each tool and the kind of the homepage (code repository, tool homepage) is not further specified. 
Here, for a repository it is checked if the URL contains `github.com`.

In [57]:
df[["homepage_url","repo_url"]] = np.nan
df.loc[~df.url.str.contains("github.com"),"homepage_url"]=df.loc[~df.url.str.contains("github.com"),"url"]
df.loc[df.url.str.contains("github.com"),"repo_url"]=df.loc[df.url.str.contains("github.com"),"url"]
# check if the we were able to map all original URLs. 
assert df.repo_url.str.len().clip(0,1).sum()+df.homepage_url.str.len().clip(0,1).sum()== df.shape[0]

### Create IDs 

In [58]:
df["id"] = df["name"].apply(lambda x: re.sub("\s+","",x.lower()))
df=df[["id","name","homepage_url","repo_url","category"]]

### Save result of intermediate stage

In [None]:
df.to_csv("data/02_intermediate/awesome_pipelines.csv",index=False)

In [60]:
df.shape

(205, 5)

In [69]:
df.groupby("id").count().sum()

name            205
homepage_url     89
repo_url        116
category        205
dtype: int64

##  Processed stage - only keep relevant tools

### Filter by using the category assigned to each tool 

In [63]:
categories_to_keep=['Pipeline frameworks & libraries', 'Workflow platforms',
       'ETL & Data orchestration',
       'Extract, transform, load (ETL)']

In [64]:
df_filtered = df.query("category in @categories_to_keep").reset_index(drop=True)

In [65]:
df_filtered.groupby(by="category").count()

Unnamed: 0_level_0,id,name,homepage_url,repo_url
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ETL & Data orchestration,4,4,2,2
"Extract, transform, load (ETL)",6,6,3,3
Pipeline frameworks & libraries,124,124,39,85
Workflow platforms,31,31,21,10


### Save result of processed stage

In [66]:
df_filtered.to_csv("data/03_processed/awesome_pipelines.csv", index=False)

In [67]:
df_filtered.shape

(165, 5)