# Integration of the awesome data engineering data source

This notebook loads the "README.md" from [awesome data engineering](https://github.com/igorbarinov/awesome-data-engineering) and extracts all listed tools and tool information. This also includes the tool URL and  tool category, and subcategory. 
The tools' category, subcategories are derived from the section hierarchy using  the respective head lines in the original README.md.

## Imports

In [None]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import markdown
import re
import numpy as np

## Raw Stage - Load raw data source from the GitHub repository

A specific git hash was used to ensure reproducibility.

In [55]:
RAW_DATA_SOURCE_URL = "https://raw.githubusercontent.com/igorbarinov/awesome-data-engineering/6785280e6ec2a63fc7673b9c8c3cb07676f93da8/README.md"

In [56]:
response = requests.get(RAW_DATA_SOURCE_URL)
if response.status_code == 200:
    readme_text= response.text

### Extract relevant tool data from raw source and convert into a pandas DataFrame

In [58]:
soup = BeautifulSoup(markdown.markdown(readme_text), 'html.parser')

In [59]:
def dictify(ul):
    result = {}
    for li in ul.find_all("li", recursive=False):
        key = next(li.stripped_strings)
        ul = li.find("ul")
        if ul:       
            result[key] = dictify(ul)
            missed_ref=li.findChild().get("href")
            if missed_ref:
                descr = li.text.split("\n")
                descr_text=key
                if descr:
                    descr_text=descr[0]
                result[key][key]=descr_text,missed_ref
        else:
            result[key] = li.text,li.find("a").get("href","")
    return result

In [60]:
all_uls = soup.find_all("ul",recursive=False)
l_all=[]
l_all_dicti=[]
test=None
for ul_element in all_uls:
    dicti = dictify(ul_element)
    l_all_dicti.append(dicti)
    df=pd.json_normalize(dicti,max_level=2,sep='#').T.reset_index(names="index_like")
    df_names=df["index_like"].str.split("#",expand=True)

    df["category"]=ul_element.find_previous().text
    l_all.append(df)


In [61]:
df=pd.concat(l_all).reset_index(drop=True)

In [62]:
df_names=df["index_like"].str.split("#",expand=True)

In [63]:
not_nan_column_count = df_names.notna().sum(axis=1)

In [64]:
df_names.loc[not_nan_column_count==2,2]= df_names.loc[not_nan_column_count==2,1]
df_names.loc[not_nan_column_count==2,1]=None

In [65]:
df_names.loc[not_nan_column_count==1,2]= df_names.loc[not_nan_column_count==1,0]
df_names.loc[not_nan_column_count==1,0]=None

In [66]:
df_names.rename(columns={0:"subcategory",1:"tool_subcategory",2:"name"},inplace=True)

In [67]:
result = pd.concat([df[["category"]],df_names,pd.DataFrame(df[0].to_list(), columns=['description', 'tool_url'])],axis=1)

In [68]:
result.shape

(185, 6)

### Save result of raw stage

In [69]:
result.to_csv("data/01_raw/awesome_data_engineering.csv", index=False)

### Sanity check

In [70]:
all_urls=[]
for ul_element in all_uls:
    as_=ul_element.find_all("a")
    for a in as_:
        all_urls.append(a.get("href"))
table__url_set= set(result.tool_url.to_list())
all_urls_set = set(all_urls)
all_urls_set-table__url_set

{'https://chairnerd.seatgeek.com/building-out-the-seatgeek-data-pipeline/',
 'https://en.wikipedia.org/wiki/Shared-nothing_architecture'}

## Intermediate Stage - e.g. URL mapping, column mapping, create id

### Map tool URL to homepage_url or repo_url

Only one URL is provided for each tool and the kind of the homepage (code repository, tool homepage) is not further specified. 
Here, for a repository it is checked if the URL contains `github.com`.

In [71]:
result[["homepage_url","repo_url"]] = np.nan
result.loc[~result.tool_url.str.contains("github.com"),"homepage_url"]=result.loc[~result.tool_url.str.contains("github.com"),"tool_url"]
result.loc[result.tool_url.str.contains("github.com"),"repo_url"]=result.loc[result.tool_url.str.contains("github.com"),"tool_url"]

### Create IDs 

In [72]:
result["id"] = result["name"].apply(lambda x: re.sub("\s+","",x.lower()))

### Select required columns

In [73]:
result = result[["id","name","homepage_url","repo_url","category","subcategory","tool_subcategory"]]

### Save result of intermediate stage

In [74]:
result.to_csv("data/02_intermediate/awesome_data_engineering.csv",index=False)

In [75]:
result.shape

(185, 7)

In [76]:
result.groupby(by="id").count().sum()

name                185
homepage_url        106
repo_url             79
category            185
subcategory          95
tool_subcategory     14
dtype: int64

##  Processed stage - only keep relevant tools

### Filter by using the category assigned to each tool 

In [None]:
category_to_keep=["Workflow","Batch Processing","Stream Processing"]

In [None]:
result = result.query("category in @category_to_keep")

In [None]:
result.shape

(45, 7)

### Save result of processed stage

In [None]:
result.to_csv("data/03_processed/awesome_data_engineering.csv",index=False)