# Integration of the existing workflow system revisited data source

This notebook loads the csv from the existing workflow system revisited data source (EWSR). EWSR was created as part of this thesis by revising and complementing all tools and tool url data found in the the original source [existing workflow systems](https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems). The revisited data was collected in a csv file, which is used here.

## Imports

In [None]:
import pandas as pd
import re 

## Raw Stage - Load raw data source from the EWSR csv

In [None]:
df = pd.read_csv("data/01_raw/tools_existing_workflow_systems - Sheet1.csv")

## Intermediate Stage - e.g. URL mapping, column mapping, create id

### Rename columns

In [None]:
df.rename(columns={"domain":"subcategory","tool_origin":"category"},inplace=True)

### Create IDs 

In [None]:
df["id"]=df["name"].apply(lambda x: re.sub("\s+","",x.lower()))

### Select relecant columns

In [None]:
df = df[["id","name","homepage_url","repo_url","publication_url","category","subcategory"]]

### Save result of intermediate stage

In [None]:

df.to_csv("data/02_intermediate/existing_workflow_systems.csv",index=False)

In [None]:
df.groupby(by="id").count().sum()

##  Processed stage - only keep relevant tools

Remove all tools without an repository entry

### Save result of processed stage

In [None]:
df.loc[ df["repo_url"].notna()].to_csv("data/03_processed/existing_workflow_systems.csv",index=False)