# Consolidated Stage

The **consolidated stage** combines the results of the **processed stage** for the 4 integrated data sources `awesome-pipeline`,`awesome-data-engineering`,`existing_workflows_systems_revisited`, and `lfai_landscape`. In addition duplicates are removed and missing GitHub repository data is mapped ( manual search  for tools where no respective data was present).


## Imports

In [None]:
import pandas as pd
from pathlib import Path

## Load the integrated  data from the **processed stage**

In [7]:
concat_list=[]
for filepath in Path('data/03_processed').glob("*.csv"):
    print(filepath.name[:-4])
    df = pd.read_csv(filepath)
    print(df.shape)
    df["source"] = filepath.name[:-4]
    concat_list.append(df)

df = pd.concat(concat_list).reset_index(drop=True)

awesome_data_engineering
(45, 7)
awesome_pipelines
(165, 5)
existing_workflow_systems
(263, 6)
lfai_landscape
(57, 6)


## Remove  duplicates 

Tools can be present in multiple data sources, therefore duplicates need to be removed when combining the data sources. 

In [8]:
df.shape

(530, 8)

In [9]:
# add a helper column to rank the quality of the data sources (subjectively) when dropping duplicates the row with the best score will be kept
df["source_ranking"]=df["source"].map({"awesome_data_engineering":4,
"awesome_pipelines":3,
"existing_workflow_systems":1,
"lfai_landscape":2})

In [10]:
# in consistent naming for example apache airflow vs airflow. simply remove apache from ID
df["id_consolidated"]=df["id"].str.replace("apache","")

In [11]:
df.sort_values(by=["id_consolidated","source_ranking"]).drop_duplicates(subset="id").shape

(421, 10)

In [12]:
# drop duplicates using the id_consolidated column (where apache substring is removed) 
# filter using the source_ranking columns, such that the row with the best source ranking will be kept.
df= df.sort_values(by=["id_consolidated","source_ranking"]).drop_duplicates(subset="id_consolidated")

In [13]:
df.shape

(418, 10)

### Fill missing data entries for tools

For example a code repository (e.g. GitHub URL) was not present in the raw data, although a repository exists.
For the combined data at this point a manual research was performed for tools without a repository URL.
 If the repository url was found for a tool it was added to the mapping below.

In [14]:
# mapping with missing information for tools, for example for each tool a repo_url entry way added when found
mapping = {"actionchain": {"tool_subcategory":"stackstorm","repo_url":"https://github.com/StackStorm/st2"},
           "activepapers":{"repo_url":"https://github.com/activepapers/activepapers-python"},
           "arteria": {"repo_url":"https://github.com/arteria-project/arteria-packs"},
           "bds":{"repo_url":"https://github.com/pcingola/bds","homepage_url":"https://pcingola.github.io/bds/"},
           "biokepler":{"status":"retired","status_comment":"url_taken_over"},
           "bonobo":{"repo_url":"https://github.com/python-bonobo/bonobo"},
           "cascading":{"repo_url":"https://github.com/cwensel/cascading"},
           "census":{"tool_type":"commercial"},
           "clubber":{"repo_url":"https://bitbucket.org/bromberglab/clubber"},
            "compss":{"repo_url":"https://github.com/bsc-wdc/compss"},
            "datalad":{"repo_url":"https://github.com/datalad/datalad"},
            "dbt":{"repo_url":"https://github.com/dbt-labs/dbt-core"},
            "doit":{"repo_url":"https://github.com/pydoit/doit"},
            "drill":{"repo_url":"https://github.com/apache/drill"},
            "apacheflink":{"repo_url":"https://github.com/apache/flink"},
            "flowhub":{"tool_type":"commercial"},
            "giraph":{"repo_url":"https://github.com/apache/giraph"},
            "graphlabcreate":{"repo_url":"https://github.com/apple/turicreate/"},
            "guixworkflowlanguage":{"alternative_tool_name":"gwl","repo_url":"https://git.savannah.gnu.org/cgit/gwl.git"},
            "h2o":{"repo_url":"https://github.com/h2oai/h2o-3"},
            "hadoopmapreduce":{"repo_url":"https://github.com/apache/hadoop"},
            "apachehudi":{"repo_url":"https://github.com/apache/hudi"},
            "apacheiravata":{"repo_url":"https://github.com/apache/airavata"},
            "joblib":{"repo_url":"https://github.com/joblib/joblib"},
            "kepler":{"status":"retired"},
            "kibaetl":{"repo_url":"https://github.com/thbar/kiba", "tool_type":"commercial_w_oss"},
            "knimeanalyticsplatform":{"atlernative_tool_name":"knime","repo_url":"https://github.com/knime/knime-core"},
            "linkedpipesetl":{"repo_url":"https://github.com/linkedpipes/etl"},
            "livy":{"repo_url":"https://github.com/apache/incubator-livy"},
            "longbow":{"repo_url":"https://github.com/hecbiosim/longbow"},
            "mahout":{"repo_url":"https://github.com/apache/mahout"},
            "make":{},
            "makeflow":{"repo_url":"https://github.com/cooperative-computing-lab/cctools"},
            "nextflowworkbench":{"status":"url_not_found"},
            "pentahokettle":{"repo_url":"https://github.com/pentaho/pentaho-kettle"},
            "prefectcore":{"repo_url":"https://github.com/PrefectHQ/prefect"},
            "presto":{"repo_url":"https://github.com/prestodb/presto"},
            "qdo":{"repo_url":"https://bitbucket.org/berkeleylab/qdo"},
            "rmake":{},
            "ruffus":{"alternative_tool_names":"cgat-ruffus","repo_url":"https://github.com/cgat-developers/ruffus"},
            "sake":{"repo_url":"https://github.com/tonyfischetti/sake"},
            "apachesamza":{"repo_url":"https://github.com/apache/samza"},
            "spark":{"repo_url":"https://github.com/apache/spark"},
            "sparkgraphx":{"repo_url":"https://github.com/apache/spark"},
            "sparkmllib":{"repo_url":"https://github.com/apache/spark"},
            "sparkpackages":{},
            "sparkrddapiexamples":{},
            "sparkstreaming":{"repo_url":"https://github.com/apache/spark"},
            "springclouddataflow":{"repo_url":"https://github.com/spring-cloud/spring-cloud-dataflow"},
            "apachestorm":{"repo_url":"https://github.com/apache/storm"},
            "stpipe":{"status":"not_found"},
            "streampipes":{"repo_url":"https://github.com/apache/streampipes"},
            "swift":{"repo_url":"https://github.com/swift-lang/swift-k"},
            "taverna":{"status":"retired"},
            "tez":{"repo_url":"https://github.com/apache/tez"},
            "voltdb":{"repo_url":"https://github.com/VoltDB/voltdb"},
            "wallaroo":{"status":"not_found"},
            "worldmake":{"status":"not_found"},
            "yap":{"status":"not_found"},
            "zenaton":{"tool_type":"commercial"},
            "zenml":{"repo_url":"https://github.com/zenml-io/zenml"}
           }

In [15]:
# only extract the repo_url mapping from the mapping information
flat_mapping =  {name:info["repo_url"] for name,info in mapping.items() if info.get("repo_url")}

In [16]:
# add missing repo_url entry to tools
df["repo_url"] = df.apply(lambda x: flat_mapping.get(x["id"],x["repo_url"]),axis=1)

In [17]:
df.shape

(418, 10)

## Remove tools with missing repo_url entry  

### Remove rows where repo_url is nan

In [18]:
df=df[df["repo_url"].notna()]
df.shape

(401, 10)

### Only keep rows/tools with a GitHub code repository

In [19]:
df = df[df["repo_url"].str.contains("github.com")]
df.shape

(380, 10)

In [20]:
df = df.drop_duplicates(subset="repo_url")
df.shape

(364, 10)

In [21]:
df=df[['id', 'name', 'homepage_url', 'repo_url', 'category', 'subcategory',
       'tool_subcategory', 'source']]

## Save consolidated  `tools` table

In [22]:
df.to_csv("data/04_consolidated/tools.csv",index=False)

In [24]:
df.groupby(by="source").count()

Unnamed: 0_level_0,id,name,homepage_url,repo_url,category,subcategory,tool_subcategory
source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
awesome_data_engineering,26,26,18,26,26,12,2
awesome_pipelines,59,59,16,59,59,0,0
existing_workflow_systems,238,238,166,238,227,224,0
lfai_landscape,41,41,41,41,41,41,0
