# CWL Workflow Documentation

This documentation provides an overview of the Continuous Workflow Integration using GitHub Actions and Common Workflow Language (CWL). The goal of this workflow is to execute a CWL workflow on GitHub whenever a push event occurs on the main branch of the repository. In addition to the technical details, we will also explore the motivations behind implementing this workflow.

The motivation behind implementing this CWL workflow using GitHub Actions encompasses several key aspects:

- Automation and Continuous Integration: By setting up this workflow, we aim to automate the execution of CWL workflows whenever changes are pushed to the main branch of the repository. This ensures that the workflow is consistently tested and executed in a controlled environment, promoting reliability and efficiency in our data analysis pipeline.

- Version Control and Collaboration: GitHub provides robust version control and collaboration features. By integrating CWL workflows with GitHub Actions, we leverage these features to manage changes, track workflow execution history, and enable collaborative development on the CWL workflows. It fosters effective teamwork and facilitates seamless contribution to the project.

- Reproducibility and Portability: CWL is widely adopted as a standard for defining bioinformatics workflows. By utilizing CWL and running workflows through GitHub Actions, we enhance the reproducibility of our analysis pipelines. This allows other researchers to easily reproduce and validate our results, fostering scientific transparency, reproducibility, and collaboration.

- Efficiency and Scalability: GitHub Actions provides a scalable infrastructure to execute workflows on cloud-based virtual machines. This empowers us to leverage powerful computing resources and parallelize the execution of our CWL workflows. Consequently, we improve efficiency and reduce the overall execution time, enabling faster data analysis and enhancing scalability.

- Ease of Setup and Configuration: GitHub Actions simplifies the setup and configuration of continuous workflows. The declarative YAML syntax used to define the workflow allows straightforward specification of desired steps, dependencies, and triggers. This ease of setup enables quick adoption and maintenance of the CWL workflow integration, streamlining development processes.

## Workflow Configuration

To set up the CWL workflow, the following steps were taken:

- Create the .github/workflows/ directory in the main repository.

- Inside the .github/workflows/ directory, create the main.yml file to define the workflow.

The contents of the main.yml file are as follows:

In [None]:
name: CWL Workflow

on:
  push:
    branches: [main]

jobs:
  run-cwl:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v2

      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: 3.8

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
          pip install cwlref-runner

      - name: Set execute permission for run_workflow.py
        run: chmod +x run_workflow.py

      - name: List directory contents
        run: ls -l

      - name: Run CWL workflow
        run: |
          cwltool $GITHUB_WORKSPACE/run_workflow.cwl $GITHUB_WORKSPACE/input.yml


The workflow is triggered whenever a push event occurs on the main branch of the repository. It runs on an ubuntu-latest virtual machine. The steps in the workflow include:

1. Checking out the repository code using actions/checkout@v2.
2. Setting up Python 3.8 using actions/setup-python@v2.
3. Installing the required dependencies, including cwlref-runner, using pip.
4. Setting the execute permission for the run_workflow.py script.
5. Listing the directory contents to verify the files.
6. Running the CWL workflow using cwltool with the run_workflow.cwl file and input.yml file as inputs.

## CWL Workflow Definition

The CWL workflow is defined in the run_workflow.cwl file located in the main repository. The contents of the run_workflow.cwl file are as follows:

In [None]:
cwlVersion: v1.0
class: CommandLineTool

baseCommand: ["python"]

inputs:
  script:
    type: File
  oc_meta:
    type: Directory
  erih_plus:
    type: File
  doaj:
    type: File

arguments:
  - valueFrom: $(inputs.script.path)

outputs:
  result:
    type: stdout
  OCMeta_DOAJ_ErihPlus_merged:
    type: File
    outputBinding:
      glob: "OCMeta_DOAJ_ErihPlus_merged.csv"

stdout: result.txt


This CWL workflow is defined as a CommandLineTool with Python as the base command. It takes four input files: script, oc_meta, erih_plus, and doaj. The script input represents the main script file (run_workflow.py). The oc_meta, erih_plus, and doaj inputs represent various data files used by the script.

The workflow uses the value of the script input as the main script file to be executed. It produces two outputs: result and OCMeta_DOAJ_ErihPlus_merged. The result output represents the standard output, while the OCMeta_DOAJ_ErihPlus_merged output is a file with a glob pattern indicating "OCMeta_DOAJ_ErihPlus_merged.csv".

### Explanation of input.yml file

The input.yml file defines the inputs for the CWL workflow. Each input is specified with its name, class, and location.



In [None]:
script:
  class: File
  location: run_workflow.py
oc_meta:
  class: Directory
  location: csv_dump/
erih_plus:
  class: File
  location: ERIHPLUSapprovedJournals.csv
doaj:
  class: File
  location: journalcsv__doaj.csv


These inputs are used by the CWL workflow to provide the necessary files and directories required for the execution of the run_workflow.py script.

By customizing the values in the input.yml file, you can provide different input files and directories to the CWL workflow, depending on your specific requirements and data sources.

- script: This input represents the main script file (run_workflow.py). The class is set to File, indicating that it represents a file input. The location specifies the path or location of the run_workflow.py script.

- oc_meta: This input represents a directory input named oc_meta. The class is set to Directory, indicating that it represents a directory input. The location specifies the path or location of the csv_dump/ directory.

- erih_plus: This input represents a file input named erih_plus. The class is set to File, indicating that it represents a file input. The location specifies the path or location of the ERIHPLUSapprovedJournals.csv file.

- doaj: This input represents a file input named doaj. The class is set to File, indicating that it represents a file input. The location specifies the path or location of the journalcsv__doaj.csv file.