## Intro To IPFS

IPFS is a a distributed file system that is focused on content addressing the files instead addressing files by location.  This is accomplished by running the content of a file through a hashing algorithm, like SHA256, and using that as the identifier.  You will need to install and intialize IPFS for the following commands to work for you.  A link to the website is below the image.  

### IPFS Website

<img src="https://ipfs.io/images/video-still-demo.png"></img>
<a href="http://ipfs.io" target="_blank">IPFS</a>

### Overview

A walk through of an example use for IPFS will be considered.  This will assume you have IPFS installed already.  

If you do not want to install IPFS, a web app was created.  The web app is able to be downloaded from github.  Following along with the README for <a href="https://github.com/kdgosik/WDL_Tasks" target="_blank">this repository</a> will get the app up and running with a few commands. (assuming that you have npm installed already)  

## Example

First we will start off by a simple bash command to look at the contents of a wdl file.  This file will then be used to put into the IPFS file system.  The hash will be displayed.  We can then use that hash to retrieve the content of the wdl file again.  

In [27]:
%%bash

cat /Users/kgosik/Documents/Projects/WebApps/WDL_Tasks/ParseWDLs/ValidatedWDLs/00_depth_preprocessing.wdl

workflow preprocess_depth {
  Array[File] beds
  String batch

  call concat_batch as preprocess_DELs {
    input:
      beds=beds,
      batch=batch,
      svtype="DEL"
  }

  call concat_batch as preprocess_DUPs {
    input:
      beds=beds,
      batch=batch,
      svtype="DUP"
  }

  output {
    File del_bed = preprocess_DELs.bed
    File dup_bed = preprocess_DUPs.bed
    File del_bed_idx = preprocess_DELs.bed_idx
    File dup_bed_idx = preprocess_DUPs.bed_idx
  }
}

task concat_batch {
  Array[File] beds
  String svtype
  String batch

  command <<<
    zcat ${sep=' ' beds} \
      | sed -e '/^#chr/d' -e 's/cn.MOPS/cnmops/g' \
      | awk -v svtype=${svtype} '($6==svtype)' \
      | sort -k1,1V -k2,2n \
      | awk -v OFS="\t" -v svtype=${svtype} -v batch=${batch} '{$4=batch"_"svtype"_"NR; print}' \
      | cat <(echo -e "#chr\tstart\tend\tname\tsample\tsvtype\tsources") - \
      | bgzip -c \
      > ${batch}.${svtype}.bed.gz;
  tabix -p bed ${batch}.${svtype}.bed.gz
  >>>

  ou

### Adding the file to IPFS

Make sure to have IPFS installed and intialized before attempting to run the following commands.  Ths first command is to add the file to the IPFS file system. This will result in a message with the added hash for the content of the wdl file.  

In [12]:
%%bash

ipfs add ParseWDLs/ValidatedWDLs/00_depth_preprocessing.wdl

added QmddfhXwGhxEWVfze5GUXBy4BaxeRpR7gBAUQkRV6TmZu6 00_depth_preprocessing.wdl


 1.12 KB / 1.12 KB  100.00% 0s[2K

### Retrieving the file content by hash

Once you have the above hash, this corresponds to the content of the wdl file.  If you add the same content of the file, you will get the same hash.  Even if the name of the file is different or it is added from another location, you will always get the same hash.  To retrieve the content you run the following command. 

In [13]:
%%bash

ipfs cat QmddfhXwGhxEWVfze5GUXBy4BaxeRpR7gBAUQkRV6TmZu6

workflow preprocess_depth {
  Array[File] beds
  String batch

  call concat_batch as preprocess_DELs {
    input:
      beds=beds,
      batch=batch,
      svtype="DEL"
  }

  call concat_batch as preprocess_DUPs {
    input:
      beds=beds,
      batch=batch,
      svtype="DUP"
  }

  output {
    File del_bed = preprocess_DELs.bed
    File dup_bed = preprocess_DUPs.bed
    File del_bed_idx = preprocess_DELs.bed_idx
    File dup_bed_idx = preprocess_DUPs.bed_idx
  }
}

task concat_batch {
  Array[File] beds
  String svtype
  String batch

  command <<<
    zcat ${sep=' ' beds} \
      | sed -e '/^#chr/d' -e 's/cn.MOPS/cnmops/g' \
      | awk -v svtype=${svtype} '($6==svtype)' \
      | sort -k1,1V -k2,2n \
      | awk -v OFS="\t" -v svtype=${svtype} -v batch=${batch} '{$4=batch"_"svtype"_"NR; print}' \
      | cat <(echo -e "#chr\tstart\tend\tname\tsample\tsvtype\tsources") - \
      | bgzip -c \
      > ${batch}.${svtype}.bed.gz;
  tabix -p bed ${batch}.${svtype}.bed.gz
  >>>

  ou

### Linking Files

There is also the ability to link two files together.  This will result in a unique hash that represents the two files linked together.  The first thing that needs to be done is to add another file that we will link to the first file added.  

In [14]:
%%bash

ipfs add ParseWDLs/ValidatedWDLs/00_pesr_processing_single_algorithm.wdl

added QmYjBMLVeZTJ2RYHkVnVarkQ1qSv5xnqSzTsCbVZuzqL6w 00_pesr_processing_single_algorithm.wdl


 1.17 KB / 1.17 KB  100.00% 0s[2K

In [15]:
%%bash 

ipfs cat QmYjBMLVeZTJ2RYHkVnVarkQ1qSv5xnqSzTsCbVZuzqL6w

workflow preprocess_algorithm {
  File vcf
  File contigs
  String sample
  String algorithm
  Int min_svsize

  call standardize_vcf {
    input: 
      raw_vcf=vcf,
      algorithm=algorithm,
      group=sample,
      contigs=contigs,
      min_svsize=min_svsize
  }

  call sort_vcf {
    input: 
      unsorted_vcf=standardize_vcf.std_vcf,
      algorithm=algorithm,
      group=sample
  }

  output {
    File std_vcf = sort_vcf.sorted_vcf
  }
}

task standardize_vcf {
  File raw_vcf
  File contigs
  Int min_svsize
  String algorithm
  String group

  command {
    svtk standardize --prefix ${algorithm}_${group} --contigs ${contigs} --min-size ${min_svsize} ${raw_vcf} ${algorithm}.${group}.vcf ${algorithm}
  }

  output { 
    File std_vcf="${algorithm}.${group}.vcf"
    String group_="${group}"
  }
  
  runtime {
    docker: "msto/sv-pipeline"
  }
}

task sort_vcf {
  File unsorted_vcf
  String algorithm
  String group
 
  command {
    vcf-sort -c ${unsorted_vcf} | bgzip -c > ${algo

Once you have two files added to the file system, you are able to take the hashing of the two files and link them together into an object.  This will produce a single hash but you are able to retrieve the linked content of one or both of the files involved.  The following command is run to link the first wdl to the second wdl and returns a hash.  We can name the link between the files.  In this case we will just call it "linked-wdls" 

In [16]:
%%bash

ipfs object patch add-link QmddfhXwGhxEWVfze5GUXBy4BaxeRpR7gBAUQkRV6TmZu6 linked-wdls QmYjBMLVeZTJ2RYHkVnVarkQ1qSv5xnqSzTsCbVZuzqL6w

QmTBCvac8akrk1nWBZxQcc1paRJ9iKGPYnjWD2XXfGbcuW


Now that we have the two wdl files linked together we can explore the hash to see what content we can get from it.  The first thing we can do is run the <code>ipfs cat</code> command to see the entire file content.  This will just be the two files' content concatenated into one.

In [17]:
%%bash 

ipfs cat QmTBCvac8akrk1nWBZxQcc1paRJ9iKGPYnjWD2XXfGbcuW

workflow preprocess_depth {
  Array[File] beds
  String batch

  call concat_batch as preprocess_DELs {
    input:
      beds=beds,
      batch=batch,
      svtype="DEL"
  }

  call concat_batch as preprocess_DUPs {
    input:
      beds=beds,
      batch=batch,
      svtype="DUP"
  }

  output {
    File del_bed = preprocess_DELs.bed
    File dup_bed = preprocess_DUPs.bed
    File del_bed_idx = preprocess_DELs.bed_idx
    File dup_bed_idx = preprocess_DUPs.bed_idx
  }
}

task concat_batch {
  Array[File] beds
  String svtype
  String batch

  command <<<
    zcat ${sep=' ' beds} \
      | sed -e '/^#chr/d' -e 's/cn.MOPS/cnmops/g' \
      | awk -v svtype=${svtype} '($6==svtype)' \
      | sort -k1,1V -k2,2n \
      | awk -v OFS="\t" -v svtype=${svtype} -v batch=${batch} '{$4=batch"_"svtype"_"NR; print}' \
      | cat <(echo -e "#chr\tstart\tend\tname\tsample\tsvtype\tsources") - \
      | bgzip -c \
      > ${batch}.${svtype}.bed.gz;
  tabix -p bed ${batch}.${svtype}.bed.gz
  >>>

  ou

We can also explore the created object piece by piece.  This we will use the command <code>ipfs object</code> set of commands.  We can look at the object itself, the hash of the parts of the objects or any links that were established in the object.  

In [18]:
%%bash

echo messy version:
## messy version
ipfs object get QmTBCvac8akrk1nWBZxQcc1paRJ9iKGPYnjWD2XXfGbcuW
echo 
echo
echo pretty print:

## pretty print with python
ipfs object get QmTBCvac8akrk1nWBZxQcc1paRJ9iKGPYnjWD2XXfGbcuW | python -m json.tool

messy version:
{"Links":[{"Name":"linked-wdls","Hash":"QmYjBMLVeZTJ2RYHkVnVarkQ1qSv5xnqSzTsCbVZuzqL6w","Size":1205}],"Data":"\u0008\u0002\u0012\ufffd\tworkflow preprocess_depth {\n  Array[File] beds\n  String batch\n\n  call concat_batch as preprocess_DELs {\n    input:\n      beds=beds,\n      batch=batch,\n      svtype=\"DEL\"\n  }\n\n  call concat_batch as preprocess_DUPs {\n    input:\n      beds=beds,\n      batch=batch,\n      svtype=\"DUP\"\n  }\n\n  output {\n    File del_bed = preprocess_DELs.bed\n    File dup_bed = preprocess_DUPs.bed\n    File del_bed_idx = preprocess_DELs.bed_idx\n    File dup_bed_idx = preprocess_DUPs.bed_idx\n  }\n}\n\ntask concat_batch {\n  Array[File] beds\n  String svtype\n  String batch\n\n  command \u003c\u003c\u003c\n    zcat ${sep=' ' beds} \\\n      | sed -e '/^#chr/d' -e 's/cn.MOPS/cnmops/g' \\\n      | awk -v svtype=${svtype} '($6==svtype)' \\\n      | sort -k1,1V -k2,2n \\\n      | awk -v OFS=\"\\t\" -v svtype=${svtype} -v batch=${batch} '{$4=b

In [19]:
%%bash

ipfs object data QmTBCvac8akrk1nWBZxQcc1paRJ9iKGPYnjWD2XXfGbcuW

�	workflow preprocess_depth {
  Array[File] beds
  String batch

  call concat_batch as preprocess_DELs {
    input:
      beds=beds,
      batch=batch,
      svtype="DEL"
  }

  call concat_batch as preprocess_DUPs {
    input:
      beds=beds,
      batch=batch,
      svtype="DUP"
  }

  output {
    File del_bed = preprocess_DELs.bed
    File dup_bed = preprocess_DUPs.bed
    File del_bed_idx = preprocess_DELs.bed_idx
    File dup_bed_idx = preprocess_DUPs.bed_idx
  }
}

task concat_batch {
  Array[File] beds
  String svtype
  String batch

  command <<<
    zcat ${sep=' ' beds} \
      | sed -e '/^#chr/d' -e 's/cn.MOPS/cnmops/g' \
      | awk -v svtype=${svtype} '($6==svtype)' \
      | sort -k1,1V -k2,2n \
      | awk -v OFS="\t" -v svtype=${svtype} -v batch=${batch} '{$4=batch"_"svtype"_"NR; print}' \
      | cat <(echo -e "#chr\tstart\tend\tname\tsample\tsvtype\tsources") - \
      | bgzip -c \
      > ${batch}.${svtype}.bed.gz;
  tabix -p bed ${batch}.${svtype}.bed.gz
  >>>


In [20]:
%%bash

ipfs object links QmTBCvac8akrk1nWBZxQcc1paRJ9iKGPYnjWD2XXfGbcuW

QmYjBMLVeZTJ2RYHkVnVarkQ1qSv5xnqSzTsCbVZuzqL6w 1205 linked-wdls 


## Web App

A web app was created to accomplish similar task as what was just shown.  It may take a few seconds for the drop down menus to be filled.  The drop down menus are filled with the WDL ids from using the GA4GH Tool (get ga4gh/v1/tools) registry from this <a href="https://api.firecloud.org/" target="_blank">api</a>.  You could also skip calling a tool from the api and type any text you wanted into the text area.  This content will be hashed instead and will be able to be retrieved later.  

<img src="AppScreen.png"></img>

Instead of adding the file via <code>ipfs add</code> command, you are able to select from a drop down menu.  Once you have selected the WDL you would like to had click the submit button.  You can check the content of the file in the text area.  Once you are satisfied you can use the 'add to ipfs' button to add the content and it will render the hash for you.  The blue box is identical to the yellow box.  The yellow is for the first WDL and the blue is for the second WDL to for it to be linked to.  Make sure you have added both WDLs selected to IPFS before moving on to the green box labeled 'Linked Data'.  Once you see the hash of each file in both the yellow and blue box, you can move down to the green box and push the 'Link Data' button.  This will hash the two WDL files together and output the resulting hash. This hash is a signature of the two linked files.  This can be used to retrieve the two WDLs in the future.

<img src="AppBottomScreen.png"></img>

Scrolling down you will see a box for a hash to be entered.  This could be any hash but is intended to be for linked WDLs like above. You can copy and paste the new hash that was created by the 'Linked Data' section.  Once pasted into the text box you can select the 'Submit' button.  This should be able to retrieve the same content from the above two WDLs.  WDL 1 would be from the yellow box and WDL 2 would be from the blue box.  