# Creating biom files from Kraken results files

This notebook will go through the workflow for creating biom files
that we can use in R and specifically the phyloseq and microViz packages for
data analysis 

Step 1: Create biom files from Kraken/Braken results for your cohort


## Getting Started

You will need to rerun this section each time you come back to this notebook to reset all directories and variables.

In [None]:
# set the variables for your netid
netid = "YOUR_NETID"

In [None]:
# Go into the working directory
work_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/12_phyloseq"
%cd $work_dir

In [None]:
# Check which set you have
!ls $work_dir/data

In [None]:
# Set the variable for your set id based on the info above
# This should be something like setid = "set21"
setid = "YOUR_SET"

## Creating a config file
Let's create a config file with all of the variables we will need in the scripts below. Then when we want to use these variables in the script, we will "source" the config file to set the variables.

In [None]:
# create a config file with all of the variables you need
!echo "export NETID=$netid" > config.sh
!echo "export SETID=$setid" >> config.sh
!echo "export WORK_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/12_phyloseq" >> config.sh
!echo "export DATA_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/12_phyloseq/data" >> config.sh
!echo "export KRAKENBIOM=/contrib/singularity/shared/bhurwitz/kraken-biom:1.2.0--pyh5e36f6f_0.sif" >> config.sh

In [None]:
# check the config file to be sure it is correct
# Is your netid and xfile correct? Do you have the right directories?
!cat config.sh

## Step 1: Creating biom files for kraken results files

In this step, we will convert kraken results files into biom files that we can use in R with the phyloseq package. Note that we will also be adding in metadata for your samples.


In [None]:
# Create a script to create biom files from kraken output
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=01:00:00   
#SBATCH --partition=standard
#SBATCH --account=bhurwitz                       
#SBATCH --output=Job-biom.out
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G  

pwd; hostname; date

source $SLURM_SUBMIT_DIR/config.sh

cd ${WORK_DIR}/data
REPORTS=${SETID}_reports
METADATA=${SETID}.samples.meta.txt

apptainer run ${KRAKENBIOM} kraken-biom \
-k $REPORTS \
--fmt json \
-m $METADATA \
-o ${WORK_DIR}/${SETID}.biom

echo "Finished `date`"

'''

with open('run_kraken_biom.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# Check the code and make sure your script above was created.
!cat run_kraken_biom.sh

In [None]:
# you should be in your working directory when you run this script
# do you see your config.sh file, and the run_kraken_biom_parallel.sh script?
!pwd
!ls

In [None]:
# Let's run sbatch to run kraken-biom
!sbatch run_kraken_biom.sh

In [None]:
# Welcome back
# You can check if it is running using the squeue command
# Check for all jobs under your netid
!squeue --user=$netid

In [None]:
# You can check to see if there are any errors by looking at one of the job output files
!cat Job-biom.out

In [None]:
# check to make sure you have a .biom file
!ls -l *.biom

## Final Step
Copy your notebook to the current working directory

In [None]:
cp ~/hw12_phyloseq.ipynb $work_dir