# Data Filtering

This notebook will go through the process to create the files we need for Microbiome Analyst. Note that from here on out, we will work with the complete dataset for your project composed of 56 samples from the class. 

-----------

Sections:

1. Create biom files from Kraken/Braken results for your project.
2. Upload these files to Microbiome Analyst and filter the datasets.
-----------


## Getting Started

You will need to rerun this section each time you come back to this notebook to reset all directories and variables.

In [None]:
# set the variables for your netid
netid = "YOUR_NETID"

In [None]:
# Go into the working directory
work_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/11_data_filter"
%cd $work_dir

In [None]:
# Notice that I have created a few files and directories for you with the complete project.
!ls 

In [None]:
# Set the variable for your project id. This is the project you chose at the start of class.
# This should be something like project_id = "project1"
project_id = "YOUR_PROJECT"

## Creating a config file
Let's create a config file with all of the variables we will need in the scripts below. Then when we want to use these variables in the script, we will "source" the config file to set the variables.

In [None]:
# create a config file with all of the variables you need
!echo "export NETID=$netid" > config.sh
!echo "export PROJECT_ID=$project_id" >> config.sh
!echo "export WORK_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/11_data_filter" >> config.sh
!echo "export KRAKENBIOM=/contrib/singularity/shared/bhurwitz/kraken-biom:1.2.0--pyh5e36f6f_0.sif" >> config.sh

## Step 1: Creating biom files for kraken results files

In this step, we will convert kraken results files into biom files that we can use these with the Microbiome Analyst. Note that we will also be adding in metadata for your samples from the 04_metadata homework.


In [None]:
# Create a script to create biom files from Kraken/bracken output
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=01:00:00   
#SBATCH --partition=standard
#SBATCH --account=bhurwitz                       
#SBATCH --output=11A-biom.out
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G  

pwd; hostname; date

source $SLURM_SUBMIT_DIR/config.sh

cd ${WORK_DIR}
REPORT_DIR=${PROJECT_ID}_reports
METADATA=${PROJECT_ID}_metadata.txt

apptainer run ${KRAKENBIOM} kraken-biom \
-k $REPORT_DIR \
--fmt json \
-m $METADATA \
-o ${WORK_DIR}/${PROJECT_ID}.biom

echo "Finished `date`"

'''

with open('11A_kraken_biom.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# Let's run sbatch to run kraken-biom and create the biom file
!sbatch 11A_kraken_biom.sh

In [None]:
# You can check if it is running using the squeue command
# Check for all jobs under your netid
# This should just take a minute to run once the job is picked up
!squeue --user=$netid

In [None]:
# check to make sure you have a .biom file
!ls -l

In [None]:
%%bash
# copy this to your home directory to download
cp *.biom ~/be487-fall-2024/assignments/11_data_filter/

## Final Step
Copy your notebook to the current working directory

In [None]:
!cp ~/be487-fall-2024/assignments/11_data_filter/hw11_data_filter.ipynb $work_dir