# How to work with cram files in genomic buckets?

**Introduction**

This notebook demonstrates how to work directly with CRAM files stored in genomic buckets using new versions of these tools, such as samtools, gatk, bcftools etc., eliminating the need for local downloads.

In [None]:
import os
my_bucket=os.getenv("WORKSPACE_BUCKET")
my_bucket

# Where are cram files? 

In [None]:
!gsutil -u $GOOGLE_PROJECT cat gs://fc-aou-datasets-controlled/v8/wgs/cram/manifest.csv | wc -l

In total, there are 414831-1=414830 cram samples in v8

In [None]:
!gsutil -u $GOOGLE_PROJECT cat gs://fc-aou-datasets-controlled/v8/wgs/cram/manifest.csv | head -5

The v8 cram files include some files from v7/v6

In [None]:
!gsutil -u $GOOGLE_PROJECT cat gs://fc-aou-datasets-controlled/v8/wgs/cram/manifest.csv | \
awk -F, 'NR>1 { split($2, path, "/"); version=path[7]; count[version]++ } END { for (v in count) print v, count[v] }'

In [None]:
172120+97460+145250

Take a quick at cram files in v8_delta

In [None]:
!gsutil -u $GOOGLE_PROJECT ls gs://fc-aou-datasets-controlled/pooled/wgs/cram/v8_delta/** | wc -l

Check the data size

In [None]:
!gsutil -u $GOOGLE_PROJECT du -hs gs://fc-aou-datasets-controlled/pooled/wgs/cram/v8_delta/

Check cram files in v7/v6 folder

Not all files in v6/v7 are included in v8

1832 (1.2%) in v7 and 1130 (1.1%) in v6 are no longer in v8

In [None]:
!gsutil -u $GOOGLE_PROJECT ls gs://fc-aou-datasets-controlled/pooled/wgs/cram/v7_delta/** | wc -l

In [None]:
!gsutil -u $GOOGLE_PROJECT du -hs gs://fc-aou-datasets-controlled/pooled/wgs/cram/v7_delta/

In [None]:
!gsutil -u $GOOGLE_PROJECT ls gs://fc-aou-datasets-controlled/pooled/wgs/cram/v6_base/** | wc -l

In [None]:
!gsutil -u $GOOGLE_PROJECT du -hs gs://fc-aou-datasets-controlled/pooled/wgs/cram/v6_base/

**A breif summary**

In [None]:
import pandas as pd

# Create the data
data = {
    "Version": ["v8_delta", "v7_delta", "v6_base", "Total"],
    "File Count": [172120, 145250, 97460, 414830],
     "Size (PiB)": [3.32, 2.62, 1.77 , 7.71]
}

# Create DataFrame
df = pd.DataFrame(data)
df

In [None]:
total_files = 414830
total_size_pib = 7.71

# Convert PiB to bytes
bytes_per_pib = 1024 ** 5
total_size_bytes = total_size_pib * bytes_per_pib

# Calculate average
avg_bytes_per_file = total_size_bytes / total_files

# Convert to GiB/GB
avg_gib = avg_bytes_per_file / (1024 ** 3)  # 1 GiB = 1024³ bytes
avg_gb = avg_bytes_per_file / (1000 ** 3)    # 1 GB  = 1000³ bytes

print(f"Average size per file: {avg_gib:.1f} GiB (binary) or {avg_gb:.2f} GB (SI)")

In [None]:
!gsutil -u $GOOGLE_PROJECT ls -l gs://fc-aou-datasets-controlled/pooled/wgs/cram/v8_delta/wgs_1000000.*

In [None]:
!gsutil -u $GOOGLE_PROJECT ls -l gs://fc-aou-datasets-controlled/pooled/wgs/cram/v6_base/wgs_1000004.*

In [None]:
!gsutil -u $GOOGLE_PROJECT ls -l gs://fc-aou-datasets-controlled/pooled/wgs/cram/v7_delta/wgs_1000039.*

# How to use samtools to check cram files

Check the pre-installed samtools version

In [None]:
!samtools --version

Samtools v1.18 doesn't work for accessing files in a bucket

In [None]:
%%bash

export GCS_OAUTH_TOKEN=`gcloud auth print-access-token`
export GCS_REQUESTER_PAYS_PROJECT=`echo $GOOGLE_PROJECT`
export ref_fasta='gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta'
export file_path='gs://fc-aou-datasets-controlled/pooled/wgs/cram/v8_delta/wgs_1000000.cram'

samtools view -T $ref_fasta ${file_path} | head -n 5

**Install a newer version of samtools in order to directly read cram files**

In [None]:
!wget https://github.com/samtools/samtools/releases/download/1.22/samtools-1.22.tar.bz2

In [None]:
!bzip2 -d samtools-1.22.tar.bz2
!tar -xf samtools-1.22.tar

In [None]:
!mkdir /home/jupyter/samtools

In [None]:
%%bash
cd samtools-1.22 # and similarly for bcftools and htslib
# here need absolute path
./configure --prefix=/home/jupyter/samtools
make
make install

Check the version

In [None]:
!ls /home/jupyter/samtools/bin

In [None]:
%%bash
cd /home/jupyter/samtools/bin
./samtools --version

In [None]:
%%bash

export GCS_OAUTH_TOKEN=`gcloud auth print-access-token`
export GCS_REQUESTER_PAYS_PROJECT=`echo $GOOGLE_PROJECT`
export ref_fasta='gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta'
export file_path='gs://fc-aou-datasets-controlled/pooled/wgs/cram/v8_delta/wgs_1000000.cram'

cd /home/jupyter/samtools/bin
./samtools view -T $ref_fasta ${file_path} | head -n 5

## Subset a region to a bam file

In [None]:
!echo $PWD

In [None]:
%%bash

export GCS_OAUTH_TOKEN=`gcloud auth print-access-token`
export GCS_REQUESTER_PAYS_PROJECT=`echo $GOOGLE_PROJECT`
export ref_fasta="gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta"
export out_path="${PWD}"
export file_path="gs://fc-aou-datasets-controlled/pooled/wgs/cram/v8_delta/wgs_1000000.cram"

cd /home/jupyter/samtools/bin
./samtools view -b -T $ref_fasta -o ${out_path}/test_subset22.bam ${file_path} chr6:1000000-1005000

In [None]:
!ls *.bam

we can write output files to bucket direclty.

In [None]:
%%bash

export GCS_OAUTH_TOKEN=`gcloud auth print-access-token`
export GCS_REQUESTER_PAYS_PROJECT=`echo $GOOGLE_PROJECT`
export ref_fasta="gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta"
export out_path="${WORKSPACE_BUCKET}/data/test_subset24.bam"
export file_path="gs://fc-aou-datasets-controlled/pooled/wgs/cram/v7_delta/wgs_1000059.cram"

cd /home/jupyter/samtools/bin
./samtools view -b -T $ref_fasta -o ${out_path} ${file_path} chr6:1000000-1005000

In [None]:
!gsutil ls -l ${WORKSPACE_BUCKET}/data/test_subset24.bam

## Check cram file sorting 

In [None]:
%%bash

export GCS_OAUTH_TOKEN=`gcloud auth print-access-token`
export GCS_REQUESTER_PAYS_PROJECT=`echo $GOOGLE_PROJECT`
export file_path="gs://fc-aou-datasets-controlled/pooled/wgs/cram/v8_delta/wgs_1000000.cram"

cd /home/jupyter/samtools/bin
./samtools view -H ${file_path} | grep '^@HD' -A1

OR we can use gsutil+preinstalled samtools to stream the header

In [None]:
!gsutil -u $GOOGLE_PROJECT cat gs://fc-aou-datasets-controlled/pooled/wgs/cram/v8_delta/wgs_1000000.cram | samtools view -H - | grep '^@HD' -A1

# How to use gatk

In [None]:
!gatk PrintReads  \
        -I gs://fc-aou-datasets-controlled/pooled/wgs/cram/v7_delta/wgs_1000039.cram \
        -L "chr2:88750000-90235368" \
        -R gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta \
        -O igk.cram \
        --gcs-project-for-requester-pays ${GOOGLE_PROJECT} 

In [None]:
!ls igk.*

**Example 2: using gatk on vcf files**

In [None]:
!gsutil -u $GOOGLE_PROJECT ls -l gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/clinvar/vcf/0000000001.* |head -5

In [None]:
%%bash
export ref_fasta="gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta"
export out_path="${WORKSPACE_BUCKET}/data/region_chr1.vcf.bgz"
export file_path="gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/clinvar/vcf/0000000001.vcf.bgz"

gatk SelectVariants \
  -R ${ref_fasta} \
  -V ${file_path} \
  -O ${out_path} \
  -L chr1:1000000-1005000 \
  --gcs-project-for-requester-pays ${GOOGLE_PROJECT} 

In [None]:
!gsutil ls -l ${WORKSPACE_BUCKET}/data/region_chr1.vcf.bgz

# How to use samtools via dsub?

In [None]:
import os
USER_NAME = os.getenv('OWNER_EMAIL').split('@')[0].replace('.','-')

# Save this Python variable as an environment variable so that its easier to use within %%bash cells.
%env USER_NAME={USER_NAME}

In [None]:
%%bash --out test_ID

source ~/aou_dsub.bash # This file was created via notebook 01_dsub_setup.ipynb.

ref_path="gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta"
file_path="gs://fc-aou-datasets-controlled/pooled/wgs/cram/v7_delta/wgs_1000059.cram"

aou_dsub \
  --name "${JOB_NAME}" \
  --logging "${WORKSPACE_BUCKET}/data/logging" \
  --image conormesser/splash:v2.6.1 \
  --boot-disk-size 20 \
  --disk-size 200 \
  --env GOOGLE_PROJECT="${GOOGLE_PROJECT}" \
  --env ref_fasta="${ref_path}" \
  --env cram_file="${file_path}" \
  --output-recursive out_path="${WORKSPACE_BUCKET}/data/cram_result/" \
  --command 'export GCS_OAUTH_TOKEN=`gcloud auth print-access-token` && \
             export GCS_REQUESTER_PAYS_PROJECT=`echo $GOOGLE_PROJECT` && \
             echo $GCS_OAUTH_TOKEN > ${out_path}/GCS_OAUTH_TOKEN.txt && \
             echo $GCS_REQUESTER_PAYS_PROJECT > ${out_path}/GCS_REQUESTER_PAYS_PROJECT.txt && \
             echo $GOOGLE_PROJECT > ${out_path}/GOOGLE_PROJECT2.txt && \
             echo ${WORKSPACE_BUCKET} > ${out_path}/WORKSPACE_BUCKET.txt && \
             samtools view -b -T ${ref_fasta} -F 1036 -e "sclen <= 30" -o ${out_path}/test_subset3.bam ${cram_file} chr6:1000000-1005000'

In [None]:
# Save this Python variable value as an environment variable so that its easier to use within %%bash cells.
%env JOB_ID={test_ID}

In [None]:
%%bash

dstat \
    --provider google-batch \
    --project "${GOOGLE_PROJECT}" \
    --location us-central1 \
    --jobs "${JOB_ID}" \
    --users "${USER_NAME}" \
    --status '*'

In [None]:
!gsutil ls -l ${WORKSPACE_BUCKET}/data/cram_result/test_subset3.bam

# How to use bcftools?

Install a new version of bcftools

In [None]:
!wget https://github.com/samtools/bcftools/releases/download/1.22/bcftools-1.22.tar.bz2

In [None]:
!bzip2 -d bcftools-1.22.tar.bz2
!tar -xf bcftools-1.22.tar

In [None]:
!mkdir /home/jupyter/bcftools/

In [None]:
%%bash
cd bcftools-1.22  # and similarly for bcftools and htslib
./configure --prefix=/home/jupyter/bcftools/
make
make install

In [None]:
%%bash
cd /home/jupyter/bcftools/bin
./bcftools --version

**Check vcf files using bcftools**

In [None]:
%%bash

export GCS_OAUTH_TOKEN=`gcloud auth print-access-token`
export GCS_REQUESTER_PAYS_PROJECT=`echo $GOOGLE_PROJECT`
export ref_fasta='gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta'
export out_path="${PWD}"
export file_path='gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/clinvar/vcf'

cd /home/jupyter/bcftools/bin
./bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\n' ${file_path}/0000000001.vcf.bgz | head -5

**Check cram files using bcftools**

In [None]:
%%bash

export GCS_OAUTH_TOKEN=`gcloud auth print-access-token`
export GCS_REQUESTER_PAYS_PROJECT=`echo $GOOGLE_PROJECT`
export ref_fasta='gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta'
export out_path="${PWD}"
export file_path='gs://fc-aou-datasets-controlled/pooled/wgs/cram/v7_delta'

cd /home/jupyter/bcftools/bin
./bcftools mpileup -f $ref_fasta -r chr6:1000000-1005000 ${file_path}/wgs_1000039.cram \
| ./bcftools call -mv -Oz -o ${out_path}/chr6.vcf.gz

In [None]:
!ls -l chr6.vcf.gz