# Pre-Alignment

---


> ⚠ This notebook runs a BASH kernel.

## Objectives

1. Produce a FASTA alignment of GISAID sequences.
1. Produce a VCF of the alignment.
1. Produce a TSV file of sample metadata and alignment statistics.
1. Filter out sequences according to metadata, genome quality, and molecular clock deviations.

**Input**

- GISAID metadata should be downloaded to: `../data/metadata.tsv`.
- GISAID sequences should be downloaded to: `../data/sequences.fasta`.


## Setup

---


In [None]:
# System
partition="" # Leave this blank to just use your default sbatch partition
# We need a conda env that has R, R package data.table, snp-sites, vcftools
conda_env="ENV_NAME"

# Directory paths relative to this notebook
data_dir="../data"
bin_dir="../bin"
log_dir="../logs"
results_dir="../results"

# Input/Output Paths
input_metadata="${data_dir}/metadata.tsv"
input_sequences="${data_dir}/sequences.fasta"
input_reference="${data_dir}/reference.fasta"

# (Optional) Reuse existing alignment and statistics
# Ignore on first time running
nextclade_alignment="${results_dir}/nextclade.aligned.fasta"
nextclade_tsv="${results_dir}/nextclade.tsv"

# Program Versions (to download)
nextclade_tag="2022-07-26T12:00:00Z"
nextclade_ver="2.3.0"

# Strain Names
reference_strain="Wuhan/Hu-1/2019" # The reference name in the GISAID metadata and in the input reference
reference_strain_nextclade="MN908947 (Wuhan-Hu-1/2019)" # The reference name in Nextclade data
reference_genbank_accession="MN908947.3"

# Metadata Filters
min_date="2020-01-01"
max_date="2022-06-30"
bad_quality_cols="qc.missingData.status,qc.mixedSites.status,qc.frameShifts.status,qc.stopCodons.status" # Exclude strain if 'bad' for any of these

# Final Output
metadata_cols="strain,date,country,gisaid_epi_isl,host,date_submitted"
nextclade_cols="seqName,clade,Nextclade_pango,qc.missingData.status,qc.mixedSites.status,qc.frameShifts.status,qc.stopCodons.status,totalSubstitutions"

In [None]:
mkdir -p ${bin_dir}
mkdir -p ${results_dir}
mkdir -p ${log_dir}
# data_dir already exists in repo

### Download Dependencies

#### Nextclade

In [None]:
wget -q -O ${bin_dir}/nextclade https://github.com/nextstrain/nextclade/releases/download/${nextclade_ver}/nextclade-x86_64-unknown-linux-gnu

In [None]:
${bin_dir}/nextclade dataset get --name sars-cov-2 --tag "${nextclade_tag}" --output-dir ${data_dir}/sars-cov-2_${nextclade_tag}

#### csvtk

In [None]:
wget -q https://github.com/shenwei356/csvtk/releases/download/v0.24.0/csvtk_linux_386.tar.gz
tar -xvf csvtk_linux_386.tar.gz
mv csvtk ${bin_dir}
rm -f csvtk_linux_386.tar.gz

#### seqkit

In [None]:
wget -q https://github.com/shenwei356/seqkit/releases/download/v2.2.0/seqkit_linux_386.tar.gz
tar -xvf seqkit_linux_386.tar.gz
mv seqkit ${bin_dir}
rm -f seqkit_linux_386.tar.gz

#### faToVcf

In [None]:
wget -q -O ${bin_dir}/faToVcf http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/faToVcf

In [None]:
wget -q -O ${data_dir}/problematic_sites.vcf https://github.com/W-L/ProblematicSites_SARS-CoV2/raw/master/problematic_sites_sarsCov2.vcf

## Alignment

---

- Start out by aligning EVERYTHING! This will take ~1-2 hours with 64 cores.
- This gives us the freedom to align once, and play with filter combinations afterwards.

### Nextclade

In [None]:
# Check if the alignment already exists (so we don't have to realign)
if [[ (-e $nextclade_alignment) && (-e $nextclade_tsv) ]]; then 

    echo "Alignment already exists, skipping nextclade."
    dependency_alignment=""
  
# Otherwise, we need to align the sequences
else
    wrap="${bin_dir}/nextclade run
      --input-dataset ${data_dir}/sars-cov-2_${nextclade_tag} 
      --output-all ${results_dir}
      --output-selection 'tsv,fasta'  
      ${input_sequences}"

    cmd="sbatch
      --parsable
      ${partition}
      -c 64 
      --mem 64G
      -J recomb-align
      -o ${log_dir}/%x_$(date +"%Y-%m-%d")_%j.log
      --wrap=\"$wrap 2>&1\""
      
    echo $cmd
    align_id=$(eval $cmd)   
  
    # Setup the SLURM dependency string for jobs that will depend on this output
    dependency_alignment="--dependency=aftercorr:$align_id"      
fi

### Metadata: Extract Minimal Columns

In [None]:
wrap="${bin_dir}/csvtk cut -t -f $metadata_cols $input_metadata > ${results_dir}/metadata.minimal.tsv"
cmd="sbatch --parsable ${partition} -c 1 --mem 16G -J recomb-metadata-minimal -o ${log_dir}/%x_$(date +"%Y-%m-%d")_%j.log --wrap=\"$wrap 2>&1\""
echo $cmd
metadata_minimal_id=$(eval $cmd)

### Alignment: Extract Minimal Columns

In [None]:
wrap="${bin_dir}/csvtk cut -t -f $nextclade_cols ${results_dir}/nextclade.tsv | ${bin_dir}/csvtk rename -t -f seqName -n strain > ${results_dir}/nextclade.minimal.tsv "
cmd="sbatch --parsable ${partition} --dependency=aftercorr:$align_id -c 1 --mem 16G -J recomb-nextclade-minimal -o ${log_dir}/%x_$(date +"%Y-%m-%d")_%j.log --wrap=\"$wrap 2>&1\""
echo $cmd
nextclade_minimal_id=$(eval $cmd)

### Merge: Metadata and Alignment Minimal Columns

This is performed through R, which has better memory management than csvtk and won't crash with these large files.

In [None]:
dependency_minimal="--dependency=aftercorr:${metadata_minimal_id}:${nextclade_minimal_id}"

# This writes the file results/minimal.tsv
wrap="source activate $conda_env && Rscript merge.R ${results_dir}"
# ${dependency_minimal} 
cmd="sbatch --parsable ${partition} -c 1 --mem 16G -J recomb-merge-minimal -o ${log_dir}/%x_$(date +"%Y-%m-%d")_%j.log --wrap=\"$wrap 2>&1\""

echo $cmd
minimal_id=$(eval $cmd)

## Filters

---


### Metadata

After the alignment is finished running, work on some pure-metadata filters.

Identify strains to exclude based on:
1. Date ambiguity
2. Date range
3. Host
4. Reference strain

Note: There are no records with ambiguous country in GISAID, so a country filter is unecessary

#### 1. Date Ambiguity

In [None]:
${bin_dir}/csvtk grep -t -f "date" -r -p "[0-9]{4}-[0-9]{2}-[0-9]{2}" -v ${results_dir}/metadata.minimal.tsv \
 | ${bin_dir}/csvtk cut -t -f "strain" \
 | tail -n+2 \
 > ${results_dir}/exclude.date_ambiguity.txt

#### 2. Date Range

In [None]:
echo "Min Date:" $min_date 
# this is deliberately reversed, because we're constructing an exclusion list
filter_min="\$date<\"$min_date\""
echo "Exclude: $filter_min"

${bin_dir}/csvtk filter2 -t -f ${filter_min} ${results_dir}/metadata.minimal.tsv 2> /dev/null \
  | ${bin_dir}/csvtk cut -t -f "strain" \
  | tail -n+2 \
  > ${results_dir}/exclude.early.txt  

In [None]:
echo "Max Date:" $max_date 
# this is deliberately reversed, because we're constructing an exclusion list
filter_max="\$date>\"$max_date\""
echo "Exclude: $filter_max"

${bin_dir}/csvtk filter2 -t -f ${filter_max} ${results_dir}/metadata.minimal.tsv 2> /dev/null \
  | ${bin_dir}/csvtk cut -t -f "strain" \
  | tail -n+2 \
  > ${results_dir}/exclude.late.txt

#### 3. Host

In [None]:
# Exclude ANYTHING source that is not human
${bin_dir}/csvtk grep -t -f "host" -p "Human" -v ${results_dir}/metadata.minimal.tsv \
  | ${bin_dir}/csvtk cut -t -f "strain" \
  | tail -n+2 \
  > ${results_dir}/exclude.host.txt

In [None]:
# Exclude environmental samples
${bin_dir}/csvtk grep -t -f "host" -p "Environment" ${results_dir}/metadata.minimal.tsv \
  | ${bin_dir}/csvtk cut -t -f "strain" \
  | tail -n+2 \
  > ${results_dir}/exclude.environment.txt

In [None]:
# Exclude non-human and not environment samples
${bin_dir}/csvtk grep -t -f "host" -p "Human" -v ${results_dir}/metadata.minimal.tsv \
  | ${bin_dir}/csvtk grep -t -f "host" -v -p "Environment" \
  | ${bin_dir}/csvtk cut -t -f "strain" \
  | tail -n+2 \
  > ${results_dir}/exclude.non-human.txt

#### 4. Reference Strain

In [None]:
echo $reference_strain > ${results_dir}/exclude.reference.txt

### Genome Quality

When the alignment if finished, we can filter on the following quality metrics

1. Missing Data, N (`qc.missingData.status`)
1. Ambiguous Nucleotides (`qc.mixedSites.status`)
1. Frameshifts (`qc.frameShifts.status`)
1. Stop Codons (`qc.stopCodons.status`)

In [None]:
wrap="${bin_dir}/csvtk grep -t -f ${bad_quality_cols} -p bad ${results_dir}/nextclade.tsv \
  | ${bin_dir}/csvtk cut -t -f seqName \
  | tail -n+2 \
  > ${results_dir}/exclude.quality.txt"
  
cmd="sbatch
  --parsable
  ${partition}
  ${dependency_alignment}
  -c 1 
  --mem 16G
  -J recomb-filter-quality
  -o ${log_dir}/%x_$(date +"%Y-%m-%d")_%j.log
  --wrap=\"$wrap 2>&1\""

echo $cmd
filter_quality_id=$(eval $cmd)
# Setup the SLURM dependency string for jobs that will depend on this output
dependency_filter_quality="--dependency=aftercorr:${filter_quality_id}"

### Clock Filter

- Remove sequences where the collection date (`date`) is before the lower 95% CI of the MRCA date
- ⚠ Requires external run of the `clock-filter.ipynb` notebook (python kernel).

### Combine All Exclusion Filters

In [None]:
# Exclude all
cat ${results_dir}/exclude.*.txt \
  | sort \
  | uniq \
  > ${results_dir}/exclude.txt;

### Filter Metadata


In [None]:
wrap="${bin_dir}/csvtk grep -t -f 'strain' -P ${results_dir}/exclude.txt -v ${results_dir}/minimal.tsv > ${results_dir}/minimal.filtered.tsv"
cmd="sbatch --parsable ${partition} -c 1 --mem 16G -J recomb-filter-metadata -o ${log_dir}/%x_$(date +"%Y-%m-%d")_%j.log --wrap=\"$wrap\""
echo $cmd
filter_metadata_id=$(eval $cmd)

## Partition

### Metadata

- ⚠ Requires external run of the `partition-month.ipynb` notebook (python kernel).

### Alignment

In [None]:
for year_month in $(ls ${results_dir}/partition); 
do 
  echo $year_month;
  strains="${results_dir}/partition/${year_month}/strains.txt"
  out_align="${results_dir}/partition/${year_month}/alignment.fasta"
  wrap="cat ${data_dir}/reference.fasta > ${out_align} && ${bin_dir}/seqkit grep --threads 8 -f $strains ${nextclade_alignment} >> ${out_align}"
  cmd="sbatch --parsable ${partition} -c 8 --mem 8G -J alignment-${year_month} -o ${log_dir}/%x_$(date +"%Y-%m-%d")_%j.log --wrap=\"$wrap\""  
  id=$(eval $cmd)
done

Collate in directory `partition-alignment`.

### Genomic VCF (gVCF)

- Replace IUPAC bases with N
- Rename the chromosome to reference name
- Remove reference genome from samples

In [None]:
mkdir -p ${results_dir}/partition-gVCF
mkdir -p ${results_dir}/partition-alignment

for year_month in $(ls ${results_dir}/partition); 
do 
  year=$(echo $year_month | cut -d "-" -f 1)
  echo $year_month;
  in_align="${results_dir}/partition/${year_month}/alignment.fasta"
  out_align="${results_dir}/partition-alignment/${year_month}-alignment.fasta"
  out_vcf="${results_dir}/partition-gVCF/${year_month}-alignment.vcf"
  
  wrap="source activate snp-sites \
    && ${bin_dir}/seqkit replace -s -i -p 'B|D|E|F|H|I|J|K|L|M|N|O|P|Q|R|S|U|V|W|X|Y|Z' -r 'N' ${in_align} > ${out_align} \
    && snp-sites -b -v ${out_align} > ${out_vcf}.tmp \
    && grep '#' ${out_vcf}.tmp > ${out_vcf}.chr_rename.tmp \
    && ${bin_dir}/csvtk replace -t -H -f 1 -p '1' -r 'Wuhan/Hu-1/2019' ${out_vcf}.tmp >> ${out_vcf}.chr_rename.tmp \
    && vcftools --remove-indv 'Wuhan/Hu-1/2019' --gzvcf ${out_vcf}.chr_rename.tmp --recode --stdout | gzip -c > ${out_vcf}.gz \
    && gzip $out_align \
    && rm -f ${out_vcf}*.tmp;"
  cmd="sbatch --parsable ${partition} -c 1 --mem 8G -J vcf-${year_month} -o ${log_dir}/%x_$(date +"%Y-%m-%d")_%j.log --wrap=\"$wrap\""  
  id=$(eval $cmd)
done

## Output for Downstream

---

- Metadata: `../results/minimal.filtered.tsv`
- Alignment: `../results/partition-alignment/*.fasta.gz`
- VCF: `../results/partition-gVCF/*.vcf`