# SARS-2 and Human ACE2 Variant Analysis

## Purpose
The intent of this pipeline is to identify SNVs in publically available, GISAID submitted SARS-2 genomes that differ from the original Wuhan strain. Identified amino acid variants in the S protein were then modeled for their 3D structure and interactions were predicted agains the Human ACE2 receptor and its reported variants.

### 1. What are the mutations within the SARS-2 genome and how prevalent are they?
In order to identify the mutations that are in circulation we can use the SARS-2 genomes from GISAID and compare them to the established Wuhan strain (the only RefSeq in NCBI). 

A caveat: I do not know how these genomes are generated but they are presumably MinION data so performing traidional QC is not possible.

From there I can use the MUMMER4 package to identify mutations and their subsequent effect on translation.

### 2. Do any of these mutations have an effect on the binding affinity to ACE2?
Hin Hark is using these mutations in his modeling. I will ask him for a summary of his methods.

### 3. What are the mutations within the ACE2 receptor and do they have any effect on the binding affinity to the SARS-2 spike (s) protein?
Data for the mutations were obtained from dbSNP and is using the data from the sources mentioned below.

## SNP Analysis

To start we are looking only at the spike protein. Using the SARS-2 reference sequence from NCBI I aligned all of the GISAID CDS in protein-space to the reference sequence. I then filtered the alignments to only the first reading frame and removed the ambiguous bases in the resulting SNPs.

In [None]:
%%bash
# Format sequence identifier to play nice with Promer
perl -pe 's/^.+EPI_ISL_(\d+).+/>$1/g'  gisaid_hcov-19_2020_05_04_17.fasta | perl -pe 's/-/n/g' > gisaid_hcov-19_2020_05_04_17.refactored.fasta

promer -p SARS-2.refseq.cds.s_prot.promer SARS-2.refseq.cds.s_prot.fasta gisaid_cov2020_sequences.refactored.fasta
show-coords -clT SARS-2.refseq.cds.s_prot.promer.delta > SARS-2.refseq.cds.s_prot.promer.coords
show-snps -STH gisaid_hcov-19_2020_05_04_17.s_prot.promer.delta < <(awk '$14 == "1" && $15 == "1" { print $0 }' gisaid_hcov-19_2020_05_04_17.s_prot.promer.coords) | cut -f1-3 | sort | uniq -c | sort -nrb | perl -pe 's/ +/\t/g' | perl -pe 's/^\s+//g' > gisaid_hcov-19_2020_05_04_17.s_prot.promer.snps-with-nonsense-mutations.counts.tsv
# Removes nonsense mutations
# show-snps -ST SARS-2.refseq.cds.s_prot.promer.delta < <(awk '$14 == "1" && $15 == "1" { print $0 }' SARS-2.refseq.cds.s_prot.promer.coords) | grep -v 'X' > SARS-2.refseq.cds.s_prot.promer.snps

Parsed the dbSNP into a TSV file with the following command:

In [None]:
# Added this into the show-snps one liner above
# perl -pe 's/\n$/\t/g' snp_result.txt| perl -pe 's/--\t/\n/g' | perl -pe 's/\d+. rs/\nrs/g' | perl -pe 's/\r//g' > snp_result.tsv

From there I formatted the data into an extended TSV by adding the frequencies of the mutation per data source and extracted the other relevant info (discarding much of it data provided by dbSNP that I deeded irrelevent at this stage in the analysis).

In [54]:
snps = []

import re

def check_snp_src(l, src):
    index = [i for i, s in enumerate(l) if src in s if i is not ""]
    if index:
        m = re.search(r'0\.\d+',l[index[0]])
#         print(l[index[0]])
#         print (m.group(0))
#         print(m)
        if m:
            return m.group(0)
        else:
            return 'NA'
    else:
        return 'NA'

with open('snp_result.tsv', 'r') as f:
    for line in f.readlines():
        if line == '\n':
            continue
        snp = {
                'id':'',
                'snv':'',
                'position':'',
                '1000Genomes':'',
                'TWINSUK':'',
                'GnomAD':'',
                'ALSPAC':'',
                'TOPMED':''
              }
        l = line.split('\t')
        snp['id'] = l[0]
        snp['snv'] = l[1]
        snp['position'] = l[2]
        snp['1000Genomes'] = check_snp_src(l,'1000Genomes')
        snp['TWINSUK'] = check_snp_src(l,'TWINSUK')
        snp['GnomAD'] = check_snp_src(l,'GnomAD')
        snp['ALSPAC'] = check_snp_src(l,'ALSPAC')
        snp['TOPMED'] = check_snp_src(l,'TOPMED')
        
#         print(snp)
        
        snps.append(snp)
    
# with open('snp_result.cleaned.tsv', 'wb', )
df = pandas.DataFrame(snps)
df.to_csv('snp_result.cleaned.tsv', sep='\t', )

# print (snps)
    
#         for i in l:
#             if i == l[0]:
#                 snp['id'] = l[0]
#                 continue
#             elif i == l[1]:
#                 snp['snv'] = l[1]
#                 continue
#             elif i == l[2]:
#                 snp['position'] = l[2]
#                 continue
#             elif
            

### SNP time course analysis
To look at the occurence of mutations throughout the course of the pandemic we need to filter out the sequences that by the month. To do so, we need to link the metatdata with the sequences and the filter per month and run the SNP analysis pipeline for each subsequent month.

Also, to make results more reliable, I will implement some QC filtering methods as mentioned in these two papers:
https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473
https://www.biorxiv.org/content/10.1101/2020.04.26.062422v2.full.pdf

In [6]:
import pandas

metadata = pandas.read_csv("metadata_2020-10-01_08-36.tsv", delimiter='\t', header=0)

In [7]:
#Remove non-human isolates and those with incomplete collection date info.
# Fix entries with no collection day in datetime field
metadata['date'] = metadata['date'].str.replace('XX','01')
metadata.filtered = metadata.loc[(metadata['host'] == 'Human') & (metadata['date'] != '2020')]

  after removing the cwd from sys.path.


In [8]:
# The earliest is 2019-12 and the latest is 2020-09
metadata.filtered.sort_values('date')

Unnamed: 0,strain,virus,gisaid_epi_isl,genbank_accession,date,region,country,division,location,region_exposure,...,sex,pangolin_lineage,GISAID_clade,originating_lab,submitting_lab,authors,url,title,paper_url,date_submitted
130280,Wuhan/IPBCAMS-WH-01/2019,ncov,EPI_ISL_402123,MT019529,2019-12-24,Asia,China,Hubei,Wuhan,Asia,...,Male,B,L,"Institute of Pathogen Biology, Chinese Academy...","Institute of Pathogen Biology, Chinese Academy...",Lili Ren et al,https://www.gisaid.org,?,?,2020-01-11
130290,Wuhan/WH01/2019,ncov,EPI_ISL_406798,LR757998,2019-12-26,Asia,China,Hubei,Wuhan,Asia,...,Male,B,L,General Hospital of Central Theater Command of...,"BGI & Institute of Microbiology, Chinese Acade...",Weijun Chen et al,https://www.gisaid.org,Genomic characterisation and epidemiology of 2...,https://dx.doi.org/10.1016/S0140-6736(20)30251-8,2020-01-30
130274,Wuhan/Hu-1/2019,ncov,EPI_ISL_402125,MN908947,2019-12-26,Asia,China,Hubei,Wuhan,Asia,...,?,B,L,National Institute for Communicable Disease Co...,National Institute for Communicable Disease Co...,Zhang et al,https://www.gisaid.org,A new coronavirus associated with human respir...,https://dx.doi.org/10.1038/s41586-020-2008-3,2020-01-12
130311,Wuhan/WIV05/2019,ncov,EPI_ISL_402128,MN996529,2019-12-30,Asia,China,Hubei,Wuhan,Asia,...,Female,B,L,Wuhan Jinyintan Hospital,"Wuhan Institute of Virology, Chinese Academy o...",Peng Zhou et al,https://www.gisaid.org,?,?,2020-01-18
130312,Wuhan/WIV06/2019,ncov,EPI_ISL_402129,MN996530,2019-12-30,Asia,China,Hubei,Wuhan,Asia,...,Male,B,L,Wuhan Jinyintan Hospital,"Wuhan Institute of Virology, Chinese Academy o...",Peng Zhou et al,https://www.gisaid.org,?,?,2020-01-18
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51194,England/NOTT-113555/2020,ncov,EPI_ISL_549512,?,2020-09-20,Europe,United Kingdom,England,,Europe,...,?,B.1,G,"Queens Medical Centre, Clinical Microbiology D...",COVID-19 Genomics UK (COG-UK) Consortium,Gemma Clark et al,https://www.gisaid.org,?,?,2020-09-28
51195,England/NOTT-113564/2020,ncov,EPI_ISL_549513,?,2020-09-20,Europe,United Kingdom,England,,Europe,...,?,B.1.79,G,"Queens Medical Centre, Clinical Microbiology D...",COVID-19 Genomics UK (COG-UK) Consortium,Gemma Clark et al,https://www.gisaid.org,?,?,2020-09-28
51196,England/NOTT-113573/2020,ncov,EPI_ISL_549514,?,2020-09-20,Europe,United Kingdom,England,,Europe,...,?,B.1,G,"Queens Medical Centre, Clinical Microbiology D...",COVID-19 Genomics UK (COG-UK) Consortium,Gemma Clark et al,https://www.gisaid.org,?,?,2020-09-28
51197,England/NOTT-113582/2020,ncov,EPI_ISL_549515,?,2020-09-20,Europe,United Kingdom,England,,Europe,...,?,B.1,G,"Queens Medical Centre, Clinical Microbiology D...",COVID-19 Genomics UK (COG-UK) Consortium,Gemma Clark et al,https://www.gisaid.org,?,?,2020-09-28


In [10]:
# Making sure dates are in the proper format


# from dateutil.parser import parse
# print(metadata.filtered.loc[0,].date)
# print(parse(metadata.filtered.loc[0,].date))

# print(metadata.filtered.sort_values('date').date.unique())

metadata.filtered['date'] = pandas.to_datetime(metadata.filtered['date'])
metadata.filtered['month_year'] = metadata.filtered['date'].dt.strftime('%Y-%m')
dates = metadata.filtered.sort_values('month_year')['month_year'].unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


In [25]:
print("\n".join(metadata.filtered['strain'].loc[metadata.filtered['month_year'] <= date].tolist()))

Wuhan/HBCDC-HB-01/2019
Wuhan/HBCDC-HB-02/2019
Wuhan/HBCDC-HB-03/2019
Wuhan/HBCDC-HB-04/2019
Wuhan/Hu-1/2019
Wuhan/IME-WH01/2020
Wuhan/IME-WH02/2020
Wuhan/IME-WH03/2020
Wuhan/IME-WH04/2020
Wuhan/IME-WH05/2020
Wuhan/IPBCAMS-WH-01/2019
Wuhan/IPBCAMS-WH-02/2019
Wuhan/IPBCAMS-WH-03/2019
Wuhan/IPBCAMS-WH-04/2019
Wuhan/IVDC-HB-01/2019
Wuhan/IVDC-HB-05/2019
Wuhan/IVDC-HB-GX02/2019
Wuhan/WH01/2019
Wuhan/WH02/2019
Wuhan/WIV02/2019
Wuhan/WIV04/2019
Wuhan/WIV05/2019
Wuhan/WIV06/2019
Wuhan/WIV07/2019


In [27]:
for date in dates:
#     print(date)
#     print(len(metadata.filtered['strain'].loc[metadata.filtered['month_year'] <= date]))
    with open('covid_'+date+'_seqs.txt', 'w') as f:
        f.write('\n'.join(metadata.filtered['strain'].loc[metadata.filtered['month_year'] <= date].tolist()))

In [28]:
# Ran on system
# Grabbing all sequences for each timepoint
for i in $(ls covid*.txt | perl -pe 's/\.txt//g'); do 
    seqtk subseq sequences_2020-10-01_07-14.fasta $i.txt >! $i.fasta ; 
    
    promer -p $i.promer.s_prot SARS-2.refseq.cds.s_prot.fasta $i.fasta
    show-coords -clT $i.promer.s_prot.delta > $i.promer.s_prot.coords
    show-snps -STH $i.promer.s_prot.delta < <(awk '$14 == "1" && $15 == "1" { print $0 }' $i.promer.s_prot.coords) | cut -f1-3 | sort | uniq -c | sort -nrb | perl -pe 's/ +/\t/g' | perl -pe 's/^\s+//g' > $i.promer.s_prot.snps-with-nonsense-mutations.counts.tsv
    # Removes nonsense mutations
    # show-snps -ST SARS-2.refseq.cds.promer.delta < <(awk '$14 == "1" && $15 == "1" { print $0 }' SARS-2.refseq.cds.s_prot.promer.coords) | grep -v 'X' > SARS-2.refseq.cds.s_prot.promer.snps
done