covid19

Software requirements

R and R packages

We ran R scripts with R v3.6.2. We used the following packages installable from CRAN:

dplyr (v0.85)
rvest (v0.3.5)
rworldmap (v1.3-6)
ggplot2 (v3.3.0)

Python and python packages:

We ran python scripts with Python 3.7, installed using Anaconda. See docstrings of *.py files for further information on reproducing results. Packages were installed with pip3 and include:

matplotlib
mmh3

Data

SARS-CoV2 and other coronavirus protein sequences

We manually downloaded the full polyprotein 1ab (ORF1ab), spike (S) protein, membrane (M) protein, envelope (E) protein, and nucleocapsid (N) protein sequence FASTA files for the following 34 coronavirus species from the National Center of Biotechnology Information (NCBI) Reference Sequence Database. Note that for SARS-CoV-2 and SARS-CoV we downloaded FASTA files for all additional annotated protein sequences comprising their full viral proteomes:

Name	ORF1ab	Spike	Envelope	Membrane	Nucleocapsid
SARS-CoV-2*	YP_009724389.1	YP_009724390.1	YP_009724392.1	YP_009724393.1	YP_009724397.2
SARS-CoV*	NP_828849.2	NP_828851.1	NP_828854.1	NP_828855.1	NP_828858.1
OC43	YP_009555238.1	YP_009555241.1	YP_009555243.1	YP_009555244.1	YP_009555245.1
Bovine-CoV	NP_150073.2	NP_150077.1	NP_150081.1	NP_150082.1	NP_150083.1
HKU24	YP_009113022.1	YP_009113025.1	YP_009113028.1	YP_009113029.1	YP_009113031.1
HKU1	YP_173236.1	YP_173238.1	YP_173240.1	YP_173241.1	YP_173242.1
MHV	AAU06353.1	AAU06356.1	AAU06359.1	AAU06360.1	NP_045302.1
Rat-CoV	YP_003029844.1	YP_003029848.1	YP_003029850.1	YP_003029851.1	YP_003029852.1
Bat-BCoV	YP_009072438.1	YP_009072440.1	YP_009072442.1	YP_009072443.1	YP_009072446.1
Hedgehog-CoV	YP_009513008.1	YP_009513010.1	YP_009513016.1	YP_009513017.1	YP_009513018.1
MERS-CoV	YP_009047202.1	YP_009047204.1	YP_009047209.1	YP_009047210.1	YP_009047211.1
HKU4	YP_001039952.1	YP_001039953.1	YP_001039958.1	YP_001039959.1	YP_001039960.1
HKU5	YP_001039961.1	YP_001039962.1	YP_001039967.1	YP_001039968.1	YP_001039969.1
GCCDC1	YP_009273004.1	YP_009273005.1	YP_009273007.1	YP_009273008.1	YP_009273009.1
HKU9	YP_001039970.1	YP_001039971.1	YP_001039973.1	YP_001039974.1	YP_001039975.1
HKU14	YP_005454239.1	YP_005454245.1	YP_005454247.1	YP_005454248.1	YP_005454249.1
CDPHE15	YP_008439200.1	YP_008439202.1	YP_008439204.1	YP_008439205.1	YP_008439206.1
HKU10	YP_006908641.2	YP_006908642.1	YP_006908644.1	YP_006908645.1	YP_006908646.1
BtRf-AlphaCoV	YP_009199789.1	YP_009199790.1	YP_009199792.1	YP_009199793.1	YP_009199794.1
229E	ARU07599.1	ARU07601.1	ARU07603.1	ARU07604.1	ARU07605.1
LuchengRn-CoV	YP_009336483.1	YP_009336484.1	YP_009336485.1	YP_009336486.1	YP_009336487.1
Ferret-CoV	YP_009256195.1	YP_009256197.1	YP_009256199.1	YP_009256200.1	YP_009256201.1
Mink-CoV	YP_009019180.1	YP_009019182.1	YP_009019184.1	YP_009019185.1	YP_009019186.1
Bat-CoV-1A	YP_001718603.1	YP_001718605.1	YP_001718607.1	YP_001718608.1	YP_001718609.1
HKU8	YP_001718610.1	YP_001718612.1	YP_001718614.1	YP_001718615.1	YP_001718616.1
BtMr-AlphaCoV	YP_009199608.1	YP_009199609.1	YP_009199611.1	YP_009199612.1	YP_009199613.1
BtNv-AlphaCoV	YP_009201729.1	YP_009201730.1	YP_009201732.1	YP_009201733.1	YP_009201734.1
Porcine-EDV	NP_598309.2	NP_598310.1	NP_598312.1	NP_598313.1	NP_598314.1
BtCoV512	YP_001351683.1	YP_001351684.1	YP_001351686.1	YP_001351687.1	YP_001351688.1
HKU2	YP_001552234.1	YP_001552236.1	YP_001552238.1	YP_001552239.1	YP_001552240.1
NL63	YP_003766.2	YP_003767.1	YP_003769.1	YP_003770.1	YP_003771.1
NL63-related	YP_009328933.1	YP_009328935.1	YP_009328937.1	YP_009328938.1	YP_009328939.1
FCoV	YP_004070193.2	YP_004070194.1	YP_004070197.1	YP_004070198.1	YP_004070199.1
TGE	NP_058422.1	NP_058424.1	NP_058426.1	NP_058427.2	NP_058428.1

HLA allele frequency data

We designed and executed a custom R script (HLA_frequencies.R) to scrape all population and human leukocyte antigen (HLA) allele frequency from the Allele Frequency Net Database.

Data analysis

Protein sequence alignments

For each protein class (i.e. ORF1ab, S, M, E, N), all 34 coronavirus sequences were aligned using the web-based Clustal Omega multisequence aligner tool employing default parameters: sequence type [Protein], output alignment format [clustal_num], dealign [false], mBed-like clustering guide-tree [true], mBed-like clustering iteration [true], number of combined iterations [0], maximum guide tree iterations [-1], and maximum HMM iterations [-1].

Binding affinity

Using the above FASTA sequences, we used netchop v3.0 "C-term" model with a cleavage threshold of 0.1 to filter peptides that were not predicted to undergo canonical MHC class I antigen processing via proteasomal cleavage.

To assess binding affinity, we kmerized the protein FASTA into peptides of length 8-12.

sed 's/>.*/|/g' protein_sequence.fasta > protein_sequence_with_pipe.fasta

perl -pe 's/\s+//g' protein_sequence_with_pipe.fasta > protein_no_space.fasta

From the netchop v3.0 output, we further processed the file to get rid of NA lines and added row numbers as a column.

sed '/NA/d' netchop_out.txt > netchop_out_parsed.txt

awk -F'\t' 'NR>1{$0=$0"\t"NR-1} 1' netchop_out_parsed.txt > netchop_numbered.txt

Using these 2 files, we ran a custom python script (faster_kmer.py) to generate a list of peptides.

python3 faster_kmer.py > output.pep

sed '/|/d' output.pep > output_filt.pep

Binding affinity of these peptides were calculated using netMHCpan v4.0 (73) with the ‘-BA’ option to include binding affinity predictions and the ‘-l’ option to specify peptides of lengths 8-12.

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
alignments		alignments
cov		cov
fasta		fasta
out		out
src		src
supporting_data		supporting_data
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alignments

alignments

cov

cov

fasta

fasta

out

out

src

src

supporting_data

supporting_data

LICENSE

LICENSE

README.md

README.md

Repository files navigation

covid19

Software requirements

Data

Data analysis

About

Releases

Packages

Contributors 4

Languages

License

pdxgx/covid19

Folders and files

Latest commit

History

Repository files navigation

covid19

Software requirements

Data

Data analysis

About

Resources

License

Stars

Watchers

Forks

Languages