Skip to content

Repository containing all source code/scripts and project files pertaining to the SARS-CoV-2 RNA secondary structure analysis project for Professor Itsik Pe'er's Spring 2020 Computational Genomics course. By Lawrence Chillrud.

Notifications You must be signed in to change notification settings

lawrence-chillrud/SARS-CoV-2-Seq-Analysis

Repository files navigation

SARS-CoV-2-Seq-Analysis

Repository containing all source code/scripts and project files pertaining to the SARS-CoV-2 RNA secondary structure analysis project for Professor Itsik Pe'er's Spring 2020 Computational Genomics course. By Lawrence Chillrud.

Contents

  1. Chillrud-Final-Report.pdf: pdf file containing the final project report, formatted as an Oxford University Press Bioinformatics Applications Note.
  2. Interspecies_Dataset: Folder containing the following four subfolders:
    1. clustal_output: Folder containing five .txt files detailing the output of running Clustal Omega on the interspecies data found in the Interspecies_Dataset/Sequences/ folder.
    2. conserved: Folder containing custom-written script to extract conserved secondary structures from the dataset, along with sample input and output files.
    3. RNAfold_output: Folder containing the output generated when running RNAfold on the interspecies data found in the Interspecies_Dataset/Sequences/ folder.
    4. Sequences: Folder containing the sequences for the interspecies dataset in FASTA format.
  3. Intraspecies_Dataset:
    1. clustal_output: Folder containing five .txt files detailing the output of running Clustal Omega on the intraspecies data found in the Intraspecies_Dataset/Sequences/ folder.
    2. conserved: Folder containing custom-written script to extract conserved secondary structures from the dataset, along with sample input and output files.
    3. RNAfold_output: Folder containing the output generated when running RNAfold on the intraspecies data found in the Intraspecies_Dataset/scraped/ folder.
    4. scraped: Folder containing the scraped intraspecies isolates generated by scraped.py
  4. Presentations: Folder containing three pdf files corresponding to the three project presentations.
    1. Project-Outline-Presentation.pdf
    2. Midpoint-Presentation.pdf
    3. Final-Presentation.pdf

Source code

All python files should be executed in Python 3.7.0. They have very specific requirements to run correctly, as they have been messily written. Due to time constraints, I was unable to clean the files up so they could be easily executed on someone else's system (many of the files have directory & file dependencies hardcoded into them, so will not run without my exact directory structure). In the interest of meeting the project deadline, cleaning these files up to make them portable remains to be a future project. Having said that, the .py files of note are as follows:

  1. scraper.py: Found in the Intraspecies_Dataset folder, this script was written to download the intraspecies dataset from NCBI in GenBank format, saving them to the scraped folder. The selenium package is required to run the script, among other more common packages that usually come with Python. A webdriver for Google chrome is also needed to run this script.
  2. reduce.py: Found in the Intraspecies_Dataset/scraped folder, this script was written to reduce the intraspecies dataset from the original 414 isolates, down to 65. It uses the percent identity matrix for the intraspecies dataset generated by Clustal Omega.
  3. stack.py: Found in a few subfolders within both the Interspecies_Dataset and Intraspecies_Dataset folders, this is the file that identifies potentially conserved secondary structures.

Clustal Omega and ViennaRNA

Please visit the following page to download and use Clustal Omega: http://www.clustal.org/omega/

Please visit the following page to download and use ViennaRNA: https://www.tbi.univie.ac.at/RNA/

About

Repository containing all source code/scripts and project files pertaining to the SARS-CoV-2 RNA secondary structure analysis project for Professor Itsik Pe'er's Spring 2020 Computational Genomics course. By Lawrence Chillrud.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages