# Homework 3: Functional file parsing

---
## Topic areas
* Functions
* I/O operations
* String operations
* Data structures

---
## Background

[ClinVar][1] is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence.

For this assignment, you will be working with a Variant Call Format (VCF) file. Below are the necessary details regarding this assignment, but consider looking [here][2] for a more detailed description of the file format. The purpose of the VCF format is to store gene sequence variations in a plain-text form.

The data you will be working with (`clinvar_20190923_short.vcf`) contains several allele frequencies from different databases. The one to look for in this assignment is from ExAC database. More information about the database can be found [here][3].


### The file format
The beginning of every VCF file contains various sets of information:
* Meta-information (details about the experiment or configuration) lines start with **`##`**
    * These lines are helpful in understanding specialized keys found in the `INFO` column. It is in these sections that one can find:
        * The description of the key
        * The data type of the values
        * The default value of the values
* Header lines (column names) start with **`#`**

From there on, each line is made up of tab (`\t`) separated values that make up eight (8) columns. Those columns are:
1. CHROM (chromosome)
2. POS (base pair position of the variant)
3. ID (identifier if applicable; `.` if not applicable/missing)
4. REF (reference base)
5. ALT (alternate base(s): comma (`,`) separated if applicable)
6. QUAL (Phred-scaled quality score; `.` if not applicable/missing)
7. FILTER (filter status; `.` if not applicable/missing)
8. INFO (any additional information about the variant)
    * Semi-colon (`;`) separated key-value pairs
    * Key-value pairs are equal sign (`=`) separated (key on the left, value on the right)
    * If a key has multiple values, the values are comma (`,`) separated

#### Homework specific information
The given data (`clinvar_20190923_short.vcf`) is a specialized form of the VCF file. As such, there are some additional details to consider when parsing for this assignment. You will be expected to consider two (2) special types of keys:
1. The `AF_EXAC` key that describes the allele frequencies from the ExAC database
    > `##INFO=<ID=AF_EXAC,Number=1,Type=Float,Description="allele frequencies from ExAC">`
    * The data included are `float`ing point numbers
2. The `CLNDN` key that gives all the names the given variant is associated with
    > `##INFO=<ID=CLNDN,Number=.,Type=String,Description="ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB">`
    * The data are`str`ings. **However**, if there are multiple diseases associated with a given variant, the diseases are pipe (`|`) separated (there are 178 instances of this case) 

---
[1]: https://www.ncbi.nlm.nih.gov/clinvar/intro/
[2]: https://samtools.github.io/hts-specs/VCFv4.3.pdf
[3]: http://exac.broadinstitute.org

## Instructions

It is safe to assume that this homework will take a considerable amount of string operations to complete. But, it is important to note that this skill is _incredibly_ powerful in bioinformatics. Many dense, plain-text files exist in the bioinformatic domain, and mastering the ability to parse them is integral to many people's research. While the format we see here has a very clinical use case, other formats exist that you will likely encounter: CSV, TSV, SAM, GFF3, etc.

Therefore, we <u>***STRONGLY ENCOURAGE***</u> you to:
* Come to office hours 
* Schedule one-on-one meetings
* Post to GitHub
* Ask a friend 

Ensure you _truly_ understand the concepts therein. The concepts here are not esoteric, but very practical. Also, **ask early, ask often**.

That said, on to the instructions for the assignment.

### Expectations
You are expected to:
1. Move the `clinvar_20190923_short.vcf` to the same folder as this notebook
1. Write a function called `parse_line` that:
    1. Takes a `str`ing as an argument
    2. Extract the `AF_EXAC` data to determine the rarity of the variant
        1. If the disease is rare:
            * `return` an a `list` of associated diseases
        2. If the disease is not rare:
            * `return` an empty `list`
2. Write another function called `read_file` that:
    1. Takes a `str`ing as an argument representing the file to be opened
    2. Open the file
    3. Read the file _line by line_.
        * **Note**: You are expected to do this one line at a time. The reasoning is that if the file is sufficiently large, you may not have the memory available to hold it. So, **do not** use `readlines()`! 
           * If you do, your grade will be reduced
    4. Passes the line to `parse_line`
    5. Use a dictionary to count the results given by `parse_line` to keep a running tally (or count) of the number of times a specific disease is observed
    6. `return` that dictionary
3. `print` the results from `read_file` when it is complete
4. Each function must have its own cell
5. The code to run all of your functions must have its own cell

---
## Academic Honor Code
In accordance with Rackham's Academic Misconduct Policy; upon submission of your assignment, you (the student) are indicating acceptance of the following statement:

> “I pledge that this submission is solely my own work.”

As such, the instructors reserve the right to process any and all source code therein contained within the submitted notebooks with source code plagiarism detection software.

Any violations of the this agreement will result in swift, sure, and significant punishment.

---
## Due date
This assignment is due **October 7th, 2019 by Noon (12 PM)**

---
## Submission
> `<uniqname>_hw3.ipynb`

### Example
> `mdsherm_hw3.ipynb`

We will *only* grade the most recent submission of your exam.

---
## Late Policy
Each submission will receive a **10%** penalty per day (up to three days) that the assignment is late.

After that, the student will receive a **0** for your homework.

---
## Good luck and code responsibly!

---

In [None]:
# Define your parse_line function here


In [None]:
# Define your read_file function here


---

In [None]:
# DO NOT MODIFY THIS CELL!
# If your code works as expected, this cell should print the results
from pprint import pprint
pprint(read_file('clinvar_20190923_short.vcf'))