# Module 3 Problem Set
### YOUR NAME HERE
_**Due: Friday Sept. 29th, 11:30AM**_
This problemset

## Question 1:
Here, you have performed RT-PCR for an unidentified gene and obtained a cDNA sequence.
You have cloned the cDNA into several clones and sequenced the plasmid in its entirety.

Millions of reads were generated, each representing a small portion of the entire plasmid.

Your coworker performed the sequencing and was kind enough to provide you with the aligned reads without the plasmid sequence in a SAM format.
They also provided you with a GTF file of the region of the genome they thought the gene was located in.

Each read has a unique ID which has been aligned to a certain position in the chromosome.

Your goal is to use these files to confirm the presence of your gene of interest and identify the gene.

Remember that bioinformatics is as much about understanding biological concepts as it is about coding and data analysis.
Use your insights to guide your analysis. Also, make sure to document your code well.
Your peers should be able to understand your logic just by reading your code and comments.

Don't forget to have fun with your data!

First, let's make sure we are ready.
Start by reading the SAM/BAM and GTF format specifications or this week's lecture materials.

**What is the purpose of a SAM file and a GTF file?**

Start by loading the provided SAM (`data/mysterious.sam`) and GTF (`data/chr19.gtf`) files into pandas DataFrames.
Both files are tab-delimited, so you can use the read_csv() function to load it.
Make sure to assign appropriate column names as per the SAM and GTF format specification.

**You only need to name the first 8 columns of the SAM file.**

**Write two separate functions to do this and print out the first few rows of the DataFrames.**

**What does a flag of 16 correspond to in the SAM file?**


Now let's explore the data.

A SAM file contains the mapping information for each read.
Let's find out where in the genome the reads are mapping to.
**Plot the distribution of the mapping position of all alignments from the SAM file as a histogram.**

**Use the default settings for this plot.**


**Is the plot helpful to help answer your question? Why not?**

**Update your code from above to the following cell, making sure that the plot is informative**

Perform any data filtering and plot adjustments as needed.

**What is the first (smallest integer) position in the genome to which the reads are aligned? What about the last?** (think about the last one carefully)


You should see a gap in the mapping.
**Why is there a gap, in the context of this experiment?**

**Approximately, what is the average coverage in the mapped region?**
You can estimate this information from the plot or calculate it from the data.

Finally, using the GTF file, let's find out what gene we have been looking at so far.

**Perform an analysis to determine which gene is presented.**
**Your analysis should be convincing to your friend.**

Use [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome) using the sequence from read 42 to further confirm/disprove your results.

**Describe the first two hits from BLAST and how they relate to your results. Which species is this organism?**

Bonus: There irregularities in the mapping that is not expected from a cloned gene. Specifically, there is a significant difference in coverage across different aligned regions. What could be the cause of that, considering this experiment?

## Question 2:

a. Design a Python class named `SeqRead` that takes the following attributes:
- `read_id` (a unique identifier for the read, e.g. 'R1','R2')
- `chromosome` (the chromosome from which the read originates, e.g. 'chr1','chr2')
- `start_position` (the starting base pair position of the read on the chromosome, an integer)
- `end_position` (the ending base pair position of the read on the chromosome, an integer)

This class should have a method to return the `length` of the read, and also a method called `__str__` that prints the read in the following format:
```
Read [read_id] originates from [chromosome] and spans from position [start_position] to [end_position]
```

**You will be also be scored based on proper typing of your function arguments.**
 (i.e. `def hello(name: str) -> str: `)


In [None]:
class SeqRead:
    ...

Create a list of four `SeqRead` objects with the following information:
- 'R101', 'chr1', 1000, 1100
- 'R202', 'chr1', 1050, 1150
- 'R303', 'chr2', 2000, 2250
- 'R404', 'chr2', 2100, 2400

After creating the list, create a for-loop that prints each read in the list using `str(YOUR_OBJECT)`.

BONUS: Create a function that takes in two SeqRead objects.

that takes as an argument another instance of `SeqRead` and calculates and returns the number of overlapping base pairs between the two reads.
The method should return the number of overlapping bases if the reads aree on the same chromosome,
and it should return `-1` if they are on different chromosomes.

Demonstrate the use of this new method by calculating the overlap between all pairs of the above reads

In [None]:
def overlaps_with(a: SeqRead, b: SeqRead) -> int:
    """Return the number of bases that overlap between the two reads."""
    return -1

## Question 3:
Using the Hippo-Seq bulk RNA-seq data used in module 2 and 3,

- [GSE74985](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE74985)
- Cembrowski MS, Wang L, Sugino K, Shields BC et al. Hipposeq: a comprehensive RNA-seq database of gene expression in hippocampal principal neurons. Elife 2016 Apr 26;5:e14997. PMID: [27113915](https://www.ncbi.nlm.nih.gov/pubmed/27113915)

Create an 'AnnData' object from these data.

Start by reading in the three files that we need to create the `AnnData` object:
 - The gene expression matrix `X` (`data/GSE74985_data.csv`) 
 - The sample information `obs` (`data/GSE74985_sample_info.csv`)
 - The gene information `var` (`data/GSE74985_gene_info.csv`)

Reminder, each of these files can be read in as a `pandas` `DataFrame`.
For the gene expression matrix, you will need to transpose (`.T`) the imported data from a 'gene X sample' matrix to a 'sample X gene' matrix.

_Hint, you may need `X.index.name = "sample"`_

Once you've imported the data, create the `AnnData` object.  (This should be done exactly as was indicated in the Module3 notebook).

If you do not have `anndata`, install it with `mamba install anndata`.

In [None]:
import pandas as pd
import anndata as ad


Subset this data to only include samples from CA3 tissue and save the subsetted data as a new anndata object.

In this new subsetted anndata object, create a new column in the `.var` dataframe called `gene_mean` that stores the mean expression of each gene.

_Remember the expression data is stored in `adata.X`_


How many genes have a mean expression > 3 in the CA3 tissue samples? _Remember, the `.var` attribute of an anndata object is just a pandas dataframe._

What are the top 10 protein coding (stored in 'gene_biotype' of the _.var_ dataframe) genes with the highest mean gene expression in the CA3 samples? Hint: `pd.sort_values()` might be useful