# A2 Conditionally Expressed
In this assignment, you'll apply what you learned about lists, conditionals, and for loops to interact with a microarray dataset from the Allen Brain Institute. We've already processed the raw data such that it is normalized and organized into a file arranged by gene names and brain areas (brainarea_vs_genes_exp_w_reannotations.tsv). Before you can begin this assignment, you need to download this dataset from datahub and upload it in the same folder as this assignment. We'll review this in class.

This assignment is worth 50 points (5 points or 5% of your grade for the class).

**PLEASE DO NOT CHANGE THE NAME OF THIS FILE.**

**PLEASE DO NOT COPY & PASTE OR DELETE CELLS INCLUDED IN THE ASSIGNMENT.**


## How to complete assignments

Whenever you see:

```
# YOUR CODE HERE
raise NotImplementedError()
```

You need to **replace (meaning, delete) these lines of code with code that answers the questions** and meets the specified criteria. Make sure you remove the 'raise' line when you do this (or your notebook will raise an error, regardless of any other code, and thus fail the grading tests).

You should write the answer to the questions in those cells (the ones with `# YOUR CODE HERE`), but you can also add extra cells to explore / investigate things if you need / want to. 

Any cell with `assert` statements in it is a test cell. You should not try to change or delete these cells. Note that there might be more than one assert that tests a particular question. 

If a test does fail, reading the error that is printed out should let you know which test failed, which may be useful for fixing it.

Note that some cells, including the test cells, may be read only, which means they won't let you edit them. If you cannot edit a cell - that is normal, and you shouldn't need to edit that cell.


## Tips & Tricks

The following are a couple tips & tricks that may help you if you get stuck on anything.

#### Printing Variables
You can (and should) print and check variables as you go. This allows you to check what values they hold, and fix things if anything unexpected happens.

#### Restarting the Kernel
- If you run cells out of order, you can end up overwriting things in your namespace. 
- If things seem to go weird, a good first step is to restart the kernel, which you can do from the kernel menu above.
- Even if everything seems to be working, it's a nice check to 'Restart & Run All', to make sure everything runs properly in order.

### Loading in the data
First, we'll take a few steps to load up the dataset. After you have uploaded the 'brainarea_vs_genes_exp_w_reannotations.tsv' file to your directory, simply run the code below -- you don't need to change anything.

In [2]:
# Import necessary packages
from csv import reader

# Open the tab-delimited file
opened_file = open('brainarea_vs_genes_exp_w_reannotations.tsv')
read_file = reader(opened_file, delimiter = '\t')
gene_data = list(read_file)

## Q1

Above, the variable `gene_data` is a list of lists. The first list is a list of headers for the array, containing a first item 'gene_symbol', followed by a list of brain regions.

In the cell below, assign the first list of `gene_data` to a variable called `brain_regions`. The first entry of the list, 'gene_symbol' isn't a brain region, but that's okay for this exercise. Leave it in the list.

In [3]:
# YOUR CODE HERE
brain_regions = gene_data[0]
print(brain_regions)

['gene_symbol', 'CA1 field', 'CA2 field', 'CA3 field', 'CA4 field', 'Crus I, lateral hemisphere', 'Crus I, paravermis', 'Crus II, lateral hemisphere', 'Crus II, paravermis', 'Edinger-Westphal nucleus', "Heschl's gyrus", 'I-II', 'III', 'III, lateral hemisphere', 'III, paravermis', 'IV', 'IV, lateral hemisphere', 'IV, paravermis', 'IX', 'IX, lateral hemisphere', 'IX, paravermis', 'V', 'V, lateral hemisphere', 'V, paravermis', 'VI', 'VI, lateral hemisphere', 'VI, paravermis', 'VIIAf', 'VIIAt', 'VIIB', 'VIIB, lateral hemisphere', 'VIIB, paravermis', 'VIIIA', 'VIIIA, lateral hemisphere', 'VIIIA, paravermis', 'VIIIB', 'VIIIB, lateral hemisphere', 'VIIIB, paravermis', 'X', 'X, lateral hemisphere', 'X, paravermis', 'abducens nucleus', 'amygdalohippocampal transition zone', 'angular gyrus, inferior bank of gyrus', 'angular gyrus, superior bank of gyrus', 'anterior group of nuclei', 'anterior hypothalamic area', 'anterior orbital gyrus', 'arcuate nucleus of medulla', 'arcuate nucleus of the hypo

In [4]:
# Tests for Q1, worth 2.5 points total.
assert isinstance(brain_regions,list)


## Q2

For our study, we're interested in seeing if the superior colliculus and visual cortex have different gene expression. First, we need to know if they're in our list of brain regions.

Write two statements to check if 'superior colliculus' and 'visual cortex' are in your list of brain regions (`brain_region`). Save the boolean outputs of these membership checks as `SC_bool` and `VC_bool`, respectively. Print the values of `SC_bool` and `VC_bool` so that you can see them.

In [5]:
# YOUR CODE HERE
if 'superior colliculus' in brain_regions:
    SC_bool = True 
else:
    SC_bool = False      
    
if 'visual cortex' in brain_regions:
    VC_bool = True 
else: 
    VC_bool = False 

print(SC_bool)
print(VC_bool)

True
False


In [6]:
# Tests for Q2
assert isinstance(SC_bool,bool)
assert isinstance(VC_bool,bool)


## Q3
Hmm, looks like the data has superior colliculus but not visual cortex. In humans, visual cortex is often called "striate cortex", because of the appearance of a dense layer of myelinated fiber that runs through it, called the Line of Gennari (details <a href="https://webvision.med.utah.edu/book/part-ix-brain-visual-areas/the-primary-visual-cortex/">here</a>, if you're curious). It's also a part of the occiptal lobe, and the gyri and sulci there are named accordingly.

To get a sense of what possible visual regions are in our list, we can look for _striate_ and _occiptal_ in the strings for each brain region. 

1. Write a `for` loop that loops through the list of brain regions and looks for *either* "striate" or "occipital" within the string for each of the brain regions in your list. Save all of the possible matches to a list called `possible_regions`.
2. Create a counter (called `counter` that shows you how many brain regions you have at the end. Save the output of this counter as a variable called `regions_message` that says "There are X possible visual regions" where "X" is the value of your counter.
3. At the end, print your list of possible regions so that you can see what it includes.

In [7]:
# YOUR CODE HERE
possible_regions = []
counter = 0
for region in brain_regions:
    if 'striate' in region or 'occipital' in region:
        possible_regions.append(region)
        counter+=1
    
regions_message = 'There are ' + str(counter) + ' possible visual regions'
print(regions_message)
print(possible_regions)

There are 11 possible visual regions
['cuneus, peristriate', 'cuneus, striate', 'inferior occipital gyrus, inferior bank of gyrus', 'inferior occipital gyrus, superior bank of gyrus', 'lingual gyrus, peristriate', 'lingual gyrus, striate', 'occipital pole, inferior aspect', 'occipital pole, lateral aspect', 'occipital pole, superior aspect', 'superior occipital gyrus, inferior bank of gyrus', 'superior occipital gyrus, superior bank of gyrus']


In [8]:
# Tests for Q3, worth 5 points.
# Tests for Q3 
assert isinstance(possible_regions,list)
assert isinstance(counter,int)
assert isinstance(regions_message,str)

In [9]:
# Hidden Tests for Q3, worth 2.5 points.

In [10]:
# Hidden Tests for Q3, worth 2.5 points.

In [11]:
# Hidden Tests for Q3, worth 2.5 points.

## Q4

![](https://resource.loni.usc.edu/wp-content/uploads/2012/06/LINGUAL01.jpg)

Let's go with '_lingual gyrus, striate_' -- that's a nice chunk of brain that encompasses visual cortex in humans (see the pink area above, details <a href="https://resource.loni.usc.edu/resources/downloads/research-protocols/masking-regions/lingual-gyrus/">here</a>.

Now that we know that 'lingual gyrus, striate' and 'superior colliculus' are both in our list, we need to know their index so that we can look for their corresponding values in the lists for each gene. For that, we can use the `index` method on our list (see the help for Index, or <a href="https://www.programiz.com/python-programming/methods/list/index">this tutorial.</a>)

Find the index of the 'lingual gyrus, striate' and 'superior colliculus' and save them as `LG_index` and `SC_index`, respectively.

In [12]:
# YOUR CODE HERE
LG_index = brain_regions.index('lingual gyrus, striate')
SC_index = brain_regions.index('superior colliculus')
print(LG_index)
print(SC_index)

125
206


In [13]:
# Tests for Q4, worth 2.5 points.
assert isinstance(LG_index,int)
assert isinstance(SC_index,int)

In [14]:
# Hidden Tests for Q4, worth 2.5 points.

## Q5

Searching for our gene in this dataset is a little tricky, since each row is a different list, but we can do it with a for loop. Let's say we're interested in **DISC1**, <a href="https://www.nature.com/articles/tp2016282">a gene that is associated with schizophrenia</a>.

Write a `for` loop that loops through each row (list) of our data, and checks if the first entry in that list is DISC1. When it finds DISC1, assign the entire list of values (including the DISC1 label) to `DISC1_data`.

In [15]:
#alternative code from section
#for i in range(len(gene_data)):
    #if gene_data[i][0] == 'DISC1':
        #DISC1_data = gene_data[i]

In [16]:
#with pandas:
#import pandas a pd
#gene_df = pd.DataFrame(gene_data[1:]), columns=gene_data[0])
#print(gene_df[gene_df.gene_symbol] == 'DISC1')
#(gene_df)

In [17]:
# YOUR CODE HERE
for row in gene_data:
    if row[0] == 'DISC1':
        DISC1_data = row
        

print(DISC1_data)

['DISC1', '0.10234691591552625', '-0.03514271503436392', '-0.1401600298036149', '0.3775630586767744', '-1.2882414643804967', '-1.3094793280338128', '-1.321081067381541', '-1.2227949526118158', '0.4297003329547636', '-0.7233433273426383', '-0.9999244303785609', '-1.2947545996841554', '-1.4535047620087718', '-1.250810252656762', '-1.4284068797498208', '-1.0874345022171286', '-1.4658415260005528', '-1.3412531621006796', '-1.1369964333437053', '-1.3363596064048082', '-1.3379675195527359', '-1.2845296167176528', '-1.3796592898878244', '-1.309981840450498', '-1.3213225935388897', '-1.3450631641372934', '-1.1854714436434552', '-1.2727282312004522', '-1.2574989948682755', '-1.1682984533338725', '-1.398199788895754', '-1.4349175925478355', '-1.3990678197360165', '-1.4609798761032304', '-1.2541422533740274', '-1.4941270352249414', '-1.2128098289423688', '-1.327699344682325', '-1.4042060649366492', '-1.0550488573934775', '0.48354135981845064', '0.7282340091841847', '-0.6605225891587009', '-0.6386

In [18]:
# Tests for Q5, worth 5 points.
assert isinstance(DISC1_data,list)

In [19]:
# Hidden Tests for Q5, worth 5 points.

## Q6
Using the indices we saved above, now we can look to see whether expression of DISC1 is higher in the superior colliculus or in the occipital lobe.

1. Save the gene expression values for superior colliculus and the occiptal lobe as `SC_DISC1` and `LG_DISC1` respectively, by using the indices you saved in the previous step.
2. Check the type of these. If they're not a float, convert each of them into a float (still assigned to `SC_DISC1` and `LG_DISC1`.

In [20]:
# YOUR CODE HERE
SC_DISC1 = float(DISC1_data[SC_index])
LG_DISC1 = float(DISC1_data[LG_index])
print(SC_DISC1)
print(LG_DISC1)
#these values are z scores so actually saying the gene expression is 1 standard deviation above/below the mean 

1.0669682977646908
-1.2263984422075282


In [21]:
# Tests for Q6, worth 5 points (note: includes hidden tests).

assert isinstance(SC_DISC1,float)
assert isinstance(LG_DISC1,float)


## Q7

Given the data points that we have here in `SC_DISC1` and `LG_DISC1`, what could we reasonably claim?

**Note:** Remember that you can indicate your response on a multiple choice by assigning a string with your one letter response to `answer`.

* `A` : superior colliculus has greater expression of DISC1 than other genes
* `B` : superior colliculus has less expression of DISC1 than other genes
* `C` : superior colliculus has greater expression of DISC1 than the lingual gyrus
* `D` : superior colliculus has less expression of DISC1 than the lingual gyrus

In [22]:
# YOUR CODE HERE
answer = 'C'

In [23]:
# Tests for Q7, worth 2.5 points (note: includes hidden tests).

assert answer in ['A','B','C','D']


## Q8

We could also decide to guide our interest in brain regions based on higher expression of DISC1. For all of the values of DISC1, look for expression values that are greater than **1.5**, and save these as a list called `high_DISC1`. In the end, `high_DISC1` should contain a list of brain areas with expression values higher than 1.5.

**Note**: Remember that the first value in each list is the name of the gene; you might need to skip it.

In [24]:
#high_DISC1 = []

#for in in range [1, DISC1_data]:
    #if i ==0:
        #continue
    #if float(DISC1_data[i]) >1.5
        #high_DISC1.apped(brain_regions[i])

In [25]:
# YOUR CODE HERE
high_DISC1 = []

for value in DISC1_data[1:]:
    
    if value > '1.5':
        regions = DISC1_data.index(value)
        region_name = brain_regions[regions]
        high_DISC1.append(region_name)
        
print(high_DISC1)

['cingulum bundle', 'corpus callosum', 'emboliform nucleus', 'fastigial nucleus', 'globose nucleus', 'globus pallidus, external segment', 'globus pallidus, internal segment', 'lateral habenular nucleus', 'lateral parabrachial nucleus', 'medial habenular nucleus', 'red nucleus', 'reticular nucleus of thalamus', 'substantia nigra, pars reticulata', 'zona incerta']


In [27]:
DISC1_data.index(value)

232

In [None]:
# Tests for Q8, worth 10 points (note: includes hidden tests).

assert isinstance(high_DISC1,list)
