# Outline

Why this tutorial will be useful:
- This tutorial assumes no programing background.
- You will learn most computational operations you will use during the course
- <font color='green'>You will only need to change the code at positions marked with ```# <- adapt code here```</font>

Main purpose of tutorial:
- Execute to ensure that your computer is readily set up for the class.
- Provide a reference of commands that you will require during the class.

Additional purposes of tutorial:
- Recapitulate concepts around gene expression by relating them to numbers
- Realize that different assumptions (which can all be good and not intrinsically wrong) can shape the results of a computational analysis
- Provide templates that you can  copy and modify for your own research (outside and independent of this class)

Stages:
- Stage 0: Getting the data
- Stage 1 How many genes are transcribed in the brain of young adults with an average of at least one transcript molecule per cell?: **loading tables**, **filtering tables**, **functions**, **visualizing distributions**
- Stage 2 How many <font color="green">protein-coding genes</font> are transcribed in the brain of young adults with an average of at least one transcript molecule per cell? **extracting patterns from text**, **clean tables**, **combining tables**, **aggregate statistics**
- Stage 3 Are genes of the chaperome weakly or strongly transcribed genes, and are some categories of chaperones more transcribed than others?: **importing excel**, **boxplots**, 

Note:

<font color='red'>Can not execute the code</font>, despite having downloaded the code as described in 01_overview and following this tutorial through Anaconda rather than GitHub? We will be available after the lecture on Oct 3rd for ~15min.

# Stage 0: Getting access to the data

<font color="red">You can access data through Box.</font> In you have no access: email thomas.stoeger@northwestern.edu.

Box is a file-sharing service of Northwestern, which you may have already used in the past. Depending on the setup of your computer, you will see a web link, from which you can download the data, or already see them as one subfolder on your computer, within your Box foder. Likely it will be called "chaperome_course_2019".

# Stage 1: How many genes are transcribed in the brain of young adults with an average of at least one transcript molecule per cell?

## 1.1. Loading tables

To start pandas you will need to execute the following line of code. You can do this by clicking on the code and then pressing the Control key on your keyboard together with Enter the key, or by clicking on "Cell" in the top naviation bar and then clicking on "Run Cells". "Cells" are the grey boxes used to program and execute code.

In [None]:
import pandas as pd

Congratulations. If you executed the above code, the square brackets on the left should contain a number.



<font color="red">You will need to adjust the following path according to the name you gave to the folder on your computer, and its location. Tip: right click one of the files in the folder and inspect properties to see path.</font>

In [None]:
human_transcription = pd.read_csv(
    filepath_or_buffer='~/Desktop/homework/Human.RPKM.txt',     # <- adapt code here
    sep=' '
)

To peak at the first rows of <code>human_transcription</code> append <code>.head()</code>

In [None]:
human_transcription.head()

<font color="green">What did we just see?</font>
- <code>import pandas as pd </code> launches the pandas library and gives it the (arbitrary) name <code>pd</code>. 
- The content of the file Human.RPKM.txt now is stored in <code>human_transcription</code>
- More generally <code>x_left_of_equal_sign = whatever_commmand_right_of_equal sign</code>, will first execute the code right to the equal sign and then hand it over to the thing on the left hand side.
- Here the thing on the left hand side of the equal sign is a variable, though you could give it an arbitrary name, it is useful to give it a name that is well readable, such as <code>human_transcription</code>
- Functions do function-specific stuff
- <code>pd.read_csv()</code> is a function. It reads the text file containing expression values, and makes it available to Python (which essentially is the glue to stick different commands together). <code>sep=' '</code> tells <code>pd.read_csv()</code> that within the specific text file, the values of different columns are separated by space signs
- <code>.head()</code> is another function and will show the values of the first few rows. Note that in contrast to <code>pd.read_csv()</code> it is directly applied (by the <code>.</code>) to <code>human_transcription</code>; useful tip: if you want to see all functions available to <code>human_transcription</code>, remove the <code>head</code> and press Shift and Tabulator while the cursor will be blinking at the <code>.</code>

## 1.2. Filtering tables

To how how many genes are transcribed in the brains of young adults, we first need to select samples corresponding to their brains.

The first step for filtering is to define a <code>list</code> of the names of the columns. One way to create a <code>list</code> is by using the opening <code>[</code> and closing <code>]</code> brackets. Note that the names of the columns are placed within <code>'</code> signs. The <code>'</code> will mean that Python will interpret the characters within this notebook as character (and not as a variable).

In [None]:
columns_of_young_adult_brain = [
    'Brain.youngAdult.47',
    'Brain.youngAdult.48',
    'Brain.youngAdult.49'
]

The second step for filtering is to to do the actual filtering. The safest way for doing so is to use <code>.loc</code>. 
- In contrast to functions, <code>.loc</code>uses square brackets (unimportant technical explanation: technically it does not do anything but provides an alternative view/window)
- the first position (prior the <code>.</code>) refers to rows of the table
- the second position (after the <code>.</code>) refers to columns of the table
- <code>:</code> means that all should be considered

In [None]:
young_brain = human_transcription.loc[:, columns_of_young_adult_brain]

In [None]:
# Now we can inspect the first rows again; Btw, # marks a comment and will not be read by Python
young_brain.head()

Now we need some decisions. There are three different samples. How should we combine them? Should we combine them? All of these are difficult and might depend on your science. 

<font color="red">**HOW DO WE APPROXIMATE THE NUMBER OF GENES TRANSCRIBED TO AT LEAST ONE MOLECULE PER CELL?**</font> 
    
Should we consider any value that is above 0? How many genes are usually expressed, anyways?....

To approximate the these questions, we find a good reference in https://www.embopress.org/doi/pdf/10.1038/msb.2011.28 (Hebenstreit et al.)

Briefly, there are two classes of genes, lowly expressed genes and highly expressed genes. If counting the number of genes at a given expression value, one should thus see two groups of genes: some have a low expression, others a high one. Curiously, and for reasons which are not fully understood yet - despite other researchs since having reporduced this observation in other cell types - the **lowly transcribed genes tend to have less than one transcript molecule per cell** (suggesting that no protein is actively produced in those cells at a given time point) and **highly transcribed genes tend to have more than one transcript molecule per cell**.

While there are different formats to measure transcript abundance, this tutorial uses measurments in the RPKM format, which is the same as Henbstreit et al. did. Details won't matter, but - in essence - this format tries to measure the abundance of the transcripts by normalizing it to a measure of the total amount of transcripts (of all genes). 

Additionally Hebenstreit et al. transformed the data. Let us transform the data in a way that seems similar to what the autors of the above paper did, and create a single log2 transformed value for given type of sample. To combine samples, this tutorial uses medians, but other options would also be possible. (Can you think of advantages and disadvantages compared to mean?)

In [None]:
# First, let us import numpy, a library for mathematical operations in python
import numpy as np

In [None]:
# let us create a first own function;
def custom_normlization_function(input_values):
    intermediate = np.median(input_values)
    output_value = np.log2(intermediate)
    return output_value

In [None]:
# apply our function along the columns of the table 
# For the sake of this tutorial: ignore any warning which you will receive.
# The warning tells that we are trying to apply a logarithm to 0 (which mathematically
# is undefined in essentially all cases)
young_brain = young_brain.apply(
    custom_normlization_function, 
    axis='columns')

In [None]:
young_brain.head()

Note how the output of the above .head() looks different from those above? The reason is that our normalization led to a one-dimensional result.

## 1.3. Visualizing

In [None]:
# Load seaborn for visualization
import seaborn as sns

We could now try to visualize <code>young_brain</code>, but would run into a problem. Some of our data was 0 (no signal detected), and log2 of 0 can be negative infinity or undefined. As such we first need to filter the data. Similar to the <code>.loc</code>, we need to use square brackets again

In [None]:
measured_young_brain = young_brain[young_brain>-np.inf]

In [None]:
# Executing this code  %matplotlib inline will likely not be necessary.
# However, it also won't hurt. It is a special command of notebooks
# to tell them that the output of a visualization should be
# shown within the notebook. It will gurantee that you will see the
# figures that you will create. (Depending on the setup of your
# computer this default way of showing visualizations might differ - 
# though this is highly unlikely...)
%matplotlib inline      

Seaborn provides many different visualizations. We will use a distribution plot. During the student course itself, you will find further visualizations at: https://seaborn.pydata.org/examples/index.html ; P.S.: You can always use the SHIFT + TAB trick listed in 1.1. to see all functions (and hence visualizations) provided by Seaborn (and SHIFT + TAB + TAB to see all options of these functions)

In [None]:
sns.distplot(measured_young_brain)

<font color="red">**DOES THIS MAKE SENSE?**</font> 
    
Visit https://www.embopress.org/doi/pdf/10.1038/msb.2011.28, and compare.

Answer: It kind of looks similar, but the effect is not as pronounced as in the paper. Perhaps you already have a suspicion....

Now let us return to our question of **How many genes are transcribed in the brains of young adults with at least one transcript per cell on average.** 

If we stick to the assumption that the separation between the two peaks corresponded to the separation of lowly and highly expressed genes, and further assume that also here this boundary corresponds to one molecule per cell, we can define the following (which even corresponds to 1 RPKM unit prior our earlier log2 transformation):

In [None]:
assumed_expression_level_corresponding_to_one_molecule = 0

In [None]:
is_above = young_brain > assumed_expression_level_corresponding_to_one_molecule

In [None]:
young_brain.head()

In [None]:
is_above.head()

In [None]:
print(
    'Approximately', 
    np.sum(is_above), 
    'out of', 
    len(is_above), 
    'genes are expressed with an average of one transcript molecule.')

# Stage 2 How many <font color="green">protein-coding genes</font> are transcribed in the brain of young adults with an average of at least one transcript molecule per cell?

## 2.1. Extracting patterns from text

First we need to obtain a list of protein coding genes. Although there exist many databases and bioinformatic tools that could obtain this information, we will here directly use the current defintion of human genes. This definition is maintained and curated by the National Library of Medicine (for this tutorial we already downloaded this main reference of genes from https://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/). This is the main American reference that exists for human genes!

In [None]:
gene_info = pd.read_csv(
    '/Users/tstoeger/Desktop/homework/Homo_sapiens.gene_info.gz',      # <- adapt code here
    '\t'   # tabulator is used to separate columns in this file
)

In [None]:
gene_info.head()

If you scroll the above table, you will see a lot of information about genes and their fundamental properties. One of the columns is called <code>type_of_gene</code>. It contains the type of the genes.

In [None]:
# Interesting side-query: Which types of genes are defined by the National Library of Medicine?
# And: How common are these types?
gene_info.loc[:, 'type_of_gene'].value_counts()

<font color="red">We have a (manageable) problem</font>
- <code>gene_info</code> contains many different identifiers for genes
- <code>GeneID</code> looks very different from the ones in the gene expression tables, including <code>young_brain</code>. If you scroll back above, you will see that they seem to start wtih ENSG and have a high numbe, whereas <code>GeneID</code> does not start with ENSG and the numbers are lower
- The gene identifiers used in the expression data are Ensembl Gene IDs, whereas the National Library of Medicine uses Entrez Gene IDs
- Luckily for us, the National Library of Medicine provides a cross-reference to Ensembl Gene IDs within <code>dbXrefs</code>
- Regretfully, the Ensembl Gene ID is stored as part of a larger text. Hence we need to isolate it

In [None]:
gene_info.loc[:, 'gene_ensembl'] = gene_info.loc[:, 'dbXrefs'].str.extract(
    pat='(ENSG[0123456789]+)',
    expand=False)

In [None]:
# Scroll to the last column: What do you see?
gene_info.head()

<font color="green">What did we just see?</font>
- <code>.str.extract()</code> will extract text that fits a certain pattern.
- The pattern, <code>pat</code>, is defined as <code>'(ENSG[0123456789]+)'</code>
- The pattern is interpreted as a regular expression, a very powerful technique to define patterns, where the details are beyond this course.
- <code>'(ENSG[0123456789]+)'</code> translates into: extract everything between <code>()</code>, which starts with the text <code>ENSG</code> followed by a non-zero, but otherwise arbitrary, amount of <code>+</code> the following characters <code>[]</code> that in this case are <code>0123456789</code>
- Since there is no column <code>'gene_ensembl'</code> in <code>gene_info</code>, the command <code>gene_info.loc[:, 'gene_ensembl'] </code> will append a new column called <code>'gene_ensembl'</code> at the end

## 2.2. Clean tables

Now we have 
- gene identifiers which match expression data
- know where we find whether genes are protein codding

We will now:
- Create a table with unambiguous mapping between Entrez Gene IDs and Ensembl gene IDs
- keep information on the gene type


In [None]:
gene_info = gene_info.loc[:, ['GeneID', 'gene_ensembl', 'type_of_gene']]

In [None]:
gene_info.head()

<font color="green"><b>Attention: Calling one column GeneID could be misleading as the other gene_ensembl also carries gene identifiers. </b></font>. 
    

We will hence ensure that the name is consistent. We will use the .rename functions which uses the {} brackets to convey information about the former and new name. Presently it is best to think of those brackets as parts of the language used for pandas (indeed it is part of the Python language, and called a dictionary).

In [None]:
# calling one column GeneID could be misleading as the other column also carries gene identifiers. Hence rename.
gene_info=gene_info.rename(columns={'GeneID': 'gene_entrez'})

In [None]:
gene_info.head()

In [None]:
print(
    'For', 
    gene_info['gene_ensembl'].isnull().sum(),
    'genes of the National Library there is no gene maintained by Ensembl.')

We will try to be convservative, and only keep genes which are listed by the National Library of Medicine, and also have support in the European counterpart, ensembl, and further map in a 1:1 relationship.

In [None]:
gene_info = gene_info.dropna(subset=['gene_ensembl'])

In [None]:
gene_info['gene_entrez'].value_counts().max()

In [None]:
gene_info['gene_ensembl'].value_counts().max()   # some Entrez Genes map to multiple ensembl genes

In [None]:
gene_info = gene_info.drop_duplicates(subset=['gene_ensembl'], keep=False)

In [None]:
print(
    'Only',
    gene_info.shape[0], 
    'genes map unambiguously between the American and European gene references.',
    'Did not we encounter many more genes before?'
)

In [None]:
gene_info.head()

In [None]:
# Interesting side-query - part 2: Now, after cleaning: For which types of genes
# is there a good consent in identifiers between America and Europe?
gene_info.loc[:, 'type_of_gene'].value_counts()  # Compare this to the above

## 2.3. Combining tables



To answer our question, of how many protein-coding genes are transcribed in the brains of young adults we need to combine two sources of information:
- <code>gene_info</code>
- <code>young_brain</code>

In [None]:
young_brain.head()

As noticed before, <code>young_brain</code> is no table (as it is one-dimensional). However, we can force it to become a table.

In [None]:
young_brain = young_brain.to_frame('expression_level')

In [None]:
young_brain.head()

In [None]:
# give the "index" (bold) on left, which contains ensembl gene identifiers,
# a proper name for tracability(now no name)
young_brain = young_brain.rename_axis(index='gene_ensembl')

In [None]:
young_brain.head()

There are two different ways to combine data in pandas, <code>pd.concat()</code> and <code>pd.merge()</code>. They operate slightly differently. In short  <code>pd.concat()</code> uses "indices" (names of rows), whereas <code>pd.merge()</code> uses values of columns. The former is a bit faster, whereas the latter allows a higher level of control, and is generally fast enough for any biological application. Hence we will do the latter, and will need to change <code>young_brain</code> in such a way that the ensembl gene identifiers are no loger rownames, but a separte column. 

In [None]:
young_brain = young_brain.reset_index()

In [None]:
young_brain.head()

In [None]:
gene_info.head()

In [None]:
young_brain = pd.merge(
    young_brain,
    gene_info,
    on='gene_ensembl'
)

In [None]:
young_brain.head()

## 2.4. Aggregating

To count the number of protein-coding genes, we will first add a new column and then use aggregate statistics of the table. An aggregate statistic is a statistic (e.g.: sum) for all records (here: genes) within one group (e.g.: protein-coding).

In [None]:
young_brain.loc[:, 'above_one_molecule'] = young_brain.loc[:, 'expression_level'] > assumed_expression_level_corresponding_to_one_molecule

In [None]:
young_brain.head()

In [None]:
young_brain.groupby('type_of_gene').agg(np.sum)

In [None]:
print(
    'There are approximately',
    young_brain.groupby('type_of_gene').agg(np.sum).loc['protein-coding', 'above_one_molecule'],
    'protein-coding genes expressed above one molecule.'
)

In [None]:
young_brain.groupby('type_of_gene').agg(np.mean) # Alternatively check the mean to fration

In [None]:
# Also, note that the plot, analogous to stage 1, now looks different
sns.distplot(
    young_brain.loc[
        young_brain['expression_level']>-np.inf,
        'expression_level'
    ]
)

# The reason is that the merging of the tables only kept data
# present in both. You may wonder what this discrepancy tells about gene
# annotation, and whether there existed biases in our knowledge
# about differnt transcript depending on their expression levels.

# Stage 3 Are genes of the chaperome weakly or strongly transcribed genes, and are some categories of chaperones more transcribed than others?

This last stage. 
- It will combine and repeat concepts from above.
- Comments are minimal. Can you read the code and infer the meaning of specific commands?
- You will see code for accessing and working with the chaperome


The chaperome is a curated list of proteins involved in protein folding. The excel file, which you will be using corresponds to one of the supplemental tables of https://www.cell.com/cell-reports/fulltext/S2211-1247(14)00825-0 (Brehme et al. 2014: A Chaperome Subnetwork Safeguards Proteostasis in Aging and Neurodegenerative Disease)

## 3.1. Loading excel

In [None]:
brehme = pd.read_excel(
    '/Users/tstoeger/Desktop/homework/1-s2.0-S2211124714008250-mmc3.xls',   # <- adapt code here
    sheet_name='BREHME_TABLE S2A',
    skiprows=8
)

In [None]:
brehme.head()

## 3.2. Boxplots

In [None]:
# make clearer labels of columns
brehme = brehme.loc[:, ['Entrez-ID', 'Functional category']].rename(columns={
    'Entrez-ID': 'gene_entrez',    # for consistency use same spelling as in above tables,
    'Functional category': 'function'   # shorter label for chaperons (note: 'function' is no Python function)
})

In [None]:
gene_info.head()

In [None]:
brehme = pd.merge(
    brehme,
    gene_info.loc[:, ['gene_entrez', 'gene_ensembl']]
)

In [None]:
young_brain.head()

In [None]:
annotated_brain = pd.merge(young_brain, brehme, how='left')

In [None]:
annotated_brain.loc[:, 'function'] = annotated_brain.loc[:, 'function'].fillna('not chaperome')

In [None]:
annotated_brain = annotated_brain[annotated_brain['type_of_gene']=='protein-coding']

In [None]:
annotated_brain.head()

In [None]:
medians = annotated_brain.groupby('function').agg(np.median).sort_values('expression_level')

In [None]:
medians

In [None]:
sns.boxplot(
    x='function', 
    y='expression_level', 
    data=annotated_brain, 
    notch=True,
    order=medians.index,
    color='lightgrey'
)

Within a boxplot the notches indicate a good approximation of the 95% confidence interval of the median. If they overlap, differences of the median (center line of every box) are generally insignificant. 

- **Are genes of the chaperome weakly or strongly transcribed genes?**
- **Are some categories of chaperones more transcribed than others?**

# Further study

NUIT has compiled an excellent list of resoruces, https://sites.northwestern.edu/summerworkshops/resources/other-options/ , which also contains a link to Safari Books online - a service through which Northwestern students can access many books about programming for free, and a link to detailed tutorials on Python.

If you are very excited about working with data you could join a student group https://sites.northwestern.edu/acids/ or visit the monthly http://data-science-nights.org/


# Sandbox

To write own code, you can use an empty "Cell". "Cells" are the names of the grey boxes in this jupyter notebook.

To make new cells you can click on the menu on the top, and select Insert > Insert Cell Below

A shortcut to make "Cells" is to press Escape Key to enter Navigation mode, and then press the B key.

To see all shortcuts, press Escape Key to Enter Navigation mode, then press the H key (H is Help).