# Outline

Why this tutorial will be useful:
- This tutorial provides a basic introduction into working with tables
- <font color='green'>You will solely need to change the code at one position marked with ```# <- adapt code here```</font>, however you are encouraged to experiment further

Main purpose of tutorial:
- Ensure that software works

Within this tutorial you will:
- Load gene expression data
- Filter the gene expression data for human brains
- Create a time-resolved clustermap - a table where similar genes will be grouped together, and the gene expression represented by colors

# Loading file
The Python programming language essentially acts as a powerful glue that can stitch different tools together. One tool for working with tables is pandas. In order to use <code>pandas</code> as a tool, we will at first need to <code>import</code> it so that it becomes available to pyhton. Per convention, we give it the name <code>pd</code>

To execute the code contained in the "cell" (grey box) below, press SHIFT and RETURN.

In [None]:
import pandas as pd

<code>pd</code> comes with several functions, which will form specific tasks. To see all of them, click on the <code>pd</code> below and press SHIFT and TAB. 

A useful function, to import tables is <code>.read_csv</code>. Functions take function-specific arguments. To see the required arguments, click onto the function name (e.g.: <code>.rad_csv</code>, and press SHIFT and TAB and TAB (again))

On the next line you will need to adjust the code, to point it toward the location on your computer, which contains "chaperome_expression_rpkm_1_1_orthologs.csv", a file, which you can download from <b>canvas (Class-3-Thursday, Oct3)</b>

In [None]:
my_table = pd.read_csv(
    filepath_or_buffer='/Users/tstoeger/Box/chaperome_student_course/2020/material/chaperome_developmental_expression_rpkm_1_1_orthologs.csv' # <- adapt code here`
)

What just happened?
- You called pandas, through its name <code>pd</code>, to read the table with expression data.
- The equal sign defines the order in which the code is executed. At first the right side is executed, and the result of this execution is than stored in a "variable" on the left hand side. You could give this variable any variable name, as long as you always use the same name when talking about it within your code. In this case, we give the variable the name <code>my_table</code>

In [None]:
# Let us look at content, using the .head function
# Btw, # marks "comments", which will not be executed
my_table.head()

Well done! But something appears funny. There is a column called "Unnamed: 0" that apperas to duplicate the "index" (the left most values, which are likely displayed in bold on your computer). Let us <code>drop</code> "Unnamed: 0".

In [None]:
my_table = my_table.drop(
    labels='Unnamed: 0', 
    axis='columns')

Have a look at the data by plotting the first few rows with <code>.head()</code>

In [None]:
my_table.head()

What just happened?
- <code>drop</code> rows or columns
- <code>labels</code> tells <code>drop</code> to look for <code>'Unnamed: 0'</code>
- <code>axis</code> tells <code>drop</code> to look for <code>'Unnamed: 0'</code> within columns (and not the index)
- The equal sign again operates as above. At first the right side is executed, and the result of this execution is than stored in the variable on the left hand side. Since the variable has the same name, <code>table</code> we overwrite its former content (only in the working memory, RAM, of your computer. The original file stays untouched)

# Filtering for human brain

In [None]:
# Let us at first inspect, which organisms are present
my_table['organism'].unique()

In [None]:
# Which tissues are present?
my_table['tissue'].unique()

In [None]:
# let us define variables that state, which organism and tissue
# we would currently like to investigate (this will later facilitate
# looking at other tissues or organs)
tissue_of_interest = 'Brain'
organism_of_interest = 'Human'

Our aim is to only consider in further analysis only one tissue and organism. We need to **filter** our data. You can think of those **filters** as logical masks. <code>==</code> will make a logical comparison that requires both arguments on each side to be the same. For instance 'cat' == 'cat'  would yield TRUE. Whereas 'cat' == 'dog' would yield false.

In [None]:
is_tissue_of_interest = my_table['tissue'] == tissue_of_interest

In [None]:
is_tissue_of_interest.head()

In [None]:
is_organism_of_interest = my_table['organism'] == organism_of_interest

In [None]:
is_organism_of_interest.head()

In [None]:
# Take a short break. When working with data, you should be paranoid. 
# Do is_tissue_of_interest and is_organism_of_interest contain
# the values you would anticipate?

In [None]:
my_table.head()

In [None]:
# Answer: yes they seem to contain the correct values. The first rows are "Brain" but not "Human"
# Let us continue, and filter

Filters can be combined. To require both values to be TRUE, you can use the AND operator (<code>&</code>). (Question: Why would the <code>==</code> not work?) 



In [None]:
use_these_records = is_tissue_of_interest & is_organism_of_interest

In [None]:
use_these_records.head()

To obtain parts of a table, you can use <code>.loc</code>, which provides a view/window to a part of the underlying data (which helps to remember why it is windows-like square brackets that you will need). Left to the <code>,</code> within <code>.loc</code> refers to rows. Right to the <code>,</code> within <code>.loc</code> refers to columns. <code>:</code> means ALL.

In [None]:
my_table.shape

In [None]:
filtered_table = my_table.loc[use_these_records, :]

In [None]:
filtered_table.shape

In [None]:
filtered_table.head()

# Arrange for plotting 

Our goal is to make a table where genes are in rows, and different timepoints are in columns. But this is a problem since the table is presently formatted very differently. 

- Data that we need are in three columns: 'gene', 'median_RPKM' (a measure of transcript abundance), and 'age'
- We somehow have to filter for those columns, and place 'gene' in rows, and 'age' in columns, and 'median_RPKM' as the value within the table
- The operation for doing the above is called <code>pivot</code>

In [None]:
filtered_table

In [None]:
pivotted_table = filtered_table.pivot(
    index='gene',
    columns='age',
    values='median_RPKM'
)

In [None]:
pivotted_table.head()

This already looks good. But if you scroll the table to the right, you will notice something funny. The columns are not ordered in a meaningful way. It would appear nicer, if the columns wer ordered by timepoint, going from the earliest time to the latest. Let us at first see, which ages we have.

In [None]:
pivotted_table.columns

Luckily the above is not too long. We could manually reorder them in a meaningful way.

In [None]:
preferred_order = [
    '4wpc', 
    '5wpc', 
    '7wpc', 
    '8wpc', 
    '9wpc',
    '10wpc', 
    '11wpc', 
    '12wpc', 
    '13wpc', 
    '16wpc', 
    '18wpc', 
    '19wpc', 
    '20wpc',
    'newborn',
    'toddler',
    'infant', 
    'school',
    'teenager',
    'youngAdult',
    'youngMidAge',
    'olderMidAge', 
    'senior']

In [None]:
pivotted_table = pivotted_table.reindex(columns=preferred_order)

In [None]:
pivotted_table.head()

# Visualize

In [None]:
# note that the following line is specific to notebooks
# and will not be required on most computers. It serves
# as a safety mechanism to ensure that the figures will
# be shown in this notebook rather than elsewhere on
# your computer
%matplotlib inline

In [None]:
# Seaborn is a powerfull tool for visualizaiton

import seaborn as sns

Let us visualize the data as a <code>clustermap</code>, which pairs similar samples. This is only one of many options of seaborn. For more ideas visit: https://seaborn.pydata.org/examples/index.html

In [None]:
sns.clustermap(
    data=pivotted_table)

# Visualize nicer

The above visualization has a few problems:
- genes are grouped by the absolute number of transcript molecules (RPKM) rather than their change over time
- Columns loose information on time

To make genes with differen expression levels comparable, we need to normalize. One way of normalization is z-scoring (https://en.wikipedia.org/wiki/Standard_score), which normalizes sampels according to mean and median. Let us build a custom function, which can z-score our data. This function will call <code>numpy</code>, a toolbox for mathematical operations

In [None]:
import numpy as np

In [None]:
def zscore(input_values):
    z_scored = (input_values-np.mean(input_values)) / (np.std(input_values))
    return(z_scored)

In [None]:
z_scored_table = pivotted_table.apply(
    func=zscore,
    axis='columns'
)

In [None]:
sns.clustermap(
    data=z_scored_table)

Above already looks better, we see patterns emerging. However the colors are not nice. The nicest would be to use a divergent colormap where values close to 0 (those close to the mean of a gene) are white

In [None]:
sns.clustermap(
    data=z_scored_table,
    cmap='bwr',   # blue white red colormap
    vmin=-3,      # fix the lowest value that will be covered by the colormap,
    vmax=3,       # fix the highest value that will be covered by the colormap
)

Wow! The looks even better. Now we see clearly, which time points have a reduced (blue) or elevated (red) expression when compared to all other timepoints of the same gene.

Yet there is more that we can do. For instance we can suppress the clustering of the columsn to keep our origianl order (which will correspond to the age).

In [None]:
sns.clustermap(
    data=z_scored_table,
    cmap='bwr',   # blue white red colormap
    vmin=-3,      # fix the lowest value that will be covered by the colormap,
    vmax=3,       # fix the highest value that will be covered by the colormap
    col_cluster=False   # Avoid clustering columns
)


# Export as pdf
# import matplotlib.pyplot as plt
# import matplotlib as mpl
# mpl.rcParams['pdf.fonttype'] = 42
# mpl.rcParams['font.family'] = 'Arial'
# plt.savefig('/Users/tstoeger/Desktop/test.pdf', bbox_inches='tight')

# Conclusion

Hopefully this tutorial has served as a brief introduction into computationally working with tables and visualize teh expression of genes. It looks as if some chaperones become downregulated after birth, whereas others become upregulated. Changing the visualization and data normalization can reveal distinct patterns of your data. You are among the first people to know about birth separting two types of regulation of chaperones, and you just discoverd something new!

In [None]:
my_table.head()

# Add time course visualization

In this case want to create a lineplot where the x-axis is chronologic time, and y-axis is raw counts.

In [None]:
x_axis = chronologic_ages = [
    -40*7+7*4, #'4wpc', 
    -40*7+7*4, #'5wpc', 
    -40*7+7*7, #'7wpc', 
    -40*7+7*8, #'8wpc', 
    -40*7+7*9, #'9wpc',
    -40*7+7*10, #'10wpc', 
    -40*7+7*11, #'11wpc', 
    -40*7+7*12, #'12wpc', 
    -40*7+7*13, #'13wpc', 
    -40*7+7*16, #'16wpc', 
    -40*7+7*18, #'18wpc', 
    -40*7+7*19, #'19wpc', 
    -40*7+7*20, #'20wpc',
    0, #'newborn',
    365*2, #'toddler',
    365*5, #'infant', 
    365*10, #'school',
    365*16, #'teenager',
    365*25, #'youngAdult',
    365*35, #'youngMidAge',
    365*50, #'olderMidAge', 
    365*70, #'senior']
]

In [None]:
gene_of_interest = 'CCT4'

In [None]:
pivotted_table.head()

In [None]:
y_axis =  pivotted_table.loc[gene_of_interest, :]

In [None]:
# Import another libary for plotting
# This library - in contrast to seaborn - provides access to
# elemental parts of drawing images
import matplotlib.pyplot as plt   

In [None]:
plt.plot(x_axis, y_axis)
plt.xlabel('Age (days)', fontsize=20)
plt.ylabel('RPKM of '+ gene_of_interest, fontsize=20)