## Part 0: Introduction

### What is your name?

### Welcome to Week 2 of the bioinformatics tutotial of STEMREM 201B!

The main goal for this week is to get you some practice working in Python, and to make sure that you're comfortable with importing, manipulating, and exporting data.  If you haven't done any type of coding before, parts of this might be challenging - I would much rather you reach out to me if you get stuck than spend an hour getting frustrated.

I'm going to try to walk you through some initial exercises, which are primarily motivated by wanting to get you some practice in Python.  Then we're going to look at at <b>GTF</b> file, which is a commonly used format for storing genomic feature annotations (i.e, transcripts, exons, genes), and prepare a few convenient files linking transcript annotations to gene names and functions.  We'll also look at a file that contains information on the files that are available through **ENCODE**, as practice of how to import and filter a large dataset.

I apologize if there are any typos or parts that are particularly confusing, boring, or otherwise unhelpful - please let me know what parts were helpful, and which parts you think would be good to change!  And though I think I can generally figure out a way to answer most bioinformatics questions, I am by no means an expert at coding, so I'm sure that is often an alternative or more elegant solution to certain problems.  Absolutely feel free to do things slightly differently, as long as it gets you to the same answer - I'll try to walk you through how I generally do things, and point out parts that I have found useful, and do my best to explain how and why certain bits of code work.

Parts of this tutorial were inspired by a class on RNA-seq, SCRB 152, that was taught and developed by Adrian Veres and Eran Hodis at Harvard.

## Part 1: Basics of Python

### 1.1 Numbers

In Python (as with most programming languages), you can store information in variables.  Often, bioinformatics data will be a number.

**Create a variable named **the_year** and set it equal to **2018**.**

**What type of data is stored in the variable **the_year**?**

Note - there is a function type() that will return the type of a variable.  For instance, if you have a variable *x*, type(x) will return the type of data in x.  In Python, functions are used with parentheses (), and between the parentheses, you put any arguments (things passed in to the function).

**Make another variable called **another_year** which is equal to **the_year** divided by 3.  What type of data is stored in the variable **another_year**?**

(https://www.digitalocean.com/community/tutorials/how-to-do-math-in-python-3-with-operators)

**What is the difference between an integer and a float?**

**What is the result of adding these two variables together?** 

(Note how Python will deal with different types of data, and return a float)

### 1.2 Strings

What if, instead of wanting to store a number, we wanted to store a word, or sequence of characters (or DNA bases)?  Note that by default, a set of characters is interpreted as being a varaible or function - so we need some way to tell Python that we want what we are typing to be interpreted as text, instead of a variable.

In [1]:
# this is an example of how we can make a string

my_string = 'hello world'
print(type(my_string))
print(my_string)

<class 'str'>
hello world


**Create another string, and add it to **my_string** (using the + operator).  What happens?**

What if we want to access just a particular part of the string? i.e., a single character? You can use the [] operator to access just part of a string.

Note how Python is *0-indexed*.  This is important to remember - some languages (e.g. Python) are 0-indexed, meaning "0" indicates the start of things, while other langauges (e.g. R) are 1-indexed, meaning that 1 indicates the start of things.  Regardless of whether you're looking at a string, list, etc. **0 indicates the 'first' thing.**

What if wanted to access a substring, rather than just a single character? We can use a colon (:) to indicate a range, as shown below

**It's important to remember that a number can be stored as an integer/float, or a string.**

In [14]:
# note how we are distinguishing between whether we store 1 as an number or a string

a = 1
b = '1'

print(a, type(a))
print(b, type(b))

1 <class 'int'>
1 <class 'str'>


### 1.3 Booleans

One of the most common things you'll want to do involves  boolean logic: are things the same, different, greater or less than each other, etc.

To distinguish from setting a variable equal to something (a = 1), you can use double equal signs (==) to ask whether something is equal.  Alternatively, you can use != to indicate not equal, > and < for greater and less than, and >= and <= for greater/less than or equal to.

**In the above example: are variables a and b the same?  What operator could you use to ask if they are not equal?**

Importantly, when dealing with numbers that are stored as strings, Python has a handy way to automatically convert them to an integer or float:

In [5]:
# we can use the int() or float() functions to convert things to integers or floats, respectivelyp
# it is important to remember, though, that a string will not equal an integer (or float)

a = '1'
c = int(a)
c == a

False

Note how this works when the variable contains a string corresponding to a number.  What happens if you try to do this with a variable containing a string that is composed of letters?

In [15]:
# you can just run this cell as is (it's supposed to error)

d = 'my string'
int(d)

ValueError: invalid literal for int() with base 10: 'my string'

This provides a good example to look into the error messages that Python provides: no matter how much time you spend with bioinformatics, you're going to get error messages, and you need to be able to figure out what they are telling you about why things are going wrong!

A good plae to start is the end: that's usually the part that is telling you what is actually going wrong.  In this case, it's saying that you have an invalid value to convert to an integer - 'my string' is not base10.

The other important place to look is in the middle, at the line that has the arrow: that is the actual line that is erroring.  So when you have a larger piece of code, and you get an error, you might not know exactly where to look - but the arrow points you to the exact line that is erroring.

### 1.4 Lists

Often, we don't just have a single piece of information - we have a whole bunch of them,  that we'd like to keep together.  For example, a bunch of genes, or their expression values.  We can use lists for this.

In [16]:
# this is an example list

my_list = ['a','b','c']
print(my_list[1])

b


**How many items are in this list?**

There is a built-in function that will return the length of a list (note that you can obviously just manually count the number of elements in this list, since it's pretty short - however, with a lot of genomics data, you may not know how many genes are in a given dataset for instance, and you're not about to go through and count them!)  So your answer to this should include Python telling you how many elements are in this list.

What if you want to add something to this list? You can say list1 = list1 + list2.  Or, you can say list1 += list2.

**Create another list contaning the numbers 1, 2, 3, and and add it to the previous list.**

Note how in the previous example, there are different types of data in the asme list: strings and integers.  There's nothing wrong with this, but it's often a good idea to try to keep everything consistent with the same type of data.  A common place where this comes up is chromosomes: if you're trying to compare a list of genes, for instance, you want to make sure that your chromosome values are either all integers, or all strings, otherwise you're going to run into problems! (because '1' isn't equal to 1)

One of the most common things you'll want to do is access particular elements of a list: we can use the same notation as with strings.

**Make a new variable, equal to the first element of my_list; make a second variable equal to the last element of my_list.  Are they equal?**

When you are indexing lists, you can begin at 0 and count upward to go through the list; or begin at -1 to start at the end of the list. (make sure that you know how to do this by accessing the last element of the list below)

### 1.5 Dictionaries

The last data type we are going to cover is a dictionary (there are also other data types that we aren't going to get into right now).  A dictionary is essentially a pairing of a 'key' and a 'value'.  It's convenient when you want to be able to look up a key, and have a value returned.

The syntax is as follows:

In [17]:
# this is an example of a dictionary

my_dict = {'a':'alligator', 'b':'bonobo', 'c':'cheetah', 'd':'donkey'}
print(my_dict)

{'a': 'alligator', 'b': 'bonobo', 'c': 'cheetah', 'd': 'donkey'}


With a dictionary, you look things up with the [] operator:

In [18]:
# this is an example of looking up the 'a' key

print(my_dict['a'])

alligator


**What happens if you try to look up a key that is not stored in the dictionary?**  Try looking up 'e' in the dictionary.

To add a value to a dictionary, you can simply say:

In [19]:
# this is an example of how you can add a key/value pair to a dictionary

my_dict['e'] = 'elephant'
print(my_dict['e'])

elephant


Note that with a dictionary, the value can be anything - a string, integer, list, etc.  When you have a ton of pieces of data, that have some kind of organization, a dictionary can be a very efficient way to organize that information!

## Part 2: Loops and Conditionals

### 2.1 Loops

It's pretty often that you want to do the same thing many, many times - for example, for every gene, or item in a list, or a range of iterations.  You can use a loop to do this:

In [20]:
# The syntax is:

# for *variable* in *iterable object*:
#     *do something*

my_list = ['a','b','c','d','e']
for i in my_list:
    print(i, my_dict[i])

a alligator
b bonobo
c cheetah
d donkey
e elephant


Note that the for line ends in a colon, and the *do something* is indented.  **This is important!**.  Python will expect the stuff with loops or conditionals to be indented - everything that is one tab in will be considered a part of that loop.

What if you want to iterate through a number of iteractions, or a range of numbers?

In [21]:
# this is the syntax for a for loop

for i in range(10):
    print(i)

0
1
2
3
4
5
6
7
8
9


**Let's put a few concepts together:**

First, initialize a new empty dictionary.  Make two lists of equal length: the first containing five letters of the alphabet; the second containing five words beginning with those letters.

Now, use a for loop to iterate through a range of numbers equal to the length of those lists, and create a dictionary where the *keys* are the letters, and the *values* are the words beginning with those letters.  Print the entire dictionary.

**One aside about variable names - certain names are 'taken' - i.e., the word** "list" **has *meaning* in Python - it refers to the data type of a list.  Same goes for things like int, str, dict, etc.  So don't try to create variables with those names! In many programs, such as Jupyter notebook, if words that have 'meaning' will change color, so if you try to create a variable and see that the word you just wrote changed color, it's an indication that you should't make a variable with that name.**

Below is another way of making a dictionary, that takes advantage of the **zip** function in Python.  Often, this is an easier way of making a dictionary than using a for loop, but there are instances when the former method (using a for loop) will be easier.

In [22]:
# It's worth mentioning that there are many 'shortcuts' in Python
# (or, as people say, 'Pythonic' ways of doing something)
# this is an alternative (and usually faster) way of making a dictionary from two lists

list1 = ['a','b','c','d','e']
list2 = ['apple','boy','cot','dog','egypt']
new_dict2 = dict(zip(list1, list2))
print(new_dict2)

{'a': 'apple', 'b': 'boy', 'c': 'cot', 'd': 'dog', 'e': 'egypt'}


**Is this dictionary equal to the one you made before?**

### 2.2 Conditionals

It's often the case that we don't want to just blindly do something - we only want to do it if a certain condition is satisfied.  We can use an 'if' statement for this.  The syntax is similar as for for loops.

In [23]:
# this is an example

a = 1
if a == 1:
    print('a is equal to 1')
if a == 2:
    print('a is equal to 2')

a is equal to 1


In [24]:
# this is another example

a = 1
if a == 2:
    print('equals 2')
else:
    print('not equal to 2')

not equal to 2


**Use a for loop to iterate through the test_list.  If the value is greater than 5, append it to new_list.  Otherwise, print the value instead. How long is the resulting new_list?**

In [6]:
# test_list
test_list = [1,2,3,4,5,6,7,8,9,10]

# your code below


**Make a new list, and for each value of the string_list, convert it to an int.**

**There is another way you can do this using what is called a 'list comprehension'.  Essentially, you're writing the entire for loop in a single line.  *There is never a situation where you have to use a list comprehension - you could always just use a normal for loop.*  However, list comprehensions are nice because they are shorter (and more Pythonic), and are good to know so you don't have to type as much!**

In [27]:
# this is an example of a list comprehension
# note that you'll need to have the variable string_list for this to work

int_list2 = [int(i) for i in string_list]
print(int_list2)

You can also get fancy with list comprehensions:

In [26]:
# these are other things you can do with list comprehensions

new_list = [i for i in range(10) if i < 5]
print(new_list)

new_list2 = ['a' if i < 5 else 'b' for i in range(10)]
print(new_list2)

[0, 1, 2, 3, 4]
['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b']


There's no need to worry too much about this now if this doesn't make sense, just know that it exists, and once you get the hang of Python, it's something to try to get used to writing!

## Part 3: Functions

In general, you don't need to be 'good at coding' to do bioinformatics - you just need to learn the syntax of how to do matrix operations, since most of what you end up doing is just doing things to rows and columns of matricies.

However, it's good to know what a function is, how to define one, and how they work.  Below is the syntax for a simple function that takes an input, checks if it is an integer, and if it is, multiplies it by two.  Note that once a function hits a 'return' statement, it will immediately exit and return (output) that value.

In [28]:
# here, we're making a simple function called 'my_function'

def my_function(x):
    if type(x) == int:
        return x * 2
    else:
        return 'not an integer!'

In [29]:
# this is how we can use that function

a = 2
b = my_function(2)
print(b)

c = my_function('abc')
print(c)

4
not an integer!


**Write a function that returns the square root of a number.  Iterate through the first ten integers, and use this function to add them to a new list.  You can do this either with a for loop, or list comprehensions.**

Functions can also take multiple inputs:

In [30]:
# this is an example of a function that takes multiple inputs

def func3(x, y, z):
    return x + y + z

func3(1,2,3)

6

**Write a function called **max_value** that takes three integers as inputs and returns the largest one.  In the case of a tie, return a single value (the maximum of the three numbers).**

#### Prove to your self that it works by running the following test cases:

In [13]:
test_cases = [[1,2,3],
              [1,3,2],
              [2,1,3],
              [2,3,1],
              [3,2,1],
              [3,1,2],
              [1,1,1],
              [2,2,1],
              [1,2,2],
              [2,1,2]]

# you can run the following code (you need to uncomment it) to test your function

# for i in test_cases:
#     x, y, z = i
#     print(i, max_value(x, y, z))

(Note that there is already a built in Python function that will do this for you - most things will already have been done, and either exist as a built in function, or someone will have solved it on the internet and you'll probably be able to find a nicely written function you can copy and paste off StackOverflow)

## Part 4: Looking at genome annotations

### 4.1 Packages - Pandas and Numpy

Pandas and Numpy are two very powerful and commonly used packages in bioinformatics, data science, and a variety of other fields.  They faciltiate doing array math and matrix operations, and make it very easy to do things that would otherwise be relatively painful to do with just lists.

We need to tell Python to import these packages.

In [95]:
# this is us importing the two packages pandas and numpy

import pandas as pd
import numpy as np

To save ourselves from having to type 'pandas' or 'numpy' every single time we want to refer to something in the packages (which will be a lot), we can use an abbreviation 'pd' and 'np' to save ourselves a little bit of typing.

One other thing to note - to access the functions in these packages, we are going to use the notation:

**pd.function()**

**np.function()**

or

**pd.subpackage.function()**

etc.

**We'll talk about numpy first - numpy introduces a data type called an 'array'.  In many ways, arrays are just like lists.  However, they are optimized for math.**

Make a new list, containing the numbers 1 to 10.  Make an array from this list.

Now try adding one to each element of the array.  What is the result?  What happens if you try to do this with a list?

You can do a lot of math operations in this way - addition, subtraction, multiplication, etc.  There's also a lot of other built in math functions within numpy, which we'll use later on.  In general, if you're doing 'scientific' operations, it's usually a good idea to work with arrays and pandas dataframes (which we'll go into below.

### Part 4.2 Importing a genome annotation file

Here, we're going to look at a file that I've downloaded from the ENSEMBL website that contains annotation information for various genes in the genome.  This file was originally downloaded with <i>transcript-based</i> annotations, which I convereted to be <i>gene-based</i>.  When you're doing RNA-seq analysis, you can either perform analyses at the transcript level (meaning considering different isoforms of the same gene differently) or at the gene level (aggregating different isoforms of the same gene); we're going to focus on gene level analysis for now.

<b>First, we need to import the annotation file.</b>  I typically like to define paths and file names at the start, just to keep things organized.

1. Create a variable called 'path' which contains the directory listing to wherever you downloaded the files.
2. Create a variable called 'fn_anno' which is the name of the file.

In [96]:
#  commented things are not run
# as a handy shortcut, typing "command + backslash" (like "command + c" for copy) will comment a line
# path = '/path/to/the/directory/containing/the/file/'
# fn = 'name_of_the_file.extension'



**Using pd.read_csv(), import the txt file (comma delimted) containing the annotations into a dataframe called 'anno', and set the index to be the 'transcript' column.  Use .head() to show the first 5 rows of the resulting dataframe.**

See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

and https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html

and https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html

for reference.  These are part of the pandas documentation.  I've provoded these here just so you can get started, but in the future, I'll provide some hints/direction as to how to go about something, but it will be up to you to look up how to actually use the functions in the pandas (or other) documentation.  In real life, you'll have to look things up yourself, and (at least for me) I think having to look things on will help you learn and remember it.

### Part 4.3 Working with pandas dataframes

**Print the information for the gene *ENSG00000181449.3*.  You should familiarize yourself with the .loc and .iloc commands.**

#### One of the most important things about working with genomics data is double checking that the files you are working with have the data you expect them to have.

For instance: what values are present in the 'chr' column of our annotation dataframe?  How many chromosome values are in this column?

What chromosomes would you expect to be there?  Are there any other chromosomes present, and if so, what are they?

As a hint, you're looking for unique values in that column of the dataframe.

#### Note that some of the values in the chromosome column are numbers (e.g., 1, 2, etc.) and others are strings (e.g., 'X', 'Y').  When Python imported the dataframe (pd.read_csv()), did it import the numerical chromosomes as integers or strings?

#### Just to be sure - go through and explicitly convert everything in the column 'chr' of the dataframe to be a string.

We're going to consider three ways of doing this (these types of row/column manipulations are things you're going to be doing a lot of). For two of them, you're going to want the command str() to convert something to a string.

Save the column 'chr' of anno as a new variable called column_orig.  Using either list comprehensions or a for each loop, explicitly convert every value to a string (using str()), and save the results as a new variable called column_new.  Set the 'chr' column of anno to be equal to column_new.

#### As it turns out, there's also a convenient function that is built into dataframes called .apply().  Look up the syntax for this, and convert the values in the 'chr' column to strings with .apply().

(You should be able to do this in a single line of code.  Remember that you need to explicitly tell Python to both 1) perform .apply() <i>and</i> save the result in the 'chr' column of anno)

**There's actually an even more convenient way to do this - look up the .astype() function, and use it below to convert the values in the 'chr' column to strings.**

#### Let's say that we want to subset this annotation to get a list of only those genes that are on the 'normal' chromosomes: autosomes, sex chromosomes, in the mitochondrial genome.

Make a list that looks like this:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 'X', 'Y', 'MT']

Do it <u>without</u> explicitly writing out the numbers 1 to 22.  Feel free to use either lists (built-in to Python) or numpy arrays (np.array()).  Be sure to save the numerical chromosomes as the correct data type (integer or string) to match the data type of the values in anno['chr'].

#### Subset the anno dataframe to include only those genes whose chromosome annotations are in this list of chromosomes.  Save this as a new dataframe called anno_filtered.

How big is this new annotation/how many things did we filter out? Print the first five columns of the dataframe with .head().

(You should end up with 57106 rows remaining in anno_filt)

#### How many genes are on each chromosome?  There's a few ways you could do this, but one is to use a Counter.

1. from the package 'collections' import Counter. https://docs.python.org/3.6/library/collections.html#collections.Counter
2. create a new variable called chr_count that is a Counter.
3. loop through all values in the 'chr' column of anno_filt and update the Counter.
    - You can either use a for loop, or, use .apply() and a lambda function.
    - Lambda functions are incredibly useful: they let you define a new short function in a single line.  The syntax is:

In [103]:
# information on lambda functions

func_lambda = lambda x: x + 2
func_lambda2 = lambda x: x ** 2 + 100
func_lambda3 = lambda x: str(x)

print(func_lambda(10))
print(func_lambda2(10))
print(func_lambda3(10))

# these are the same as:

def func_long(x):
    return x + 2
def func_long2(x):
    return x ** 2 + 100
def func_long3(x):
    return str(x)

# you can use a lambda function with dataframes
my_data_frame = pd.DataFrame()
my_data_frame['my_column'] = np.arange(10)
my_data_frame['other_column'] = np.arange(10,20)
print(my_data_frame, '\n')

print(my_data_frame.apply(lambda x: np.log10(x['my_column'] + 10**-15), axis=1), '\n')
print(my_data_frame['my_column'].apply(lambda x: np.log10(x + 10**-15)), '\n')
print(my_data_frame.apply(lambda x: [(i**2)/100 for i in x]), '\n')

12
200
10
   my_column  other_column
0          0            10
1          1            11
2          2            12
3          3            13
4          4            14
5          5            15
6          6            16
7          7            17
8          8            18
9          9            19 

0   -1.500000e+01
1    4.821637e-16
2    3.010300e-01
3    4.771213e-01
4    6.020600e-01
5    6.989700e-01
6    7.781513e-01
7    8.450980e-01
8    9.030900e-01
9    9.542425e-01
dtype: float64 

0   -1.500000e+01
1    4.821637e-16
2    3.010300e-01
3    4.771213e-01
4    6.020600e-01
5    6.989700e-01
6    7.781513e-01
7    8.450980e-01
8    9.030900e-01
9    9.542425e-01
Name: my_column, dtype: float64 

   my_column  other_column
0       0.00          1.00
1       0.01          1.21
2       0.04          1.44
3       0.09          1.69
4       0.16          1.96
5       0.25          2.25
6       0.36          2.56
7       0.49          2.89
8       0.64          3.24
9       0.

## Part 5: ENCODE data

**Import the file 'all_ENCODE_metadata.tsv.gz' into a dataframe called encode.  Set the index column to be the file accession number, and print the first rows with .head()**

#### How big is this dataframe?  What type of information is present in the rows? Columns?

#### Create a new dataframe called encode_filt that includes only samples that:
 - are from human (homo sapiens)
 - do not have audit errors.  Specifically, only include rows where encode['Audit ERROR'].isnull() is True.
 
For the first criteria, you may need to look at what columns are present in the dataframe to choose the appropriate ones to filter on.  Your dataframe should have 223543 rows.

#### What types of RNA-seq data are available?  Create a dataframe called rna that only has rows that satisfy all of the following criteria:
 - They come from RNA-seq experiments.
 - Their libraries are made from RNA
 - They are depleted in rRNA
 - They are fastq files
 
You will need to look at both the column listings, as well as the unique values in these columns, to be able to know what values to filter on.  You will want to look at four columns, create a boolean mask for each of them (a array/series containing either True or False for each value), and then make a final mask that contains only values where all four sub-masks were True.

Your final 'rna' dataframe should have 1017 rows.

In [105]:
# example on how to merge multiple masks

a = np.array([True, True, False])
b = np.array([False, True, False])
c = np.array([True, True, True])

d = a & b & c

d

# note that you can't do this with lists
# (try it yourself and see what happens)
# arrays make our lives easier!

array([False,  True, False])

#### Get a list of the unique biosample term names in the rna dataframe.  In other words, a list of biosample term names for which there exists RNA-seq data that satisfied our above criteria.

#### What types of ChIP-seq data are available?  Create a dataframe called chip that only has rows that satisfy all of the following criteria:
 - They come from ChIP-seq experiments
 - The ChIP-seq target is H3K27ac-human
 - The file format is bed narrowPeak
 - The output type is replicated peaks
 - The bed files were aligned to the GRCh38 assembly.
 
Your final dataframe should have 80 rows.

#### Get a list of the unique biosample term names in the chip dataframe.

#### Now, get a list of the biosample term names which are shared between the two lists.  In other words, find the intersection of biosample term names with RNA and ChIP data satisfying our various criteria.  How many samples are there in this list?

#### The sample 'gastrocnemius medialis' should be in your list.  Print the data in the rna and chip dataframes that are from this sample.

#### The sample 'fibroblast of arm' should also be in your list.  Print the data in the rna and chip dataframes that are from this sample.

## The end

#### Congratulations on finishing this - I know that it's a lot!

How much time did this take you?  Do you have any comments/advice on things to improve, add more of, remove, or otherwise change?