## Welcome to Week 2: Dataframes!

### In weeks 2 and 3, we're going to focus on two things, which are essentially the basics of all downstream bioinformatics that you'll do.

<u>First</u>, learning to work with dataframes: we're going use the package <b>pandas</b>, which is one of the most commonly used packages for datascience. https://pandas.pydata.org/docs/getting_started/10min.html#min has a brief introduction, if you are curious.  The core idea of dataframes - the primary datatype associated with pandas - is that you have a two-dimensional matrix of data (i.e., rows and columns, like an Excel spreadsheet), and can associate a *label* with each row and column.  For example, with scRNA data, you have a 2D matrix of gene expression counts, where each row is a gene and each column is a cell. If you wanted to look up the expression for a particular gene in a particular cell, rather than have to know the particular XY "coordinates" of that datapoint (i.e., gene row # 1827 and cell column # 2937), you can just pass in the names of the gene and cell.  If you wanted to sort the dataframe by the expression of a particular gene, you'd want to make sure that the pairings of gene names, cell names, and datapoints stay correct through this sorting process, and pandas dataframes help take care of this to keep everything organized and correct.  Don't worry if this doesn't make too much sense now - it'll make more sense when we start playing with actual examples.

In addition to pandas, we're going to use the package <b>numpy</b>, which is the core "math" package ("scientific computing", as they describe it - https://docs.scipy.org/doc/numpy/user/quickstart.html and https://docs.scipy.org/doc/numpy/user/basics.html). Oftentimes when working with large datasets, you want to perform a simple operation (for example, log transform or depth normalize) on many pieces of data.  Numpy implements a lot of tricks under the hood to perform vectorized math operations very efficiently - doing the same operation to many pieces of data.  Numpy is built around *arrays*, which are a 1D datatype: essentially a list, but with a lot of added tricks. Say you have a bunch of datapoints - gene counts, for example - and want to multiply each one by 2.  Using a list, you would need to do this one-by-one for each list: iterate through the entire list with a for loop (or list comprehension) and multiply each value by two. However, using a numpy array, you can simply multiply the entire array by 2, and numpy will return the element-wise product of the array by 2 (multiplying each elementy by 2). Again, this will make a little more sense once you've played around with it a little.

**I would recommend skimming through the introductions for pandas and numpy, since you'll want to become familiar with them both for this lesson and going forward. It's not as crucial that you memorize each function and every feature, but good to just have a sense of what is possible, so that you can remember that there should be a way to do something easily, then google for it later on and re-figure out how to do it.**

* https://pandas.pydata.org/docs/getting_started/10min.html#min
* https://docs.scipy.org/doc/numpy/user/quickstart.html

The <u>second</u> thing that we're going to focus on is plotting. **Matplotlib** is the core plotting package in Python. It is built around two concepts: the figure, which is the "overall" image - think about it like a piece of paper or figure panel - and axes, which are the specific XY axes where you plot things.  The simplest example is a figure with one axis - say a simple scatter plot. This is what you'll do 90% of the time.  Sometimes, though, you might want to group together multiple plots at the same time - say you have four scatter plots you want to make together. In this case, the figure might have four axes (a 2-by-2 grid of scatter plots).  The important thing to remember, is that when you're plotting, you 1) create a figure, 2) create an axis, 3) plot things on that axis, [4) create & plot on any additional axes if applicable], and 5) save the figure (which contains the axis/axes you've plotting things on).

Two useful matplotlib links with some tutorials and example plots:

* https://matplotlib.org/tutorials/index.html
* https://matplotlib.org/gallery/index.html

Three other packages that we aren't going to use here, but you will also encounter down the road: <b>scipy</b>, which has a lot of more specialized functions for things like statistics (and many others - https://docs.scipy.org/doc/scipy/reference/, https://docs.scipy.org/doc/scipy/reference/tutorial/index.html), and **scikit-learn**, which is the core machine learning package (https://scikit-learn.org/stable/getting_started.html), and **seaborn**, which is another data visualization package (https://seaborn.pydata.org/introduction.html) built on matplotlib.

## Import Statements

#### First, let's import the packages that we are going to use this and next week: pandas, numpy, and matplotlib.

We're going to abbreviate their names as follows:

    import pandas as pd
    import numpy as np
    import matplotlib as mpl
    
Then, when we want to do things with numpy, for example, such as the log10() function, rather than say: numpy.log10(my_data), we can say np.log10(my_data). Note that if we wanted to just import numpy (and not rename it - so saying numpy.log10(my_data)), we would just say:

    import numpy
    
We can also import a particular function from numpy, rather than everything:

    from numpy import log10
    
If we ran that, we would be importing just the log10() function from numpy, rather than the package as a whole.  We would then access this function by saying log10(my_data), rather than np.log10(my_dat).

You can also put these things together and say:

    from matplotlib import pyplot as plt
    
Here, we're importing pyplot from the matplotlib package, and renaming it plt to save us some typing.

In [1]:
import pandas as pd
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot as plt

## 1. Lists, loops, and arrays.

First, we're going to do a quick overview of lists vs. arrays, and also list comprehensions.

### 1.1 Lists, loops, and list comprehensions.

Here, I've created a list, where each element is a string.  Let's say I want to convert each element to be an integer.  There are two ways to do this.

In the first way, we're creating a new empty list, iterating through each element of string_list, converting it to an integer, and adding it to our new empty list.

In the second way, we're using a list comprehension to do this all in one step.

In [2]:
string_list = ['1','2','3','4','5','6','7','8','9','10']

# first way
int_list = []
for i in string_list:
    int_list += [int(i)]
print(int_list)

# second way
int_list2 = [int(i) for i in string_list]
print(int_list2)

# checking that they are equal
print(int_list == int_list2)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
True


List comprehensions are your friend - they can make it easier to do simple operations to an entire list.  The basic syntax is:

    [function(variable) for variable in thing_to_iterate_over]

https://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/ has a good short tutorial that is worth reading through.

You can also make a list comprehension include a conditional:

    [function(variable) for variable in thing_to_iterate_over if condition]
    
I'm going to provide a few examples below of the same thing done either with a loop or list comprehension, and then ask you to convert a few loops to comprehensions and vice versa.

### 1.1 Examples

In [3]:
# make a list containing the integers from 0 to 10

# here, we are using the range() function, which will automatically start at 0
# and then iterate up to the number you provide

# with a loop
list_1 = []
for i in range(10):
    list_1 += [i]

# list comprehension
list_2 = [i for i in range(10)]

print(list_1)
print(list_2)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


In [4]:
# make a list containing the integers from 10 to 20, but as strings.

# if you provide two inputs to the range() function, it will start at the first one, and end at the second one

# with a loop
list_1 = []
for i in range(10, 20):
    list_1 += [str(i)]

# list comprehension
list_2 = [str(i) for i in range(10, 20)]

print(list_1)
print(list_2)

['10', '11', '12', '13', '14', '15', '16', '17', '18', '19']
['10', '11', '12', '13', '14', '15', '16', '17', '18', '19']


In [5]:
# make a list of the first ten integers squared

# note that you can say either i*i or i**2 to square a number
# to cube it, you could say i*i*i or i**3, and so on

# with a loop
list_1 = []
for i in range(10):
    list_1 += [i * i]

# list comprehension
list_2 = [i*i for i in range(10)]

print(list_1)
print(list_2)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]


In [6]:
# iterate through input_list
# if the integer is less than or equal to 10, then square it
# otherwise, don't include it

input_list = [10, 4, 28, 3, 1, 930, 3928, 6, 2, 8, 2038]

# with a loop
list_1 = []
for i in input_list:
    if i <= 10:
        list_1 += [i**2]

# list comprehension
list_2 = [i*i for i in input_list if i <= 10]

print(list_1)
print(list_2)

[100, 16, 9, 1, 36, 4, 64]
[100, 16, 9, 1, 36, 4, 64]


The following two examples are examples where you can write things with a list comprehension - but it starts to get a little hard to follow, and might just be better off writing with a normal list, because the list comprehension starts to become a little unreadable.

In [7]:
# iterate through input list
# if it is less than or equal to 10, return the integer squared
# otherwise, return the integer raised to the fourth power

# note that when you have an if...else that the location gets moved around

input_list = [10, 4, 28, 3, 1, 930, 3928, 6, 2, 8, 2038]

# with a loop
list_1 = []
for i in input_list:
    if i <= 10:
        list_1 += [i**2]
    else:
        list_1 += [i**4]

# list comprehension
list_2 = [i**2 if i <= 10 else i ** 4 for i in input_list]

print(list_1)
print(list_2)

[100, 16, 614656, 9, 1, 748052010000, 238059718905856, 36, 4, 64, 17251097061136]
[100, 16, 614656, 9, 1, 748052010000, 238059718905856, 36, 4, 64, 17251097061136]


In [8]:
# iterate through integers from 0 to 10
# if it is less than or equal to 5, return 'black'
# otherwise, if it is less than 8, return 'red'
# otherwise, return 'blue'

# note that when you have an if...else that the location gets moved around

# with a loop
list_1 = []
for i in range(10):
    if i <= 5:
        list_1 += ['black']
    elif i < 8:
        list_1 += ['red']
    else:
        list_1 += ['blue']

# list comprehension
list_2 = ['black' if i <= 5 else 'red' if i < 8 else 'blue' for i in range(10)]

print(list_1)
print(list_2)

['black', 'black', 'black', 'black', 'black', 'black', 'red', 'red', 'blue', 'blue']
['black', 'black', 'black', 'black', 'black', 'black', 'red', 'red', 'blue', 'blue']


### 1.2 Problems

Convert the loop to a list comprehension, and the list comprehensions to loops.  Check that the results are equal.

In [9]:
list_1 = []
for i in range(20):
    list_1 += [4 * i - 2]
    
print(list_1)

# write answer below
list_2 = [4 * i - 2 for i in range(20)]
print(list_2)
print(list_1 == list_2)

[-2, 2, 6, 10, 14, 18, 22, 26, 30, 34, 38, 42, 46, 50, 54, 58, 62, 66, 70, 74]
[-2, 2, 6, 10, 14, 18, 22, 26, 30, 34, 38, 42, 46, 50, 54, 58, 62, 66, 70, 74]
True


In [10]:
input_list = ['black','black','orange','black','red','black','red','black','red','red','black','green','blue','purple']

list_1 = []
for i in input_list:
    if i == 'black':
        list_1 += [1]
    else:
        list_1 += [5]
print(list_1)

# write answer below
list_2 = [1 if i == 'black' else 5 for i in input_list]
print(list_2)
print(list_1 == list_2)

[1, 1, 5, 1, 5, 1, 5, 1, 5, 5, 1, 5, 5, 5]
[1, 1, 5, 1, 5, 1, 5, 1, 5, 5, 1, 5, 5, 5]
True


In [11]:
list_1 = [str(i / 2) for i in range(15)]
print(list_1)

# write answer below
list_2 = []
for i in range(15):
    list_2 += [str(i / 2)]
print(list_2)
print(list_1 == list_2)

['0.0', '0.5', '1.0', '1.5', '2.0', '2.5', '3.0', '3.5', '4.0', '4.5', '5.0', '5.5', '6.0', '6.5', '7.0']
['0.0', '0.5', '1.0', '1.5', '2.0', '2.5', '3.0', '3.5', '4.0', '4.5', '5.0', '5.5', '6.0', '6.5', '7.0']
True


In [12]:
input_list = [1,4,8,2,40,2038,233,23,1,5,3,882]

list_1 = [i for i in input_list if i % 2 == 0]
print(list_1)

# write your answer below
list_2 = []
for i in input_list:
    if i % 2 == 0:
        list_2 += [i]
print(list_2)
print(list_1 == list_2)

[4, 8, 2, 40, 2038, 882]
[4, 8, 2, 40, 2038, 882]
True


### 1.2 Numpy arrays

To create an array from a list, you say:

    new_array = np.array(old_list)
    
We're going to try doing the same things to list and arrays to see what happens in each case.

**Before running the cells below, try to guess that the output will be in each case (for the list versus array), and pay attention to the differences between how lists and arrays behave.**

In [13]:
test_list = [i for i in range(10)]
test_array = np.array(test_list)

print(test_list)
print(test_array)
print()

# what happens if we multiply by two?
print(test_list * 2)
print(test_array * 2)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0 1 2 3 4 5 6 7 8 9]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[ 0  2  4  6  8 10 12 14 16 18]


In [14]:
# what happens if we try to add one to each one?
# note that this will only work for the arrays: it will throw an error for the list

In [15]:
print(test_list + 1)

TypeError: can only concatenate list (not "int") to list

In [16]:
print(test_array + 1)

[ 1  2  3  4  5  6  7  8  9 10]


In [17]:
# what happens if we try to add two lists or two arrays together?

print(test_list + test_list)
print(test_array + test_array)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[ 0  2  4  6  8 10 12 14 16 18]


**This is all that we are going to go over for now - the main takeaway here is that when you're dealing with arrays, you're performing the same operation on all elements of the array.**

## 2. Importing a genome annotation file in pandas

Here, we're going to look at a file that I've downloaded from the ENSEMBL website that contains annotation information for various genes in the genome.  This file was originally downloaded with <i>transcript-based</i> annotations, which I convereted to be <i>gene-based</i>.  When you're doing RNA-seq analysis, you can either perform analyses at the transcript level (meaning considering different isoforms of the same gene differently) or at the gene level (aggregating different isoforms of the same gene); we're going to focus on gene level analysis for now.

<b>First, we need to import the annotation file.</b>  I typically like to define paths and file names at the start, just to keep things organized.

1. Create a variable called 'path' which contains the directory listing to wherever you downloaded the files.
2. Create a variable called 'fn_anno' which is the name of the file.

As a reminder, both of these should be strings, and the variable 'path' should end with a '/'.

In [18]:
# you will need to change this based on where you saved the files on your comptuer, as you did last week

# path = '/path/to/the/directory/containing/the/file/'
# fn = 'name_of_the_file.extension'

**Using pd.read_csv(), import the txt file (comma delimted) containing the annotations into a dataframe called 'anno', and set the index to be the 'gene' column.  Use .head() to show the first 5 rows of the resulting dataframe.**

See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

and https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html

and https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html

for reference.  These are part of the pandas documentation.  I've provided these here just so you can get started, but in the future, I'll provide some hints/direction as to how to go about something, but it will be up to you to look up how to actually use the functions in the pandas (or other) documentation.  In real life, you'll have to look things up yourself, and and I'm constantly looking up things that I've forgotten, don't know how to do, or don't want to figure out and would rather copy something somebody else already figured out and helpfully posted online.

In [19]:
path = '/Users/kevin/changlab/github/Bioinformatics-Tutorials/wk2_dataframes/data/'
fn = '/Homo_sapiens.GRCh38.gene_annotations.txt.gz'

anno = pd.read_csv(path + fn, sep=',')
anno = anno.set_index('gene')
anno.head()

Unnamed: 0_level_0,start,end,strand,length,chr,gene_symbol,gene_type,source
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ENSG00000000003.14,100630765,100637538,-1,6773,X,TSPAN6,protein_coding,cdna
ENSG00000000005.5,100589213,100598708,1,9495,X,TNMD,protein_coding,cdna
ENSG00000000419.12,50935098,50956428,-1,21329,20,DPM1,protein_coding,cdna
ENSG00000000457.13,169853881,169893003,-1,39121,1,SCYL3,protein_coding,cdna
ENSG00000000460.16,169780373,169834072,1,53699,1,C1orf112,protein_coding,cdna


**Print the information for the gene** *'ENSG00000181449.3'* **.  You should familiarize yourself with the .loc and .iloc commands.**

In [20]:
anno.loc['ENSG00000181449.3']

start               181711924
end                 181714436
strand                      1
length                   2512
chr                         3
gene_symbol              SOX2
gene_type      protein_coding
source                   cdna
Name: ENSG00000181449.3, dtype: object

**Save the information in the** *'start'* **column of the anno dataframe in a new variable, called** *start_column* **. Print start_column.**

In [21]:
start_column = anno['start']
print(start_column)

gene
ENSG00000000003.14    100630765
ENSG00000000005.5     100589213
ENSG00000000419.12     50935098
ENSG00000000457.13    169853881
ENSG00000000460.16    169780373
                        ...    
ENSG00000284596.1     102471469
ENSG00000284597.1       7931256
ENSG00000284598.1       7420360
ENSG00000284599.1      16979511
ENSG00000284600.1       1795567
Name: start, Length: 62803, dtype: int64


#### One of the most important things about working with genomics data is double checking that the files you are working with have the data you expect them to have.

For instance: what values are present in the 'chr' column of our annotation dataframe?  How many chromosome values are in this column?

What chromosomes would you expect to be there?  Are there any other chromosomes present, and if so, what are they?

As a hint, you're looking for unique values in that column of the dataframe (and then also the length of the result).

In [22]:
print(len(anno['chr'].unique()))
print(anno['chr'].unique())

380
['X' '20' '1' '6' '3' '7' '12' '11' '4' '17' '2' '16' '8' '19' '9' '13'
 '14' '5' '22' '10' 'Y' '18' '15' 'CHR_HSCHR6_MHC_MCF_CTG1'
 'CHR_HSCHR6_MHC_QBL_CTG1' 'CHR_HSCHR6_MHC_DBB_CTG1'
 'CHR_HSCHR6_MHC_SSTO_CTG1' 'CHR_HSCHR6_MHC_COX_CTG1' '21'
 'CHR_HSCHR6_MHC_MANN_CTG1' 'CHR_HSCHR4_6_CTG12' 'MT' 'CHR_HSCHR1_5_CTG3'
 'CHR_HSCHR6_MHC_APD_CTG1' 'CHR_HG1362_PATCH' 'CHR_HSCHR15_3_CTG8'
 'CHR_HSCHR19_4_CTG3_1' 'CHR_HSCHR16_1_CTG1' 'CHR_HSCHR1_2_CTG3'
 'CHR_HG2128_PATCH' 'CHR_HSCHR13_1_CTG3' 'CHR_HSCHR16_2_CTG3_1'
 'CHR_HSCHR3_1_CTG2_1' 'CHR_HSCHR21_2_CTG1_1' 'CHR_HSCHR17_4_CTG4'
 'CHR_HSCHR12_2_CTG2' 'CHR_HSCHR12_3_CTG2_1' 'CHR_HSCHR1_2_CTG31'
 'CHR_HG142_HG150_NOVEL_TEST' 'CHR_HG151_NOVEL_TEST'
 'CHR_HSCHR16_1_CTG3_1' 'CHR_HSCHR17_1_CTG4' 'CHR_HSCHR1_1_CTG31'
 'CHR_HSCHR7_1_CTG6' 'CHR_HSCHR12_1_CTG1' 'CHR_HSCHR22_1_CTG1'
 'CHR_HSCHR12_2_CTG2_1' 'CHR_HSCHR12_1_CTG2_1' 'CHR_HSCHR18_2_CTG2'
 'CHR_HSCHR18_1_CTG2_1' 'CHR_HSCHR19_1_CTG3_1' 'CHR_HSCHR18_1_CTG1_1'
 'CHR_HSCHR21_4_CTG1_1' 'CHR_

#### Note that some of the values in the chromosome column are numbers (e.g., 1, 2, etc.) and others are strings (e.g., 'X', 'Y').  When Python imported the dataframe (pd.read_csv()), did it import the numerical chromosomes as integers or strings?

In [23]:
np.unique([str(type(i)) for i in anno['chr']])

array(["<class 'str'>"], dtype='<U13')

#### Let's say that we want to subset this annotation to get a list of only those genes that are on the 'normal' chromosomes: autosomes, sex chromosomes, in the mitochondrial genome.

Make a list that looks like this:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 'X', 'Y', 'MT']

Do it <u>without</u> explicitly writing out the numbers 1 to 22.  Feel free to use either lists (built-in to Python) or numpy arrays (np.array()).  Be sure to save the numerical chromosomes as the correct data type (integer or string) to match the data type of the values in anno['chr'].

In [24]:
my_chrs = [str(i) for i in range(1,23)] + ['X', 'Y','MT']
print(my_chrs)

['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y', 'MT']


#### Subset the anno dataframe to include only those genes whose chromosome annotations are in this list of chromosomes.  Save this as a new dataframe called anno_filt.

How big is this new annotation/how many things did we filter out? Print the first five columns of the dataframe with .head().

You should use the .isin() function, and can also use .shape to get the size of a dataframe.

(You should end up with 57106 rows remaining in anno_filt)

In [25]:
# print(anno.shape) to get the size of the current dataframe

anno_filt = anno[anno['chr'].isin(my_chrs)]
print(anno.shape)
print(anno_filt.shape)
anno_filt.head()

(62803, 8)
(57106, 8)


Unnamed: 0_level_0,start,end,strand,length,chr,gene_symbol,gene_type,source
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ENSG00000000003.14,100630765,100637538,-1,6773,X,TSPAN6,protein_coding,cdna
ENSG00000000005.5,100589213,100598708,1,9495,X,TNMD,protein_coding,cdna
ENSG00000000419.12,50935098,50956428,-1,21329,20,DPM1,protein_coding,cdna
ENSG00000000457.13,169853881,169893003,-1,39121,1,SCYL3,protein_coding,cdna
ENSG00000000460.16,169780373,169834072,1,53699,1,C1orf112,protein_coding,cdna


#### How many genes are on each chromosome?  There's a few ways you could do this, but one is to use a Counter.

1. from the package 'collections' import Counter. https://docs.python.org/3.6/library/collections.html#collections.Counter
2. create a new variable called chr_count that is a Counter, and pass in the chr column of your dataframe to your counter to get the counts of how many times each chromosome is found
3. print the results

In [26]:
from collections import Counter

chr_count = Counter(anno_filt['chr'])

for i in my_chrs:
    print(i, chr_count[i])

1 5191
2 3919
3 2992
4 2466
5 2796
6 2823
7 2846
8 2335
9 2226
10 2177
11 3172
12 2852
13 1283
14 2194
15 2105
16 2373
17 2918
18 1123
19 2877
20 1381
21 821
22 1329
X 2351
Y 519
MT 37


## Part 3: ENCODE data

**Import the file 'all_ENCODE_metadata.tsv.gz' into a dataframe called encode.  Set the index column to be the file accession number, and print the first rows with .head()**

In [27]:
fn = 'all_ENCODE_metadata.tsv.gz'
encode = pd.read_csv(path + fn, sep='\t')
encode = encode.set_index('File accession')

encode.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0_level_0,File format,Output type,Experiment accession,Assay,Biosample term id,Biosample term name,Biosample type,Biosample life stage,Biosample sex,Biosample Age,...,dbxrefs,File download URL,Assembly,Platform,Controlled by,File Status,Audit WARNING,Audit INTERNAL_ACTION,Audit NOT_COMPLIANT,Audit ERROR
File accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENCFF467JYT,fastq,reads,ENCSR146BGM,ChIP-seq,UBERON:0004550,gastroesophageal sphincter,tissue,adult,female,51 year,...,,https://www.encodeproject.org/files/ENCFF467JY...,,HiSeq 2000,/files/ENCFF231JUU/,released,"antibody characterized with exemption, mild to...","mismatched file status, experiment not submitt...","severe bottlenecking, poor library complexity",
ENCFF351KDO,fastq,reads,ENCSR146BGM,ChIP-seq,UBERON:0004550,gastroesophageal sphincter,tissue,adult,female,51 year,...,,https://www.encodeproject.org/files/ENCFF351KD...,,HiSeq 2000,/files/ENCFF432BFB/,released,"antibody characterized with exemption, mild to...","mismatched file status, experiment not submitt...","severe bottlenecking, poor library complexity",
ENCFF961OUB,bam,alignments,ENCSR146BGM,ChIP-seq,UBERON:0004550,gastroesophageal sphincter,tissue,adult,female,51 year,...,,https://www.encodeproject.org/files/ENCFF961OU...,hg19,,,archived,"antibody characterized with exemption, mild to...","mismatched file status, experiment not submitt...","severe bottlenecking, poor library complexity",
ENCFF014FJF,fastq,reads,ENCSR146BGM,ChIP-seq,UBERON:0004550,gastroesophageal sphincter,tissue,adult,female,51 year,...,,https://www.encodeproject.org/files/ENCFF014FJ...,,HiSeq 2000,/files/ENCFF432BFB/,released,"antibody characterized with exemption, mild to...","mismatched file status, experiment not submitt...","severe bottlenecking, poor library complexity",
ENCFF655EPI,fastq,reads,ENCSR146BGM,ChIP-seq,UBERON:0004550,gastroesophageal sphincter,tissue,adult,female,51 year,...,,https://www.encodeproject.org/files/ENCFF655EP...,,HiSeq 2000,/files/ENCFF231JUU/,released,"antibody characterized with exemption, mild to...","mismatched file status, experiment not submitt...","severe bottlenecking, poor library complexity",


#### How big is this dataframe?  What type of information is present in the rows? Columns?

In [28]:
print(encode.shape)
print(encode.columns)

(301366, 49)
Index(['File format', 'Output type', 'Experiment accession', 'Assay',
       'Biosample term id', 'Biosample term name', 'Biosample type',
       'Biosample life stage', 'Biosample sex', 'Biosample Age',
       'Biosample organism', 'Biosample treatments',
       'Biosample subcellular fraction term name', 'Biosample phase',
       'Biosample synchronization stage', 'Experiment target',
       'Antibody accession', 'Library made from', 'Library depleted in',
       'Library extraction method', 'Library lysis method',
       'Library crosslinking method', 'Library strand specific',
       'Experiment date released', 'Project', 'RBNS protein concentration',
       'Library fragmentation method', 'Library size range',
       'Biological replicate(s)', 'Technical replicate', 'Read length',
       'Mapped read length', 'Run type', 'Paired end', 'Paired with',
       'Derived from', 'Size', 'Lab', 'md5sum', 'dbxrefs', 'File download URL',
       'Audit INTERNAL_ACTION', 'Audit N

#### Create a new dataframe called encode_filt that includes only samples that:
 - are from human (homo sapiens)
 - do not have audit errors.  Specifically, only include rows where encode['Audit ERROR'].isnull() is True.
 
For the first criteria, you may need to look at what columns are present in the dataframe to choose the appropriate ones to filter on.  Your dataframe should have 223543 rows.

In [29]:
# I'm providing the answer here, so that you can see how to do this

# first, we're creating a variable m1 which asks "is the value equal to 'Homo sapeins'
# for each value in the 'Biosample organism' column"
m1 = encode['Biosample organism'] == 'Homo sapiens'

# second, we're creating a variable m2 which asks "is there no value, i.e., no error included'
# for each value in the 'Audit ERROR' column"
m2 = encode['Audit ERROR'].isnull()

# now, we're asking if for each element (note that these correspond to rows of the dataframe)
# are both criteria two?
mask = m1 & m2

# now, we're actually filtering the dataframe
encode_filt = encode.loc[mask]
print(encode_filt.shape)

(223543, 49)


#### Breaking briefly from the ENCODE data, to try to illustrate what is going on here:

Here, we've created three arrays, with three values each.  We're performing an "and" operation - meaning that if everything is True, it will return True; otherwise, it will return False.

In [30]:
# example on how to merge multiple masks

a = np.array([True, True, False])
b = np.array([False, True, False])
c = np.array([True, True, True])

d = a & b & c

print(d)

# note that you can't do this with lists
# (try it yourself and see what happens)
# arrays make our lives easier

[False  True False]


#### What types of RNA-seq data are available?  Create a dataframe called rna that only has rows that satisfy all of the following criteria:
 - They come from RNA-seq experiments.
 - Their libraries are made from RNA
 - They are depleted in rRNA
 - They are fastq files
 
You will need to look at both the column listings, as well as the unique values in these columns, to be able to know what values to filter on.  You will want to look at four columns, create a boolean mask for each of them (a array/series containing either True or False for each value), and then make a final mask that contains only values where all four sub-masks were True.

Your final 'rna' dataframe should have 1017 rows.

In [31]:
m1 = encode_filt['Assay'] == 'RNA-seq'
m2 = encode_filt['Library made from'] == 'RNA'
m3 = encode_filt['Library depleted in'] == 'rRNA'
m4 = encode_filt['File format'] == 'fastq'
mask = m1 & m2 & m3 & m4

rna = encode_filt[mask]

print(rna.shape)

(1017, 49)


#### Get a list of the unique biosample term names in the rna dataframe.  In other words, a list of biosample term names for which there exists RNA-seq data that satisfied our above criteria.

In [32]:
in_rna = rna['Biosample term name'].unique()
print(in_rna)

['pericardium fibroblast' 'pulmonary artery endothelial cell' 'K562'
 'gastrocnemius medialis' 'metanephros' 'thyroid gland' 'body of pancreas'
 'airway epithelial cell' 'urinary bladder' 'tongue' 'IMR-90'
 'skeletal muscle tissue' 'esophagus muscularis mucosa' 'HeLa-S3'
 'dermis microvascular lymphatic vessel endothelial cell' 'heart'
 'smooth muscle cell of bladder' 'spinal cord' 'MCF-7' 'sigmoid colon'
 'uterus' 'lung' 'skin of body' 'SJSA1' 'A549' 'stomach' 'M059J'
 'suprapubic skin' 'fibroblast of lung' 'SK-N-SH' 'RPMI-7951'
 'hair follicular keratinocyte' 'omental fat pad' 'adrenal gland'
 'vein endothelial cell' 'keratinocyte' 'skeletal muscle satellite cell'
 'HepG2' 'subcutaneous preadipocyte' 'subcutaneous adipose tissue'
 'fibroblast of villous mesenchyme' 'spleen' 'cerebellum' 'temporal lobe'
 'endothelial cell of coronary artery' 'HT1080'
 'smooth muscle cell of the umbilical artery' 'right lobe of liver'
 'transverse colon' 'umbilical cord' 'occipital lobe'
 'glomerular e

#### What types of ChIP-seq data are available?  Create a dataframe called chip that only has rows that satisfy all of the following criteria:
 - They come from ChIP-seq experiments
 - The ChIP-seq target is H3K27ac-human
 - The file format is bed narrowPeak
 - The output type is replicated peaks
 - The bed files were aligned to the GRCh38 assembly.
 
Your final dataframe should have 80 rows.

In [33]:
m1 = encode_filt['Assay'] == 'ChIP-seq'
m2 = encode_filt['Experiment target'] == 'H3K27ac-human'
m3 = encode_filt['File format'] == 'bed narrowPeak'
m4 = encode_filt['Output type'] == 'replicated peaks'
m5 = encode_filt['Assembly'] == 'GRCh38'
mask = m1 & m2 & m3 & m4 & m5
chip = encode_filt.loc[mask]
print(chip.shape)

(80, 49)


#### Get a list of the unique biosample term names in the chip dataframe.

In [34]:
in_chip = chip['Biosample term name'].unique()
print(in_chip)

['ACC112' 'neural cell' 'RWPE1' '22Rv1' 'endodermal cell'
 'gastrocnemius medialis' 'GM12878' 'SUDHL6' 'MCF-7' 'KMS-11'
 'thoracic aorta' 'KOPT-K1' 'RWPE2' 'neutrophil' 'DND-41' 'Loucy' 'VCaP'
 'DOHH2' 'OCI-LY1' 'OCI-LY3' 'A549' 'keratinocyte' 'SK-N-SH' 'C4-2B'
 'epithelial cell of prostate' 'HCT116' 'smooth muscle cell'
 'neuroepithelial stem cell' 'mid-neurogenesis radial glial cells'
 'fibroblast of dermis' 'H9' 'neural progenitor cell' 'osteoblast'
 'radial glial cell' 'induced pluripotent stem cell' 'Karpas-422'
 'astrocyte' 'mammary epithelial cell' 'CD14-positive monocyte' 'Panc1'
 'thyroid gland' 'HeLa-S3' 'myotube' 'IMR-90' 'H1-hESC' 'A673'
 'mesenchymal stem cell' 'MM.1S' 'hepatocyte' 'iPS DF 19.11' 'B cell'
 'mesendoderm' 'endothelial cell of umbilical vein' 'cardiac muscle cell'
 'body of pancreas' 'PC-3' 'skeletal muscle myoblast' 'trophoblast cell'
 'SK-N-MC' 'bipolar neuron' 'adrenal gland' 'right lobe of liver' 'PC-9'
 'neural stem progenitor cell' 'fibroblast of lung' 

#### Now, get a list of the biosample term names which are shared between the two lists.  In other words, find the intersection of biosample term names with RNA and ChIP data satisfying our various criteria.  How many samples are there in this list?

I've provided one way to do this below using list comprehensions - there are many other ways to do this, such as converting the lists to sets, and then finding the intersection of those sets.

In [35]:
list_1 = ['a','b','c','d','e','f','g']
list_2 = ['d','e','f','g','h','i','j']

list_3 = [i for i in list_1 if i in list_2]
print(list_3)

['d', 'e', 'f', 'g']


In [36]:
in_both = [i for i in in_rna if i in in_chip]
print(len(in_both))
print(sorted(in_both))

36
['A549', 'B cell', 'CD14-positive monocyte', 'GM12878', 'H1-hESC', 'HeLa-S3', 'IMR-90', 'Karpas-422', 'MCF-7', 'OCI-LY7', 'PC-3', 'SK-N-SH', 'adrenal gland', 'astrocyte', 'bipolar neuron', 'body of pancreas', 'cardiac muscle cell', 'endothelial cell of umbilical vein', 'fibroblast of arm', 'fibroblast of dermis', 'fibroblast of lung', 'gastrocnemius medialis', 'hepatocyte', 'induced pluripotent stem cell', 'keratinocyte', 'mammary epithelial cell', 'myotube', 'neural cell', 'neural progenitor cell', 'osteoblast', 'right lobe of liver', 'skeletal muscle myoblast', 'smooth muscle cell', 'thoracic aorta', 'thyroid gland', 'vagina']


#### The sample 'gastrocnemius medialis' should be in your list.  Print the data in the rna and chip dataframes that are from this sample.

In [37]:
print(rna[rna['Biosample term name'] == 'gastrocnemius medialis'])
print('\n')
print(chip[chip['Biosample term name'] == 'gastrocnemius medialis'])

               File format Output type Experiment accession    Assay  \
File accession                                                         
ENCFF173AFN          fastq       reads          ENCSR609NZM  RNA-seq   
ENCFF054KNM          fastq       reads          ENCSR609NZM  RNA-seq   
ENCFF387IXO          fastq       reads          ENCSR853BNH  RNA-seq   
ENCFF307UCJ          fastq       reads          ENCSR853BNH  RNA-seq   
ENCFF004CNM          fastq       reads          ENCSR853BNH  RNA-seq   
ENCFF086LCO          fastq       reads          ENCSR853BNH  RNA-seq   
ENCFF660SLV          fastq       reads          ENCSR678TMV  RNA-seq   
ENCFF751QEC          fastq       reads          ENCSR678TMV  RNA-seq   
ENCFF139SHO          fastq       reads          ENCSR967JPI  RNA-seq   
ENCFF825PCO          fastq       reads          ENCSR967JPI  RNA-seq   

               Biosample term id     Biosample term name Biosample type  \
File accession                                              

## The end of Part 1

#### Congratulations on finishing this - I know that it's a lot!