### Data 88: Statistical Genomics - 02/10/20

# Lab 1 - Comparative Genomics through EDA

### by Jonathan Fischer and Shishi Luo

In [None]:
# Import the necessary modules
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sp
import pandas as pd
from client.api.notebook import Notebook
plt.style.use('fivethirtyeight')

## Let's explore genome sizes for some commonly studied organisms

In this course we will primarily use pandas and numpy to store and manipulate data rather than the datascience module designed for Data 8. These modules are more powerful and are common tools in data analysis in Python. The guided examples below should help you get comfortable with some of the different syntax requirements of these modules. You'll often need to be aware of whether you're working with a pandas data frame or a numpy array, as they are indexed and sliced in different ways. Below are a series of (guided) tasks related to the exploration of some data about genomes of various organisms. You should fill in any ?s with the correct code to complete the task. When in doubt, read the hints in/above each cell or refer to previous cells--hints for all the answers should be somewhere in this document!

Load the table of model organisms and name it model_species

In [None]:
# In general, we can read in csv files using pandas as follows:
# table_name = pd.read_csv('filename')
# filename = 'https://raw.githubusercontent.com/jrfischer/Data88-Genetics_and_Genomics-SP20/master/Lab1/model_species.csv'

model_species = pd.read_csv('https://raw.githubusercontent.com/jrfischer/Data88-Genetics_and_Genomics-SP20/master/Lab1/model_species.csv')
model_species

We can check the size of our table by using the syntax df.shape. Try it below with our table model_species. The first value gives the number of rows and the second the number of columns. To get only the number of rows, we can use model_species.shape[0].

In [None]:
# Print the shape of model_species
?

Let's rank organisms by their genome size. To sort the table, use table_name.sort_values('Column_name'). To sort in descending order, use the additional option like so: table_name.sort_values('Column_name', ascending = False).
Try both ways out.

In [None]:
model_species.sort_values(?)

Can we extract just the organisms with more than 60,000 genes? This is done by table_name.loc[table_name['Column_name'] ~ condition,]. The ~ symbol in the above is the operator for the condition. In this case, it's > for greater than

In [None]:
model_species.loc[?,]

Alternatively, say we simply want the first three rows of our table. We can access them using the .iloc attribute, e.g. table_name.iloc[0:3,]. Try it with model_species.

In [None]:
model_species.iloc[?,]

Here we've demonstrated how to refer to specific columns and rows in pandas. Note that df.loc is used when the conditions are not integers, df.iloc when they are, and simply df["col_name"] when we want to extract a column named 'col_name'. Also don't forget the zero-indexing and half-open interval properties of Python.

## Let's compare the genome sizes of some pathogens

Now we're going to examine how the genomes of some pathogens compare. Load the provided pathogen information and name the resulting table pathogens.

In [None]:
# filename is 'https://raw.githubusercontent.com/jrfischer/Data88-Genetics_and_Genomics-SP20/master/Lab1/pathogens.csv'

pathogens = pd.read_csv(?)

Take a look at the table. How many organisms does it have? (Don't forget we only want the first entry of df.shape here).

In [None]:
num_pathogens = ?
num_pathogens

With so many organisms, it can be hard to interpret the table. Histograms are a great way to visualize the distribution of a quantity of interest. Let's make a few to investigate our data.

In [None]:
# Make a histogram of genome sizes for all the genomes in our table.
# Hint: plt.hist(table_name['column_name'], bins = b, density = n)
# b gives the number of bins in the histogram
# n is either True or False for whether bin heights should be normalized by number of observations
# Choose 20 bins and density = False

plt.hist(?)
plt.title('Pathogen genome sizes')
plt.xlabel('Genome size')
plt.ylabel('Count')
plt.show()

In [None]:
# Now let's make a histogram of genome sizes split by subgroup.
# Here we have Bacteria, Fungus, Parasite, and Virus as our category names.
# Let's density = True because each group may have a different number of observations

bins = np.linspace(0, 1500, 100)
plt.hist(pathogens['Size'][pathogens['Subgroup'] == 'Bacteria'], bins, density = True, color = 'Blue', alpha = 0.5)
plt.hist(pathogens['Size'][pathogens['Subgroup'] == ?], bins, density = True, color = 'Red', alpha = 0.5)
plt.hist(pathogens['Size'][pathogens['Subgroup'] == ?], bins, density = True, color = 'Yellow', alpha = 0.5)
plt.hist(pathogens['Size'][pathogens['Subgroup'] == ?], bins, density = True, color = 'Black', alpha = 0.5)

plt.title('Pathogen genome sizes')
plt.xlabel('Genome size (MB)')
plt.ylabel('Density')
plt.legend(['Bacteria', 'Fungus', 'Parasite', 'Virus'])
plt.show()

Now that we've seen histograms, let's look at some scatterplots. 

In [None]:
# Make scatterplot of the number of genes vs the genome size in pathogens
# Genome size on X axis (Size), Number of genes on Y axis (Genes)
# plt.scatter(table['X_column'], table['Y_column'])

plt.scatter(pathogens[?], pathogens[?])
plt.title('Number of genes vs genome size in pathogens')
plt.xlabel('Genome size (MB)')
plt.ylabel('Number of genes')
plt.show()

Now let's consider two different measurements of correlation.

In [None]:
# Also print Pearson AND Spearman correlations. 
# Command is print(sp.pearsonr(table['Variable1'], table['Variable2'])[0]) for Pearson
# and print(sp.spearmanr(table['Variable1'], table['Variable2'])[0]) for Spearman

print(sp.pearsonr(pathogens['Size'], pathogens['Genes'])[0], sp.spearmanr(pathogens['Size'], pathogens['Genes'])[0])

In [None]:
# Make scatterplot of the number of proteins vs number of genes in pathogens
# Number of proteins on Y axis, number of genes on X axis
# print both the pearson and spearman correlations between these quantities

plt.scatter(?, ?)
plt.title('Number of proteins vs number of genes in pathogens')
plt.xlabel('Number of genes')
plt.ylabel('Number of proteins')
plt.show()

print(sp.pearsonr(?)[0], sp.spearmanr(?)[0])

Now let's see how to create a dataframe of our own from scratch. Here we'll fill it with some summary statistics about the centrality and spread of our data.

In [None]:
# Compute the mean, median, standard deviation, and interquartile range for 
# the genome sizes, number of genes, and number of proteins. Store them in a df with 
# Mean, Median, SD, and IQR as the columns

# mean: np.mean, median: np.median, standard deviation: np.std, IQR: sp.iqr
# Put these into a table names pathogen_summary with row indices Size, Genes, and Proteins

# Example of how to construct a pandas df
# df = pd.DataFrame(data = [ [row1], [row2], ... , [last_row] ],
# columns = ['first_col_name', ... , 'last_col_name'], index = ['first_index', ... , 'last_index'])


pathogen_summary = pd.DataFrame(data = [ [np.mean(pathogens['Size']), np.median(pathogens['Size']), np.std(pathogens['Size']), sp.iqr(pathogens['Size'])],
    [np.mean(pathogens['Genes']), np.median(pathogens['Genes']), np.std(pathogens['Genes']), sp.iqr(pathogens['Genes'])],
    [np.mean(pathogens['Proteins']), np.median(pathogens['Proteins']), np.std(pathogens['Proteins']), sp.iqr(pathogens['Proteins'])]],
    columns = ['Mean', 'Median', 'SD', 'IQR'], index = ['Size', 'Genes', 'Proteins'])

pathogen_summary

## Let's repeat these steps but for animals

In [None]:
# Load animal information. Name the table animals
# filename is 'https://raw.githubusercontent.com/jrfischer/Data88-Genetics_and_Genomics-SP20/master/Lab1/animals.csv'

animals = ?
animals

In [None]:
# How many different animal species are in the table?
?

In [None]:
# Histogram of animal genome sizes (in megabases)

plt.hist(?, bins = 20, density = False)
plt.title('Animal genome sizes')
plt.xlabel('Genome size')
plt.ylabel('Count')
plt.show()

In [None]:
# Print the pearson and spearman correlations between genome sizes and numbers of genes
print(sp.pearsonr(?, ?)[0], sp.spearmanr(?, ?)[0])

In [None]:
# Histograms of genome sizes split by Subgroup (aka pivot histograms)
# Let's normalize by frequency here because each group may have a different number of 

# Now let's make a histogram of genome sizes split by subgroup.
# Here we have 'Birds', 'Fishes', 'Flatworms', 'Insects', 'Mammals', 'Reptiles' as our category names.
# Let's density = True because each group may have a different number of observations
# Fill in the ?'s with the correct syntax to make the plot

bins = np.linspace(0, 3500, 50)
plt.hist(?, bins, density = True, color = 'Blue', alpha = 0.5)
plt.hist(?, bins, density = True, color = 'Red', alpha = 0.5)
plt.hist(?, bins, density = True, color = 'Yellow', alpha = 0.5)
plt.hist(?, bins, density = True, color = 'Black', alpha = 0.5)
plt.hist(?, bins, density = True, color = 'Green', alpha = 0.5)
plt.hist(?, bins, density = True, color = 'Orange', alpha = 0.5)


plt.title('Pathogen genome sizes')
plt.xlabel('Genome size (MB)')
plt.ylabel('Density')
plt.legend(['Birds', 'Fishes', 'Flatworms', 'Insects', 'Mammals', 'Reptiles'])
plt.show()

In [None]:
# Make a scatter plot of the number of genes (Y-axis) vs genome size (X-axis) in animals
plt.scatter(?, ?)
plt.title('Number of genes vs genome size in animals')
plt.xlabel('Genome size (MB)')
plt.ylabel('Number of genes')
plt.show()

In [None]:
# Print correlations
print(?, ?)

In [None]:
# Make scatterplot of the number of proteins vs number of genes in animals
# Print correlations

plt.scatter(?, ?)
plt.title('Number of proteins vs number of genes in animals')
plt.xlabel('Number of genes')
plt.ylabel('Number of proteins')
plt.show()

print(?, ?)

In [None]:
# Table of summary statistics (name it animal_summary)

animal_summary = ?

animal_summary

## To submit

In [None]:
ok = Notebook('Lab01_EDA.ok')
_ = ok.auth(inline = True)

In [None]:
# Submit the assignment.
_ = ok.submit()