# NumPy - Part 2

NumPy has [long list of functions](https://docs.scipy.org/doc/numpy-1.15.1/genindex.html), difficult to go over all of them. Please refer to its [manual](https://docs.scipy.org/doc/numpy-1.15.1/reference/). You can use tab to see the available functions as well.

In this notebook we'll cover more features of NumPy.

> Most of the code in this notebook has been taken or adapted from [Bootcamp for Biology notebook](http://nbviewer.jupyter.org/url/atwallab.cshl.edu/teaching/QBbootcamp3.ipynb) prepared by Mickey Atwal, Cold Spring Harbor Laboratory. ([Lab website](http://atwallab.cshl.edu))
A gentle introduction to some elements of scientific programming in Python.
>
> PS: the website does not exist anymore, here's the Internet Archive [link](https://web.archive.org/web/20190913002839/http://atwallab.cshl.edu:80/GA_teaching.html)


### 1. Loading data

We will be working with the text file containing the nucleotide counts of the E. Coli DNA binding sites of the transcription factor CRP (cAMP receptor protein) also known as CAP (catabolite gene activator protein). You can find a copy at http://atwallab.cshl.edu/links/crp_counts_matrix.txt. 

In [None]:
# web URL address of the file 
# http://atwallab.cshl.edu/links/crp_counts_matrix.txt

# name of the file to be saved into
filename="data/crp_counts_matrix.txt"

This is a small tab-delimited text file where the counts data at each of the 42 nucleotide positions is stored as a series of strings. Let's take a look using the Unix command "head"

In [None]:
!head data/crp_counts_matrix.txt

We need to convert this to a numerical array of numbers where we can perform computations.  The [genfromtxt](http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html) function in NumPy automatically generates a NumPy array from a text file

In [None]:
# loads data from text file and store in an integer NumPy array called 'counts'
import numpy as np
counts=np.genfromtxt(filename,dtype=int)

> There's an alternative function to read txt files: `np.loadtxt()`

### 2. Working with numerical arrays

In [None]:
counts.ndim # what is the dimensionality of the array?

In [None]:
counts.shape # what is the size of the array? (rows, columns)

In [None]:
counts.dtype # what is the data type of the array?

#### Array indexing

Let's practise some array indexing to remember how they work in Python

In [None]:
# let's have a look at the first five rows
counts[:5]

In [None]:
# the first row
counts[0]

In [None]:
# rows 2 to 3, i.e. the second to third rows
counts[1:3]

In [None]:
# the second column
counts[:,1]

In [None]:
# the last two rows
counts[-2:]

In [None]:
# every third row beginning with the first
counts[::3]

In [None]:
# rows 3 to 4, and columns 2 to 4 
counts[2:4,1:4]

Here's a summary of indexing and slicing. (*[Image source](https://scipy-lectures.org/intro/numpy/array_object.html#indexing-and-slicing)*)

![](images/numpy_indexing.png)

#### Reshaping

In [None]:
counts.reshape(21,8)

In [None]:
counts.reshape(4,2,21)

In [None]:
counts.T

#### Computations on arrays

In [None]:
# the minimum and maximum element of the array
np.min(counts), np.max(counts)

In [None]:
# select the elements greater than 200
counts[counts>200]

In [None]:
counts >200

In [None]:
# what are the indices of the elements greater than 200? 
# The Numpy function "where" returns the indices in separate arrays of rows and columns.
np.where(counts>200)

In [None]:
# select elements that are greater than 200 and also divisible by 3, i.e. counts mod 3 = 0
counts[(counts>200) & (counts%3==0)]

Dot Product. Frequently when performing operations on arrays we have to take the dot product of two list of numbers, or two vectors, e.g. $\vec{x}=\{x_1,x_2,x_3\}$ and $\vec{y}=\{y_1,y_2,y_3\}$. The dot product $\vec{x} \cdot \vec{y}$ is defined as
$$
x \cdot y = \sum_{i=1}^{3} x_i y_i
$$
NumPy provides an efficient way of doing this without explicitly writing a 'for loop'

In [None]:
# dot product between rows 3 and 4s
counts[:5]


In [None]:
np.dot(counts[2],counts[3])

In [None]:
111*89 + 70*68 + 70*83 + 91*102

In [None]:
# sum each column of the array, i.e. sum along the rows, the dimension indexed as 0
np.sum(counts,0) 

In [None]:
# sum each row of the array, i.e. sum along the columns, the dimension indexed as 1
np.sum(counts,1) 

In [None]:
# mean, median and standard deviation of each column
np.mean(counts,0), np.median(counts,0), np.std(counts,0) 

We can add pseudocounts to each element. This is usually a good idea if your count data is undersampled.

In [None]:
# add 1 to EVERY element of the counts matrix to form a new matrix 'new_counts'
new_counts=counts+1

Let's calculate the probabilities of each nucleotide at each position, e.g. the probability of seeing an A at position *i* is

$$
p_i(A) = \frac{{counts}_i(A)}{{counts}_i(A)+{counts}_i(T)+{counts}_i(G)+{counts}_i(C)} 
$$

The total counts is the same for all positions, so we might just as well use only the first position to evaluate it.

In [None]:
total_counts=sum(new_counts[0])
total_counts

In [None]:
prob=new_counts/total_counts
print(prob)

In [None]:
new_counts[0]

In [None]:
total_counts

It's often a good idea to represent the data graphically to glean what's going on

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt

# set the size of the figure
plt.figure(figsize=[15,2])

# show the array flipped (transposed) and with no colour interpolation smoothing
plt.imshow(prob.T,interpolation='nearest')

# set the ticks
plt.xticks(range(0,42),range(1,43))
plt.yticks(range(4),['A','C','G','T'])

# set the colorbar
plt.clim([0,1])
plt.colorbar(ticks=np.arange(0,1,0.2))

# title
plt.title('base frequency matrix of CRP binding sites',fontsize=15)

# show the figure
plt.show()

In [None]:
import matplotlib.pyplot as plt
plt.imshow(prob,interpolation='nearest')

In [None]:
rand_points = np.random.random(256).reshape(16,16)
rand_points 

In [None]:
plt.imshow(rand_points)
plt.clim([0,1])
plt.colorbar(ticks=np.arange(0,1,0.2))

### 3. Random sampling, correlations and statistical tests

#### Gaussian distribution

A frequent task in scientific programming is to analyze random samples from a known distribution.  A commonly used distribution is the gaussian (normal distribution)

$$
p(x)=\frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{(x-\mu)^2}{2 \sigma^2} \right)
$$

where $\sigma $ is the standard deviation and $\mu $ is the mean

In [None]:
# number of samples
N_gauss = 5000

# standard deviations
s = 2

# mean
u = 5

# draw random samples from gaussian distribution
samples=np.random.normal(u,s,N_gauss)

In [None]:
samples

We can plot the data in two different ways:

In [None]:
plt.figure(figsize=(12,5))

# plot histogram
plt.subplot(1,2,1)
plt.hist(samples,bins=20)
plt.title('histogram')

# plot boxplot
plt.subplot(1,2,2)
# the default is to produce a vertical boxplot. Setting vert=False gives a horizontal plot
plt.boxplot(samples,vert=False) 
plt.title('boxplot')

plt.show()

### Curve Fitting

A frequent task in science is to fit a curve to data, and guess the underyling generative model. Let's make up some fake noisy data of the form $y=x^3 + \eta$, where $\eta$ is noise drawn from a gaussian (normal) distribution.

In [None]:
# number of data points
import numpy as np
N=20

# N equalled spaced x values, from -20 to 20
x=np.linspace(-20,20,N)

# noise term: N samples from a gaussian distribution with mean 0 and standard deviation 1000
noise=np.random.normal(0,1000,N)

# y values
y=x**3+noise

Let's plot our fake noisy data.

In [None]:
import matplotlib.pyplot as plt 
plt.plot(x,y,'ro') # 'r' indicates red, 'o' indicates a small circle
plt.title('noisy data')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

Now we will try to fit polynomials of various orders using the Numpy "polyfit" function.

In [None]:
# straight line fit. The "fit1" array consists of the coefficients of the best linear fit
fit1=np.polyfit(x,y,1)

# 3rd order polynomial fit
fit3=np.polyfit(x,y,3)

# 19th order polynomial fit
fit19=np.polyfit(x,y,19)

# create functions from the fits
y_1=np.poly1d(fit1)
y_3=np.poly1d(fit3)
y_19=np.poly1d(fit19)

print('linear fit: y_1=(%.2f)x+(%.2f)' % (fit1[0],fit1[1]))
print('3rd order fit: y_3=(%.2f)x^3 + (%.2f)x^2 + (%.2f)x + (%.2f)' % (fit3[0],fit3[1],fit3[2],fit3[3]))

In [None]:
plt.plot(x,y,'ro')
plt.plot(x,y_1(x))
plt.plot(x,y_3(x))
plt.plot(x,y_19(x))

# add a legend to the right of the plot
legend_text=('data','linear fit: $y=mx+c$ ','3rd order polynomial','19th order polynomial')
plt.legend(legend_text, loc='center left', bbox_to_anchor=(1,0.5))

plt.show()

The high order (19th) fit clearly tracks the data better. However, the 19th order polynomial in fact overfits the data since it will perform poorly on new data sampled from the original noisy curve.