## Hands on usage of numpy arrays

Before completing this notebook you should have completed your assigned reading, chapters 2-3 of the [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html), which includes a tutorial on using the package `numpy`. We'll spend a lot of time talking about and using `numpy` because it is by far the most widely used package for scientific computing in Python, and it is incredibly powerful. Follow along with your reading and execute code in a notebook to try out the various functions and concepts that it is introducing. Here, I provide a number of additional exercises for you that have biological significance and may be more interesting. 

### Required software

In [1]:
# conda install numpy 

In [2]:
import numpy as np


### Numpy arrays
Numpy arrays are super efficient for storing and operating on sets of values that are all of the same `type`. In fact, the datatype (dtype) becomes a very important concept which we will explore further below. 

### Modifying arrays (a copy versus a view)
Although arrays seem similar to lists, they are in fact very different and you will likely run in to many errors early on due to this confusion. Arrays can be indexed and sliced like lists, and they are mutable, so that you can change values within an array. However, there are differences in how they retain copies of themselves. Essentially, arrays are intended to store only a single copy of itself in memory unless you tell it to make another copy by using the `.copy()` function. Otherwise, the thing that is returned to you when you perform a slice on an array is called `view`. This is simply a part of the full array, and if you modify it then you will have modified the original array as well. 

Lists on the other hand return a copy of themselves when you index or slice them, such that the original is unchanged if you operate on the copy. This is demonstrated below. 

In [3]:
# mutate a list element
ll = ['a', 'b', 'c']
ll[0] = 'd'
print(ll)

['d', 'b', 'c']


In [4]:
# mutate an array element
arr = np.array(['a', 'b', 'c'])
arr[0] = 'd'
print(arr)

['d' 'b' 'c']


In [5]:
# mutate a copy and both exist as separate instances
lc = ll.copy()
lc[0] = 'e'
print(lc, ll)

['e', 'b', 'c'] ['d', 'b', 'c']


In [6]:
# same with arrays
carr = arr.copy()
carr[0] = 'e'
print(carr, arr)

['e' 'b' 'c'] ['d' 'b' 'c']


In [7]:
# but different for index/slicing: a list returns a copy
lsub = ll[:2]
lsub[0] = 'x'
print(lsub, ll)

['x', 'b'] ['d', 'b', 'c']


In [8]:
# but arrays return a view (part of the same copy)
asub = arr[:2]
asub[0] = 'x'
print(asub, arr)

['x' 'b'] ['x' 'b' 'c']


In [9]:
# to edit only a copy of it you must explicitly use copy()
asub = arr[:2].copy()
asub[0] = 'y'
print(asub, arr)

['y' 'b'] ['x' 'b' 'c']


## Genomic sequence data as an array
The string characters A,C,G,T can be sampled in an array to represent a sequence of DNA. Here we use the `.random` module of numpy, which is similar to the `.random` package from the standard library, but much more powerful, as it return arrays and has many more scientific methods for sampling random distributions, as we'll see. The array of sequence data in this case is six rows and 12 columns, or in other words, we have data for 6 haploid individuals for 12 sites of DNA. 

In [10]:
np.random.seed(12345)                                        # init a random seed
seq = np.random.choice(list("ACGT"), size=12, replace=True)  # make array that is 12 bases long
seqs = np.array([seq]*6)                                     # make 6 copies of arr

In [11]:
print(seqs)

[['G' 'C' 'C' 'C' 'A' 'C' 'G' 'G' 'C' 'G' 'C' 'C']
 ['G' 'C' 'C' 'C' 'A' 'C' 'G' 'G' 'C' 'G' 'C' 'C']
 ['G' 'C' 'C' 'C' 'A' 'C' 'G' 'G' 'C' 'G' 'C' 'C']
 ['G' 'C' 'C' 'C' 'A' 'C' 'G' 'G' 'C' 'G' 'C' 'C']
 ['G' 'C' 'C' 'C' 'A' 'C' 'G' 'G' 'C' 'G' 'C' 'C']
 ['G' 'C' 'C' 'C' 'A' 'C' 'G' 'G' 'C' 'G' 'C' 'C']]


### Fancy indexing

In [12]:
# select the first four rows
seqs[:4, :]

array([['G', 'C', 'C', 'C', 'A', 'C', 'G', 'G', 'C', 'G', 'C', 'C'],
       ['G', 'C', 'C', 'C', 'A', 'C', 'G', 'G', 'C', 'G', 'C', 'C'],
       ['G', 'C', 'C', 'C', 'A', 'C', 'G', 'G', 'C', 'G', 'C', 'C'],
       ['G', 'C', 'C', 'C', 'A', 'C', 'G', 'G', 'C', 'G', 'C', 'C']],
      dtype='<U1')

In [13]:
# select the last six columns
seqs[:, -6:]

array([['G', 'G', 'C', 'G', 'C', 'C'],
       ['G', 'G', 'C', 'G', 'C', 'C'],
       ['G', 'G', 'C', 'G', 'C', 'C'],
       ['G', 'G', 'C', 'G', 'C', 'C'],
       ['G', 'G', 'C', 'G', 'C', 'C'],
       ['G', 'G', 'C', 'G', 'C', 'C']], dtype='<U1')

In [14]:
# select first two rows and first four columns
seqs[:2, :4]

array([['G', 'C', 'C', 'C'],
       ['G', 'C', 'C', 'C']], dtype='<U1')

In [15]:
# create boolean mask of whether element is "G"
seqs == "G"

array([[ True, False, False, False, False, False,  True,  True, False,
         True, False, False],
       [ True, False, False, False, False, False,  True,  True, False,
         True, False, False],
       [ True, False, False, False, False, False,  True,  True, False,
         True, False, False],
       [ True, False, False, False, False, False,  True,  True, False,
         True, False, False],
       [ True, False, False, False, False, False,  True,  True, False,
         True, False, False],
       [ True, False, False, False, False, False,  True,  True, False,
         True, False, False]])

In [16]:
# view the array as int8 values (easier to read than True/False)
np.int8(seqs == "G")

array([[1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0]], dtype=int8)

In [17]:
# create boolean mask for whether any sites in a column are "G"
np.any(seqs == "G", axis=0)

array([ True, False, False, False, False, False,  True,  True, False,
        True, False, False])

# Generate variable sequence data

Don't worry too much about this function for right now, we'll dive into it in detail in the next notebook. For now we'll just use it to generate variable sequence data. 

In [18]:
def seqdata(ninds, nsites):
    # make sequence data 
    oseq = np.random.choice(list("ACGT"), size=nsites)
    arr = np.array([oseq for i in range(ninds)])
    
    # introduce some mutataions
    muts = np.random.binomial(1, 0.1, (ninds, nsites))
    for col in range(nsites):
        newbase = np.random.choice(list(set("ACTG") - set(arr[0, col])))
        mask = muts[:, col].astype(bool)
        arr[:, col][mask] = newbase
    return arr

In [19]:
# generate an array of variable sequence data
np.random.seed(123)
arr = seqdata(8, 10)
print(arr)

[['G' 'T' 'G' 'G' 'A' 'G' 'G' 'C' 'T' 'G']
 ['G' 'C' 'G' 'G' 'A' 'G' 'G' 'C' 'T' 'G']
 ['G' 'C' 'G' 'G' 'A' 'G' 'G' 'C' 'T' 'G']
 ['G' 'C' 'G' 'C' 'A' 'G' 'G' 'C' 'T' 'G']
 ['G' 'C' 'A' 'G' 'A' 'G' 'G' 'C' 'T' 'G']
 ['G' 'C' 'G' 'G' 'A' 'G' 'G' 'C' 'T' 'G']
 ['G' 'C' 'G' 'G' 'A' 'G' 'G' 'C' 'T' 'G']
 ['G' 'C' 'G' 'G' 'A' 'G' 'G' 'C' 'T' 'A']]


### Find variable sites
Here we can use a broadcasting method to compare sequences to find if there is any variation in the sequences. We could examine each columns individually and count the number of elements in it, but a much easier way is to simply perform on operation over an `axis` of the array that will return True or False depending on whether there is variation. One way is to simply compare each column to the value in the first row. Broadcasting will allow this to work so that across all rows each value in each column is compared to its respective first row element. 

In [20]:
# ask which sites are variable
print(arr != arr[0])

[[False False False False False False False False False False]
 [False  True False False False False False False False False]
 [False  True False False False False False False False False]
 [False  True False  True False False False False False False]
 [False  True  True False False False False False False False]
 [False  True False False False False False False False False]
 [False  True False False False False False False False False]
 [False  True False False False False False False False  True]]


In [21]:
# broadcast with any() to get columns (sites) that are variable
np.any(arr != arr[0], axis=0)

array([False,  True,  True,  True, False, False, False, False, False,
        True])

## Filtering and calculating population genetic statistics

Often we are interested in filtering sequence data based on some criterion before we calculate statistics on it. Examples would be filtering to remove sites with missing data (often coded in DNA by the character `N`), or filtering to remove sites with rare alleles (if its found in very few individuals it may just be an error). The latter is often applied with a filter called a minor allele frequency (MAF). Let's practice calculating the minor allele frequency and filtering based on it. 

In [22]:
# generate a larger array of variable sequence data
np.random.seed(12345)
arr = seqdata(16, 10)
print(arr)

[['G' 'T' 'C' 'C' 'A' 'C' 'T' 'G' 'C' 'G']
 ['G' 'C' 'C' 'A' 'A' 'C' 'G' 'G' 'C' 'G']
 ['G' 'T' 'C' 'C' 'A' 'C' 'G' 'C' 'C' 'G']
 ['G' 'C' 'C' 'C' 'A' 'C' 'G' 'G' 'C' 'G']
 ['G' 'C' 'C' 'C' 'A' 'C' 'G' 'G' 'A' 'G']
 ['G' 'C' 'C' 'C' 'G' 'C' 'G' 'G' 'C' 'G']
 ['G' 'C' 'C' 'A' 'A' 'C' 'G' 'G' 'C' 'G']
 ['G' 'C' 'C' 'C' 'G' 'A' 'G' 'G' 'C' 'G']
 ['G' 'C' 'C' 'C' 'A' 'C' 'G' 'G' 'A' 'G']
 ['T' 'C' 'C' 'C' 'A' 'C' 'G' 'G' 'C' 'G']
 ['G' 'C' 'C' 'C' 'A' 'C' 'G' 'C' 'C' 'G']
 ['T' 'C' 'C' 'C' 'A' 'A' 'G' 'G' 'C' 'G']
 ['T' 'C' 'C' 'C' 'A' 'C' 'G' 'G' 'C' 'G']
 ['G' 'C' 'G' 'C' 'A' 'C' 'G' 'G' 'C' 'G']
 ['G' 'C' 'C' 'C' 'A' 'C' 'G' 'G' 'C' 'G']
 ['T' 'C' 'G' 'C' 'A' 'C' 'G' 'G' 'C' 'G']]


### calculate the frequency of the rare allele in each column
Let's think about how to do this. First, we need to find the sites that are variable in each column, then we need to find a way to count them, and then divide by the length of the column to get the value as a frequency. Well, all of this information is present in the operations we performed above to find the variable sites. Let's use that same framework here. 

#### 1. view which sites are variable?

In [26]:
# view boolean mask as int8s
np.int8(arr != arr[0])

#make matrix of true/false whether it is different, use int8 to convert false to true

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 1, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 1, 0, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 1, 0],
       [0, 1, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 1, 0, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 1, 0],
       [1, 1, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 1, 1, 0, 0],
       [1, 1, 0, 0, 0, 1, 1, 0, 0, 0],
       [1, 1, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 1, 1, 0, 0, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 0, 0],
       [1, 1, 1, 0, 0, 0, 1, 0, 0, 0]], dtype=int8)

In [30]:
arr[0]

array(['G', 'T', 'C', 'C', 'A', 'C', 'T', 'G', 'C', 'G'], dtype='<U1')

#### 2. get each column as a frequency

In [31]:
# sum True values and divide by len of columns (shape[0]) to get frequency
np.sum(arr != arr[0], axis=0) / arr.shape[0]

#arr.shape[0] tells you how many rows there are
#the sum adds the falses across the column

array([0.25  , 0.875 , 0.125 , 0.125 , 0.125 , 0.125 , 0.9375, 0.125 ,
       0.125 , 0.    ])

In [35]:
np.sum(arr != arr[0], axis=0)
arr.shape[0]
arr

array([['G', 'T', 'C', 'C', 'A', 'C', 'T', 'G', 'C', 'G'],
       ['G', 'C', 'C', 'A', 'A', 'C', 'G', 'G', 'C', 'G'],
       ['G', 'T', 'C', 'C', 'A', 'C', 'G', 'C', 'C', 'G'],
       ['G', 'C', 'C', 'C', 'A', 'C', 'G', 'G', 'C', 'G'],
       ['G', 'C', 'C', 'C', 'A', 'C', 'G', 'G', 'A', 'G'],
       ['G', 'C', 'C', 'C', 'G', 'C', 'G', 'G', 'C', 'G'],
       ['G', 'C', 'C', 'A', 'A', 'C', 'G', 'G', 'C', 'G'],
       ['G', 'C', 'C', 'C', 'G', 'A', 'G', 'G', 'C', 'G'],
       ['G', 'C', 'C', 'C', 'A', 'C', 'G', 'G', 'A', 'G'],
       ['T', 'C', 'C', 'C', 'A', 'C', 'G', 'G', 'C', 'G'],
       ['G', 'C', 'C', 'C', 'A', 'C', 'G', 'C', 'C', 'G'],
       ['T', 'C', 'C', 'C', 'A', 'A', 'G', 'G', 'C', 'G'],
       ['T', 'C', 'C', 'C', 'A', 'C', 'G', 'G', 'C', 'G'],
       ['G', 'C', 'G', 'C', 'A', 'C', 'G', 'G', 'C', 'G'],
       ['G', 'C', 'C', 'C', 'A', 'C', 'G', 'G', 'C', 'G'],
       ['T', 'C', 'G', 'C', 'A', 'C', 'G', 'G', 'C', 'G']], dtype='<U1')

#### 3. to get the minor allele frequency, take 1-value if freq > 0.5

In [36]:
# store view from above cell
freqs = np.sum(arr != arr[0], axis=0) / arr.shape[0]

# store a copy so we do not modify the original array 'arr'
maf = freqs.copy()

# subselect sites with major freq (>0.5) and modify to be 1-value
maf[maf > 0.5] = 1 - maf[maf > 0.5]

# print minor allele frequencies
print(maf)

[0.25   0.125  0.125  0.125  0.125  0.125  0.0625 0.125  0.125  0.    ]


#### 4. filter columns of the array by MAF
For our analyses we might only want to analyze sites with a MAF > 0.1. This excludes two sites from the original array, one that was not variable and one that was variable at only a single haplotype. 

In [37]:
print(arr[:, maf > 0.1])

[['G' 'T' 'C' 'C' 'A' 'C' 'G' 'C']
 ['G' 'C' 'C' 'A' 'A' 'C' 'G' 'C']
 ['G' 'T' 'C' 'C' 'A' 'C' 'C' 'C']
 ['G' 'C' 'C' 'C' 'A' 'C' 'G' 'C']
 ['G' 'C' 'C' 'C' 'A' 'C' 'G' 'A']
 ['G' 'C' 'C' 'C' 'G' 'C' 'G' 'C']
 ['G' 'C' 'C' 'A' 'A' 'C' 'G' 'C']
 ['G' 'C' 'C' 'C' 'G' 'A' 'G' 'C']
 ['G' 'C' 'C' 'C' 'A' 'C' 'G' 'A']
 ['T' 'C' 'C' 'C' 'A' 'C' 'G' 'C']
 ['G' 'C' 'C' 'C' 'A' 'C' 'C' 'C']
 ['T' 'C' 'C' 'C' 'A' 'A' 'G' 'C']
 ['T' 'C' 'C' 'C' 'A' 'C' 'G' 'C']
 ['G' 'C' 'G' 'C' 'A' 'C' 'G' 'C']
 ['G' 'C' 'C' 'C' 'A' 'C' 'G' 'C']
 ['T' 'C' 'G' 'C' 'A' 'C' 'G' 'C']]
