# Assignment

-------------

**1) Complete this notebook and make a pull request:** 

Answer questions (Q) in the space provided (A) in this notebook. When finished, copy your notebook to the `Assignment/` directory and name it `nb-6.5-<Github-username>.ipynb`. Then make a pull request to the upstream repo. The entered answers in this notebook will be simply Markdown text where I want you to interpret and describe a block of code to better understand what it is doing. Much of this code you will have seen already. 


**2) Write an importable Python package, save as a repo, and test it here.**

The package should be written as we did in our last lession (`.py` files in a directory with a setup.py file so it can be installed with `pip`). Follow instructions at the end of this notebook for how to write your package. Test it here by importing the package and executing the code at the end. It should work and give correct answers, if not, continue working on it. When you have it completed save your package as a new Github repo named `seqlib`.

### The `seqlib` package

Together we are going to write several functions here that will make up your new package called `seqlib`. It will be your job to copy these functions, organize them into a Class, save the code into a `.py` file (you can use SublimeText if you're comfortable with it for much of this, or any text editor including the one in jupyter), package the files so they can be imported as a library, and test the package so that it accomplishes the tasks which are defined at the end of this notebook. First things first, though, let's write the functions. 

In [1]:
import numpy as np
import pandas as pd

### Q.  Describe what the `mutate()` function below does:


A. The `mutate()` function returns a random base (A, T, C, or G) which is not the same as the base inputted to the function.

In [2]:
def mutate(base):
    diff = set("ACTG") - set(base)
    return np.random.choice(list(diff))

In [3]:
# test it
mutate("A")

'C'

### Q. Describe how the `seqdata()` function below works:
Annotate the code by inserting lines with comments as you read through the function to make sense of it. What is being created at each step and how is it used?


A. The function `simulate()` returns an array with dimensions ninds x nsites which consists of random bases, random mutations to some of the bases, and random missing bases denoted "N"

In [4]:
def simulate(ninds, nsites):
    # create a random list of bases at the specified length
    oseq = np.random.choice(list("ACGT"), size=nsites)
    # create a multidimentional array using this random list and copy the list for the specified number of individuals
    arr = np.array([oseq for i in range(ninds)])
    # create an array of random probabilities with dimensions of ninds x nsites using the parameters of 1 trial and 0.1 probability of success
    muts = np.random.binomial(1, 0.1, (ninds, nsites))
    # grab one column from the array, grab just the bases that we want to mutate, and then change the base using the mutate() function from before
    # iterate over all the columns (specified by nsites)
    for col in range(nsites):
        newbase = mutate(arr[0, col])
        mask = muts[:, col].astype(bool)
        arr[:, col][mask] = newbase
    # create an array of of random probabilities with dimensions ninds x nsites using the parameters of 1 trial and 0.1 probability of success
    missing = np.random.binomial(1, 0.1, (ninds, nsites))
    # replace all sites randomly marked true in the previous array, replace the base with "N"
    arr[missing.astype(bool)] = "N"
    # return the final array with mutations and missing bases
    return arr

In [5]:
seqs = simulate(6, 15)
print(seqs)

[['C' 'C' 'C' 'T' 'G' 'G' 'T' 'A' 'T' 'N' 'A' 'C' 'T' 'T' 'C']
 ['N' 'C' 'C' 'T' 'N' 'G' 'T' 'A' 'T' 'C' 'G' 'T' 'T' 'C' 'G']
 ['C' 'C' 'C' 'T' 'T' 'G' 'T' 'A' 'T' 'C' 'A' 'T' 'T' 'T' 'C']
 ['G' 'C' 'C' 'T' 'T' 'G' 'T' 'A' 'T' 'C' 'A' 'C' 'T' 'T' 'C']
 ['C' 'C' 'C' 'T' 'T' 'G' 'T' 'N' 'A' 'C' 'A' 'T' 'T' 'T' 'C']
 ['C' 'C' 'C' 'T' 'T' 'G' 'N' 'A' 'A' 'C' 'A' 'T' 'T' 'C' 'G']]


### **Q: Describe how the `filter_missing` function works:**
Annotate the code by inserting lines with comments as you read through the function to make sense of it. How does it find columns with missing (N) values in them? How might you mprove it?

A. The `filter_missing()` function finds missing values using a boolean, calculates a frequency of missing values across each individual in the array and only those sites from the array which have a missing frequency less than or equal to the max frequency. 

In [8]:
def filter_missing(arr, maxfreq):
    # find missing (N) values with a boolean across each row
    # calculate the frequency of missing values by summing the number of missing values across each row and dividing by the total number of values in each row
    freqmissing = np.sum(arr == "N", axis=0) / arr.shape[0]
    # return only those columns from the array which have a missing frequency less than or equal to the max frequency
    return arr[:, freqmissing <= maxfreq]

In [9]:
filter_missing(seqs, 0.1)

array([['C', 'C', 'T', 'G', 'T', 'A', 'C', 'T', 'T', 'C'],
       ['C', 'C', 'T', 'G', 'T', 'G', 'T', 'T', 'C', 'G'],
       ['C', 'C', 'T', 'G', 'T', 'A', 'T', 'T', 'T', 'C'],
       ['C', 'C', 'T', 'G', 'T', 'A', 'C', 'T', 'T', 'C'],
       ['C', 'C', 'T', 'G', 'A', 'A', 'T', 'T', 'T', 'C'],
       ['C', 'C', 'T', 'G', 'A', 'A', 'T', 'T', 'C', 'G']], dtype='<U1')

### **Q: Describe how the `filter_maf` function works:**
Annotate the code by inserting lines with comments as you read through the function to make sense of it. How does it calculate minor allele frequencies? Why does it use copy?

A. The function `filter_maf()` finds the minor allele frequency by finding variable sites and then calculating the frequency of variation. The function ensures that all maf values are less than 0.5 and returns the array with just those columns which have an maf greater than minfreq. The funciton uses copy so as not to modify the original array `arr`. This way it is able to have both the original array to use in `return` and the array from which to modify maf values greater than 0.5.

In [13]:
def filter_maf(arr, minfreq):
    # compare the values to each other within a column to find which sites are variable
    # sum the number of variable sites across each column (given value of 1 because of the boolean)
    # divide the number of variable sites per column by the total length of the column
    freqs = np.sum(arr != arr[0], axis=0) / arr.shape[0]
    # define maf as the minor allele frequency and make a copy to preserve the original
    maf = freqs.copy()
    # index maf to replace all values greater than 0.5 with 1-value
    maf[maf > 0.5] = 1 - maf[maf > 0.5]
    # return the original array with just those columns for which the maf is greater than the inputted minfreq
    return arr[:, maf > minfreq]

In [17]:
filter_maf(seqs, 0.1)

array([['C', 'G', 'T', 'A', 'T', 'N', 'A', 'C', 'T', 'C'],
       ['N', 'N', 'T', 'A', 'T', 'C', 'G', 'T', 'C', 'G'],
       ['C', 'T', 'T', 'A', 'T', 'C', 'A', 'T', 'T', 'C'],
       ['G', 'T', 'T', 'A', 'T', 'C', 'A', 'C', 'T', 'C'],
       ['C', 'T', 'T', 'N', 'A', 'C', 'A', 'T', 'T', 'C'],
       ['C', 'T', 'N', 'A', 'A', 'C', 'A', 'T', 'C', 'G']], dtype='<U1')

### Q: What order should these functions be applied, does it matter?

A. It doesn't matter what order these functions are applied in, they both result in the same answer.

In [18]:
filter_missing(filter_maf(seqs, 0.1), 0.1)

array([['T', 'A', 'C', 'T', 'C'],
       ['T', 'G', 'T', 'C', 'G'],
       ['T', 'A', 'T', 'T', 'C'],
       ['T', 'A', 'C', 'T', 'C'],
       ['A', 'A', 'T', 'T', 'C'],
       ['A', 'A', 'T', 'C', 'G']], dtype='<U1')

In [19]:
filter_maf(filter_missing(seqs, 0.1), 0.1)

array([['T', 'A', 'C', 'T', 'C'],
       ['T', 'G', 'T', 'C', 'G'],
       ['T', 'A', 'T', 'T', 'C'],
       ['T', 'A', 'C', 'T', 'C'],
       ['A', 'A', 'T', 'T', 'C'],
       ['A', 'A', 'T', 'C', 'G']], dtype='<U1')

### Q: Describe how `calculate_statistics()` works


A. `calculate_statistics()` returns a Series with the mean nucleotide diversity, which is calculated using the mean of the variance across each column, the mean minor allele frequency, which calculates the mean of the maf defined the same way as it is in `filter_maf()`, the invariant sites, which is the total number of columns in the array which have variation, and variable sites, which is the total number of columns not counted in invariant sites.

In [20]:
def calculcate_statistics(arr):
    # mean nucleotide diversity
    nd = np.var(arr == arr[0], axis=0).mean()
    # mean maf
    mf = np.mean(np.sum(arr != arr[0], axis=0) / arr.shape[0])
    # invariant sites
    inv = np.any(arr != arr[0], axis=0).sum()
    # variable sites
    var = arr.shape[1] - inv
    return pd.Series(
        {"mean nucleotide diversity": nd,
         "mean minor allele frequency": mf,
         "invariant sites": inv,
         "variable sites": var,
        })

In [21]:
calculcate_statistics(seqs)

invariant sites                10.000000
mean minor allele frequency     0.277778
mean nucleotide diversity       0.120370
variable sites                  5.000000
dtype: float64

### Instructions: Write a `seqlib` Class object

I started writing the bare bones of it below. You should write it so that it can be executed as described below to perform all of the functions we defined above, and so that its attributes can be accessed. Save this class object in a `.py` file and make it into an importable package called `seqlib`. You can write and test your object in this notebook if you like, but it must be saved separately in a `.py` file and be imported. You cannot execute the code at the end using your object defined here in the notebook. When finished save your package to GitHub as a repo just like we did with the `helloworld` package. You do not need to write a CLI script like we did for the `helloworld` package, we will only be using the Python API here. See the examples below for **how you should write your Class object**. It should be able to run in the way written below, so look at that code and think about how you would write a Class object that can do that. 

While you can mostly copy the functions from above, you will need to modify them slightly to access information about the Class object using *self*. For example, the `simulate()` function below takes self as a first argument and can access `self.inds` and `self.nsites` from that, so we do not need to provide those as arguments to the `simulate` function like we did above. 

In [None]:
class Seqlib:
    def __init__(self, ninds, nsites):
        self.ninds = ninds
        self.nsites = nsites
        self.seqs = self.simulate()
        # ...
        
    def simulate(self):
        pass
        # ...
        
    # continue writing this full object..
    

## Test your package
The package should be globally importable (you ran `pip install .` or `pip install -e .` to install it), and it should be able to execute the following code without error. 

In [30]:
import seqlib

In [31]:
# init a Seqlib Class object
seqs = seqlib.Seqlib(ninds=10, nsites=50)

In [32]:
# access attributes from the object
print(seqs.ninds, seqs.nsites)

10 50


In [33]:
# returns the MAF of the array as an array of floats
seqs.maf

array([0.8, 0.1, 0.3, 0.9, 0.9, 0.1, 0.2, 0.4, 0.1, 0.8, 0. , 0.7, 0.1,
       0.4, 0.2, 0.2, 0.3, 0.8, 0.1, 0.4, 0.1, 0.2, 0.1, 0.8, 0.1, 0.8,
       0. , 0. , 0.1, 0.9, 0. , 0.1, 0.9, 0.1, 0.2, 0.2, 0.8, 0.2, 0.1,
       0.1, 0.1, 0.1, 0. , 0.1, 0.6, 0.2, 0.1, 0.9, 0.6, 0.1])

In [34]:
# return a view of the filtered sequence array by applying a new function 
# called `filter()` that applies both the maf and missing filter functions
seqs.filter(minmaf=0.1, maxmissing=0.0)

array([['C', 'T', 'T', 'A', 'T'],
       ['C', 'C', 'T', 'T', 'A'],
       ['C', 'T', 'T', 'T', 'A'],
       ['C', 'C', 'T', 'T', 'T'],
       ['C', 'T', 'C', 'T', 'A'],
       ['C', 'C', 'T', 'T', 'A'],
       ['G', 'C', 'C', 'A', 'T'],
       ['C', 'C', 'T', 'T', 'A'],
       ['C', 'C', 'T', 'T', 'A'],
       ['G', 'C', 'T', 'T', 'T']], dtype='<U1')

In [35]:
# calculate statistics for an array with the results returned as a DataFrame
seqs.calculate_statistics()

invariant sites                5.000
mean minor allele frequency    0.500
mean nucleotide diversity      0.186
variable sites                 0.000
dtype: float64

In [36]:
type(seqs)

seqlib.Seqlib

In [40]:
# calculate statistics for an array after filtering it
seqs.filter(minmaf=0.1, maxmissing=0.0).calculate_statistics()

TypeError: filter() got an unexpected keyword argument 'dtype'