# Assignment

-------------

**1) Complete this notebook and make a pull request:** 

Answer questions (Q) in the space provided (A) in this notebook. When finished, copy your notebook to the `Assignment/` directory and name it `nb-6.5-<Github-username>.ipynb`. Then make a pull request to the upstream repo. The entered answers in this notebook will be simply Markdown text where I want you to interpret and describe a block of code to better understand what it is doing. Much of this code you will have seen already. 


**2) Write an importable Python package, save as a repo, and test it here.**

The package should be written as we did in our last lession (`.py` files in a directory with a setup.py file so it can be installed with `pip`). Follow instructions at the end of this notebook for how to write your package. Test it here by importing the package and executing the code at the end. It should work and give correct answers, if not, continue working on it. When you have it completed save your package as a new Github repo named `seqlib`.

### The `seqlib` package

Together we are going to write several functions here that will make up your new package called `seqlib`. It will be your job to copy these functions, organize them into a Class, save the code into a `.py` file (you can use SublimeText if you're comfortable with it for much of this, or any text editor including the one in jupyter), package the files so they can be imported as a library, and test the package so that it accomplishes the tasks which are defined at the end of this notebook. First things first, though, let's write the functions. 

In [3]:
import numpy as np
import pandas as pd

### Q.  Describe what the `mutate()` function below does:


A. This function takes one base as an argument and returns any random base that is NOT the inputed base.

In [9]:
def mutate(base):
    diff = set("ACTG") - set(base) #Create a set of bases without the input base
    return np.random.choice(list(diff)) #Choose a random base from the subset "diff".

In [13]:
# test it
mutate("A")

'C'

### Q. Describe how the `seqdata()` function below works:
Annotate the code by inserting lines with comments as you read through the function to make sense of it. What is being created at each step and how is it used?


A. This function returns ninds number of simulated sequence as an array of nsites with mutations and missing bases added in at a 10% rate each. First it creates a random sequence of nistes length and creates ninds number of arrays. Next, a mask for a mutation rate of 0.10 is created. Then for each site in each of the sequences, the function mutates the base if it passes the 10% mutation filter. Another mask is created to insert missing bases (N) at a 10% rate. 

In [23]:
def simulate(ninds, nsites):
    oseq = np.random.choice(list("ACGT"), size=nsites) #Creates a random array of As, Ts, Cs and Gs of length nsites 
    arr = np.array([oseq for i in range(ninds)]) #Creates ninds number of arrays of length nsites using the code above. 
    muts = np.random.binomial(1, 0.1, (ninds, nsites)) #Creates a mask with a 10% success rate with dimensions of ninds x nsites 
    for col in range(nsites):
        newbase = mutate(arr[0, col]) #Selecting a random new base other than the base present for each position in the sequences.
        mask = muts[:, col].astype(bool) #creates a mask of bools for each position. 
        arr[:, col][mask] = newbase #mutate the base with the newbase if it passes the mask filter for mutations.
    missing = np.random.binomial(1, 0.1, (ninds, nsites)) #creates another mask with a 10% success rate
    arr[missing.astype(bool)] = "N" #replaces the base with an N if it passes the mask filter for missing bases.
    return arr

In [42]:
seqs = simulate(6, 15)
print(seqs)

[['T' 'G' 'T' 'G' 'G' 'C' 'C' 'N' 'A' 'N' 'N' 'C' 'T' 'T' 'C']
 ['T' 'G' 'T' 'G' 'G' 'C' 'C' 'T' 'A' 'C' 'A' 'N' 'T' 'T' 'G']
 ['T' 'G' 'T' 'G' 'A' 'C' 'C' 'T' 'C' 'C' 'A' 'C' 'T' 'T' 'C']
 ['N' 'A' 'T' 'G' 'G' 'C' 'A' 'T' 'A' 'C' 'A' 'C' 'T' 'C' 'N']
 ['N' 'G' 'T' 'G' 'G' 'T' 'C' 'T' 'A' 'C' 'A' 'C' 'N' 'T' 'C']
 ['T' 'G' 'T' 'G' 'G' 'C' 'C' 'T' 'A' 'C' 'A' 'G' 'T' 'C' 'G']]


### **Q: Describe how the `filter_missing` function works:**
Annotate the code by inserting lines with comments as you read through the function to make sense of it. How does it find columns with missing (N) values in them? How might you improve it?

A. If one sequence has a missing allele in one position, this `filter_missing` function removes the position from all sequences. It sums the number of times N appears in each column and divides each sum by the number of sequences to find the mutation rate of the position. Then it selects a subset of the positions that are less than the given maxfreq and returns an array of the resulting sequences without any sequences above the max frequency. I could possibly improve it by using the np.where function. 

In [53]:
def filter_missing(arr, maxfreq):
    freqmissing = np.sum(arr == "N", axis=0) / arr.shape[0] #Sums # of times N is found in a column and divides it by the number of rows.
    return arr[:, freqmissing <= maxfreq] #selects those columns that have N frequencies less than or equal to the given maxfreq.

In [54]:
filter_missing(seqs, 0.1)

array([['G', 'T', 'G', 'G', 'C', 'C', 'A', 'T'],
       ['G', 'T', 'G', 'G', 'C', 'C', 'A', 'T'],
       ['G', 'T', 'G', 'A', 'C', 'C', 'C', 'T'],
       ['A', 'T', 'G', 'G', 'C', 'A', 'A', 'C'],
       ['G', 'T', 'G', 'G', 'T', 'C', 'A', 'T'],
       ['G', 'T', 'G', 'G', 'C', 'C', 'A', 'C']], dtype='<U1')

### **Q: Describe how the `filter_maf` function works:**
Annotate the code by inserting lines with comments as you read through the function to make sense of it. How does it calculate minor allele frequencies? Why does it use copy?

A. `filter_maf` calcuates minor allele frequencies by summing the number of times a base is different from the first sequence and dividing that by the number of sequences. It uses copy becuase the function modifies the array to replace any major allele freqs to minor allele freqs (1 - major allele freqs) and we do not want to modify the actual freqs array. 

In [65]:
def filter_maf(arr, minfreq):
    freqs = np.sum(arr != arr[0], axis=0) / arr.shape[0] #Sums the number of times a base is different from the first seq and divides it by length of columns.
    maf = freqs.copy() #makes a copy of this array of minor allele freqs
    maf[maf > 0.5] = 1 - maf[maf > 0.5] #subselect sites with major freq (>0.5) and modify these to be 1-the value
    return arr[:, maf > minfreq] #Return columns that have minor allele freqs above the minimum freq. 

In [66]:
filter_maf(seqs, 0.1)

array([['T', 'G', 'G', 'C', 'C', 'N', 'A', 'N', 'N', 'C', 'T', 'T', 'C'],
       ['T', 'G', 'G', 'C', 'C', 'T', 'A', 'C', 'A', 'N', 'T', 'T', 'G'],
       ['T', 'G', 'A', 'C', 'C', 'T', 'C', 'C', 'A', 'C', 'T', 'T', 'C'],
       ['N', 'A', 'G', 'C', 'A', 'T', 'A', 'C', 'A', 'C', 'T', 'C', 'N'],
       ['N', 'G', 'G', 'T', 'C', 'T', 'A', 'C', 'A', 'C', 'N', 'T', 'C'],
       ['T', 'G', 'G', 'C', 'C', 'T', 'A', 'C', 'A', 'G', 'T', 'C', 'G']],
      dtype='<U1')

### Q: What order should these functions be applied, does it matter?

A. It does not seem to matter what order these functinos are applied! This is because the minor allele frequencies will for any sequences that have N in them will ultimately be removed. 

In [67]:
filter_missing(filter_maf(seqs, 0.1), 0.1)

array([['G', 'G', 'C', 'C', 'A', 'T'],
       ['G', 'G', 'C', 'C', 'A', 'T'],
       ['G', 'A', 'C', 'C', 'C', 'T'],
       ['A', 'G', 'C', 'A', 'A', 'C'],
       ['G', 'G', 'T', 'C', 'A', 'T'],
       ['G', 'G', 'C', 'C', 'A', 'C']], dtype='<U1')

In [68]:
filter_maf(filter_missing(seqs, 0.1), 0.1)

array([['G', 'G', 'C', 'C', 'A', 'T'],
       ['G', 'G', 'C', 'C', 'A', 'T'],
       ['G', 'A', 'C', 'C', 'C', 'T'],
       ['A', 'G', 'C', 'A', 'A', 'C'],
       ['G', 'G', 'T', 'C', 'A', 'T'],
       ['G', 'G', 'C', 'C', 'A', 'C']], dtype='<U1')

### Q: Describe how `calculate_statistics()` works


A. `calculate_statistics` takes the array created with `simulate` above and calculates the mean nucleotide diversity, mean minor allele frequency, number of invariant sites and the number of variable sites. The function does this by evaluating the variant or invariant sites across each position in the sequence. 

In [74]:
def calculcate_statistics(arr):
    nd = np.var(arr == arr[0], axis=0).mean() #Calculates the mean of the variance across the sequences -> mean nucleotide diversity
    mf = np.mean(np.sum(arr != arr[0], axis=0) / arr.shape[0]) #Calculates average of invariant sites frequency -> the minor allele frequency  
    inv = np.any(arr != arr[0], axis=0).sum() #Sums the number of invariant sites (mutations)
    var = arr.shape[1] - inv #Subtracts the number of invariant sites from the number of sites -> variable sites
    return pd.Series(
        {"mean nucleotide diversity": nd,
         "mean minor allele frequency": mf,
         "invariant sites": inv,
         "variable sites": var,
        })

In [75]:
calculcate_statistics(seqs)

invariant sites                13.000000
mean minor allele frequency     0.333333
mean nucleotide diversity       0.144444
variable sites                  2.000000
dtype: float64

### Instructions: Write a `seqlib` Class object

I started writing the bare bones of it below. You should write it so that it can be executed as described below to perform all of the functions we defined above, and so that its attributes can be accessed. Save this class object in a `.py` file and make it into an importable package called `seqlib`. You can write and test your object in this notebook if you like, but it must be saved separately in a `.py` file and be imported. You cannot execute the code at the end using your object defined here in the notebook. When finished save your package to GitHub as a repo just like we did with the `helloworld` package. You do not need to write a CLI script like we did for the `helloworld` package, we will only be using the Python API here. See the examples below for **how you should write your Class object**. It should be able to run in the way written below, so look at that code and think about how you would write a Class object that can do that. 

While you can mostly copy the functions from above, you will need to modify them slightly to access information about the Class object using *self*. For example, the `simulate()` function below takes self as a first argument and can access `self.inds` and `self.nsites` from that, so we do not need to provide those as arguments to the `simulate` function like we did above. 

In [109]:
class Seqlib:
    def __init__(self, ninds, nsites):
        self.ninds = ninds
        self.nsites = nsites
        self.arr = self.simulate(ninds, nsites)

    def mutate(base):
        diff = set("ACTG") - set(base) #Create a set of bases without the input base
        return np.random.choice(list(diff)) #Choose a random base from the subset "diff"
    
    def simulate(self, ninds, nsites):
        oseq = np.random.choice(list("ACGT"), size=self.nsites) #Creates a random array of As, Ts, Cs and Gs of length nsites 
        arr = np.array([oseq for i in range(self.ninds)]) #Creates ninds number of arrays of length nsites using the code above.
        muts = np.random.binomial(1, 0.1, (self.ninds, self.nsites)) #Creates a mask with a 10% success rate with dimensions of ninds x nsites 
        for col in range(nsites):
            newbase = mutate(self.arr[0, col]) #Selecting a random new base other than the base present for each position in the sequences.
            mask = muts[:, col].astype(bool) #creates a mask of bools for each position. 
            self.arr[:, col][mask] = newbase #mutate the base with the newbase if it passes the mask filter for mutations.
        missing = np.random.binomial(1, 0.1, (self.ninds, self.nsites)) #creates another mask with a 10% success rate
        self.arr[missing.astype(bool)] = "N" #replaces the base with an N if it passes the mask filter for missing bases.
        return self.arr

    def filter_missing(self, arr, maxfreq=0.5):
        freqmissing = np.sum(self.arr == "N", axis=0) / self.arr.shape[0] #Sums # of times N is found in a column and divides it by the number of rows.
        return self.arr[:, freqmissing <= maxfreq] #selects those columns that have N frequencies less than or equal to the given maxfreq.

    def filter_maf(self, arr, minfreq=0.5):
        freqs = np.sum(self.arr != arr[0], axis=0) / arr.shape[0] #Sums the number of times a base is different from the first seq and divides it by length of columns.
        maf = freqs.copy() #makes a copy of this array of minor allele freqs
        maf[maf > 0.5] = 1 - maf[maf > 0.5] #subselect sites with major freq (>0.5) and modify these to be 1-the value
        return self.arr[:, maf > minfreq] #Return columns that have minor allele freqs above the minimum freq. 

    def calculcate_statistics(self, arr):
        nd = np.var(self.arr == self.arr[0], axis=0).mean() #Calculates the mean of the variance across the sequences -> mean nucleotide diversity
        mf = np.mean(np.sum(self.arr != arr[0], axis=0) / arr.shape[0]) #Calculates average of invariant sites frequency -> the minor allele frequency  
        inv = np.any(self.arr != self.arr[0], axis=0).sum() #Sums the number of invariant sites (mutations)
        var = self.arr.shape[1] - inv #Subtracts the number of invariant sites from the number of sites -> variable sites
        return pd.Series(
            {"mean nucleotide diversity": nd,
             "mean minor allele frequency": mf,
             "invariant sites": inv,
             "variable sites": var,
            })

    

In [110]:
seqs = Seqlib(ninds=10, nsites=50)

AttributeError: 'Seqlib' object has no attribute 'arr'

In [103]:
print(seqs.ninds, seqs.nsites)

10 50


In [106]:
seqs.maf

## Test your package
The package should be globally importable (you ran `pip install .` or `pip install -e .` to install it), and it should be able to execute the following code without error. 

In [None]:
import seqlib

In [None]:
# init a Seqlib Class object
seqs = seqlib.Seqlib(ninds=10, nsites=50)

In [None]:
# access attributes from the object
print(seqs.ninds, seqs.nsites)

In [None]:
# returns the MAF of the array as an array of floats
seqs.maf

In [None]:
# return a view of the filtered sequence array by applying a new function 
# called `filter()` that applies both the maf and missing filter functions
seqs.filter(minmaf=0.1, maxmissing=0.0)

In [None]:
# calculate statistics for an array with the results returned as a DataFrame
seqs.calculate_statistics()

In [None]:
# calculate statistics for an array after filtering it
seqs.filter(minmaf=0.1, maxmissing=0.0).calculate_statistics()