# Assignment

-------------

**1) Complete this notebook and make a pull request:** 

Answer questions (Q) in the space provided (A) in this notebook. When finished, copy your notebook to the `Assignment/` directory and name it `nb-6.5-<Github-username>.ipynb`. Then make a pull request to the upstream repo. The entered answers in this notebook will be simply Markdown text where I want you to interpret and describe a block of code to better understand what it is doing. Much of this code you will have seen already. 


**2) Write an importable Python package, save as a repo, and test it here.**

The package should be written as we did in our last lession (`.py` files in a directory with a setup.py file so it can be installed with `pip`). Follow instructions at the end of this notebook for how to write your package. Test it here by importing the package and executing the code at the end. It should work and give correct answers, if not, continue working on it. When you have it completed save your package as a new Github repo named `seqlib`.

### The `seqlib` package

Together we are going to write several functions here that will make up your new package called `seqlib`. It will be your job to copy these functions, organize them into a Class, save the code into a `.py` file (you can use SublimeText if you're comfortable with it for much of this, or any text editor including the one in jupyter), package the files so they can be imported as a library, and test the package so that it accomplishes the tasks which are defined at the end of this notebook. First things first, though, let's write the functions. 

In [1]:
import numpy as np
import pandas as pd

### Q.  Describe what the `mutate()` function below does:


A. Remove the input base from the set of "ACTG" and then return a random base from the resulting set. This will return any base other than the one that was input.

In [2]:
def mutate(base):
    diff = set("ACTG") - set(base)
    return np.random.choice(list(diff))

In [3]:
# test it
mutate("A")

'C'

### Q. Describe how the `seqdata()` function below works:
Annotate the code by inserting lines with comments as you read through the function to make sense of it. What is being created at each step and how is it used?


A. See my lines with `#` below.

In [4]:
# create a function given two arguments (rows and columns) to simulate a DNA sequence
def simulate(ninds, nsites):
    
    # define a random sequence of bases of nsites length
    oseq = np.random.choice(list("ACGT"), size=nsites)
    
    # create array with above-defined oseq by ninds number of columns
    arr = np.array([oseq for i in range(ninds)])
    
    # create random mutation probabilities
    muts = np.random.binomial(1, 0.1, (ninds, nsites))
    
    # iterate over each column
    for col in range(nsites):
        
        # perform the mutate function
        newbase = mutate(arr[0, col])
        
        # boolean mask to select by mutation probability
        mask = muts[:, col].astype(bool)
        
        # assign the same mutation to successive iterations in the same column
        arr[:, col][mask] = newbase
        
    # assign random probability for each point in the array
    missing = np.random.binomial(1, 0.1, (ninds, nsites))
    
    # code "missing" data as "N"
    arr[missing.astype(bool)] = "N"
    
    # return the array resulting from the above code
    return arr

In [5]:
seqs = simulate(6, 15)
print(seqs)

[['C' 'C' 'A' 'G' 'A' 'N' 'N' 'C' 'A' 'C' 'T' 'C' 'G' 'T' 'A']
 ['C' 'C' 'A' 'G' 'A' 'A' 'T' 'C' 'A' 'C' 'G' 'C' 'G' 'T' 'A']
 ['C' 'A' 'A' 'G' 'A' 'C' 'T' 'N' 'A' 'C' 'G' 'C' 'A' 'T' 'A']
 ['C' 'C' 'A' 'G' 'A' 'C' 'T' 'C' 'A' 'C' 'G' 'C' 'A' 'T' 'A']
 ['C' 'C' 'A' 'G' 'A' 'C' 'T' 'G' 'A' 'C' 'N' 'C' 'A' 'T' 'A']
 ['C' 'C' 'A' 'G' 'A' 'C' 'T' 'C' 'A' 'C' 'G' 'C' 'A' 'T' 'A']]


### **Q: Describe how the `filter_missing` function works:**
Annotate the code by inserting lines with comments as you read through the function to make sense of it. How does it find columns with missing (N) values in them? How might you mprove it?

A. See my lines with `#` below.

In [6]:
# define a function given two arguments (an array and a maximum frequency)
def filter_missing(arr, maxfreq):
    
    # define the frequency of missing values as the sum of the number of times that the array has "N" divided by the shape value
    freqmissing = np.sum(arr == "N", axis=0) / arr.shape[0]
    
    # create an array with only the columns that have a frequency of missing values less than or equal to the input maximum frequency
    # aka an array of only the columns with no missing data 
    return arr[:, freqmissing <= maxfreq]

In [7]:
filter_missing(seqs, 0.1)

array([['C', 'C', 'A', 'G', 'A', 'A', 'C', 'C', 'G', 'T', 'A'],
       ['C', 'C', 'A', 'G', 'A', 'A', 'C', 'C', 'G', 'T', 'A'],
       ['C', 'A', 'A', 'G', 'A', 'A', 'C', 'C', 'A', 'T', 'A'],
       ['C', 'C', 'A', 'G', 'A', 'A', 'C', 'C', 'A', 'T', 'A'],
       ['C', 'C', 'A', 'G', 'A', 'A', 'C', 'C', 'A', 'T', 'A'],
       ['C', 'C', 'A', 'G', 'A', 'A', 'C', 'C', 'A', 'T', 'A']],
      dtype='<U1')

### **Q: Describe how the `filter_maf` function works:**
Annotate the code by inserting lines with comments as you read through the function to make sense of it. How does it calculate minor allele frequencies? Why does it use copy?

A. See my lines with `#` below.

In [10]:
# still unclear about this function
def filter_maf(arr, minfreq):
    
    # calculate minor allele frequency
    freqs = np.sum(arr != arr[0], axis=0) / arr.shape[0]
    
    #
    maf = freqs.copy()
    
    #
    maf[maf > 0.5] = 1 - maf[maf > 0.5]
    
    #
    return arr[:, maf > minfreq]

In [11]:
filter_maf(seqs, 0.1)

array([['C', 'N', 'N', 'C', 'T', 'G'],
       ['C', 'A', 'T', 'C', 'G', 'G'],
       ['A', 'C', 'T', 'N', 'G', 'A'],
       ['C', 'C', 'T', 'C', 'G', 'A'],
       ['C', 'C', 'T', 'G', 'N', 'A'],
       ['C', 'C', 'T', 'C', 'G', 'A']], dtype='<U1')

### Q: What order should these functions be applied, does it matter?

A. The output is the same for both orders.

In [12]:
filter_missing(filter_maf(seqs, 0.1), 0.1)

array([['C', 'G'],
       ['C', 'G'],
       ['A', 'A'],
       ['C', 'A'],
       ['C', 'A'],
       ['C', 'A']], dtype='<U1')

In [13]:
filter_maf(filter_missing(seqs, 0.1), 0.1)

array([['C', 'G'],
       ['C', 'G'],
       ['A', 'A'],
       ['C', 'A'],
       ['C', 'A'],
       ['C', 'A']], dtype='<U1')

### Q: Describe how `calculate_statistics()` works


A. `calculate_statistics()` uses numpy to generate statistical values and shape them into an array, and then uses panda to apply labels to display with the statistical values.

In [14]:
def calculcate_statistics(arr):
    nd = np.var(arr == arr[0], axis=0).mean()
    mf = np.mean(np.sum(arr != arr[0], axis=0) / arr.shape[0])
    inv = np.any(arr != arr[0], axis=0).sum()
    var = arr.shape[1] - inv
    return pd.Series(
        {"mean nucleotide diversity": nd,
         "mean minor allele frequency": mf,
         "invariant sites": inv,
         "variable sites": var,
        })

In [15]:
calculcate_statistics(seqs)

invariant sites                6.000000
mean minor allele frequency    0.244444
mean nucleotide diversity      0.066667
variable sites                 9.000000
dtype: float64

### Instructions: Write a `seqlib` Class object

I started writing the bare bones of it below. You should write it so that it can be executed as described below to perform all of the functions we defined above, and so that its attributes can be accessed. Save this class object in a `.py` file and make it into an importable package called `seqlib`. You can write and test your object in this notebook if you like, but it must be saved separately in a `.py` file and be imported. You cannot execute the code at the end using your object defined here in the notebook. When finished save your package to GitHub as a repo just like we did with the `helloworld` package. You do not need to write a CLI script like we did for the `helloworld` package, we will only be using the Python API here. See the examples below for **how you should write your Class object**. It should be able to run in the way written below, so look at that code and think about how you would write a Class object that can do that. 

While you can mostly copy the functions from above, you will need to modify them slightly to access information about the Class object using *self*. For example, the `simulate()` function below takes self as a first argument and can access `self.inds` and `self.nsites` from that, so we do not need to provide those as arguments to the `simulate` function like we did above. 

In [16]:
class Seqlib:
    """
    A seqlib object for class.
    """
    def __init__(self, ninds, nsites):
        
        # generate full sequence array
        self.ninds = ninds
        self.nsites = nsites
        self.seqs = self.simulate()
        
        # store maf of the full seq array
        self.maf = self._get_maf()
    
    # private functions used only during init
    def mutate(self, base):
        "converts a base to another base"
        diff = set("ACTG") - set(base)
        return np.random.choice(list(diff))
        
    def simulate(self):
        "create a random array of DNA bases with mutations"    
        oseq = np.random.choice(list("ACGT"), size=self.nsites)
        arr = np.array([oseq for i in range(self.ninds)])
        muts = np.random.binomial(1, 0.1, (self.ninds, self.nsites))
        for col in range(self.nsites):
            newbase = self._mutate(arr[0, col])
            mask = muts[:, col].astype(bool)
            arr[:, col][mask] = newbase
        missing = np.random.binomial(1, 0.1, (self.ninds, self.nsites))
        arr[missing.astype(bool)] = "N"
        return arr
    
    def _get_maf(self):
        "returns the maf of the full seqarray while not counting Ns"
        # init an array to fill and iterate over columns
        maf = np.zeros(self.nsites)
        for col in range(self.nsites):
            # select this column of bases
            thiscol = self.seqs[:, col]

            # mask "N" bases and get new length
            nmask = thiscol != "N"
            no_n_len = np.sum(nmask)

            # mask "N" bases and get the first base
            first_non_n_base = thiscol[nmask][0]

            # calculate maf of "N" masked bases
            freq = np.sum(thiscol[nmask] != first_non_n_base) / no_n_len
            if freq > 0.5:
                maf[col] = 1 - freq
            else:
                maf[col] = freq
        return maf
        
    def _get_maf(self):
        "returns the maf of the full seqarray while not counting Ns"
        # init an array to fill and iterate over columns
        maf = np.zeros(self.nsites)
        for col in range(self.nsites):
            # select this column of bases
            thiscol = self.seqs[:, col]

            # mask "N" bases and get new length
            nmask = thiscol != "N"
            no_n_len = np.sum(nmask)

            # mask "N" bases and get the first base
            first_non_n_base = thiscol[nmask][0]

            # calculate maf of "N" masked bases
            freq = np.sum(thiscol[nmask] != first_non_n_base) / no_n_len
            if freq > 0.5:
                maf[col] = 1 - freq
            else:
                maf[col] = freq
        return maf    

    # private functions that are called within other functions
    def _filter_missing(self, maxmissing):
        "returns a boolean filter True for columns with Ns > maxmissing"
        freqmissing = np.sum(self.seqs == "N", axis=0) / self.seqs.shape[0]
        return freqmissing > maxmissing

    def _filter_maf(self, minmaf):
        "returns a boolean filter True for columns with maf < minmaf"
        return self.maf < minmaf

    # public functions
    def filter(self, minmaf, maxmissing):
        """
        Applies maf and missing filters to the array
        Parameters
        ----------
        minmaf: float
            The minimum minor allele frequency. Filter columns below this.
        maxmissing: float
            The maximum prop. missing data. Filter columns with prop Ns > this.
        """
        filter1 = self._filter_maf(minmaf)
        filter2 = self._filter_missing(maxmissing)
        fullfilter = filter1 + filter2
        return self.seqs[:, np.invert(fullfilter)]

    def filter_seqlib(self, minmaf, maxmissing):
        """
        Applies maf and missing filters to the array and returns a copy
        of the seqlib object where the .seqs array has been filtered
        Parameters
        ----------
        minmaf: float
            The minimum minor allele frequency. Filter columns below this.
        maxmissing: float
            The maximum prop. missing data. Filter columns with prop Ns > this.
        """
        # apply filters to get new array size
        newseqs = self.filter(minmaf, maxmissing)

        # make a new copy of the seqlib object
        newself = copy.deepcopy(self)  
        newself.__init__(newseqs.shape[0], newseqs.shape[1])

        # store the array (overwrite it)
        newself.seqs = newseqs

        # call the _get_maf to match new array
        newself._get_maf()
        return newself

    def calculate_statistics(self):
        """
        Returns a dataframe of statistics on the seqs array. The earlier
        example from the notebook had a bug where var and inv were switched.
        """
        if self.seqs.size:
            nd = np.var(self.seqs == self.seqs[0], axis=0).mean()
            mf = np.mean(
                np.sum(self.seqs != self.seqs[0], axis=0) / self.seqs.shape[0])
            inv = np.all(self.seqs == self.seqs[0], axis=0).sum()
            var = self.seqs.shape[1] - inv
            return pd.Series(
                {"mean nucleotide diversity": nd,
                 "mean minor allele frequency": mf,
                 "invariant sites": inv,
                 "variable sites": var,
                 })
        else:
            print("seqs array is empty")


## Test your package
The package should be globally importable (you ran `pip install .` or `pip install -e .` to install it), and it should be able to execute the following code without error. 

In [17]:
import seqlib

In [18]:
# init a Seqlib Class object
seqs = seqlib.Seqlib(ninds=10, nsites=50)

In [19]:
# access attributes from the object
print(seqs.ninds, seqs.nsites)

10 50


In [20]:
# returns the MAF of the array as an array of floats
seqs.maf

array([0.1       , 0.        , 0.        , 0.11111111, 0.        ,
       0.11111111, 0.125     , 0.11111111, 0.1       , 0.2       ,
       0.        , 0.2       , 0.1       , 0.11111111, 0.11111111,
       0.125     , 0.        , 0.        , 0.1       , 0.22222222,
       0.11111111, 0.1       , 0.14285714, 0.125     , 0.1       ,
       0.125     , 0.        , 0.        , 0.        , 0.25      ,
       0.        , 0.        , 0.22222222, 0.11111111, 0.11111111,
       0.        , 0.        , 0.        , 0.28571429, 0.2       ,
       0.        , 0.1       , 0.375     , 0.        , 0.11111111,
       0.1       , 0.125     , 0.11111111, 0.22222222, 0.2       ])

In [21]:
# return a view of the filtered sequence array by applying a new function 
# called `filter()` that applies both the maf and missing filter functions
seqs.filter(minmaf=0.1, maxmissing=0.0)

array([['G', 'T', 'G', 'G', 'T', 'A', 'C', 'C', 'C', 'T', 'A'],
       ['G', 'T', 'G', 'G', 'T', 'A', 'C', 'C', 'C', 'T', 'C'],
       ['G', 'T', 'T', 'G', 'T', 'A', 'C', 'C', 'C', 'T', 'A'],
       ['G', 'T', 'G', 'A', 'T', 'A', 'A', 'T', 'A', 'T', 'A'],
       ['G', 'T', 'G', 'A', 'T', 'A', 'C', 'C', 'C', 'T', 'A'],
       ['C', 'T', 'G', 'G', 'T', 'C', 'C', 'C', 'C', 'G', 'A'],
       ['G', 'T', 'G', 'G', 'T', 'A', 'C', 'C', 'C', 'T', 'C'],
       ['G', 'T', 'T', 'G', 'T', 'A', 'C', 'C', 'C', 'T', 'A'],
       ['G', 'T', 'G', 'G', 'C', 'A', 'C', 'T', 'C', 'T', 'A'],
       ['G', 'C', 'G', 'G', 'T', 'A', 'C', 'C', 'C', 'T', 'A']],
      dtype='<U1')

In [22]:
# calculate statistics for an array with the results returned as a DataFrame
seqs.calculate_statistics()

invariant sites                 3.0000
mean minor allele frequency     0.2800
mean nucleotide diversity       0.1336
variable sites                 47.0000
dtype: float64

In [27]:
# calculate statistics for an array after filtering it
seqs.filter_seqlib(minmaf=0.1, maxmissing=0.0).calculate_statistics()

invariant sites                 0.000000
mean minor allele frequency     0.136364
mean nucleotide diversity       0.115455
variable sites                 11.000000
dtype: float64