# Assignment

-------------

**1) Complete this notebook and make a pull request:** 

Answer questions (Q) in the space provided (A) in this notebook. When finished, copy your notebook to the `Assignment/` directory and name it `nb-6.5-<Github-username>.ipynb`. Then make a pull request to the upstream repo. The entered answers in this notebook will be simply Markdown text where I want you to interpret and describe a block of code to better understand what it is doing. Much of this code you will have seen already. 


**2) Write an importable Python package, save as a repo, and test it here.**

The package should be written as we did in our last lession (`.py` files in a directory with a setup.py file so it can be installed with `pip`). Follow instructions at the end of this notebook for how to write your package. Test it here by importing the package and executing the code at the end. It should work and give correct answers, if not, continue working on it. When you have it completed save your package as a new Github repo named `seqlib`.

### The `seqlib` package

Together we are going to write several functions here that will make up your new package called `seqlib`. It will be your job to copy these functions, organize them into a Class, save the code into a `.py` file (you can use SublimeText if you're comfortable with it for much of this, or any text editor including the one in jupyter), package the files so they can be imported as a library, and test the package so that it accomplishes the tasks which are defined at the end of this notebook. First things first, though, let's write the functions. 

In [1]:
import numpy as np
import pandas as pd

### Q.  Describe what the `mutate()` function below does:


A. The function provides a set of bases and subtracts the base provided by the user that will be mutated and then returns the mutated base.

In [2]:
def mutate(base):
    diff = set("ACTG") - set(base) # provides a set of 4 bases and subtracts an individual base from the original set 
    return np.random.choice(list(diff)) # returns one of the three remaining bases after the base is subtracted

In [3]:
# test it
mutate("A")

'T'

### Q. Describe how the `seqdata()` function below works:
Annotate the code by inserting lines with comments as you read through the function to make sense of it. What is being created at each step and how is it used?


A. 

In [7]:
def simulate(ninds, nsites):  #ninds: number of individuals, nsites: each base
    oseq = np.random.choice(list("ACGT"), size=nsites) # calls orginal sequence array and creates a sequence of bases
    arr = np.array([oseq for i in range(ninds)]) # creates new array of sequences for the range of # of individuals 
    muts = np.random.binomial(1, 0.1, (ninds, nsites)) # iterates over columns and creates mutation with a 10% mutation probability 
    
    for col in range(nsites):
        newbase = mutate(arr[0, col]) #Use mutate function to create mutations in first row of each column
        mask = muts[:, col].astype(bool) #goes over each column and grabs only the sites that are mutated - returns boolean type
        arr[:, col][mask] = newbase # grab the column, apply mask to .....
    missing = np.random.binomial(1, 0.1, (ninds, nsites)) # create missing values using a binomial distribution 
    arr[missing.astype(bool)] = "N" # return missing values with "N"
    return arr # return array

In [6]:
seqs = simulate(6, 15) # 6 individuals and 15 bases 
print(seqs)

[['G' 'G' 'C' 'T' 'G' 'T' 'A' 'G' 'C' 'A' 'A' 'T' 'T' 'N' 'T']
 ['A' 'G' 'C' 'T' 'A' 'T' 'C' 'G' 'A' 'G' 'A' 'T' 'N' 'A' 'T']
 ['G' 'N' 'C' 'T' 'A' 'T' 'A' 'G' 'C' 'A' 'A' 'A' 'N' 'A' 'T']
 ['G' 'G' 'C' 'T' 'A' 'T' 'A' 'G' 'A' 'A' 'A' 'T' 'T' 'N' 'A']
 ['G' 'G' 'C' 'T' 'A' 'T' 'A' 'G' 'A' 'A' 'A' 'T' 'T' 'A' 'T']
 ['G' 'G' 'C' 'T' 'A' 'T' 'A' 'G' 'A' 'A' 'A' 'N' 'T' 'C' 'T']]


### **Q: Describe how the `filter_missing` function works:**
Annotate the code by inserting lines with comments as you read through the function to make sense of it. How does it find columns with missing (N) values in them? How might you mprove it?

A. 

In [8]:
def filter_missing(arr, maxfreq):
    freqmissing = np.sum(arr == "N", axis=0) / arr.shape[0] #sum every first row element that equals "N", divide by the number of rows 
    return arr[:, freqmissing <= maxfreq] # return all rows in array where the proportion of freqmissing equals the value given for maxfreq

In [16]:
filter_missing(seqs, 0.1) # filter seq data so that the missing data is less than 10%

array([['G', 'C', 'T', 'G', 'T', 'A', 'G', 'C', 'A', 'A', 'T'],
       ['A', 'C', 'T', 'A', 'T', 'C', 'G', 'A', 'G', 'A', 'T'],
       ['G', 'C', 'T', 'A', 'T', 'A', 'G', 'C', 'A', 'A', 'T'],
       ['G', 'C', 'T', 'A', 'T', 'A', 'G', 'A', 'A', 'A', 'A'],
       ['G', 'C', 'T', 'A', 'T', 'A', 'G', 'A', 'A', 'A', 'T'],
       ['G', 'C', 'T', 'A', 'T', 'A', 'G', 'A', 'A', 'A', 'T']],
      dtype='<U1')

### **Q: Describe how the `filter_maf` function works:**
Annotate the code by inserting lines with comments as you read through the function to make sense of it. How does it calculate minor allele frequencies? Why does it use copy?

A. 

In [10]:
def filter_maf(arr, minfreq):
    freqs = np.sum(arr != arr[0], axis=0) / arr.shape[0] # sum every row element that doesn't match the first element in the column, divide by the number of rows 
    maf = freqs.copy() # make copy to not change original frequency array
    maf[maf > 0.5] = 1 - maf[maf > 0.5] # subselect sites with major freq (>0.5) and modify to be 1-value
    return arr[:, maf > minfreq] # return array rows where major allele frequency is greater than minimum allele frequency

In [17]:
filter_maf(seqs, 0.1)

array([['G', 'G', 'G', 'A', 'C', 'A', 'T', 'T', 'N', 'T'],
       ['A', 'G', 'A', 'C', 'A', 'G', 'T', 'N', 'A', 'T'],
       ['G', 'N', 'A', 'A', 'C', 'A', 'A', 'N', 'A', 'T'],
       ['G', 'G', 'A', 'A', 'A', 'A', 'T', 'T', 'N', 'A'],
       ['G', 'G', 'A', 'A', 'A', 'A', 'T', 'T', 'A', 'T'],
       ['G', 'G', 'A', 'A', 'A', 'A', 'N', 'T', 'C', 'T']], dtype='<U1')

### Q: What order should these functions be applied, does it matter?

A. The order of these functions should be swapped so that the missing values identified through the filter_missing function are removed and the sequence data used for the filter_maf function does not include "N" elements. 

In [19]:
filter_missing(filter_maf(seqs, 0.1), 0.1)

array([['G', 'G', 'A', 'C', 'A', 'T'],
       ['A', 'A', 'C', 'A', 'G', 'T'],
       ['G', 'A', 'A', 'C', 'A', 'T'],
       ['G', 'A', 'A', 'A', 'A', 'A'],
       ['G', 'A', 'A', 'A', 'A', 'T'],
       ['G', 'A', 'A', 'A', 'A', 'T']], dtype='<U1')

In [20]:
filter_maf(filter_missing(seqs, 0.1), 0.1)

array([['G', 'G', 'A', 'C', 'A', 'T'],
       ['A', 'A', 'C', 'A', 'G', 'T'],
       ['G', 'A', 'A', 'C', 'A', 'T'],
       ['G', 'A', 'A', 'A', 'A', 'A'],
       ['G', 'A', 'A', 'A', 'A', 'T'],
       ['G', 'A', 'A', 'A', 'A', 'T']], dtype='<U1')

### Q: Describe how `calculate_statistics()` works


A. 

In [21]:
def calculcate_statistics(arr):
    nd = np.var(arr == arr[0], axis=0).mean() # get mean value for all elements present
    mf = np.mean(np.sum(arr != arr[0], axis=0) / arr.shape[0]) # get mean value for elements that have mutated (=1)
    inv = np.any(arr != arr[0], axis=0).sum() # get sum value for all elements that have not mutated (=0)
    var = arr.shape[1] - inv # subtract invariant sites from whole array to get the variable sites
    return pd.Series(
        {"mean nucleotide diversity": nd, # provide column headers shown as strings for the objects defined in the series 
         "mean minor allele frequency": mf,
         "invariant sites": inv,
         "variable sites": var,
        })

In [22]:
calculcate_statistics(seqs)

invariant sites                10.000000
mean minor allele frequency     0.244444
mean nucleotide diversity       0.114815
variable sites                  5.000000
dtype: float64

### Instructions: Write a `seqlib` Class object

I started writing the bare bones of it below. You should write it so that it can be executed as described below to perform all of the functions we defined above, and so that its attributes can be accessed. Save this class object in a `.py` file and make it into an importable package called `seqlib`. You can write and test your object in this notebook if you like, but it must be saved separately in a `.py` file and be imported. You cannot execute the code at the end using your object defined here in the notebook. When finished save your package to GitHub as a repo just like we did with the `helloworld` package. You do not need to write a CLI script like we did for the `helloworld` package, we will only be using the Python API here. See the examples below for **how you should write your Class object**. It should be able to run in the way written below, so look at that code and think about how you would write a Class object that can do that. 

While you can mostly copy the functions from above, you will need to modify them slightly to access information about the Class object using *self*. For example, the `simulate()` function below takes self as a first argument and can access `self.inds` and `self.nsites` from that, so we do not need to provide those as arguments to the `simulate` function like we did above. 

In [148]:

import numpy as np
import pandas as pd


class Seqlib:
    def __init__(self, ninds, nsites):
        
        self.ninds = ninds
        self.nsites = nsites
        self.arr = self.simulate()
     

    def mutate(self, base):
        diff = set("ACTG") - set(base) # provides a set of 4 bases and subtracts an individual base from the original set 
        return np.random.choice(list(diff)) # returns one of the three remaining bases after the base is subtracted
        
    def simulate(self):
        oseq = np.random.choice(list("ACGT"), size=self.nsites) # calls orginal sequence array and creates a sequence of bases
        self.arr = np.array([oseq for i in range(self.ninds)]) # creates new array of sequences for the range of # of individuals 
        muts = np.random.binomial(1, 0.1, (self.ninds, self.nsites)) # iterates over columns and creates mutation with a 10% mutation probability 
    
        for col in range(self.nsites):
            newbase = mutate(self.arr[0, col]) #Use mutate function to create mutations in first row of each column
            mask = muts[:, col].astype(bool) #goes over each column and grabs only the sites that are mutated - returns boolean type
            self.arr[:, col][mask] = newbase # grab the column, apply mask to .....
        missing = np.random.binomial(1, 0.1, (self.ninds, self.nsites)) # create missing values using a binomial distribution 
        self.arr[missing.astype(bool)] = "N" # return missing values with "N"
        return self.arr # return array

    def filter_missing(self, maxfreq):
        freqmissing = np.sum(self.arr == "N", axis=0) / self.arr.shape[0] #sum every first row element that equals "N", divide by the number of rows 
        return self.arr[:, freqmissing <= maxfreq] # return all rows in array where the proportion of freqmissing equals the value given for maxfreq

    def filter_maf(self, minfreq):
        freqs = np.sum(self.arr != self.arr[0], axis=0) / self.arr.shape[0] # sum every row element that doesn't match the first element in the column, divide by the number of rows 
        maf = freqs.copy() # make copy to not change original frequency array
        maf[maf > 0.5] = 1 - maf[maf > 0.5] # subselect sites with major freq (>0.5) and modify to be 1-value
        return self.arr[:, maf > minfreq] # return array rows where major allele frequency is greater than minimum allele frequency
    
    def descriptive(self, minmaf, maxmissing):
        job1 = self.filter_missing(maxfreq = maxmissing)
        freqs = np.sum(job1 != job1[0], axis=0) / job1.shape[0]  
        maf = freqs.copy() 
        maf[maf > 0.5] = 1 - maf[maf > 0.5] 
        return job1[:, maf > minmaf] 
    
    def calculcate_statistics(self):
        nd = np.var(self.arr == self.arr[0], axis=0).mean() # get mean value for all elements present
        mf = np.mean(np.sum(self.arr != self.arr[0], axis=0) / self.arr.shape[0]) # get mean value for elements that have mutated (=1)
        inv = np.any(self.arr != self.arr[0], axis=0).sum() # get sum value for all elements that have not mutated (=0)
        var = self.arr.shape[1] - inv # subtract invariant sites from whole array to get the variable sites
        return pd.Series(
        {"mean nucleotide diversity": nd,
         "mean minor allele frequency": mf,
         "invariant sites": inv,
         "variable sites": var,
        })
    
    

## Test your package
The package should be globally importable (you ran `pip install .` or `pip install -e .` to install it), and it should be able to execute the following code without error. 

In [159]:
#import Seqlib

In [160]:
# init a Seqlib Class object
seqs = Seqlib(ninds=10, nsites=50)
seqs

<__main__.Seqlib at 0x111909860>

In [161]:
# access attributes from the object
print(seqs.ninds, seqs.nsites)

10 50


In [162]:
# returns the MAF of the array as an array of floats
seqs.filter_maf(0.1)

array([['G', 'C', 'T', 'C', 'C', 'G', 'G', 'T', 'T', 'N', 'A', 'C', 'C',
        'C', 'C', 'C', 'N', 'T', 'G', 'C', 'N', 'C', 'A', 'G', 'C', 'G',
        'T', 'N', 'A', 'G', 'A'],
       ['T', 'C', 'T', 'C', 'C', 'G', 'G', 'T', 'T', 'T', 'N', 'C', 'C',
        'C', 'N', 'N', 'G', 'T', 'G', 'C', 'A', 'C', 'A', 'G', 'C', 'C',
        'A', 'N', 'A', 'G', 'A'],
       ['T', 'C', 'T', 'G', 'C', 'G', 'G', 'T', 'N', 'T', 'A', 'C', 'N',
        'N', 'C', 'C', 'G', 'T', 'G', 'C', 'A', 'C', 'N', 'G', 'C', 'N',
        'T', 'T', 'A', 'G', 'A'],
       ['T', 'C', 'T', 'C', 'N', 'G', 'G', 'N', 'N', 'T', 'A', 'N', 'C',
        'N', 'C', 'C', 'N', 'N', 'N', 'N', 'A', 'C', 'A', 'C', 'C', 'G',
        'T', 'T', 'N', 'N', 'A'],
       ['T', 'C', 'T', 'G', 'N', 'G', 'N', 'T', 'N', 'N', 'T', 'C', 'C',
        'C', 'C', 'N', 'G', 'T', 'T', 'C', 'A', 'C', 'A', 'N', 'C', 'G',
        'T', 'A', 'A', 'G', 'N'],
       ['N', 'C', 'T', 'G', 'C', 'A', 'T', 'N', 'T', 'G', 'A', 'C', 'C',
        'C', 'C', 'C', 'G',

In [163]:
# return a view of the filtered sequence array by applying a new function 
# called `filter()` that applies both the maf and missing filter functions
seqs.descriptive(minmaf=0.1, maxmissing=0.0)

array([['T'],
       ['T'],
       ['T'],
       ['T'],
       ['T'],
       ['T'],
       ['T'],
       ['G'],
       ['G'],
       ['T']], dtype='<U1')

In [164]:
# calculate statistics for an array with the results returned as a DataFrame
seqs.calculate_statistics()

AttributeError: 'Seqlib' object has no attribute 'calculate_statistics'

In [None]:
# calculate statistics for an array after filtering it
seqs.filter(minmaf=0.1, maxmissing=0.0).calculate_statistics()