# Assignment

-------------

**1) Complete this notebook and make a pull request:** 

Answer questions (Q) in the space provided (A) in this notebook. When finished, copy your notebook to the `Assignment/` directory and name it `nb-6.5-<Github-username>.ipynb`. Then make a pull request to the upstream repo. The entered answers in this notebook will be simply Markdown text where I want you to interpret and describe a block of code to better understand what it is doing. Much of this code you will have seen already. 


**2) Write an importable Python package, save as a repo, and test it here.**

The package should be written as we did in our last lession (`.py` files in a directory with a setup.py file so it can be installed with `pip`). Follow instructions at the end of this notebook for how to write your package. Test it here by importing the package and executing the code at the end. It should work and give correct answers, if not, continue working on it. When you have it completed save your package as a new Github repo named `seqlib`.

### The `seqlib` package

Together we are going to write several functions here that will make up your new package called `seqlib`. It will be your job to copy these functions, organize them into a Class, save the code into a `.py` file (you can use SublimeText if you're comfortable with it for much of this, or any text editor including the one in jupyter), package the files so they can be imported as a library, and test the package so that it accomplishes the tasks which are defined at the end of this notebook. First things first, though, let's write the functions. 

In [3]:
import numpy as np
import pandas as pd

### Q.  Describe what the `mutate()` function below does:


A. Here, the function asks for a base to be provided such as A, C, G or T. That base is then removed from the set containing ('ACGT') and a random base from the remaining three bases is returned. 

In [4]:
def mutate(base):
    diff = set("ACTG") - set(base)
    return np.random.choice(list(diff))

In [5]:
# test it
mutate("A")

'G'

### Q. Describe how the `seqdata()` function below works:
Annotate the code by inserting lines with comments as you read through the function to make sense of it. What is being created at each step and how is it used?


A. 

In [6]:
def simulate(ninds, nsites): # The function is defined with two variables - number of individuals and number of sites (number of bases)
    oseq = np.random.choice(list("ACGT"), size=nsites) # A random list of bases of length=nsites is defined
    arr = np.array([oseq for i in range(ninds)]) # An array containing the list of bases across the number of individuals (number of rows) is defined
    muts = np.random.binomial(1, 0.1, (ninds, nsites)) # To create a mutation, we generate a random binomial distribution where a values of 1 is printed with a success of 10% across each (row, column)
    for col in range(nsites): # for every column in the range of the number of sites defined
        newbase = mutate(arr[0, col]) # a newbase is created based on the mutate function, for elements in the first row
        mask = muts[:, col].astype(bool) # A mask variable is created that prints True for every base that is mutated
        arr[:, col][mask] = newbase # For every element in the array that is mutated based on the random binomial distribution, the newbase is added to that element
    missing = np.random.binomial(1, 0.1, (ninds, nsites)) # Another random binomial distribution is created that prints a missing base with 10% success
    arr[missing.astype(bool)] = "N" # The value N is added to the missing based in the array
    return arr # The output is returned

In [7]:
seqs = simulate(6, 15)
print(seqs)

[['G' 'A' 'C' 'C' 'T' 'C' 'A' 'T' 'N' 'T' 'C' 'A' 'G' 'T' 'A']
 ['C' 'A' 'G' 'C' 'T' 'C' 'A' 'T' 'N' 'T' 'C' 'A' 'G' 'T' 'A']
 ['C' 'A' 'N' 'N' 'T' 'C' 'A' 'T' 'G' 'T' 'C' 'A' 'G' 'T' 'A']
 ['C' 'A' 'N' 'C' 'T' 'T' 'A' 'T' 'G' 'T' 'C' 'A' 'N' 'T' 'A']
 ['C' 'N' 'G' 'C' 'T' 'C' 'N' 'T' 'G' 'T' 'C' 'A' 'G' 'T' 'A']
 ['G' 'A' 'G' 'C' 'T' 'C' 'C' 'T' 'G' 'G' 'C' 'N' 'G' 'N' 'N']]


### **Q: Describe how the `filter_missing` function works:**
Annotate the code by inserting lines with comments as you read through the function to make sense of it. How does it find columns with missing (N) values in them? How might you mprove it?

A. 

In [8]:
def filter_missing(arr, maxfreq): # This is a function that returns an array with a defined maximum frequency of missing bases
    freqmissing = np.sum(arr == "N", axis=0) / arr.shape[0] # A missing frequency is defined that calculates the frequency of N along every column by dividing it by the length of each column
    return arr[:, freqmissing <= maxfreq] # The output is returned that contains the array with a given frequency that is less than the maximum frequency defined

In [9]:
filter_missing(seqs, 0.1)

array([['G', 'T', 'C', 'T', 'T', 'C'],
       ['C', 'T', 'C', 'T', 'T', 'C'],
       ['C', 'T', 'C', 'T', 'T', 'C'],
       ['C', 'T', 'T', 'T', 'T', 'C'],
       ['C', 'T', 'C', 'T', 'T', 'C'],
       ['G', 'T', 'C', 'T', 'G', 'C']], dtype='<U1')

### **Q: Describe how the `filter_maf` function works:**
Annotate the code by inserting lines with comments as you read through the function to make sense of it. How does it calculate minor allele frequencies? Why does it use copy?

A. 

In [10]:
def filter_maf(arr, minfreq): # Calculates the minor allele frequency and returns the output containing an array with only those elements greater than the frequency defined
    freqs = np.sum(arr != arr[0], axis=0) / arr.shape[0] # each element of the array that is not the same as the first row is summed and it's frequency is calculated by dividing it by the length of the columns
    maf = freqs.copy() # A copy of the above is made
    maf[maf > 0.5] = 1 - maf[maf > 0.5] # For every element within the array whose frequency is >0.5 is subtracted from 1 and returned
    return arr[:, maf > minfreq] # An array with elements that are greater than the minfrequency defined are returned

In [11]:
filter_maf(seqs, 0.1)

array([['G', 'A', 'C', 'C', 'C', 'A', 'N', 'T', 'A', 'G', 'T', 'A'],
       ['C', 'A', 'G', 'C', 'C', 'A', 'N', 'T', 'A', 'G', 'T', 'A'],
       ['C', 'A', 'N', 'N', 'C', 'A', 'G', 'T', 'A', 'G', 'T', 'A'],
       ['C', 'A', 'N', 'C', 'T', 'A', 'G', 'T', 'A', 'N', 'T', 'A'],
       ['C', 'N', 'G', 'C', 'C', 'N', 'G', 'T', 'A', 'G', 'T', 'A'],
       ['G', 'A', 'G', 'C', 'C', 'C', 'G', 'G', 'N', 'G', 'N', 'N']],
      dtype='<U1')

### Q: What order should these functions be applied, does it matter?

A. Ideally, an array must first be obtained by removing a certain percentage of N / missing bases and following which, the minor allele frequency must be calculated. However, in the below example, it does not appear to matter. This could posssibly be due to the size of the overall array

In [20]:
filter_missing(filter_maf(seqs, 0.1), 0.1)

array([['G', 'C', 'T'],
       ['C', 'C', 'T'],
       ['C', 'C', 'T'],
       ['C', 'T', 'T'],
       ['C', 'C', 'T'],
       ['G', 'C', 'G']], dtype='<U1')

In [21]:
filter_maf(filter_missing(seqs, 0.1), 0.1)

array([['G', 'C', 'T'],
       ['C', 'C', 'T'],
       ['C', 'C', 'T'],
       ['C', 'T', 'T'],
       ['C', 'C', 'T'],
       ['G', 'C', 'G']], dtype='<U1')

### Q: Describe how `calculate_statistics()` works


A. 

In [25]:
def calculcate_statistics(arr):
    nd = np.var(arr == arr[0], axis=0).mean() # This function calculates the variance along the mentioned axis and it's mean is then taken
    mf = np.mean(np.sum(arr != arr[0], axis=0) / arr.shape[0]) # The mean minor allele frequency is calculated by taking the mean of the frequency of elements that are not the same as the first row
    inv = np.any(arr != arr[0], axis=0).sum() # The sum of all elements that are not the same as the first row
    var = arr.shape[1] - inv # A variable that subtracts the inv from the total lenght of all rows
    return pd.Series(
        {"mean nucleotide diversity": nd,
         "mean minor allele frequency": mf,
         "invariant sites": inv,
         "variable sites": var,
        })

In [26]:
calculcate_statistics(seqs)

invariant sites                12.000000
mean minor allele frequency     0.255556
mean nucleotide diversity       0.127778
variable sites                  3.000000
dtype: float64

### Instructions: Write a `seqlib` Class object

I started writing the bare bones of it below. You should write it so that it can be executed as described below to perform all of the functions we defined above, and so that its attributes can be accessed. Save this class object in a `.py` file and make it into an importable package called `seqlib`. You can write and test your object in this notebook if you like, but it must be saved separately in a `.py` file and be imported. You cannot execute the code at the end using your object defined here in the notebook. When finished save your package to GitHub as a repo just like we did with the `helloworld` package. You do not need to write a CLI script like we did for the `helloworld` package, we will only be using the Python API here. See the examples below for **how you should write your Class object**. It should be able to run in the way written below, so look at that code and think about how you would write a Class object that can do that. 

While you can mostly copy the functions from above, you will need to modify them slightly to access information about the Class object using *self*. For example, the `simulate()` function below takes self as a first argument and can access `self.inds` and `self.nsites` from that, so we do not need to provide those as arguments to the `simulate` function like we did above. 

In [257]:
import numpy as np
import pandas as pd

class Seqlib:
    
    def __init__(self, ninds, nsites):
        self.ninds = ninds
        self.nsites = nsites
        self.arr = self.simulate()              
        
    # This function is used later on within the simulate function
    def mutate(self, base):
        diff = set("ACTG") - set(base)
        return np.random.choice(list(diff))
    
    def simulate(self): # The function is defined with two variables - number of individuals and number of sites (number of bases)
        oseq = np.random.choice(list("ACGT"), size=self.nsites) # A random list of bases of length=nsites is defined
        self.arr = np.array([oseq for i in range(self.ninds)]) # An array containing the list of bases across the number of individuals (number of rows) is defined
        muts = np.random.binomial(1, 0.1, (self.ninds, self.nsites)) # To create a mutation, we generate a random binomial distribution where a values of 1 is printed with a success of 10% across each (row, column)
        for col in range(self.nsites): # for every column in the range of the number of sites defined
            newbase = self.mutate(self.arr[0, col]) # a newbase is created based on the mutate function, for elements in the first row
            mask = muts[:, col].astype(bool) # A mask variable is created that prints True for every base that is mutated
            self.arr[:, col][mask] = newbase # For every element in the array that is mutated based on the random binomial distribution, the newbase is added to that element
        missing = np.random.binomial(1, 0.1, (self.ninds, self.nsites)) # Another random binomial distribution is created that prints a missing base with 10% success
        self.arr[missing.astype(bool)] = "N" # The value N is added to the missing based in the array
        return self.arr
    
    def filter_missing(self, maxfreq): # This is a function that returns an array with a defined maximum frequency of missing bases
        freqmissing = np.sum(self.arr == "N", axis=0) / self.arr.shape[0] # A missing frequency is defined that calculates the frequency of N along every column by dividing it by the length of each column
        return self.arr[:, freqmissing <= maxfreq] # The output is returned that contains the array with a given frequency that is less than the maximum frequency defined
        
    def filter_maf(self, minfreq): # Calculates the minor allele frequency and returns the output containing an array with only those elements greater than the frequency defined
        freqs = np.sum(self.arr != self.arr[0], axis=0) / self.arr.shape[0] # each element of the array that is not the same as the first row is summed and it's frequency is calculated by dividing it by the length of the columns
        maf = freqs.copy() # A copy of the above is made
        maf[maf > 0.5] = 1 - maf[maf > 0.5] # For every element within the array whose frequency is >0.5 is subtracted from 1 and returned
        return self.arr[:, maf > minfreq] # An array with elements that are greater than the minfrequency defined are returned 
    
    def maf(self): # TO return MAF as floats (See Notebook 6.1)
        freqs = np.sum(self.arr != self.arr[0], axis=0) / self.arr.shape[0] # calculates mutation freq (number of rows in a column that are not equal to the first row over column length)
        maf = freqs.copy() # make a copy of the frequency sequence
        maf[maf > 0.5] = 1 - maf[maf > 0.5] # select sites where the freq is over .5 and replace with 1-that value
        return freqs
        
    def calculate_statistics(self):
        nd = np.var(self.arr == self.arr[0], axis=0).mean() # This function calculates the variance along the mentioned axis and it's mean is then taken
        mf = np.mean(np.sum(self.arr != self.arr[0], axis=0) / self.arr.shape[0]) # The mean minor allele frequency is calculated by taking the mean of the frequency of elements that are not the same as the first row
        inv = np.any(self.arr != self.arr[0], axis=0).sum() # The sum of all elements that are not the same as the first row
        var = self.arr.shape[1] - inv # A variable that subtracts the inv from the total lenght of all rows
        return pd.Series(
            {"mean nucleotide diversity": nd,
             "mean minor allele frequency": mf,
             "invariant sites": inv,
             "variable sites": var,
            })
    
    def filter_maf_missing(self,minfreq,maxmissing):
        obj1 = self.filter_missing(maxfreq=maxmissing)
        freqs = np.sum(obj1 != obj1[0], axis=0) / obj1.shape[0] # each element of the array that is not the same as the first row is summed and it's frequency is calculated by dividing it by the length of the columns
        maf = freqs.copy() # A copy of the above is made
        maf[maf > 0.5] = 1 - maf[maf > 0.5] # For every element within the array whose frequency is >0.5 is subtracted from 1 and returned
        return obj1[:, maf > minfreq] 
    
    
    

## Test your package
The package should be globally importable (you ran `pip install .` or `pip install -e .` to install it), and it should be able to execute the following code without error. 

In [1]:
import seqlib

In [4]:
# init a Seqlib Class object
seqs = seqlib.Seqlib(ninds=10, nsites=50)

In [5]:
# access attributes from the object
print(seqs.ninds, seqs.nsites)

10 50


In [6]:
seqs.simulate()

array([['G', 'C', 'G', 'A', 'A', 'T', 'G', 'N', 'C', 'N', 'C', 'T', 'G',
        'A', 'A', 'G', 'T', 'T', 'C', 'A', 'T', 'G', 'G', 'C', 'G', 'C',
        'G', 'G', 'A', 'G', 'C', 'T', 'T', 'C', 'G', 'T', 'N', 'C', 'G',
        'A', 'C', 'A', 'G', 'A', 'C', 'C', 'C', 'C', 'T', 'G'],
       ['N', 'C', 'G', 'C', 'A', 'N', 'G', 'N', 'C', 'C', 'C', 'T', 'G',
        'A', 'T', 'G', 'C', 'N', 'C', 'A', 'T', 'G', 'G', 'A', 'G', 'C',
        'G', 'N', 'C', 'G', 'C', 'G', 'T', 'C', 'G', 'T', 'G', 'C', 'G',
        'A', 'C', 'A', 'G', 'A', 'C', 'N', 'C', 'C', 'C', 'G'],
       ['G', 'C', 'G', 'C', 'A', 'T', 'G', 'N', 'N', 'C', 'T', 'C', 'G',
        'A', 'A', 'G', 'N', 'T', 'C', 'A', 'T', 'G', 'G', 'A', 'G', 'T',
        'N', 'G', 'A', 'G', 'C', 'T', 'N', 'C', 'G', 'T', 'G', 'C', 'G',
        'A', 'C', 'A', 'G', 'A', 'C', 'C', 'G', 'N', 'C', 'G'],
       ['G', 'C', 'G', 'A', 'A', 'T', 'G', 'G', 'T', 'G', 'C', 'T', 'G',
        'A', 'T', 'N', 'C', 'T', 'C', 'A', 'T', 'G', 'G', 'C', 'N', 'T',
     

In [7]:
# returns the MAF of the array as an array of floats
seqs.filter_maf(0.1)

array([['C', 'A', 'A', 'T', 'G', 'N', 'C', 'N', 'C', 'T', 'G', 'A', 'A',
        'G', 'A', 'T', 'G', 'C', 'C', 'G', 'G', 'A', 'C', 'N', 'G', 'C',
        'A', 'A', 'C', 'C', 'C', 'T', 'G'],
       ['C', 'C', 'A', 'N', 'G', 'N', 'C', 'C', 'C', 'T', 'G', 'A', 'T',
        'G', 'A', 'T', 'G', 'A', 'C', 'G', 'N', 'C', 'C', 'G', 'G', 'C',
        'A', 'A', 'C', 'N', 'C', 'C', 'G'],
       ['C', 'C', 'A', 'T', 'G', 'N', 'N', 'C', 'T', 'C', 'G', 'A', 'A',
        'G', 'A', 'T', 'G', 'A', 'T', 'N', 'G', 'A', 'C', 'G', 'G', 'C',
        'A', 'A', 'C', 'C', 'N', 'C', 'G'],
       ['C', 'A', 'A', 'T', 'G', 'G', 'T', 'G', 'C', 'T', 'G', 'A', 'T',
        'N', 'A', 'T', 'G', 'C', 'T', 'G', 'G', 'A', 'C', 'G', 'G', 'C',
        'G', 'A', 'C', 'C', 'C', 'C', 'G'],
       ['C', 'A', 'A', 'A', 'G', 'G', 'N', 'C', 'C', 'T', 'G', 'A', 'T',
        'G', 'T', 'C', 'G', 'C', 'C', 'G', 'G', 'N', 'C', 'G', 'N', 'C',
        'A', 'N', 'T', 'C', 'C', 'N', 'G'],
       ['C', 'N', 'A', 'T', 'T', 'G', 'T', 'C', 'C

In [8]:
seqs.filter_missing(0.0)

array([['C', 'G', 'G', 'G', 'C', 'C', 'T', 'G', 'A', 'A', 'C', 'C'],
       ['C', 'G', 'G', 'G', 'C', 'C', 'G', 'G', 'A', 'A', 'C', 'C'],
       ['C', 'G', 'G', 'G', 'T', 'C', 'T', 'G', 'A', 'A', 'C', 'G'],
       ['C', 'G', 'G', 'G', 'T', 'C', 'T', 'G', 'A', 'G', 'C', 'C'],
       ['C', 'G', 'G', 'G', 'C', 'C', 'T', 'T', 'A', 'A', 'T', 'C'],
       ['C', 'G', 'G', 'G', 'C', 'C', 'T', 'G', 'A', 'A', 'C', 'C'],
       ['T', 'G', 'G', 'G', 'C', 'C', 'T', 'G', 'A', 'A', 'C', 'C'],
       ['C', 'G', 'G', 'G', 'C', 'C', 'T', 'G', 'A', 'A', 'C', 'C'],
       ['C', 'G', 'C', 'G', 'C', 'C', 'T', 'G', 'A', 'A', 'C', 'C'],
       ['T', 'G', 'C', 'T', 'T', 'C', 'T', 'G', 'A', 'G', 'T', 'C']],
      dtype='<U1')

In [9]:
seqs.maf()

array([0.1, 0.2, 0. , 0.3, 0.2, 0.3, 0.2, 0.6, 0.4, 0.7, 0.3, 0.2, 0.2,
       0.3, 0.8, 0.2, 0.9, 0.1, 0.1, 0.2, 0.2, 0.2, 0.1, 0.7, 0.1, 0.3,
       0.2, 0.2, 0.3, 0.1, 0. , 0.1, 0.1, 0.2, 0.1, 0.1, 0.7, 0.1, 0.3,
       0. , 0.2, 0.2, 0.1, 0.2, 0.2, 0.2, 0.1, 0.2, 0.6, 0.2])

In [10]:
# return a view of the filtered sequence array by applying a new function 
# called `filter()` that applies both the maf and missing filter functions
seqs.filter_maf_missing(0.1, 0.0)

array([['C', 'G', 'C', 'A', 'C'],
       ['C', 'G', 'C', 'A', 'C'],
       ['C', 'G', 'T', 'A', 'C'],
       ['C', 'G', 'T', 'G', 'C'],
       ['C', 'G', 'C', 'A', 'T'],
       ['C', 'G', 'C', 'A', 'C'],
       ['T', 'G', 'C', 'A', 'C'],
       ['C', 'G', 'C', 'A', 'C'],
       ['C', 'C', 'C', 'A', 'C'],
       ['T', 'C', 'T', 'G', 'T']], dtype='<U1')

In [11]:
# calculate statistics for an array with the results returned as a DataFrame
seqs.calculate_statistics()

invariant sites                47.0000
mean minor allele frequency     0.2520
mean nucleotide diversity       0.1456
variable sites                  3.0000
dtype: float64

In [12]:
# calculate statistics for an array after filtering it
seqs.filter_maf_missing(0.1, 0.0).calculate_statistics()

AttributeError: 'numpy.ndarray' object has no attribute 'calculate_statistics'