# Assignment

-------------

**1) Complete this notebook and make a pull request:** 

Answer questions (Q) in the space provided (A) in this notebook. When finished, copy your notebook to the `Assignment/` directory and name it `nb-6.5-<Github-username>.ipynb`. Then make a pull request to the upstream repo. The entered answers in this notebook will be simply Markdown text where I want you to interpret and describe a block of code to better understand what it is doing. Much of this code you will have seen already. 


**2) Write an importable Python package, save as a repo, and test it here.**

The package should be written as we did in our last lession (`.py` files in a directory with a setup.py file so it can be installed with `pip`). Follow instructions at the end of this notebook for how to write your package. Test it here by importing the package and executing the code at the end. It should work and give correct answers, if not, continue working on it. When you have it completed save your package as a new Github repo named `seqlib`.

### The `seqlib` package

Together we are going to write several functions here that will make up your new package called `seqlib`. It will be your job to copy these functions, organize them into a Class, save the code into a `.py` file (you can use SublimeText if you're comfortable with it for much of this, or any text editor including the one in jupyter), package the files so they can be imported as a library, and test the package so that it accomplishes the tasks which are defined at the end of this notebook. First things first, though, let's write the functions. 

In [1]:
import numpy as np
import pandas as pd

### Q.  Describe what the `mutate()` function below does:


A. The function asks for a base, and returns one of the three bases left that are not asked for. Specifically the function deletes the base inputed from the 4-base set, and then randomly chooses a base from the new set.

In [2]:
def mutate(base):
    diff = set("ACTG") - set(base)
    return np.random.choice(list(diff))

In [3]:
# test it
mutate("A")

'C'

### Q. Describe how the `seqdata()` function below works:
Annotate the code by inserting lines with comments as you read through the function to make sense of it. What is being created at each step and how is it used?


A. 

In [5]:
def simulate(ninds, nsites):
    
    # define a random sequence with length specified as 'oseq'
    # define an array of 'oseq' list as 'array',in which the number of indivual sequence is specified
    # generate a random binomial distribution with n=1 and p=0.1, representing the mutation as 'muts'
    oseq = np.random.choice(list("ACGT"), size=nsites)
    arr = np.array([oseq for i in range(ninds)])
    muts = np.random.binomial(1, 0.1, (ninds, nsites))
    
    # In this for-loop, for every column (up to 'oseq' length), 
    # a new base is created from a specific element in 'arr' using the mutation function defined above
    # 'mask' is created to store 'True' for every mutation in the array
    # For every element that is 'True' for 'mask', the mutated new base replaces the old base element
    for col in range(nsites):
        newbase = mutate(arr[0, col])
        mask = muts[:, col].astype(bool)
        arr[:, col][mask] = newbase
    
    # generate a random binomial distribution with n=1 and p=0.1, representing the missing bases
    # for every missing base, it will show as "N" in the array
    missing = np.random.binomial(1, 0.1, (ninds, nsites))
    arr[missing.astype(bool)] = "N"
    
    return arr

In [6]:
seqs = simulate(6, 15)
print(seqs)

[['T' 'G' 'A' 'T' 'C' 'G' 'C' 'C' 'C' 'A' 'A' 'T' 'A' 'G' 'G']
 ['G' 'G' 'A' 'T' 'C' 'G' 'C' 'A' 'C' 'A' 'A' 'T' 'N' 'G' 'C']
 ['N' 'G' 'A' 'N' 'C' 'G' 'C' 'N' 'T' 'A' 'A' 'T' 'A' 'A' 'N']
 ['G' 'G' 'A' 'T' 'C' 'N' 'C' 'A' 'C' 'A' 'A' 'T' 'A' 'G' 'G']
 ['G' 'G' 'A' 'N' 'N' 'G' 'C' 'A' 'C' 'A' 'A' 'T' 'G' 'G' 'C']
 ['G' 'G' 'A' 'T' 'C' 'G' 'C' 'A' 'C' 'A' 'A' 'T' 'A' 'G' 'C']]


### **Q: Describe how the `filter_missing` function works:**
Annotate the code by inserting lines with comments as you read through the function to make sense of it. How does it find columns with missing (N) values in them? How might you improve it?

A. The function returns an array with missing bases of specified maximum frequency 

In [7]:
def filter_missing(arr, maxfreq):
    # Calculate the missing frequency by dividing the sum # of "N" across every column by the total length
    freqmissing = np.sum(arr == "N", axis=0) / arr.shape[0]
    # Return the array with only sequences with missing frequency smaller than the specified max frequency
    return arr[:, freqmissing <= maxfreq]

In [8]:
filter_missing(seqs, 0.1)

array([['G', 'A', 'C', 'C', 'A', 'A', 'T', 'G'],
       ['G', 'A', 'C', 'C', 'A', 'A', 'T', 'G'],
       ['G', 'A', 'C', 'T', 'A', 'A', 'T', 'A'],
       ['G', 'A', 'C', 'C', 'A', 'A', 'T', 'G'],
       ['G', 'A', 'C', 'C', 'A', 'A', 'T', 'G'],
       ['G', 'A', 'C', 'C', 'A', 'A', 'T', 'G']], dtype='<U1')

### **Q: Describe how the `filter_maf` function works:**
Annotate the code by inserting lines with comments as you read through the function to make sense of it. How does it calculate minor allele frequencies? Why does it use copy?

A. The function returns an array of sequences with specified minimum mutation frequency. The minor allel frequency (<0.5) is calculated by subtracting any frequency greater than 0.5 from 1. We need to make a copy of the original freqency file, because otherwise the original array would be changed as well.

In [11]:
def filter_maf(arr, minfreq): 
    # mutation frequency is calculated by summing the difference of elements compared with the first row
    # and deviding that by the total length
    freqs = np.sum(arr != arr[0], axis=0) / arr.shape[0]
    
    # Create a copy of the frequency array
    # Change all the elements > 0.5 to 1 - that element so that the minor allele sequence frequency is recorded
    maf = freqs.copy()
    maf[maf > 0.5] = 1 - maf[maf > 0.5]
    
    # Return an array of sequence with minor allele sequence frequency greater than the specified frequency
    return arr[:, maf > minfreq]

In [12]:
filter_maf(seqs, 0.1)

array([['T', 'T', 'C', 'G', 'C', 'C', 'A', 'G', 'G'],
       ['G', 'T', 'C', 'G', 'A', 'C', 'N', 'G', 'C'],
       ['N', 'N', 'C', 'G', 'N', 'T', 'A', 'A', 'N'],
       ['G', 'T', 'C', 'N', 'A', 'C', 'A', 'G', 'G'],
       ['G', 'N', 'N', 'G', 'A', 'C', 'G', 'G', 'C'],
       ['G', 'T', 'C', 'G', 'A', 'C', 'A', 'G', 'C']], dtype='<U1')

### Q: What order should these functions be applied, does it matter?

A. First filter the missing bases using 'filter_missing' function, in order to get rid of the unreliable sequency. Then do the 'filter_maf' to calculate the minor allel freqency. The above order makes more logical sense, but usually the opposite order would not affect the final result, like shown in the below example.

In [13]:
filter_missing(filter_maf(seqs, 0.1), 0.1)

array([['C', 'G'],
       ['C', 'G'],
       ['T', 'A'],
       ['C', 'G'],
       ['C', 'G'],
       ['C', 'G']], dtype='<U1')

In [14]:
filter_maf(filter_missing(seqs, 0.1), 0.1)

array([['C', 'G'],
       ['C', 'G'],
       ['T', 'A'],
       ['C', 'G'],
       ['C', 'G'],
       ['C', 'G']], dtype='<U1')

### Q: Describe how `calculate_statistics()` works


A. The function calculates the mean nucleotide diversity, mean minor allel frequency, invariant sites and variable sites for a given sequence array

In [15]:
def calculcate_statistics(arr):
    
    # Calculate the mean of variances of array across the columns, as "mean nucleotide diversity" 
    nd = np.var(arr == arr[0], axis=0).mean()
    
    # Calculate the mean of frequency across columns, as "mean minor allel frequency"
    mf = np.mean(np.sum(arr != arr[0], axis=0) / arr.shape[0])
    
    # Calculate the sum of all elements except those different from the first row across columns, as "invariant sites"
    # Subtract invariant sites from the total length to get "variable sites"
    inv = np.any(arr != arr[0], axis=0).sum()
    var = arr.shape[1] - inv
    
    # return all values as panda series with specified name
    return pd.Series(
        {"mean nucleotide diversity": nd,
         "mean minor allele frequency": mf,
         "invariant sites": inv,
         "variable sites": var,
        })

In [16]:
calculcate_statistics(seqs)

invariant sites                9.000000
mean minor allele frequency    0.244444
mean nucleotide diversity      0.100000
variable sites                 6.000000
dtype: float64

### Instructions: Write a `seqlib` Class object

I started writing the bare bones of it below. You should write it so that it can be executed as described below to perform all of the functions we defined above, and so that its attributes can be accessed. Save this class object in a `.py` file and make it into an importable package called `seqlib`. You can write and test your object in this notebook if you like, but it must be saved separately in a `.py` file and be imported. You cannot execute the code at the end using your object defined here in the notebook. When finished save your package to GitHub as a repo just like we did with the `helloworld` package. You do not need to write a CLI script like we did for the `helloworld` package, we will only be using the Python API here. See the examples below for **how you should write your Class object**. It should be able to run in the way written below, so look at that code and think about how you would write a Class object that can do that. 

While you can mostly copy the functions from above, you will need to modify them slightly to access information about the Class object using *self*. For example, the `simulate()` function below takes self as a first argument and can access `self.inds` and `self.nsites` from that, so we do not need to provide those as arguments to the `simulate` function like we did above. 

In [4]:
import numpy as np
import pandas as pd

class Seqlib:
    
    def __init__(self, ninds, nsites):
        self.ninds = ninds
        self.nsites = nsites
        self.seqs = self.simulate()
        self.maf = self.maf()
        
    
    # Make mutated base, later used in function simulate
    def mutate(self, base):
        diff = set("ACTG") - set(base)
        return np.random.choice(list(diff))
    
    # Simulate a random sequence as arrays of multiple individuals
    def simulate(self):
        oseq = np.random.choice(list("ACGT"), size=self.nsites)
        arr = np.array([oseq for i in range(self.ninds)])
        muts = np.random.binomial(1, 0.1, (self.ninds, self.nsites))
    
        for col in range(self.nsites):
            newbase = self.mutate(arr[0, col])
            mask = muts[:, col].astype(bool)
            arr[:, col][mask] = newbase
    
        missing = np.random.binomial(1, 0.1, (self.ninds, self.nsites))
        arr[missing.astype(bool)] = "N"    
        return arr
    
    # Return MAF as floats
    def maf(self):
        freqs = np.sum(self.seqs != self.seqs[0], axis=0) / self.seqs.shape[0]
        maf = freqs.copy()
        maf[maf > 0.5] = 1 - maf[maf > 0.5]
        return maf
    
    # Filter out sequences with missing frequency more than specified max frequency
    def filter_missing(self, maxmissing):
        freqmissing = np.sum(self.seqs == "N", axis=0) / self.seqs.shape[0]
        return self.seqs[:, freqmissing <= maxmissing]
    
    # Filter out sequences with minor allel sequence frequency greater than the specified frequency
    def filter_maf(self, minmaf):
        freqs = np.sum(self.seqs != self.seqs[0], axis=0) / self.seqs.shape[0]
        maf = freqs.copy()
        maf[maf > 0.5] = 1 - maf[maf > 0.5]
        return self.seqs[:, maf > minmaf]
    
    def filter(self, maxmissing, minmaf):
        freqmissing = np.sum(self.seqs == "N", axis=0) / self.seqs.shape[0]
        arr = self.seqs[:, freqmissing <= maxmissing]
        freqs = np.sum(arr != arr[0], axis=0) / arr.shape[0]
        maf = freqs.copy()
        maf[maf > 0.5] = 1 - maf[maf > 0.5]
        return arr[:, maf > minmaf]
    
    def calculate_statistics(self):
        nd = np.var(self.seqs == self.seqs[0], axis=0).mean()
        mf = np.mean(np.sum(self.seqs != self.seqs[0], axis=0) / self.seqs.shape[0])
        inv = np.any(self.seqs != self.seqs[0], axis=0).sum()
        var = self.seqs.shape[1] - inv
    
        # return all values as panda series with specified name
        return pd.Series(
            {"mean nucleotide diversity": nd,
             "mean minor allele frequency": mf,
             "invariant sites": inv,
             "variable sites": var,
            })

## Test your package
The package should be globally importable (you ran `pip install .` or `pip install -e .` to install it), and it should be able to execute the following code without error. 

In [1]:
import seqlib

In [2]:
# init a Seqlib Class object
seqs = seqlib.Seqlib(ninds=10, nsites=50)

In [3]:
# access attributes from the object
print(seqs.ninds, seqs.nsites)
print(seqs.seqs)

10 50
[['A' 'T' 'C' 'C' 'T' 'T' 'C' 'C' 'A' 'C' 'C' 'A' 'G' 'G' 'N' 'A' 'T' 'C'
  'C' 'G' 'T' 'T' 'A' 'C' 'G' 'A' 'T' 'G' 'T' 'A' 'G' 'A' 'T' 'T' 'C' 'G'
  'G' 'N' 'G' 'C' 'G' 'T' 'G' 'T' 'N' 'A' 'T' 'T' 'G' 'G']
 ['A' 'T' 'C' 'C' 'T' 'T' 'C' 'C' 'A' 'C' 'C' 'A' 'N' 'G' 'G' 'A' 'T' 'C'
  'G' 'G' 'C' 'T' 'A' 'C' 'T' 'A' 'T' 'G' 'T' 'T' 'N' 'A' 'T' 'T' 'C' 'G'
  'T' 'G' 'G' 'C' 'G' 'T' 'G' 'T' 'A' 'N' 'T' 'N' 'C' 'G']
 ['A' 'T' 'N' 'N' 'T' 'T' 'N' 'C' 'A' 'C' 'G' 'A' 'G' 'G' 'C' 'A' 'T' 'C'
  'G' 'G' 'T' 'T' 'A' 'C' 'G' 'A' 'T' 'G' 'T' 'T' 'G' 'A' 'T' 'T' 'C' 'G'
  'T' 'G' 'G' 'C' 'G' 'T' 'N' 'T' 'A' 'A' 'N' 'T' 'G' 'G']
 ['A' 'A' 'C' 'N' 'T' 'T' 'C' 'C' 'A' 'C' 'C' 'A' 'G' 'G' 'N' 'N' 'T' 'C'
  'G' 'A' 'T' 'T' 'A' 'C' 'G' 'A' 'T' 'G' 'T' 'T' 'G' 'A' 'T' 'T' 'C' 'G'
  'T' 'G' 'G' 'C' 'G' 'T' 'C' 'T' 'A' 'G' 'T' 'N' 'G' 'N']
 ['A' 'T' 'N' 'C' 'T' 'T' 'C' 'C' 'A' 'C' 'N' 'A' 'G' 'G' 'G' 'N' 'T' 'C'
  'G' 'G' 'N' 'T' 'A' 'C' 'G' 'A' 'T' 'G' 'T' 'A' 'G' 'A' 'T' 'T' 'C' 'G'
  'N' 'N' 'C' 'C' 

In [4]:
# returns the MAF of the array as an array of floats
seqs.maf

array([0.1, 0.2, 0.3, 0.2, 0.3, 0.1, 0.2, 0.1, 0. , 0. , 0.2, 0.3, 0.2,
       0.1, 0.2, 0.3, 0.1, 0.1, 0.2, 0.2, 0.4, 0. , 0. , 0.2, 0.3, 0.1,
       0. , 0.2, 0. , 0.2, 0.3, 0. , 0.2, 0.1, 0.1, 0. , 0.1, 0.3, 0.2,
       0.3, 0. , 0.2, 0.3, 0. , 0.2, 0.2, 0.3, 0.3, 0.2, 0.3])

In [5]:
# return a view of the filtered sequence array by applying a new function 
# called `filter()` that applies both the maf and missing filter functions
seqs.filter(minmaf=0.1, maxmissing=0.0)

array([['G', 'A', 'G', 'C', 'G'],
       ['G', 'T', 'G', 'C', 'C'],
       ['G', 'T', 'G', 'C', 'G'],
       ['A', 'T', 'G', 'C', 'G'],
       ['G', 'A', 'C', 'C', 'C'],
       ['G', 'T', 'G', 'T', 'G'],
       ['G', 'T', 'G', 'C', 'G'],
       ['G', 'T', 'G', 'T', 'G'],
       ['G', 'T', 'G', 'T', 'G'],
       ['A', 'T', 'C', 'C', 'G']], dtype='<U1')

In [6]:
# calculate statistics for an array with the results returned as a DataFrame
seqs.calculate_statistics()

invariant sites                40.0000
mean minor allele frequency     0.2480
mean nucleotide diversity       0.1276
variable sites                 10.0000
dtype: float64

In [7]:
# calculate statistics for an array after filtering it
seqs.filter(minmaf=0.1, maxmissing=0.0).calculate_statistics()

AttributeError: 'numpy.ndarray' object has no attribute 'calculate_statistics'