# Assignment

-------------

**1) Complete this notebook and make a pull request:** 

Answer questions (Q) in the space provided (A) in this notebook. When finished, copy your notebook to the `Assignment/` directory and name it `nb-6.5-<Github-username>.ipynb`. Then make a pull request to the upstream repo. The entered answers in this notebook will be simply Markdown text where I want you to interpret and describe a block of code to better understand what it is doing. Much of this code you will have seen already. 


**2) Write an importable Python package, save as a repo, and test it here.**

The package should be written as we did in our last lession (`.py` files in a directory with a setup.py file so it can be installed with `pip`). Follow instructions at the end of this notebook for how to write your package. Test it here by importing the package and executing the code at the end. It should work and give correct answers, if not, continue working on it. When you have it completed save your package as a new Github repo named `seqlib`.

### The `seqlib` package

Together we are going to write several functions here that will make up your new package called `seqlib`. It will be your job to copy these functions, organize them into a Class, save the code into a `.py` file (you can use SublimeText if you're comfortable with it for much of this, or any text editor including the one in jupyter), package the files so they can be imported as a library, and test the package so that it accomplishes the tasks which are defined at the end of this notebook. First things first, though, let's write the functions. 

In [5]:
import numpy as np
import pandas as pd

### Q.  Describe what the `mutate()` function below does:


A. The `mutate()` function starts by creating a variable `diff` which is equal to "ACTG" minus the base that you call for the function. Then it returns a random choosing of one of the three remaining bases left over from the `diff` variable. So if you run `mutate("T")` it will never output a 'T', as this is not in the result of `diff`.

In [6]:
def mutate(base):
    diff = set("ACTG") - set(base)
    return np.random.choice(list(diff))


In [7]:
# test it
mutate("T")

'A'

### Q. Describe how the `simulate()` function below works:
Annotate the code by inserting lines with comments as you read through the function to make sense of it. What is being created at each step and how is it used?


A. I think this function generates an array of variable sequence data. It returns a matrix of what you call, number of ninds by nsites. It sets the original sequence equal to a random choice of bases. The arr is sat as an array for original sequence of the range of ninds. Then it introduces mutations. IT iterates through colummns of range of nsites, and for any missing sequence data, sets it to print an 'N'.

In [8]:
def simulate(ninds, nsites):       # requires 2 arguments
    oseq = np.random.choice(list("ACGT"), size=nsites) # sets oseq equal to a random choice, size of nsites, in ACGT
    arr = np.array([oseq for i in range(ninds)])   # create sequence of nsites
    muts = np.random.binomial(1, 0.1, (ninds, nsites)) # introduce mutations, random binomial distribution
    for col in range(nsites):  # for column in range of nsites
        newbase = mutate(arr[0, col])  #set newbase = to mutate function applied to the arr for all columns
        mask = muts[:, col].astype(bool) # mask = random binomial for all collummns, set as type boolean
        arr[:, col][mask] = newbase # sets newbase as both arguments
    missing = np.random.binomial(1, 0.1, (ninds, nsites)) # determines if it is missing
    arr[missing.astype(bool)] = "N"  #if it is missing, return an N
    return arr

In [9]:
simulate(6,15)

array([['C', 'T', 'A', 'A', 'N', 'G', 'G', 'C', 'A', 'N', 'C', 'G', 'G',
        'T', 'C'],
       ['C', 'T', 'A', 'A', 'A', 'G', 'G', 'C', 'A', 'A', 'C', 'G', 'T',
        'C', 'C'],
       ['C', 'T', 'A', 'A', 'N', 'C', 'G', 'G', 'A', 'A', 'C', 'C', 'T',
        'C', 'G'],
       ['T', 'T', 'A', 'A', 'T', 'C', 'G', 'G', 'A', 'A', 'N', 'G', 'T',
        'C', 'C'],
       ['T', 'T', 'A', 'N', 'N', 'N', 'N', 'C', 'A', 'A', 'C', 'G', 'G',
        'C', 'C'],
       ['C', 'T', 'A', 'N', 'A', 'C', 'T', 'C', 'A', 'T', 'A', 'G', 'G',
        'C', 'C']], dtype='<U1')

In [10]:
seqs = simulate(6, 15)  #matrix size 6 by 15
print(seqs)

[['G' 'G' 'C' 'A' 'G' 'C' 'A' 'T' 'C' 'G' 'G' 'N' 'G' 'T' 'T']
 ['N' 'G' 'A' 'A' 'G' 'N' 'A' 'T' 'C' 'G' 'G' 'T' 'G' 'T' 'T']
 ['G' 'G' 'C' 'A' 'G' 'C' 'A' 'T' 'A' 'A' 'G' 'T' 'G' 'T' 'T']
 ['G' 'G' 'C' 'A' 'N' 'C' 'A' 'N' 'C' 'G' 'G' 'T' 'G' 'N' 'T']
 ['G' 'G' 'C' 'A' 'G' 'C' 'N' 'T' 'C' 'G' 'G' 'T' 'G' 'T' 'T']
 ['G' 'G' 'N' 'T' 'G' 'C' 'A' 'A' 'C' 'G' 'G' 'G' 'G' 'T' 'T']]


### **Q: Describe how the `filter_missing` function works:**
Annotate the code by inserting lines with comments as you read through the function to make sense of it. How does it find columns with missing (N) values in them? How might you mprove it?

A. Filtering for missing alleles. Find frequency of missing alleles by summing the instances of "N" and dividing it by the shape of the array. Then it returns the array with columns that have a frequency of missing bases which is less than the specified maximum frequency.

In [11]:
def filter_missing(arr, maxfreq):   #sequence of nsitres, a=max frequency
    freqmissing = np.sum(arr == "N", axis=0) / arr.shape[0]  #missing if the sum of
    return arr[:, freqmissing <= maxfreq]

In [12]:
filter_missing(seqs, 0.1)

array([['G', 'A', 'C', 'G', 'G', 'G', 'T'],
       ['G', 'A', 'C', 'G', 'G', 'G', 'T'],
       ['G', 'A', 'A', 'A', 'G', 'G', 'T'],
       ['G', 'A', 'C', 'G', 'G', 'G', 'T'],
       ['G', 'A', 'C', 'G', 'G', 'G', 'T'],
       ['G', 'T', 'C', 'G', 'G', 'G', 'T']], dtype='<U1')

### **Q: Describe how the `filter_maf` function works:**
Annotate the code by inserting lines with comments as you read through the function to make sense of it. How does it calculate minor allele frequencies? Why does it use copy?

A. Filtering for minor allele frequency. It takes frequencies bigger than .5 and replaces it with 1 - frequency. Which is a copy of it.

In [13]:
def filter_maf(arr, minfreq):  # filter for minor allele frequency
    freqs = np.sum(arr != arr[0], axis=0) / arr.shape[0] # frequency determined if the sum does not equal zero
    maf = freqs.copy() # copies
    maf[maf > 0.5] = 1 - maf[maf > 0.5]  #
    return arr[:, maf > minfreq]

In [14]:
filter_maf(seqs, 0.1)

array([['G', 'C', 'A', 'G', 'C', 'A', 'T', 'C', 'G', 'N', 'T'],
       ['N', 'A', 'A', 'G', 'N', 'A', 'T', 'C', 'G', 'T', 'T'],
       ['G', 'C', 'A', 'G', 'C', 'A', 'T', 'A', 'A', 'T', 'T'],
       ['G', 'C', 'A', 'N', 'C', 'A', 'N', 'C', 'G', 'T', 'N'],
       ['G', 'C', 'A', 'G', 'C', 'N', 'T', 'C', 'G', 'T', 'T'],
       ['G', 'N', 'T', 'G', 'C', 'A', 'A', 'C', 'G', 'G', 'T']],
      dtype='<U1')

### Q: What order should these functions be applied, does it matter?

A. I don't think it matters what order the functions are applied, since they returned the same array. But I think it makes sense the second way, to filter maf once you've removed the missing data.

In [15]:
filter_missing(filter_maf(seqs, 0.1), 0.1)

array([['A', 'C', 'G'],
       ['A', 'C', 'G'],
       ['A', 'A', 'A'],
       ['A', 'C', 'G'],
       ['A', 'C', 'G'],
       ['T', 'C', 'G']], dtype='<U1')

In [16]:
filter_maf(filter_missing(seqs, 0.1), 0.1)

array([['A', 'C', 'G'],
       ['A', 'C', 'G'],
       ['A', 'A', 'A'],
       ['A', 'C', 'G'],
       ['A', 'C', 'G'],
       ['T', 'C', 'G']], dtype='<U1')

### Q: Describe how `calculate_statistics()` works


A. Calculates each desired factor, and then has it printed by calling the name and returning the variable it was made equal to in defining the function. Prints as a pandas series, text then the value.

In [12]:
def calculcate_statistics(arr):
    nd = np.var(arr == arr[0], axis=0).mean() #calculating the mean of the array
    mf = np.mean(np.sum(arr != arr[0], axis=0) / arr.shape[0]) #freq calculated by dividing sum of maf by whole sum
    inv = np.any(arr != arr[0], axis=0).sum() # sums invariant sites
    var = arr.shape[1] - inv #subtracts invariant from total
    return pd.Series(
        {"mean nucleotide diversity": nd,
         "mean minor allele frequency": mf,
         "invariant sites": inv,
         "variable sites": var,
        })

In [13]:
calculcate_statistics(seqs)

invariant sites                12.000000
mean minor allele frequency     0.255556
mean nucleotide diversity       0.150000
variable sites                  3.000000
dtype: float64

### Instructions: Write a `seqlib` Class object

I started writing the bare bones of it below. You should write it so that it can be executed as described below to perform all of the functions we defined above, and so that its attributes can be accessed. Save this class object in a `.py` file and make it into an importable package called `seqlib`. You can write and test your object in this notebook if you like, but it must be saved separately in a `.py` file and be imported. You cannot execute the code at the end using your object defined here in the notebook. When finished save your package to GitHub as a repo just like we did with the `helloworld` package. You do not need to write a CLI script like we did for the `helloworld` package, we will only be using the Python API here. See the examples below for **how you should write your Class object**. It should be able to run in the way written below, so look at that code and think about how you would write a Class object that can do that. 

While you can mostly copy the functions from above, you will need to modify them slightly to access information about the Class object using *self*. For example, the `simulate()` function below takes self as a first argument and can access `self.inds` and `self.nsites` from that, so we do not need to provide those as arguments to the `simulate` function like we did above. 

In [54]:
import numpy as np
import pandas as pd

class seqlib:
    def __init__(self, ninds, nsites):
        self.ninds = ninds
        self.nsites = nsites
        self.seqs = self.simulate()
    
    def mutate(self, base):
        diff = set("ACTG") - set(base)
        return np.random.choice(list(diff))
        
    def simulate(self):
        ninds = self.ninds
        nsites = self.nsites
        oseq = np.random.choice(list("ACGT"), size=nsites) 
        arr = np.array([oseq for i in range(ninds)])   
        muts = np.random.binomial(1, 0.1, (ninds, nsites)) 
        for col in range(nsites):  
            newbase = mutate(arr[0, col]) 
            mask = muts[:, col].astype(bool) 
            arr[:, col][mask] = newbase 
        missing = np.random.binomial(1, 0.1, (ninds, nsites)) 
        arr[missing.astype(bool)] = "N"  
        return arr
    
    def filter_missing(self, maxfreq):   
        arr = self.seqs
        freqmissing = np.sum(arr == "N", axis=0) / arr.shape[0]  
        return arr[:, freqmissing <= maxfreq]
    
    def filter(self, minfreq, maxfreq):  
        maf = self.filter_maf(self.filter_missing(maxfreq), minfreq)
        return maf
    
    def filter_maf(self, minmaf):
        freqs = np.sum(arr != arr[0], axis=0) / arr.shape[0]
        maf = freqs.copy()
        maf[maf > 0.5] = 1 - maf[maf > 0.5]
        return arr[:, maf > minmaf]
    
    def maf(self):
        freqs = np.sum(arr != arr[0], axis=0) / arr.shape[0]
        maf = freqs.copy()
        maf[maf > 0.5] = 1 - maf[maf > 0.5]
        return freqs
    
    
    def calculcate_statistics(self):
        nd = np.var(arr == arr[0], axis=0).mean() #calculating the mean of the array
        mf = np.mean(np.sum(arr != arr[0], axis=0) / arr.shape[0]) #freq calculated by dividing sum of maf by whole sum
        inv = np.any(arr != arr[0], axis=0).sum() 
        var = arr.shape[1] - inv 
        return pd.Series(
            {"mean nucleotide diversity": nd,
             "mean minor allele frequency": mf,
             "invariant sites": inv,
             "variable sites": var,
            })      


## Test your package
The package should be globally importable (you ran `pip install .` or `pip install -e .` to install it), and it should be able to execute the following code without error. 

In [51]:
import os
import sys

setup = """
from setuptools import setup
setup(
    name="mypackage",
    version="0.1",
    packages=["seqlib"],
)
"""


init = """
from .seqlib import seqlib
"""

seqlib = """
def seqlib():
    print("hello world")
"""

## let's define some names that we'll use for paths
prjname = "seqlib"
pkgname = "seqlib"
storeloc = os.path.expanduser("~/PDSB/")

## now let's create some joint paths with the os module
prjpath = os.path.join(storeloc, prjname)
pkgpath = os.path.join(storeloc, prjname, pkgname)

## check out paths
print(prjpath)
print(pkgpath)

## make the directories (exist_ok allows for it to already exist)
os.makedirs(pkgpath, exist_ok=True)


## write setup.py file
with open(os.path.join(prjpath, "setup.py"), 'w') as out:
    out.write(setup)
    
## write init file
with open(os.path.join(pkgpath, "__init__.py"), 'w') as out:
    out.write(init)
    
## write script to file
with open(os.path.join(pkgpath, "seqlib.py"), 'w') as out:
    out.write(seqlib)

/Users/chloehacker/PDSB/seqlib
/Users/chloehacker/PDSB/seqlib/seqlib


In [57]:
import seqlib

ModuleNotFoundError: No module named 'seqlib'