# Assignment

-------------

**1) Complete this notebook and make a pull request:** 

Answer questions (Q) in the space provided (A) in this notebook. When finished, copy your notebook to the `Assignment/` directory and name it `nb-6.5-<Github-username>.ipynb`. Then make a pull request to the upstream repo. The entered answers in this notebook will be simply Markdown text where I want you to interpret and describe a block of code to better understand what it is doing. Much of this code you will have seen already. 


**2) Write an importable Python package, save as a repo, and test it here.**

The package should be written as we did in our last lession (`.py` files in a directory with a setup.py file so it can be installed with `pip`). Follow instructions at the end of this notebook for how to write your package. Test it here by importing the package and executing the code at the end. It should work and give correct answers, if not, continue working on it. When you have it completed save your package as a new Github repo named `seqlib`.

### The `seqlib` package

Together we are going to write several functions here that will make up your new package called `seqlib`. It will be your job to copy these functions, organize them into a Class, save the code into a `.py` file (you can use SublimeText if you're comfortable with it for much of this, or any text editor including the one in jupyter), package the files so they can be imported as a library, and test the package so that it accomplishes the tasks which are defined at the end of this notebook. First things first, though, let's write the functions. 

In [13]:
import numpy as np
import pandas as pd

### Q.  Describe what the `mutate()` function below does:


A. 

In [14]:
def mutate(base):
    # This says to subract a specific base from specified set of bases. AKA mutating it to a new base.
    diff = set("ACTG") - set(base)
    # This returns a random base from the set of bases that doesnt include the specified base. 
    return np.random.choice(list(diff))

In [15]:
# test it
mutate("A")

'T'

### Q. Describe how the `seqdata()` function below works:
Annotate the code by inserting lines with comments as you read through the function to make sense of it. What is being created at each step and how is it used?


A. 

In [16]:
def simulate(ninds, nsites):
    # First, create a random list of bases that is the length of the number of sites being sampled
    oseq = np.random.choice(list("ACGT"), size=nsites)
    # This creates an array using the sequence oseq, so it repeates oseq throughout the array
    arr = np.array([oseq for i in range(ninds)])
    # This is preparing to introduce mutations into the array
    muts = np.random.binomial(1, 0.1, (ninds, nsites))
    # This for loop adds the mutations into the array, by creating a mask and then mutating the specified percentage of the bases identified in the mask
    for col in range(nsites):
        newbase = mutate(arr[0, col])
        mask = muts[:, col].astype(bool)
        arr[:, col][mask] = newbase
   # This is setting up a mask to identify sites that are missing
    missing = np.random.binomial(1, 0.1, (ninds, nsites))
   # This puts the mask over the array and lables the missing values as N
    arr[missing.astype(bool)] = "N"
    return arr

In [17]:
seqs = simulate(10, 10)
print(seqs)

[['N' 'C' 'N' 'C' 'C' 'N' 'A' 'T' 'G' 'G']
 ['A' 'C' 'G' 'T' 'A' 'C' 'N' 'T' 'G' 'G']
 ['A' 'C' 'G' 'T' 'A' 'C' 'A' 'T' 'N' 'G']
 ['C' 'N' 'C' 'T' 'A' 'C' 'A' 'T' 'G' 'N']
 ['A' 'C' 'G' 'T' 'A' 'C' 'N' 'T' 'G' 'G']
 ['A' 'C' 'C' 'N' 'A' 'C' 'A' 'T' 'G' 'A']
 ['A' 'A' 'G' 'T' 'N' 'C' 'A' 'G' 'G' 'N']
 ['A' 'N' 'G' 'T' 'A' 'C' 'C' 'T' 'G' 'G']
 ['N' 'C' 'C' 'T' 'A' 'C' 'A' 'G' 'A' 'G']
 ['A' 'A' 'G' 'T' 'N' 'C' 'A' 'T' 'G' 'A']]


### **Q: Describe how the `filter_missing` function works:**
Annotate the code by inserting lines with comments as you read through the function to make sense of it. How does it find columns with missing (N) values in them? How might you mprove it?

A. 

In [18]:
# This function identifies columns that have missing data (N) and removes them from the array
def filter_missing(arr, maxfreq):
    # This line identifies the columns that have N in them
    freqmissing = np.sum(arr == "N", axis=0) / arr.shape[0]
    # This next argument asks to return the array without the N columns
    return arr[:, freqmissing <= maxfreq]
# How can this be improved: This can be improved by creating a copy within this function.This makes it so that the origional array isnt altered. 

In [19]:
filter_missing(seqs, 0.1)

array([['N', 'C', 'N', 'T', 'G'],
       ['G', 'T', 'C', 'T', 'G'],
       ['G', 'T', 'C', 'T', 'N'],
       ['C', 'T', 'C', 'T', 'G'],
       ['G', 'T', 'C', 'T', 'G'],
       ['C', 'N', 'C', 'T', 'G'],
       ['G', 'T', 'C', 'G', 'G'],
       ['G', 'T', 'C', 'T', 'G'],
       ['C', 'T', 'C', 'G', 'A'],
       ['G', 'T', 'C', 'T', 'G']], dtype='<U1')

### **Q: Describe how the `filter_maf` function works:**
Annotate the code by inserting lines with comments as you read through the function to make sense of it. How does it calculate minor allele frequencies? Why does it use copy?

A. 

In [23]:
# This function creates an array where the columns with no variation are removed
def filter_maf(arr, minfreq):
    # Identifies columns in which not all bases in the column are the same
    freqs = np.sum(arr != arr[0], axis=0) / arr.shape[0]
    # maf creates a copy of of the freqs array: creates an array where no column is uniform in bases
    maf = freqs.copy()
    # This line indexes the new array for columns that have a certain amount of variation in it
    maf[maf > 0.5] = 1 - maf[maf > 0.5]
    # This line returns the array
    return arr[:, maf > minfreq]

### Q: What order should these functions be applied, does it matter?

A. If a copy was made for one of these functions, but not the other, it will matter which one goes first. This is because the oder should be run with the copied one first and the other second. If both have copies, it shouldnt matter which one is run first. 

In [24]:
filter_missing(filter_maf(seqs, 0.1), 0.1)

array([['T', 'G'],
       ['T', 'G'],
       ['T', 'N'],
       ['T', 'G'],
       ['T', 'G'],
       ['T', 'G'],
       ['G', 'G'],
       ['T', 'G'],
       ['G', 'A'],
       ['T', 'G']], dtype='<U1')

In [22]:
filter_maf(filter_missing(seqs, 0.1), 0.1)

array([['T', 'G'],
       ['T', 'G'],
       ['T', 'N'],
       ['T', 'G'],
       ['T', 'G'],
       ['T', 'G'],
       ['G', 'G'],
       ['T', 'G'],
       ['G', 'A'],
       ['T', 'G']], dtype='<U1')

### Q: Describe how `calculate_statistics()` works


A. 

In [12]:
# This function is looking to calculate statistics off of the the array created with the preivous functions
def calculcate_statistics(arr):
    # nd is looking at the mean nucleotide diversity by taking the variation of the array and dviding it by the mean of the array
    nd = np.var(arr == arr[0], axis=0).mean()
    # mf is takig the mean minor allele frequency by taking the mean freqncy of columns that contain mutated bases
    mf = np.mean(np.sum(arr != arr[0], axis=0) / arr.shape[0])
    # inv is looking at invariat sites by counting any sites that have a mutation
    inv = np.any(arr != arr[0], axis=0).sum()
    # variable sites = this counts then number of sites that do not have variance 
    var = arr.shape[1] - inv
    # This returns a series with all the variables that we just calculated
    return pd.Series(
        {"mean nucleotide diversity": nd,
         "mean minor allele frequency": mf,
         "invariant sites": inv,
         "variable sites": var,
        })

In [13]:
calculcate_statistics(seqs)

invariant sites                12.000000
mean minor allele frequency     0.255556
mean nucleotide diversity       0.150000
variable sites                  3.000000
dtype: float64

### Instructions: Write a `seqlib` Class object

I started writing the bare bones of it below. You should write it so that it can be executed as described below to perform all of the functions we defined above, and so that its attributes can be accessed. Save this class object in a `.py` file and make it into an importable package called `seqlib`. You can write and test your object in this notebook if you like, but it must be saved separately in a `.py` file and be imported. You cannot execute the code at the end using your object defined here in the notebook. When finished save your package to GitHub as a repo just like we did with the `helloworld` package. You do not need to write a CLI script like we did for the `helloworld` package, we will only be using the Python API here. See the examples below for **how you should write your Class object**. It should be able to run in the way written below, so look at that code and think about how you would write a Class object that can do that. 

While you can mostly copy the functions from above, you will need to modify them slightly to access information about the Class object using *self*. For example, the `simulate()` function below takes self as a first argument and can access `self.inds` and `self.nsites` from that, so we do not need to provide those as arguments to the `simulate` function like we did above. 

In [121]:
import numpy as np
import pandas as pd


class Seqlib:
    def __init__(self, ninds, nsites):
        # store attributes
        self.ninds = ninds
        self.nsites = nsites
        self.seqs = self.simulate()
        self.simulate()
        self.maf = self.filter_maf
        self.missing = self.filter_missing
        self.filter = self.filter_array
        self.statistic = self.calculate_statistics
                        
    def simulate(self):
        oseq = np.random.choice(list("ACGT"), size = self.nsites)
        arr = np.array([oseq for i in range(self.ninds)])
        muts = np.random.binomial(1, 0.1, (self.ninds, self.nsites))
        for col in range (self.nsites):
            newbase = np.random.choice(list(set("ACTG") - set(arr[0,col])))
            mask = muts[:, col].astype(bool)
            arr[:, col][mask] = newbase
        missing = np.random.binomial(1, 0.1, (self.ninds, self.nsites))
        arr[missing.astype(bool)] = "N"
        return arr

    def filter_missing(self): 
        freqmissing = np.sum(self.seqs == "N", axis=0) / self.seqs.shape[0]
        return self.seqs[:, freqmissing <= self.maxfreq]
        
    def filter_maf(self): 
        freqz = np.sum(self.seqs != self.seqs[0], axis=0) / self.seqs.shape[0]
        maf = freqz.copy()
        maf[maf > 0.5] = 1 - maf[maf > .5]
        print(maf)
        
    def filter_array(self, minmaf, maxmissing):
        freqmissing = np.sum(self.seqs == "N", axis=0) / self.seqs.shape[0]
        freqs = np.sum(self.seqs != self.seqs[0], axis=0) / self.seqs.shape[0]
        maf = freqs.copy()
        maf[maf > 0.5] = 1 - maf[maf > 0.5]
        self.seqs[:, maf > minmaf]
        return self.seqs[:, freqmissing <= maxmissing]
     
    def calculate_statistics(self): #arr
        nd = np.var(self.seqs == self.seqs[0], axis=0).mean()
        mf = np.mean(np.sum(self.seqs !=  self.seqs[0], axis=0) / self.seqs.shape[0])
        inv = np.any(self.seqs != self.seqs[0], axis=0).sum()
        var = self.seqs.shape[1] - inv
        return pd.Series(
            {"mean nucleotide diversity" : nd,
             "mean minor allele frequency": mf,
             "invariant sites": inv,
             "variable sites": var,
            })
    

## Test your package
The package should be globally importable (you ran `pip install .` or `pip install -e .` to install it), and it should be able to execute the following code without error. 

In [1]:
import seqlib


In [3]:
# init a Seqlib Class object
seqs = seqlib.Seqlib(10, 50)
seqs.simulate()

array([['C', 'T', 'T', 'C', 'C', 'G', 'A', 'N', 'C', 'C', 'C', 'N', 'C',
        'C', 'N', 'C', 'C', 'C', 'G', 'A', 'G', 'N', 'A', 'C', 'T', 'C',
        'T', 'C', 'T', 'N', 'C', 'G', 'T', 'T', 'C', 'T', 'A', 'C', 'T',
        'C', 'C', 'C', 'T', 'A', 'G', 'A', 'T', 'T', 'T', 'N'],
       ['T', 'T', 'N', 'C', 'C', 'G', 'A', 'C', 'C', 'C', 'C', 'N', 'C',
        'C', 'C', 'C', 'A', 'C', 'G', 'G', 'G', 'T', 'A', 'C', 'G', 'C',
        'T', 'C', 'C', 'G', 'C', 'N', 'N', 'T', 'N', 'T', 'A', 'C', 'T',
        'C', 'N', 'C', 'T', 'A', 'N', 'N', 'T', 'T', 'T', 'T'],
       ['N', 'N', 'T', 'C', 'C', 'N', 'A', 'C', 'A', 'T', 'C', 'C', 'C',
        'C', 'C', 'C', 'A', 'C', 'G', 'G', 'N', 'T', 'A', 'N', 'T', 'C',
        'T', 'C', 'C', 'G', 'N', 'G', 'T', 'T', 'N', 'T', 'A', 'C', 'T',
        'C', 'C', 'C', 'T', 'A', 'G', 'C', 'T', 'T', 'T', 'T'],
       ['C', 'T', 'T', 'C', 'C', 'G', 'A', 'C', 'C', 'C', 'N', 'G', 'C',
        'C', 'C', 'C', 'A', 'C', 'G', 'G', 'A', 'T', 'A', 'C', 'T', 'C',
     

In [4]:
# access attributes from the object
print(seqs.ninds, seqs.nsites)

10 50


In [5]:
# returns the MAF of the array as an array of floats
seqs.maf()

[0.1 0.3 0.2 0.2 0.  0.1 0.1 0.1 0.4 0.1 0.  0.1 0.  0.2 0.4 0.1 0.1 0.1
 0.1 0.1 0.  0.2 0.2 0.2 0.1 0.3 0.1 0.1 0.2 0.3 0.1 0.3 0.2 0.  0.1 0.2
 0.  0.1 0.3 0.2 0.1 0.  0.  0.2 0.1 0.1 0.1 0.1 0.3 0.1]


In [6]:
# return a view of the filtered sequence array by applying a new function 
# called `filter()` that applies both the maf and missing filter functions
seqs.filter(minmaf=0.1, maxmissing=0.0)

array([['C', 'C', 'C', 'C', 'A', 'C', 'G', 'A', 'G', 'G', 'C', 'G', 'C',
        'A', 'C', 'T', 'C', 'C', 'G', 'T'],
       ['C', 'C', 'C', 'C', 'A', 'C', 'G', 'A', 'G', 'G', 'C', 'A', 'C',
        'A', 'C', 'T', 'C', 'C', 'T', 'T'],
       ['C', 'C', 'C', 'C', 'A', 'C', 'G', 'A', 'G', 'G', 'C', 'G', 'C',
        'A', 'C', 'T', 'C', 'C', 'G', 'T'],
       ['T', 'C', 'C', 'C', 'A', 'C', 'G', 'A', 'G', 'T', 'C', 'A', 'C',
        'A', 'C', 'T', 'C', 'C', 'G', 'T'],
       ['C', 'C', 'C', 'C', 'A', 'C', 'G', 'A', 'G', 'G', 'C', 'A', 'C',
        'A', 'C', 'T', 'C', 'C', 'G', 'G'],
       ['C', 'C', 'C', 'C', 'A', 'C', 'G', 'A', 'G', 'G', 'C', 'A', 'C',
        'A', 'C', 'T', 'C', 'C', 'G', 'T'],
       ['C', 'C', 'C', 'C', 'A', 'C', 'G', 'A', 'G', 'G', 'C', 'A', 'C',
        'A', 'C', 'T', 'C', 'C', 'G', 'T'],
       ['C', 'C', 'C', 'C', 'G', 'C', 'G', 'T', 'G', 'G', 'C', 'A', 'C',
        'A', 'A', 'T', 'C', 'G', 'G', 'T'],
       ['C', 'C', 'C', 'C', 'A', 'C', 'G', 'A', 'G', 'G', 'T', '

In [7]:
# calculate statistics for an array with the results returned as a DataFrame
seqs.calculate_statistics()

invariant sites                42.0000
mean minor allele frequency     0.2500
mean nucleotide diversity       0.1114
variable sites                  8.0000
dtype: float64

In [None]:
# calculate statistics for an array after filtering it
seqs.filter(minmaf=0.1, maxmissing=0.0).statistic()