# Assignment

-------------

**1) Complete this notebook and make a pull request:** 

Answer questions (Q) in the space provided (A) in this notebook. When finished, copy your notebook to the `Assignment/` directory and name it `nb-6.5-<Github-username>.ipynb`. Then make a pull request to the upstream repo. The entered answers in this notebook will be simply Markdown text where I want you to interpret and describe a block of code to better understand what it is doing. Much of this code you will have seen already. 


**2) Write an importable Python package, save as a repo, and test it here.**

The package should be written as we did in our last lession (`.py` files in a directory with a setup.py file so it can be installed with `pip`). Follow instructions at the end of this notebook for how to write your package. Test it here by importing the package and executing the code at the end. It should work and give correct answers, if not, continue working on it. When you have it completed save your package as a new Github repo named `seqlib`.

### The `seqlib` package

Together we are going to write several functions here that will make up your new package called `seqlib`. It will be your job to copy these functions, organize them into a Class, save the code into a `.py` file (you can use SublimeText if you're comfortable with it for much of this, or any text editor including the one in jupyter), package the files so they can be imported as a library, and test the package so that it accomplishes the tasks which are defined at the end of this notebook. First things first, though, let's write the functions. 

In [1]:
import numpy as np
import pandas as pd

### Q.  Describe what the `mutate()` function below does:


A. 

In [2]:
def mutate(base):
    diff = set("ACTG") - set(base)
    return np.random.choice(list(diff))

In [3]:
# test it
mutate("A")

'T'

### Q. Describe how the `seqdata()` function below works:
Annotate the code by inserting lines with comments as you read through the function to make sense of it. What is being created at each step and how is it used?


A. The function takes a number of indices and number of sites to generate a random sequence and the mutations through time at each point in the sequence. Line 1 defines the function. Line 2 creates the initial random sequence by selecting a number (nsites) of bases. Line 3 make a number (ninds) of copies of the sequence generated in line 2. Line 4 generates a binomial distribution of 1s and 0s where there is 0.1 chance of getting a 1 to indicate the positions where mutations should occur. Python stores this as an array. Lines 5-8 are a for loop that iterates over the sites, converts the 0/1 array into a false/true mask, and goes through the array changing the bases where the mask is true to a new base. Line 9 generates a binomial array that matches the dimensions of the array of sequences. Line 10 assigns the value N to the 1s in the array from line 9. Then Line 11 returns the array

What is being created? Line 1 (function), line 2 (sequence), line 3 (array), line 4 (array), line 5 (initializing for loop), line 6 (single character value), line 7 (array), line 8 (array), line 9 (array), line 10 (array)

In [44]:
# define simulate function to simulate sequences and mutations
def simulate(ninds, nsites):
    # create initial random sequence of length nsites
    oseq = np.random.choice(list("ACGT"), size=nsites)
    # define an array copying the sequences ninds times (each as a new row)
    arr = np.array([oseq for i in range(ninds)])
    # define whether any given site has a mutation using a binomial distribution with probability of true 0.1
    muts = np.random.binomial(1, 0.1, (ninds, nsites))
    # iterate through sites and insert newbase (randomly chosen other than original value) in sites identified in muts
    for col in range(nsites):
        newbase = mutate(arr[0, col])
        mask = muts[:, col].astype(bool)
        arr[:, col][mask] = newbase
    # make array from binomial distribution for sites to be missing
    missing = np.random.binomial(1, 0.1, (ninds, nsites))
    # replace value with N in sites to be missing
    arr[missing.astype(bool)] = "N"
    return arr

In [45]:
seqs = simulate(6, 15)
print(seqs)

[['A' 'T' 'C' 'C' 'C' 'G' 'A' 'C' 'A' 'C' 'N' 'A' 'C' 'A' 'G']
 ['A' 'T' 'T' 'C' 'C' 'G' 'T' 'C' 'G' 'C' 'C' 'G' 'N' 'A' 'A']
 ['A' 'T' 'T' 'C' 'C' 'N' 'T' 'C' 'A' 'C' 'C' 'A' 'C' 'A' 'N']
 ['A' 'T' 'T' 'C' 'A' 'G' 'T' 'C' 'A' 'C' 'C' 'A' 'C' 'N' 'G']
 ['A' 'T' 'T' 'C' 'C' 'G' 'T' 'A' 'A' 'C' 'C' 'A' 'C' 'A' 'G']
 ['A' 'T' 'T' 'C' 'A' 'G' 'A' 'A' 'N' 'C' 'C' 'A' 'C' 'A' 'G']]


### **Q: Describe how the `filter_missing` function works:**
Annotate the code by inserting lines with comments as you read through the function to make sense of it. How does it find columns with missing (N) values in them? How might you mprove it?

A. Summary of function: The filter missing function removes the columns that have a missing base in any row. The function does this by summing the occurances of N along  the column axis (0) and dividing that by the shape (0) which tells us the number of columns contained in the array. This tells us the fraction of sequences that had a missing base at that location. Then the function returns the array giving only columns where the frequency of missing bases is less than a specified frequency (maxfreq). 

It finds columns with missing values by summing the frequency of N along on the column. 

This could be improved by using numpy unique along a column. If it is greater than 1 we know a base was missing and can calculate the frequency by unique-1/column length. (I'm not sure if this is an improvement but it's another option).

In [46]:
# define function to remove columns with a missing value
def filter_missing(arr, maxfreq):
    # calculate the fraction of rows with a missing base at that site
    freqmissing = np.sum(arr == "N", axis=0) / arr.shape[0]
    # return the array excluding the columns that had more missing than the defined threshold
    return arr[:, freqmissing <= maxfreq]

In [47]:
filter_missing(seqs, 0.1)

array([['A', 'T', 'C', 'C', 'C', 'A', 'C', 'C', 'A'],
       ['A', 'T', 'T', 'C', 'C', 'T', 'C', 'C', 'G'],
       ['A', 'T', 'T', 'C', 'C', 'T', 'C', 'C', 'A'],
       ['A', 'T', 'T', 'C', 'A', 'T', 'C', 'C', 'A'],
       ['A', 'T', 'T', 'C', 'C', 'T', 'A', 'C', 'A'],
       ['A', 'T', 'T', 'C', 'A', 'A', 'A', 'C', 'A']], dtype='<U1')

### **Q: Describe how the `filter_maf` function works:**
Annotate the code by inserting lines with comments as you read through the function to make sense of it. How does it calculate minor allele frequencies? Why does it use copy?

A. It calculates minor allele frequency (maf) by taking frequencies that are greater that 0.5 and replacing them with 1-freq. It uses a copy so that the original array is not altered.

In [48]:
# define function that takes array of sequences and its mutations through time as well as minimum frequency for minor allele
def filter_maf(arr, minfreq):
    # calculates mutation freq (number of rows in a column that are not equal to the first row over column length)
    freqs = np.sum(arr != arr[0], axis=0) / arr.shape[0]
    # make a copy of the frequency sequence
    maf = freqs.copy()
    # select sites where the freq is over .5 and replace with 1-that value
    maf[maf > 0.5] = 1 - maf[maf > 0.5]
    # return the array with all rows and columns where the maf value is greater than the minfreq defined by user
    return arr[:, maf > minfreq]

In [49]:
arr

array([['G', 'T', 'T', 'T', 'G', 'G', 'T', 'A', 'G', 'A'],
       ['G', 'T', 'T', 'T', 'G', 'G', 'A', 'A', 'G', 'A'],
       ['G', 'G', 'T', 'T', 'C', 'G', 'T', 'A', 'G', 'G'],
       ['G', 'G', 'C', 'T', 'G', 'G', 'T', 'G', 'G', 'A']], dtype='<U1')

In [84]:
filter_maf(seqs, 0.1)

TypeError: 'Seqlib' object does not support indexing

### Q: What order should these functions be applied, does it matter?

A. Although the outputs are the same for the sequences tested, it makes more logical sense to do filter_maf(filter_missing(seqs, 0.1), 0.1) because this removes any missing data first before calculating the minor allele frequencies which depend on the number of differences in the column. 

In [51]:
filter_missing(filter_maf(seqs, 0.1), 0.1)

array([['C', 'C', 'A', 'C', 'A'],
       ['T', 'C', 'T', 'C', 'G'],
       ['T', 'C', 'T', 'C', 'A'],
       ['T', 'A', 'T', 'C', 'A'],
       ['T', 'C', 'T', 'A', 'A'],
       ['T', 'A', 'A', 'A', 'A']], dtype='<U1')

In [52]:
filter_maf(filter_missing(seqs, 0.1), 0.1)

array([['C', 'C', 'A', 'C', 'A'],
       ['T', 'C', 'T', 'C', 'G'],
       ['T', 'C', 'T', 'C', 'A'],
       ['T', 'A', 'T', 'C', 'A'],
       ['T', 'C', 'T', 'A', 'A'],
       ['T', 'A', 'A', 'A', 'A']], dtype='<U1')

### Q: Describe how `calculate_statistics()` works


A. The function calculates the mean of the variance along each column in terms of whether the base is the same as the top row. Then it calculates (mf) the mean frequency by calculating the frequency of variation using the sum of values that do not equal the first row over the length of the column and calculating the mean (to average all columns). Inv calculates the number of mutation instances. np.any calculates whether an element is true that the base is different than the top row. Then it sums up the number of values that didn't match (the number of mutations). Var gives the number of base positions that had a mutation by finding the number of columns and subtracting the number of columns that had a mutation. Then the function prints this information along with the text explaining it as a series. 

In [59]:
def calculcate_statistics(arr):
    nd = np.var(arr == arr[0], axis=0).mean()
    mf = np.mean(np.sum(arr != arr[0], axis=0) / arr.shape[0])
    inv = np.any(arr != arr[0], axis=0).sum()
    var = arr.shape[1] - inv
    return pd.Series(
        {"mean nucleotide diversity": nd,
         "mean minor allele frequency": mf,
         "invariant sites": inv,
         "variable sites": var,
        })

In [60]:
calculcate_statistics(seqs)

invariant sites                11.000000
mean minor allele frequency     0.288889
mean nucleotide diversity       0.129630
variable sites                  4.000000
dtype: float64

### Instructions: Write a `seqlib` Class object

I started writing the bare bones of it below. You should write it so that it can be executed as described below to perform all of the functions we defined above, and so that its attributes can be accessed. Save this class object in a `.py` file and make it into an importable package called `seqlib`. You can write and test your object in this notebook if you like, but it must be saved separately in a `.py` file and be imported. You cannot execute the code at the end using your object defined here in the notebook. When finished save your package to GitHub as a repo just like we did with the `helloworld` package. You do not need to write a CLI script like we did for the `helloworld` package, we will only be using the Python API here. See the examples below for **how you should write your Class object**. It should be able to run in the way written below, so look at that code and think about how you would write a Class object that can do that. 

While you can mostly copy the functions from above, you will need to modify them slightly to access information about the Class object using *self*. For example, the `simulate()` function below takes self as a first argument and can access `self.inds` and `self.nsites` from that, so we do not need to provide those as arguments to the `simulate` function like we did above. 

In [31]:
import numpy as np
import pandas as pd

class Seqlib:
    def __init__(self, ninds, nsites):
        self.ninds = ninds
        self.nsites = nsites
        self.seqs = self.simulate()
        # ...
    
    def mutate(self, base):
        diff = set("ACTG") - set(base)
        return np.random.choice(list(diff))
    
    # define simulate function to simulate sequences and mutations
    # define simulate function to simulate sequences and mutations
    def simulate(self):
        ninds = self.ninds
        nsites = self.nsites
        # create initial random sequence of length nsites
        oseq = np.random.choice(list("ACGT"), size=nsites)
        # define an array copying the sequences ninds times (each as a new row)
        arr = np.array([oseq for i in range(ninds)])
        # define whether any given site has a mutation using a binomial distribution with probability of true 0.1
        muts = np.random.binomial(1, 0.1, (ninds, nsites))
        # iterate through sites and insert newbase (randomly chosen other than original value) in sites identified in muts
        for col in range(nsites):
            newbase = self.mutate(arr[0, col])
            mask = muts[:, col].astype(bool)
            arr[:, col][mask] = newbase
        # make array from binomial distribution for sites to be missing
        missing = np.random.binomial(1, 0.1, (ninds, nsites))
        # replace value with N in sites to be missing
        arr[missing.astype(bool)] = "N"
        return arr
    
    # define function to remove columns with a missing value
    def filter_missing(self, maxmissing):
        arr = self.seqs
        # calculate the fraction of rows with a missing base at that site
        freqmissing = np.sum(arr == "N", axis=0) / arr.shape[0]
        # return the array excluding the columns that had more missing than the defined threshold
        return arr[:, freqmissing <= maxmissing]
    
    # define function that takes array of sequences and its mutations through time as well as minimum frequency for minor allele
    def filter_maf(self, minmaf):
       # arr = self.seqs
        # calculates mutation freq (number of rows in a column that are not equal to the first row over column length)
        freqs = np.sum(arr != arr[0], axis=0) / arr.shape[0]
        # make a copy of the frequency sequence
        maf = freqs.copy()
        # select sites where the freq is over .5 and replace with 1-that value
        maf[maf > 0.5] = 1 - maf[maf > 0.5]
        # return the array with all rows and columns where the maf value is greater than the minfreq defined by user
        return arr[:, maf > minmaf]
    
    # return maf as floats
    def maf(self):
        #arr = self.seqs()
        # calculates mutation freq (number of rows in a column that are not equal to the first row over column length)
        freqs = np.sum(arr != arr[0], axis=0) / arr.shape[0]
        # make a copy of the frequency sequence
        maf = freqs.copy()
        # select sites where the freq is over .5 and replace with 1-that value
        maf[maf > 0.5] = 1 - maf[maf > 0.5]
        return freqs
    
    # return filtered by missing and minfreq
    def filter(self, minmaf, maxmissing): 
        maf = self.filter_maf(self.filter_missing(maxmissing), minmaf)
        return maf
    
    def calculate_statistics(self):
        arr = self.sequence()
        nd = np.var(arr == arr[0], axis=0).mean()
        mf = np.mean(np.sum(arr != arr[0], axis=0) / arr.shape[0])
        inv = np.any(arr != arr[0], axis=0).sum()
        var = arr.shape[1] - inv
        return pd.Series(
            {"mean nucleotide diversity": nd,
             "mean minor allele frequency": mf,
             "invariant sites": inv,
             "variable sites": var,
            })
    

In [32]:
seqs = Seqlib(ninds=10, nsites=50)
print(seqs.ninds, seqs.nsites)
seqs.filter_missing(maxmissing=0.1)
seqs.filter(minmaf=0.1, maxmissing=0.0)

10 50


TypeError: filter_maf() takes 2 positional arguments but 3 were given

## Test your package
The package should be globally importable (you ran `pip install .` or `pip install -e .` to install it), and it should be able to execute the following code without error. 

In [None]:
import seqlib

In [None]:
# init a Seqlib Class object
seqs = seqlib.Seqlib(ninds=10, nsites=50)

In [None]:
# access attributes from the object
print(seqs.ninds, seqs.nsites)

In [45]:
# returns the MAF of the array as an array of floats
seqs.maf

<bound method Seqlib.maf of <__main__.Seqlib object at 0x00000206478AEA90>>

In [46]:
# return a view of the filtered sequence array by applying a new function 
# called `filter()` that applies both the maf and missing filter functions
seqs.filter(minmaf=0.1, maxmissing=0.0)

array([['A', 'G', 'A', 'A'],
       ['A', 'G', 'A', 'A'],
       ['A', 'G', 'A', 'A'],
       ['A', 'G', 'A', 'A'],
       ['A', 'G', 'A', 'A'],
       ['G', 'G', 'A', 'A'],
       ['G', 'G', 'A', 'A'],
       ['A', 'G', 'T', 'T'],
       ['G', 'A', 'T', 'A'],
       ['G', 'A', 'A', 'T']], dtype='<U1')

In [47]:
# calculate statistics for an array with the results returned as a DataFrame
seqs.calculate_statistics()

invariant sites                39.0000
mean minor allele frequency     0.2380
mean nucleotide diversity       0.1078
variable sites                 11.0000
dtype: float64

In [40]:
# calculate statistics for an array after filtering it
seqs.filter(minmaf=0.1, maxmissing=0.0).calculate_statistics()

TypeError: filter() got an unexpected keyword argument 'minmaf'

In [54]:
import os
import sys

## let's define some names that we'll use for paths
prjname = "seqlib"
pkgname = "seqlib"
storeloc = os.path.expanduser("~/PDSB/6-scientific-python/Notebooks/")

## now let's create some joint paths with the os module
prjpath = os.path.join(storeloc, prjname)
pkgpath = os.path.join(storeloc, prjname, pkgname)

## check out paths
print(prjpath)
print(pkgpath)

## make the directories (exist_ok allows for it to already exist)
os.makedirs(pkgpath, exist_ok=True)

## write setup.py file
with open(os.path.join(prjpath, "setup.py"), 'w') as out:
    out.write(setup)

## write init file
with open(os.path.join(pkgpath, "__init__.py"), 'w') as out:
    out.write(init)
    
## write script to file
with open(os.path.join(pkgpath, "seqlib.py"), 'w') as out:
    out.write(seqlib)

C:\Users\Anika/PDSB/6-scientific-python/Notebooks/seqlib
C:\Users\Anika/PDSB/6-scientific-python/Notebooks/seqlib\seqlib


NameError: name 'setup' is not defined