# Assignment

-------------

**1) Complete this notebook and make a pull request:** 

Answer questions (Q) in the space provided (A) in this notebook. When finished, copy your notebook to the `Assignment/` directory and name it `nb-6.5-<Github-username>.ipynb`. Then make a pull request to the upstream repo. The entered answers in this notebook will be simply Markdown text where I want you to interpret and describe a block of code to better understand what it is doing. Much of this code you will have seen already. 


**2) Write an importable Python package, save as a repo, and test it here.**

The package should be written as we did in our last lession (`.py` files in a directory with a setup.py file so it can be installed with `pip`). Follow instructions at the end of this notebook for how to write your package. Test it here by importing the package and executing the code at the end. It should work and give correct answers, if not, continue working on it. When you have it completed save your package as a new Github repo named `seqlib`.

### The `seqlib` package

Together we are going to write several functions here that will make up your new package called `seqlib`. It will be your job to copy these functions, organize them into a Class, save the code into a `.py` file (you can use SublimeText if you're comfortable with it for much of this, or any text editor including the one in jupyter), package the files so they can be imported as a library, and test the package so that it accomplishes the tasks which are defined at the end of this notebook. First things first, though, let's write the functions. 

In [None]:
import numpy as np
import pandas as pd

### Q.  Describe what the `mutate()` function below does:

A. The mutate function creats a random mutation. The parameter base (e.g. "A") is substracted from the set "ACTG" and the difference (e.g. 'G', 'C', 'T') is saved in *diff*. Lastly, a random letter is chosen from the list, *diff*, and returned at the end. 

In [63]:
def mutate(base):
    diff = set("ACTG") - set(base)
    return np.random.choice(list(diff))

In [54]:
# test it
mutate("A")

'C'

### Q. Describe how the `simulate()` function below works:
Annotate the code by inserting lines with comments as you read through the function to make sense of it. What is being created at each step and how is it used?


A. The function `simulate` generate variable sequence data by creating mutations and sites with missing data. The parameters used (ninds = 6, nsites = 15) construct six 15-base long nuclotide sequence.

In [67]:
def simulate(ninds, nsites):  #ninds is a number of rows and nsites is a number of columns
    
    #choose a random letter from the list "ACGT" for 15 times 
    oseq = np.random.choice(list("ACGT"), size=nsites) 
    
    #construct an array by creating 6 rows of oseq
    arr = np.array([oseq for i in range(ninds)]) 
    
    #Binomial sampling throught the array where the probability of one outcome is 0.1 (p=0.1). This will return an array of binary integers  
    muts = np.random.binomial(1, 0.1, (ninds, nsites)) 
    
    for col in range(nsites): #nsite is 15 this case. for loop goes through column 1 to 15
        newbase = mutate(arr[0, col])   # creating a random mutation in the coulmn 1(to 15 throught the interation). 
        mask = muts[:, col].astype(bool) #muts flips a coin to assign outcome in binary integers(e.g. 0 or 1) which will then be converted to a boolean type using the astype() call.
        arr[:, col][mask] = newbase    # the arr[:,col] part pulls out a full column from the array. Then, the [mask] index pulls out some indices (defined above sell) from that column and stores in newbase
    missing = np.random.binomial(1, 0.1, (ninds, nsites)) # generate random missing data by using binomial sampling with probability of 0.1
    
    #indicate missing data as "N"
    arr[missing.astype(bool)] = "N"  
    return arr

In [68]:
seqs = simulate(6, 15)
print(seqs)

[['G' 'T' 'C' 'C' 'G' 'A' 'C' 'N' 'C' 'G' 'G' 'T' 'N' 'G' 'G']
 ['G' 'T' 'C' 'C' 'G' 'T' 'C' 'G' 'C' 'G' 'N' 'G' 'T' 'G' 'T']
 ['N' 'T' 'C' 'C' 'G' 'A' 'C' 'A' 'C' 'G' 'G' 'N' 'T' 'G' 'T']
 ['G' 'T' 'C' 'C' 'G' 'A' 'C' 'G' 'N' 'G' 'G' 'G' 'T' 'C' 'T']
 ['G' 'T' 'C' 'C' 'G' 'T' 'A' 'G' 'C' 'G' 'G' 'G' 'T' 'N' 'T']
 ['G' 'T' 'C' 'C' 'G' 'A' 'N' 'G' 'C' 'G' 'G' 'G' 'A' 'G' 'T']]


#### Test code of the function simulate


In [66]:
def simulate(nsites):  #ninds is a number of rows and nsites is a number of columns
    oseq = np.random.choice(list("ACGT"), size=nsites)
    
    return oseq
seqs = simulate(15)
print(seqs)

['A' 'T' 'T' 'C' 'G' 'G' 'A' 'G' 'T' 'C' 'A' 'G' 'A' 'C' 'T']


In [None]:
def simulate(ninds, nsites):  #ninds is a number of rows and nsites is a number of columns
    oseq = np.random.choice(list("ACGT"), size=nsites)
    arr = np.array([oseq for i in range(ninds)])
    return arr
seqs = simulate(6, 15)
print(seqs)

In [None]:
def simulate(ninds, nsites):  #ninds is a number of rows and nsites is a number of columns
    oseq = np.random.choice(list("ACGT"), size=nsites)
    arr = np.array([oseq for i in range(ninds)])
    muts = np.random.binomial(1, 0.1, (ninds, nsites))
    return muts
seqs = simulate(6, 15)
print(seqs)

### **Q: Describe how the `filter_missing` function works:**
Annotate the code by inserting lines with comments as you read through the function to make sense of it. How does it find columns with missing (N) values in them? How might you improve it?

A. The `filter_missing` function removes sequences that have more than centain frequency of missing data. 

In [69]:

def filter_missing(arr, maxfreq):  #arr is the variable seq generated using the function, simulate.
    
    # freqmissing is obtained by deviding the numner of N within the row by length of columns (shape[0]).
    freqmissing = np.sum(arr == "N", axis=0) / arr.shape[0]    
    
    #return rows with missing frequency less than 0.1 
    return arr[:, freqmissing <= maxfreq]  

In [70]:
filter_missing(seqs, 0.1)

array([['T', 'C', 'C', 'G', 'A', 'G', 'G'],
       ['T', 'C', 'C', 'G', 'T', 'G', 'T'],
       ['T', 'C', 'C', 'G', 'A', 'G', 'T'],
       ['T', 'C', 'C', 'G', 'A', 'G', 'T'],
       ['T', 'C', 'C', 'G', 'T', 'G', 'T'],
       ['T', 'C', 'C', 'G', 'A', 'G', 'T']], dtype='<U1')

### **Q: Describe how the `filter_maf` function works:**
Annotate the code by inserting lines with comments as you read through the function to make sense of it. How does it calculate minor allele frequencies? Why does it use copy?

A. The `filter_maf` function removes sequences that has too little minor allele frequencies. It uses copy becasue we do not want to make changes in the original sequences. 

In [71]:
def filter_maf(arr, minfreq):
    # The sum of variables in each column is divided by the length of columns (shape[0]) to obtein frequency and this view is stored in freqs. A number of variables is found by comparing the first row against the rest. 
    freqs = np.sum(arr != arr[0], axis=0) / arr.shape[0] 
    
    # store a copy to avoid modifying the original array 'arr'
    maf = freqs.copy()
    
    #subselect sites (columns) with major freq (>0.5) and modify to be 1-value (e.g 0.875 to 0.125)
    maf[maf > 0.5] = 1 - maf[maf > 0.5]
    
    # print only columns of minor allele frequencies greater than the minimum frequency of 0.1 (sequeces with too little mutations are eliminated here)
    return arr[:, maf > minfreq]  #print all row 

In [72]:
filter_maf(seqs, 0.1)

array([['G', 'A', 'C', 'N', 'C', 'G', 'T', 'N', 'G', 'G'],
       ['G', 'T', 'C', 'G', 'C', 'N', 'G', 'T', 'G', 'T'],
       ['N', 'A', 'C', 'A', 'C', 'G', 'N', 'T', 'G', 'T'],
       ['G', 'A', 'C', 'G', 'N', 'G', 'G', 'T', 'C', 'T'],
       ['G', 'T', 'A', 'G', 'C', 'G', 'G', 'T', 'N', 'T'],
       ['G', 'A', 'N', 'G', 'C', 'G', 'G', 'A', 'G', 'T']], dtype='<U1')

### Q: What order should these functions be applied, does it matter?

A. The order does not matter because presence of N won't affect the `filter_map` function.

In [73]:
filter_missing(filter_maf(seqs, 0.1), 0.1)

array([['A', 'G'],
       ['T', 'T'],
       ['A', 'T'],
       ['A', 'T'],
       ['T', 'T'],
       ['A', 'T']], dtype='<U1')

In [30]:
filter_maf(filter_missing(seqs, 0.1), 0.1)

array([['T', 'C', 'T', 'C'],
       ['T', 'C', 'T', 'C'],
       ['T', 'C', 'T', 'G'],
       ['T', 'T', 'T', 'G'],
       ['G', 'C', 'T', 'G'],
       ['T', 'C', 'C', 'G']], dtype='<U1')

### Q: Describe how `calculate_statistics()` works


A. 

In [74]:
def calculcate_statistics(arr):
    
    # find mean nucleotide diversity: arr == arr[0] compares the first row against the rest of the rows. The similarity of each column is calculated and the similarity values from each column are used to compute the mean. 
    nd = np.var(arr == arr[0], axis=0).mean()
    
    # find mean minor allele frequency:  arr != arr[0] finds a number of differences between the first row against the rest of the rows. The number of differeces is calculated for each column and divided by the total length of each column. The sum of the frequencies of all columns is used to compute the mean.
    mf = np.mean(np.sum(arr != arr[0], axis=0) / arr.shape[0])
   
    # find variant sites: create boolean mask for whether any sites in a column is different from the first row. Sum the number of columns with some variability 
    var = np.any(arr != arr[0], axis=0).sum()
    
    # find variable sites: substract the number of variant sites from the total number of columns (15)
    inv = arr.shape[1] - var
    
    #Use Pandas to return all the values 
    return pd.Series(
        {"mean nucleotide diversity": nd,
         "mean minor allele frequency": mf,
         "invariant sites": inv,
         "variable sites": var,
        })

In [75]:
calculcate_statistics(seqs)

invariant sites                 5.000000
mean minor allele frequency     0.322222
mean nucleotide diversity       0.109259
variable sites                 10.000000
dtype: float64

### Instructions: Write a `seqlib` Class object

I started writing the bare bones of it below. You should write it so that it can be executed as described below to perform all of the functions we defined above, and so that its attributes can be accessed. Save this class object in a `.py` file and make it into an importable package called `seqlib`. You can write and test your object in this notebook if you like, but it must be saved separately in a `.py` file and be imported. You cannot execute the code at the end using your object defined here in the notebook. When finished save your package to GitHub as a repo just like we did with the `helloworld` package. You do not need to write a CLI script like we did for the `helloworld` package, we will only be using the Python API here. See the examples below for **how you should write your Class object**. It should be able to run in the way written below, so look at that code and think about how you would write a Class object that can do that. 

While you can mostly copy the functions from above, you will need to modify them slightly to access information about the Class object using *self*. For example, the `simulate()` function below takes self as a first argument and can access `self.inds` and `self.nsites` from that, so we do not need to provide those as arguments to the `simulate` function like we did above. 

In [91]:
#!/usr/bin/env python

"""
Seqlib class project for 6-scientific-python
"""

import numpy as np
import pandas as pd
    
class Seqlib:

    def __init__(self, ninds, nsites, maxfreq, minfreq):
        self.ninds = ninds
        self.nsites = nsites
        self.seqs = self.simulate()
        self.filter = self.filter_missing(), self.filter_maf()
    
    def simulate(self):  #ninds is a number of rows and nsites is a number of columns
        pass 
        """
        The function simulate generate variable sequence data by creating mutations and sites with missing data. 
        """
        
        #choose a random letter from the list "ACGT" for a number of times indicated in nsites
        oseq = np.random.choice(list("ACGT"), size=self.nsites) 
    
        #construct an array by creating rows of oseq
        arr = np.array([oseq for i in range(self.ninds)]) 
    
        #Binomial sampling through the array where the probability of one outcome is 0.1 (p=0.1). 
        #This will return an array of binary integers  
        muts = np.random.binomial(1, 0.1, (self.ninds, self.nsites)) 
    
        for col in range(self.nsites):      
            newbase = mutate(arr[0, col])    # creating a random mutation in the coulmns through the interation. 
            mask = muts[:, col].astype(bool) # muts flips a coin to assign outcome in binary integers(e.g. 0 or 1) which will 
                                                # then be converted to a boolean type using the astype() call.
            arr[:, col][mask] = newbase      # the arr[:,col] part pulls out a full column from the array. Then, the [mask] index pulls 
                                                # out some indices (defined above sell) from that column and stores in newbase
    
        # generate random missing data by using binomial sampling with probability of 0.1
        missing = np.random.binomial(1, 0.1, (self.ninds, self.nsites)) 
        #indicate missing data as "N"
        arr[missing.astype(bool)] = "N"  
        return arr
        
    def filter_missing(self):  #arr is the variable seq generated using the function, simulate.
        
        """
        The filter_missing function removes sequences that have more than centain frequency of missing data.
        """  
        # freqmissing is obtained by deviding the numner of N within the row by length of columns (shape[0]).
        freqmissing = np.sum(self.arr == "N", axis=0) / self.arr.shape[0]    
    
        #return rows with missing frequency less than maxfreq 
        return self.arr[:, freqmissing <= maxfreq]  
    
    def filter_maf(self):
    
        """
        The filter_maf function removes sequences that has too little minor allele frequencies. It uses copy becasue we do not want 
        to make changes in the original sequences.
        """  
    
        # The sum of variables in each column is divided by the length of columns (shape[0]) to obtein frequency and this view 
        # is stored in freqs. A number of variables is found by comparing the first row against the rest. 
        freqs = np.sum(self.arr != self.arr[0], axis=0) / self.arr.shape[0] 
    
        # store a copy to avoid modifying the original array 'arr'
        maf = freqs.copy()
    
        # subselect sites (columns) with major freq (>0.5) and modify to be 1-value (e.g 0.875 to 0.125)
        maf[maf > 0.5] = 1 - maf[maf > 0.5]
    
        # print only columns of minor allele frequencies greater than the minimum frequency (sequeces with too little mutations are eliminated here)
        return self.arr[:, maf > minfreq]  #print all row 
  
    
    def calculcate_statistics(self):
        """
        A. The calculcate_statistics function computes mean nucleotide diversity, mean minor allele frequency, variant sites, 
        variable sites
        """     
        # find mean nucleotide diversity: arr == arr[0] compares the first row against the rest of the rows. The similarity of 
        # each column is calculated and the similarity values from each column are used to compute the mean. 
        nd = np.var(self.arr == self.arr[0], axis=0).mean()
    
        # find mean minor allele frequency:  arr != arr[0] finds a number of differences between the first row against the rest
        # of the rows. The number of differeces is calculated for each column and divided by the total length of each column. 
        # The sum of the frequencies of all columns is used to compute the mean.
        mf = np.mean(np.sum(self.arr != self.arr[0], axis=0) / self.arr.shape[0])
   
        # find variant sites: create boolean mask for whether any sites in a column is different from the first row. Sum the 
        # number of columns with some variability 
        var = np.any(self.arr != self.arr[0], axis=0).sum()
    
        # find variable sites: substract the number of variant sites from the total number of columns
        inv = self.arr.shape[1] - var
    
        #Use Pandas to return all the values 
        return pd.Series(
            {"mean nucleotide diversity": nd,
             "mean minor allele frequency": mf,
             "invariant sites": inv,
             "variable sites": var,
            })
    

In [None]:
### my updated version
#!/usr/bin/env python

"""
Seqlib class project for 6-scientific-python
"""

import numpy as np
import pandas as pd
    
class Seqlib:

    def __init__(self, ninds, nsites, arr, maxfreq, minfreq):
        self.ninds = ninds
        self.nsites = nsites
        self.seqs = self.simulate()
        self.seqlib = __init__()
        self.arr = arr
        self.maf = maf
        self.filter = self.filter_missing(), self.filter_maf()
    
    def simulate(self):  #ninds is a number of rows and nsites is a number of columns 
        """
        The function simulate generate variable sequence data by creating mutations and 
        sites with missing data. 
        The parameters used (ninds = 6, nsites = 15) construct six 15-base long nuclotide
        sequence.
        """
        
        #choose a random letter from the list "ACGT" for 15 times 
        oseq = np.random.choice(list("ACGT"), size=self.nsites) 
    
        #construct an array by creating 6 rows of oseq
        arr = np.array([oseq for i in range(self.ninds)]) 
    
        #Binomial sampling throught the array where the probability of one outcome is 0.1 (p=0.1). 
        # This will return an array of binary integers  
        muts = np.random.binomial(1, 0.1, (self.ninds, self.nsites)) 
    
        for col in range(self.nsites): #nsite is 15 this case. for loop goes through column 1 to 15
            newbase = mutate(arr[0, col])   # creating a random mutation in the coulmn 1(to 15 throught the interation). 
            mask = muts[:, col].astype(bool) #muts flips a coin to assign outcome in binary integers(e.g. 0 or 1) which will 
                                            #then be converted to a boolean type using the astype() call.
            arr[:, col][mask] = newbase    # the arr[:,col] part pulls out a full column from the array. Then, the [mask] index pulls 
                                            #out some indices (defined above sell) from that column and stores in newbase
    
        # generate random missing data by using binomial sampling with probability of 0.1
        missing = np.random.binomial(1, 0.1, (self.ninds, self.nsites)) 
        #indicate missing data as "N"
        arr[missing.astype(bool)] = "N"  
        return arr
        
    def filter_missing(self):  #arr is the variable seq generated using the function, simulate.
        
        """
        The filter_missing function removes sequences that have more than centain frequency of missing data.
        """  
        # freqmissing is obtained by deviding the numner of N within the row by length of columns (shape[0]).
        freqmissing = np.sum(self.arr == "N", axis=0) / self.arr.shape[0]    
    
        #return rows with missing frequency less than 0.1 
        return self.arr[:, freqmissing <= maxfreq]  
    
    def filter_maf(self):
    
        """
        The filter_maf function removes sequences that has too little minor allele frequencies. It uses copy becasue we do not want 
        to make changes in the original sequences.
        """  
    
        # The sum of variables in each column is divided by the length of columns (shape[0]) to obtein frequency and this view 
        # is stored in freqs. A number of variables is found by comparing the first row against the rest. 
        freqs = np.sum(self.arr != self.arr[0], axis=0) / self.arr.shape[0] 
    
        # store a copy to avoid modifying the original array 'arr'
        maf = freqs.copy()
    
        #subselect sites (columns) with major freq (>0.5) and modify to be 1-value (e.g 0.875 to 0.125)
        maf[maf > 0.5] = 1 - maf[maf > 0.5]
    
        # print only columns of minor allele frequencies greater than the minimum frequency of 0.1 (sequeces with too little mutations are eliminated here)
        return self.arr[:, maf > minfreq]  #print all row 
  
    
    def calculcate_statistics(self):
        """
        A. The calculcate_statistics function computes mean nucleotide diversity, mean minor allele frequency, variant sites, 
        variable sites
        """     
        # find mean nucleotide diversity: arr == arr[0] compares the first row against the rest of the rows. The similarity of 
        # each column is calculated and the similarity values from each column are used to compute the mean. 
        nd = np.var(self.arr == self.arr[0], axis=0).mean()
    
        # find mean minor allele frequency:  arr != arr[0] finds a number of differences between the first row against the rest
        # of the rows. The number of differeces is calculated for each column and divided by the total length of each column. 
        # The sum of the frequencies of all columns is used to compute the mean.
        mf = np.mean(np.sum(self.arr != self.arr[0], axis=0) / self.arr.shape[0])
   
        # find variant sites: create boolean mask for whether any sites in a column is different from the first row. Sum the 
        # number of columns with some variability 
        var = np.any(self.arr != self.arr[0], axis=0).sum()
    
        # find variable sites: substract the number of variant sites from the total number of columns (15)
        inv = self.arr.shape[1] - var
    
        #Use Pandas to return all the values 
        return pd.Series(
            {"mean nucleotide diversity": nd,
             "mean minor allele frequency": mf,
             "invariant sites": inv,
             "variable sites": var,
            })
    

## Deren's correct version

In [7]:
#!/usr/bin/env python

"""
seqlib library for class assignment
"""

import copy
import numpy as np
import pandas as pd


class Seqlib:
    """
    A seqlib object for class.
    """
    def __init__(self, ninds, nsites):

        ## generate the full sequence array
        self.ninds = ninds
        self.nsites = nsites
        self.seqs = self._simulate()

        ## store maf of the full seq array
        self.maf = self._get_maf()


    ## private functions used only during init -----
    def _mutate(self, base):
        "converts a base to another base"
        diff = set("ACTG") - set(base)
        return np.random.choice(list(diff))


    def _simulate(self):
        "returns a random array of DNA bases with mutations"
        oseq = np.random.choice(list("ACGT"), size=self.nsites)
        arr = np.array([oseq for i in range(self.ninds)])
        muts = np.random.binomial(1, 0.1, (self.ninds, self.nsites))
        for col in range(self.nsites):
            newbase = self._mutate(arr[0, col])
            mask = muts[:, col].astype(bool)
            arr[:, col][mask] = newbase
        missing = np.random.binomial(1, 0.1, (self.ninds, self.nsites))
        arr[missing.astype(bool)] = "N"
        return arr


    def _get_maf(self):
        "returns the maf of the full seqarray while not counting Ns"
        ## init an array to fill and iterate over columns
        maf = np.zeros(self.nsites)
        for col in range(self.nsites):
            ## select this column of bases
            thiscol = self.seqs[:, col]

            ## mask "N" bases and get new length
            nmask = thiscol != "N"
            no_n_len = np.sum(nmask)

            ## mask "N" bases and get the first base
            first_non_n_base = thiscol[nmask][0]

            ## calculate maf of "N" masked bases
            freq = np.sum(thiscol[nmask] != first_non_n_base) / no_n_len
            if freq > 0.5:
                maf[col] = 1 - freq
            else:
                maf[col] = freq
        return maf

        
    ## private functions that are called within other functions
    def _filter_missing(self, maxmissing):
        "returns a boolean filter True for columns with Ns > maxmissing"
        freqmissing = np.sum(self.seqs == "N", axis=0) / self.seqs.shape[0]
        return freqmissing > maxmissing


    def _filter_maf(self, minmaf):
        "returns a boolean filter True for columns with maf < minmaf"
        return self.maf < minmaf


    ## public functions -----
    def filter(self, minmaf, maxmissing):
        """
        Applies maf and missing filters to the array 
        Parameters
        ----------
        minmaf: float
            The minimum minor allele frequency. Filter columns below this.
        maxmissing: float
            The maximum prop. missing data. Filter columns with prop Ns > this.
        """
        filter1 = self._filter_maf(minmaf)
        filter2 = self._filter_missing(maxmissing)
        fullfilter = filter1 + filter2
        return self.seqs[:, np.invert(fullfilter)]


    def filter_seqlib(self, minmaf, maxmissing):
        """
        Applies maf and missing filters to the array and returns a copy 
        of the seqlib object where the .seqs array has been filtered
        Parameters
        ----------
        minmaf: float
            The minimum minor allele frequency. Filter columns below this.
        maxmissing: float
            The maximum prop. missing data. Filter columns with prop Ns > this.
        """
        ## apply filters to get new array size
        newseqs = self.filter(minmaf, maxmissing)

        ## make a new copy of the seqlib object
        newself = copy.deepcopy(self)       
        newself.__init__(newseqs.shape[0], newseqs.shape[1]) 

        ## store the array (overwrite it)
        newself.seqs = newseqs

        ## call the _get_maf to match new array
        newself._get_maf()
        return newself


    def calculate_statistics(self):
        """ 
        Returns a dataframe of statistics on the seqs array. The earlier 
        example from the notebook had a bug where var and inv were switched.
        """
        if self.seqs.size:
            nd = np.var(self.seqs == self.seqs[0], axis=0).mean()
            mf = np.mean(
                np.sum(self.seqs != self.seqs[0], axis=0) / self.seqs.shape[0])
            inv = np.all(self.seqs == self.seqs[0], axis=0).sum()
            var = self.seqs.shape[1] - inv
            return pd.Series(
                {"mean nucleotide diversity": nd,
                 "mean minor allele frequency": mf,
                 "invariant sites": inv,
                 "variable sites": var,
                })
        else:
            print("seqs array is empty")


## Test your package
The package should be globally importable (you ran `pip install .` or `pip install -e .` to install it), and it should be able to execute the following code without error. 

YOU NEED TO CD INTO THE SEQLIB DIRECTLY (not the seqlib dir with the the seq dir) TO pip install -e .

In [None]:
## import the seqlib library
import seqlib

## generate a seqlib class object with a sequence array of shape (n, m) 
s = seqlib.Seqlib(10, 100)

## return sequence array
print(s.seqs)

## return maf of sequence array
print(s.maf)

## return a filtered view of the seqarray filter based on maf
## and missing (N) sites
print(s.filter(minmaf=0.1, maxmissing=0.0)

## return a new copy of seqlib object with modified seqarray 
n = s.filter_seqlib(minmaf=0.1, maxmissing=0.0)

## view stats on the full seqarray
s.calculate_stats()

## view stats on the modified seqarray
n.calculate_stats()

## or do the same in one shot
s.filter_seqlib(minmaf=0.1, maxmissing=0.0).calculate_stats()


In [34]:
import seqlib
s = seqlib.Seqlib(10, 100)

TypeError: __init__() missing 3 required positional arguments: 'arr', 'maxfreq', and 'minfreq'

In [30]:
## import the seqlib library
import seqlib

## generate a seqlib class object with a sequence array of shape (n, m) 
s = seqlib.Seqlib(10, 100)

## return sequence array
print(s.seqs)

## return maf of sequence array
print(s.maf)

## return a filtered view of the seqarray filter based on maf
## and missing (N) sites
print(s.filter(minmaf=0.1, maxmissing=0.0)

SyntaxError: unexpected EOF while parsing (<ipython-input-30-4e6157800aca>, line 15)

In [22]:
import seqlib

In [None]:
# init a Seqlib Class object
seqs = seqlib.Seqlib(ninds=10, nsites=50)

In [None]:
# access attributes from the object
print(seqs.ninds, seqs.nsites)

In [None]:
# returns the MAF of the array as an array of floats
seqs.maf

In [None]:
# return a view of the filtered sequence array by applying a new function 
# called `filter()` that applies both the maf and missing filter functions
seqs.filter(minmaf=0.1, maxmissing=0.0)

In [None]:
# calculate statistics for an array with the results returned as a DataFrame
seqs.calculate_statistics()

In [None]:
# calculate statistics for an array after filtering it
seqs.filter(minmaf=0.1, maxmissing=0.0).calculate_statistics()