# Calculating kappa Z-scores
###### Last updated 2022-07-05

## About

This notebook requires the package `localcider` and implements a simplified version of the approach defined in Cohan and Shinn et al. *JMB* (2022) and is similar to the approach used in Martin et al. *Science* (2020).

If this code is used **please cite Cohan & Shinn et al (2022)** (reference below).

For an excruciating dive into $\kappa$, $\delta$, and some of the underlying theory there, I recommend both the original Das & Pappu 2013 PNAS paper and Holehouse et al. 2017 Biophys. J. (specifically the supplementary information - make sure you have coffee and/or a stiff drink at your side...).

## Usage

This notebook defines two functions. To compute the Z-score on your sequence of interest simply run:


    seq = # defines a valid amino acid sequence
    
    get_kappa_zscore(seq)
    
And the underlying Z-score will be returned. That's it. No magic. 

## Installation
To run this notebook simply install the localcider Python package using `pip`

    pip install localcider
    
And then run this notebook using Juypter. We strongly recommend doing this within a controlled conda environment.

> NB: If you're not familiar with conda and its role in Python environment configuration, [we recommend reading up on this first before continuing](https://carpentries-incubator.github.io/introduction-to-conda-for-data-scientists/).

## Help with this code
This code was written by Alex Holehouse as a generic implementation for calculating kappa-based Z-scores using the $\delta$ distribution instead of the $\kappa$ distribution (where $\kappa = \delta / \delta_{max}$, so using $\delta$ instead of $\kappa$ gives you the same value in terms of a Z-score).

Conceptually, this is exactly the same approach as NARDINI uses - here we just expose the inner workings! As mentioned - IF this code is used please cite the NARDINI paper.

If you have any questions please don't hesitate to [reach out to me](https://www.holehouselab.com/).



## References
Cohan, M. C., Shinn, M. K., Lalmansingh, J. M., & Pappu, R. V. (2022). Uncovering Non-random Binary Patterns Within Sequences of Intrinsically Disordered Proteins. Journal of Molecular Biology, 434(2), 167373.

Das, R. K., & Pappu, R. V. (2013). Conformations of intrinsically disordered proteins are influenced by linear sequence distributions of oppositely charged residues. Proceedings of the National Academy of Sciences of the United States of America, 110(33), 13392–13397.

Holehouse, A. S., Das, R. K., Ahad, J. N., Richardson, M. O. G., & Pappu, R. V. (2017). CIDER: Resources to Analyze Sequence-Ensemble Relationships of Intrinsically Disordered Proteins. Biophysical Journal, 112(1), 16–21.

In [20]:
## THERE SHOULD BE NO NEED TO CHANGE THE CODE IN THIS CELL.
## Just make sure you run it so the functions are available elsewhere!
##
##

from localcider.sequenceParameters import SequenceParameters
import random

# ....................................................................
#
def get_zscore(scores, target):
    """
    Function that, given a list of scores and a target score returns
    the Z-score associated with the target using the list as a null 
    distribution.
    
    Parameters
    ----------------
    scores : list
        A list of floats
        
    target : float
        The actual value of interest
        
    Returns
    ------------
    float
        A Z-score that reflects how far from the mean the target
        sequence is compared to the underlying distribution. Note
        that this does NOT check for reasonable statistical properties
        so if you're worried maybe consider bootstrapping and/or asking
        if a Z-score is a reasonable metric for the underlying data...
    
    """
    return (target - np.mean(scores)) / np.std(scores)


# ....................................................................
#
def build_delta_null(seq, count=1000):
    """
    Function that builds the null distribution of delta values for a 
    sequence of interest.
    
    Algorithmically this works simply by
    
    1. Take the sequence (seq)
    2. Shuffle
    3. Calculate delta
    4. Repeat
    
    Parameters
    --------------
    seq : str
        A valid amino acid sequence
        
    count : int
        Number of random permutations to calculate
        
    Returns
    -------------
    list 
        A list of delta values for all the random permutations. 
    
    
    """
    deltas = []
    for i in range(count):
        s = list(seq)
        random.shuffle(s)
        s = "".join(s)
        deltas.append(SequenceParameters(s).get_delta())
    return deltas

    
    
def get_kappa_zscore(seq, count=1000):
    """
    Function that returns a Z-score associated with the charge 
    patterning for your sequence, using kappa as the metric for 
    calculating that Z-score.
    
    Positive values reflect the number of standard deviations away
    from the mean the sequences - i.e. positive values greater than 1 
    means the sequence is more blocky than expected by random chance.
    
    Negative values reflect the number of standard deviations away
    from the mean the sequences - i.e. positive values greater than 1 
    means the sequence is more blocky than expected by random chance.
    
    NOTE this calculates Z-scores with an unbiased standard deviation
    and mean. 
        
    Parameters
    --------------
    seq : str
        A valid amino acid sequence
        
    count : int
        Number of random permutations to calculate
        
    Returns
    -------------
    float
        A Z-score that reflects how far from the mean the target
        sequence is compared to the underlying distribution. Note
        that this does NOT check for reasonable statistical properties
        so if you're worried maybe consider bootstrapping and/or asking
        if a Z-score is a reasonable metric for the underlying data...

    """
    
    # compute the target delta
    target = SequenceParameters(seq).get_delta()
    
    # compute the null distribution of deltas for this sequence
    # compoistion
    null_dist = build_delta_null(seq, count)
    
    return get_zscore(null_dist, target)
