# Data in Python

## Stats 141B

## Lecture 5

## Prof. Sharpnack

## Lecture slides at http://anson.ucdavis.edu/~jsharpna/141Blectures/

## Lecture repository at https://github.com/jsharpna/141Blectures/

In [1]:
# The following data is from https://www.pombase.org/downloads/protein-datasets
# It contains the amino acid sequences for proteins in fission yeast

! head ../data/peptide.fa

>SPAC1002.01:pep
MLPPTIRISGLAKTLHIPSRSPLQALKGSFILLNKRKFHYSPFILQEKVQSSNHTIRSDT
KLWKRLLKITGKQAHQFKDKPFSHIFAFLFLHELSAILPLPIFFFIFHSLDWTPTGLPGE
YLQKGSHVAASIFAKLGYNLPLEKVSKTLLDGAAAYAVVKVSYFVENNMVSSTRPFVSN*
>SPAC1002.02:pep
MASTFSQSVFARSLYEDSAENKVDSSKNTEANFPITLPKVLPTDPKASSLHKPQEQQPNI
IPSKEEDKKPVINSMKLPSIPAPGTDNINESHIPRGYWKHPAVDKIAKRLHDQAPSDRTW
SRMVSNLFAFISIQFLNRYLPNTTAVKVVSWILQALLLFNLLESVWQFVRPQPTFDDLQL
TPLQRKLMGLPEGGSTSGKHLTPPRYRPNFSPSRKAENVKSPVRSTTWA*
>SPAC1002.03c:pep


In [2]:
def pep_reader(filename='../data/peptide.fa'):
    with open(filename,'r') as pepfile:
        pepname = False # start of file
        for line in pepfile: 
            if line[0] == '>': # check for prot id line
                if pepname:
                    yield (pepname,pepseq) # if not first output protein
                pepname = line.split(':')[0][1:] # get the id
                pepseq = "" # init seq
            else:
                pepseq += line.strip() # append to seq

In [3]:
pep = pep_reader() # init the iterator
pepdict = {k:v for k,v in pep} # make dictionary with gen expression

In [4]:
[k for i,k in enumerate(pepdict.keys()) if i < 10] # first 10 keys

['SPAC1002.01',
 'SPAC1002.02',
 'SPAC1002.03c',
 'SPAC1002.04c',
 'SPAC1002.05c',
 'SPAC1002.06c',
 'SPAC1002.07c',
 'SPAC1002.08c',
 'SPAC1002.09c',
 'SPAC1002.10c']

## Dictionaries form hash tables

- a hash function is used to give the keys integer ids (probably unique but maybe not)
- a hash table maps these ids to values

![](https://upload.wikimedia.org/wikipedia/commons/thumb/7/7d/Hash_table_3_1_1_0_1_0_0_SP.svg/315px-Hash_table_3_1_1_0_1_0_0_SP.svg.png)
*image from wikipedia

In [5]:
pepdict['SPAC1002.01'] # select element

'MLPPTIRISGLAKTLHIPSRSPLQALKGSFILLNKRKFHYSPFILQEKVQSSNHTIRSDTKLWKRLLKITGKQAHQFKDKPFSHIFAFLFLHELSAILPLPIFFFIFHSLDWTPTGLPGEYLQKGSHVAASIFAKLGYNLPLEKVSKTLLDGAAAYAVVKVSYFVENNMVSSTRPFVSN*'

In [6]:
hash('SPAC1002.01') # the hash value

-6987572177271965481

In [26]:
lastid = prot_ids[-1] # select last id

%time prot_seqs[prot_ids.index(lastid)] # time selecting using list.index

CPU times: user 109 µs, sys: 2 µs, total: 111 µs
Wall time: 115 µs


'MSAEDLFTIQILCDQIELKLASIVINSNIKLQLKRKKKTQQL*'

In [27]:
%time pepdict[lastid] # time select using dict

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 5.96 µs


'MSAEDLFTIQILCDQIELKLASIVINSNIKLQLKRKKKTQQL*'

In [9]:
pep = pep_reader() # init again
prot_ids, prot_seqs = zip(*pep) # make 2 lists
prot_ids = list(prot_ids)
prot_seqs = list(prot_seqs)

In [10]:
def seq_to_set(seq):
    """Protein sequence to a set"""
    return set(p for p in seq)

### Sets in Python

- good at testing containment
- uses a hash table: values are if the element is in the set or not
- set operations:
 - `a in s` tests if element a is in set s
 - `s1 & s2` intersection
 - `s1 | s2` union
 - `s1 - s2` s1 not in s2

In [11]:
s = seq_to_set(prot_seqs[0]) # save the set to test with
%timeit '*' in s # how long does containment test take here?

24.7 ns ± 0.456 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [12]:
s = prot_seqs[0] # test with list
%timeit '*' in s # how long does containment test take now?

29 ns ± 0.367 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [29]:
s = set(i for i in range(1000)) # test with a larger set
%timeit 999 in s

39.7 ns ± 1.39 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [30]:
s = list(i for i in range(1000))
%timeit 999 in s

9.78 µs ± 243 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [13]:
prot_sets = [seq_to_set(seq) for seq in prot_seqs] # create sets database

In [14]:
## Creating the alphabet of all amino acids in dataset

alphabet = set()
for seq in prot_sets:
    alphabet |= seq
    
print(','.join(alphabet))

E,V,D,P,Y,S,M,A,Q,*,W,N,I,L,K,C,H,T,R,G,F


In [31]:
## which proteins have all amino acids (including *)?
pids_full_aa = [pid for pid, seq in zip(prot_ids,prot_sets) if seq == alphabet]

len(pids_full_aa) / len(prot_ids)

0.8177570093457944

### Similarity between proteins

- how similar are the amino acid distribution?
- define the AA proportions (string s, amino acid a):
$$\hat p_s(a) = \frac{\textrm{ # a in s}}{\textrm{ length of s }}$$
- define total variation distance between seqs:
$$ d(s_0, s_1) := \sum_{a} |\hat p_{s_0}(a) - \hat p_{s_1}(a)|$$

In [17]:
prot_lens = [len(seq) for seq in prot_seqs] # protein lengths

In [20]:
from collections import Counter

Counter(prot_seqs[0]) # Counter counts the number of unique elements in iterable

Counter({'M': 2,
         'L': 24,
         'P': 12,
         'T': 9,
         'I': 12,
         'R': 6,
         'S': 18,
         'G': 8,
         'A': 12,
         'K': 16,
         'H': 8,
         'Q': 6,
         'F': 15,
         'N': 6,
         'Y': 5,
         'E': 5,
         'V': 9,
         'D': 4,
         'W': 2,
         '*': 1})

In [21]:
# total count of each AA in data
total_count = sum((Counter(val) for _,val in pepdict.items()), Counter())

# total number of AAs in data
total_aas = sum(val for _,val in total_count.items())

# total proportion of AAs in data
total_prop = {aa:cnt/total_aas for aa,cnt in total_count.items()}
total_prop

{'M': 0.020619039358262394,
 'L': 0.09850329500826625,
 'P': 0.04703679164637549,
 'T': 0.05490719007798082,
 'I': 0.06143952045246013,
 'R': 0.048582195459932156,
 'S': 0.09403679791661006,
 'G': 0.049142336414805275,
 'A': 0.06209120683204759,
 'K': 0.0642406432424637,
 'H': 0.022572426434472912,
 'Q': 0.038115501900926115,
 'F': 0.045995514692204635,
 'N': 0.05199780123774431,
 'Y': 0.034146025402810316,
 'E': 0.06512223822293192,
 'V': 0.06020261217972164,
 'D': 0.0533003379656433,
 'W': 0.011164361658769655,
 '*': 0.0021469283165883235,
 'C': 0.01463723557898301}

In [22]:
def comp_prots(i,j):
    """Compare two proteins by TV-dist (this should be in a class)
    Requires global variables prot_sets,prot_lens,prot_seqs,prot_ids"""
    pid1, pid2 = prot_ids[i],prot_ids[j]
    pset1, pset2 = prot_sets[i],prot_sets[j]
    plen1, plen2 = prot_lens[i],prot_lens[j]
    pcnt1, pcnt2 = (Counter(s) for s in [prot_seqs[i],prot_seqs[j]]) # create 2 counters
    norm = 0
    for aa in pset1 & pset2:
        norm += abs(pcnt1[aa] / plen1 - pcnt2[aa] / plen2) # TV dist for overlap
    for aa in pset1 - pset2:
        norm += pcnt1[aa] / plen1  # TV dist for aa in s1 not s2
    for aa in pset2 - pset1:
        norm += pcnt2[aa] / plen2  # TV dist for aa in s2 not s1   
    return (pid1,pid2), norm

In [23]:
## distance between first and second
comp_prots(0,1)

(('SPAC1002.01', 'SPAC1002.02'), 0.34830917874396133)

In [24]:
## protein 'closest' to first
n = len(prot_ids)
min((comp_prots(0,i) for i in range(1,n)),key=lambda x: x[1])

(('SPAC1002.01', 'SPBC16H5.14c'), 0.19907084785133566)

- key: tells how to compare for min
- lambda is ad-hoc function
 - `lambda x: expr(x)`

In [25]:
## Briefly about scope

i = 5 # i in global namespace
def pseq():
    for i in range(10):
        print(i) # i in local namespace
        
pseq()
print(i) # i in global namespace again

0
1
2
3
4
5
6
7
8
9
5
