# Effect of mutations on conformations

Evaluate if some conformations are specific for a given sequence, and if the Pu->Pu and Py->Py mutations induce unrrealistic conformations.

In [1]:
import numpy as np
import json, sys
from math import factorial as fact

In [2]:
ff = json.load(open("fragments_clust.json"))

In [23]:
#example of fragment
ff["AAA"]["10"]

{'chain': 'B',
 'clust0.2': 9372,
 'clust0.2_center': True,
 'clust1.0': 392,
 'clust1.0_center': True,
 'clust3.0': 58,
 'clust3.0_center': True,
 'indices': [9, 10, 11],
 'missing_atoms': [0, 0, 0],
 'model': 3,
 'resid': ['209', '210', '211'],
 'seq': 'GGA',
 'structure': '1A1T'}

### Remove redundant fragments in clusters of 1A conformations
Fragments are considered redundant if they are:
- of same original sequence

AND one of the following:

- from RNA belonging to the same Rfam family
- bound to proteins with >90% seq id

In [3]:
def l2d(mapfile):
    ll = [l.split() for l in open(mapfile).readlines()]
    d = {}
    for i,j in ll:
        if i not in d:
            d[i] = set()
        d[i].add(j)
    return d

In [4]:
# mapping of each PDB structure to Rfam and Uniprot
unip = l2d("pdbUnip")
rfam = l2d("pdbRfam")

In [5]:
# clusters of PDB chains with seqid > 90%
# https://cdn.rcsb.org/resources/sequence/clusters/bc-90.out
ll = [l.split() for l in open("bc-90.out").readlines()]
redun = {}
for nl,l in enumerate(ll):
    for w in l:
        redun[w[:4]] = nl

In [7]:
#define which sequences correspond to each Pu/Py motif
pu = ["A", "G"]
py = ["C","U"]
mu = ["A", "C"]
motifs=[a+b+c for a in mu for b in mu for c in mu ]
seq = {m:set([ ff[m][k]["seq"] for k in ff[m].keys()]) for m in motifs}

In [10]:
seq

{'AAA': {'AAA', 'AAG', 'AGA', 'AGG', 'GAA', 'GAG', 'GGA', 'GGG'},
 'AAC': {'AAC', 'AAU', 'AGC', 'AGU', 'GAC', 'GAU', 'GGC', 'GGU'},
 'ACA': {'ACA', 'ACG', 'AUA', 'AUG', 'GCA', 'GCG', 'GUA', 'GUG'},
 'ACC': {'ACC', 'ACU', 'AUC', 'AUU', 'GCC', 'GCU', 'GUC', 'GUU'},
 'CAA': {'CAA', 'CAG', 'CGA', 'CGG', 'UAA', 'UAG', 'UGA', 'UGG'},
 'CAC': {'CAC', 'CAU', 'CGC', 'CGU', 'UAC', 'UAU', 'UGC', 'UGU'},
 'CCA': {'CCA', 'CCG', 'CUA', 'CUG', 'UCA', 'UCG', 'UUA', 'UUG'},
 'CCC': {'CCC', 'CCU', 'CUC', 'CUU', 'UCC', 'UCU', 'UUC', 'UUU'}}

In [60]:
#to make results more readable, we hide some entries in the fragments description
keep = {'p':'structure',
        'r':'resid', 
        's':'seq',
        'c':'chain'
        #'m':'model'
       }
f2 = {m:{k:{ x:ff[m][k][keep[x]] for x in keep} for k in ff[m].keys()} for m in motifs}

In [61]:
f2["AAA"]["10"]

{'p': '1A1T', 'r': ['209', '210', '211'], 's': 'GGA', 'c': 'B'}

In [90]:
# For a given motif, in each cluster (1A),
# sort non-redondant fragments by their original sequence.
# Non-redundant = 
#    from different PDB structures
#    from different RNA families (according to Rfam)
#    bound to different proteins (according to uniprot) 

d = {m:{} for m in motifs}
for m in motifs:
    for k in ff[m].keys():
        f = ff[m][k]
        s = f['seq']
        c = f['clust1.0']
        pdb = f['structure'][:4]
        if c not in d[m]:
            d[m][c] = {}
            for ss in seq[m]:
                d[m][c][ss] = {"seqid":set(),
                              "rfam":set(),
                              "pdb":set(),
                              "frag":[]}
        new = 1

        # is it bound to a new protein?
        if pdb in redun.keys():
            if redun[pdb] in d[m][c][s]["seqid"]:
                new = 0
            else:
                d[m][c][s]["seqid"].add(redun[pdb])
        
        # is it from a new Rfam?
        if pdb in rfam:
            for r in rfam[pdb]: 
                if r in d[m][c][s]["rfam"]:
                    new = 0
                else:
                    d[m][c][s]["rfam"].add(r)
                    
        # is it bound to a new protein (Uniprot/code)?
        if pdb in unip:
            for p in unip[pdb]:
                if p in d[m][c][s]["pdb"]:
                    new = 0
                else:
                    d[m][c][s]["pdb"].add(p)
        else:
            if pdb in d[m][c][s]["pdb"]:
                new = 0
            else:
                d[m][c][s]["pdb"].add(pdb)

        if new:
            d[m][c][s]["frag"].append(f2[m][k])


In [91]:
clust = {m:{ c: { s:len(d[m][c][s]["frag"]) for s in seq[m]} for c in d[m].keys() } for m in motifs}

In [92]:
# Number of fragments with each of the 8 PuPuPu original sequences
# in the first "AAA" cluster
clust["AAA"][1]

{'GGA': 57,
 'AAA': 43,
 'AGG': 57,
 'GAA': 44,
 'GGG': 56,
 'AAG': 43,
 'AGA': 49,
 'GAG': 59}

In [93]:
nclust = {m:{} for m in motifs}
for m in motifs:
    for k in clust[m]:
        n = sum([clust[m][k][s] for s in clust[m][k].keys()])
        if n not in nclust[m]:
            nclust[m][n] = 1
        else:
            nclust[m][n] += 1

kk = [k for k in nclust["AAA"].keys()]
kk.sort() 
for k in kk:
    n = nclust["AAA"][k]
    print("There are %i \t \"AAA\" clusters of size %i"%(n,k))

There are 2338 	 "AAA" clusters of size 1
There are 130 	 "AAA" clusters of size 2
There are 37 	 "AAA" clusters of size 3
There are 22 	 "AAA" clusters of size 4
There are 11 	 "AAA" clusters of size 5
There are 11 	 "AAA" clusters of size 6
There are 6 	 "AAA" clusters of size 7
There are 6 	 "AAA" clusters of size 8
There are 4 	 "AAA" clusters of size 9
There are 3 	 "AAA" clusters of size 11
There are 3 	 "AAA" clusters of size 12
There are 3 	 "AAA" clusters of size 13
There are 1 	 "AAA" clusters of size 14
There are 2 	 "AAA" clusters of size 15
There are 1 	 "AAA" clusters of size 16
There are 2 	 "AAA" clusters of size 17
There are 1 	 "AAA" clusters of size 18
There are 1 	 "AAA" clusters of size 22
There are 1 	 "AAA" clusters of size 27
There are 1 	 "AAA" clusters of size 29
There are 1 	 "AAA" clusters of size 30
There are 1 	 "AAA" clusters of size 31
There are 1 	 "AAA" clusters of size 54
There are 1 	 "AAA" clusters of size 186
There are 1 	 "AAA" clusters of size 22

### Are some clusters made of only one sequence ?
In other words, is a conformation (at 1A precision) specific of one sequence ?

Our null hypothesis is a random distribution of the n fragments of each cluster into the k=8 sequences.\
For a cluster containing n fragments (n>=8), the probability p(n) that they are all of the same sequence is:

p(n) = A(n) / B(n)\
with:\
A(n) = nb of distributions into only one sequence = k\
B(n) = nb of distributions = k**n

In [94]:
for n in range(1, 6):
    p = 8/8**n
    print(n,p)

1 1.0
2 0.125
3 0.015625
4 0.001953125
5 0.000244140625


The first value of n such that p(n) < 0.05 is 3 \
Therefore we will consider only clusters with at least 3 fragments.

In [98]:
single = {m:{} for m in motifs} #clusters with n >= nf made of one sequence
nclust = {m:{} for m in motifs} #nb clusters with n >= nf
E = 0 #expectancy of total number of single-sequence clusters
nf = 3
for m in motifs:
    ncl = 0
    for k in d[m].keys():
        n = sum([clust[m][k][s] for s in seq[m]])
        if n < nf : continue
        ncl += 1
        E += 8/8**n
        i = len([s for s in seq[m] if clust[m][k][s]> 0])
        if i==1:
            ss = [s for s in seq[m] if clust[m][k][s] > 0][0] 
            single[m][k] = (ss,clust[m][k][ss])
    nclust[m] = ncl

In [99]:
for m in motifs:
    print(m, len(single[m]), nclust[m])
    
#a = sum([len(clust[m].keys()) for m in motifs])
x1 = sum([len(single[m]) for m in motifs])
x2 = sum([nclust[m] for m in motifs])
x3 = 100*x1/x2
print("\n %i out of %i (%.2f %%) clusters with at least %i fragments are of single sequence, \
while ~%i were expected"%(x1, x2, x3, nf, E))

AAA 4 122
AAC 4 100
ACA 4 75
ACC 0 74
CAA 6 106
CAC 4 62
CCA 2 77
CCC 8 99

 32 out of 715 (4.48 %) clusters with at least 3 fragments are of single sequence, while ~4 were expected


In [68]:
print("Single-seq clusters of the PuPuPu motif:")
print("{index of cluster: (original sequence, number of fragments)}")
single["AAA"]

Single-seq clusters of the PuPuPu motif:
{index of cluster: (original sequence, number of fragments)}


{235: ('GGA', 4), 657: ('AGG', 3), 236: ('GGA', 3), 262: ('GGG', 3)}

In [70]:
print("Fragments in each single-seq PuPuPu cluster\n")
print("{p: pdbcode, r: resid, s: original sequence, c: chain }")
for k in single["AAA"]:
    s = single["AAA"][k][0]
    print("\ncluster %i"%k)
    for frag in d["AAA"][k][s]["frag"]:
        print(frag)

Fragments in each single-seq PuPuPu cluster

{p: pdbcode, r: resid, s: original sequence, c: chain }

cluster 235
{'p': '4QVI', 'r': ['2156', '2157', '2158'], 's': 'GGA', 'c': 'B'}
{'p': '5FJ4', 'r': ['4', '5', '6'], 's': 'GGA', 'c': 'D'}
{'p': '6YAL', 'r': ['1600', '1601', '1602'], 's': 'GGA', 'c': '2'}
{'p': '2ANN', 'r': ['6', '7', '8'], 's': 'GGA', 'c': 'B'}

cluster 657
{'p': '6KWR', 'r': ['599', '600', '601'], 's': 'AGG', 'c': 'B'}
{'p': '4K4V', 'r': ['599', '600', '601'], 's': 'AGG', 'c': 'B'}
{'p': '4K4Y', 'r': ['599', '600', '601'], 's': 'AGG', 'c': 'B'}

cluster 236
{'p': '6QULA', 'r': ['188', '189', '190'], 's': 'GGA', 'c': 'A'}
{'p': '1JBR', 'r': ['18', '19', '20'], 's': 'GGA', 'c': 'F'}
{'p': '2HGH', 'r': ['40', '41', '42'], 's': 'GGA', 'c': 'B'}

cluster 262
{'p': '6QULA', 'r': ['1695', '1696', '1697'], 's': 'GGG', 'c': 'A'}
{'p': '1AUD', 'r': ['34', '35', '36'], 's': 'GGG', 'c': 'B'}
{'p': '1EKZ', 'r': ['17', '18', '19'], 's': 'GGG', 'c': 'B'}


**Cluster 235**\
4QVI: Mutant ribosomal protein M218L TthL1 in complex with 80nt 23S RNA from Thermus thermophilus \
5FJ4: Structure of the standard kink turn HmKt-7 as stem loop bound with U1A and L7Ae proteins \
6YAL: Mammalian 48S late-stage initiation complex with beta-globin mRNA \
2ANN: Crystal structure (I) of Nova-1 KH1/KH2 domain tandem with 25 nt RNA hairpin\

**cluster 657**\
4K4V: Poliovirus polymerase elongation complex (r5+1_form) \
4K4Y: Coxsackievirus B3 polymerase elongation complex (r2+1_form) \
6KWR: enterovirus 71 polymerase elongation complex (ddCTP form) \
=> ***redundant***

**cluster 236**\
6QUL: bacterial 50S ribosomal subunit in complex with cadazolid \
1JBR: Ribotoxin Restrictocin and a 31-mer SRD RNA Inhibitor \
2HGH: Transcription Factor IIIA zinc fingers 4-6 bound to 5S rRNA 55mer\

**cluster 262**\
6QUL: bacterial 50S ribosomal subunit in complex with cadazolid \
1AUD: U1A-UTRRNA \
1EKZ: THIRD DSRBD FROM DROSOPHILA STAUFEN AND A RNA HAIRPIN

