# Gene comparison

Task description: $\mathbf{a} = [a_1,a_2,a_3,..., a_n]$ and $\mathbf{b} = [b_1,b_2,b_3,..., b_m]$ are two list of genes. $a_i$'s are distinct genes, so are $b_i$'s.
The Pearson correlations between every pair $a_i b_j$ are computed. Note that $a_i$'s and $b_i$'s might not be sorted.
The data is stored at the file `gene_pair` in the format
\begin{align}
&a_1 ~~~ b_1 ~~~ 0.2\\
&a_1 ~~~ b_2 ~~~ 0.3\\
&a_1 ~~~ b_3 ~~~ 0.6\\
&\vdots ~~~~~ \vdots ~~~~~ \vdots
\end{align}
`gene_pair` has $nm$ rows and 3 columnes. Now, we have a feature pair list `feature_pair` in the format,
\begin{align}
& c_1 ~~~ d_1 \\
& c_2 ~~~ d_2 \\
& c_3 ~~~ d_3 \\
&\vdots ~~~~~ \vdots 
\end{align}
`feature_pair` has $l$ rows and 2 columns. 

We want to compare the gene pairs $ab$ with $cd$. If they are the same, we label it with $1$. That is, we want to find all the pairs appearing in both the lists.
To begin comparison, we need to sort the data so that the comparison process has a linear complexcity $O(N)$. The code is
```
sorted(gene_pair, key=itemgetter(0,1))
```



Algorithm: 

1. Sort `gene_pair` by $a_i$ first. If the sorted $a'_i$ have the degenerate elements, $a'_j=a'_{j+1}=...$, sort these elements by $b_i$. For example, $a_i$'s are the ages, and $b_i'$ are the heights. Then, the order is first determined by the ages. If ages are the same, the order among the same ages is determined by height. 
The sorted file `gene_pair_sorted` has the format,
\begin{align}
&a'_1  ~~~ b'_1 ~~~ 0.4\\
&a'_1  ~~~ b'_2 ~~~ 0.2\\
&a'_2  ~~~ b'_3 ~~~ 0.3\\
&a'_2  ~~~ b'_4 ~~~ 0.6\\
&\vdots ~~~~~~~~ \vdots ~~~~~~~ \vdots
\end{align}
where $a'_i$ is sorted, and $b'_i<b'_j$ if $i<j$ when $b'_i$ and
$b'_j$ are of the same $a'_i$.  
2. Do the same as above on `feature_pair` to create `feature_pair_sorted`,
\begin{align}
&c'_1   ~~~ d'_1 \\
&c'_2   ~~~ d'_2 \\
&c'_3   ~~~ d'_3 \\
&\vdots ~~~~~~~~ \vdots 
\end{align}
3. Compare $a'_i $ with $c'_j $. If the same, compare $b'_i $ with $d'_j$. If the same, label $a'_i b'_i$ with 1, else 0.   

In [210]:
import random
import string
import time
import numpy as np
# Function to generate a random string of 'max_length' characters
def generate_random_string(max_length=6):
    length = random.randint(2, max_length)
    return ''.join(random.choices(string.ascii_lowercase, k=length))

# Function to generate a list of 'n' random strings
def generate_random_string_list(n, max_length=6):
    return [generate_random_string(max_length) for _ in range(n)]

In [211]:
n = 1000
m = 1000
l = 100
a = generate_random_string_list(n,2)
b = generate_random_string_list(m,2)
c = generate_random_string_list(l,2)
d = generate_random_string_list(l,2)

gene_pair = []
for i in range(n):
    for j in range(m):
        gene_pair.append([a[i],b[j],np.random.rand()]) # using list
        
feature_pair = []
for i in range(l):
    feature_pair.append([c[i],d[i]])
    
###

    

In [212]:
from operator import itemgetter
gene_pair_sorted = sorted(gene_pair, key=itemgetter(0,1)) ## sort by which column
feature_pair_sorted = sorted(feature_pair, key=itemgetter(0,1)) ## sort by which column

In [213]:
# list1 is gene_pair_sorted, list2 is feature_pair_sorted
#

def compare_sorted_tuple_lists(list1, list2): 
    i, j = 0, 0
    list1_TF = []
    list2_TF = []

    while i < len(list1) and j < len(list2):
        if list1[i][0] == list2[j][0]:
            if list1[i][1] == list2[j][1]: ## inner if is for comparing b_i and d_j when a_i = c_j
                list1_TF.append(1) 
                list2_TF.append(1)
                list1[i].append(1)              
                i += 1
                j += 1
            elif list1[i][1] < list2[j][1]:
                list1_TF.append(0)                 
                list1[i].append(0)
                i += 1
            else:   
                list2_TF.append(0)
                j += 1           
        elif list1[i][0] < list2[j][0]:
            list1_TF.append(0)
            list1[i].append(0)
            i += 1
        else:
            list2_TF.append(0)
            j += 1

    # If any list is exhausted, mark the remaining elements as not found
    while i < len(list1):
        list1_TF.append(0)
        list1[i].append(0)
        i += 1

    return list1_TF, list2_TF

def count_ones(lst):
    ones_count = 0
    for item in lst:
        if item == 1:
            ones_count += 1
    return ones_count

In [214]:
l_1, l_2 = compare_sorted_tuple_lists(gene_pair_sorted, feature_pair_sorted)
n_ones =count_ones(l_1)
expe_n_ones = (1-(1-1/26/26)**m) * (1-(1-1/26/26)**n) * l

In [215]:
print(n_ones,end='\n')
print(expe_n_ones,end='\n')

65
59.66789623942363


In [216]:
indices_1 = [i for i, x in enumerate(l_1) if x == 1]
indices_2 = [i for i, x in enumerate(l_2) if x == 1]

In [217]:
for i in indices_1:
    print(i, gene_pair_sorted[i],end='\n')

27740 ['ar', 'rm', 0.17750833555058165, 1]
30656 ['au', 'il', 0.940533482688357, 1]
32240 ['aw', 'bz', 0.5051867843067446, 1]
36277 ['ay', 'hg', 0.1019962452480393, 1]
39518 ['ba', 'nj', 0.8602744266703226, 1]
53926 ['bp', 'yb', 0.27468266612356407, 1]
67219 ['cc', 'ft', 0.7196614389442924, 1]
73146 ['ci', 'ds', 0.5149956796878523, 1]
74004 ['cj', 'ab', 0.3948132043651701, 1]
96403 ['ct', 'uj', 0.5680612796645947, 1]
98860 ['cv', 'ye', 0.2531338783978351, 1]
105162 ['db', 'eb', 0.7150589158815855, 1]
120664 ['dl', 'qy', 0.4586994930380257, 1]
146298 ['eb', 'tn', 0.02358874289175672, 1]
168096 ['eq', 'aq', 0.7345215666718655, 1]
185088 ['ex', 'nl', 0.2748130396759977, 1]
197175 ['fk', 'en', 0.46417684855640406, 1]
198225 ['fm', 'fx', 0.7268219194354554, 1]
206216 ['ft', 'bo', 0.18502238921627046, 1]
210030 ['fv', 'ac', 0.7665557667492505, 1]
219538 ['fy', 'nw', 0.7538295418608685, 1]
228241 ['gd', 'gl', 0.804706452693114, 1]
233198 ['gf', 'fb', 0.48233153351505, 1]
289053 ['hj', 'be', 0

In [218]:
for i in indices_2:
    print(i, feature_pair_sorted[i],end='\n')

0 ['ar', 'rm']
2 ['au', 'il']
3 ['aw', 'bz']
4 ['ay', 'hg']
5 ['ba', 'nj']
8 ['bp', 'yb']
10 ['cc', 'ft']
11 ['ci', 'ds']
13 ['cj', 'ab']
14 ['ct', 'uj']
15 ['cv', 'ye']
16 ['db', 'eb']
18 ['dl', 'qy']
19 ['eb', 'tn']
20 ['eq', 'aq']
21 ['ex', 'nl']
24 ['fk', 'en']
25 ['fm', 'fx']
26 ['ft', 'bo']
27 ['fv', 'ac']
29 ['fy', 'nw']
31 ['gd', 'gl']
33 ['gf', 'fb']
34 ['hj', 'be']
36 ['jc', 'gr']
37 ['jd', 'yy']
39 ['jr', 'ci']
40 ['jx', 'ts']
41 ['ke', 'nc']
42 ['kf', 'xg']
43 ['ki', 'si']
46 ['kw', 'xn']
47 ['lh', 'ag']
48 ['lq', 'rz']
50 ['mo', 'pc']
51 ['mq', 'qh']
53 ['mw', 'pv']
56 ['oa', 'eq']
57 ['om', 'dz']
58 ['oq', 'ai']
59 ['os', 'ia']
61 ['pl', 'qw']
65 ['qf', 'oi']
66 ['qi', 'ur']
67 ['qm', 'md']
68 ['qt', 'oj']
69 ['qt', 'si']
70 ['qu', 'za']
71 ['rq', 'iq']
72 ['se', 'ge']
78 ['ug', 'lu']
79 ['up', 'ah']
82 ['vt', 'nl']
83 ['vy', 'qk']
86 ['wz', 'ib']
87 ['xf', 'jb']
88 ['xg', 'cm']
89 ['xg', 'po']
91 ['xx', 'gz']
92 ['xz', 'mj']
93 ['yj', 'xj']
95 ['yq', 'do']
96 ['yw', 'wg'

In [219]:
print(len(indices_1))
print(len(indices_2))

65
65
