# Gene comparison

Task description: $\mathbf{a} = [a_1,a_2,a_3,..., a_n]$ and $\mathbf{b} = [b_1,b_2,b_3,..., b_m]$ are two list of genes. $a_i$'s are distinct genes, so are $b_i$'s.
The Pearson correlations between every pair $a_i b_j$ are computed. Note that $a_i$'s and $b_i$'s might not be sorted.
The data is stored at the file `gene_pair` in the format
\begin{align}
&a_1 ~~~ b_1 ~~~ 0.2\\
&a_1 ~~~ b_2 ~~~ 0.3\\
&a_1 ~~~ b_3 ~~~ 0.6\\
&\vdots ~~~~~ \vdots ~~~~~ \vdots
\end{align}
`gene_pair` has $nm$ rows and 3 columnes. Now, we have a feature pair list `feature_pair` in the format,
\begin{align}
& c_1 ~~~ d_1 \\
& c_2 ~~~ d_2 \\
& c_3 ~~~ d_3 \\
&\vdots ~~~~~ \vdots 
\end{align}
`feature_pair` has $l$ rows and 2 columns. 

We want to compare the gene pairs $ab$ with $cd$. If they are the same, we label it with $1$. That is, we want to find all the pairs appearing in both the lists.
To begin comparison, we need to sort the data so that the comparison process has a linear complexcity $O(N)$. The code is
```
sorted(gene_pair, key=itemgetter(0,1))
```



Algorithm: 

1. Sort `gene_pair` by $a_i$ first. If the sorted $a'_i$ have the degenerate elements, $a'_j=a'_{j+1}=...$, sort these elements by $b_i$. For example, $a_i$'s are the ages, and $b_i'$ are the heights. Then, the order is first determined by the ages. If ages are the same, the order among the same ages is determined by height. 
The sorted file `gene_pair_sorted` has the format,
\begin{align}
&a'_1  ~~~ b'_1 ~~~ 0.4\\
&a'_1  ~~~ b'_2 ~~~ 0.2\\
&a'_2  ~~~ b'_3 ~~~ 0.3\\
&a'_2  ~~~ b'_4 ~~~ 0.6\\
&\vdots ~~~~~~~~ \vdots ~~~~~~~ \vdots
\end{align}
where $a'_i$ is sorted, and $b'_i<b'_j$ if $i<j$ when $b'_i$ and
$b'_j$ are of the same $a'_i$.  
2. Do the same as above on `feature_pair` to create `feature_pair_sorted`,
\begin{align}
&c'_1   ~~~ d'_1 \\
&c'_2   ~~~ d'_2 \\
&c'_3   ~~~ d'_3 \\
&\vdots ~~~~~~~~ \vdots 
\end{align}
3. Compare $a'_i $ with $c'_j $. If the same, compare $b'_i $ with $d'_j$. If the same, label $a'_i b'_i$ with 1, else 0.   

In [210]:
import random
import string
import time
import numpy as np
# Function to generate a random string of 'max_length' characters
def generate_random_string(max_length=6):
    length = random.randint(2, max_length)
    return ''.join(random.choices(string.ascii_lowercase, k=length))

# Function to generate a list of 'n' random strings
def generate_random_string_list(n, max_length=6):
    return [generate_random_string(max_length) for _ in range(n)]

In [232]:
n = 1000
m = 1000
l = 100
a = generate_random_string_list(n,2)
b = generate_random_string_list(m,2)
c = generate_random_string_list(l,2)
d = generate_random_string_list(l,2)

gene_pair = []
for i in range(n):
    for j in range(m):
        gene_pair.append([a[i],b[j],np.random.rand()]) # using list
        
feature_pair = []
for i in range(l):
    feature_pair.append([c[i],d[i]])
    
###

    

In [233]:
from operator import itemgetter
gene_pair_sorted = sorted(gene_pair, key=itemgetter(0,1)) ## sort by which column
feature_pair_sorted = sorted(feature_pair, key=itemgetter(0,1)) ## sort by which column

In [234]:
# list1 is gene_pair_sorted, list2 is feature_pair_sorted
#

def compare_sorted_tuple_lists(list1, list2): 
    i, j = 0, 0
    list1_TF = []
    list2_TF = []

    while i < len(list1) and j < len(list2):
        if list1[i][0] == list2[j][0]:
            if list1[i][1] == list2[j][1]: ## inner if is for comparing b_i and d_j when a_i = c_j
                list1_TF.append(1) 
                list2_TF.append(1)
                list1[i].append(1)              
                i += 1
                j += 1
            elif list1[i][1] < list2[j][1]:
                list1_TF.append(0)                 
                list1[i].append(0)
                i += 1
            else:   
                list2_TF.append(0)
                j += 1           
        elif list1[i][0] < list2[j][0]:
            list1_TF.append(0)
            list1[i].append(0)
            i += 1
        else:
            list2_TF.append(0)
            j += 1

    # If any list is exhausted, mark the remaining elements as not found
    while i < len(list1):
        list1_TF.append(0)
        list1[i].append(0)
        i += 1

    return list1_TF, list2_TF

def count_ones(lst):
    ones_count = 0
    for item in lst:
        if item == 1:
            ones_count += 1
    return ones_count

In [235]:
l_1, l_2 = compare_sorted_tuple_lists(gene_pair_sorted, feature_pair_sorted)
n_ones =count_ones(l_1)
expe_n_ones = (1-(1-1/26/26)**m) * (1-(1-1/26/26)**n) * l

In [236]:
print(n_ones,end='\n')
print(expe_n_ones,end='\n')

55
59.66789623942363


In [237]:
indices_1 = [i for i, x in enumerate(l_1) if x == 1]
indices_2 = [i for i, x in enumerate(l_2) if x == 1]

In [238]:
for i in indices_1:
    print(i, gene_pair_sorted[i],end='\n')

43888 ['ba', 'yn', 0.8034929133686216, 1]
55372 ['bi', 'jz', 0.5820611193292777, 1]
60702 ['bo', 'sj', 0.22524971816313188, 1]
83569 ['cb', 'nk', 0.21860718684720315, 1]
100341 ['cn', 'jh', 0.4607315216625282, 1]
114418 ['cv', 'fy', 0.2422922541178404, 1]
118202 ['da', 'cn', 0.3530769571803287, 1]
120882 ['dc', 'wz', 0.4938022253714286, 1]
123486 ['de', 'tk', 0.03604268877500183, 1]
135846 ['dm', 'vt', 0.45111740483701557, 1]
146348 ['dw', 'ev', 0.003959325434139682, 1]
178492 ['en', 'wm', 0.5382827087481779, 1]
239924 ['fz', 'lx', 0.42314734672759624, 1]
271455 ['gs', 'lt', 0.3218175173212986, 1]
285864 ['hg', 'lb', 0.031451288010275635, 1]
287258 ['hh', 'hl', 0.654535258717868, 1]
308826 ['hv', 'kr', 0.2422963654143896, 1]
330615 ['if', 'fu', 0.26605749967110937, 1]
348732 ['io', 'tc', 0.9893208156818367, 1]
369648 ['ja', 'gf', 0.1314484614896012, 1]
403126 ['jw', 'oi', 0.2103027094513178, 1]
422076 ['kj', 'ca', 0.12313688829257852, 1]
437304 ['kx', 'os', 0.33168055393404305, 1]
4490

In [239]:
for i in indices_2:
    print(i, feature_pair_sorted[i],end='\n')

0 ['ba', 'yn']
1 ['bi', 'jz']
2 ['bo', 'sj']
4 ['cb', 'nk']
7 ['cn', 'jh']
8 ['cv', 'fy']
12 ['da', 'cn']
13 ['dc', 'wz']
14 ['de', 'tk']
15 ['dm', 'vt']
18 ['dw', 'ev']
20 ['en', 'wm']
24 ['fz', 'lx']
26 ['gs', 'lt']
27 ['hg', 'lb']
28 ['hh', 'hl']
29 ['hv', 'kr']
31 ['if', 'fu']
32 ['io', 'tc']
34 ['ja', 'gf']
38 ['jw', 'oi']
39 ['kj', 'ca']
42 ['kx', 'os']
44 ['lk', 'ap']
47 ['mc', 'ju']
48 ['me', 'do']
49 ['me', 'xz']
50 ['mg', 'ra']
52 ['mu', 'uc']
53 ['nj', 'nb']
55 ['od', 'ea']
56 ['of', 'ag']
57 ['or', 'kf']
58 ['pb', 'hs']
60 ['pe', 'cs']
63 ['pv', 'ab']
64 ['qb', 'dx']
66 ['rb', 'pj']
68 ['rq', 'ba']
69 ['rv', 'of']
72 ['so', 'ji']
73 ['sy', 'ir']
74 ['tm', 'va']
76 ['tu', 'fj']
78 ['uf', 'cl']
79 ['un', 'ix']
83 ['we', 'fi']
88 ['yg', 'to']
90 ['yl', 'ht']
91 ['ym', 'oh']
92 ['yw', 'dl']
94 ['yz', 'uq']
95 ['zc', 'bp']
97 ['zq', 'of']
98 ['zr', 'ta']


In [240]:
print(len(indices_1))
print(len(indices_2))

55
55
