# Gene comparison

Task description: $\mathbf{a} = [a_1,a_2,a_3,..., a_n]$ and $\mathbf{b} = [b_1,b_2,b_3,..., b_m]$ are two list of genes. $a_i$'s are distinct genes, so are $b_i$'s.
The Pearson correlations between every pair $a_i b_j$ are computed. The data is stored at the file `gene_pair` in the format
\begin{align}
&a_1 ~~~ b_1 ~~~ 0.2\\
&a_1 ~~~ b_2 ~~~ 0.3\\
&a_1 ~~~ b_3 ~~~ 0.6\\
&\vdots ~~~~~ \vdots ~~~~~ \vdots
\end{align}
`gene_pair` has $nm$ rows and 3 colmunes. Now, we have a feature pair list `feature_pair` in the format,
\begin{align}
& c_1 ~~~ d_1 \\
& c_2 ~~~ d_2 \\
& c_3 ~~~ d_3 \\
&\vdots ~~~~~ \vdots 
\end{align}
`feature_pair` has $l$ rows and 2 columns.

Algorithm: 

1. Sort `gene_pair` by $a_i$ and create the file `gene_pair_sorted` with the format,
\begin{align}
&a'_1 b'_1  ~~~ a'_1 ~~~ 0.4\\
&a'_2 b'_2  ~~~ a'_2 ~~~ 0.2\\
&a'_3 b'_3  ~~~ a'_3 ~~~ 0.3\\
&\vdots ~~~~~~~~ \vdots ~~~~~~~ \vdots
\end{align}
where $a'_i$ is sorted, and $b'_i$ is the original pair of $a'_i$. In this form, $a'_i$ and $b'_i$ can repeat.
2. Do the same as above on `feature_pair` to create `feature_pair_sorted`,
\begin{align}
&c'_1 d'_1  ~~~ c'_1 \\
&c'_2 d'_2  ~~~ c'_2 \\
&c'_3 d'_3  ~~~ c'_3 \\
&\vdots ~~~~~~~~ \vdots 
\end{align}
3. Compare $a'_i b'_i$ with $c'_j d'_j$. If the same, compare $a'_i $ with $c'_j$. If the same, label $a'_i b'_i$ with 1, else 0.   

In [59]:
import random
import string
import time
import numpy as np
# Function to generate a random string of 'max_length' characters
def generate_random_string(max_length=6):
    length = random.randint(2, max_length)
    return ''.join(random.choices(string.ascii_lowercase, k=length))

# Function to generate a list of 'n' random strings
def generate_random_string_list(n, max_length=6):
    return [generate_random_string(max_length) for _ in range(n)]

In [60]:
n = 1000
m = 1000
l = 100
a = generate_random_string_list(n,2)
b = generate_random_string_list(m,2)
c = generate_random_string_list(l,2)
d = generate_random_string_list(l,2)

gene_pair = []
for i in range(n):
    for j in range(m):
        gene_pair.append([a[i]+b[j],a[i],np.random.rand()]) # using list
        
feature_pair = []
for i in range(l):
    feature_pair.append([c[i]+d[i],c[i]])
    
###

    

In [61]:
from operator import itemgetter
gene_pair_sorted = sorted(gene_pair, key=itemgetter(0)) ## sort by which column
feature_pair_sorted = sorted(feature_pair, key=itemgetter(0)) ## sort by which column

In [62]:
# list1 is gene_pair_sorted, list2 is feature_pair_sorted
#

def compare_sorted_tuple_lists(list1, list2): 
    i, j = 0, 0
    result = []

    while i < len(list1) and j < len(list2):
        if list1[i][0] == list2[j][0]:
            result.append(1)  # String exists in both lists
            list1[i].append(1)
            i += 1
            j += 1
        elif list1[i][0] < list2[j][0]:
            result.append(0)
            list1[i].append(0)
            i += 1
        else:
            result.append(0)
            list1[i].append(0)
            j += 1

    # If any list is exhausted, mark the remaining elements as not found
    while i < len(list1):
        result.append(0)
        list1[i].append(0)
        i += 1

    return result

def count_ones(lst):
    ones_count = 0
    for item in lst:
        if item == 1:
            ones_count += 1
    return ones_count

In [63]:
test = compare_sorted_tuple_lists(gene_pair_sorted, feature_pair_sorted)
n_ones =count_ones(test)
expe_n_ones = (1-(1-1/26/26)**m) * (1-(1-1/26/26)**n) * l

In [64]:
print(n_ones,end='\n')
print(expe_n_ones,end='\n')

61
59.66789623942363


In [73]:
gene_pair_sorted[1:5]

[['aaac', 'aa', 0.1780737247960692, 0],
 ['aaad', 'aa', 0.7824382525403691, 0],
 ['aaae', 'aa', 0.3417914304198547, 0],
 ['aaae', 'aa', 0.0069615199309635, 0]]

In [79]:
yes_no_column = [row[3] for row in gene_pair_sorted]

In [81]:
yes_no_column.index(1)

6379

In [91]:
indices = [i for i, x in enumerate(yes_no_column) if x == 1]

In [100]:
for i in indices:
    print(i, gene_pair_sorted[i],end='\n')

6379 ['acuq', 'ac', 0.33571295626967756, 1]
36236 ['axdc', 'ax', 0.5275113411826904, 1]
54821 ['bmvl', 'bm', 0.5970314064647441, 1]
95760 ['cujz', 'cu', 0.9500523223386794, 1]
108570 ['ddfa', 'dd', 0.8321777197755661, 1]
115714 ['djwi', 'dj', 0.7872436141402266, 1]
115922 ['djza', 'dj', 0.6560156869712689, 1]
128576 ['dwfc', 'dw', 0.31582011955977063, 1]
168952 ['evyu', 'ev', 0.10694892966815928, 1]
203268 ['fxgz', 'fx', 0.3546345698965675, 1]
214456 ['gggb', 'gg', 0.8981131249478187, 1]
222808 ['glxj', 'gl', 0.8190836818762561, 1]
226614 ['goqc', 'go', 0.4637200945782244, 1]
228383 ['gpmj', 'gp', 0.5530056088564907, 1]
264709 ['hqsl', 'hq', 0.15081292077728303, 1]
269306 ['hvef', 'hv', 0.3520941195616737, 1]
275033 ['hyat', 'hy', 0.3808912341969214, 1]
284456 ['igmf', 'ig', 0.03747600368330506, 1]
290683 ['ikrs', 'ik', 0.91631444582021, 1]
300869 ['ivwp', 'iv', 0.1221202138016173, 1]
301522 ['iwes', 'iw', 0.5820804053924716, 1]
303679 ['iwxe', 'iw', 0.7644747266980029, 1]
340344 ['jrp