# Annoy Analysis:
Annoy (Approximate Nearest Neighbors Oh Yeah) is an efficient technique for finding nearest neighbors in large datasets, particularly in applications involving vectors, such as machine learning and information retrieval.

## How Annoy Works:

Data (typically vectors) are added to an index. Each vector represents a numerical representation of an item.

Tree Division: Annoy builds multiple search trees, where each tree is constructed based on a random sample of the data. The idea is to partition the space into regions, allowing for faster neighbor searches.

Search Algorithm:
When querying a vector, Annoy searches through the constructed trees. It traverses the trees to find the nearest neighbors.
The algorithm is designed to be fast, using approximation, which means it might not always find the exact nearest neighbors but frequently finds very close neighbors efficiently.

In [6]:
!pip install annoy

Collecting annoy
  Downloading annoy-1.17.3.tar.gz (647 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/647.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.8/647.5 kB[0m [31m4.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m647.5/647.5 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: annoy
  Building wheel for annoy (setup.py) ... [?25l[?25hdone
  Created wheel for annoy: filename=annoy-1.17.3-cp310-cp310-linux_x86_64.whl size=552448 sha256=ab261789336281aba0524ada12a8054b4ef5fc099dc0a10d71c39cc51f9c7eee
  Stored in directory: /root/.cache/pip/wheels/64/8a/da/f714bcf46c5efdcfcac0559e63370c21abe961c48e3992465a
Successfully built annoy
Installing collected packages: annoy
Successfully installed annoy-1.17.3


In [1]:
# Import Libs
import pandas as pd
from annoy import AnnoyIndex

In [2]:
df = pd.read_excel('/content/rice_word2vec_3mer_dataset.xlsx')

In [3]:
df.head(2)

Unnamed: 0,circName,stress,tissue,chr,start,end,strand,start_anno,wc_3mer_1,wc_3mer_2,...,wc_3mer_55,wc_3mer_56,wc_3mer_57,wc_3mer_58,wc_3mer_59,wc_3mer_60,wc_3mer_61,wc_3mer_62,wc_3mer_63,wc_3mer_64
0,osa-circ1-OS01T0723400,-,multipleTissue,1,30167620,30167771,+,"exon,CDS",-3.738741,8.954664,...,-25.19884,2.249939,-4.701558,-3.633969,5.212902,-6.579618,11.032136,-8.533007,-8.32791,18.141064
1,osa-circ2-OS03T0223400,-,multipleTissue,3,6461672,6462146,-,"exon,CDS",31.042567,37.664923,...,-27.264821,51.35945,-73.188428,-1.662928,56.56528,28.610637,80.878607,38.090248,-30.54351,28.859375


# Cold stress dataset

In [4]:
df_cold = df.query('stress == "-" or stress == "cold"').replace('-', 'control')

In [5]:
df_vec = df_cold.drop(['tissue','chr','start','end','strand','start_anno'], axis=1)
df_vec

Unnamed: 0,circName,stress,wc_3mer_1,wc_3mer_2,wc_3mer_3,wc_3mer_4,wc_3mer_5,wc_3mer_6,wc_3mer_7,wc_3mer_8,...,wc_3mer_55,wc_3mer_56,wc_3mer_57,wc_3mer_58,wc_3mer_59,wc_3mer_60,wc_3mer_61,wc_3mer_62,wc_3mer_63,wc_3mer_64
0,osa-circ1-OS01T0723400,control,-3.738741,8.954664,0.703204,-7.427550,9.996643,-0.181532,18.598464,-34.975266,...,-25.198840,2.249939,-4.701558,-3.633969,5.212902,-6.579618,11.032136,-8.533007,-8.327910,18.141064
1,osa-circ2-OS03T0223400,control,31.042567,37.664923,95.821671,-13.835744,-44.356329,15.314592,45.651881,-14.309955,...,-27.264821,51.359450,-73.188428,-1.662928,56.565280,28.610637,80.878607,38.090248,-30.543510,28.859375
2,osa-circ3-OS11T0210300,control,-31.307338,14.101101,6.012345,-38.559886,47.126226,-4.402491,-2.144210,-110.080545,...,-69.366830,1.113416,-42.689580,-6.150814,23.203639,-7.541412,-11.218056,-40.615067,16.391329,1.324206
3,osa-circ4-OS02T0200900,control,15.773287,55.951050,71.105993,11.420289,-15.831468,-30.626971,37.164693,-23.572454,...,42.391724,50.721551,-67.643923,-24.403869,41.171276,45.067768,9.704687,9.653121,7.829608,-11.254961
4,osa-circ5-OS05T0494800,control,-44.379635,-33.514299,9.594330,-28.682953,23.765476,24.619851,-26.153289,-72.517970,...,-75.243917,-36.308267,7.583992,-27.560256,-24.934966,-15.739278,9.237141,-16.380438,-18.462203,7.718794
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
63043,osa-circ63044--,control,1.187543,33.638602,46.761491,33.606569,-28.457826,-18.402121,19.510404,26.983153,...,39.581378,5.624058,-18.996478,-32.586105,-8.206610,38.324260,2.731934,28.181426,3.800943,-12.217631
63044,osa-circ63045--,control,-80.622757,6.318129,-2.366743,16.879751,-30.253832,-43.023873,-19.773850,-1.661586,...,22.650734,-26.579853,-16.502095,-137.357917,-76.090281,117.878950,-128.438327,-9.823361,4.179330,-43.925242
63045,osa-circ63046--,control,-74.383266,-15.111880,-5.831784,-13.084491,-20.186291,-32.085257,-11.033929,-44.055941,...,-16.439233,-7.612099,-29.730788,-98.997580,-40.800622,88.914954,-93.826379,-28.685970,-15.167838,-35.795096
63046,osa-circ63047--,control,15.836808,18.807946,3.763431,5.834760,-12.692529,13.718565,11.533184,-19.348844,...,23.810030,14.158860,-25.809598,-1.792872,-4.828570,6.402574,10.700880,3.991035,-6.085688,4.940037


In [6]:
df_vec['stress'].value_counts()

Unnamed: 0_level_0,count
stress,Unnamed: 1_level_1
control,50187
cold,5724


In [7]:
dimension = 64

In [9]:
cold_sequences = df_vec[df_vec['stress'] == 'cold']
control_sequences = df_vec[df_vec['stress'] == 'control']

# creating sample
control_sample = control_sequences.sample(n=5724, random_state=42)

combined_sequences = pd.concat([cold_sequences, control_sample])

t = AnnoyIndex(dimension, 'angular')
for i in range(len(combined_sequences)):
    t.add_item(i, combined_sequences.iloc[i, 2:2 + dimension].values)
t.build(10)

similarities = {}
n_vizinhos = 5  # number of neighbors

# Analyzing only 'cold' sequences
for estresse_index in cold_sequences.index:
    if estresse_index < len(df_vec):
        adjusted_index = combined_sequences.index.get_loc(estresse_index)

        #Search neighbors
        vizinhos = t.get_nns_by_item(adjusted_index, n_vizinhos + 1)
        valid_vizinhos = []

        for i in vizinhos:
            if combined_sequences.iloc[i]['circName'] != combined_sequences.iloc[adjusted_index]['circName']:
                valid_vizinhos.append(i)
            if len(valid_vizinhos) == n_vizinhos:
                break

        similarities[estresse_index] = valid_vizinhos

# Results
cold_count = 0
control_count = 0
control_neighbors = set()

with open('neighbors_results.txt', 'w') as f:
  for estresse_index, vizinhos in similarities.items():
      cold_sequence = combined_sequences.iloc[combined_sequences.index.get_loc(estresse_index)]['circName']
      similar_sequences = [combined_sequences.iloc[i]['circName'] for i in vizinhos]
      similar_conditions = [combined_sequences.iloc[i]['stress'] for i in vizinhos]

      for i in vizinhos:
          if combined_sequences.iloc[i]['stress'] == 'control':
              control_neighbors.add(i)

      cold_count += similar_conditions.count('cold')

      print(f'Seq of "cold" group: {cold_sequence}')
      f.write(f'Seq of "cold" group: {cold_sequence}\n')

      for seq, cond in zip(similar_sequences, similar_conditions):
          print(f'Nearest neighbors: {seq}, Condition: {cond}')
          f.write(f'Nearest neighbors: {seq}, Condition: {cond}\n')

control_count = len(control_neighbors)

# Similarity analysis
total_vizinhos = len(similarities) * n_vizinhos
num_cold_sequences = len(cold_sequences)

print(f'Number of iterated cold stress sequences: {num_cold_sequences}')
print(f'Total number of neighbors in the "cold" group: {cold_count}')
print(f'Total number of neighbors in the "control" group": {control_count}')
print(f'Percentage of neighbors in the "cold" group: {cold_count / total_vizinhos * 100:.2f}%')
print(f'Percentage of neighbors in the "control" group: {control_count / total_vizinhos * 100:.2f}%')

[1;30;43mA saída de streaming foi truncada nas últimas 5000 linhas.[0m
Nearest neighbors: osa-circ58565-OS01T0655500, Condition: cold
Nearest neighbors: osa-circ43075-OS03T0171900, Condition: cold
Nearest neighbors: osa-circ58134-OS02T0121000, Condition: cold
Seq of "cold" group: osa-circ43625-OS09T0280300
Nearest neighbors: osa-circ41297-OS01T0775300, Condition: cold
Nearest neighbors: osa-circ58905-OS10T0566300, Condition: cold
Nearest neighbors: osa-circ8532-OS02T0664100, Condition: control
Nearest neighbors: osa-circ18570-OS01T0323000, Condition: control
Nearest neighbors: osa-circ59208-OS06T0495700, Condition: cold
Seq of "cold" group: osa-circ43626-OS09T0327400
Nearest neighbors: osa-circ42002-OS01T0695300, Condition: cold
Nearest neighbors: osa-circ41650-OS01T0841600, Condition: cold
Nearest neighbors: osa-circ10595-OS03T0646900, Condition: control
Nearest neighbors: osa-circ41912-OS04T0252200, Condition: cold
Nearest neighbors: osa-circ42992-OS08T0474700, Condition: cold
Seq 

### Analysis of Results

- Number of ref sequences from the "cold" group: 5724

- Total neighbors from the "cold" group: 7788

This number indicates that when searching for the nearest neighbors for the 5724 sequences from the "cold" group, a total of 7788 instances of neighbors that also belong to the "cold" group were found. This means that, on average, each sequence from the "cold" group has similar neighbors within the same group.

- Total neighbors from the "control" group: 3963

This value shows that, among the neighbors of the sequences from the "cold" group, 3963 of them belong to the "control" group. This is less than 5724, indicating that not all sequences from the "cold" group are getting close to those from the "control" group.

- Percentage of neighbors in the "control" group: 23.46%

- Percentage of neighbors in the "cold" group: 46.10%

The high percentage indicates that the sequences in the "cold" group are closer to each other than to those in the "control" group.

 The fact that the total number of neighbors in the "control" group is smaller than the number of "cold" sequences suggests that the sequences in the "cold" group tend to be more similar to each other and have less relationship with the sequences in the "control" group.

# Drought stress dataset

In [10]:
df_drought = df.query('stress == "-" or stress == "drought"').replace('-', 'control')

In [12]:
df_vec = df_drought.drop(['tissue','chr','start','end','strand','start_anno'], axis=1)
df_vec

Unnamed: 0,circName,stress,wc_3mer_1,wc_3mer_2,wc_3mer_3,wc_3mer_4,wc_3mer_5,wc_3mer_6,wc_3mer_7,wc_3mer_8,...,wc_3mer_55,wc_3mer_56,wc_3mer_57,wc_3mer_58,wc_3mer_59,wc_3mer_60,wc_3mer_61,wc_3mer_62,wc_3mer_63,wc_3mer_64
0,osa-circ1-OS01T0723400,control,-3.738741,8.954664,0.703204,-7.427550,9.996643,-0.181532,18.598464,-34.975266,...,-25.198840,2.249939,-4.701558,-3.633969,5.212902,-6.579618,11.032136,-8.533007,-8.327910,18.141064
1,osa-circ2-OS03T0223400,control,31.042567,37.664923,95.821671,-13.835744,-44.356329,15.314592,45.651881,-14.309955,...,-27.264821,51.359450,-73.188428,-1.662928,56.565280,28.610637,80.878607,38.090248,-30.543510,28.859375
2,osa-circ3-OS11T0210300,control,-31.307338,14.101101,6.012345,-38.559886,47.126226,-4.402491,-2.144210,-110.080545,...,-69.366830,1.113416,-42.689580,-6.150814,23.203639,-7.541412,-11.218056,-40.615067,16.391329,1.324206
3,osa-circ4-OS02T0200900,control,15.773287,55.951050,71.105993,11.420289,-15.831468,-30.626971,37.164693,-23.572454,...,42.391724,50.721551,-67.643923,-24.403869,41.171276,45.067768,9.704687,9.653121,7.829608,-11.254961
4,osa-circ5-OS05T0494800,control,-44.379635,-33.514299,9.594330,-28.682953,23.765476,24.619851,-26.153289,-72.517970,...,-75.243917,-36.308267,7.583992,-27.560256,-24.934966,-15.739278,9.237141,-16.380438,-18.462203,7.718794
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
63043,osa-circ63044--,control,1.187543,33.638602,46.761491,33.606569,-28.457826,-18.402121,19.510404,26.983153,...,39.581378,5.624058,-18.996478,-32.586105,-8.206610,38.324260,2.731934,28.181426,3.800943,-12.217631
63044,osa-circ63045--,control,-80.622757,6.318129,-2.366743,16.879751,-30.253832,-43.023873,-19.773850,-1.661586,...,22.650734,-26.579853,-16.502095,-137.357917,-76.090281,117.878950,-128.438327,-9.823361,4.179330,-43.925242
63045,osa-circ63046--,control,-74.383266,-15.111880,-5.831784,-13.084491,-20.186291,-32.085257,-11.033929,-44.055941,...,-16.439233,-7.612099,-29.730788,-98.997580,-40.800622,88.914954,-93.826379,-28.685970,-15.167838,-35.795096
63046,osa-circ63047--,control,15.836808,18.807946,3.763431,5.834760,-12.692529,13.718565,11.533184,-19.348844,...,23.810030,14.158860,-25.809598,-1.792872,-4.828570,6.402574,10.700880,3.991035,-6.085688,4.940037


In [13]:
df_vec['stress'].value_counts()

Unnamed: 0_level_0,count
stress,Unnamed: 1_level_1
control,50187
drought,824


In [19]:
drought_sequences = df_vec[df_vec['stress'] == 'drought']
control_sequences = df_vec[df_vec['stress'] == 'control']

# creating sample
control_sample = control_sequences.sample(n=824, random_state=42)

combined_sequences = pd.concat([drought_sequences, control_sample])

t = AnnoyIndex(dimension, 'angular')
for i in range(len(combined_sequences)):
    t.add_item(i, combined_sequences.iloc[i, 2:2 + dimension].values)
t.build(10)

similarities = {}
n_vizinhos = 5  # number of neighbors

# Analyzing only 'cold' sequences
for estresse_index in drought_sequences.index:
    if estresse_index < len(df_vec):
        adjusted_index = combined_sequences.index.get_loc(estresse_index)

        #Search neighbors
        vizinhos = t.get_nns_by_item(adjusted_index, n_vizinhos + 1)
        valid_vizinhos = []

        for i in vizinhos:
            if combined_sequences.iloc[i]['circName'] != combined_sequences.iloc[adjusted_index]['circName']:
                valid_vizinhos.append(i)
            if len(valid_vizinhos) == n_vizinhos:
                break

        similarities[estresse_index] = valid_vizinhos

# Results
drought_count = 0
control_count = 0
control_neighbors = set()

with open('neighbors_results_drought.txt', 'w') as f:
  for estresse_index, vizinhos in similarities.items():
      drought_sequence = combined_sequences.iloc[combined_sequences.index.get_loc(estresse_index)]['circName']
      similar_sequences = [combined_sequences.iloc[i]['circName'] for i in vizinhos]
      similar_conditions = [combined_sequences.iloc[i]['stress'] for i in vizinhos]

      for i in vizinhos:
          if combined_sequences.iloc[i]['stress'] == 'control':
              control_neighbors.add(i)

      drought_count += similar_conditions.count('drought')

      print(f'Seq of "drought" group: {drought_sequence}')
      f.write(f'Seq of "drought" group: {drought_sequence}\n')

      for seq, cond in zip(similar_sequences, similar_conditions):
          print(f'Nearest neighbors: {seq}, Condition: {cond}')
          f.write(f'Nearest neighbors: {seq}, Condition: {cond}\n')

control_count = len(control_neighbors)

# Similarity analysis
total_vizinhos = len(similarities) * n_vizinhos
num_drought_sequences = len(drought_sequences)

print(f'Number of iterated drought stress sequences: {num_drought_sequences}')
print(f'Total number of neighbors in the "drought" group: {drought_count}')
print(f'Total number of neighbors in the "control" group": {control_count}')
print(f'Percentage of neighbors in the "drought" group: {drought_count / total_vizinhos * 100:.2f}%')
print(f'Percentage of neighbors in the "control" group: {control_count / total_vizinhos * 100:.2f}%')

Seq of "drought" group: osa-circ5574-OS05T0563400
Nearest neighbors: osa-circ12645-OS02T0735200, Condition: drought
Nearest neighbors: osa-circ12543-OS02T0507400, Condition: drought
Nearest neighbors: osa-circ13459-OS05T0455600, Condition: drought
Nearest neighbors: osa-circ53481-OS02T0704600, Condition: control
Nearest neighbors: osa-circ42170-OS03T0278200, Condition: control
Seq of "drought" group: osa-circ5588-OS01T0730500
Nearest neighbors: osa-circ12729-OS01T0495900, Condition: drought
Nearest neighbors: osa-circ29614-OS06T0660800, Condition: control
Nearest neighbors: osa-circ27442-OS05T0551000, Condition: control
Nearest neighbors: osa-circ2196-OS01T0924900, Condition: control
Nearest neighbors: osa-circ36664-OS02T0623500, Condition: control
Seq of "drought" group: osa-circ5598-OS03T0117100
Nearest neighbors: osa-circ13495-OS09T0509200, Condition: drought
Nearest neighbors: osa-circ13448-OS02T0169800, Condition: drought
Nearest neighbors: osa-circ32196-OS03T0616700, Condition: c