# Annoy Analysis:
Annoy (Approximate Nearest Neighbors Oh Yeah) is an efficient technique for finding nearest neighbors in large datasets, particularly in applications involving vectors, such as machine learning and information retrieval.

## How Annoy Works:

Data (typically vectors) are added to an index. Each vector represents a numerical representation of an item.

Tree Division: Annoy builds multiple search trees, where each tree is constructed based on a random sample of the data. The idea is to partition the space into regions, allowing for faster neighbor searches.

Search Algorithm:
When querying a vector, Annoy searches through the constructed trees. It traverses the trees to find the nearest neighbors.
The algorithm is designed to be fast, using approximation, which means it might not always find the exact nearest neighbors but frequently finds very close neighbors efficiently.

In [6]:
!pip install annoy

Collecting annoy
  Downloading annoy-1.17.3.tar.gz (647 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/647.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.8/647.5 kB[0m [31m4.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m647.5/647.5 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: annoy
  Building wheel for annoy (setup.py) ... [?25l[?25hdone
  Created wheel for annoy: filename=annoy-1.17.3-cp310-cp310-linux_x86_64.whl size=552448 sha256=ab261789336281aba0524ada12a8054b4ef5fc099dc0a10d71c39cc51f9c7eee
  Stored in directory: /root/.cache/pip/wheels/64/8a/da/f714bcf46c5efdcfcac0559e63370c21abe961c48e3992465a
Successfully built annoy
Installing collected packages: annoy
Successfully installed annoy-1.17.3


In [1]:
# Import Libs
import pandas as pd
from annoy import AnnoyIndex

In [2]:
df = pd.read_excel('/content/maize_word2vec_3mer_dataset.xlsx')

In [3]:
df.head(2)

Unnamed: 0,circName,stress,tissue,chr,start,end,strand,start_anno,wc_3mer_1,wc_3mer_2,...,wc_3mer_55,wc_3mer_56,wc_3mer_57,wc_3mer_58,wc_3mer_59,wc_3mer_60,wc_3mer_61,wc_3mer_62,wc_3mer_63,wc_3mer_64
0,zma-circ1-Zm00001d002325,-,multipleTissue,2,10317309,10317467,-,"exon,CDS",-2.819397,-26.071322,...,-1.649599,31.57955,8.262636,-8.476553,-21.353326,-5.830368,5.679497,-0.869754,-16.990953,-28.061553
1,zma-circ2-Zm00001d038675,-,multipleTissue,6,162376852,162378246,+,"exon,CDS",38.139316,-33.984278,...,-276.418821,-19.796359,-14.883323,-210.64013,-108.069105,-196.095334,-8.178139,196.293086,-111.923786,-176.260473


In [4]:
dimension = 64

# Drought stress dataset

In [5]:
df_drought = df.query('stress == "-" or stress == "drought"').replace('-', 'control')

In [6]:
df_vec = df_drought.drop(['tissue','chr','start','end','strand','start_anno'], axis=1)
df_vec

Unnamed: 0,circName,stress,wc_3mer_1,wc_3mer_2,wc_3mer_3,wc_3mer_4,wc_3mer_5,wc_3mer_6,wc_3mer_7,wc_3mer_8,...,wc_3mer_55,wc_3mer_56,wc_3mer_57,wc_3mer_58,wc_3mer_59,wc_3mer_60,wc_3mer_61,wc_3mer_62,wc_3mer_63,wc_3mer_64
0,zma-circ1-Zm00001d002325,control,-2.819397,-26.071322,-2.050199,11.781961,-9.510820,20.928187,10.537300,8.309442,...,-1.649599,31.579550,8.262636,-8.476553,-21.353326,-5.830368,5.679497,-0.869754,-16.990953,-28.061553
1,zma-circ2-Zm00001d038675,control,38.139316,-33.984278,-402.364554,-30.165603,47.523170,189.462963,-281.399404,95.374081,...,-276.418821,-19.796359,-14.883323,-210.640130,-108.069105,-196.095334,-8.178139,196.293086,-111.923786,-176.260473
2,zma-circ3-Zm00001d038163,control,-13.009970,-12.990245,-26.831700,57.567522,-63.762736,-13.593141,39.764057,-25.189784,...,41.822434,8.608161,-6.166661,31.397092,23.119942,14.479912,7.400303,-15.383993,25.829718,-35.978360
3,zma-circ4-Zm00001d049552,control,-7.712050,-3.377060,10.117356,10.020552,2.411303,19.037348,-2.969932,3.283452,...,-28.654313,9.646898,16.798625,4.431155,-22.220375,-12.460058,8.829987,17.305939,-19.056328,-6.713643
4,zma-circ5-Zm00001d032567,control,5.437028,-13.467944,-87.025836,3.275219,17.536336,70.588946,-70.014532,31.610944,...,-27.083839,-10.496368,-3.498599,-47.558427,34.946736,-30.822783,-12.506350,11.305700,-11.058980,-78.504665
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38780,zma-circ38781--,control,-13.972886,6.107950,-25.147395,73.045038,-87.992530,-31.104236,68.620450,-49.629747,...,98.429620,7.452953,-15.737801,35.920804,61.204687,48.885356,-12.690994,-36.205666,44.758745,-17.722664
38781,zma-circ38782--,control,7.036017,-10.090989,-36.701810,24.329014,-40.458631,17.339536,-3.631819,-21.419614,...,24.790249,6.276686,-12.864078,-1.256257,42.178863,-4.007750,-10.729215,-5.812331,7.679475,-18.434670
38782,zma-circ38783--,control,-18.672884,43.651235,-33.115869,118.402164,-140.489068,-27.546500,68.986703,-92.037108,...,172.084031,18.400802,-26.153603,36.820269,153.651124,101.715267,-52.398818,-47.916956,38.571516,-17.400118
38783,zma-circ38784--,control,7.627427,-54.741109,-242.209928,153.955496,-204.570579,-41.848951,36.354824,-20.722182,...,87.106894,30.368255,-34.900771,53.611954,44.205673,-65.699111,30.057831,104.595740,81.229813,-104.068594


In [7]:
df_vec['stress'].value_counts()

Unnamed: 0_level_0,count
stress,Unnamed: 1_level_1
control,20809
drought,11206


In [8]:
drought_sequences = df_vec[df_vec['stress'] == 'drought']
control_sequences = df_vec[df_vec['stress'] == 'control']

# creating sample
control_sample = control_sequences.sample(n=11206, random_state=42)

combined_sequences = pd.concat([drought_sequences, control_sample])

t = AnnoyIndex(dimension, 'angular')
for i in range(len(combined_sequences)):
    t.add_item(i, combined_sequences.iloc[i, 2:2 + dimension].values)
t.build(10)

similarities = {}
n_vizinhos = 5  # number of neighbors

# Analyzing only 'drought' sequences
for estresse_index in drought_sequences.index:
    if estresse_index < len(df_vec):
        adjusted_index = combined_sequences.index.get_loc(estresse_index)

        #Search neighbors
        vizinhos = t.get_nns_by_item(adjusted_index, n_vizinhos + 1)
        valid_vizinhos = []

        for i in vizinhos:
            if combined_sequences.iloc[i]['circName'] != combined_sequences.iloc[adjusted_index]['circName']:
                valid_vizinhos.append(i)
            if len(valid_vizinhos) == n_vizinhos:
                break

        similarities[estresse_index] = valid_vizinhos

# Results
drought_count = 0
control_count = 0
control_neighbors = set()

with open('neighbors_results_drought_maize.txt', 'w') as f:
  for estresse_index, vizinhos in similarities.items():
      drought_sequence = combined_sequences.iloc[combined_sequences.index.get_loc(estresse_index)]['circName']
      similar_sequences = [combined_sequences.iloc[i]['circName'] for i in vizinhos]
      similar_conditions = [combined_sequences.iloc[i]['stress'] for i in vizinhos]

      for i in vizinhos:
          if combined_sequences.iloc[i]['stress'] == 'control':
              control_neighbors.add(i)

      drought_count += similar_conditions.count('drought')

      print(f'Seq of "drought" group: {drought_sequence}')
      f.write(f'Seq of "drought" group: {drought_sequence}\n')

      for seq, cond in zip(similar_sequences, similar_conditions):
          print(f'Nearest neighbors: {seq}, Condition: {cond}')
          f.write(f'Nearest neighbors: {seq}, Condition: {cond}\n')

control_count = len(control_neighbors)

# Similarity analysis
total_vizinhos = len(similarities) * n_vizinhos
num_drought_sequences = len(drought_sequences)

print(f'Number of iterated drought stress sequences: {num_drought_sequences}')
print(f'Total number of neighbors in the "drought" group: {drought_count}')
print(f'Total number of neighbors in the "control" group": {control_count}')
print(f'Percentage of neighbors in the "drought" group: {drought_count / total_vizinhos * 100:.2f}%')
print(f'Percentage of neighbors in the "control" group: {control_count / total_vizinhos * 100:.2f}%')

[1;30;43mA saída de streaming foi truncada nas últimas 5000 linhas.[0m
Nearest neighbors: zma-circ34102-Zm00001d039341, Condition: drought
Nearest neighbors: zma-circ21388-Zm00001d028010, Condition: control
Nearest neighbors: zma-circ35918-Zm00001d042569, Condition: control
Seq of "drought" group: zma-circ20113-Zm00001d026000
Nearest neighbors: zma-circ12963-Zm00001d004159, Condition: drought
Nearest neighbors: zma-circ32455-Zm00001d018481, Condition: drought
Nearest neighbors: zma-circ15184-Zm00001d040340, Condition: drought
Nearest neighbors: zma-circ35266-Zm00001d017997, Condition: drought
Nearest neighbors: zma-circ15171-Zm00001d039534, Condition: drought
Seq of "drought" group: zma-circ20114-Zm00001d026031
Nearest neighbors: zma-circ33171-Zm00001d017072, Condition: drought
Nearest neighbors: zma-circ35550-Zm00001d007599, Condition: control
Nearest neighbors: zma-circ33434-Zm00001d017872, Condition: control
Nearest neighbors: zma-circ37795-Zm00001d018642, Condition: drought
Neare