# Long Question Coreference Adaptation Exploration
Framework for performing coreference over long context.

[Paper](https://arxiv.org/abs/2410.01671)

## Basic Setup

In [1]:
# Automatic reloading
%load_ext autoreload
%autoreload 2

In [2]:
import sys
import os

# Get the current file's directory (e.g., the 'notebooks' directory)
current_dir = os.path.dirname(os.path.abspath(''))

# Navigate one level up to project directory
project_dir = os.path.abspath(os.path.join(current_dir, '..'))

# Add the directory to sys.path
sys.path.append(project_dir)
os.chdir(project_dir)
os.getcwd()

'/Users/kangjunong/rag'

In [3]:
from src.components.coreference_models import MaverickCoreferenceModel
from src.components.coreference_framework import LongQuestionCoreferenceAdaptation

model = MaverickCoreferenceModel(device="cpu")
lqca = LongQuestionCoreferenceAdaptation(
    coreference_model=model,
    # max_partition_token_size=300,
    # sliding_window_token_interval=50,
)

sapienzanlp/maverick-mes-ontonotes loading




### Issue when working with Maverick model
Input text cannot have \n, as it causes the model to miscount the character indices after being split into subdocuments. This results in the code double counting unique mentions, which we identify via their indices. The temporary fix is to have the input_text be a single-line string.

Further elaborating, each \n is recognised as a single character by python's splitting method, but the model does not recognise \n as a token at all, thereby shifting the character index forward for each \n.

In [4]:
input_text = """Alice went to the park to meet her friend Bob. She arrived early and decided to sit on a bench near the fountain. Bob, who was running late, sent her a message saying he'd be there in ten minutes. While waiting, Alice noticed a dog chasing its tail. The playful animal entertained her until Bob finally arrived. Once he reached the park, Bob apologized for the delay and handed Alice a book. The novel, which he had borrowed from the library, was one Alice had wanted to read for weeks. She thanked him and immediately began flipping through its pages. As they chatted, the dog from earlier ran up to them, wagging its tail enthusiastically. Bob remarked how energetic the dog was, and Alice agreed, saying, "I wonder if it belongs to anyone here." After spending an hour at the park, Alice and Bob decided to grab some coffee. They left the park, leaving the dog behind, and walked to the nearest café. Inside, Bob ordered a cappuccino while Alice chose a latte. The barista, noticing their cheerful conversation, smiled and handed them their drinks with a friendly comment, "Enjoy your day!" As they sipped their coffee, Alice mentioned a trip she was planning. She told Bob about the beautiful beaches she wanted to visit and how her brother had recommended the destination. Bob expressed interest in joining her if she didn’t mind. Alice laughed and said, "Let me check with my brother. He’s the one organizing everything." """

sub_documents = lqca.get_sub_documents(full_document_text=input_text)
sub_documents

[SubDocument(content="Alice went to the park to meet her friend Bob. She arrived early and decided to sit on a bench near the fountain. Bob, who was running late, sent her a message saying he'd be there in ten minutes. While waiting, Alice noticed a dog chasing its tail. The playful animal entertained her until Bob finally arrived. Once he reached the park, Bob apologized for the delay and handed Alice a book. The novel, which he had borrowed from the library, was one Alice had wanted", start_char_idx=0, end_char_idx=466),
 SubDocument(content=' noticed a dog chasing its tail. The playful animal entertained her until Bob finally arrived. Once he reached the park, Bob apologized for the delay and handed Alice a book. The novel, which he had borrowed from the library, was one Alice had wanted to read for weeks. She thanked him and immediately began flipping through its pages. As they chatted, the dog from earlier ran up to them, wagging its tail enthusiastically. Bob remarked how energet

In [316]:
clusters_per_subdocument = lqca.get_coreference_clusters_per_sub_document(sub_documents)

In [317]:
mentions = lqca.get_unique_mentions(clusters_per_sub_document=clusters_per_subdocument)
mentions

{Mention(char_idx=(0, 4), content='Alice'),
 Mention(char_idx=(14, 21), content='the park'),
 Mention(char_idx=(31, 33), content='her'),
 Mention(char_idx=(31, 44), content='her friend Bob'),
 Mention(char_idx=(47, 49), content='She'),
 Mention(char_idx=(114, 138), content='Bob , who was running late'),
 Mention(char_idx=(146, 148), content='her'),
 Mention(char_idx=(167, 168), content='he'),
 Mention(char_idx=(212, 216), content='Alice'),
 Mention(char_idx=(226, 230), content='a dog'),
 Mention(char_idx=(240, 242), content='its'),
 Mention(char_idx=(250, 267), content='The playful animal'),
 Mention(char_idx=(281, 283), content='her'),
 Mention(char_idx=(291, 293), content='Bob'),
 Mention(char_idx=(317, 318), content='he'),
 Mention(char_idx=(328, 335), content='the park'),
 Mention(char_idx=(338, 340), content='Bob'),
 Mention(char_idx=(378, 382), content='Alice'),
 Mention(char_idx=(384, 389), content='a book'),
 Mention(char_idx=(392, 440), content='The novel , which he had borrow

In [318]:
unique_mentions_per_subdocument = lqca.get_unique_mentions_per_subdocument(sub_documents=sub_documents, unique_mentions=mentions)
unique_mentions_per_subdocument

[[Mention(char_idx=(384, 389), content='a book'),
  Mention(char_idx=(409, 410), content='he'),
  Mention(char_idx=(14, 21), content='the park'),
  Mention(char_idx=(392, 440), content='The novel , which he had borrowed from the library'),
  Mention(char_idx=(167, 168), content='he'),
  Mention(char_idx=(487, 489), content='She'),
  Mention(char_idx=(0, 4), content='Alice'),
  Mention(char_idx=(338, 340), content='Bob'),
  Mention(char_idx=(240, 242), content='its'),
  Mention(char_idx=(378, 382), content='Alice'),
  Mention(char_idx=(451, 455), content='Alice'),
  Mention(char_idx=(31, 44), content='her friend Bob'),
  Mention(char_idx=(146, 148), content='her'),
  Mention(char_idx=(499, 499), content='h'),
  Mention(char_idx=(212, 216), content='Alice'),
  Mention(char_idx=(250, 267), content='The playful animal'),
  Mention(char_idx=(291, 293), content='Bob'),
  Mention(char_idx=(281, 283), content='her'),
  Mention(char_idx=(328, 335), content='the park'),
  Mention(char_idx=(226, 

In [319]:
lookup = lqca.build_weight_function_lookup(sub_documents=sub_documents, coreference_clusters_from_sub_documents=clusters_per_subdocument, unique_mentions=mentions)
lookup

{(Mention(char_idx=(384, 389), content='a book'),
  Mention(char_idx=(409, 410), content='he')): {'s': 0, 't': 2, 'd': 0.0},
 (Mention(char_idx=(14, 21), content='the park'),
  Mention(char_idx=(384, 389), content='a book')): {'s': 0, 't': 1, 'd': 0.0},
 (Mention(char_idx=(384, 389), content='a book'),
  Mention(char_idx=(392, 440), content='The novel , which he had borrowed from the library')): {'s': 2,
  't': 0,
  'd': 1.0},
 (Mention(char_idx=(167, 168), content='he'),
  Mention(char_idx=(384, 389), content='a book')): {'s': 0, 't': 1, 'd': 0.0},
 (Mention(char_idx=(384, 389), content='a book'),
  Mention(char_idx=(487, 489), content='She')): {'s': 0, 't': 2, 'd': 0.0},
 (Mention(char_idx=(0, 4), content='Alice'),
  Mention(char_idx=(384, 389), content='a book')): {'s': 0, 't': 1, 'd': 0.0},
 (Mention(char_idx=(338, 340), content='Bob'),
  Mention(char_idx=(384, 389), content='a book')): {'s': 0, 't': 2, 'd': 0.0},
 (Mention(char_idx=(240, 242), content='its'),
  Mention(char_idx=(3

In [321]:
distance_graph, clusters = lqca.calculate_distance_graph(unique_mentions=mentions, L=3, lookup=lookup)

In [342]:
distance_graph

{(Mention(char_idx=(384, 389), content='a book'),
  Mention(char_idx=(384, 389), content='a book')): 1,
 (Mention(char_idx=(384, 389), content='a book'),
  Mention(char_idx=(1205, 1207), content='she')): 0,
 (Mention(char_idx=(384, 389), content='a book'),
  Mention(char_idx=(1336, 1340), content='Alice')): 0,
 (Mention(char_idx=(384, 389), content='a book'),
  Mention(char_idx=(1173, 1175), content='Bob')): 0,
 (Mention(char_idx=(384, 389), content='a book'),
  Mention(char_idx=(409, 410), content='he')): 0,
 (Mention(char_idx=(384, 389), content='a book'),
  Mention(char_idx=(1233, 1243), content='her brother')): 0,
 (Mention(char_idx=(384, 389), content='a book'),
  Mention(char_idx=(669, 675), content='the dog')): 0,
 (Mention(char_idx=(384, 389), content='a book'),
  Mention(char_idx=(986, 990), content='their')): 0,
 (Mention(char_idx=(384, 389), content='a book'),
  Mention(char_idx=(1033, 1036), content='them')): 0,
 (Mention(char_idx=(384, 389), content='a book'),
  Mention(ch

In [322]:
unique_mentions_not_pronouns = lqca.get_non_pronoun_unique_mentions(unique_mentions=mentions, full_document_text=input_text)
unique_mentions_not_pronouns

[Mention(char_idx=(0, 4), content='Alice'),
 Mention(char_idx=(14, 21), content='the park'),
 Mention(char_idx=(212, 216), content='Alice'),
 Mention(char_idx=(226, 230), content='a dog'),
 Mention(char_idx=(250, 267), content='The playful animal'),
 Mention(char_idx=(291, 293), content='Bob'),
 Mention(char_idx=(328, 335), content='the park'),
 Mention(char_idx=(338, 340), content='Bob'),
 Mention(char_idx=(378, 382), content='Alice'),
 Mention(char_idx=(384, 389), content='a book'),
 Mention(char_idx=(451, 455), content='Alice'),
 Mention(char_idx=(570, 589), content='the dog from earlier'),
 Mention(char_idx=(642, 644), content='Bob'),
 Mention(char_idx=(669, 675), content='the dog'),
 Mention(char_idx=(686, 690), content='Alice'),
 Mention(char_idx=(775, 782), content='the park'),
 Mention(char_idx=(785, 789), content='Alice'),
 Mention(char_idx=(785, 797), content='Alice and Bob'),
 Mention(char_idx=(795, 797), content='Bob'),
 Mention(char_idx=(838, 845), content='the park'),
 Me

### Negative Examples
Mention(char_idx=(392, 440), content='The novel , which he had borrowed from the library') is recognised as a pronoun because of the presence of "he" in the mention.

In [323]:
cluster_reprensentatives = lqca.get_cluster_representatives(clusters=clusters, unique_mentions_not_pronouns=unique_mentions_not_pronouns)
cluster_reprensentatives

['Alice and Bob',
 'a book',
 'Alice',
 'Bob',
 'the park',
 'the dog',
 'The playful animal',
 None]

In [329]:
mentions_to_representatives = lqca.assign_mentions_to_cluster_representatives(clusters=clusters, cluster_representatives=cluster_reprensentatives)
mentions_to_representatives

{Mention(char_idx=(1391, 1392), content='He'): None,
 Mention(char_idx=(1379, 1388), content='my brother'): None,
 Mention(char_idx=(1379, 1380), content='my'): 'Alice',
 Mention(char_idx=(1365, 1366), content='me'): 'Alice',
 Mention(char_idx=(1336, 1340), content='Alice'): 'Alice',
 Mention(char_idx=(1319, 1321), content='she'): 'Alice',
 Mention(char_idx=(1312, 1314), content='her'): 'Alice',
 Mention(char_idx=(1278, 1280), content='Bob'): 'Bob',
 Mention(char_idx=(1233, 1243), content='her brother'): None,
 Mention(char_idx=(1233, 1235), content='her'): 'Alice',
 Mention(char_idx=(1205, 1207), content='she'): 'Alice',
 Mention(char_idx=(1173, 1175), content='Bob'): 'Bob',
 Mention(char_idx=(1164, 1166), content='She'): 'Alice',
 Mention(char_idx=(1146, 1148), content='she'): 'Alice',
 Mention(char_idx=(1123, 1127), content='Alice'): 'Alice',
 Mention(char_idx=(1109, 1113), content='their'): 'Alice and Bob',
 Mention(char_idx=(1097, 1100), content='they'): 'Alice and Bob',
 Mention(

In [330]:
output_text = lqca.add_representatives_to_mentions(input_text=input_text, mentions_to_representatives=mentions_to_representatives)
output_text

'Alice went to the park to meet her (Alice) friend Bob. She (Alice) arrived early and decided to sit on a bench near the fountain. Bob, who was running late, sent her (Alice) a message saying he (Bob)\'d be there in ten minutes. While waiting, Alice noticed a dog (The playful animal) chasing its (The playful animal) tail. The playful animal entertained her (Alice) until Bob finally arrived. Once he (Bob) reached the park, Bob apologized for the delay and handed Alice a book. The novel , which he had borrowed from the library (a book)ibrary, was one Alice had wanted to read for weeks. She (Alice) thanked h (Bob)im (Bob) and immediately began flipping through its (a book) pages. As they (Alice and Bob) chatted, the dog from earlier ran up to them (Alice and Bob), wagging its (the dog) tail enthusiastically. Bob remarked how energetic the dog was, and Alice agreed, saying, "I (Alice) wonder if it (the dog) belongs to anyone here." After spending an hour at the park, Alice and Bob decided 

## Example of Running Component

In [336]:
from haystack import Document

input_doc = Document(content=input_text)
output = lqca.run([input_doc])

In [341]:
output['documents'][0].content

'Alice went to the park to meet her (Alice) friend Bob. She (Alice) arrived early and decided to sit on a bench near the fountain. Bob, who was running late, sent her (Alice) a message saying he (Bob)\'d be there in ten minutes. While waiting, Alice noticed a dog (The playful animal) chasing its (The playful animal) tail. The playful animal entertained her (Alice) until Bob finally arrived. Once he (Bob) reached the park, Bob apologized for the delay and handed Alice a book. The novel , which he had borrowed from the library (a book)ibrary, was one Alice had wanted to read for weeks. She (Alice) thanked h (Bob)im (Bob) and immediately began flipping through its (a book) pages. As they (Alice and Bob) chatted, the dog from earlier ran up to them (Alice and Bob), wagging its (the dog) tail enthusiastically. Bob remarked how energetic the dog was, and Alice agreed, saying, "I (Alice) wonder if it (the dog) belongs to anyone here." After spending an hour at the park, Alice and Bob decided 