# Maverick Coref Exploration
- [Paper](https://aclanthology.org/2024.acl-long.722/)
- [Repo](https://github.com/SapienzaNLP/maverick-coref) 

## Basic Setup

In [1]:
# Automatic reloading
%load_ext autoreload
%autoreload 2

In [2]:
import sys
import os

# Get the current file's directory (e.g., the 'notebooks' directory)
current_dir = os.path.dirname(os.path.abspath(''))

# Navigate one level up to project directory
project_dir = os.path.abspath(os.path.join(current_dir, '..'))

# Add the directory to sys.path
sys.path.append(project_dir)
os.chdir(project_dir)
os.getcwd()

'c:\\Users\\Ryan Lee\\Desktop\\AISG Internship\\rag'

There may be some installation errors for Windows users when you attempt to `pip install maverick-coref`. We will need to update the setup file. 

- `git clone https://github.com/SapienzaNLP/maverick-coref.git`
- `cd maverick-coref`
- In `setup.py` explicitly specify to use `utf-8` encoding (otherwise would use system-default encodign): `long_description=open("README.md", encoding="utf-8").read()`
- `pip install -e .`


In [3]:
# pip install maverick-coref
from maverick import Maverick

model = Maverick(
    hf_name_or_path = "sapienzanlp/maverick-mes-ontonotes",
    device = "cpu"
)

  from .autonotebook import tqdm as notebook_tqdm


sapienzanlp/maverick-mes-ontonotes loading




In [4]:
text = "Barack Obama is traveling to Rome. The city is sunny and the president plans to visit its most important attractions."
results = model.predict(text)
results

{'tokens': ['Barack',
  'Obama',
  'is',
  'traveling',
  'to',
  'Rome',
  '.',
  'The',
  'city',
  'is',
  'sunny',
  'and',
  'the',
  'president',
  'plans',
  'to',
  'visit',
  'its',
  'most',
  'important',
  'attractions',
  '.'],
 'clusters_token_offsets': [((5, 5), (7, 8), (17, 17)), ((0, 1), (12, 13))],
 'clusters_char_offsets': [[(29, 32), (35, 42), (86, 88)],
  [(0, 11), (57, 69)]],
 'clusters_token_text': [['Rome', 'The city', 'its'],
  ['Barack Obama', 'the president']],
 'clusters_char_text': None}

In [5]:
texts = [
    'We are AISG. We are so happy to see you using the coref package. This package is very fast!',
    'Alice goes down the rabbit hole. Where she would discover a new reality beyond her expectations.',
    'Mary saw Susan at the park. She was playing with a frisbee. They then conversed.',
    'Alice went to the library because she wanted to borrow a book. She found a novel by Kenrick and decided to check it out. As Alice walked home, she bumped into her friend Clara, who asked her what she had borrowed. Alice showed it to Clara, and they talked about the author for a while.'
]

for text in texts:
    results = model.predict(text)
    print(text)
    print(results['clusters_token_text'])
    print(results['clusters_token_offsets'])
    print("="*len(text))

We are AISG. We are so happy to see you using the coref package. This package is very fast!
[['We', 'We'], ['the coref package', 'This package']]
[((0, 0), (4, 4)), ((12, 14), (16, 17))]
Alice goes down the rabbit hole. Where she would discover a new reality beyond her expectations.
[['Alice', 'she', 'her']]
[((0, 0), (8, 8), (15, 15))]
Mary saw Susan at the park. She was playing with a frisbee. They then conversed.
[['Susan', 'She']]
[((2, 2), (7, 7))]
Alice went to the library because she wanted to borrow a book. She found a novel by Kenrick and decided to check it out. As Alice walked home, she bumped into her friend Clara, who asked her what she had borrowed. Alice showed it to Clara, and they talked about the author for a while.
[['Alice', 'she', 'She', 'Alice', 'she', 'her', 'her', 'she', 'Alice'], ['a novel by Kenrick', 'it', 'it'], ['her friend Clara , who asked her what she had borrowed', 'Clara'], ['Kenrick', 'the author']]
[((0, 0), (6, 6), (13, 13), (27, 27), (31, 31), (34,

## Remarks
- Similar results to fastcoref
- All Maverick models use DeBERTa-v3 (both base and large) which can model very long input texts. DeBERTa_large can handle sequences up to 24,528 tokens, making it better for long context. But this is very computationally expensive as attention mechanism incurs quadratic computational complexity (unlike Longformer, which uses sliding window attention mechanism for linear time complexity of O(nw) where w is the window size). 

In [6]:
from src.components.coreference_models import MaverickCoreferenceModel

my_model = MaverickCoreferenceModel(device="cpu")
my_model.predict(texts[0])

sapienzanlp/maverick-mes-ontonotes loading


[Cluster(mentions=[Mention(char_idx=(0, 1), content='We'), Mention(char_idx=(13, 14), content='We')]),
 Cluster(mentions=[Mention(char_idx=(46, 62), content='the coref package'), Mention(char_idx=(65, 76), content='This package')])]