# Refactors

In the _NewsEdits_ paper, we describe an algorithm for identifying sentences that have been `refactor`ed, and we describe the algorithm described in Appendix F.1, https://aclanthology.org/2022.naacl-main.10.pdf. However, we don't come close to fully exploring this rich topic. 

A refactored sentence is a sentence that has been purposefully moved in the document.

Please refer to the slideshow in this directory, https://github.com/isi-nlp/NewsEdits/blob/main/demo_notebooks/Refactoring%20Examples%20and%20Edge%20Cases.pdf, for a visualized description of the algorithm and tricky edge cases in order to better understand how our heuristic algorithm works to identify refactors.

In [60]:
import sqlite3
import pandas as pd
import sys
sys.path.insert(0, '../util')
import util_refactorings as ur
import util_data_fetching_for_app as uda

In [10]:
! gunzip ../../../data/diffengine-diffs/spark-output/nyt-matched-sentences.db.gz

In [11]:
conn = sqlite3.connect('../data/nyt-matched-sentences.db')

In [61]:
## please run the bottom cell first. Just shown here for brevity
matched_sentences, split_sentences = uda.get_data_from_sqlite_by_sent_cutoffs(
    source='nyt', conn=conn, high_sent_count=30, low_sent_count=3
)

In [79]:
keys = matched_sentences[['entry_id', 'version_x', 'version_y']].drop_duplicates()

In [100]:
e, v_x, v_y = keys.iloc[11]

df = (
    matched_sentences
    .loc[lambda df: df['entry_id'] == e]
    .loc[lambda df: df['version_x'] == v_x]
    .loc[lambda df: df['version_y'] == v_y]
)

refactors = ur.find_refactors_for_doc(
    df[['entry_id', 'version_x', 'version_y', 'sent_idx_x', 'sent_idx_y']].dropna().astype(int)
)

In [101]:
refactors

[(5, 15), (12, 6), (13, 7), (15, 8)]

These are the two edges that demark refactored notes. They are joinable on sentence_idxs.

So in other words, `sent_idx_x=[5, 12, ...]` are all refactored.

We caution users that since refactors are relatively rare, they should reduce instances of false positives where possible. We encourage users to visualize the data using techniques described in other sections and to come up with other filtering rules (e.g. filter out all sentences that are `< k characters`) to avoid refactoring errors caused by sentence-parsing errors.