<a href="https://www.kaggle.com/code/nghtctrl/modeling-revision-classification?scriptVersionId=174974139" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Modeling Revision Classification

Daniel Kim\*, Jason G. Chew\*, Jiho Kim\*

*Equal Contribution

# Introduction
In the writing process, effective textual revisions typically result in expansive alterations to the semantic content of a text, as opposed to prescriptive alterations like proofreading; however, novice writers often tend to favor the latter ([Flower et. al 1986](https://doi.org/10.2307/357381)). Therefore, the identification of major semantic alterations in the revisions of student writers could be used to benchmark their progress towards more effective revision. 

To this end, the following report analyzes various applications of language models towards revision evaluation, which is in this case framed as a binary classification task for identifying revisions as “content” (substantive) or “surface” (superficial) revisions. 

This report considers two approaches to the binary classification task.

1. Completion prompting: when given a “fill-in-the-blank” classification prompt like “...the revision is _________”, a language model can implicitly make predictions thanks to the logprobs, or “likelihoods,” it computes for each of the two possible classification terms (“substantive” and “superficial”). Once the model computes these likelihoods, they may be compared with one another to make the classification.

2. Similarity scores: a language model can compute abstract representations (embeddings) of an original and revised text based on their semantic content, and the similarity of those semantic embeddings can be used as a measure of how little a revision changed the “content” of a sentence; these similarities can then be used to predict whether a revision alters a sentence’s semantic meaning significantly enough to be considered a “content” revision.

Within approach 1, performance on the task improved slightly by preprending contextual information to the prompt, such as example classifications and definitions of the term.

# Description of the Dataset
We use a dataset called “[ArgRewriteV2](https://argrewrite.cs.pitt.edu/)”, which contains essays written by students in response to a single prompt about the implications of self-driving cars. Each essay has three versions: the original draft, a revision, and a second revision. The second revisions were not made under experimentally constant circumstances, so we will only use the first revisions for our evaluation. 

The dataset contains essay-level, sentence-level, and subsentence-level data. This report only uses the sentence-level data. Our goal is to classify the revision types of revisions using two categories: “superficial” and “substantive.” The “superficial” category corresponds to the revision categories “Word Usage” and “Conventions/Grammar/Spelling” in the dataset. The “substantive” category corresponds to the categories "Claim/Ideas", "Organization", "Warrant/Reasoning/Backing", "Rebuttal/Reservation", "Precision", "General Content", and "Grammar" in the dataset. (As opposed to the original dataset’s authors, we placed "Organization" in the “substantive” category as an arguably significant change to a sentence.)

---

## Setup

In [None]:
%pip install sentence_transformers

##### Import Necessary Modules

In [None]:
from sentence_transformers import SentenceTransformer, util
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from sklearn.metrics import roc_auc_score, confusion_matrix, roc_curve, ConfusionMatrixDisplay
import json
import plotly.express as px
import pandas as pd
import torch

torch_device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {torch_device}")

torch.manual_seed(0);

## Load Data

In [None]:
data = pd.read_csv("/kaggle/input/argrewrite-v-2-corpus-sentence-pairs/sentence_pairs.csv")

In [None]:
actual_rev_types = []

for i in range(len(data)):
    revision_type = data.loc[i, "revision_type"]
    if revision_type != "neither":
        actual_rev_types.append(revision_type)

## Functions

##### Get the logprobs for complition given prefix

In [None]:
def get_completion_logprobs(prefix, completion):
    with torch.no_grad():
        completion_ids = tokenizer.encode(completion, return_tensors="pt").to(torch_device)
        completion_len = completion_ids.shape[1]

        whole_phrase = prefix + completion
        whole_phrase_ids = tokenizer.encode(whole_phrase, return_tensors="pt").to(torch_device)
        whole_phrase_logits = model(whole_phrase_ids).logits
        whole_phrase_logprobs = torch.log_softmax(whole_phrase_logits[0], 1)

        completion_logprobs = []
        for i in range(-completion_len-1, -1):
            token_id = whole_phrase_ids[0][i+1]
            logprob = whole_phrase_logprobs[i][token_id]
            completion_logprobs.append(logprob)

    return completion_logprobs

##### Function for Plotting ROC Curve

In [None]:
def plot_roc(actual_rev_types, scores, metric_label):
    fpr, tpr, thresholds = roc_curve(actual_rev_types, scores, pos_label="content")
    # Plot code generated by ChatGPT:
    # https://chat.openai.com/share/2cb2a8d8-7d8e-46bf-b9b3-560db72f3f49
    roc_df = pd.DataFrame({"fpr": fpr, "tpr": tpr, "threshold": thresholds})
    fig = px.line(roc_df, x="fpr", y="tpr",
                  title=f"ROC Curve for {metric_label}",
                  labels={
                    "fpr": "False Positive Rate",
                    "tpr": "True Positive Rate",
                    "threshold": "Threshold",
                  },
                  hover_data={"threshold"}) 

    # Add a diagonal line (random classifier baseline)
    fig.add_scatter(x=[0, 1], y=[0, 1], mode='lines', line=dict(color='gray', dash='dash'), name='Random Classifier')

    # Show the plot
    fig.show()
    
    # Calculate area under the ROC curve
    binary_rev_labels = [1 if label == "content" else 0 for label in actual_rev_types]
    auc = roc_auc_score(binary_rev_labels, scores)
    print("Area under the ROC Curve:", auc)

---

# Completion Model 1: GPT-2 (Baseline)
We will use GPT-2 as our baseline model for the completion prompting approach. GPT-2 is an older model from 2019 which at the time of its publishing significantly furthered the possibility of “competent generalists” for NLP tasks beyond “narrow expert” systems ([Radford et. al 2019](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)). We use the smaller 137-million parameter version of this model to reduce the compute power and time required to make a prediction, as we will be making many predictions over the course of this report. As one of the first language models that marked significant progress towards “competent generalist” performance, GPT-2 is a fitting model to use as a baseline to determine how such language models might perform on our specific classification task. 

### Load GPT-2

In [None]:
model_name = "openai-community/gpt2"

tokenizer = AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map=torch_device)

# Add the EOS token as PAD token
if model.generation_config.pad_token_id is None:
    model.generation_config.pad_token_id = model.generation_config.eos_token_id

tokenizer.decode([tokenizer.eos_token_id]);

In [None]:
prompts = []

for i in range(len(data)):
    revision_type = data.loc[i, "revision_type"]
    if revision_type != "neither":
        old_sentence = data.loc[i, "original_sentence"]
        new_sentence = data.loc[i, "revised_sentence"]
        prompt = f"The following revision from: \n{old_sentence}\nto:\n{new_sentence}\n "
        prompts.append(prompt)

#### Logprob threshold
We are using the threshold of 0 for the purpose of experimentation.

In [None]:
shortening_factor = 1

gpt2_preds = []
gpt2_logprob_diffs = []

logprob_threshold = 0

for i in range(len(prompts)//shortening_factor):
    prompt = prompts[i]

    content_logprobs = torch.stack(get_completion_logprobs(prefix=prompt, completion="is substantive")).to(torch_device)
    surface_logprobs = torch.stack(get_completion_logprobs(prefix=prompt, completion="is superficial")).to(torch_device)

    logprob_diff = (torch.sum(content_logprobs) - torch.sum(surface_logprobs)).item()
    gpt2_logprob_diffs.append(logprob_diff)

    if logprob_diff > logprob_threshold:
        gpt2_preds.append("content")
    else:
        gpt2_preds.append("surface")

##### ROC Curve for GPT-2 Baseline

In [None]:
plot_roc(actual_rev_types[:len(prompts)//shortening_factor], gpt2_logprob_diffs, metric_label="Logprob Diff (GPT-2 Baseline)")

##### Confusion Matrix for GPT-2 Baseline

In [None]:
cm = confusion_matrix(actual_rev_types[:len(prompts)//shortening_factor], gpt2_preds, labels=["content", "surface"])
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm, display_labels=["content", "surface"]
)
disp.plot();

# Prompt Engineering Approach

## Adding Reasoning for the Classification Terms to the Prefix
We originally used a longer completion that explained why a revision might or might not be “substantive.” However, this created many complicated token interdependencies that were difficult to separate from one another in our analysis, so we elected to calculate logprobs on completions which only varied the classification terms themselves.

## Including Reasoning in the Prefix
We added reasoning in the prefix by defining “substantive” and “superficial” revisions before calculating the completion. 
The prompts were defined as follows:
- Defining “substantive” revision: "Substantive revisions change the meaning significantly, so the following revision from '{old_sentence}' to '{new_sentence}' "
- Defining “superficial” revision: "Superficial revisions only change words without affecting the overall meaning, so the following revision from '{old_sentence}' to '{new_sentence}' "

##### Descriptions for keywords are added (prepended) to the prefix

In [None]:
descriptive_prompts = []

for i in range(len(data)):
    revision_type = data.loc[i, "revision_type"]
    if revision_type != "neither":
        old_sentence = data.loc[i, "original_sentence"]
        new_sentence = data.loc[i, "revised_sentence"]
        content_stmt = f"Substantive revisions change the meaning significantly, so the following revision from '{old_sentence}' to '{new_sentence}' "
        surface_stmt = f"Superficial revisions only change words without affecting the overall meaning, so the following revision from '{old_sentence}' to '{new_sentence}' "
        descriptive_prompts.append(
            {
                "content_stmt": content_stmt,
                "surface_stmt": surface_stmt,
            }
        )

##### Classify depending on the difference in logprobs 

In [None]:
shortening_factor = 1

descriptive_preds = []
descriptive_logprob_diffs = []

logprob_threshold = 0

for i in range(len(descriptive_prompts)//shortening_factor):
    content_prompt = descriptive_prompts[i]["content_stmt"]
    surface_prompt = descriptive_prompts[i]["surface_stmt"]

    content_logprobs = torch.stack(get_completion_logprobs(prefix=content_prompt, completion="is substantive")).to(torch_device)
    surface_logprobs = torch.stack(get_completion_logprobs(prefix=surface_prompt, completion="is superficial")).to(torch_device)

    logprob_diff = (torch.sum(content_logprobs) - torch.sum(surface_logprobs)).item()
    descriptive_logprob_diffs.append(logprob_diff)

    if logprob_diff > logprob_threshold:
        descriptive_preds.append("content")
    else:
        descriptive_preds.append("surface")

##### ROC Curve for GPT-2, Prepended Descriptions

In [None]:
plot_roc(actual_rev_types[:len(descriptive_prompts)//shortening_factor], descriptive_logprob_diffs, metric_label="Logprob Diff (GPT-2 Classification Description)")

##### Confusion Matrix for GPT-2, Prepended Descriptions

In [None]:
cm = confusion_matrix(actual_rev_types[:len(descriptive_prompts)//shortening_factor], descriptive_preds, labels=["content", "surface"])
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm, display_labels=["content", "surface"]
)
disp.plot();

### Analysis
Prepending the description for the keywords for content-level revision and surface-level revision did improve the performance of the model from the baseline. This raised the area under the ROC curve a little, but the improvement was not significant.

## Few-Shot Prompting by Examples in the Prefix
To facilitate few-shot learning, the prompt prefix includes relevant examples of randomly chosen revision pairs and their classifications, in the format:
- The following revision from: “<original1> to <revision1> is substantive.
- The following revision from: “<original2> to <revision2> is superficial.
- The following revision from: “<original3> to <revision3> is substantive.
- The following revision from: “<original4> to <revision4> is superficial.
- The following revision from: “<original5> to <revision5> is 

##### In-context Examples relevant to revision cases

In [None]:
few_shot_prompt = """
The following revision from: Having these types of vehicles is also not worth taking away people’s jobs and putting their do not have the technology to operate at a high level of safety in certain weather conditions.
to: Having these types of vehicles is also not worth putting people's lives at risk, especially for those who live in areas where it snows and rains a lot, because these vehicles do not have the technology to operate at a high level of safety in those weather conditions.
is substantive.

The following revision from: In light of recent events with the death of an Arizona woman at the hands of a self-driving Uber, many are unsure of what stance to take on the matter.
to: In light of recent events with the death of an Arizona woman at the hands of a self-driving Uber, many are conflicted on what stance to take on the matter.
is superficial.

The following revision from: On the other hand, the car companies, your lawyers and some other groups will love this idea to death.
to: On the other hand, the self- driving car companies, your lawyers and Google (they provide GPS) will love this idea to death."
is substantive.

The following revision from: There are many variables to consider when thinking about individuals using self-driving cars: the weather, other traditional cars and their drivers, and the possibility of inappropriate - or developmentally inappropriate person - like children, mistakenly getting behind the wheel.
to: There are many confounding variables to consider when thinking about individuals using self-driving cars: the weather, other traditional cars and their drivers, and the possibility of inappropriate - or developmentally-inappropriate persons - like children, mistakenly climbing behind the wheel.
is superficial.

The following revision from: {old_sentence}
to: {new_sentence} is 
"""

##### Concatenate prompts

In [None]:
few_shot_prompts = []
for i in range(len(data)):
    revision_type = data.loc[i, "revision_type"]
    if revision_type != "neither":
        old_sentence = data.loc[i, "original_sentence"]
        new_sentence = data.loc[i, "revised_sentence"]
        composite_stmts = few_shot_prompt.format(old_sentence=old_sentence, new_sentence=new_sentence)
        few_shot_prompts.append(composite_stmts)

##### Classify depending on the logprob difference

In [None]:
shortening_factor = 1

few_shot_preds = []
few_shot_logprob_diffs = []

logprob_threshold = 0

for i in range(len(few_shot_prompts)//shortening_factor):
    few_shot_prompt = few_shot_prompts[i]

    content_logprobs = torch.stack(get_completion_logprobs(prefix=few_shot_prompt, completion="substantive")).to(torch_device)
    surface_logprobs = torch.stack(get_completion_logprobs(prefix=few_shot_prompt, completion="superficial")).to(torch_device)

    logprob_diff = (torch.sum(content_logprobs) - torch.sum(surface_logprobs)).item()
    few_shot_logprob_diffs.append(logprob_diff)

    if logprob_diff > logprob_threshold:
        few_shot_preds.append("content")
    else:
        few_shot_preds.append("surface")

##### ROC Curve for GPT-2, Few-shot Learning

In [None]:
plot_roc(actual_rev_types[:len(prompts)//shortening_factor], few_shot_logprob_diffs, metric_label="Logprob Diff (GPT-2 Few-Shot)")

##### Confusion Matrix for GPT-2, Few-shot Learning

In [None]:
cm = confusion_matrix(actual_rev_types[:len(prompts)//shortening_factor], few_shot_preds, labels=["content", "surface"])
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm, display_labels=["content", "surface"]
)
disp.plot();

### Analysis
Few-shot prompting was even more effective than simply including reasoning in the prefix. The model was able to correctly classify the revision types with a higher accuracy when it was trained with in-context examples. Since it gives the chance for the model to learn the difference between the two revision types, the few-shot learning approach seems to be more effective in improving the model's performance than merely prepending descriptions.

---

# Completion Model 2: GEMMA-2b
To get an idea of how a different model might change performance on a prompt, we will use GEMMA-2b for comparison. GEMMA is a much newer model with many more parameters (albeit quantized, in our case), so we might expect it to perform even better as a “competent generalist” than GPT-2. This difference should help us get an idea of if or how performance on a prompt scales with more “advanced” models.

## Recall: GPT-2 Baseline Classification

In [None]:
plot_roc(actual_rev_types[:len(prompts)//shortening_factor], gpt2_logprob_diffs, metric_label="Logprob Diff")

In [None]:
cm = confusion_matrix(actual_rev_types[:len(prompts)//shortening_factor], gpt2_preds, labels=["content", "surface"])
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm, display_labels=["content", "surface"]
)
disp.plot();

## Gemma Baseline Classification

In [None]:
%pip install -U bitsandbytes
%pip install accelerate
# %pip install -i https://pypi.org/simple/ bitsandbytes

##### Get Gemma 2B (without instruction tuning) Model

In [None]:
model_name = "/kaggle/input/gemma/transformers/2b/2"
    
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
    
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map=torch_device, quantization_config=quantization_config)

tokenizer.decode([tokenizer.eos_token_id]);

##### Classify depending on the logprob difference

In [None]:
shortening_factor = 1

gemma_preds = []
gemma_logprob_diffs = []

logprob_threshold = 0

for i in range(len(prompts)//shortening_factor):
    gemma_prompt = prompts[i]

    content_logprobs = torch.stack(get_completion_logprobs(prefix=gemma_prompt, completion="is substantive")).to(torch_device)
    surface_logprobs = torch.stack(get_completion_logprobs(prefix=gemma_prompt, completion="is superficial")).to(torch_device)

    logprob_diff = (torch.sum(content_logprobs) - torch.sum(surface_logprobs)).item()
    gemma_logprob_diffs.append(logprob_diff)

    if logprob_diff > logprob_threshold:
        gemma_preds.append("content")
    else:
        gemma_preds.append("surface")

##### Load the classification data from json file since Gemma takes much time

In [None]:
with open("/kaggle/input/gemma-data/gemma_data.json", "r") as file:
    gemma_data = json.load(file)

##### ROC Curve for Gemma Baseline

In [None]:
plot_roc(actual_rev_types[:len(prompts)], gemma_data["logprob_diffs"], metric_label="Logprob Diff (GEMMA-2B Baseline)")

##### Confusion Matrix for Gemma Baseline

In [None]:
cm = confusion_matrix(actual_rev_types[:len(prompts)], gemma_data["predictions"], labels=["content", "surface"])
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm, display_labels=["content", "surface"]
)
disp.plot();

---

# SBERT: Similarity Score Approach

## mpnet-base-v2
We will use mpnet-base-v2 as our model for the similarity score approach. mpnet-base-v2 is an SBERT model derived from the BERT architecture. BERT models compute sentence-level embeddings by pooling token-level embeddings, allowing them to capture the semantic meaning of sentences beyond that of individual tokens. The “S” in SBERT, which stands for “Siamese,” refers to the concept of “Siamese,” or “conjoined,” twins, which in turn alludes to the way in which SBERT models are trained. SBERT models are trained on “conjoined” pairs of sentences, using a loss function that quantifies a model’s “surprisal” on the type of relationship between those sentences. Thus, SBERT models use the sentence-level embeddings of the BERT architecture and a “conjoined” training approach to learn to output high similarity scores for semantically similar sentences ([Briggs](https://www.pinecone.io/learn/series/nlp/sentence-embeddings/)). mpnet-base-v2 is a strong general-purpose model described as “provid[ing] the best quality” on the [SBERT website](https://www.sbert.net/docs/pretrained_models.html), so we use it here.

Our classification algorithm roughly adopts the following approach:
* Compute the similarity score of the original sentence and the revision.
* Invert the similarity score with a negative sign to measure “difference.”
* If this “difference” is greater than a certain threshold, predict “content”; otherwise, predict “surface.”

In [None]:
old_sentences = []
new_sentences = []
actual_rev_types = []

for i in range(len(data)):
    revision_type = data.loc[i, "revision_type"]
    if revision_type != "neither":
        old_sentence = str(data.loc[i, "original_sentence"])
        old_sentences.append(old_sentence)
        new_sentence = str(data.loc[i, "revised_sentence"])
        new_sentences.append(new_sentence)
        actual_rev_types.append(revision_type)

##### Load Sentence Transformer Model

In [None]:
model = SentenceTransformer("all-mpnet-base-v2").to(torch_device)

##### Classification depending on the cosine similarity

In [None]:
shortening_factor = 1
sbert_preds = []
diff_scores = []

diff_threshold = -0.661

for i in range(len(old_sentences)//shortening_factor):
    
    # Cosine-similarity code adapted from: https://www.sbert.net/docs/usage/semantic_textual_similarity.html
    with torch.no_grad():
        # Compute embeddings
        original_embed = model.encode(old_sentences[i], convert_to_tensor=True, show_progress_bar=False).to(torch_device)
        revision_embed = model.encode(new_sentences[i], convert_to_tensor=True, show_progress_bar=False).to(torch_device)

        # Compute cosine-similarities
        cos_similarity = util.cos_sim(original_embed, revision_embed)
        diff_score = -cos_similarity[0].item()
        diff_scores.append(diff_score)

        if diff_score > diff_threshold:
            sbert_preds.append("content")
        else:
            sbert_preds.append("surface")

##### ROC Curve for Sentence Transformer - Cosine Similarity

In [None]:
plot_roc(actual_rev_types[:len(old_sentences)//shortening_factor], diff_scores, metric_label="Semantic Diff (Sentence Transformer)")

##### Confusion Matrix for Sentence Transformer - Cosine Similarity

In [None]:
cm = confusion_matrix(actual_rev_types[:len(old_sentences)//shortening_factor], sbert_preds, labels=["content", "surface"])
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm, display_labels=["content", "surface"]
)
disp.plot();

### Analysis
{TBA}

---

# Conclusion
Providing descriptions of revision types and utilizing few-shot prompting improved the model’s classification capabilities. Adding descriptions (especially prepending) and in-context learning approach give the model the chance to learn the difference between the two revision cases.

We cannot assume identical prompting works equally well across language models. Opposed to our expectation, Gemma 2 did not perform as well as GPT-2 in classifying revision types, even though it was a bigger model. We will need to run more experiments to see if the prompt can be generalized across different models.

Classification using cosine similarity is more reliable and interpretable than using the difference of log probs. While the difference of log probs is a common approach, it takes quite complicated steps to calculate and interpret the results. Cosine similarity is more straightforward and easier to understand, as well as performing significantly better in our experiments.

# Limitations
### Dataset Limitations
The ArgRewrite corpus only contains essays written for a single argumentative prompt concerning self-driving cars. Thus, the results discussed in this notebook may not fully represent the models’ general classification abilities for essays of other subjects.

### Lack of True Chain-of-Thought Prompting And Computational Limitations
Successful chain-of-thought prompting requires language models to coherently generate intermediate reasoning of some sort to allow them to explicitly condition their final answer on that intermediate generation. More advanced instruction-tuned models are generally required to generate this kind of reasoning with any reliability. To keep this report self-contained, these models were not employed, as their usage would require access to processing power outside that which is freely available on Kaggle. Instead, we prompted GPT-2 with “prepackaged reasoning,” which provided definitional explanations for the completion terms “substantive” and “superficial.” This approach is far less adaptive than chain-of-thought prompting, as it only adds a small amount of boilerplate reasoning to the prompt without adapting to the particulars of the sentences under evaluation. Hence, this report does not provide any workflows to test more advanced models’ chain-of-thought capabilities.

### Using Automation Responsibly
This report provides a possible workflow for automatically evaluating students’ revisions, which is not always desirable in every situation. The binary “content”/”surface” categorization lacks nuance and does not provide students with the individual feedback they might need to improve their revision, regardless of writing level. While the quantitative measures used to make the binary classification provide a slightly more nuanced continuous measure, they nonetheless reduce the complex process of revision to a single number that offers little direction for improvement. More complex, personalized feedback is required for students to grow as writers and revisers. Hence, our evaluation methodology is not intended to be used in educational contexts. It is intended to grant insights into broad patterns of revisions, rather than how to personally improve one’s own writing.

# Future Improvements
### Essay-Level Evaluation
This report only analyzes sentence-level revision classification, but could potentially be expanded to generate and analyze essay-level revision scores.

### Prompting Modern LLMs
Modern LLMs are much more “competent generalists” on NLP tasks than any model running in a self-contained Kaggle notebook, so further studies into the effectiveness of the “completion approach” might use the APIs of LLMs like GPT-4 or LLAMA-3 to get a better idea of how the very best models could perform on the classification task. These models more reliably generate intermediate reasoning as well, which would allow for additional experimentation with “chain-of-thought” prompting—possibly by prompting models to identify parts of a revision that would lead them to conclude a revision is “surface-level” or “content-level.”

### Evaluating Performance on Revision Subcategories
The dataset of revisions used in this report included finer-grained revision categories not considered in our analysis. Our evaluation could benefit from additional, finer-grained analysis of the revision subclasses our models fail at most. With this information in mind, the models could be fine-tuned in an attempt to improve the their performance on those specific subclasses. This report does not use fine-tuning due to time constraints; however, the evaluations provided in this report provide a possible workflow which can be expanded to measure a fine-tuned model’s performance on revision subclasses. Thus, the process of fine-tuning and evaluating performance on revision subclasses will be left to future work.

# Appendix

## Data Wrangling

In [None]:
import requests
import zipfile
import io
from pathlib import Path
import re
import warnings

### Download Dataset

The [ArgRewrite dataset](https://argrewrite.cs.pitt.edu/#corpus) was shared by the original authors under GNU GPLv3. Therefore, we have also released our wrangled version under the same license, which is accessible on Kaggle [here](https://www.kaggle.com/datasets/nghtctrl/argrewrite-v-2-corpus).

In [None]:
corpus_url = "https://argrewrite.cs.pitt.edu/corpus/ArgRewrite.V2.zip"
corpus_path = "argrewrite-v-2-corpus"
response = requests.get(corpus_url)

with zipfile.ZipFile(io.BytesIO(response.content)) as z:
    z.extractall(corpus_path)

### Open annotated essay files

In [None]:
annotations_path = Path(corpus_path) / "annotations"
xlsx_files = list(annotations_path.glob("**/12/*.xlsx"))

print(f"There are {len(xlsx_files)} xlsx files in the corpus.")

### Extract relevant data

In [None]:
%pip install -U openpyxl

In [None]:
def get_ids(prefix, filename):
    """Function used to retrieve writer IDs"""
    regex = prefix + "(\d+)"
    return re.search(regex, filename).group(1)


old_draft_dfs = []
new_draft_dfs = []

for xlsx_file in xlsx_files:
    try:
        # Ignore all of the openpyxl incompatibility warnings.
        with warnings.catch_warnings():
            warnings.filterwarnings(
                "ignore",
                category=UserWarning,
                module=re.escape("openpyxl.styles.stylesheet"),
            )
            with open(xlsx_file, "rb") as f:
                # Open each of the "Old Draft" and "New Draft" sheets
                old_draft_sheet = pd.read_excel(f, sheet_name="Old Draft")
                new_draft_sheet = pd.read_excel(f, sheet_name="New Draft")
                
                # Extract the writer IDs from the file path
                writer_id = get_ids("Annotation_2018argrewrite_", str(f))
                
                old_draft_df = pd.DataFrame(
                    {
                        "writer_id": writer_id,
                        "original_sentence_index": old_draft_sheet["Sentence Index"].astype(str),
                        "revised_sentence_index": old_draft_sheet["Aligned Index"].astype(str),
                        "original_sentence": old_draft_sheet["Sentence Content"],
                        "revision_purpose": old_draft_sheet["Revision Purpose Level 0"],
                        "revision_operation": old_draft_sheet["Revision Operation Level 0"],
                    }
                )
                old_draft_dfs.append(old_draft_df)

                new_draft_df = pd.DataFrame(
                    {
                        "writer_id": writer_id,
                        "original_sentence_index": new_draft_sheet["Aligned Index"].astype(str),
                        "revised_sentence_index": new_draft_sheet["Sentence Index"].astype(str),
                        "revised_sentence": new_draft_sheet["Sentence Content"],
                        "revision_purpose": new_draft_sheet["Revision Purpose Level 0"],
                        "revision_operation": new_draft_sheet["Revision Operation Level 0"],
                    }
                )
                new_draft_dfs.append(new_draft_df)
    except ValueError as e:
        # Catch invalid files
        print(f"Error in {xlsx_file}: {e}")

old_draft_df = pd.concat(old_draft_dfs, ignore_index=True)
new_draft_df = pd.concat(new_draft_dfs, ignore_index=True)

In [None]:
print(f"old_draft_df has {old_draft_df.shape[0]} rows.")
print(f"new_draft_df has {new_draft_df.shape[0]} rows.")

### Retrieve and combine revised sentences

In [None]:
# Find rows where revision_sentence was indicated to be revised from 2+ original_sentences, indicated by ',' separator
multiple_original_indices = new_draft_df["original_sentence_index"].str.contains(
    ",", na=False
)

# Create dataframe storing rows of new_draft_df with sentences revised from 2+ multiple original_sentences
combined_sentences_df = new_draft_df.loc[multiple_original_indices]

# Each item in the list will be a string of sentences that were combined in the new_draft_df
combined_sentence_strings = []

# For each case where sentences were combined:
for row in combined_sentences_df.iloc:
    writer_id = row["writer_id"]  # Get id of writer who combined sentences
    original_index_group = row["original_sentence_index"].split(
        ","
    )  # Get indices of writer's combined sentences
    source_sentences = ""  # Temp string for sentences that were combined

    # For each of the indices of the writer's combined sentences:
    for original_index in original_index_group:
        # Determine row of next combined sentence via id of combining writer and index of next combined sentence
        to_combine = (old_draft_df["writer_id"] == writer_id) & (
            old_draft_df["original_sentence_index"] == original_index
        )
        # Add the row's sentence to string storing sentences that were combined
        source_sentences += (
            " " + old_draft_df.loc[to_combine]["original_sentence"].values[0]
        )

    combined_sentence_strings.append(source_sentences)

# For rows with multiple values in original_sentence_index, force original_sentence_index to only the first index listed
new_draft_df.loc[multiple_original_indices, "original_sentence_index"] = (
    new_draft_df.loc[multiple_original_indices, "original_sentence_index"]
    .str.split(",")
    .str[0]
)

print(f"old_draft_df has {old_draft_df.shape[0]} rows.")
print(f"new_draft_df has {new_draft_df.shape[0]} rows.")

### Join the original and revised sentences

In [None]:
# Do a full outer join based on the sentence indexes
sentence_pair_df = new_draft_df.merge(
    old_draft_df[["writer_id", "original_sentence", "original_sentence_index"]],
    how="outer",
    left_on=["writer_id", "original_sentence_index"],
    right_on=["writer_id", "original_sentence_index"],
)

print(f"sentence_pair_df has {sentence_pair_df.shape[0]} rows.")

In [None]:
# Do the following for all cases of combined_sentences:
for i in range(len(combined_sentences_df["writer_id"])):
    writer_id = combined_sentences_df["writer_id"].iloc[
        i
    ]  # Get id of writer who combined sentences
    revised_sentence_id = combined_sentences_df["revised_sentence_index"].iloc[
        i
    ]  # Get id of revised combined sentence

    # Get row of sentence revised by combining old sentences
    revised_row = (sentence_pair_df["writer_id"] == writer_id) & (
        sentence_pair_df["revised_sentence_index"] == revised_sentence_id
    )
    # Retrieve string of the old sentences that were combined
    revised_sentence = combined_sentence_strings[i]

    # Set original_sentence in revised sentence's row to be the sentences the revised sentence combined
    sentence_pair_df.loc[revised_row, "original_sentence"] = revised_sentence

    # Debug message printing each modified row. The revised sentence should share content with the source sentences.
    print(
        "Source sentences:",
        sentence_pair_df.loc[revised_row]["original_sentence"].values[0],
        "\nRevised sentence:",
        sentence_pair_df.loc[revised_row]["revised_sentence"].values[0],
    )
    print()

In [None]:
# Get the number of each revision subtypes
sentence_pair_df["revision_purpose"].value_counts()

In [None]:
def classify_purpose(x):
    """This function converts revision subtypes as either 'content', 'surface', or 'neither'"""
    if isinstance(x, str):
        # Simple hack to match the subwords; however, there might be a better way to do this
        if any(
            word in x
            for word in ["Clai", "Warr", "Evid", "Rebu", "Prec", "Cont", "Orga"]
        ):
            return "content"
        else:
            return "surface"
    else:
        return "neither"


# Insert the newly converted revision types into the sentence pairs dataset
revision_type = sentence_pair_df["revision_purpose"].apply(classify_purpose)
assert len(revision_type) == sentence_pair_df.shape[0]
sentence_pair_df["revision_type"] = revision_type

sentence_pair_df["revision_type"].value_counts()

### Save as csv file

In [None]:
sentence_pair_df.to_csv("sentence_pairs.csv", index=False)