In [1]:
import pandas as pd
import glob

pd.set_option("display.max_columns", None)
pd.set_option("display.expand_frame_repr", False)
pd.set_option("max_colwidth", 800)

In [2]:
# Read annotation files from JSON
# Dropping rows where summary sentence could not be matched
annotation_files = glob.glob("../data/annotation/*.json")
annotations_raw = [pd.read_json(f) for f in annotation_files]
papers_df = pd.read_pickle("../data/papers.pkl")

In [3]:
# Creating pairwise comparison lists, grouped by summary
data = []
for file in annotations_raw:
    summary_id = file["summary_id"][0]
    paper_id = file["paper_id"][0]
    summary_doc = file["source_text"].to_list()
    paper_doc = list(papers_df["paper_text"][papers_df["paper_id"] == paper_id].item().values())
    sids = [(source_sid, target_sid + 1) for source_sid, target_sid in zip(file["source_sid"], file["target_sid"])] # Make title sid = 1

    example = {
        "summary_id": summary_id,
        "source_text": summary_doc,
        "target_text": paper_doc,
        "sids": sids,

    }

    data.append(example)

annotations_docs = pd.DataFrame(data)
annotations_docs.to_pickle("../data/docs-dataset.pkl")
annotations_docs.to_json("../data/docs-dataset.jsonl", orient="records")

In [4]:
annotations_docs.head(5)

Unnamed: 0,summary_id,source_text,target_text,sids
0,C00-2123,"[The authors in this paper describe a search procedure for statistical machine translation (MT) based on dynamic programming (DP)., From a DP-based solution to the traveling salesman problem, they present a novel technique to restrict the possible word reordering between source and target language in order to achieve an eÃcient search algorithm., A beam search concept is applied as in speech recognition., There is no global pruning., An extended lexicon model is defined, and its likelihood is compared to a baseline lexicon model, which takes only single-word dependencies into account., In order to handle the necessary word reordering as an optimization problem within the dynamic programming approach, they describe a solution to the traveling salesman problem (TSP) which is based on dy...","[Word Re-ordering and DP-based Search in Statistical Machine Translation, In this paper, we describe a search procedure for statistical machine translation (MT) based on dynamic programming (DP)., Starting from a DP-based solution to the traveling salesman problem, we present a novel technique to restrict the possible word reordering between source and target language in order to achieve an eÃcient search algorithm., A search restriction especially useful for the translation direction from German to English is presented., The experimental tests are carried out on the Verbmobil task (GermanEnglish, 8000-word vocabulary), which is a limited-domain spoken-language task., The goal of machine translation is the translation of a text given in some source language into a target language., We...","[(1, 2), (2, 3), (3, 166), (4, 167), (5, 36), (6, 40), (7, 194), (8, 195), (9, 5)]"
1,C02-1025,"[This paper presents a maximum entropy-based named entity recognizer (NER)., NER is useful in many NLP applications such as information extraction, question answering, etc .Chieu and Ng have shown that the maximum entropy framework is able to use global information directly from various sources., They believe that global context is useful in most languages, as it is a natural tendency for authors to use abbreviations on entities already mentioned previously., They have made use of local and global features to deal with the instances of same token in a document., They have made use of local and global features to deal with the instances of same token in a document., Their results show that their high performance NER use less training data than other systems., The use of global features ...","[Named Entity Recognition: A Maximum Entropy Approach Using Global Information, This paper presents a maximum entropy-based named entity recognizer (NER)., It differs from previous machine learning-based NERs in that it uses information from the whole document to classify each word, with just one classifier., Previous work that involves the gathering of information from the whole document often uses a secondary classifier, which corrects the mistakes of a primary sentence- based classifier., In this paper, we show that the maximum entropy framework is able to make use of global information directly, and achieves performance that is comparable to the best previous machine learning-based NERs on MUC6 and MUC7 test data., Considerable amount of work has been done in recent years on the na...","[(1, 2), (2, 7), (3, 205), (4, 63), (4, 64), (5, 15), (6, 11), (7, 199)]"
2,C10-1045,"[This paper offers a broad insight into of Arabic constituency parsing by analyzing the interplay of linguistic phenomena, annotation choices, and model design., It is probably the first analysis of Arabic parsing of this kind., It is well-known that English constituency parsing models do not generalize to other languages and treebanks., Explanations for this phenomenon are relative informativeness of lexicalization, insensitivity to morphology and the effect of variable word order and these factors lead to syntactic disambiguation., The authors use linguistic and annotation insights to develop a manually annotated grammar and evaluate it and finally provide a realistic evaluation in which segmentation is performed in a pipeline jointly with parsing., The authors use linguistic and ann...","[Better Arabic Parsing: Baselines, Evaluations, and Analysis, In this paper, we offer broad insight into the underperformance of Arabic constituency parsing by analyzing the interplay of linguistic phenomena, annotation choices, and model design., First, we identify sources of syntactic ambiguity understudied in the existing parsing literature., Second, we show that although the Penn Arabic Treebank is similar to other tree- banks in gross statistical terms, annotation consistency remains problematic., Third, we develop a human interpretable grammar that is competitive with a latent variable PCFG., Fourth, we show how to build better models for three different parsers., Finally, we show that in application settings, the absence of gold segmentation lowers parsing performance by 2â5% ...","[(1, 2), (2, 27), (3, 8), (4, 8), (5, 23), (5, 25), (6, 22)]"
3,D10-1044_swastika,"[Foster et all describe a new approach to SMT adaptation that weights out-of-domain phrase pairs according to their relevance to the target domain, determined by both how similar to it they appear to be, and whether they belong to general language or not., They extended previous work on discriminative weighting by using a finer granularity, focusing on the properties of instances rather than corpus components, and used simpler training procedure., They incorporated instance-weighting into a mixture-model framework, and found that it yielded consistent improvements over a wide range of baselines., In this paper, the authors proposed an approach for instance-weighting phrase pairs in an out-of-domain corpus in order to improve in-domain performance., Each out-of-domain phrase pair was ch...","[Discriminative Instance Weighting for Domain Adaptation in Statistical Machine Translation, We describe a new approach to SMT adaptation that weights out-of-domain phrase pairs according to their relevance to the target domain, determined by both how similar to it they appear to be, and whether they belong to general language or not., This extends previous work on discriminative weighting by using a finer granularity, focusing on the properties of instances rather than corpus components, and using a simpler training procedure., We incorporate instance weighting into a mixture-model framework, and find that it yields consistent improvements over a wide range of baselines., Domain adaptation is a common concern when optimizing empirical NLP applications., Even when there is training dat...","[(1, 2), (2, 3), (3, 4), (4, 145), (5, 146), (6, 147), (7, 151), (8, 152)]"
4,D10-1083,"[In this paper, the authors are of the opinion that the sequence models-based approaches usually treat token-level tag assignment as the primary latent variable., However, these approaches are ill-equipped to directly represent type-based constraints such as sparsity., In this work, they take a more direct approach and treat a word type and its allowed POS tags as a primary element of the model., Their work is closely related to recent approaches that incorporate the sparsity constraint into the POS induction process., There are clustering approaches that assign a single POS tag to each word type., These clusters are computed using an SVD variant without relying on transitional structure., The departure from the traditional token-based tagging approach allow them to explicitly capture ...","[Simple Type-Level Unsupervised POS Tagging, Part-of-speech (POS) tag distributions are known to exhibit sparsity â a word is likely to take a single predominant tag in a corpus., Recent research has demonstrated that incorporating this sparsity constraint improves tagging accuracy., However, in existing systems, this expansion come with a steep increase in model complexity., This paper proposes a simple and effective tagging method that directly models tag sparsity and other distributional properties of valid POS tag assignments., In addition, this formulation results in a dramatic reduction in the number of model parameters thereby, enabling unusually rapid training., Our experiments consistently demonstrate that this model architecture yields substantial performance gains over mor...","[(1, 15), (2, 17), (3, 20), (4, 35), (5, 38), (6, 39), (7, 238), (8, 239), (9, 240), (10, 242)]"


In [5]:
# Creating sentence pairs with binary labels (matching, non-matching)
def create_labels(row):
    sid = row["target_sid"]
    sentences = list(row["paper_text"].keys())
    labels = [1 if sid == sentence else 0 for sentence in sentences]
    return labels

summaries_df = pd.concat(annotations_raw).dropna(subset="target_sid")
summaries_df["source_sid"] = summaries_df["source_sid"].astype("int32").astype("string")
summaries_df["target_sid"] = summaries_df["target_sid"].astype("int32").astype("string")
summaries_df["strategy"] = summaries_df["strategy"].astype("category")

# Merge dataframes and get target sentences by id
merged_df = summaries_df.merge(papers_df, on="paper_id", how="left").dropna(subset="target_sid")

annotations_df = merged_df[["summary_id", "paper_id", "source_sid", "target_sid", "source_text"]].copy()
annotations_df["label"] = merged_df.apply(create_labels, axis=1)
annotations_df["target_text"] = merged_df.apply(lambda row: list(row["paper_text"].values()), axis=1)

annotations_df["target_sids"] = merged_df.apply(lambda row: list(row["paper_text"].keys()), axis=1)
annotations_binary = annotations_df.explode(["label", "target_text", "target_sids"]).reset_index(drop=True)
annotations_binary = annotations_binary.drop(columns=["target_sid"]).rename(columns={"target_sids": "target_sid"})

annotations_binary.to_pickle("../data/binary-dataset.pkl")

print(annotations_binary.info())
display(annotations_binary.head(10))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29667 entries, 0 to 29666
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   summary_id   29667 non-null  object
 1   paper_id     29667 non-null  object
 2   source_sid   29667 non-null  string
 3   source_text  29667 non-null  object
 4   label        29667 non-null  object
 5   target_text  29667 non-null  object
 6   target_sid   29667 non-null  object
dtypes: object(6), string(1)
memory usage: 1.6+ MB
None


Unnamed: 0,summary_id,paper_id,source_sid,source_text,label,target_text,target_sid
0,C00-2123,C00-2123,1,The authors in this paper describe a search procedure for statistical machine translation (MT) based on dynamic programming (DP).,0,Word Re-ordering and DP-based Search in Statistical Machine Translation,0
1,C00-2123,C00-2123,1,The authors in this paper describe a search procedure for statistical machine translation (MT) based on dynamic programming (DP).,1,"In this paper, we describe a search procedure for statistical machine translation (MT) based on dynamic programming (DP).",1
2,C00-2123,C00-2123,1,The authors in this paper describe a search procedure for statistical machine translation (MT) based on dynamic programming (DP).,0,"Starting from a DP-based solution to the traveling salesman problem, we present a novel technique to restrict the possible word reordering between source and target language in order to achieve an eÃcient search algorithm.",2
3,C00-2123,C00-2123,1,The authors in this paper describe a search procedure for statistical machine translation (MT) based on dynamic programming (DP).,0,A search restriction especially useful for the translation direction from German to English is presented.,3
4,C00-2123,C00-2123,1,The authors in this paper describe a search procedure for statistical machine translation (MT) based on dynamic programming (DP).,0,"The experimental tests are carried out on the Verbmobil task (GermanEnglish, 8000-word vocabulary), which is a limited-domain spoken-language task.",4
5,C00-2123,C00-2123,1,The authors in this paper describe a search procedure for statistical machine translation (MT) based on dynamic programming (DP).,0,The goal of machine translation is the translation of a text given in some source language into a target language.,5
6,C00-2123,C00-2123,1,The authors in this paper describe a search procedure for statistical machine translation (MT) based on dynamic programming (DP).,0,"We are given a source string fJ 1 = f1:::fj :::fJ of length J, which is to be translated into a target string eI 1 = e1:::ei:::eI of length I. Among all possible target strings, we will choose the string with the highest probability: ^eI 1 = arg max eI 1 fPr(eI 1jfJ 1 )g = arg max eI 1 fPr(eI 1) Pr(fJ 1 jeI 1)g : (1) The argmax operation denotes the search problem, i.e. the generation of the output sentence in the target language.",6
7,C00-2123,C00-2123,1,The authors in this paper describe a search procedure for statistical machine translation (MT) based on dynamic programming (DP).,0,"Pr(eI 1) is the language model of the target language, whereas Pr(fJ 1 jeI1) is the transla tion model.",7
8,C00-2123,C00-2123,1,The authors in this paper describe a search procedure for statistical machine translation (MT) based on dynamic programming (DP).,0,Our approach uses word-to-word dependencies between source and target words.,8
9,C00-2123,C00-2123,1,The authors in this paper describe a search procedure for statistical machine translation (MT) based on dynamic programming (DP).,0,"The model is often further restricted so that each source word is assigned to exactly one target word (Brown et al., 1993; Ney et al., 2000).",9
