In [50]:
import pandas as pd
import glob

pd.set_option("display.max_columns", None)
pd.set_option("display.expand_frame_repr", False)
pd.set_option("max_colwidth", 800)

In [51]:
# Read annotation files from JSON
# Dropping rows where summary sentence could not be matched
annotation_files = glob.glob("../data/annotation/*.json")
summaries_df = pd.concat([pd.read_json(f) for f in annotation_files]).dropna(subset="target_sid")

In [52]:
summaries_df["source_sid"] = summaries_df["source_sid"].astype("int32").astype("string")
summaries_df["target_sid"] = summaries_df["target_sid"].astype("int32").astype("string")
summaries_df["strategy"] = summaries_df["strategy"].astype("category")

# Merge dataframes and get target sentences by id
papers_df = pd.read_pickle("../data/papers.pkl")
annotations_df = summaries_df.merge(papers_df, on="paper_id", how="left")

In [53]:
def create_binary_labels(row):
    sid = row["target_sid"]
    sentences = list(row["paper_text"].keys())
    labels = [1 if sid == sentence else 0 for sentence in sentences]
    return labels

annotations_df["target_text"] = annotations_df.apply(lambda row: row["paper_text"].get(row["target_sid"]), axis=1)
annotations_df["target_doc"] = annotations_df.apply(lambda row: list(row["paper_text"].values()), axis=1)
annotations_df["labels"] = annotations_df.apply(create_binary_labels, axis=1)

# Column ordering
annotations_df = annotations_df[["summary_id", "paper_id", "source_sid", "target_sid", "source_text", "target_text", "labels", "target_doc", "strategy"]]

annotations_df.to_pickle("../data/annotations.pkl")
annotations_df.to_csv("../data/annotations.csv", index=False)

display(annotations_df.sample(5))

Unnamed: 0,summary_id,paper_id,source_sid,target_sid,source_text,target_text,labels,target_doc,strategy
116,D10-1044_swastika,D10-1044,2,2,"They extended previous work on discriminative weighting by using a finer granularity, focusing on the properties of instances rather than corpus components, and used simpler training procedure.","This extends previous work on discriminative weighting by using a finer granularity, focusing on the properties of instances rather than corpus components, and using a simpler training procedure.","[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[Discriminative Instance Weighting for Domain Adaptation in Statistical Machine Translation, We describe a new approach to SMT adaptation that weights out-of-domain phrase pairs according to their relevance to the target domain, determined by both how similar to it they appear to be, and whether they belong to general language or not., This extends previous work on discriminative weighting by using a finer granularity, focusing on the properties of instances rather than corpus components, and using a simpler training procedure., We incorporate instance weighting into a mixture-model framework, and find that it yields consistent improvements over a wide range of baselines., Domain adaptation is a common concern when optimizing empirical NLP applications., Even when there is training dat...",extractive
57,C02-1025,C02-1025,4,63,They have made use of local and global features to deal with the instances of same token in a document.,Global features are extracted from other occurrences of the same token in the whole document.,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[Named Entity Recognition: A Maximum Entropy Approach Using Global Information, This paper presents a maximum entropy-based named entity recognizer (NER)., It differs from previous machine learning-based NERs in that it uses information from the whole document to classify each word, with just one classifier., Previous work that involves the gathering of information from the whole document often uses a secondary classifier, which corrects the mistakes of a primary sentence- based classifier., In this paper, we show that the maximum entropy framework is able to make use of global information directly, and achieves performance that is comparable to the best previous machine learning-based NERs on MUC6 and MUC7 test data., Considerable amount of work has been done in recent years on the na...",abstractive
49,C10-1045,C10-1045,4,7,"Explanations for this phenomenon are relative informativeness of lexicalization, insensitivity to morphology and the effect of variable word order and these factors lead to syntactic disambiguation.","It is well-known that constituency parsing models designed for English often do not generalize easily to other languages and treebanks.1 Explanations for this phenomenon have included the relative informativeness of lexicalization (Dubey and Keller, 2003; Arun and Keller, 2005), insensitivity to morphology (Cowan and Collins, 2005; Tsarfaty and Simaâan, 2008), and the effect of variable word order (Collins et al., 1999).","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[Better Arabic Parsing: Baselines, Evaluations, and Analysis, In this paper, we offer broad insight into the underperformance of Arabic constituency parsing by analyzing the interplay of linguistic phenomena, annotation choices, and model design., First, we identify sources of syntactic ambiguity understudied in the existing parsing literature., Second, we show that although the Penn Arabic Treebank is similar to other tree- banks in gross statistical terms, annotation consistency remains problematic., Third, we develop a human interpretable grammar that is competitive with a latent variable PCFG., Fourth, we show how to build better models for three different parsers., Finally, we show that in application settings, the absence of gold segmentation lowers parsing performance by 2â5% ...",abstractive
48,C10-1045,C10-1045,3,7,It is well-known that English constituency parsing models do not generalize to other languages and treebanks.,"It is well-known that constituency parsing models designed for English often do not generalize easily to other languages and treebanks.1 Explanations for this phenomenon have included the relative informativeness of lexicalization (Dubey and Keller, 2003; Arun and Keller, 2005), insensitivity to morphology (Cowan and Collins, 2005; Tsarfaty and Simaâan, 2008), and the effect of variable word order (Collins et al., 1999).","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[Better Arabic Parsing: Baselines, Evaluations, and Analysis, In this paper, we offer broad insight into the underperformance of Arabic constituency parsing by analyzing the interplay of linguistic phenomena, annotation choices, and model design., First, we identify sources of syntactic ambiguity understudied in the existing parsing literature., Second, we show that although the Penn Arabic Treebank is similar to other tree- banks in gross statistical terms, annotation consistency remains problematic., Third, we develop a human interpretable grammar that is competitive with a latent variable PCFG., Fourth, we show how to build better models for three different parsers., Finally, we show that in application settings, the absence of gold segmentation lowers parsing performance by 2â5% ...",abstractive
132,W11-2123_vardha,W11-2123,7,279,"The code is open source, has minimal dependencies, and offers both C++ and Java interfaces for integration.","The code is opensource, has minimal dependencies, and offers both C++ and Java interfaces for integration.","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[KenLM: Faster and Smaller Language Model Queries, We present KenLM, a library that implements two data structures for efficient language model queries, reducing both time and costs., The structure uses linear probing hash tables and is designed for speed., Compared with the widely- SRILM, our is 2.4 times as fast while using 57% of the mem- The structure is a trie with bit-level packing, sorted records, interpolation search, and optional quantization aimed lower memory consumption. simultaneously uses less memory than the smallest lossless baseline and less CPU than the baseline., Our code is thread-safe, and integrated into the Moses, cdec, and Joshua translation systems., This paper describes the several performance techniques used and presents benchmarks against alternative impleme...",extractive


In [54]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(annotations_df, test_size=0.2, shuffle=True, random_state=42)

train_df.to_csv("../data/train.csv", index=False)
train_df.to_pickle("../data/train.pkl")

test_df.to_csv("../data/test.csv", index=False)
test_df.to_pickle("../data/test.pkl")