# ReqEval Task : Detecting and Resolving Referential Ambiguity

by Group #2

Güray Baydur, Çağlar Fırat, Merve Gül Kantarcı, Baturalp Yörük

# Preparing & Preprocessing Dataset

This is the preprocessing step. It shapes the dataset in a format that GPT2 model easily fits.

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
%cd /content/drive/My Drive/RE-T2

/content/drive/My Drive/RE-T2


In [None]:
"""
Please enter path to data and disambiguation_answers.
Please use the ones in csv format. 
"""
data_path = "training_set.csv"
disambiguation_path = "disambiguation_answers_file.csv"
""" Train/test ratio, enter 1.0 to use the whole dataset for training """
train_ratio = 1.0

In [None]:
import random
import pandas
import re

colnames_data = ['sent_id', 'sent']
colnames_resolution = ['sent_id', 'resolution']
data = pandas.read_csv(data_path, names=colnames_data, quotechar='"', delimiter=',', skipinitialspace=True)
solutions = pandas.read_csv(disambiguation_path, names=colnames_resolution, quotechar='"', delimiter=',', skipinitialspace=True)

dataset = []
testset = []
unambigious_idxs = [] # use only unambigious ones to find references
ready_trainset_path = "ReqEvalTrainset.txt"

requirements_ids = data.sent_id.tolist()
requirements = data.sent.tolist()
resolutions = solutions.resolution.tolist()
resolution_ids = solutions.sent_id.tolist()

# special tokens for GPT2
r, e = "<|startoftext|> ", " <|endoftext|>\n" 

for row_idx in range(1, len(requirements_ids)):
    line = ""
    # find multiple references in one requirement
    refcodes = re.findall('<referential id=.*?>', requirements[row_idx])
    if len(refcodes) > 0:
        creq = requirements[row_idx]
        subsls = []
        for refcode in refcodes:
            code = refcode.replace('"', '').split("=")[1][:-1]
            req_idx = requirements_ids[row_idx] + "-" + code
            if req_idx in resolution_ids:
                creq = creq.replace(refcode, "<referential> ")
                subsls.append(resolutions[resolution_ids.index(req_idx)])
            else:
                creq = creq[creq.find(refcode) + len(refcode):].replace("</referential>", "").replace("<referential>", "")
        if len(subsls) > 0: # if at least one sub reference is unambigious
          line += r + creq.replace("</referential>", " </referential>").strip('"')
          for sl in subsls:
            # there is more than one reference to solve in the requirement
            line += " " + sl + " <next> "
          line = line[:-1*len(" <next> ")]
          line += e
          dataset.append(line)
          unambigious_idxs.append(row_idx)
    else: # only one referential tag in requirement
        if requirements_ids[row_idx] in resolution_ids:
            # make sure to leave whitespace between <referential> </referential> tags, needed for GPT2  
            line += r + requirements[row_idx].replace("<referential>", "<referential> ").replace("</referential>", " </referential>").strip('"') + " " + resolutions[resolution_ids.index(requirements_ids[row_idx])]
            line += e
            dataset.append(line)
            unambigious_idxs.append(row_idx)

# randomly splits for test/train
train_ints = random.sample(range(len(unambigious_idxs)), int(round(len(unambigious_idxs)*train_ratio)))
trainset = [dataset[trainidx] for trainidx in train_ints]

# train file is ready for GPT2!
f = open(ready_trainset_path,"w+")
f.writelines(trainset)
f.close()

f2 = open("ReqEvalSplitIdx.txt", "w+")
f2.write(str(unambigious_idxs) + "\n")
f2.write(str(train_ints))
f2.close()

print("===============Processed Dataset===============")
print("Train - Test Ratio: " + str(train_ratio))
print("Total number of requirements: " + str(len(dataset)))
print("-----------------------------------------------")
print("Size of train set: " + str(len(trainset)))
if train_ratio != 1.0:
  print("Size of test set: " + str(len(unambigious_idxs) - len(trainset)))

Train - Test Ratio: 1.0
Total number of requirements: 69
-----------------------------------------------
Size of train set: 69


# Downloading GPT-2

We first need to download the base GPT-2 model to finetune it later.

GPT-2 model based on its sizes:

* `124M` (default): the "small" model
* `355M`: the "medium" model
* `774M`: the "large" model
* `1558M`: the "extra large", 'real' model

We tried first two due to hardware limitaitons.

In [None]:
""" Below we import the required library """
%tensorflow_version 1.x
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2

print("\nLibrary installation succesful!")

TensorFlow 1.x selected.
  Building wheel for gpt-2-simple (setup.py) ... [?25l[?25hdone
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.


Library installation succesful!


In [None]:
""" 
PLEASE VERIFY THAT YOU ARE ACTUALLY USING GPU! 
OTHERWISE THE WHOLE PROCESS TAKES MUCH LONGER TIME...
"""
!nvidia-smi

Mon Jan 11 20:33:10 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P0    28W /  70W |    111MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
run_name = 'Full2' # unique run name to train a model
model_size = "medium" # small or medium

model_name = "124M" if model_size == "small" else "355M"

print("\nDownloading GPT-2 base model...")
gpt2.download_gpt2(model_name=model_name)
print("\nDownload successful!")


Downloading GPT-2 base model...


Fetching checkpoint: 1.05Mit [00:00, 258Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 92.2Mit/s]                                                   
Fetching hparams.json: 1.05Mit [00:00, 230Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:30, 46.2Mit/s]                                 
Fetching model.ckpt.index: 1.05Mit [00:00, 311Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 130Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 128Mit/s]                                                       


Download successful!





# Finetuning GPT-2

Trained models will be saved to `/checkpoint/<run_name>` in the location you are working on. No worries, model will find it automatically if you use the same run name!


In [None]:
# reset if there is a session already going on
try:
  gpt2.reset_session(sess)
except NameError:
  pass
# start tf session
sess = gpt2.start_tf_sess()
#Finetune!
gpt2.finetune(sess,
              dataset=ready_trainset_path,
              model_name=model_name,
              steps=1000,
              restore_from='latest',
              run_name=run_name,
              print_every=1,
              sample_every=500,
              save_every=500
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Please use tensorflow.python.ops.op_selector.get_backward_walk_ops.
Loading checkpoint checkpoint/08BigModel1k/model-1000
INFO:tensorflow:Restoring parameters from checkpoint/08BigModel1k/model-1000


100%|██████████| 1/1 [00:00<00:00, 222.71it/s]

Loading dataset...
dataset has 2932 tokens
Training...
Saving checkpoint/08BigModel1k/model-1000





> such that local open channel communications are still possible. to be used <|endoftext|>
<|startoftext|> The shunting leader can then choose the moment when <referential> he </referential> takes or rejects the call. the shunting leader <|endoftext|>
<|startoftext|> Redundancy Management picks up where <referential> it </referential> left off in these attempts. redundancy management <|endoftext|>
<|startoftext|> The ETCS trainborne equipment shall transmit <referential> its </referential> own train identification to the RBC. ETCS trainbome equipment <|endoftext|>
<|startoftext|> The Clarus system shall process data as <referential> they </referential> are received. data <|endoftext|>
<|startoftext|> Non leading drivers shall be able to indicate to the system <referential> their </referential> location in the train during the registration procedure (2nd driver, 3rd driver etc). non leading drivers <|endoftext|>
<|startoftext|> It shall be possible for the user to find and display store

# Evaluating The Model

Here we see how well (or what) the model learned.

In [None]:
""" 
You can change the below line to path to test data if you did not use split (e.g. train_ratio=1.0)
Should be in the same format with training_set.csv
"""

# loads the random split info back
split_lists = open("ReqEvalSplitIdx.txt", "r").readlines()
unambigious_idxs = [int(x) for x in split_lists[0][1:-2].split(", ")]
train_ints = [int(x) for x in split_lists[1][1:-2].split(", ")]

# uncomment the below lines if you have lost the variables due to a corruption
run_name = 'Full2' # unique run name
data_path = "training_set.csv"
disambiguation_path = "disambiguation_answers_file.csv"

## Train Accuracy

Below makes a dataset consisting train set requirements to evaluate success rate of train.

In [None]:
import pandas
import re

colnames_data = ['sent_id', 'sent']
colnames_resolution = ['sent_id', 'resolution']
data = pandas.read_csv(data_path, names=colnames_data, quotechar='"', delimiter=',', skipinitialspace=True)
solutions = pandas.read_csv(disambiguation_path, names=colnames_resolution, quotechar='"', delimiter=',', skipinitialspace=True)

requirements_ids = data.sent_id.tolist()
requirements = data.sent.tolist()
resolutions = solutions.resolution.tolist()
resolution_ids = solutions.sent_id.tolist()

train_requirements_ids = [requirements_ids[i] for i in unambigious_idxs if unambigious_idxs.index(i) in train_ints]
train_requirements = [requirements[i] for i in unambigious_idxs if unambigious_idxs.index(i) in train_ints]

train_dataset = []
train_solution_ids = []

r, e = "<|startoftext|> ", " <|endoftext|>\n"

# make train dataset for gpt2
for row_idx in range(len(train_requirements_ids)):
    refcodes = re.findall('<referential id=.*?>', train_requirements[row_idx])
    if len(refcodes) > 0:
        creq = train_requirements[row_idx]
        subsls = []
        for refcode in refcodes:
            code = refcode.replace('"', '').split("=")[1][:-1]
            req_idx = train_requirements_ids[row_idx] + "-" + code
            if req_idx in resolution_ids:
                creq = creq.replace(refcode, "<referential> ")
                subsls.append(resolutions[resolution_ids.index(req_idx)])
                train_solution_ids.append(req_idx)
            else:
                creq = creq[creq.find(refcode) + len(refcode):].replace("</referential>", "").replace("<referential>", "")
        if len(subsls) > 0:
          train_dataset.append(r + creq.replace("</referential>", " </referential>").strip('"'))
    else:
        if train_requirements_ids[row_idx] in resolution_ids:
            train_dataset.append(r + train_requirements[row_idx].replace("<referential>", "<referential> ").replace("</referential>", " </referential>").strip('"'))
            train_solution_ids.append(train_requirements_ids[row_idx])

print(len(train_solution_ids))
print(train_solution_ids)
print("Size of train set: " + str(len(train_dataset)))

70
['library#02', 'library#03', 'library#04', 'library#05', 'library#07', 'library#11', 'library#12-a', 'library#12-b', 'library#18', 'library#19', 'library#21', 'weather#01', 'weather#02', 'weather#03', 'weather#05', 'weather#07', 'weather#08', 'weather#12', 'weather#13', 'weather#14', 'weather#15', 'weather#17', 'weather#18', 'weather#19', 'weather#20', 'railway#01', 'railway#02', 'railway#05', 'railway#08', 'railway#09', 'railway#10', 'railway#11-a', 'railway#12', 'railway#13', 'railway#14', 'railway#15', 'railway#16', 'railway#18', 'railway#19', 'railway#20', 'railway#22', 'railway#23', 'railway#24', 'railway#27', 'railway#29', 'railway#30', 'railway#32', 'railway#34-a', 'railway#35', 'railway#36', 'railway#39', 'railway#41', 'railway#43-b', 'railway#44', 'railway#45', 'railway#48', 'railway#51', 'railway#52', 'railway#54', 'railway#59', 'railway#61', 'railway#62', 'home#03', 'space#01', 'space#02-a', 'space#02-b', 'space#10', 'space#12', 'space#13', 'space#20']
Size of train set: 

In [None]:
# reset if there is a session corrupted due to Colab time limit
try:
  gpt2.reset_session(sess)
except NameError:
  pass
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name=run_name)

Loading checkpoint checkpoint/Full2/model-2000
INFO:tensorflow:Restoring parameters from checkpoint/Full2/model-2000


In [None]:
out_file_name = run_name + "-train-output.csv"
gen_file = open(out_file_name, "w", encoding="utf-8")
gen_file.write("sent_id, decision")

outlist = []
req_id_idx = 0
for train_idx in range(len(train_dataset)):
  print("Requirement:\n" + train_dataset[train_idx])
  outlist.extend(gpt2.generate(sess, 
    length=80,
    temperature=0.4,
    prefix=train_dataset[train_idx],
    nsamples=1,
    batch_size=1,
    truncate="<|endoftext|>",
    include_prefix=False,
    run_name=run_name,
    return_as_list=True
    ))
  # when there is more than one reference to find
  if train_dataset[train_idx].count("<referential>") > 1:
    # seperate with <next> token
    mul_ans = outlist[-1][1:-1].split("<next>")
    for subidx in range(len(mul_ans)):
      gen_file.write('\n"' + train_solution_ids[req_id_idx] + '"' + ', "' + mul_ans[subidx].strip() + '"')
      req_id_idx += 1
  else:
    gen_file.write('\n"' + train_solution_ids[req_id_idx] + '"' + ', "' + outlist[-1][1:-1] + '"')
    req_id_idx += 1
  if train_idx % 5 == 0:
     gpt2.reset_session(sess)
     gen_file.close()
     gen_file = open(out_file_name, "a", encoding="utf-8")
     sess = gpt2.start_tf_sess()
     gpt2.load_gpt2(sess, run_name=run_name)

  print("Solution:\n" + outlist[-1])

gen_file.close()

Requirement:
<|startoftext|> The library may want to accept important digital materials in non-standard formats in case we are able to migrate <referential> them </referential> to a more usable format in the future.
Loading checkpoint checkpoint/Full2/model-2000
INFO:tensorflow:Restoring parameters from checkpoint/Full2/model-2000
Solution:
 digital materials 
Requirement:
<|startoftext|> Once material has arrived, <referential> it </referential> must undergo several reviews, including virus checking, format compliance and anticipated content and file type.
Solution:
 material 
Requirement:
<|startoftext|> Allows resources to be reviewed before a decision is made whether <referential> they </referential> should be retained.
Solution:
 resources 
Requirement:
<|startoftext|> Allows metadata to be stored in a database in a manner that conforms to repository reformatting and linked to <referential> their </referential> corresponding objects via an identifier.
Solution:
 metadata 
Requirem

## Test Accuracy

Below makes a dataset consisting test set requirements to evaluate success rate of train.

Please run this if you use a train_ratio different than 1.0

In [None]:
import pandas
import re

colnames_data = ['sent_id', 'sent']
colnames_resolution = ['sent_id', 'resolution']
data = pandas.read_csv(data_path, names=colnames_data, quotechar='"', delimiter=',', skipinitialspace=True)
solutions = pandas.read_csv(disambiguation_path, names=colnames_resolution, quotechar='"', delimiter=',', skipinitialspace=True)

requirements_ids = data.sent_id.tolist()
requirements = data.sent.tolist()
resolutions = solutions.resolution.tolist()
resolution_ids = solutions.sent_id.tolist()

test_requirements_ids = [requirements_ids[i] for i in unambigious_idxs if unambigious_idxs.index(i) not in train_ints]
test_requirements = [requirements[i] for i in unambigious_idxs if unambigious_idxs.index(i) not in train_ints]

test_dataset = []
test_solution_ids = []

r, e = "<|startoftext|> ", " <|endoftext|>\n"

# make test dataset for gpt2
for row_idx in range(len(test_requirements_ids)):
    refcodes = re.findall('<referential id=.*?>', test_requirements[row_idx])
    if len(refcodes) > 0:
        creq = test_requirements[row_idx]
        subsls = []
        for refcode in refcodes:
            code = refcode.replace('"', '').split("=")[1][:-1]
            req_idx = test_requirements_ids[row_idx] + "-" + code
            if req_idx in resolution_ids:
                creq = creq.replace(refcode, "<referential> ")
                subsls.append(resolutions[resolution_ids.index(req_idx)])
                test_solution_ids.append(req_idx)
            else:
                creq = creq[creq.find(refcode) + len(refcode):].replace("</referential>", "").replace("<referential>", "")
        if len(subsls) > 0:
          test_dataset.append(r + creq.replace("</referential>", " </referential>").strip('"'))
    else:
        if test_requirements_ids[row_idx] in resolution_ids:
            test_dataset.append(r + test_requirements[row_idx].replace("<referential>", "<referential> ").replace("</referential>", " </referential>").strip('"'))
            test_solution_ids.append(train_requirements_ids[row_idx])

print("Size of test set: " + str(len(test_dataset)))

Size of test set: 15


In [None]:
# reset if there is a session corrupted due to Colab time limit
try:
  gpt2.reset_session(sess)
except NameError:
  pass
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name=run_name)

Loading checkpoint checkpoint/08BigModel1k/model-2000
INFO:tensorflow:Restoring parameters from checkpoint/08BigModel1k/model-2000


In [None]:
out_file_name = run_name + "-test-output.csv"
gen_file = open(out_file_name, "w", encoding="utf-8")
gen_file.write("sent_id, decision")

outlist = []
req_id_idx = 0
for test_idx in range(len(test_dataset)):
  print("Requirement:\n" + test_dataset[test_idx])
  outlist.extend(gpt2.generate(sess, 
    length=80,
    temperature=0.4,
    prefix=test_dataset[test_idx],
    nsamples=1,
    batch_size=1,
    truncate="<|endoftext|>",
    include_prefix=False,
    run_name=run_name,
    return_as_list=True
    ))
  # when there is more than one reference to find
  if test_dataset[test_idx].count("<referential>") > 1:
    # seperate with <next> token
    mul_ans = outlist[-1][1:-1].split("<next>")
    for subidx in range(len(testdataset[test_idx].count("<referential>"))):
      gen_file.write('\n"' + test_requirements_ids[req_id_idx] + '"' + ', "' + mul_ans[subidx])
      req_id_idx += 1
  else:
    gen_file.write('\n"' + test_requirements_ids[req_id_idx] + '"' + ', "' + outlist[-1][1:-1])
    req_id_idx += 1
  if test_idx % 5 == 0:
     gpt2.reset_session(sess)
     gen_file.close()
     gen_file = open(out_file_name, "a", encoding="utf-8")
     sess = gpt2.start_tf_sess()
     gpt2.load_gpt2(sess, run_name=run_name)

  print("Solution:\n" + outlist[-1])

gen_file.close()

Requirement:
<|startoftext|> Allows resources to be reviewed before a decision is made whether <referential> they </referential> should be retained.
Loading checkpoint checkpoint/08BigModel1k/model-2000
INFO:tensorflow:Restoring parameters from checkpoint/08BigModel1k/model-2000
Solution:
 resources 
Requirement:
<|startoftext|> The CAS shall be configurable to allow new observation types to be implemented as <referential> they </referential> become available.
Solution:
 new observation types 
Requirement:
<|startoftext|> The QEDC shall maintain observations and <referential> their </referential> associated quality flags for seven days.
Solution:
 observations 
Requirement:
<|startoftext|> The QEDC shall support queries for <referential> its </referential> observations.
Solution:
 the QEDC 
Requirement:
<|startoftext|> The Clarus program shall inform contributors of the acceptance and use of <referential> their </referential> data and information.
Solution:
 contributors 
Requirement:


# LICENSE

MIT License

Copyright (c) 2019 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.