> 
> **Univerisity of Pisa - M.Sc. Computer Science, Artificial Intelligence**  
> **Human Language Technologies - a.a. 2021/22**
>
> *September, 2022*
>
>**Authors** 
- Irene Pisani *i.pisani1@studenti.unipi.it* (560104)
- Alice Bergonzini *a.bergonzini1@studenti.unipi.it* (560680)
>

---
\\

###### **FINAL PROJECT on KEY POINT ANALYSIS**
# ***Track 1: Key Point Generation***

---


## **Settings**

- Define Colab GPU to use
- Download TR, VL e TS set from offial IBM reporsitory
- Install required tools
- Import useful libraries
- Load from check point Google mT5 tokenizer and Google mT5 seq2seq model

In [1]:
!nvidia-smi

Fri Aug 26 14:23:47 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
! git clone "https://github.com/IBM/KPA_2021_shared_task"

Cloning into 'KPA_2021_shared_task'...
remote: Enumerating objects: 44, done.[K
remote: Counting objects: 100% (44/44), done.[K
remote: Compressing objects: 100% (40/40), done.[K
remote: Total 44 (delta 14), reused 26 (delta 4), pack-reused 0[K
Unpacking objects: 100% (44/44), done.


In [3]:

!pip install datasets -q
!pip install transformers -q
!pip install sentencepiece -q
!pip install rouge_score -q


[K     |████████████████████████████████| 365 kB 8.9 MB/s 
[K     |████████████████████████████████| 212 kB 64.3 MB/s 
[K     |████████████████████████████████| 120 kB 54.2 MB/s 
[K     |████████████████████████████████| 115 kB 76.6 MB/s 
[K     |████████████████████████████████| 127 kB 73.1 MB/s 
[K     |████████████████████████████████| 4.7 MB 6.9 MB/s 
[K     |████████████████████████████████| 6.6 MB 54.3 MB/s 
[K     |████████████████████████████████| 1.3 MB 7.1 MB/s 
[?25h  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone


In [4]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

import os

import nltk
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, default_data_collator, Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq
from datasets import load_dataset, load_metric

In [5]:
nltk.download('punkt')
metric = load_metric('rouge')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Downloading builder script:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

In [6]:
# define device to use: GPU - cuda
device = 'cuda' 

# load from checkpoint model and tokenizer 
tokenizer = AutoTokenizer.from_pretrained('google/mt5-small')       # import pre-trained MT5 tokenizer 
model = AutoModelForSeq2SeqLM.from_pretrained('google/mt5-small')   # import pre-trained MT5 model

Downloading tokenizer_config.json:   0%|          | 0.00/82.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/553 [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/4.11M [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

## **Dataset**

- Read TR, VL, TS set
- Prepare datasets for passing them to the model

In [7]:


def load_kpa_data(gold_data_dir, subset):

    # load arguments set, keypoint set and label set from given directory
    arguments_file = os.path.join(gold_data_dir, f"arguments_{subset}.csv")
    key_points_file = os.path.join(gold_data_dir, f"key_points_{subset}.csv")
    labels_file = os.path.join(gold_data_dir, f"labels_{subset}.csv")

    # read arguments set, keypoint set and label set in csv format as pandas dataframes
    arguments_df = pd.read_csv(arguments_file)
    key_points_df = pd.read_csv(key_points_file)
    labels_file_df = pd.read_csv(labels_file)
    
    # label set will be ignored
    # argument set anf keypoint set will be combined as follow

    # define required list to store <arguments, keypoint> pairs informations
    argument, keypoint, argument_id, keypoint_id, stance, topic = [], [], [],[], [], []

    # for each argument and for each key point under the same topic and stance 
    # ---->  create a pair <arguments, keypoint> with all relative info (such as topic, stance, id, text, etc... )
    for arg, arg_id,topic_arg,stance_arg  in zip(arguments_df['argument'],arguments_df['arg_id'],arguments_df['topic'],arguments_df['stance']):
      for kp,kp_id,topic_kp,stance_kp in zip(key_points_df['key_point'],key_points_df['key_point_id'],key_points_df['topic'],key_points_df['stance']):
        if (topic_arg == topic_kp and stance_arg == stance_kp):
          
          argument.append(arg)
          argument_id.append(arg_id)
          keypoint.append(kp)
          keypoint_id.append(kp_id)
          topic.append(topic_arg)
          stance.append(stance_arg)

    # use all the generated pair to create a final dataset (a dataframe)
    dataset_df = pd.DataFrame({'arg_id':argument_id,
                               'key_point_id':keypoint_id,
                               'argument':argument,
                               'keypoint':keypoint,
                               'topic' : topic,
                               'stance': stance})
    # add a supplemntar column to store the concatenation of argument and topic
    dataset_df["arg_topic"] = dataset_df["topic"] + " " + dataset_df["argument"]

    # return final dataset
    return dataset_df, arguments_df, key_points_df, labels_file_df

dataset_directory = "/content/KPA_2021_shared_task/kpm_data"  # directory for dataset used for training and validation set
testset_directory = "/content/KPA_2021_shared_task/test_data" # directory for dataset used for testing set 

# get dataset for training, evaluation and test
tr_data, _, _, _ = load_kpa_data(gold_data_dir = dataset_directory, subset = "train")
vl_data, _, _, _ = load_kpa_data(gold_data_dir = dataset_directory, subset = "dev")
ts_data, _, _, _ = load_kpa_data(gold_data_dir = testset_directory, subset="test")

# show TR, VL, TS set 
print("\nTRAINING SET: "+str(tr_data.shape))
display(tr_data.head())
print("\nVALIDATION SET: "+str(vl_data.shape))
display(vl_data.head())
print("\nTEST SET: "+str(ts_data.shape))
display(ts_data.head())

# save TR, VL, TS set in csv format
dataset_path = "/content/"
tr_data.to_csv(dataset_path + 'KPAgen_tr.csv')
vl_data.to_csv(dataset_path + 'KPAgen_vl.csv')
ts_data.to_csv(dataset_path + 'KPAgen_ts.csv')



TRAINING SET: (24454, 7)


Unnamed: 0,arg_id,key_point_id,argument,keypoint,topic,stance,arg_topic
0,arg_0_0,kp_0_0,`people reach their limit when it comes to the...,Assisted suicide gives dignity to the person t...,Assisted suicide should be a criminal offence,-1,Assisted suicide should be a criminal offence ...
1,arg_0_0,kp_0_1,`people reach their limit when it comes to the...,Assisted suicide reduces suffering,Assisted suicide should be a criminal offence,-1,Assisted suicide should be a criminal offence ...
2,arg_0_0,kp_0_2,`people reach their limit when it comes to the...,People should have the freedom to choose to en...,Assisted suicide should be a criminal offence,-1,Assisted suicide should be a criminal offence ...
3,arg_0_0,kp_0_3,`people reach their limit when it comes to the...,The terminally ill would benefit from assisted...,Assisted suicide should be a criminal offence,-1,Assisted suicide should be a criminal offence ...
4,arg_0_1,kp_0_0,A patient should be able to decide when they h...,Assisted suicide gives dignity to the person t...,Assisted suicide should be a criminal offence,-1,Assisted suicide should be a criminal offence ...



VALIDATION SET: (4211, 7)


Unnamed: 0,arg_id,key_point_id,argument,keypoint,topic,stance,arg_topic
0,arg_4_0,kp_4_0,having a school uniform can reduce bullying as...,Children can still express themselves using ot...,We should abandon the use of school uniform,-1,We should abandon the use of school uniform ha...
1,arg_4_0,kp_4_1,having a school uniform can reduce bullying as...,School uniform reduces bullying,We should abandon the use of school uniform,-1,We should abandon the use of school uniform ha...
2,arg_4_0,kp_4_2,having a school uniform can reduce bullying as...,School uniforms encourage discipline or focus ...,We should abandon the use of school uniform,-1,We should abandon the use of school uniform ha...
3,arg_4_0,kp_4_3,having a school uniform can reduce bullying as...,School uniforms saves costs,We should abandon the use of school uniform,-1,We should abandon the use of school uniform ha...
4,arg_4_0,kp_4_4,having a school uniform can reduce bullying as...,School uniforms create a sense of equality/unity,We should abandon the use of school uniform,-1,We should abandon the use of school uniform ha...



TEST SET: (3923, 7)


Unnamed: 0,arg_id,key_point_id,argument,keypoint,topic,stance,arg_topic
0,arg_0_0,kp_0_0,Routine child vaccinations isn't mandatory sin...,"Routine child vaccinations, or their side effe...",Routine child vaccinations should be mandatory,-1,Routine child vaccinations should be mandatory...
1,arg_0_0,kp_0_1,Routine child vaccinations isn't mandatory sin...,Mandatory vaccination contradicts basic rights,Routine child vaccinations should be mandatory,-1,Routine child vaccinations should be mandatory...
2,arg_0_0,kp_0_2,Routine child vaccinations isn't mandatory sin...,The parents and not the state should decide,Routine child vaccinations should be mandatory,-1,Routine child vaccinations should be mandatory...
3,arg_0_0,kp_0_3,Routine child vaccinations isn't mandatory sin...,Routine child vaccinations are not necessary t...,Routine child vaccinations should be mandatory,-1,Routine child vaccinations should be mandatory...
4,arg_0_1,kp_0_0,Routine child vaccinations should not be manda...,"Routine child vaccinations, or their side effe...",Routine child vaccinations should be mandatory,-1,Routine child vaccinations should be mandatory...


In [8]:
# get the column names for input/target
dataset_columns = ('argument', 'keypoint')
argument_column = dataset_columns[0]
key_point_column = dataset_columns[1]

# load the dataset previously saved
dataset_path = "/content/"
tr_dataset = load_dataset('csv', data_files = dataset_path + '/KPAgen_tr.csv')['train']
vl_dataset = load_dataset('csv', data_files = dataset_path + '/KPAgen_vl.csv')['train']
ts_dataset = load_dataset('csv', data_files = dataset_path + '/KPAgen_ts.csv')['train']

# create a dictionary to wrap together TR, VL, TS set 
data = {'train': tr_dataset, 'validation': vl_dataset, 'test': ts_dataset}

# show dataset columns (just for info)
tr_column_names = data["train"].column_names
vl_column_names = data["train"].column_names
ts_column_names = data["test"].column_names




Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-7bd66832e1d321c0/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-7bd66832e1d321c0/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-809d9fe91dad27aa/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-809d9fe91dad27aa/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-032ed7944b5bbacd/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-032ed7944b5bbacd/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

## **Utilities functions**

- preprocessing: preprocess dataset with mT5 tokenizer
- postprocessing: process data to compute required evaluation metric
- compute metrics: compute rouge score to evaluate generated summaries

In [9]:
max_input_length = 512
max_target_length = 60
padding = "max_length" 

def preprocess_function(data_set):

    inputs = data_set[argument_column]   # get input column
    targets = data_set[key_point_column]  # get target column
    
    # add useful prefix to input, to tell the model which task has to perform
    prefix = "summarize: "
    inputs = [prefix + inp for inp in inputs]

    # execute input tokenization
    model_inputs = tokenizer(inputs, 
                             max_length = max_input_length,
                             truncation = True)

    # execute target tokenizatiion
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, 
                           max_length = max_target_length,
                           truncation =True)
    
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


def postprocess_text(preds, labels):
    
    # get predictions and labels and split them in different sentence
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]
    preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

    return preds, labels

def compute_metrics(eval_preds):

    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    # Decode generated summaries into text
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    
    # Replace -100 in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    
    # Decode reference summaries into text
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # post-processing: ROUGE expects a newline after each sentence
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
    
    # Compute ROUGE scores
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Extract the median scores
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result



## **MT5 fine-tuning, evaluation, prediction**

- model finetuning on TR set
- model evaluation on VL set
- get model prediction (generated summaries/keypoint) on TS set

In [26]:
# apply preprocessing procedure on TR, VL e TS set
train_dataset = tr_dataset.map(preprocess_function, batched=True)
eval_dataset = vl_dataset.map(preprocess_function, batched=True)
test_dataset = ts_dataset.map(preprocess_function,batched=True)

# define datacollators objects to use for creating batches
data_collator = DataCollatorForSeq2Seq(tokenizer, model = model)

# define training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir    = '/content/',
    learning_rate = 5.6e-5,
    evaluation_strategy = "epoch",
    num_train_epochs    = 3,
    per_device_train_batch_size = 16,
    per_device_eval_batch_size  = 16,
    warmup_steps = 500,
    weight_decay = 0.01,
    logging_dir  = '/content',
    save_steps   = 4518,
    predict_with_generate = True
)

# initialize Trainer object
trainer = Seq2SeqTrainer(
    model = model,
    args  = training_args,
    train_dataset = train_dataset,
    eval_dataset  = eval_dataset,
    tokenizer     = tokenizer,
    data_collator = data_collator,
    compute_metrics = compute_metrics
)

model.cuda()      # pass model to GPU
checkpoint = ''   # define checkpoint 

# runs fine-tuning and save fine-tuned model
train_result = trainer.train(resume_from_checkpoint = None) 
trainer.save_model()

# use model to predict new summary on test set 
test_results = trainer.predict(
      test_dataset,
      metric_key_prefix = "test",
      max_length = max_target_length,
      num_beams = 6)
print(test_results)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set don't have a corresponding argument in `MT5ForConditionalGeneration.forward` and have been ignored: key_point_id, stance, argument, keypoint, arg_topic, arg_id, Unnamed: 0, topic. If key_point_id, stance, argument, keypoint, arg_topic, arg_id, Unnamed: 0, topic are not expected by `MT5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 24454
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 4587


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,0.3165,9.68394,6.8019,0.9916,6.6164,6.6238,13.7526


The following columns in the evaluation set don't have a corresponding argument in `MT5ForConditionalGeneration.forward` and have been ignored: key_point_id, stance, argument, keypoint, arg_topic, arg_id, Unnamed: 0, topic. If key_point_id, stance, argument, keypoint, arg_topic, arg_id, Unnamed: 0, topic are not expected by `MT5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 4211
  Batch size = 16


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,0.3165,9.68394,6.8019,0.9916,6.6164,6.6238,13.7526
2,0.3065,9.714754,5.2309,0.7149,5.1504,5.1455,14.805
3,0.332,8.867827,6.2447,0.9793,6.1703,6.1636,14.4728


The following columns in the evaluation set don't have a corresponding argument in `MT5ForConditionalGeneration.forward` and have been ignored: key_point_id, stance, argument, keypoint, arg_topic, arg_id, Unnamed: 0, topic. If key_point_id, stance, argument, keypoint, arg_topic, arg_id, Unnamed: 0, topic are not expected by `MT5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 4211
  Batch size = 16
Saving model checkpoint to /content/checkpoint-4518
Configuration saved in /content/checkpoint-4518/config.json
Model weights saved in /content/checkpoint-4518/pytorch_model.bin
tokenizer config file saved in /content/checkpoint-4518/tokenizer_config.json
Special tokens file saved in /content/checkpoint-4518/special_tokens_map.json
Copy vocab file to /content/checkpoint-4518/spiece.model
The following columns in the evaluation set don't have a corresponding argument in `MT5ForConditionalGeneration.forward` and have been i

PredictionOutput(predictions=array([[     0,  37333,  17312, ...,      0,      0,      0],
       [     0,  37333,  17312, ...,      0,      0,      0],
       [     0,  37333,  17312, ...,      0,      0,      0],
       ...,
       [     0,    259, 226535, ...,      0,      0,      0],
       [     0,    259, 226535, ...,      0,      0,      0],
       [     0,    259, 226535, ...,      0,      0,      0]]), label_ids=array([[   259, 226116,    265, ...,      0,      0,      0],
       [ 49175,  59997,    259, ...,      0,      0,      0],
       [   486,  22552,    305, ...,      0,      0,      0],
       ...,
       [   486,   2279,   1070, ...,      0,      0,      0],
       [   486,   2279,   1070, ...,      0,      0,      0],
       [   486,   2279,   1070, ...,      0,      0,      0]]), metrics={'test_loss': 9.499456405639648, 'test_rouge1': 11.195, 'test_rouge2': 0.7526, 'test_rougeL': 10.3996, 'test_rougeLsum': 10.4044, 'test_gen_len': 18.9913, 'test_runtime': 189.976, '

In [27]:
if trainer.is_world_process_zero():
  if training_args.predict_with_generate:
    
    # decode model's predictions
    test_preds = tokenizer.batch_decode(
        test_results.predictions, skip_special_tokens=True, clean_up_tokenization_spaces=True
        )
    # split model's predictions
    test_preds = [pred.strip() for pred in test_preds]
    
    # save model's predictions in a txt file
    preds_file = os.path.join('/content/', "test_set_generations.txt")
    with open(preds_file, "w") as file_out:
          file_out.write("\n".join(test_preds))

## **Keypoint Ranking**
- Read all the generated keypoints 
- Assign to each keypoint a rouge-score with its corresponding <argument+topic>
- Rank keypoint based on their score
- Select top 5 keypoint for each topic and stance

In [28]:
# open file in which predictions are stored
f = open('/content/test_set_generations.txt','r')
preds = f.read().splitlines()

# open test set as dataframe and add a column with all the given predictions
df_test = pd.read_csv(dataset_path + 'KPAgen_ts.csv')
df_test["generated_keypoint"] = preds

# store in a file this dataframe with all the generated keypoint and save it
final_kps = df_test
final_kps.to_csv("/content/final.csv")

In [29]:
from rouge_score import rouge_scorer

# define a scorer object to assignn rouge 1 score
scorer = rouge_scorer.RougeScorer(['rouge1'], use_stemmer=True)

def compute_rouge1_metrics(kp,arg_topic):

  # compute rouge-1 score between keypoint and its correspondinng arguments concatened with the topic
  scores = []
  for i in range(len(kp)):
    score = scorer.score(kp[i], arg_topic[i])
    score = float(score["rouge1"][0])
    scores.append(score)

  return scores

def get_best_score(kps, topic, stance):

  final_kps = []
  final_score = []
  final_topic = []
  final_stance = []
  
  # get list of unique keypoint
  gen_kps_unique = kps["generated_keypoint"].unique()
  
  # for each unique keypoint
  for gen_kp in gen_kps_unique:
    
    # consider all keypoint with same text
    key_point = kps[(kps["generated_keypoint"]==gen_kp)]
    # assign to this keypoint the mean of their score
    best_score = key_point["rouge_score"].mean()
    # return keypoint info: text, score, topic, stance
    final_kps.append(gen_kp)
    final_score.append(best_score)
    final_topic.append(topic)
    final_stance.append(stance)
  
  # create a keypoints dataframe
  keypoint_dataframe = pd.DataFrame({"generated_keypoint":final_kps, "score":final_score,"topic":final_topic, "stance":final_stance})
  # order keypoints based on their score
  keypoint_dataframe = keypoint_dataframe.sort_values(by='score', ascending=False)
  # save only the top 5 key point 
  keypoint_dataframe = keypoint_dataframe.iloc[0:5, :]

  return keypoint_dataframe

def keypoint_ranking(kps, test_set):
  
  # initialize an empty dataframe for final keypoints
  final_kps = pd.DataFrame({
      "generated_keypoint": [np.nan],
      "score": [np.nan],
      "topic": [np.nan],
      "stance": [np.nan]
      },
      index=[0]
      )

  # get list of all topic inside TS set
  all_topic = test_set["topic"].unique()

  # for each topic
  for topic in all_topic:
    
    # create a temporaneous test set with only row concerning that topic
    temp_test_set = test_set[(test_set['topic']==topic)]
    # create a temporaneous test set with only row concerning that topic and having positive stance 
    pos_test_set = temp_test_set[(temp_test_set["stance"]==1)]
    # create a temporaneous test set with only row concerning that topic and having negative stance 
    neg_test_set = temp_test_set[(temp_test_set["stance"]==-1)]
   
    # create a temporaneous keypoint set with only row concerning that topic
    temp_kps = kps[(kps['topic']==topic)]
    # create a temporaneous keypoint set with only row concerning that topic and having positive stance 
    pos_kps = temp_kps[(temp_kps["stance"]==1)]
     # create a temporaneous keypoinnt set with only row concerning that topic and having negative stance 
    neg_kps = temp_kps[(temp_kps["stance"]==-1)]

    # given a temporaneus test set and a temporaneus keypoint set
    # ---> compute rouge score between keypoint (from temp. kp set) and corresponding argument+topic (from temp. test set)
    # repeat the procedure both for positive and negative set  
    pos_kps["rouge_score"] = compute_rouge1_metrics(pos_kps["keypoint"].tolist(), pos_test_set["arg_topic"].tolist())
    neg_kps["rouge_score"] = compute_rouge1_metrics(neg_kps["keypoint"].tolist(), neg_test_set["arg_topic"].tolist())
    

    # both for postive stance and negative stance set of keypoint keep only top 5 keypoints
    pos_kps_5 = get_best_score(pos_kps, topic, 1)
    neg_kps_5 = get_best_score(neg_kps, topic, -1)
    
    # concatanate pos and neg keypoint to obtain top 10 keypoint under this topic
    final_kps = pd.concat([final_kps, pos_kps_5])
    final_kps = pd.concat([final_kps, neg_kps_5])

  # return final keypoint set   
  final_kps = final_kps.iloc[1: , :]
  final_kps = final_kps[["generated_keypoint", "topic", "stance"]]
  final_kps["stance"] = final_kps["stance"].astype(int)
  final_kps = final_kps.reset_index(drop=True)

  display(final_kps)
                                
  return final_kps


final_key_points = keypoint_ranking( final_kps, df_test)

Unnamed: 0,generated_keypoint,topic,stance
0,People should be allowed to do whatever they w...,Routine child vaccinations should be mandatory,1
1,The Guantanamo bay detention camp is a symbol ...,Routine child vaccinations should be mandatory,1
2,Child performers should not be banned as long ...,Routine child vaccinations should be mandatory,1
3,Parents will have more ability to pay-attentio...,Routine child vaccinations should be mandatory,1
4,Cannabis is a gateway-drug/addictive,Routine child vaccinations should be mandatory,1
5,Parents will have more ability to pay-attentio...,Routine child vaccinations should be mandatory,-1
6,The Guantanamo bay detention camp harms the US...,Routine child vaccinations should be mandatory,-1
7,The terminally ill would benefit from assisted...,Routine child vaccinations should be mandatory,-1
8,Private military companies are less ethical an...,Routine child vaccinations should be mandatory,-1
9,A mandatory retirement reduces the quality of ...,Routine child vaccinations should be mandatory,-1


## Final Keypoints

> Final keypoints are saved in file `generated_keypoints.csv`

In [30]:
final_key_points.to_csv('/content/generated_keypoints.csv')
print("Final keypoint has been generated and saved in file: generated_keypoints.csv")

Final keypoint has been generated and saved in file: generated_keypoints.csv
