# Create new prompted sentence representations for the link identification task

## Libraries

In [1]:
!pip install pandas==1.3.4
!pip install transformers==4.12.5
!pip install datasets==1.15.1

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting pandas==1.3.4
  Downloading pandas-1.3.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.5 MB)
[K     |████████████████████████████████| 11.5 MB 3.9 MB/s eta 0:00:01
Installing collected packages: pandas
Successfully installed pandas-1.3.4
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting transformers==4.12.5
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 28.2 MB/s eta 0:00:01
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 95.1 MB/s eta 0:00:01
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 66.1 MB/s eta 0:00:01
Install

In [2]:
import os
import pickle

import pandas as pd

import numpy as np
import torch

import transformers
from transformers import BertTokenizer

import datasets
from datasets import concatenate_datasets
from datasets import Dataset
from datasets import ClassLabel
from datasets import DatasetDict

In [3]:
print('pandas:\t\t', pd.__version__)
print('transformers:\t', transformers.__version__)
print('datasets:\t', datasets.__version__)

pandas:		 1.3.4
transformers:	 4.12.5
datasets:	 1.15.1


## Load Data

In [4]:
dataset_df = pd.read_pickle("/notebooks/Prompting/dataset/pe_dataset_w_essay_position_pos_tags_pickle")

In [5]:
dataset_df

Unnamed: 0,essay_nr,component_id,label_and_comp_idxs,text,label_x,label_ComponentType,relation_SupportAttack,label_RelationType,label_LinkedNotLinked,split_y,...,is_last_in_para,nr_preceeding_comps_in_para,nr_following_comps_in_para,structural_fts_as_text,structural_fts_as_text_combined,para_ratio,first_or_last,strct_fts_w_position_in_essay,component_pos_tags,strct_fts_essay_position_pos_tags
0,essay001,T1,MajorClaim 503 575,we should attach more importance to cooperatio...,MajorClaim,MajorClaim,[],,Linked,TRAIN,...,1,0,0,Topic: Should students be taught to compete or...,Topic: Should students be taught to compete or...,0.25,1,Topic: Should students be taught to compete or...,"Part Of Speech tags: PRON, VERB, VERB, ADJ, NO...",Topic: Should students be taught to compete or...
1,essay001,T2,MajorClaim 2154 2231,a more cooperative attitudes towards life is m...,MajorClaim,MajorClaim,[],,Linked,TRAIN,...,1,0,0,Topic: Should students be taught to compete or...,Topic: Should students be taught to compete or...,1.00,1,Topic: Should students be taught to compete or...,"Part Of Speech tags: DET, ADV, ADJ, NOUN, ADP,...",Topic: Should students be taught to compete or...
2,essay001,T3,Claim 591 714,"through cooperation, children can learn about ...",Claim,Claim,[],Support,Linked,TRAIN,...,0,0,3,Topic: Should students be taught to compete or...,Topic: Should students be taught to compete or...,0.50,0,Topic: Should students be taught to compete or...,"Part Of Speech tags: ADP, NOUN, PUNCT, NOUN, V...",Topic: Should students be taught to compete or...
3,essay001,T4,Premise 716 851,What we acquired from team work is not only ho...,Premise,Premise,[],Support,NotLinked,TRAIN,...,0,1,2,Topic: Should students be taught to compete or...,Topic: Should students be taught to compete or...,0.50,0,Topic: Should students be taught to compete or...,"Part Of Speech tags: PRON, PRON, VERB, ADP, NO...",Topic: Should students be taught to compete or...
4,essay001,T5,Premise 853 1086,"During the process of cooperation, children ca...",Premise,Premise,[],Support,NotLinked,TRAIN,...,0,2,1,Topic: Should students be taught to compete or...,Topic: Should students be taught to compete or...,0.50,0,Topic: Should students be taught to compete or...,"Part Of Speech tags: ADP, DET, NOUN, ADP, NOUN...",Topic: Should students be taught to compete or...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5968,essay402,T11,Premise 1275 1339,indirectly they will learn how to socialize ea...,Premise,Premise,[],Support,NotLinked,TRAIN,...,0,4,3,Topic: Children should studying hard or playin...,Topic: Children should studying hard or playin...,0.75,0,Topic: Children should studying hard or playin...,"Part Of Speech tags: ADV, PRON, VERB, VERB, AD...",Topic: Children should studying hard or playin...
5969,essay402,T12,Premise 1341 1388,That will make children getting lots of friends,Premise,Premise,[],Support,NotLinked,TRAIN,...,0,5,2,Topic: Children should studying hard or playin...,Topic: Children should studying hard or playin...,0.75,0,Topic: Children should studying hard or playin...,"Part Of Speech tags: DET, VERB, VERB, NOUN, VE...",Topic: Children should studying hard or playin...
5970,essay402,T13,Premise 1393 1436,they can contribute positively to community,Premise,Premise,[],Support,Linked,TRAIN,...,0,6,1,Topic: Children should studying hard or playin...,Topic: Children should studying hard or playin...,0.75,0,Topic: Children should studying hard or playin...,"Part Of Speech tags: PRON, VERB, VERB, ADV, AD...",Topic: Children should studying hard or playin...
5971,essay402,T14,Premise 1448 1525,playing sport makes children getting healthy a...,Premise,Premise,[],Support,NotLinked,TRAIN,...,1,7,0,Topic: Children should studying hard or playin...,Topic: Children should studying hard or playin...,0.75,0,Topic: Children should studying hard or playin...,"Part Of Speech tags: VERB, NOUN, VERB, NOUN, V...",Topic: Children should studying hard or playin...


In [6]:
dataset_df.columns

Index(['essay_nr', 'component_id', 'label_and_comp_idxs', 'text', 'label_x',
       'label_ComponentType', 'relation_SupportAttack', 'label_RelationType',
       'label_LinkedNotLinked', 'split_y', 'essay', 'argument_bound_1',
       'argument_bound_2', 'argument_id', 'sentence', 'paragraph', 'para_nr',
       'total_paras', 'token_count', 'token_count_covering_para',
       'tokens_count_covering_sentence', 'preceeding_tokens_in_sentence_count',
       'succeeding_tokens_in_sentence_count', 'token_ratio',
       'relative_position_in_para_char', 'is_in_intro',
       'relative_position_in_para_token', 'is_in_conclusion',
       'is_first_in_para', 'is_last_in_para', 'nr_preceeding_comps_in_para',
       'nr_following_comps_in_para', 'structural_fts_as_text',
       'structural_fts_as_text_combined', 'para_ratio', 'first_or_last',
       'strct_fts_w_position_in_essay', 'component_pos_tags',
       'strct_fts_essay_position_pos_tags'],
      dtype='object')

In [7]:
# sanity check
print(len(dataset_df))
dataset_df = dataset_df.dropna()
print(len(dataset_df))

5973
5973


In [8]:
dataset_df = dataset_df.reset_index(drop=True)

In [9]:
dataset_df

Unnamed: 0,essay_nr,component_id,label_and_comp_idxs,text,label_x,label_ComponentType,relation_SupportAttack,label_RelationType,label_LinkedNotLinked,split_y,...,is_last_in_para,nr_preceeding_comps_in_para,nr_following_comps_in_para,structural_fts_as_text,structural_fts_as_text_combined,para_ratio,first_or_last,strct_fts_w_position_in_essay,component_pos_tags,strct_fts_essay_position_pos_tags
0,essay001,T1,MajorClaim 503 575,we should attach more importance to cooperatio...,MajorClaim,MajorClaim,[],,Linked,TRAIN,...,1,0,0,Topic: Should students be taught to compete or...,Topic: Should students be taught to compete or...,0.25,1,Topic: Should students be taught to compete or...,"Part Of Speech tags: PRON, VERB, VERB, ADJ, NO...",Topic: Should students be taught to compete or...
1,essay001,T2,MajorClaim 2154 2231,a more cooperative attitudes towards life is m...,MajorClaim,MajorClaim,[],,Linked,TRAIN,...,1,0,0,Topic: Should students be taught to compete or...,Topic: Should students be taught to compete or...,1.00,1,Topic: Should students be taught to compete or...,"Part Of Speech tags: DET, ADV, ADJ, NOUN, ADP,...",Topic: Should students be taught to compete or...
2,essay001,T3,Claim 591 714,"through cooperation, children can learn about ...",Claim,Claim,[],Support,Linked,TRAIN,...,0,0,3,Topic: Should students be taught to compete or...,Topic: Should students be taught to compete or...,0.50,0,Topic: Should students be taught to compete or...,"Part Of Speech tags: ADP, NOUN, PUNCT, NOUN, V...",Topic: Should students be taught to compete or...
3,essay001,T4,Premise 716 851,What we acquired from team work is not only ho...,Premise,Premise,[],Support,NotLinked,TRAIN,...,0,1,2,Topic: Should students be taught to compete or...,Topic: Should students be taught to compete or...,0.50,0,Topic: Should students be taught to compete or...,"Part Of Speech tags: PRON, PRON, VERB, ADP, NO...",Topic: Should students be taught to compete or...
4,essay001,T5,Premise 853 1086,"During the process of cooperation, children ca...",Premise,Premise,[],Support,NotLinked,TRAIN,...,0,2,1,Topic: Should students be taught to compete or...,Topic: Should students be taught to compete or...,0.50,0,Topic: Should students be taught to compete or...,"Part Of Speech tags: ADP, DET, NOUN, ADP, NOUN...",Topic: Should students be taught to compete or...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5968,essay402,T11,Premise 1275 1339,indirectly they will learn how to socialize ea...,Premise,Premise,[],Support,NotLinked,TRAIN,...,0,4,3,Topic: Children should studying hard or playin...,Topic: Children should studying hard or playin...,0.75,0,Topic: Children should studying hard or playin...,"Part Of Speech tags: ADV, PRON, VERB, VERB, AD...",Topic: Children should studying hard or playin...
5969,essay402,T12,Premise 1341 1388,That will make children getting lots of friends,Premise,Premise,[],Support,NotLinked,TRAIN,...,0,5,2,Topic: Children should studying hard or playin...,Topic: Children should studying hard or playin...,0.75,0,Topic: Children should studying hard or playin...,"Part Of Speech tags: DET, VERB, VERB, NOUN, VE...",Topic: Children should studying hard or playin...
5970,essay402,T13,Premise 1393 1436,they can contribute positively to community,Premise,Premise,[],Support,Linked,TRAIN,...,0,6,1,Topic: Children should studying hard or playin...,Topic: Children should studying hard or playin...,0.75,0,Topic: Children should studying hard or playin...,"Part Of Speech tags: PRON, VERB, VERB, ADV, AD...",Topic: Children should studying hard or playin...
5971,essay402,T14,Premise 1448 1525,playing sport makes children getting healthy a...,Premise,Premise,[],Support,NotLinked,TRAIN,...,1,7,0,Topic: Children should studying hard or playin...,Topic: Children should studying hard or playin...,0.75,0,Topic: Children should studying hard or playin...,"Part Of Speech tags: VERB, NOUN, VERB, NOUN, V...",Topic: Children should studying hard or playin...


In [10]:
dataset_df = dataset_df.rename(columns={"split_y": "split"})

In [11]:
dataset_df.columns

Index(['essay_nr', 'component_id', 'label_and_comp_idxs', 'text', 'label_x',
       'label_ComponentType', 'relation_SupportAttack', 'label_RelationType',
       'label_LinkedNotLinked', 'split', 'essay', 'argument_bound_1',
       'argument_bound_2', 'argument_id', 'sentence', 'paragraph', 'para_nr',
       'total_paras', 'token_count', 'token_count_covering_para',
       'tokens_count_covering_sentence', 'preceeding_tokens_in_sentence_count',
       'succeeding_tokens_in_sentence_count', 'token_ratio',
       'relative_position_in_para_char', 'is_in_intro',
       'relative_position_in_para_token', 'is_in_conclusion',
       'is_first_in_para', 'is_last_in_para', 'nr_preceeding_comps_in_para',
       'nr_following_comps_in_para', 'structural_fts_as_text',
       'structural_fts_as_text_combined', 'para_ratio', 'first_or_last',
       'strct_fts_w_position_in_essay', 'component_pos_tags',
       'strct_fts_essay_position_pos_tags'],
      dtype='object')

## Add prompts

In [12]:
## LIFT paper: we should add symbols to separate and denote the components and the answer such.
## several other things.

In [13]:
# Function to add the prompt to the dataset

In [14]:
def get_prompted_representation(x, prompt_template):
    
    sentence_representation = x.strct_fts_essay_position_pos_tags
    component = x.text
    label = x.label_LinkedNotLinked
    
    if prompt_template == 1:
        
        prompted_representation = 'Component: ' + sentence_representation + ". " + 'Is this component a premise, a claim or a major claim? This component is a ' + label + "."
        
    elif prompt_template == 2:
        
        prompted_representation = 'Which of these choices best describes the following component? "Linked" or "Separate".' + " Component: " + sentence_representation + ". " + label + "."
        
    elif prompt_template == 3:
        
        prompted_representation = 'How is the component best described?: "MajorClaim", "Claim" or "Premise".' + " Component: " + sentence_representation + ". " + label + "."
        
    return prompted_representation

In [15]:
dataset_df['text'][0], dataset_df['strct_fts_essay_position_pos_tags'][0]

('we should attach more importance to cooperation during primary education',
 'Topic: Should students be taught to compete or to cooperate?, Sentence: From this point of view, I firmly believe that we should attach more importance to cooperation during primary education., First or last in essay: Yes, First in paragraph: Yes, Last in paragraph: Yes, In in introduction: Yes, Is in conclusion: No. Part Of Speech tags: PRON, VERB, VERB, ADJ, NOUN, ADP, NOUN, ADP, ADJ, NOUN')

In [16]:
get_prompted_representation(dataset_df.iloc[0], 2)

'Which of these choices best describes the following component? "Linked" or "Separate". Component: Topic: Should students be taught to compete or to cooperate?, Sentence: From this point of view, I firmly believe that we should attach more importance to cooperation during primary education., First or last in essay: Yes, First in paragraph: Yes, Last in paragraph: Yes, In in introduction: Yes, Is in conclusion: No. Part Of Speech tags: PRON, VERB, VERB, ADJ, NOUN, ADP, NOUN, ADP, ADJ, NOUN. Linked.'

In [17]:
dataset_df['prompted_representation_2'] = dataset_df.apply(lambda x: get_prompted_representation(x, 2), axis=1)

In [18]:
dataset_df['prompted_representation_2'][0]

'Which of these choices best describes the following component? "Linked" or "Separate". Component: Topic: Should students be taught to compete or to cooperate?, Sentence: From this point of view, I firmly believe that we should attach more importance to cooperation during primary education., First or last in essay: Yes, First in paragraph: Yes, Last in paragraph: Yes, In in introduction: Yes, Is in conclusion: No. Part Of Speech tags: PRON, VERB, VERB, ADJ, NOUN, ADP, NOUN, ADP, ADJ, NOUN. Linked.'

In [19]:
dataset_df.to_pickle("pe_dataset_w_prompts_2_linkID_pickle")