# Semantic Role Labelling

The main statements that may describe the Semantic Role Labelling:
- It is the task of assigning roles to spans in sentences.
- It is the task of automatically semantic role labeling finding the semantic roles of each argument of each predicate in a sentence.

Dataset uses the following labels:
    
    - O
    - B-Object
    - I-Object
    - B-Aspect
    - I-Aspect
    - B-Predicate
    - I-Predicate

Thus, the main task is to assign one of such labels to each of the words in the test set.

In our experiments we considered a popular solution for the semantic role labelling, which consists of NER algorythm and BERT model with additional layer.
These steps are described in detail in the following relevant sub-sections.

### Pre-processing

We considered dataset pre-processing because of incorrect punctuation separation inside the BERT model, and due to pruning input values.

Thereby, we did the following tasks:

    - Separating data into sentences with empty lines (NaN).
    - Clean punctuation into single dot.
    - Reconstruct labels to [ O, Object, Aspect, Predicate ].

## Requirements

In [2]:
# Connect to Google Drive and upload a folder
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
import pandas as pd
import numpy as np
import torch
import torch.optim as optim

import re # Regular expression

from tqdm import tqdm

from transformers import BertTokenizerFast
from transformers import BertForTokenClassification

from torch.utils.data import DataLoader

## Download the data

In [5]:
!git clone https://github.com/s-nlp/semantic-role-labelling.git

fatal: destination path 'semantic-role-labelling' already exists and is not an empty directory.


### Dataset loading

**df_train** - Train dataset

**df_val** - Validation dataset

**df_test** - Test dataset

    ['sentences', 'labels']

In [6]:
dataset = 'train'
dataset_dev = 'dev' # Validation dataset
dataset_test = 'test_no_answers'

dataset_dev_no = 'dev_no_answers' # Validation dataset

In [7]:
path = '/content/semantic-role-labelling/' + dataset + '.tsv'
path_dev = '/content/' + dataset_dev + '.tsv'
path_test = '/content/semantic-role-labelling/' + dataset_test + '.tsv'

path_dev_no = '/content/semantic-role-labelling/' + dataset_dev_no + '.tsv'


df = pd.read_csv(path, sep='\t', header= None, names=['data', 'label'], quoting=3, skip_blank_lines=False).fillna('_None_') # Train dataset
df_dev = pd.read_csv(path_dev, sep='\t', header= None, names=['data', 'label'], quoting=3, skip_blank_lines=False).fillna('_None_') # Dev dataset
df_test = pd.read_csv(path_test, sep='\t', header= None, names=['data'], quoting=3, skip_blank_lines=False).fillna('_None_') # Test dataset

df_dev_no = pd.read_csv(path_dev_no, sep='\t', header= None, names=['data'], quoting=3, skip_blank_lines=False).fillna('_None_') # Dev dataset


### Checking

In [None]:
df.head(22)

In [None]:
df_dev.head(22)

In [None]:
df_test.head()

In [None]:
print(len(df_dev))
print(len(df_dev_no))

In [None]:
df_dev_no.head()

In [None]:
df.data[592]

'"'

In [None]:
df_test.head(24)

In [None]:
df_dev_no.data[460:500]

## Preprocessing 

#### Corpus preprocessing:
    - Separating data into sentences with empty lines (NaN).
    - Clean punctuation into single dot.
    - Reconstruct labels to [ O, Object, Aspect, Predicate ].

In [8]:
def count_words(df):

    count_words_list = []
    temp_count = 0

    for id_w, word in enumerate(df.data):

        if word == '_None_':
          if id_w == 0:
            temp_count = 0
          else:
            count_words_list.append(temp_count)
            temp_count = 0

        else:
            temp_count += 1
    
    return count_words_list

In [9]:
## Sep as sentences with space

def separate_text(df):
    # Separating data into sentences with empty lines (NaN)

    sentence_list = []
    temp_list = ''

    for word in df.data:

        if word == '_None_':
            sentence_list.append(temp_list)
            temp_list = ''
        else:
            word = re.sub(r"[\"\—\#\$\%\&\'\(\)\*\+\,\–\-\/\:\;\<\=\>\?\@\[\\\]\^\?\!\_\`\{\|\}\~\«\»ѣ\№]", ".", word)
            word = re.sub(r"[.]+", ".", word)
            
            if temp_list == '':
                temp_list += word
            else:
                temp_list += ' ' + word
    
    print(sentence_list[:5])

    return sentence_list

In [10]:
# ex = pd.DataFrame(['mhvk', '.', ').', '".','khgk', '_None_', '.', '_None'], columns=['data'])

# sentence_list = separate_text(ex)

# print(str(sentence_list[0]))

['mhvk . . . khgk']
mhvk . . . khgk


In [11]:
def clean_labels(df):
    # Reconstruct labels into [0, 1, 2, 3]. 0 - O, 1 - Object, 2 - Aspect, 3 - Predicate.
        
    label_list = []
    # temp_list = [(-100)]
    temp_list = []

    for i in df.label:

        if i == '_None_':
            # temp_list.append(-100)
            label_list.append(temp_list)
            # temp_list = [(-100)]
            temp_list = []
        else:
            if i == 'O':
                # label = 0
                label = 'O'
            elif i == 'B-Object' or i == 'I-Object':
                # label = 1
                label = 'Object'
            elif i == 'B-Aspect' or i == 'I-Aspect':
                # label = 2
                label = 'Aspect'
            elif i == 'B-Predicate' or i == 'I-Predicate':
                # label = 3
                label = 'Predicate'
            temp_list.append(label)

    return label_list

In [13]:
labels_to_ids = {k: v for v, k in enumerate(['O', 'Object', 'Aspect', 'Predicate'])}
ids_to_labels = {v: k for v, k in enumerate(['O', 'Object', 'Aspect', 'Predicate'])}


# Appling cleaning to df (train)
train = {'sentences':separate_text(df), 'labels':clean_labels(df)}

# Appling cleaning to df_val (dev)
val = {'sentences':separate_text(df_dev), 'labels':clean_labels(df_dev)}
val_no = {'sentences':separate_text(df_dev_no)}

test = {'sentences':separate_text(df_test)}


df_train = pd.DataFrame(data=train)
df_val = pd.DataFrame(data=val)
df_test = pd.DataFrame(data=test)

df_val_no = pd.DataFrame(data=val_no)

['also . i have recently discovered advil liquigels work much better and faster for a headache than regular ibuprofen .', 'i have always heard that motrin is better than advil for fevers . and that advil works better for body aches and pains .', 'when i was a figure skater i injuried my ankles all the time and the quickest way back on your feet is regular doses of motrin . faster acting than advil and you can take it more often . ice and keeping it above your hip .', 'in a way . halloween is even better than thanksgiving . my absolute favorite holiday . because thanksgiving divides us into separate households . while halloween unites us into bands of candy bandits .', 'i think halloween is actually safer than christmas and thanksgiving .']
['meanwhile . though windows 8 is significantly at greater risk . 1 . 73 percent . compared to windows 8 . 1 . according to redmond . s report . it . s still significantly safer than windows 7 . windows xp . or windows vista .', 'windows 7 is still g

In [None]:
print(labels_to_ids)

{'O': 0, 'Object': 1, 'Aspect': 2, 'Predicate': 3}


In [None]:
print(ids_to_labels)

{0: 'O', 1: 'Object', 2: 'Aspect', 3: 'Predicate'}


In [None]:
df_train.head()

In [None]:
df_val.head()

In [None]:
df_val_no.head()

In [None]:
df_test.head()

## Text processing

In [14]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')

In [15]:
label_all_tokens = False


def align_label(texts, labels):
    tokenized_inputs = tokenizer(texts, padding='max_length', max_length=512, truncation=True)

    word_ids = tokenized_inputs.word_ids()

    previous_word_idx = None
    label_ids = []

    for word_idx in word_ids:
        if word_idx is None:
            label_ids.append(-100)

        elif word_idx != previous_word_idx:
            try:
                label_ids.append(labels_to_ids[labels[word_idx]])
            except:
                label_ids.append(-100)
        else:
            try:
                label_ids.append(labels_to_ids[labels[word_idx]] if label_all_tokens else -100)
            except:
                label_ids.append(-100)
        previous_word_idx = word_idx

    return label_ids

In [16]:
print(align_label(df_train.sentences[0], df_train.labels[0]))

[-100, 0, 0, 0, 0, 0, 0, 1, -100, 0, -100, -100, -100, -100, 0, 0, 3, 0, 3, 0, 0, 2, 0, 0, 1, -100, -100, -100, -100, 0, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -10

In [17]:
class DataSequence(torch.utils.data.Dataset):

    def __init__(self, df, use_labels=False):
        

        self.use_labels = use_labels

        if self.use_labels: 
          lb = df.labels
        txt = df.sentences
        self.texts = [tokenizer(str(i),
                               padding='max_length', max_length = 512, truncation=True, return_tensors="pt") for i in txt]
        if self.use_labels: 
          self.labels = [align_label(i,j) for i,j in zip(txt, lb)]

    def __len__(self):
        
        if self.use_labels:
          return len(self.labels)
        else:
          return len(self.texts)
          
    def get_batch_data(self, idx):

        return self.texts[idx]

    def get_batch_labels(self, idx):

        return torch.LongTensor(self.labels[idx])

    def __getitem__(self, idx):

        if self.use_labels:
          batch_data = self.get_batch_data(idx)
          batch_labels = self.get_batch_labels(idx)
          
          return batch_data, batch_labels
          
        else:
          batch_data = self.get_batch_data(idx)
          
          return batch_data

## Word embedding

### BERT

#### Create a model

In [18]:
unique_labels = [0, 1, 2, 3]

In [19]:
class BertModel(torch.nn.Module):
    def __init__(self):
        super(BertModel, self).__init__()
        self.bert = BertForTokenClassification.from_pretrained('bert-base-cased', num_labels=len(unique_labels))

    def forward(self, input_id, mask, label):
        output = self.bert(input_ids=input_id, attention_mask=mask, labels=label, return_dict=False)
        return output

###### Train:

In [23]:
def train_loop(model, df_train, df_val):

    train_dataset = DataSequence(df_train, use_labels=True)
    val_dataset = DataSequence(df_val, use_labels=True)

    train_dataloader = DataLoader(train_dataset, num_workers=4, batch_size=TRAIN_BATCH_SIZE, shuffle=True)
    val_dataloader = DataLoader(val_dataset, num_workers=4, batch_size=TEST_BATCH_SIZE)

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")


    LEARNING_RATE =  5e-3 #0.01 # 0.005, 0.001 # 0.0005

    scheduler_check = False # True

    optimizer = optim.SGD(model.parameters(), lr=LEARNING_RATE)


    if scheduler_check: 
      milestones =  [4, 7] #[6, 11, 14, 15] #[5, 9, 12, 14, 15] #[3, 6, 9, 12, 15]  # # # [5, 10, 13, 15, 16]
      scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones, gamma=0.1)


    if use_cuda:
        model = model.cuda()

    best_acc = 0
    best_loss = 1000

    for epoch_num in range(EPOCHS):

        total_acc_train = 0
        total_loss_train = 0

        model.train()
            
        # print("lr:", scheduler.get_last_lr())

        for train_data, train_label in tqdm(train_dataloader):

            # print(f'total_loss_train: {total_loss_train}')

            train_label = train_label.to(device)
            mask = train_data['attention_mask'].squeeze(1).to(device)
            input_id = train_data['input_ids'].squeeze(1).to(device)

            optimizer.zero_grad()
            loss, logits = model(input_id, mask, train_label)
            
            for i in range(logits.shape[0]):
            
                logits_clean = logits[i][train_label[i] != -100]
                label_clean = train_label[i][train_label[i] != -100]


                #print(f'logits_clean: {logits_clean.size(0)} | label_clean: {label_clean.size(0)}')
                assert logits_clean.size(0) == label_clean.size(0)


                predictions = logits_clean.argmax(dim=1)
                acc = (predictions == label_clean).float().mean()
                total_acc_train += acc
                total_loss_train += loss.item()

            loss.backward()
            optimizer.step()

        if scheduler_check: scheduler.step()


        model.eval()

        total_acc_val = 0
        total_loss_val = 0

        for val_data, val_label in val_dataloader:

            val_label = val_label.to(device)
            mask = val_data['attention_mask'].squeeze(1).to(device)
            input_id = val_data['input_ids'].squeeze(1).to(device)

            loss, logits = model(input_id, mask, val_label)

            for i in range(logits.shape[0]):

                logits_clean = logits[i][val_label[i] != -100]
                label_clean = val_label[i][val_label[i] != -100]

                assert logits_clean.size(0) == label_clean.size(0)

                predictions = logits_clean.argmax(dim=1)
                acc = (predictions == label_clean).float().mean()
                total_acc_val += acc
                total_loss_val += loss.item()


        train_accuracy = total_acc_train / len(df_train)
        train_loss = total_loss_train / len(df_train)

        val_accuracy = total_acc_val / len(df_val)
        val_loss = total_loss_val / len(df_val)


        print(
            f'Epochs: {epoch_num + 1} | Train_Loss: {train_loss: .3f} | Train_Accuracy: {train_accuracy: .3f} | Val_Loss: {val_loss: .3f} | Val_Accuracy: {val_accuracy: .3f}')


        if best_acc < val_accuracy:
          best_acc = val_accuracy

        if best_loss > val_loss:
          best_loss = val_loss

        print(
            f'Best Val_Loss: {best_loss: .3f} | Best Val_Accuracy: {best_acc: .3f}')

In [24]:
cls_only = False

if cls_only:
  EPOCHS = 15
  TRAIN_BATCH_SIZE = 16 #16, 64, 32 8 
  TEST_BATCH_SIZE = 16 #16, 4
else:
  EPOCHS = 5 #7
  TRAIN_BATCH_SIZE = 2 #8
  TEST_BATCH_SIZE = 2 #4


model = BertModel()


if cls_only:
  for name, param in model.named_parameters():
    if param.requires_grad and not 'classifier' in name:
      #print(name)
      param.requires_grad = False

  for name, param in model.named_parameters():
    if param.requires_grad: print(name)

else:
  for param in model.parameters():
    param.requires_grad = True

  for name, param in model.named_parameters():
    if not param.requires_grad: print(name)  

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

In [25]:
train_loop(model, df_train, df_val)

100%|██████████| 1167/1167 [03:33<00:00,  5.48it/s]


Epochs: 1 | Train_Loss:  0.263 | Train_Accuracy:  0.907 | Val_Loss:  0.352 | Val_Accuracy:  0.878
Best Val_Loss:  0.352 | Best Val_Accuracy:  0.878


100%|██████████| 1167/1167 [03:38<00:00,  5.35it/s]


Epochs: 2 | Train_Loss:  0.165 | Train_Accuracy:  0.941 | Val_Loss:  0.292 | Val_Accuracy:  0.893
Best Val_Loss:  0.292 | Best Val_Accuracy:  0.893


100%|██████████| 1167/1167 [03:39<00:00,  5.33it/s]


Epochs: 3 | Train_Loss:  0.138 | Train_Accuracy:  0.949 | Val_Loss:  0.369 | Val_Accuracy:  0.878
Best Val_Loss:  0.292 | Best Val_Accuracy:  0.893


100%|██████████| 1167/1167 [03:39<00:00,  5.32it/s]


Epochs: 4 | Train_Loss:  0.119 | Train_Accuracy:  0.956 | Val_Loss:  0.321 | Val_Accuracy:  0.892
Best Val_Loss:  0.292 | Best Val_Accuracy:  0.893


100%|██████████| 1167/1167 [03:39<00:00,  5.33it/s]


Epochs: 5 | Train_Loss:  0.104 | Train_Accuracy:  0.963 | Val_Loss:  0.372 | Val_Accuracy:  0.886
Best Val_Loss:  0.292 | Best Val_Accuracy:  0.893


### Checking

In [26]:
# df_devo_no = pd.read_csv(path_dev_no, sep='\t', header= None, names=['data'], quoting=3, skip_blank_lines=False) # Dev dataset
df_devo_no = pd.read_csv(path_dev_no, sep='\t', header= None, names=['data'], quoting=3) # Dev dataset
df_le_testo = pd.read_csv(path_test, sep='\t', header= None, names=['data'], quoting=3) # Dev dataset

df_devo_no
df_le_testo

Unnamed: 0,data
0,plus
1,","
2,android
3,is
4,developing
...,...
9439,and
9440,steal
9441,its
9442,thunder


In [31]:
def align_word_ids(texts):
  
    tokenized_inputs = tokenizer(texts, padding='max_length', max_length=512, truncation=True)

    word_ids = tokenized_inputs.word_ids()

    previous_word_idx = None
    label_ids = []

    for word_idx in word_ids:

        if word_idx is None:
            label_ids.append(-100)

        elif word_idx != previous_word_idx:
            try:
                label_ids.append(1)
            except:
                label_ids.append(-100)
        else:
            try:
                label_ids.append(1 if label_all_tokens else -100)
            except:
                label_ids.append(-100)
        previous_word_idx = word_idx

    return label_ids


def convert_labels(prediction_label):
    
    labels = []
    prev_i = 'O'

    for i in prediction_label:
        l = 'B-'
        if i != 'O':
            if prev_i == i:
               l = 'I-'
            l += i
            labels.append(l)
        else:
            labels.append('O')
        prev_i = i

    return labels


indexes_nan = []


def evaluate(model, df_test):

    test_dataset = DataSequence(df_test, use_labels=False)
    print(len(test_dataset))

    test_dataloader = DataLoader(test_dataset, num_workers=4, batch_size=1) # BATCH_SIZE

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    if use_cuda:
        model = model.cuda()

    model.eval()

    labels_list = []

    for id, test_data in enumerate(test_dataloader):

        mask = test_data['attention_mask'].squeeze(1).to(device)
        input_id = test_data['input_ids'].squeeze(1).to(device)

        sentence = df_test["sentences"][id]

        label_ids = torch.Tensor(align_word_ids(sentence)).unsqueeze(0).to(device)

        logits = model(input_id, mask, label=None)

        logits_clean = logits[0][label_ids != -100]

        predictions = logits_clean.argmax(dim=1).tolist()

        prediction_label = []
        for i in predictions:
            prediction_label.append(ids_to_labels[i])

        labels_list += convert_labels(prediction_label)

        indexes_nan.append(len(labels_list))
        # labels_list.append('_NaN_') # Add empty lines

    # df_devo_no['label'] = [labels_list[i] for i in range(df_devo_no.shape[0])]
    df_le_testo['label'] = [labels_list[i] for i in range(df_le_testo.shape[0])]


    # for i in range(df_devo_no.shape[0]):
    #     if i != '_NaN_':
    #         df_devo_no['label'].iloc[i] = labels_list[i]


    # to CSV
    # df_devo_no.to_csv('/content/semantic-role-labelling/df_devo_no_my.tsv', header=None, index=False, quoting=3, sep='\t', encoding='utf-8')
    df_le_testo.to_csv('/content/semantic-role-labelling/df_le_testo_my.tsv', header=None, index=False, quoting=3, sep='\t', encoding='utf-8')

    # return df_devo_no
    return df_le_testo

In [28]:
# model = BertModel()

d = evaluate(model, df_val)

d

283




Unnamed: 0,data,label
0,meanwhile,O
1,",",O
2,though,O
3,windows,O
4,8,O
...,...,...
8358,wallet,O
8359,",",O
8360,or,O
8361,purse,O


In [29]:
print(indexes_nan)

[47, 90, 104, 127, 170, 206, 223, 249, 284, 320, 370, 417, 428, 439, 460, 478, 496, 529, 551, 585, 604, 633, 654, 681, 711, 746, 786, 807, 821, 837, 857, 907, 948, 1035, 1062, 1108, 1153, 1177, 1213, 1239, 1272, 1284, 1306, 1321, 1348, 1390, 1403, 1430, 1456, 1479, 1509, 1560, 1588, 1611, 1650, 1675, 1711, 1731, 1753, 1784, 1796, 1817, 1833, 1852, 1879, 1906, 1937, 1969, 2009, 2045, 2064, 2114, 2122, 2155, 2180, 2211, 2222, 2246, 2280, 2322, 2340, 2379, 2402, 2425, 2453, 2485, 2531, 2557, 2579, 2604, 2641, 2681, 2692, 2731, 2770, 2780, 2800, 2833, 2869, 2901, 2937, 2962, 2979, 2996, 3018, 3043, 3062, 3077, 3109, 3147, 3171, 3186, 3219, 3256, 3292, 3320, 3368, 3396, 3442, 3463, 3494, 3522, 3555, 3588, 3606, 3629, 3659, 3696, 3762, 3804, 3841, 3877, 3927, 3945, 3971, 4013, 4025, 4056, 4093, 4144, 4188, 4239, 4265, 4304, 4341, 4371, 4409, 4466, 4514, 4533, 4568, 4621, 4662, 4692, 4725, 4733, 4762, 4772, 4826, 4838, 4852, 4868, 4889, 4915, 4938, 4984, 5004, 5017, 5042, 5069, 5092, 5111, 51

In [None]:
d.to_csv('/content/semantic-role-labelling/d.tsv', header=None, index=False, quoting=3, sep='\t', encoding='utf-8')

with open("/content/semantic-role-labelling/d.tsv") as input:
    lines = [line for line in input if line.strip()]
with open("/content/semantic-role-labelling/d_new.tsv", "w") as output:
    i = 0
    for line in lines:
        output.write(line)
        if i+1 in (indexes_nan):
            output.write("\n")
        i += 1
    


In [32]:
d_2 = evaluate(model, df_test)

d_2

360




Unnamed: 0,data,label
0,plus,O
1,",",O
2,android,B-Object
3,is,O
4,developing,B-Aspect
...,...,...
9439,and,O
9440,steal,O
9441,its,O
9442,thunder,O


In [None]:
d_2.to_csv('/content/semantic-role-labelling/d_2.tsv', header=None, index=False, quoting=3, sep='\t', encoding='utf-8')

with open("/content/semantic-role-labelling/d_2.tsv") as input:
    lines = [line for line in input if line.strip()]
with open("/content/semantic-role-labelling/d_test_new.tsv", "w") as output:
    i = 0
    for line in lines:
        output.write(line)
        if i+1 in (indexes_nan):
            output.write("\n")
        i += 1


### Save answers to .csv file:

In [None]:
path_my = '/content/semantic-role-labelling/' + dataset_test + '_my.tsv'
df_my = pd.read_csv(path_my, sep='\t', header= None,  names=['data', 'label'], quoting=3, skip_blank_lines=False)

print(df_my.shape)

(9804, 2)


In [None]:
df_testo

Unnamed: 0,data,labels
0,plus,O
1,",",O
2,android,B-Object
3,is,O
4,developing,B-Aspect
...,...,...
9799,steal,O
9800,its,
9801,thunder,O
9802,.,B-Object


In [None]:
df_my

Unnamed: 0,data,label
0,plus,O
1,",",O
2,android,B-Object
3,is,O
4,developing,B-Aspect
...,...,...
9799,steal,O
9800,its,
9801,thunder,O
9802,.,B-Object
