## Lacuna Masakhane Parts of Speech Classification Challenge
* Part-of-speech (POS) tagging is a crucial step in natural language processing (NLP), as it allows algorithms to understand the grammatical structure and meaning of a text. This is especially important in creating the building blocks for preparing low-resource African languages for NLP tasks. The MaseakhaPOS dataset for 20 typologically diverse African languages, including benchmarks, was created with the help of Lacuna Fund to try and address this problem.

* The objective of this challenge is to create a machine learning solution that correctly classifies 14 parts of speech for the unrelated Luo and Setswana languages. You will need to build one solution that applies to both languages, not two solutions, one for each language.

* It is important that only one solution be built for both languages as this is a step in creating a solution that can be applied to many different languages, instead of having to create a model for each language.



### Installing necessary dependencies

In [1]:
!git clone https://github.com/NtemKenyor/masakhane-pos

fatal: destination path 'masakhane-pos' already exists and is not an empty directory.


In [2]:
!pip install datasets transformers seqeval
!pip install -u accelerate


Usage:   
  pip install [options] <requirement specifier> [package-index-options] ...
  pip install [options] -r <requirements file> [package-index-options] ...
  pip install [options] [-e] <vcs project url> ...
  pip install [options] [-e] <local project path> ...
  pip install [options] <archive url/path> ...

no such option: -u


### Importing Necessary Dependencies

In [3]:
import transformers
import torch
import pandas as pd
import numpy as np
from datasets import Dataset, DatasetDict, load_dataset, load_metric
import ast
import wandb
import random
import os
import warnings
import glob
from sklearn.metrics import accuracy_score
warnings.filterwarnings("ignore")
from transformers import (
    AutoModel, AutoTokenizer, AdamW, get_linear_schedule_with_warmup,
    AutoModelForMaskedLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling,
    EarlyStoppingCallback, AutoModelForTokenClassification,
)
print(transformers.__version__)



4.31.0


### Config

In [4]:
class CFG :
    path = '/kaggle/input/masakhane-ner-update/'
    project_name = 'Ner_Masakhane_POS_Classification'
    model_nm = 'Davlan/afro-xlmr-large-75L'
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    seed = 42
    test_path = "/kaggle/working/masakhane-pos/data/Test.csv"
    train_path = "/kaggle/working/masakhane-pos/data/africa_lan.csv"
    label_mappings = {0: 'ADJ', 1: 'ADP', 2: 'ADV', 3: 'AUX', 4: 'CCONJ', 5: 'DET', 6: 'INTJ', 7: 'NOUN', 8: 'NUM', 9: 'PART', 10: 'PRON', 11: 'PROPN', 12: 'PUNCT', 13: 'SCONJ', 14: 'SYM', 15: 'VERB', 16: 'X'}
    batch_size = 16
    max_length = 135
    num_classes = 17
    valid_languages = ['wol' ,'sna']
    dropout = 0.0
    num_epoch = 30
    early_stopping_patience = 30
    working_path = '/kaggle/working/'
    lr = 1e-06
    warmup_steps = 100
    gradient_accumulation_steps = 1
    data_dir = "/kaggle/working/masakhane-pos/data"


### WANDB - Track your trial runs

In [5]:
wandb.login()  
wandb.init(project=CFG.project_name)

[34m[1mwandb[0m: Currently logged in as: [33mkoleshjr[0m ([33mteam-kolesh[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [6]:
%env WANDB_LOG_MODEL=true   #log every trained model

env: WANDB_LOG_MODEL=true   #log every trained model


### Reproducibility

In [7]:
def set_random_seed(random_seed):
    random.seed(random_seed)
    np.random.seed(random_seed)
    os.environ["PYTHONHASHSEED"] = str(random_seed)

    torch.manual_seed(random_seed)
    torch.cuda.manual_seed(random_seed)
    torch.cuda.manual_seed_all(random_seed)

    torch.backends.cudnn.deterministic = True
set_random_seed(CFG.seed)
transformers.set_seed(CFG.seed)

* add some telemetry - tells us which examples and software versions are being used

In [8]:
from transformers.utils import send_example_telemetry
send_example_telemetry("token_classification_notebook", framework="pytorch")

### Loading The Data

In [9]:
train = pd.read_csv(CFG.train_path)
train = train.dropna()
test = pd.read_csv(CFG.test_path)

display(train.head(), test.head())

Unnamed: 0,word,tag,lang
0,Do,VERB,pcm
1,senator,NOUN,pcm
2,tok,VERB,pcm
3,dis,DET,pcm
4,one,NUM,pcm


Unnamed: 0,Id,Word,Language,Pos
0,Id00qog2f11n_0,Ne,luo,
1,Id00qog2f11n_1,otim,luo,
2,Id00qog2f11n_2,penj,luo,
3,Id00qog2f11n_3,e,luo,
4,Id00qog2f11n_4,kind,luo,


In [10]:
test['Pos'] = 'X'  # assumed for now
test['sentence_Id'] = test['Id'].apply(lambda x: x.split('_')[0])
test = test.groupby('sentence_Id').agg(list).reset_index()
test.head(2)


Unnamed: 0,sentence_Id,Id,Word,Language,Pos
0,Id00qog2f11n,"[Id00qog2f11n_0, Id00qog2f11n_1, Id00qog2f11n_...","[Ne, otim, penj, e, kind, Februar, tarik, 9, g...","[luo, luo, luo, luo, luo, luo, luo, luo, luo, ...","[X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, ..."
1,Id01lywjj7oz,"[Id01lywjj7oz_0, Id01lywjj7oz_1, Id01lywjj7oz_...","[Sifuna, ne, ojiwo, jonyuol, kod, joma, moko, ...","[luo, luo, luo, luo, luo, luo, luo, luo, luo, ...","[X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, ..."


### Data Preprocessing

* Inside the function, two empty lists are initialized: example_words and example_labels. These lists will store the words and corresponding labels from the file.

* It opens the file specified by file_path for reading, assuming it is encoded in UTF-8.

* The code then enters a loop to read the file line by line. For each line:

   * a. It strips any leading or trailing whitespace using line.strip().

   * b. It checks if the length of the line is less than 2 characters or if the line is just a newline character ("\n"). If either of these conditions is met, it indicates the end of an example.

   * c. If an example has been completed (i.e., words is not empty), it appends the collected words to the example_words list and the corresponding labels to the example_labels list.

   * d. It resets the words and labels lists to prepare for the next example.

* If the line does not meet the criteria in step 4b (i.e., it's part of an ongoing example), it splits the line into words and labels. It assumes that the line is formatted with space-separated words and labels. Words are stored in the words list, and labels are stored in the labels list.

* It checks if the line has both words and labels (i.e., len(splits) > 1). If so, it extracts the last element (the label), removes the newline character ("\n"), and appends it to the labels list. If the line has no label, it assigns the label "O."

* After processing all lines in the file, the code checks if there are any remaining words in the words list. If so, it appends them as the last example in the example_words and example_labels lists.

* Finally, the function returns two lists: example_words, which contains lists of words for each example, and example_labels, which contains lists of corresponding labels.

In [11]:
def read_examples_from_file(file_path):
    example_words = []
    example_labels = []
    with open(file_path, encoding="utf-8") as f:
        words = []
        labels = []
        for line in f:
            line = line.strip()
            if len(line) < 2  or line == "\n":
                if words:
                    example_words.append(words)
                    example_labels.append(labels)

                    words = []
                    labels = []
            else:
                splits = line.split(" ")
                words.append(splits[0])
                if len(splits) > 1:
                    labels.append(splits[-1].replace("\n", ""))
                else:
                    # Examples could have no label for mode = "test"
                    labels.append("O")
        if words:
            example_words.append(words)
            example_labels.append(labels)
    return example_words, example_labels

In [13]:
train_words = []
train_labels = []

valid_words = []
valid_labels = []

for file_path in glob.glob(f"{CFG.data_dir}/**/*.txt"):
    words_list, labels_list = read_examples_from_file(file_path)

    language = file_path.split(os.sep)[-2]
    if language in CFG.valid_languages:
        valid_words.extend(words_list)
        valid_labels.extend(labels_list)
    else:
        train_words.extend(words_list)
        train_labels.extend(labels_list)

In [14]:
len(train_labels), len(train_words), len(valid_labels), len(valid_words)

(24418, 24418, 3055, 3055)

In [15]:
train = pd.DataFrame()
train['Word'] = train_words
train['Pos'] = train_labels
train.head(2)

Unnamed: 0,Word,Pos
0,"[Ndiyabazi, abalandeli, bethu, balambele, nton...","[VERB, NOUN, PRON, VERB, NOUN, PUNCT]"
1,"[Ndiyayazi, nendlela, abaziva, ngayo, ngento, ...","[VERB, NOUN, VERB, PRON, NOUN, VERB, PUNCT]"


In [16]:
valid = pd.DataFrame()
valid['Word'] = valid_words
valid['Pos'] = valid_labels
valid.head(2)

Unnamed: 0,Word,Pos
0,"[Mapurisa, e, Zimbabwe, Republic, Police, akap...","[NOUN, ADP, PROPN, ADJ, NOUN, VERB, NOUN, ADJ,..."
1,"[Mapurisa, akaparura, chirongwa, ichi, ne, Chi...","[NOUN, VERB, NOUN, DET, ADP, NOUN, VERB, ADP, ..."


In [17]:
print(train.shape, valid.shape, test.shape)

labels = ["X", "ADJ", "ADP", "ADV", "AUX", "CCONJ", "DET", "INTJ", "NOUN", "NUM", "PART", "PRON", "PROPN", "PUNCT", "SCONJ", "SYM", "VERB"]
CFG.num_classes = len(labels)

(24418, 2) (3055, 2) (1208, 5)


### Tokenization
* The code starts by importing the AutoTokenizer class from the "transformers" library. This library is commonly used for working with pre-trained models for natural language processing.

* It initializes a tokenizer object using the AutoTokenizer.from_pretrained method. The specific model used for tokenization is determined by the value of CFG.model_nm.

* The convert_to_feature function takes several arguments, including row, which is expected to be a row of data containing words and their corresponding labels. It also takes various configuration parameters, such as max_seq_length, cls_token_at_end, and token-related settings.

* Inside the function, a label_map is created, which is a dictionary that maps labels to their integer representations. This mapping is based on the label_list provided as an argument.

* The function processes the words and labels in the input row. It tokenizes each word using the tokenizer and appends the resulting tokens to the tokens list. The corresponding label IDs are also appended to the label_ids list. If a word is tokenized into multiple sub-tokens, the label is repeated for each sub-token.

* The code accounts for the maximum sequence length by truncating or padding the tokens and label_ids lists. If the length of tokens exceeds max_seq_length minus some special tokens count (which depends on the tokenizer and settings), it truncates the lists.

* It adds special tokens like [SEP] (separator token) and [CLS] (classification token) to the tokens and label_ids lists as needed. The specific position of these tokens depends on the cls_token_at_end setting.

* Segment IDs are created to distinguish between different segments in the input. For most models, sequence_a_segment_id is assigned to all tokens in the input.

* The input_ids are generated by converting the tokens into their corresponding token IDs using the tokenizer.

* An input_mask is created to distinguish between real tokens and padding tokens. Real tokens have a value of 1, while padding tokens have a value of 0.

* The code ensures that the input data is zero-padded to the max_seq_length. The specific padding strategy depends on the pad_on_left setting.

* Several assertions are used to verify the lengths of various lists to ensure they match the max_seq_length.

* Finally, the function returns a dictionary containing the following keys: input_ids, input_mask, segment_ids, and label_ids, which represent the processed input features for a natural language processing model.

In [18]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(CFG.model_nm, use_fast=True)
def convert_to_feature(
    row,
    tokenizer=tokenizer,
    label_list=labels,
    max_seq_length=CFG.max_length,
    cls_token_at_end=False,
    cls_token="[CLS]",
    cls_token_segment_id=1,
    sep_token="[SEP]",
    sep_token_extra=False,
    pad_on_left=False,
    pad_token=0,
    pad_token_segment_id=0,
    pad_token_label_id=-100,
    sequence_a_segment_id=0,
    mask_padding_with_zero=True,
):
    """ Loads a data file into a list of `InputBatch`s
        `cls_token_at_end` define the location of the CLS token:
            - False (Default, BERT/XLM pattern): [CLS] + A + [SEP] + B + [SEP]
            - True (XLNet/GPT pattern): A + [SEP] + B + [SEP] + [CLS]
        `cls_token_segment_id` define the segment id associated to the CLS token (0 for BERT, 2 for XLNet)
    """

    label_map = {label: i for i, label in enumerate(label_list)}

    tokens = []
    label_ids = []
    for word, label in zip(row['Word'], row['Pos']):
        word_tokens = tokenizer.tokenize(word)
        tokens.extend(word_tokens)
        # Use the real label id for the first token of the word, and padding ids for the remaining tokens
        label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1))

    # Account for [CLS] and [SEP] with "- 2" and with "- 3" for RoBERTa.
    special_tokens_count = 3 if sep_token_extra else 2
    if len(tokens) > max_seq_length - special_tokens_count:
        tokens = tokens[: (max_seq_length - special_tokens_count)]
        label_ids = label_ids[: (max_seq_length - special_tokens_count)]

    tokens += [sep_token]
    label_ids += [pad_token_label_id]
    if sep_token_extra:
        # roberta uses an extra separator b/w pairs of sentences
        tokens += [sep_token]
        label_ids += [pad_token_label_id]
    segment_ids = [sequence_a_segment_id] * len(tokens)

    if cls_token_at_end:
        tokens += [cls_token]
        label_ids += [pad_token_label_id]
        segment_ids += [cls_token_segment_id]
    else:
        tokens = [cls_token] + tokens
        label_ids = [pad_token_label_id] + label_ids
        segment_ids = [cls_token_segment_id] + segment_ids

    input_ids = tokenizer.convert_tokens_to_ids(tokens)

    # The mask has 1 for real tokens and 0 for padding tokens.
    # Only real tokens are attended to.
    input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)

    # Zero-pad up to the sequence length.
    padding_length = max_seq_length - len(input_ids)
    if pad_on_left:
        input_ids = ([pad_token] * padding_length) + input_ids
        input_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + input_mask
        segment_ids = ([pad_token_segment_id] * padding_length) + segment_ids
        label_ids = ([pad_token_label_id] * padding_length) + label_ids
    else:
        input_ids += [pad_token] * padding_length
        input_mask += [0 if mask_padding_with_zero else 1] * padding_length
        segment_ids += [pad_token_segment_id] * padding_length
        label_ids += [pad_token_label_id] * padding_length

    assert len(input_ids) == max_seq_length
    assert len(input_mask) == max_seq_length
    assert len(segment_ids) == max_seq_length

    return dict(input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids, label_ids=label_ids)

tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

In [19]:
%%time
# train features
train_features = pd.DataFrame(train.apply(lambda row: convert_to_feature(row), axis=1).tolist())
train_features.head(2)

CPU times: user 32.8 s, sys: 96.6 ms, total: 32.9 s
Wall time: 32.9 s


Unnamed: 0,input_ids,input_mask,segment_ids,label_ids
0,"[3, 541, 49778, 159236, 10, 87518, 34642, 186,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[-100, 16, -100, -100, 8, -100, -100, 11, -100..."
1,"[3, 99501, 53187, 708, 108, 71065, 10, 159236,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[-100, 16, -100, -100, 8, -100, 16, -100, -100..."


In [20]:
%%time
# valid features
valid_features = pd.DataFrame(valid.apply(lambda row: convert_to_feature(row), axis=1).tolist())
valid_features.head(2)

CPU times: user 4.7 s, sys: 12 ms, total: 4.71 s
Wall time: 4.71 s


Unnamed: 0,input_ids,input_mask,segment_ids,label_ids
0,"[3, 911, 51166, 433, 28, 147326, 47806, 72110,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[-100, 8, -100, -100, 2, 12, 1, 8, 16, -100, -..."
1,"[3, 911, 51166, 433, 15623, 2500, 3168, 1658, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[-100, 8, -100, -100, 16, -100, -100, 8, -100,..."


In [21]:
%%time
# test features
test_features = pd.DataFrame(test.apply(lambda row: convert_to_feature(row), axis=1).tolist())
test_features.head(2)

CPU times: user 1.83 s, sys: 8.94 ms, total: 1.84 s
Wall time: 1.83 s


Unnamed: 0,input_ids,input_mask,segment_ids,label_ids
0,"[3, 799, 36, 5083, 5551, 170, 28, 8562, 61783,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[-100, 0, 0, -100, 0, -100, 0, 0, 0, 0, 0, 0, ..."
1,"[3, 602, 95635, 108, 36, 658, 3613, 741, 299, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[-100, 0, -100, 0, 0, -100, -100, 0, -100, -10..."


In [22]:
masakhane = DatasetDict({
    "train": Dataset.from_pandas(train_features),
    "valid": Dataset.from_pandas(valid_features),
    "test": Dataset.from_pandas(test_features),
})

masakhane

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'input_mask', 'segment_ids', 'label_ids'],
        num_rows: 24418
    })
    valid: Dataset({
        features: ['input_ids', 'input_mask', 'segment_ids', 'label_ids'],
        num_rows: 3055
    })
    test: Dataset({
        features: ['input_ids', 'input_mask', 'segment_ids', 'label_ids'],
        num_rows: 1208
    })
})

In [23]:
label_map = {i: label for i, label in enumerate(labels)}

### Metrics

In [24]:
def compute_metrics(
    eval_pred,
    pad_token_label_id=-100,
):
    predictions, labels = eval_pred
    preds = np.argmax(predictions, axis = -1)

    out_label_list = []
    preds_list = []
    for i in range(labels.shape[0]):
        for j in range(labels.shape[1]):
            if labels[i, j] != pad_token_label_id:
                out_label_list.append(label_map[labels[i][j]])
                preds_list.append(label_map[preds[i][j]])

    accuracy = accuracy_score(out_label_list, preds_list)
    return {"accuracy": accuracy}

### Build The Model

In [25]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
model = AutoModelForTokenClassification.from_pretrained(
    CFG.model_nm,
    num_labels = CFG.num_classes,
    id2label={str(i): label for i, label in enumerate(labels)},
    label2id={label: i for i, label in enumerate(labels)},
)

config.json:   0%|          | 0.00/714 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.27G [00:00<?, ?B/s]

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at Davlan/afro-xlmr-large-75L and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [26]:
epoch_steps = int(np.ceil(len(masakhane['train']) / CFG.batch_size))
display(epoch_steps)

1527

### Training and Evaluation

In [27]:
training_args = TrainingArguments(
    output_dir='./masakhane-pos',
    evaluation_strategy = "steps",
    save_strategy='steps',
    eval_steps = epoch_steps//2,
    save_steps = epoch_steps//2,
    logging_strategy="steps",
    logging_steps= epoch_steps//2,
    learning_rate=CFG.lr,
    save_total_limit=1,
    num_train_epochs=CFG.num_epoch,
    fp16=True,
    report_to='none',
    metric_for_best_model = "accuracy",
    greater_is_better=True,
    load_best_model_at_end = True,
    per_device_train_batch_size=CFG.batch_size,
    per_device_eval_batch_size=CFG.batch_size,
    warmup_steps = CFG.warmup_steps,
)

In [28]:
trainer = Trainer(
    model = model,
    args = training_args,
    compute_metrics = compute_metrics,
    train_dataset = masakhane['train'],
    eval_dataset = masakhane['valid'],
    tokenizer = tokenizer,
    callbacks = [EarlyStoppingCallback(5)],
)

In [None]:
%%time
trainer.train()

You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Accuracy
763,1.7774,1.046259,0.706392
1526,0.8786,0.845609,0.761507
2289,0.6775,0.804041,0.774852
3052,0.5814,0.781334,0.7793
3815,0.5262,0.77948,0.777291
4578,0.4882,0.777088,0.776538
5341,0.4564,0.758541,0.778332


In [None]:
trainer.evaluate()

### Save Your Model To The Hub

In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [None]:
import os
import shutil

folder_path = "/kaggle/working/masakhane-pos"  # Replace with the path to your folder

# Remove existing files (if you don't need them)
shutil.rmtree(folder_path, ignore_errors=True)




In [None]:
trainer.push_to_hub()

In [None]:
tokenizer.push_to_hub("Koleshjr/masakhane-pos")

### Inference

In [None]:
loaded_model = AutoModelForTokenClassification.from_pretrained(
    "Koleshjr/masakhane-pos",
    num_labels = CFG.num_classes,

)

loaded_tokenizer = AutoTokenizer.from_pretrained(
    'Koleshjr/masakhane-pos'
)

In [None]:
del model, trainer

In [None]:
pad_token_label_id=-100

test_args = TrainingArguments(
    output_dir= '/content/',
    do_train =False,
    do_predict = True,
    dataloader_drop_last = False
)

trainer = Trainer(
    model = loaded_model,
    args = test_args,
)

test_results = trainer.predict(masakhane['test'])

In [None]:
result = test_results.predictions.argmax(axis = -1)

preds_list = [[] for _ in range(test_results.label_ids.shape[0])]
for i in range(test_results.label_ids.shape[0]):
    for j in range(test_results.label_ids.shape[1]):
        if test_results.label_ids[i, j] != pad_token_label_id:
            preds_list[i].append(label_map[result[i][j]])

test['Pos'] = preds_list
test.head(2)

In [None]:
test.shape

In [None]:
test.isnull().sum()

In [None]:
test.head()

In [None]:
def print_rows_with_mismatched_lengths(df):
    for index, row in df.iterrows():
        if len(row['Id']) != len(row['Pos']):
            print(f"Row {index}: 'Id' length = {len(row['Id'])}, 'Pos' length = {len(row['Pos'])} sentence_id: {row['sentence_Id']}")

# Call the function to print rows with mismatched lengths
print_rows_with_mismatched_lengths(test)


In [None]:
# def add_nouns_to_match_lengths(df):
#     for index, row in df.iterrows():
#         id_length = len(row['Id'])
#         pos_length = len(row['Pos'])
#         if id_length > pos_length:
#             # Calculate the difference in lengths
#             diff = id_length - pos_length
#             # Add 'NOUN' to 'Pos' to match lengths
#             df.at[index, 'Pos'].extend(['NOUN'] * diff)

# # Call the function to add 'NOUN' labels to match lengths
# add_nouns_to_match_lengths(test)


In [None]:
submission = test[['Id', 'Pos']].explode(column=['Id', 'Pos'], ignore_index=True)
submission.head()

In [None]:
submission = submission.reset_index(drop=True)
submission.shape

In [None]:
submission.head()

In [None]:
submission.to_csv("experiment_1.csv", index = False)
submission['Pos'].value_counts()