Користејќи го моделот FLAN-T5 со техниката few-shot prompting за секој примерок 
од податочното множество за препознавање на навредлив текст одредете дали 
примерокот содржи навредлив текст или не. Испробајте со користење различен 
број на примероци (n = 1, 2, 3, 5, 10).
Добиените предвидувања евалуирајте ги со метриките: точност
(accuracy_score), прецизност (precision_score), одзив (recall_score) и F1-
мерка (f1_score). Евалуацијата направете ја посебно за сите подмножества 
(подмножество за тренирање, валидација и тестирање).

In [37]:
# !pip install sentence_transformers

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from sentence_transformers import SentenceTransformer, util

In [3]:
test_en_path = "C:/Users/Mia/Desktop/FINKI/NLP/nlp/data/offensive text detection/test_en.txt"
train_en_path = "C:/Users/Mia/Desktop/FINKI/NLP/nlp/data/offensive text detection/train_en.txt"
val_en_path = "C:/Users/Mia/Desktop/FINKI/NLP/nlp/data/offensive text detection/val_en.txt"

In [4]:
train_en = pd.read_table(train_en_path).dropna()
test_en = pd.read_table(test_en_path).dropna()
val_en = pd.read_table(val_en_path).dropna()

In [5]:
dataset = pd.concat([train_en, pd.concat([test_en, val_en])])

In [6]:
train_samples = train_en['Sentence'].values.tolist()
train_labels = train_en['Label'].values.tolist()
test_samples = test_en['Sentence'].values.tolist()
test_labels = test_en['Label'].values.tolist()
val_samples = val_en['Sentence'].values.tolist()
val_labels = val_en['Label'].values.tolist()

In [7]:
model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base')
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-base')

Evaluate Method (Accuracy Score, Precision Score, Recall Score, F1 Score metrics)

In [8]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [9]:
def evaluate(y_test, y_pred, prompt_type, train_test_or_val):
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    print(train_test_or_val + ": " + prompt_type)
    print("Accuracy:", accuracy)
    print("Precision:", precision)
    print("Recall:", recall)
    print("F1 Score:", f1)

Main method, here we write the prompt message, specify the number of examples that go into the prompt message and predict.

In [22]:
def prompt_with_number_of_examples(samples, labels, no_of_examples):
    pred_labels = []

    for sample, label in zip(samples, labels):
        example = []
       
        for i in range(no_of_examples):
            example_text = samples[i]
            example_label = 'offensive' if labels[i] == 1 else 'non-offensive'
            result_example = f'Text: {example_text}\nCategory: {example_label}'
            example.append(result_example)

        prompt = f'{example}\nBased on the above example, classify the text into offensive or non-offensive: {sample}'
        
        # print(prompt)

        input_data = tokenizer(prompt, return_tensors='pt')
        input_ids = input_data.input_ids
        
        output = model.generate(input_ids)
        pred_label = tokenizer.decode(output[0])

        pred_labels.append(pred_label)
    
    return pred_labels

In [95]:
import re

def clean_prediction(pred_label):
    pattern = re.compile('<.*?>')
    pred_list = []

    for pred in pred_label:
        pred = re.sub(pattern, '', pred)
        pred = pred.strip()
        pred = pred.lower()
        # print(pred)

        if pred == "non-offensive":
            pred = 0
        else:
            pred = 1

        pred_list.append(pred)

    return pred_list

In [18]:
def predict_with_few_shot_prompting(samples, labels, train_test_or_val):

    pred_labels_n_1 = prompt_with_number_of_examples(samples, labels, 1) 
    pred_labels_n_2 = prompt_with_number_of_examples(samples, labels, 2)
    pred_labels_n_5 = prompt_with_number_of_examples(samples, labels, 5)
    pred_labels_n_10 = prompt_with_number_of_examples(samples, labels, 10)

    evaluate(labels, clean_prediction(pred_labels_n_1), "N 1", train_test_or_val)
    evaluate(labels, clean_prediction(pred_labels_n_2), "N 2", train_test_or_val)
    evaluate(labels, clean_prediction(pred_labels_n_5), "N 5", train_test_or_val)
    evaluate(labels, clean_prediction(pred_labels_n_10), "N 10", train_test_or_val)

In [23]:
pred_labels_train = predict_with_few_shot_prompting(train_samples[:100], train_labels[:100], "Train")

Train: N 1
Accuracy: 0.53
Precision: 1.0
Recall: 0.53
F1 Score: 0.6928104575163399
Train: N 2
Accuracy: 0.51
Precision: 1.0
Recall: 0.51
F1 Score: 0.6754966887417219
Train: N 5
Accuracy: 0.55
Precision: 1.0
Recall: 0.55
F1 Score: 0.7096774193548387
Train: N 10
Accuracy: 0.45
Precision: 1.0
Recall: 0.45
F1 Score: 0.6206896551724138


    Best performance: N 5

In [24]:
pred_labels_test = predict_with_few_shot_prompting(test_samples[:100], test_labels[:100], "Test")



Test: N 1
Accuracy: 0.65
Precision: 1.0
Recall: 0.65
F1 Score: 0.787878787878788
Test: N 2
Accuracy: 0.61
Precision: 1.0
Recall: 0.61
F1 Score: 0.7577639751552795
Test: N 5
Accuracy: 0.41
Precision: 1.0
Recall: 0.41
F1 Score: 0.5815602836879432
Test: N 10
Accuracy: 0.37
Precision: 1.0
Recall: 0.37
F1 Score: 0.5401459854014599


    Best performance: N 1

In [25]:
pred_labels_val = predict_with_few_shot_prompting(val_samples[:100], val_labels[:100], "Val")

Token indices sequence length is longer than the specified maximum sequence length for this model (519 > 512). Running this sequence through the model will result in indexing errors


Val: N 1
Accuracy: 0.47
Precision: 1.0
Recall: 0.47
F1 Score: 0.6394557823129251
Val: N 2
Accuracy: 0.44
Precision: 1.0
Recall: 0.44
F1 Score: 0.6111111111111112
Val: N 5
Accuracy: 0.4
Precision: 1.0
Recall: 0.4
F1 Score: 0.5714285714285715
Val: N 10
Accuracy: 0.33
Precision: 1.0
Recall: 0.33
F1 Score: 0.49624060150375937


    Best performance: N 1

Можеме да забележиме дека во овој вид на промпт: Text: text Category: category, имаме многу добри резултати но резултатите се влошуваат кога користиме повеќе примери. Со оглед на резултатите можеме да земеме некоја средина, и да искористиме 2 примери како најоптимална постапка.

Во втората лабораториска задача ги добив следните резултати во евалуацијата:

* Accuracy: 0.5138539042821159
* Precision: 0.510556621880998
* Recall: 0.6700251889168766
* F1 Score: 0.579520697167756 

Според ова можеме да заклучиме дека FLAN-T5 со few-shot prompting е далеку побрз и попрецизен од невронските секвенцијални мрежи со LSTM, Embedding и Dense слоеви.

Испробајте ги следните prompts:
1. „Here is a text: {text}, which is {label}. Classify the following text: {sample} into
{label1} or {label2}.“
2. „Here is a text: {text}, which is not {label}. Classify the following text: {sample} into
{label1} or {label2}."

In [90]:
def prompt_with_number_of_examples_prompt_type_1(samples, labels, no_of_examples):
    pred_labels = []

    for sample, label in zip(samples, labels):
        example = []
       
        for i in range(no_of_examples):
            example_text = samples[i]
            example_label = 'offensive' if labels[i] == 1 else 'non-offensive'
            result_example = f'Here is a text: {example_text}, which is {example_label}'
            example.append(result_example)

        label1 = 'offensive'
        label2 = 'non-offensive'
        prompt = f'{example}\nClassify the following text: {sample}, into {label1} or {label2}.'
        
        # print(prompt)

        input_data = tokenizer(prompt, return_tensors='pt')
        input_ids = input_data.input_ids
        
        output = model.generate(input_ids)
        pred_label = tokenizer.decode(output[0])

        pred_labels.append(pred_label)
        # print(pred_label)
    
    return pred_labels

In [69]:
def prompt_with_number_of_examples_prompt_type_2(samples, labels, no_of_examples):
    pred_labels = []

    for sample, label in zip(samples, labels):
        example = []
       
        for i in range(no_of_examples):
            example_text = samples[i]
            example_label = 'offensive' if labels[i] == 0 else 'non-offensive'
            result_example = f'Here is a text: {example_text},which is not {example_label}'
            example.append(result_example)

        label1 = 'offensive'
        label2 = 'non-offensive'
        prompt = f'{example}\nClassify the following text: {sample} into {label1} or {label2}'
        
        # print(prompt)

        input_data = tokenizer(prompt, return_tensors='pt')
        input_ids = input_data.input_ids
        
        output = model.generate(input_ids)
        pred_label = tokenizer.decode(output[0])

        pred_labels.append(pred_label)
    
    return pred_labels

In [77]:
def predict_with_few_shot_prompting_prompt_type_1(samples, labels, train_test_or_val):

    pred_labels_n_1 = prompt_with_number_of_examples_prompt_type_1(samples, labels, 1) 
    pred_labels_n_2 = prompt_with_number_of_examples_prompt_type_1(samples, labels, 2)
    pred_labels_n_5 = prompt_with_number_of_examples_prompt_type_1(samples, labels, 5)
    pred_labels_n_10 = prompt_with_number_of_examples_prompt_type_1(samples, labels, 10)

    evaluate(labels, clean_prediction(pred_labels_n_1), "N 1", train_test_or_val)
    evaluate(labels, clean_prediction(pred_labels_n_2), "N 2", train_test_or_val)
    evaluate(labels, clean_prediction(pred_labels_n_5), "N 5", train_test_or_val)
    evaluate(labels, clean_prediction(pred_labels_n_10), "N 10", train_test_or_val)

In [99]:
def predict_with_few_shot_prompting_prompt_type_2(samples, labels, train_test_or_val):

    pred_labels_n_1 = prompt_with_number_of_examples_prompt_type_2(samples, labels, 1) 
    pred_labels_n_2 = prompt_with_number_of_examples_prompt_type_2(samples, labels, 2)
    pred_labels_n_5 = prompt_with_number_of_examples_prompt_type_2(samples, labels, 5)
    pred_labels_n_10 = prompt_with_number_of_examples_prompt_type_2(samples, labels, 10)
    
    # print(clean_prediction(pred_labels_n_1))
    evaluate(labels, clean_prediction(pred_labels_n_1), "N 1", train_test_or_val)
    evaluate(labels, clean_prediction(pred_labels_n_2), "N 2", train_test_or_val)
    evaluate(labels, clean_prediction(pred_labels_n_5), "N 5", train_test_or_val)
    evaluate(labels, clean_prediction(pred_labels_n_10), "N 10", train_test_or_val)

In [76]:
# predict_with_few_shot_prompting_prompt_type_1(train_samples[:2], train_labels[:2], "Train")

['Here is a text: Pussy nobody asked for your input., which is offensive']
Classify the following text: Pussy nobody asked for your input. into offensive or non-offensive
<pad> Non-offensive</s>
['Here is a text: Pussy nobody asked for your input., which is offensive']
Classify the following text: Ok, this makes no sense. This will create vigilantes. And I hope they put this savage animal down before he hurts anyone else. into offensive or non-offensive
<pad> Non-offensive</s>
[1, 1]


Воочувам дека моделот тешко предвидува со еден пример.

In [96]:
predict_with_few_shot_prompting_prompt_type_1(train_samples[:100], train_labels[:100], "Train")



Train: N 1
Accuracy: 0.27
Precision: 1.0
Recall: 0.27
F1 Score: 0.4251968503937008
Train: N 2
Accuracy: 0.22
Precision: 1.0
Recall: 0.22
F1 Score: 0.36065573770491804
Train: N 5
Accuracy: 0.39
Precision: 1.0
Recall: 0.39
F1 Score: 0.5611510791366906
Train: N 10
Accuracy: 0.41
Precision: 1.0
Recall: 0.41
F1 Score: 0.5815602836879432


    Best performance: N 10

In [100]:
predict_with_few_shot_prompting_prompt_type_2(train_samples[:100], train_labels[:100], "Train")



Train: N 1
Accuracy: 0.1
Precision: 1.0
Recall: 0.1
F1 Score: 0.18181818181818182
Train: N 2
Accuracy: 0.05
Precision: 1.0
Recall: 0.05
F1 Score: 0.09523809523809523
Train: N 5
Accuracy: 0.06
Precision: 1.0
Recall: 0.06
F1 Score: 0.11320754716981131
Train: N 10
Accuracy: 0.05
Precision: 1.0
Recall: 0.05
F1 Score: 0.09523809523809523


    Best performance: N 10 

1. Првиот вид на промпт: „Here is a text: {text}, which is {label}. Classify the following text: {sample} into
{label1} or {label2}.“ има подобри резултати од вториот.
- Со најдобар перформанс со f-1 резултат од 0.58, можам да кажам дека овој промпт е значително послаб од оној што го испробавме прв (тој имаше f-1 резултат од 0.78).
- За разлика од првиот промпт, овој станува подобар со повеќе примери.

2. „Here is a text: {text}, which is not {label}. Classify the following text: {sample} into
{label1} or {label2}."
- Овој тип на промпт го има најлошиот резултат во целото истражување. Иако се подобрува со зголемување на бројот на примерите, сепак не е доволно добар за да се спореди со претходните два. 

Заклучоци од овој обид:
- Најдобри резултати добиваат промптови кои што се прецизни и недвомислени. 

Пример:
Промптот од тип 1 имаше подобар резултат од тип 2, бидејќи тип 2 користеше негирање "which is not {label}", место "which is {label}".
- Промптови каде што поедноставно се специфицира, класифицира и се даваат појасни инструкции се поуспешни.

Пример: Here is a text: {text}, which is {label} може да се поедностави во видот: Text: {Text} Category: {Category}