<center>
Obrada prirodnog jezika

Lemmatization

Lucija Dumančić


# MODEL EVALUATION

In [2]:
import pandas as pd
import numpy as np
from tensorflow.keras.models import load_model
from sklearn.metrics import f1_score
from tqdm import tqdm

In [3]:
%cd /content/drive/MyDrive/OPJ/Projekt

/content/drive/MyDrive/OPJ/Projekt


## Load model and test data

In [4]:
model = load_model("lemmatization_model.keras") #loading trained lemmatization model

In [5]:
# Load test arrays from files
vectorized_tokens_test = np.load('vectorized_tokens_test.npy')
vectorized_lemmas_test = np.load('vectorized_lemmas_test.npy')

In [7]:
test = pd.read_csv('hrWaC2.1.01_test_dataset.csv')

In [8]:
test.head()

Unnamed: 0,Token,Lemma
0,rujan,rujan
1,da,da
2,znači,značiti
3,su,biti
4,bio,biti


In [None]:
true_lemmas = test['Lemma'].astype(str)

In [None]:
lemma_vector_dict = {string: vector for string, vector in zip(true_lemmas, vectorized_lemmas_test)}

Dictionary is created to map lemmas word values to their corresponding vector representations.

## Evaluate model

In [None]:
def nearest_lemmas(predicted_vectors):
    nearest_lemmas = []

    # Initialize progress bar
    with tqdm(total=len(predicted_vectors), desc="Calculating F1", unit="instance") as pbar:
        for pred_vector in predicted_vectors:
            max_similarity = -1
            nearest_lemma = None
            for lemma, vector in lemma_vector_dict.items():
              # Calculate cosine similarity between predicted and true vectors
              similarity = np.dot(pred_vector, vector) / (np.linalg.norm(pred_vector) * np.linalg.norm(vector))
              if similarity > max_similarity:
                max_similarity = similarity
                nearest_lemma = lemma
            nearest_lemmas.append(nearest_lemma)
              # Update progress bar
            pbar.update(1)

    return nearest_lemmas

This function takes a list of predicted vectors as input and returns a list of corresponding predicted lemmas in word shape. It iterates through each predicted vector and calculates the cosine similarity between predicted and test vectors to find the corresponding word value for most similar lemma vector.

In [None]:
vectorized_lemmas_predicted = model.predict(vectorized_lemmas_test)



In [None]:
predicted_lemmas = nearest_lemmas(vectorized_lemmas_predicted)

Calculating F1: 100%|██████████| 54944/54944 [3:27:16<00:00,  4.42instance/s]


In [None]:
f1 = f1_score(predicted_lemmas, true_lemmas, average='weighted')

Weighted average in F1 score calculation accounts for class imbalance by weighting each class's F1 score based on the number of true instances in that class.

In [None]:
print(f"F1 score: {f1 * 100:.2f}%")

F1 score: 91.62%


We obtained an F1 score of 91.62% from our feedforward neural network lemmatization model's predictions.

In [None]:
print(np.array(true_lemmas[50:60]))
print(np.array(true_lemmas[60:70]))
print(np.array(true_lemmas[70:80]))
print(np.array(true_lemmas[80:90]))
print(np.array(true_lemmas[90:100]))

['grad' 'godina' 'reći' 'stoljeće' 'bolnica' 'mentalan' 'žrtva' 'igra'
 'medved' 'prije']
['drago' 'barača' 'ožiljak' 'martina' 'godina' 'načelnik' 'osigurati'
 'studij' 'tportal' 'njegov']
['kod' 'izgledati' 'čaj' 'pisac' 'moći' 'verona' 'proširiti' 'sve' 'ran'
 'plav']
['postavljanje' 'vijeće' 'za' 'marketing' 'htjeti' 'pokretati' 'vrijeme'
 'poznat' 'zbog' 'mogućnost']
['izvještavati' 'projekt' 'čiji' 'značajan' 'neki' 'liječnik'
 'sudjelovati' 'koji' 'egzorcizam' 'po']


This is example of 50 test lemmas.

In [None]:
print(predicted_lemmas[50:60])
print(predicted_lemmas[60:70])
print(predicted_lemmas[70:80])
print(predicted_lemmas[80:90])
print(predicted_lemmas[90:100])

['grad', 'godina', 'žaliti', 'stoljeće', 'bolnica', 'mentalan', 'žrtva', 'odigravati', 'medved', 'prijevremen']
['sretan', 'konjak', 'ožiljak', 'martina', 'godina', 'načelnik', 'osigurati', 'studij', 'tportal', 'njegov']
['premještanje', 'izgledati', 'čaj', 'pisac', 'sposobnost', 'verona', 'proširiti', 'jednostavno', 'ispostaviti', 'plav']
['postavljanje', 'predsjedništvo', 'osiguravateljski', 'marketing', 'gnjaviti', 'pokretati', 'odrastanje', 'poznati', 'pretjerivanje', 'mogućnost']
['izvještavati', 'projekt', 'čiji', 'značajan', 'običan', 'liječnik', 'sudjelovati', 'koji', 'egzorcizam', 'primamljiv']


We're inspecting their predicted values for 50 corresponding tokens.