# Evaluating the Accuracy of Lemmatizers/Stemmers

In this notebook, I evaluate the accuracy of various Nepali lemmatizers and stemmers against a gold dataset [(link)](https://github.com/dpakpdl/NepaliLemmatizer/blob/master/Lemmatization/data/manually_annotated_corpus/gold_data.txt) which has been manually crafted.  

The accuracy metric is computed simply as the number of matches between the algorithm's results and the gold data, divided by the total number of items in the dataset.

In [2]:
!pip install pandas
import pandas as pd
from collections import namedtuple

Collecting pandas
  Downloading pandas-2.0.3-cp39-cp39-macosx_10_9_x86_64.whl (11.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.8/11.8 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2023.3-py2.py3-none-any.whl (502 kB)
Collecting tzdata>=2022.1 (from pandas)
  Using cached tzdata-2023.3-py2.py3-none-any.whl (341 kB)
Collecting numpy>=1.20.3 (from pandas)
  Downloading numpy-1.25.1-cp39-cp39-macosx_10_9_x86_64.whl (20.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.1/20.1 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: pytz, tzdata, numpy, pandas
Successfully installed numpy-1.25.1 pandas-2.0.3 pytz-2023.3 tzdata-2023.3


In [3]:
# To read in the gold data for lemmatization/stemming
# Extract the word, lemma and suffix to store in a Python dictionary with word as key

# Declaring namedtuple() for a word's lemma and suffix
WordInfo = namedtuple('WordInfo', ['lemma', 'suffix', 'line'])


filename = "data/gold_data.txt"
    
with open(filename, "r") as infile:
    all_lines = infile.readlines()
    all_lines = [x.strip() for x in all_lines]
    
print(f'Number of entries in gold dataset: {len(all_lines)}')

lemma_dictionary = {}
# Extract the word, 
for i, line in enumerate(all_lines):
    parts = line.split(',')
    word = parts[0].strip()
    lem = parts[1].strip()
    suffix = parts[-2].strip()[1:-1]
    info = WordInfo(lem, suffix, i)
    if word not in lemma_dictionary:
        lemma_dictionary[word]= info
        #print(f'Stored entry for word {word}: {info}')
    else:
        print(f'Duplicate entry for word {word}: {info}')
        print(f'Stored entry for word {word}: {lemma_dictionary[word]}')
        print(f'Line: {lemma_dictionary[word].line}  {all_lines[lemma_dictionary[word].line]}')
        print(f'Line: {info.line}  {all_lines[info.line]}')
        print()


Number of entries in gold dataset: 3401
Duplicate entry for word कार्यरथीसहित: WordInfo(lemma='कार्यरथी', suffix='सहित', line=23)
Stored entry for word कार्यरथीसहित: WordInfo(lemma='कार्यरथी', suffix='सहित', line=21)
Line: 21  कार्यरथीसहित,कार्यरथी,('SFX', '22', '0', 'सहित', '.')
Line: 23  कार्यरथीसहित,कार्यरथी,('SFX', '15', '0', 'सहित', '.')

Duplicate entry for word नास्पातीसहित: WordInfo(lemma='नास्पाती', suffix='सहित', line=1739)
Stored entry for word नास्पातीसहित: WordInfo(lemma='नास्पाती', suffix='सहित', line=1737)
Line: 1737  नास्पातीसहित,नास्पाती,('SFX', '15', '0', 'सहित', '.')
Line: 1739  नास्पातीसहित,नास्पाती,('SFX', '22', '0', 'सहित', '.')

Duplicate entry for word आगलागीसहित: WordInfo(lemma='आगलागी', suffix='सहित', line=2796)
Stored entry for word आगलागीसहित: WordInfo(lemma='आगलागी', suffix='सहित', line=2794)
Line: 2794  आगलागीसहित,आगलागी,('SFX', '15', '0', 'सहित', '.')
Line: 2796  आगलागीसहित,आगलागी,('SFX', '22', '0', 'सहित', '.')



In [4]:
len(lemma_dictionary)

3398

Exploring the gold dataset, we found 3 duplicates.  However, the duplicates differ from their counterparts in the rule number, and so for our purposes of using the word and lemma only without using the rule number, we can just leave out the duplicates.  Thus our list of gold data is reduced to 3398 entries.

## Prepare the Gold Data

In [5]:
# Prepare the gold data as 3 lists - word, lemma, suffix
gold_words = []
gold_lemmas = []
gold_suffixes = []

for k, v in lemma_dictionary.items():
    gold_words.append(k)
    gold_lemmas.append(v.lemma)
    gold_suffixes.append(v.suffix)
    
len(gold_words)

3398

### Inspect a random entry to check the 3 lists are aligned

In [6]:
gold_words[10], gold_lemmas[10], gold_suffixes[10]

('खोलानालामात्र', 'खोलानाला', 'मात्र')

In [7]:
all_lines[ lemma_dictionary[gold_words[10]].line ]

"खोलानालामात्र,खोलानाला,('SFX', '22', '0', 'मात्र', '.')"

In [8]:
# Prepare a string of text using the gold words
gold_text = ' '.join(gold_words)
gold_text

'द्रब्यशाहमध्ये द्रब्यशाहबाहेक द्रब्यशाहकी द्रब्यशाहसम्मको द्रब्यशाहबाट कोलाहलका कोलाहललगायतका कोलाहलमध्ये कोलाहलजस्ता कोलाहलकै खोलानालामात्र खोलानालाहरूसँगको खोलानालालगायतका खोलानालामाथिबाट खोलानालाहरूमा भित्तेलेखनसित भित्तेलेखनको भित्तेलेखनबाहेक भित्तेलेखनजस्ता भित्तेलेखनकै कार्यरथीजस्ता कार्यरथीसहित कार्यरथीसम्मको कार्यरथीसँगको अनौठी अनौठा मानसरोवरबिना मानसरोवरपछिका मानसरोवरका मानसरोवरद्वारा मानसरोवरमात्र पानीसरोबाहेक पानीसरोअनुसार पानीसरोअघि पानीसरोबारे पानीसरोसहित दिगन्तमै दिगन्तसँगका दिगन्तबाहेक दिगन्तबाट दिगन्तभन्दापर इस्टकोटद्वारा इस्टकोटहरूप्रति इस्टकोटसहितका इस्टकोटले इस्टकोटहरूसँगको शृगालजस्तै शृगालबिना शृगालतिर शृगालकै शृगालसँग धनुर्धरहरूमाथि धनुर्धरसँगै धनुर्धरपछिकी धनुर्धरअघि धनुर्धरमाथिका पदवियोगमाथिको पदवियोगप्रति पदवियोगभन्दा पदवियोगसहितको पदवियोगमाथिका शङ्खमूलपछिका शङ्खमूलसम्मका शङ्खमूलसँगै शङ्खमूलले शङ्खमूलमात्र पुरुषोत्तमकी पुरुषोत्तमतिर पुरुषोत्तममै पुरुषोत्तमअघि पुरुषोत्तमपट्टि वणिक्भित्रै वणिक्सम्मका वणिक्का वणिक्सम्ममा वणिक्अनुसार डाइनीसँगै डाइनीजस्ता डाइनीका डा

## Exploring [Nepali_nlp](https://github.com/sushil79g/Nepali_nlp/tree/master)

In [9]:
# Installing the library
!pip install git+https://github.com/sushil79g/Nepali_nlp.git

Collecting git+https://github.com/sushil79g/Nepali_nlp.git
  Cloning https://github.com/sushil79g/Nepali_nlp.git to /private/var/folders/q6/4vk5yrts387cp_3qs430rhqc0000gn/T/pip-req-build-klx2ivnt
  Running command git clone --filter=blob:none --quiet https://github.com/sushil79g/Nepali_nlp.git /private/var/folders/q6/4vk5yrts387cp_3qs430rhqc0000gn/T/pip-req-build-klx2ivnt
  Resolved https://github.com/sushil79g/Nepali_nlp.git to commit 9feccc8331ce5f744a2b3157c8d1ea7c8231b2a4
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting gensim==3.7.3 (from Nepali-nlp==0.0.0)
  Using cached gensim-3.7.3.tar.gz (23.4 MB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting requests==2.22.0 (from Nepali-nlp==0.0.0)
  Using cached requests-2.22.0-py2.py3-none-any.whl (57 kB)
Collecting wget==3.2 (from Nepali-nlp==0.0.0)
  Using cached wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting news-please (from Nepali-nlp==0.0.0)
  Using cached news_please-1



In [11]:
!pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.3.0-cp39-cp39-macosx_10_9_x86_64.whl (10.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.2/10.2 MB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
Collecting joblib>=1.1.1 (from scikit-learn)
  Using cached joblib-1.3.1-py3-none-any.whl (301 kB)
Collecting threadpoolctl>=2.0.0 (from scikit-learn)
  Using cached threadpoolctl-3.2.0-py3-none-any.whl (15 kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.3.1 scikit-learn-1.3.0 threadpoolctl-3.2.0


In [15]:
import Nepali_nlp.Nepali_nlp as Nepali_nlp

#nepali_nlp_results = Stem().rootify(gold_text)

2023-07-21 04:05:44.974873: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


ModuleNotFoundError: No module named 'Nepali_nlp.Nepali_nlp'

In [16]:
from Nepali_nlp import Stem

nepali_nlp_results = Stem().rootify(gold_text)

In [18]:
len(nepali_nlp_results)

3403

In [19]:
len(gold_words)

3398

In [20]:
nepali_nlp_results

['द्रब्यशाहमध्ये',
 'द्रब्यशाहबाहेक',
 'द्रब्यशाह',
 'द्रब्यशाहसम्म',
 'द्रब्यशाहबाट',
 'कोलाहल',
 'कोलाहललगायत',
 'कोलाहलमध्ये',
 'कोलाहलजस्ता',
 'कोलाहल',
 'खोलानालामात्र',
 'खोलानालाहरूसँग',
 'खोलानालालगायत',
 'खोलानालामाथिबाट',
 'खोलानाला',
 'भित्तेलेखनसित',
 'भित्तेलेखन',
 'भित्तेलेखनबाहेक',
 'भित्तेलेखनजस्ता',
 'भित्तेलेखन',
 'कार्यरथीजस्ता',
 'कार्यरथीसहित',
 'कार्यरथीसम्म',
 'कार्यरथीसँग',
 'अनौठी',
 'अनौठा',
 'मानसरोवरबिना',
 'मानसरोवरपछि',
 'मानसरोवर',
 'मानसरोवर',
 'मानसरोवरमात्र',
 'पानीसरोबाहेक',
 'पानीसरोअनुसार',
 'पानीसरोअघि',
 'पानीसरोबारे',
 'पानीसरोसहित',
 'दिगन्त',
 'दिगन्तसँग',
 'दिगन्तबाहेक',
 'दिगन्तबाट',
 'दिगन्तभन्दापर',
 'इस्टकोट',
 'इस्टकोटहरूप्रति',
 'इस्टकोटसहित',
 'इस्टकोट',
 'इस्टकोटहरूसँग',
 'शृगालजस्तै',
 'शृगालबिना',
 'शृगालतिर',
 'शृगाल',
 'शृगाल',
 'धनुर्धर',
 'धनुर्धर',
 'धनुर्धरपछि',
 'धनुर्धरअघि',
 'धनुर्धर',
 'पदवियोग',
 'पदवियोगप्रति',
 'पदवियोगभन्',
 'पदवियोगसहित',
 'पदवियोग',
 'शङ्खमूलपछि',
 'शङ्खमूलसम्म',
 'शङ्खमूल',
 'शङ्खमूल',
 'शङ्खमूलमात्र

In [24]:
nepali_nlp_results_byword = []

for w in gold_words:
    res = Stem().rootify(w)
    if len(res) > 1:
        line_num = lemma_dictionary[w].line
        print(f'Word: {w} Stem res: {res}')
        print(f'Line {line_num} in data file: {all_lines[line_num]}')
        print()
    nepali_nlp_results_byword.append(res)

len(nepali_nlp_results_byword)    

Word: विश्वस्त सूत्रपछिको Stem res: ['विश्वस्त', 'सूत्रपछि']
Line 3165 in data file: विश्वस्त सूत्रपछिको,विश्वस्त सूत्र,('SFX', '22', '0', 'पछिको', '.')

Word: विश्वस्त सूत्रहरूबाट Stem res: ['विश्वस्त', 'सूत्रहरूबाट']
Line 3166 in data file: विश्वस्त सूत्रहरूबाट,विश्वस्त सूत्र,('SFX', '18', '0', 'हरूबाट', '.')

Word: विश्वस्त सूत्रसँगका Stem res: ['विश्वस्त', 'सूत्रसँग']
Line 3167 in data file: विश्वस्त सूत्रसँगका,विश्वस्त सूत्र,('SFX', '15', '0', 'सँगका', '.')

Word: विश्वस्त सूत्रहरूद्वारा Stem res: ['विश्वस्त', 'सूत्र']
Line 3168 in data file: विश्वस्त सूत्रहरूद्वारा,विश्वस्त सूत्र,('SFX', '18', '0', 'हरूद्वारा', '.')

Word: विश्वस्त सूत्रबाहेक Stem res: ['विश्वस्त', 'सूत्रबाहेक']
Line 3169 in data file: विश्वस्त सूत्रबाहेक,विश्वस्त सूत्र,('SFX', '22', '0', 'बाहेक', '.')



3398

In [25]:
nepali_nlp_results = []

for w in gold_words:
    res = Stem().rootify(w)
    if len(res) > 1:
        nepali_nlp_results.append(' '.join(res))
    else:
        nepali_nlp_results.append(res[0])

len(nepali_nlp_results)

3398

In [26]:
# Define a function to score the eval list against the gold lemmas
# The function returns accuracy score in % as well as the checklist for 
# error analysis
def accuracy_score(gold, eval):
    checks =  [ 1 if gold[i]==eval[i] else 0 for i in range(len(gold)) ] 
    score = sum(checks)
    accuracy = score/len(gold)
    return accuracy, checks


In [27]:
nepali_nlp_acc, nepali_nlp_checklist = accuracy_score(gold_lemmas, nepali_nlp_results)

In [28]:
nepali_nlp_acc

0.2692760447321954

In [29]:
len(gold_lemmas)

3398

In [30]:
nepali_nlp_checklist[:10]

[0, 0, 1, 0, 0, 1, 0, 0, 0, 1]

In [31]:
print(gold_lemmas[:10])
print(nepali_nlp_results[:10])

['द्रब्यशाह', 'द्रब्यशाह', 'द्रब्यशाह', 'द्रब्यशाह', 'द्रब्यशाह', 'कोलाहल', 'कोलाहल', 'कोलाहल', 'कोलाहल', 'कोलाहल']
['द्रब्यशाहमध्ये', 'द्रब्यशाहबाहेक', 'द्रब्यशाह', 'द्रब्यशाहसम्म', 'द्रब्यशाहबाट', 'कोलाहल', 'कोलाहललगायत', 'कोलाहलमध्ये', 'कोलाहलजस्ता', 'कोलाहल']


## Exploring [nepali-stemmer](https://github.com/oya163/nepali-stemmer)

In [32]:
!pip install nepali-stemmer

Collecting nepali-stemmer
  Downloading nepali_stemmer-0.0.2-py3-none-any.whl (149 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m149.0/149.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: nepali-stemmer
Successfully installed nepali-stemmer-0.0.2


In [35]:
!pip install importlib-resources==1.4.0

Collecting importlib-resources==1.4.0
  Downloading importlib_resources-1.4.0-py2.py3-none-any.whl (20 kB)
Installing collected packages: importlib-resources
Successfully installed importlib-resources-1.4.0


In [36]:
!pip install nepali-stemmer==0.0.2



In [37]:
from nepali_stemmer.stemmer import NepStemmer

In [38]:
# downloading dictionay and suffixes
!wget https://raw.githubusercontent.com/oya163/nepali-stemmer/master/nepali_stemmer/files/dictionary.txt
!wget https://raw.githubusercontent.com/oya163/nepali-stemmer/master/nepali_stemmer/files/suffix.txt

--2023-07-21 05:04:28--  https://raw.githubusercontent.com/oya163/nepali-stemmer/master/nepali_stemmer/files/dictionary.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8002::154, 2606:50c0:8000::154, 2606:50c0:8003::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8002::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 722832 (706K) [text/plain]
Saving to: ‘dictionary.txt’


2023-07-21 05:04:31 (390 KB/s) - ‘dictionary.txt’ saved [722832/722832]

--2023-07-21 05:04:31--  https://raw.githubusercontent.com/oya163/nepali-stemmer/master/nepali_stemmer/files/suffix.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8002::154, 2606:50c0:8000::154, 2606:50c0:8003::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8002::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5144 (5.0K) [text/plain]
Saving to

In [40]:
nepstem = NepStemmer(dict_path='dictionary.txt', suffix_path='suffix.txt')


nepstem.stem(gold_words[0]), gold_words[0], gold_lemmas[0], gold_suffixes[0]

('द्रब्यशाह मध्ये', 'द्रब्यशाहमध्ये', 'द्रब्यशाह', 'मध्ये')

In [41]:
# The results of nepali_stemmer shows that it returns the suffix as well as the stem 
# in a single string, with a space separating the stem and suffix.

# Below I wish to check what it returns for the gold_word which has 2 tokens.
nepstem.stem('विश्वस्त सूत्रपछिको')

'विश्वस्त सूत्र पछिको'

In [44]:
# The gold answers for the above word is:
lemma_dictionary['विश्वस्त सूत्रपछिको'].lemma, lemma_dictionary['विश्वस्त सूत्रपछिको'].suffix

('विश्वस्त सूत्र', 'पछिको')

In [48]:
# explore the answers for the gold_words
for w in gold_words:
    res = nepstem.stem(w)
    res_parts = res.split()
    if len(res_parts)<1 or len(res_parts)>=3:
        print(f'Word: {w} Stemmer_Res: {res} Gold_lemma: {lemma_dictionary[w].lemma} Gold_suffix: {lemma_dictionary[w].suffix}')
        print()
    

Word: विश्वस्त सूत्रपछिको Stemmer_Res: विश्वस्त सूत्र पछिको Gold_lemma: विश्वस्त सूत्र Gold_suffix: पछिको

Word: विश्वस्त सूत्रहरूबाट Stemmer_Res: विश्वस्त सूत्र हरूबाट Gold_lemma: विश्वस्त सूत्र Gold_suffix: हरूबाट

Word: विश्वस्त सूत्रसँगका Stemmer_Res: विश्वस्त सूत्रसँग का Gold_lemma: विश्वस्त सूत्र Gold_suffix: सँगका

Word: विश्वस्त सूत्रहरूद्वारा Stemmer_Res: विश्वस्त सूत्रहरू द्वारा Gold_lemma: विश्वस्त सूत्र Gold_suffix: हरूद्वारा

Word: विश्वस्त सूत्रबाहेक Stemmer_Res: विश्वस्त सूत्र बाहेक Gold_lemma: विश्वस्त सूत्र Gold_suffix: बाहेक



In [50]:
# From inspecting the above results, I decide to remove the third 'word' in the result string 
# returned by nepali-stemmer.  It is possible that the result string contains only 1 'word'.
# In this case, it means that the stemmer did not find a suffix.  In cases where there are
# 2 'words' in the result string, the suffix is the last 'word', and the suffix will be 
# removed to leave just the stem/lemma.

nepali_stemmer_results=[]

for w in gold_words:
    res = nepstem.stem(w)
    res_parts = [ w.strip() for w in res.split() ]
    if len(res_parts) > 2:
        res_stem = ' '.join(res_parts[:-1])
        print(res_stem, res_parts)
        nepali_stemmer_results.append(res_stem)
    else:
        nepali_stemmer_results.append(res_parts[0])

len(nepali_stemmer_results)

विश्वस्त सूत्र ['विश्वस्त', 'सूत्र', 'पछिको']
विश्वस्त सूत्र ['विश्वस्त', 'सूत्र', 'हरूबाट']
विश्वस्त सूत्रसँग ['विश्वस्त', 'सूत्रसँग', 'का']
विश्वस्त सूत्रहरू ['विश्वस्त', 'सूत्रहरू', 'द्वारा']
विश्वस्त सूत्र ['विश्वस्त', 'सूत्र', 'बाहेक']


3398

In [51]:
nepali_stemmer_acc, nepali_stemmer_checklist = accuracy_score(gold_lemmas, nepali_stemmer_results)

In [52]:
nepali_stemmer_acc

0.7230723955267805

In [55]:
print(nepali_stemmer_checklist[90:100])
print(gold_lemmas[90:100])
print(nepali_stemmer_results[90:100])

[1, 1, 1, 1, 1, 0, 1, 1, 0, 1]
['साथ', 'साथ', 'साथ', 'साथ', 'साथ', 'शृङ्ग', 'शृङ्ग', 'शृङ्ग', 'शृङ्ग', 'शृङ्ग']
['साथ', 'साथ', 'साथ', 'साथ', 'साथ', 'शृङ्गमाथि', 'शृङ्ग', 'शृङ्ग', 'शृङ्गपछि', 'शृङ्ग']


## Exploring [NepaliLemmatizer](https://github.com/dpakpdl/NepaliLemmatizer/tree/master)

In [64]:
import  NepaliLemmatizer.Lemmatization as Lemmatization

In [68]:
!ls -al

total 21744
drwxr-xr-x   13 pllee  staff       416 Jul 21 07:16 [34m.[m[m
drwxr-xr-x+ 287 pllee  staff      9184 Jul 21 06:17 [34m..[m[m
-rw-r--r--@   1 pllee  staff      6148 Jul 21 06:17 .DS_Store
drwxr-xr-x    3 pllee  staff        96 Jul 20 07:07 [34m.ipynb_checkpoints[m[m
-rw-r--r--@   1 pllee  staff     73169 Jul 20 03:16 NLTK __ nltk.metrics.paice.pdf
drwxrwxr-x@   8 pllee  staff       256 Sep 24  2020 [34mNepaliLemmatizer[m[m
-rw-r--r--@   1 pllee  staff  10073228 Jul 21 06:16 NepaliLemmatizer-master.zip
drwxr-xr-x    3 pllee  staff        96 Jul 20 06:49 [34mdata[m[m
-rw-r--r--    1 pllee  staff    722832 Jul 21 05:04 dictionary.txt
-rw-r--r--    1 pllee  staff    235910 Jul 21 07:16 evaluate.ipynb
drwxr-xr-x    5 pllee  staff       160 Jul  8 21:34 [34mpapers[m[m
-rw-r--r--    1 pllee  staff      5144 Jul 21 05:04 suffix.txt
-rw-r--r--    1 pllee  staff        32 Jul 10 22:17 try.txt


In [69]:
import os
os.chdir("NepaliLemmatizer/Lemmatization")
!pwd

/Users/pllee/nepali-challenge/NepaliLemmatizer/Lemmatization


In [73]:
os.chdir('..')
!pwd

/Users/pllee/nepali-challenge/NepaliLemmatizer


In [75]:
!pwd

/Users/pllee/nepali-challenge/NepaliLemmatizer


In [76]:
!python lemmatizer.py -m trie -t 'खाएको'

+--------+---------+
| word   | lemma   |
| खाएको  | खा      |
+--------+---------+


In [77]:
import lemmatizer

In [78]:
lemmatizer_nep = lemmatizer.NepaliLemmatizer()

In [83]:
#trie method
print(type(lemmatizer_nep.trie_based_method(gold_words[0])[0][0]) is dict)
lemmatizer_nep.trie_based_method(gold_words[0])[0][0]   # a dictionary is returned if lemmatization is done

True


{'word': 'द्रब्यशाहमध्ये', 'lemma': 'द्रब्यशाह'}

In [85]:
print(type(lemmatizer_nep.trie_based_method('द्रब्यशाह')[0][0]) is tuple)
lemmatizer_nep.trie_based_method('द्रब्यशाह')  # a tuple is returned if no stemming is done

True


[[('द्रब्यशाह', 'द्रब्यशाह')]]

In [86]:
lemmatizer_nep.trie_based_method('विश्वस्त सूत्र') 

[[{'word': 'विश्वस्त', 'lemma': 'विश्व'}, ('सूत्र', 'सूत्र')]]

In [91]:
# hybrid method

print(type(lemmatizer_nep.hybrid_method(gold_words[0])[0][0]) is dict)
print(lemmatizer_nep.hybrid_method(gold_words[0])[0][0]['lemma'])
lemmatizer_nep.hybrid_method(gold_words[0])[0][0]

True
द्रब्यशाह


{'word': 'द्रब्यशाहमध्ये', 'lemma': 'द्रब्यशाह'}

In [89]:
print(type(lemmatizer_nep.hybrid_method('द्रब्यशाह')[0][0]) is tuple)
lemmatizer_nep.hybrid_method('द्रब्यशाह')  # a dictionary is returned if no stemming is done

False


[[{'word': 'द्रब्यशाह', 'lemma': 'द्रब्यशाह'}]]

In [90]:
lemmatizer_nep.hybrid_method('विश्वस्त सूत्र') 

[[{'word': 'विश्वस्त', 'lemma': 'विश्वस्त'},
  {'word': 'सूत्र', 'lemma': 'सूत्र'}]]

In [95]:
# get results for NepaliLemmatizer trie method

nepali_lemmatizer_trie_results = []

for w in gold_words:
    res = lemmatizer_nep.trie_based_method(w)[0]
    res_extract = []
    for r in res:
        if type(r) is dict:
            res_extract.append(r['lemma'])
        elif type(r) is tuple:
            res_extract.append(r[1])
        else:
            print(f'Processing word: {w} lemmatizer_result: {res}')
            print(f"Result is of type {type(r)}")
    # join the component results into a string 
    if len(res_extract)==2:
        nepali_lemmatizer_trie_results.append(' '.join(res_extract))
    elif len(res_extract)==1:
        nepali_lemmatizer_trie_results.append(res_extract[0])
    else:
        print(f'Processing word: {w} lemmatizer_result: {res}')
        print(f'Result has {len(res_extract)} components.')

len(nepali_lemmatizer_trie_results)


3398

In [96]:
# compute accuracy of NepaliLemmatizer

nepali_lemmatizer_trie_acc, nepali_lemmatizer_trie_checklist = accuracy_score(gold_lemmas, nepali_lemmatizer_trie_results)

In [97]:
nepali_lemmatizer_trie_acc

0.983225426721601

In [98]:
print(nepali_lemmatizer_trie_checklist[90:100])
print(gold_lemmas[90:100])
print(nepali_lemmatizer_trie_results[90:100])

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
['साथ', 'साथ', 'साथ', 'साथ', 'साथ', 'शृङ्ग', 'शृङ्ग', 'शृङ्ग', 'शृङ्ग', 'शृङ्ग']
['साथ', 'साथ', 'साथ', 'साथ', 'साथ', 'शृङ्ग', 'शृङ्ग', 'शृङ्ग', 'शृङ्ग', 'शृङ्ग']


In [99]:
nepali_lemmatizer_trie_checklist.index(0)

198

In [100]:
print(nepali_lemmatizer_trie_checklist[198])
print(gold_lemmas[198])
print(nepali_lemmatizer_trie_results[198])

0
च्याउ
च्याइँ


In [118]:
len(nepali_lemmatizer_trie_checklist) - sum(nepali_lemmatizer_trie_checklist)

57

In [106]:
nepali_lemmatizer_trie_error_index = [ i for i, n in enumerate(nepali_lemmatizer_trie_checklist) if nepali_lemmatizer_trie_checklist[i]==0 ]
len(nepali_lemmatizer_trie_error_index)

57

In [113]:
# Do a quick inspect of errors
for e in nepali_lemmatizer_trie_error_index[:10]:
    print( nepali_lemmatizer_trie_checklist[e], \
           gold_lemmas[e], \
           nepali_lemmatizer_trie_results[e] )

0 च्याउ च्याइँ
0 व्याकुल व्याकुलता
0 पूर्ण पूर्णता
0 इट इटहरी
0 काट् काट्ने
0 काट् काटे
0 काट् काट्ने
0 काट् काटे
0 औपचारिक औपचारिकता
0 छेक् छेक


In [114]:
# get results for NepaliLemmatizer hybrid method

nepali_lemmatizer_hybrid_results = []

for w in gold_words:
    res = lemmatizer_nep.hybrid_method(w)[0]
    res_extract = []
    for r in res:
        if type(r) is dict:
            res_extract.append(r['lemma'])
        elif type(r) is tuple:
            res_extract.append(r[1])
        else:
            print(f'Processing word: {w} lemmatizer_result: {res}')
            print(f"Result is of type {type(r)}")
    # join the component results into a string 
    if len(res_extract)==2:
        nepali_lemmatizer_hybrid_results.append(' '.join(res_extract))
    elif len(res_extract)==1:
        nepali_lemmatizer_hybrid_results.append(res_extract[0])
    else:
        print(f'Processing word: {w} lemmatizer_result: {res}')
        print(f'Result has {len(res_extract)} components.')

len(nepali_lemmatizer_hybrid_results)

3398

In [115]:
# compute accuracy of NepaliLemmatizer hybrid method

nepali_lemmatizer_hybrid_acc, nepali_lemmatizer_hybrid_checklist = accuracy_score(gold_lemmas, nepali_lemmatizer_hybrid_results)

In [116]:
nepali_lemmatizer_hybrid_acc

0.996174220129488

In [117]:
print(nepali_lemmatizer_hybrid_checklist[90:100])
print(gold_lemmas[90:100])
print(nepali_lemmatizer_hybrid_results[90:100])

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
['साथ', 'साथ', 'साथ', 'साथ', 'साथ', 'शृङ्ग', 'शृङ्ग', 'शृङ्ग', 'शृङ्ग', 'शृङ्ग']
['साथ', 'साथ', 'साथ', 'साथ', 'साथ', 'शृङ्ग', 'शृङ्ग', 'शृङ्ग', 'शृङ्ग', 'शृङ्ग']


In [119]:
len(nepali_lemmatizer_hybrid_checklist) - sum(nepali_lemmatizer_hybrid_checklist)

13

In [120]:
nepali_lemmatizer_hybrid_error_index = [ i for i, n in enumerate(nepali_lemmatizer_hybrid_checklist) if nepali_lemmatizer_hybrid_checklist[i]==0 ]
len(nepali_lemmatizer_hybrid_error_index)

13

In [121]:
# Do a quick inspect of errors
for e in nepali_lemmatizer_hybrid_error_index:
    print( nepali_lemmatizer_hybrid_checklist[e], \
           gold_lemmas[e], \
           nepali_lemmatizer_hybrid_results[e] )

0 व्याकुल व्याकुलता
0 पूर्ण पूर्णता
0 औपचारिक औपचारिकता
0 अर्धो अर्धा
0 अनुकरण अनुकरणीय
0 पवित्र पवित्रता
0 कयर कयरता
0 देखाउ देखाइ
0 उघ्रँदो उघ्र
0 उघ्रँदो उघ्र
0 अनिवार्य अनिवार्यता
0 सभ्य सभ्यता
0 कायर कायरता


# Conclusion

In this notebook, I have computed the accuracy of 4 stemmers/lemmatizers against gold dataset from [here](https://github.com/dpakpdl/NepaliLemmatizer/blob/master/Lemmatization/data/manually_annotated_corpus/gold_data.txt).

The results are as follows:
* Nepali_nlp stemmer 26.93%
* nepali-stemmer 72.31%
* NepaliLemmatizer trie-method 98.32%
* NepaliLemmatizer hybrid-method 99.62%

It is obvious the NepaliLemmatizer lemmatizers are far better than the 2 stemmers.  In addition, the hybrid method has better accuracy than the trie method for the NepalLemmatizer.

Since the list of stemmers/lemmatizers explored in this notebook is certainly not exhaustive, we will search and explore more algorithms.  In addition, we would like to do an error analysis of the NepaliLemmatizer methods to see if there can be improvements to the algorithms.
