# Morpheme Lemmatizer

The morpheme-lemmatizer is an additional model based on a morpheme-segmenter model that given a sequence of potential morpheme strings of one word, it will regularize the potential morpheme strings.

Pipeline:
1. input: word w
2. word w --> morpheme list: m0, m1, ...
3. for each morpheme $m_i$, encode as integer representation of characters
4. feed characters into fnn
5. get regularized morpheme
6. applied to all morphemes in morpheme list --> regularized morphemes: f(m0), f(m1), ...

where $f(m_i) \in M$ and $M$ is the morpheme lexicon.

For the sake of trying out different models, this will be an external layer to the morpheme segmenter and will be a FFN that takes up to $m$-long substrings + index of morpheme in word + length of word. However, this could easily be built on top of the morpheme segmenter model as another layer.

Note that because of this separate structure, we must have the hyper parameter: **cutoff-probability**, which states that the segmenter must give at least the cutoff-probability in order to split the substring.

In [1]:
import csv
import re
import gc
import string
import nltk
from nltk.corpus import wordnet, stopwords
from nltk.tokenize import word_tokenize
import gensim
import gensim.downloader
import math
import matplotlib.pyplot as plot
from collections import defaultdict
from morphemes import Morphemes
from sklearn.decomposition import PCA
import numpy as np
import pandas as pd
import tensorflow as tf
tf.config.set_visible_devices([], 'GPU') # idk why m1 needs this (https://stackoverflow.com/q/72441453)
import keras
from keras.utils import Sequence, pad_sequences
from keras.models import Sequential, Model
from keras.datasets import imdb
from keras.optimizers import Adam
from keras.layers import SimpleRNN, Dense, Activation, Input, LSTM, ReLU, Layer, LayerNormalization, MultiHeadAttention, Dropout, Embedding, GlobalAveragePooling1D

In [3]:
# getting morpheme lexicon
morpheme_lexicon = set()
morphemes_in = {} # word -> morpheme
with open('./morphemes_files/morphemes_with_inflection.csv') as file:
    reader = csv.reader(file)
    for word, ssv_morphemes in reader:
        morpheme_list = ssv_morphemes.split()
        morphemes_in[word] = morpheme_list
        morpheme_lexicon.update(morpheme_list)


int_to_morpheme = list(morpheme_lexicon)
morpheme_to_int = {m: i for i, m in enumerate(int_to_morpheme)}

NULL_LETTER = '_'
int_to_letter = [NULL_LETTER] + list(string.ascii_letters[0:26])
letter_to_int = {l: i for i, l in enumerate(int_to_letter)}

In [None]:
# load morpheme segmenter
segmenter_model = keras.models.load_model('./models/MorphemeSegmenter-Transformer')

def embed_word(w, MAX_WORD_LEN = 50): # note MAX_WORD_LEN is a hyperparam of the segmenter_model
    filled = [letter_to_int[l] for l in w if l in letter_to_int]
    full = filled + [letter_to_int['_']]*(MAX_WORD_LEN-len(filled))
    return np.array(full)

def decode_pred(pred_vec, min_p = .1):
    ordered_cuts = np.flip(np.argsort(pred_vec))
    likely_cuts = [i for i in ordered_cuts if pred_vec[i] > min_p]
    return likely_cuts

def segment_morphemes(word, min_p = .1):
    return decode_pred(segmenter_model.predict(np.array([embed_word(word)]))[0], min_p)