# Advanced Language Models & Applications in Morphologically Rich Languages: Modern Hebrew

CP193 - Individualized deliverable by Oren Dar, submitted to Professor Diamond.

## Technical overview & state-of-the-art results
In this deliverable, I will explain the technical motivation of this work as well as the techniques utilized in it, and present the state-of-the-art results I have achieved on the Hebrew neural sentiment analysis dataset, which was collected and explained in-depth by Amram et al. (2018), reducing the error rate by 45\% and achieving an astonishing 94\% accuracy on a 3-class classification problem with a significant number of ambiguous samples.

### Technical motivation - transfer learning
Over the past decade, deep neural networks (DNNs) have achieved state-of-the-art results on many visual and language-based tasks, to the point where DNNs are now the dominant method used in the domains of Computer Vision (CV) and Natural Language Processing (NLP). However, equally as astonishing as their ability to succeed on tasks when large labeled datasets and significant amounts of compute and time are available is their ability to transfer their success to datasets and tasks which they were never exposed to, and even when they are then trained on this new task, they still outperform models which were trained on this task but without the pretraining (Yosinski et al. 2018), thanks to their ability to gradually build up higher level representations of the features of the input - which have been shown to perform extremely well even when taken out of the DNN (Sharif et al. 2014).

Following this discovery and thanks to the availability of a variety of models which were all pretrained on ImageNet (a huge image classification dataset), transfer learning was utilized for almost any Computer Vision problem with great success in both academia and industry. Models belonging to this category were usually Convolutional Neural Networks (CNNs), where the earlier layers in the model tend to be convolutional and act as feature extractors (this part is sometimes referred to as the "backbone" of the model), whereas the later layers in the model tend to be linear and act as classifiers (this part is often referred to as the "head" of the model). 

The convolutional layers tended to learn features from the input domain, where the first layers learn high-level features (for example, edges and lines in images) and are therefore always useful when dealing with image data, and the layers following them learn more specific, low-level features (for example, specific textures and shapes in images), which are usually also useful when dealing with image data but may be replaced in cases where the data for the task at hand is extremely different than the original data the model was pretrained on (Figure 2, Zeiler & Fergus, 2014).

Therefore, the general transfer learning procedure is the following: we can take any DNN pretrained on ImageNet, replace its pretrained classification layer with a randomly-initialized layer which suits our CV task, and "freeze" all the layers that come before it - meaning fixing their weights and telling our optimizer to not calculate gradients or update their weights, and so since we only need to train this new layer that we have added and since this layer is already getting very good features from the pretrained feature extractor, the training process will be very quick and we will get much better results than if we had tried to train a DNN on our task from scratch. In addition, if we have sufficient data or our data is significantly different than ImageNet, we may then "unfreeze" the earlier layers and train the entire model some more so that all the layers will be adapted to our task - this is often referred to as "fine-tuning" the model. This approach is extremely common and powerful and is the main reason Computer Vision as a domain is considered to have been largely solved at this point in time.

However, transfer learning for NLP was not common - and was considered by some to be impossible - until Howard & Ruder (2018) have published their extremely powerful general-purpose transfer learning approach in NLP, which features general-purpose large-scale language model pretraining and several strong techniques for utilizing it for transfer learning (more on that later), and was largely motivated by how powerful transfer learning was for CV. A language model is a model which can predict the next token (usually a word) given a sentence. So for example, when given the prompt "the dog could not cross the street because there was a lot of", ideally the model would produce the word "traffic" as the next word. We can therefore measure the model using the standard accuracy metric - if the model has to choose between 50,000 words and it chooses correctly 20\% of the time (on unseen data), then it must have a deep understanding of the semantics of the language and of context relationships within the sentence. One key advantage of this model is that unlike to CV pretraining, no labels are necessary - training a language model is a self-supervised task, as we can take any regular sentence and then ask the model to predict the next word starting from the first one (and hiding the words the model has no reached yet), and thus we can compare the model's predictions and the actual words the same as in a standard supervised setting.

The core idea of the approach is therefore to train a language model on a large text corpus - usually Wikipedia or Common Crawl's database of web pages, and then take that pretrained model and fine-tune it on the desired task, similar to the procedure for CV.

### Technical description of the overall procedure used here
As can be seen in this notebook, our approach here is based off Howard & Ruder (2018) in both technique and implementation. We first train a general-purpose Hebrew language model on the Hebrew Wikipedia, which takes about 20 hours on my mid-range GPU. We then fine-tune the language model on the sentences from the sentiment analysis task (only as text and without sentiment labels), and then we prepare the language model for classification by adding a classification layer and training on the sentiment analysis task training set using the sentiment labels. We can then predict labels for the test set and calculate the model's accuracy and compare it to existing results.

Code is based off fast.ai's NLP codebase (https://github.com/fastai/course-nlp) and has been adapted to work in Hebrew and on the Hebrew sentiment analysis dataset.

### Advanced techniques utilized
* Gradual unfreezing - when attempting to use transfer learning using a pretrained model, we start with a frozen model (weights are fixed) except for the last layer, and then gradually unfreeze layers from the end to the beginning as the model continues to learn. This method works extremely well and the theoretical motivation is similar to CV - earlier layers learn more general patterns, which should still be applicable to a different task within the same domain, whereas later layers learn task-specific patterns or are replaced when training on a new task and so need to be retrained. However, a layer cannot be fine-tuned effectively if the layer following it has been randomly initialized and has not been trained yet, and therefore as the last layers get trained we can gradually train more and more layers until we reach the point where we can fine-tune the entire model.


* Discriminative / layer-wise learning rate - when fine-tuning a pretrained model, each layer should be trained using a different learning rate. The motivation is similar to above - since the earlier layers learn more general (and therefore more robust) patterns, they require less adjustment in order to perform well on the task, and adjusting them too much can degrade performance. On the other hand, the later layers were initialized from scratch and are task-specific, and so they require significant adjustment in order to perform well on the task. The exact ratios were discovered through extensive empirical experimentation and have been shown to work in a variety of scenarios, and were not hand-crafted to this task or this dataset.


* Subword tokenization - the process of "atomizing" a text corpus - converting it into a sequence of units - is called "tokenization", and these units are called "tokens". The basic approach to tokenization is word-based, meaning every word gets converted to a token (as well as some special characters, such as punctuation, apostrophes etc). This approach works fairly well in English since the same word can be used in a variety of situations to mean similar things - "he is a guard", "she is a guard", "they guard" all include the word "guard" which is used in a similar way. In Hebrew, these would actually be 4 different words - there is a word for a male guard, there is a word for a female guard, and there are 2 words for guard plural (male and female forms). In addition to that, English does not utilize prefixes and suffixes whereas Hebrew does so extensively - for example, the English sentence "as per your request" is represented using a single word in Hebrew (with a prefix and a suffix). These examples showcase Hebrew's rich morphology, and also highlight why word-based tokenization might perform poorly in Hebrew - the model would not understand that prefixes, suffixes, and conjugation are elements of the language and would instead consider them to be features of individual words. Therefore some method of subword tokenization is required. One option is to use a parser, which uses a set of hand-crafted heuristics in order to tokenize (for example, it can look for specific prefixes and suffixes and create separate tokens for them). However, SentencePiece - which is described by Kudo & Richardson (2018) - is an unsupervised language-independent subword tokenization method which is based on learning a subword tokenization process using a neural network, and since I tend to trust learned transformation more than heuristics-based transformations - especially in language, where exceptions and edge-cases are common - and since the software is readily available, I used SentencePiece to tokenize all data used here - both the Wikipedia dataset and the sentiment analysis dataset.


### Dataset
We use two datasets here - one is Wikipedia in Hebrew, on which we pretrain our language model. The second one is the Hebrew sentiment analysis dataset, which was collected by Amram et al. (2018) - the full details of their collection process can be found there, including a qualitative analysis of difficult examples. The dataset is composed of 12,804 Facebook comments made by various users on President Reuven Rivlin's official Facebook page, and were manually labeled into 3 categories according to their overall sentiment - positive (8,512 samples), neutral (370 samples), and negative (3,922 samples). The dataset is additionally provided in two different forms - in one comments are parsed into words as is standard in text files, and in the other comments are parsed into morphemes using a Hebrew parser.

### Results
In Amram et al. (2018), many models are trained and evaluated on the sentiment analysis dataset using both word-based and morpheme-based representations, including Convolutional Neural Networks (CNNs) and two kinds of Long Short Term Memory (LSTM) networks. The best accuracy achieved in the paper was 89.2\% - equivalent to an error rate of 10.8\% - using a CNN trained on the word-based representation. As can be seen at the bottom of this notebook, I have achieved 94\% accuracy - equivalent to an error rate of 6%, reducing the error rate by 45% - in a few days of experimentation and without tweaking the procedure or any of the parameters to fit this specific task or dataset. This is the current state-of-the-art result on this dataset.
###  Preliminary conclusions
I have tried to use both SentencePiece-based subword tokenization and standard word-based tokenization for both datasets, and also to train & test for the neural sentiment analysis task on both word-based data representation and morpheme-based data representation. In my experience, SentencePiece subword tokenization and morpheme-based representation have both helped the model perform better, which was not the case for the paper. However, I do not find surprising since they both help the network understand the morphology and conjugation of the language and the relationships between different parts of the sentence.

### Future work
There are several avenues for future extensions of this work (as I work towards the full Capstone project), and I will do my best to explore them all and include them in future products and assignments:
* Bi-directional model - current language model was trained in the forward (regular) direction. A common approach to easily achieve better results is to also train a backwards language model (which starts from the last word in the sentence and keeps trying to predict the word before) and then average out the predictions of the two models, as they seem to complement each other and so their average produces better results. This is especially useful since in Hebrew (like in many other languages), adjectives follow nouns unlike in English. So for example "the small child" would appear as "the child small" in Hebrew, and so sometimes it is useful to guess the preceding word and sometimes it is easier to useful the following word.

* More advanced models - currently using AWD-LSTM, which is essentially a regular LSTM with a lot of regularization, but Quasi-Recurrent Neural Networks (QRNN) or Transformer models may produce better results. Will go into more depth at a future point.


* Ablation study table - I would like to show the peak results of several models including both with and without subword tokenization and evaluated against both the word-based and the morpheme-based representaton of the sentiment analysis dataset, in order to lend further support to the preliminary conclusions above.


* More datasets and tasks - I would like to further show the general-purpose strength of this technique by achieving state-of-the-art results on more datasets and ideally also more tasks. Two possibilities I will explore (but no promises) are named entity recognition and Bible text analysis.

In [1]:
#imports and GPU setup
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai import *
from fastai.text import *
torch.cuda.set_device(0)

In [6]:
#setting up paths and constants
data_path = Config.data_path()
lang = 'he'
name = f'{lang}wiki'
tok = 'SP'
path = data_path/name
path.mkdir(exist_ok=True, parents=True)
lm_fns = [f'{lang}_wt_{tok}', f'{lang}_wt_vocab_{tok}']

## Hebrew wikipedia model

### Download data from Wikipedia

In [7]:
from nlputils import split_wiki,get_wiki

In [8]:
get_wiki(path,lang)

/home/oren/.fastai/data/hewiki/hewiki already exists; not downloading


In [10]:
!head -n4 {path}/{name}

<doc id="7" url="https://he.wikipedia.org/wiki?curid=7" title="מתמטיקה">
מתמטיקה

מָתֵמָטִיקָה היא תחום דעת העוסק במושגים כגון כמות, מבנה, מרחב ושינוי. המתמטיקאים מחפשים דפוסים ותבניות משותפות במספרים, במרחב, במדע ובהפשטות דמיוניות.


This function splits the single wikipedia file into a separate file per article. This is often easier to work with.

In [11]:
dest = split_wiki(path,lang)

/home/oren/.fastai/data/hewiki/docs already exists; not splitting


In [12]:
dest.ls()[:5]

[PosixPath('/home/oren/.fastai/data/hewiki/docs/ארנסט רייר.txt'),
 PosixPath('/home/oren/.fastai/data/hewiki/docs/הקרב במעבר גלורייטה.txt'),
 PosixPath('/home/oren/.fastai/data/hewiki/docs/כרמל שאמה הכהן.txt'),
 PosixPath('/home/oren/.fastai/data/hewiki/docs/נייתן אדריאן.txt'),
 PosixPath('/home/oren/.fastai/data/hewiki/docs/הר לא זז.txt')]

### Create language model & train on Wikipedia

In [None]:
#set up Wikipedia dataset, prepare for training
bs=64
data = (TextList.from_folder(dest, processor=[OpenFileProcessor(),
                                              SPProcessor()])
        .split_by_rand_pct(0.1, seed=42)
        .label_for_lm()
        .databunch(bs=bs, num_workers=1))

data.save(path/f'{lang}_databunch_{tok}')
len(data.vocab.itos),len(data.train_ds)

In [None]:
data = load_data(path, f'{lang}_databunch_{tok}', bs=bs)

In [None]:
#build the language model
learn = language_model_learner(data, AWD_LSTM, drop_mult=0.1, wd=0.1,
                               pretrained=False).to_fp16()

In [None]:
#set up learning rate, scale by batch size
lr = 3e-3
lr *= bs/48

In [None]:
#train the model
learn.unfreeze()
learn.fit_one_cycle(10, lr, moms=(0.8,0.7))

Save the pretrained model and vocab:

In [None]:
mdl_path = path/'models'
mdl_path.mkdir(exist_ok=True)
learn.to_fp32().save(mdl_path/lm_fns[0], with_opt=False)
learn.data.vocab.save(mdl_path/(lm_fns[1] + '.pkl'))

## Hebrew sentiment analysis

### Language model

In [None]:
#load train set & see some examples
train_df = pd.read_csv('../morph_train.tsv', sep='\t',
                       header=None, names=['comment', 'label'])
train_df.head()

In [None]:
#load test set & see some examples
test_df = pd.read_csv('../morph_test.tsv', sep='\t',
                      header=None, names=['comment', 'label'])
test_df.head()

In [None]:
#combine train & test set for language model fine-tuning
df = pd.concat([train_df,test_df], sort=False)

In [None]:
#prepare the combined dataset
data_lm = (TextList.from_df(df, path, cols='comment',
                            processor=SPProcessor.load(dest))
    .split_by_rand_pct(0.1, seed=42)
    .label_for_lm()           
    .databunch(bs=bs, num_workers=1))

data_lm.save(path/f'{lang}_clas_databunch_{tok}_morph')

In [None]:
data_lm = load_data(path, f'{lang}_clas_databunch_{tok}_morph', bs=bs)

In [None]:
#load the pretrained language model
learn_lm = language_model_learner(data_lm, AWD_LSTM,
                                  pretrained_fnames=lm_fns, drop_mult=1.0, wd=0.1)

In [None]:
#set up learning rate, scale by batch size
lr = 1e-3
lr *= bs/48

In [None]:
#train only the last layer - rest are frozen by default
learn_lm.fit_one_cycle(1, lr*10, moms=(0.8,0.7))

In [None]:
#now train the entire model for longer
learn_lm.unfreeze()
learn_lm.fit_one_cycle(5, slice(lr/10,lr*10), moms=(0.8,0.7))

In [None]:
#save the final language model
learn_lm.save(f'{lang}fine_tuned_{tok}_morph')
learn_lm.save_encoder(f'{lang}fine_tuned_enc_{tok}_morph')

### Classifier - only using the training set (and splitting it into a validation set)

In [None]:
#prepare data for training - notice we only use the training data here
bs = 48
data_clas = (TextList.from_df(train_df, path,
                              vocab=data_lm.vocab, cols='comment', 
                              processor=SPProcessor.load(dest))
    .split_by_rand_pct(0.1, seed=42)
    .label_from_df(cols='label')
    .databunch(bs=bs, num_workers=1))

data_clas.save(f'{lang}_textlist_class_{tok}_morph')

In [None]:
data_clas = load_data(path, f'{lang}_textlist_class_{tok}_morph',
                      bs=bs, num_workers=1)

In [None]:
#prepare the classification model
drop = 0.5
learn_c = text_classifier_learner(data_clas, AWD_LSTM,
                                  drop_mult=drop, metrics=[accuracy]).to_fp16()
learn_c.load_encoder(f'{lang}fine_tuned_enc_{tok}_morph')
learn_c.freeze()

In [None]:
#setup learning rate, scale by batch size
lr=2e-2
lr *= bs/48

In [None]:
#train only last layer
learn_c.fit_one_cycle(2, lr, moms=(0.8,0.7))

In [None]:
#train only last layer
learn_c.fit_one_cycle(2, lr, moms=(0.8,0.7))

In [None]:
#train only last 2 layers
learn_c.freeze_to(-2)
learn_c.fit_one_cycle(2, slice(lr/(2.6**4),lr), moms=(0.8,0.7))

In [None]:
#train only last 3 layers
learn_c.freeze_to(-3)
learn_c.fit_one_cycle(2, slice(lr/2/(2.6**4),lr/2), moms=(0.8,0.7))

In [None]:
#train entire model
learn_c.unfreeze()
learn_c.fit_one_cycle(4, slice(lr/10/(2.6**4),lr/10), moms=(0.8,0.7))

In [None]:
#save classification model
learn_c.save(f'{lang}clas_{tok}_morph')

### Fine-tuning & classification using full train set

In [13]:
#load & combine train and test set, set up is_valid column to separate train and test later
train_df = pd.read_csv('../morph_train.tsv', sep='\t',
                       header=None, names=['comment', 'label'])
train_df['is_valid'] = False
test_df = pd.read_csv('../morph_test.tsv', sep='\t',
                      header=None, names=['comment', 'label'])
test_df['is_valid'] = True

df = pd.concat([train_df,test_df], sort=False)

In [14]:
#prepare data
data_lm = (TextList.from_df(df, path, cols='comment',
                            processor=SPProcessor.load(dest))
    .split_none()
    .label_for_lm()           
    .databunch(bs=bs, num_workers=1))

data_lm.save(path/f'{lang}_clas_databunch_{tok}_morph_traintest')

In [15]:
data_lm = load_data(path, f'{lang}_clas_databunch_{tok}_morph_traintest',
                    bs=bs)

In [16]:
#prepare language model for fine-tuning
learn_lm = language_model_learner(data_lm, AWD_LSTM,
                                  pretrained_fnames=lm_fns, drop_mult=1.0, wd=0.1)

In [17]:
#setup learning rate, scale by batch size
lr = 1e-3
lr *= bs/48

In [None]:
#train last layer
learn_lm.fit_one_cycle(2, lr*10, moms=(0.8,0.7))

In [None]:
#train entire model
learn_lm.unfreeze()
learn_lm.fit_one_cycle(8, lr, moms=(0.8,0.7))

In [None]:
#save the fine-tuned language model
learn_lm.save(f'{lang}fine_tuned_{tok}_morph_traintest')
learn_lm.save_encoder(f'{lang}fine_tuned_enc_{tok}_morph_traintest')

In [None]:
#prepare the full training set for training, 
#and the fill test set for evaluating results
data_clas = (TextList.from_df(df, path, vocab=data_lm.vocab,
                              cols='comment', processor=SPProcessor.load(dest))
    .split_from_df()
    .label_from_df(cols='label')
    .databunch(bs=bs, num_workers=1))

data_clas.save(f'{lang}_textlist_class_{tok}_morph_traintest')

In [18]:
data_clas = load_data(path, f'{lang}_textlist_class_{tok}_morph_traintest',
                      bs=bs, num_workers=1)

In [33]:
#set up classification model
drop = 0.7
wd = 0.1
learn_c = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=drop,
                                  wd=wd, metrics=[accuracy]).to_fp16()
learn_c.load_encoder(f'{lang}fine_tuned_enc_{tok}_morph_traintest')
learn_c.freeze()

In [34]:
#setup learning rate, scale by batch size
lr = 2e-2
lr *= bs/48

In [35]:
#train last layer
learn_c.fit_one_cycle(2, lr, moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,0.491691,0.362868,0.858317,00:12
1,0.446336,0.332952,0.869667,00:11


In [36]:
##train last layer some more
learn_c.fit_one_cycle(2, lr, moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,0.475503,0.348391,0.863014,00:10
1,0.459252,0.341473,0.87045,00:10


In [37]:
#train last two layers
learn_c.freeze_to(-2)
learn_c.fit_one_cycle(2, slice(lr/(2.6**4),lr), moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,0.505037,0.349256,0.854403,00:13
1,0.377914,0.300704,0.882583,00:12


In [38]:
#train last three layers
learn_c.freeze_to(-3)
learn_c.fit_one_cycle(2, slice(lr/2/(2.6**4),lr/2), moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,0.384048,0.329825,0.878669,00:24
1,0.263334,0.279749,0.900587,00:22


In [39]:
#train full model - achieving 94% accuracy on the test set
learn_c.unfreeze()
learn_c.fit_one_cycle(6, slice(lr/10/(2.6**4),lr/10), moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,0.217789,0.271436,0.903718,00:33
1,0.182591,0.241869,0.921331,00:30
2,0.177956,0.236046,0.93229,00:31
3,0.121006,0.235444,0.933855,00:34
4,0.093702,0.232601,0.937378,00:32
5,0.082043,0.235795,0.940509,00:34


In [None]:
#save results - anyone can now use this model!
learn_c.save(f'{lang}clas_{tok}_morph_traintest')

## Bibliography

* Amram, A., Ben-David, A., and Tsarfaty, R. (2018). Representations and Architectures in Neural Sentiment Analysis for Morphologically Rich Languages: A Case Study from Modern Hebrew. In: Proceedings of The 27th International Conference on Computational Linguistics (COLING 2018) Santa Fe, NM, (pp. 2242-2252).
* Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146.
* Sharif Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 806-813).
* Kudo, T., & Richardson, J. (2018). Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
* Yosinski J, Clune J, Bengio Y, and Lipson H. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems 27 (NIPS ’14), NIPS Foundation, 2014.
* Zeiler, M. D., & Fergus, R. (2014, September). Visualizing and understanding convolutional networks. In European conference on computer vision (pp. 818-833). Springer, Cham.
