## **1 Data Preparation (12 points)**

We will use an English corpus that you already know from the assignments (*Alice in Wonderland*), and a Bengali corpus that is decidedly different in both context and language structure. You can find the corpora in `data/bengali_corpus.txt` and `data/alice_in_wonderland.txt`.

1. Preprocess both corpora such that they can serve as the input to sentencepiece. (8 points)

2. Split the preprocessed corpus into a train and a test set. The test set should comprise 20% of the corpus. Write the two sets to files `train.txt` and `test.txt` (4 points)

In [None]:
!pwd
!ls 

/home/snlp-project-21
Desktop    Downloads   Music	 Public  sentencepiece	Videos
Documents  miniconda3  Pictures  rnnlm	 Templates


You can also `cd` into another directory:

In [None]:
%cd project

Alternatively, you can also check the Python API for sentencepiece model training and segmentation and utilise these in your code files.

### 2.3 Task

You are asked to create data for a language model based on different subword granularity, namely:

1. Characters. This can be done manually but also by running BPE with the output vocabulary size being the same as the input alphabet size. (4 points)
2. Subword Units: smaller vocabulary, closer to characters. The vocabulary size is usually in the range of 100 to 800 for English. (6 points)
3. Subword Units: larger vocabulary, closer to words. The vocabulary size is usually in the range 1500 to 3000 for English. (6 points)

In 2 and 3, try to experiment with multiple values and pick one to get the best performance. 

You should run this on both languages (the train part of the given data), resulting in files: `en_s1.txt`, `en_s2.txt`, `en_s3.txt`, `bn_s1.txt`, `bn_s2.txt` and `bn_s3.txt`.

## Take a look at these files and comment briefly on what you observe in terms of word segmentation.

## **3 LM Training (20 point)**

1. Now, train 3 language models based on the corpora you created in 2.3. We will do this using the RNNLM toolkit. The RNN model is trained on the subword units you have created using SentencePiece. As with all neural models, the performance and computation times depend on the number of hidden layers, backpropagation parameter. The class size is used to implement a class-based language model. <br/>  (8 points)

  You can use rnnlm to train a language model with the following command:
  ```shell
  /home/snlp-project-21/rnnlm/rnnlm \
    -train /path/to/train.txt \
    -valid /path/to/test.txt \
    -rnnlm model \
      -hidden 40 \
      -rand-seed 1 \
      -debug 2 \
      -bptt 3 \
      -class 9999
  ```
  Remember that you have to prefix it with `!` to run it in the Notebook. 

  The `model` files will be stored in the same directory as the script. Consider creating a directory to store the models:
  ```shell
  !rm -rf models/rnnlm \
    && mkdir models/rnnlm \
    && cd models/rnnlm \
    && # call of rnnlm goes here
    ...
  ```

2. After training, the rnnlm toolkit outputs the perplexity of the trained model. Play around with the hyperparameters of rnnlm and report a perplexity that is below the baseline from **3.1**. Use these hyperparamters to train the models you will use in **4.** and **5.**  (12 points)

## **4 Text Generation (16 points)**

After training our language models, we are now ready to create some artificial data! Take a look at the [basic examples](http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz) on the rnnlm site in the folder `~/simple-examples/4-data-generation/test.sh` to find out how to do that (**hint:** it's only a tiny flag!). 

1. For every language model trained in 3, use rnnlm to generate $k = 10^1, 10^2, 10^3, 10^4, 10^5, 10^6, 10^7$ output tokens. This means that you have to run rnnlm 7 times and change the parameter of the flag in every run. (8 points)

  You can use a shell loop to achieve this:
  ```shell
  for i in {1,2,3,4,5,6,7}; do \
    echo $i; \
    # do something
  done
  ```
  See [here](https://linuxize.com/post/bash-for-loop/) for a detailed introduction. Alternatively, you can implement this in the Python file.

2. Save the generated data into text files. You may name them `10.txt`, `100.txt` and so on. Make sure they are saved to different directories for every model. (2 points)
  You can redirect the output of a shell comand with `>`:
  ```shell
  echo 'I am a sample text' > test.txt
  ```

Note that `>/<` appends to the end of a file. If you want to replace the content of a file use `>>/<<`.

3. Inspect `100.txt` for every model. Do you see a difference in the quality of the generated data? Why could that be? <br/>
(Note: Your generated data will be in the form of subwords. You have to decode this back to word level to compare) (6 points)

## **5 OOV comparison (16 points)**

1. Using the original corpora generated in 1., find the train and test vocabulary and determine the OOV rate at word level. Do this by decoding the RNN output, adding all the generated words to your vocabulary and measuring the OOV rate. (2 points)

2. Use the generated corpora from **4.** to augment the train vocabulary. Do this $k$ times, i. e. for generated corpora of size $10^1, 10^2,...,10^7$. For each model and each $k$, calculate the OOV rate of the augmented train set against the test set. (8 points)

3. For each model, plot OOV rates. What do you observe? Which of the models would you use in a practical application? (6 points)

## **6. Analysis** (20 points)

Write a succinct summary of your observations for all the tasks, what you aimed to achieve, and whether your expectations were fulfilled. What are your takeaways from this project? How do your results differ for English and Bengali? What hyperparameters do you use to optimise the OOV rates? Are there any ways you could improve your results?

For this section, we will also consider the overall style and how well-written the report is. You should write the summary in the same final Notebook you submit. It should be roughly 500-800 words.

# 4) Grading

The project comprises 25% of the final grade. The grading for this project is distributed as follows:

- Data preparation (12 points)
- Subword units (16 points)
- LM training (20 points)
- Text generation (16 points)
- OOV comparison (16 points)
- Analysis (20 points)