# LT2222 V21 Assignment 3 -- Predict Swedish Vowels

## Introduction

The written Swedish language has the following inventory of lowercase vowels characters: a, e, i, o, u, y, å¸ ä, ö, é.  The Evil Vowel Fairy is threatening to magically steal the vowels from Swedish texts, replacing them with blank symbols.  Before the Fairy does that, your mission is to create a system that automatically puts the vowels back, rendering the evil plan fruitless.  The most important text that the Evil Vowel Fairy is targeting are a group of newspaper articles written in the 19th century currently hosted by Språkbanken, because the Evil Vowel Fairy has some deranged plan involving the 19th century.

(These newspaper articles have a bunch of other, now-archaic or foreign vowels that the Fairy is not interested in and you will ignore.)

There is a secret agent who has helped by writing some scripts to train a vowel prediction model, but that agent has written the scripts a little cryptically to make it hard for the Fairy, who doesn't really understand computers but it never hurts to make sure.

Every part of this assignment that involves Python scripting needs to be done on the bash command line on mltgpu or eduserv.  Include your name in README.md.

This assignment is due Monday, March 29, 2021 at 9:00.  There are 31 points on this assignment, plus opportunity for 22 bonus points.

## Preparation

Fork and clone the GitHub repository: https://github.com/asayeed/lt2222-v21-a3 (Links to an external site.)

There will be three files, train.py, model.py, and README.md.  You will write your responses to whatever needs text responses plus other comments and instructions in README.md.

The texts are available at

* /home/xsayas@GU.GU.SE/scratch/lt2222-v21-resources/svtrain.lower.txt -- training
* /home/xsayas@GU.GU.SE/scratch/lt2222-v21-resources/svtest.lower.txt -- test/evaluation

## Part 1: Figure out train.py (8 points)

train.py is already complete, and you will not modify it. Instead, in README.md, you will explain what the functions a, b, and g do, as well as the meaning of the command-line arguments that are being processed via the argparse module.

You will then run train.py on the training file.  train.py will save a model.


## Part 2: Write eval.py (15 points)

Write eval.py and add it to the repository.  What eval.py will do from the command line:

* Load a model produced by train.py. (Take a look at model.py.)
* Load the test data.
* Create evaluation instances compatible with the training instances.  (A simplifying assumption for the purposes of the assignment: assuming that the neighbouring vowels are known as though the Fairy hadn't stolen them.)
* Use the model to predict instances.
* Write the text with the predicted (as opposed to the real) vowels back into an output file.
* Print the accuracy of the model to the terminal.

In [1]:
import os

os.environ["CUDA_VISIBLE_DEVICES"]=""
os.environ["USE_CPU"]="1"

import sys
import argparse
import numpy as np
import pandas as pd
from model import train
import torch

vowels = sorted(['y', 'é', 'ö', 'a', 'i', 'å', 'u', 'ä', 'e', 'o'])

#takes a file f, and generates a list  mm, where each item in mm is a character from the text (with two additional start <s> and end <e> end characters/tags). It returns a tuple containing mm, and a listified set of mm (list of all the possible unique items in mm). 
def a(f):
    mm = []
    with open(f, "r") as q:
        for l in q:
            mm += [c for c in l]

    mm = ["<s>", "<s>"] + mm + ["<e>", "<e>"]
    return mm, list(set(mm))

#Takes a character x and list of possible characters p
#Generates list of zeroes of length of p, for one-hot-encoding
#
def g(x, p):
    z = np.zeros(len(p))
    z[p.index(x)] = 1
    return z

#Run on the output of function a()
#u is a list of all characters in the text
#It goes through every vowel character (only)
#and appends that vowel to list gt (as an index with reference to vowels)
#It goes through the contextual characters/tags either side of each vowel
#and uses function g() to create one-hot-encodes of these contextual words which it saves in the list gr
def b(u, p):
    gt = []
    gr = []
    for v in range(len(u) - 4): 
        if u[v+2] not in vowels:
            continue
        
        h2 = vowels.index(u[v+2])
        gt.append(h2)
        r = np.concatenate([g(x, p) for x in [u[v], u[v+1], u[v+3], u[v+4]]])
        gr.append(r)

    return np.array(gr), np.array(gt) #returns numpy arrays gr, which is a list one-hot-encodes of the context of each vowel, and gt which is a list of the vowels themselves represented as an index on vowels. 
        

In [2]:
path = './outfile'
model = torch.load(path)
model.eval()

NameError: name 'm' is not defined

## Part 3: Analysis (8 points)
                                                                      
Describe what you do in README.md.  Train and evaluate the following models:

* Five different variations of the --k option, holding the --r option at its default.
* Five different variations of the --r option, holding the --k option at its default.

Include the best model and output text in your repository with its parameters.  Describe any patterns you see, if there are any.  Look at the output texts and make qualitative comments on the performances of the model.

It is very likely that in this very simple model, for this amount of data, nothing will work very well.  Nevertheless, do your best to draw whatever tentative conclusions you can.


## Bonus Part A: Perplexity (4 points)
                                                                      
Add the option in the eval.py script to compute the perplexity of the model.  Document in README.md and include perplexity values for the experiments in part 3.


## Bonus Part B: Sequence (15 points)
                                                                      
Include new versions of train.py, model.py, and eval.py that do not include the assumption that neighbouring vowels are known, but rather works in sequence so that the model-predicted previous vowels are known, but future vowels are not. Systematically evaluate accuracy and describe in README.py


## Bonus Part C: Dropout (3 points)
                                                                      
Make a new version of model.py (and corresponding train.py and eval.py as necessary) that includes dropout in the model.  Systematically evaluate accuracy and describe in README.py.