# Notes
1. Use binary classification
2. Merge training and dev data; use cross-validation

3. Use char 6-grams, but also test 7+ is memory and time permits
9. Word n-grams (1-3)
10. Use term weighting

4. nrc uses linear kernel with SVM
5. Also consider XGB or LightGBM
8. Logistic regression with L2 reg and C=1

6. Blacklists/whitelists
7. Dimensionality reduction

11. Create confusion matrices

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import preprocessing

import warnings
warnings.filterwarnings('ignore')

# Read provisional training material
The official data will be released at the end of March. It will probably be the BTI data, so in order to avoid all forms of contamination, we will use a different set. Our data consists of SUBTIEL data, with both Flemish and Netherlandic Dutch subtitles. It requires some preprocessing to convert the files from PAC and STL to SRT. We run this conversion offline, as it also contains some manual steps. And it probably is different from the official data.

All files are converted to plain text, so we remove all information pertaining to time, colour of the text, and font styles. For the conversion of pac files we run this mess of a grep:

```
for L in VL NL; do find ./*/${L}/ -iname "*.pac" -exec ./unpac {} \; | grep -v "\"\| ' *$\|\\$\|\&\|)\|;\|%\|^ .[[:space:]]*$\|^ ..[[:space:]]*$\|^ . .[[:space:]]$\|^[[:space:]]*$" | sed 's/<\|>//g' | grep -v "^[[:space:]]*.$\|^[[:space:]]*..$\|^[[:space:]]*.[[:space:]].[[:space:]]*$\|^[[:space:]]*..[[:space:]]..[[:space:]]*$\|^[[:space:]]*..[[:space:]].[[:space:]]*$\|^[[:space:]]*.[[:space:]]..[[:space:]]*$\^[[:space:]]*.[[:space:]].[[:space:]]\+.[[:space:]]*$" | grep -v "^[[:space:]]*.$\|^[[:space:]]*..$\|^[[:space:]]*.[[:space:]].[[:space:]]*$\|^[[:space:]]*..[[:space:]]..[[:space:]]*$\|^[[:space:]]*..[[:space:]].[[:space:]]*$\|^[[:space:]]*.[[:space:]]..[[:space:]]*$\^[[:space:]]*.[[:space:]].[[:space:]]\+.[[:space:]]*$\|BTI\|Broadcast\|Title:\|title:\|Story:\|story:\|Story:\|TITLE:\|CONFIG:\|Config:\|config:" > ${L}.unpac; done
```

In [None]:
The stl files are cleaner; we extract the info with:

```
for L in VL NL; do find ./*/${L}/ -iname "*.stl" -printf '%P\n' -execdir python2 ~/Programming/stl2srt/to_srt.py {} ~/Programming/lama-dsl/data/${L}srt/{} \; ;  for f in ~/Programming/lama-dsl/data/${L}srt/*.stl; do grep -v "\-\->\|^[[:space:]]*[[:digit:]]\+$\|^[[:space:]]*$" $f | tail -n +4 ; done > ${L}.unstl; done

```

Turns out that the Flemish data does not have any STL files. Ah well.
The first stats:
```
wc ?L.all   
  384631  2770783 15103475 NL.all
  296689  2641861 14050991 VL.all
```
The next step is to run the files through ucto.

In [3]:
import ucto

ModuleNotFoundError: No module named 'ucto'

In [None]:
ucto_config = "tokconfig-nld"
vl_tokeniser = ucto.Tokenizer(ucto_config)
nl_tokeniser = ucto.Tokenizer(ucto_config)


with open('data/VL.all', 'r') as f:
    for line in f:
        vl_tokeniser.process(line)
print("--")
vl_text = []
current_line = []
for token in vl_tokeniser:
    current_line.append(str(token))
    if token.isendofsentence():
        vl_text.append(" ".join(current_line))
        current_line = []
print("--")
with open('data/NL.all', 'r') as f:
    for line in f:
        nl_tokeniser.process(line)        
print("--")       
nl_text = []
current_line = []
for token in nl_tokeniser:
    current_line.append(str(token))
    if token.isendofsentence():
        nl_text.append(" ".join(current_line))
        current_line = []

In [None]:
print("Mean length Flemish sentence: ", sum([len(x.split()) for x in vl_text])/len(vl_text))
print("Mean length Dutch sentence:   ", sum([len(x.split()) for x in vl_text])/len(vl_text))

2.0

We don't have any test data, but we will extensively use cross-validation to see our progress (if any).

# Features

## Character $n$-grams

In [None]:
from nltk.util import ngrams
vl_cngrams = Counter()
for ngram in ngrams("hallo dit is een test", 3):
    vl_cngrams["".join(ngram)] += 1

nl_cngrams = Counter()
for ngram in ngrams("hallo dit is een test", 3):
    nl_cngrams["".join(ngram)] += 1

## Word $n$-grams

In [None]:
vl_cngrams = Counter()
for ngram in ngrams("hallo dit is een test".split(), 2):
    vl_cngrams["".join(ngram)] += 1

nl_cngrams = Counter()
for ngram in ngrams("hallo dit is een test".split(), 2):
    nl_cngrams["".join(ngram)] += 1