# Notes
1. Use binary classification
2. Merge training and dev data; use cross-validation

3. Use char 6-grams, but also test 7+ is memory and time permits
9. Word n-grams (1-3)
10. Use term weighting

4. nrc uses linear kernel with SVM
5. Also consider XGB or LightGBM
8. Logistic regression with L2 reg and C=1

6. Blacklists/whitelists
7. Dimensionality reduction

11. Create confusion matrices

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import preprocessing

import warnings
warnings.filterwarnings('ignore')

# Read provisional training material
The official data will be released at the end of March. It will probably be the BTI data, so in order to avoid all forms of contamination, we will use a different set. Our data consists of SUBTIEL data, with both Flemish and Netherlandic Dutch subtitles. It requires some preprocessing to convert the files from PAC and STL to SRT. We run this conversion offline, as it also contains some manual steps. And it probably is different from the official data.

All files are converted to plain text, so we remove all information pertaining to time, colour of the text, and font styles. For the conversion of pac files we run this mess of a grep:

```
for L in VL NL; do find ./*/${L}/ -iname "*.pac" -exec ./unpac {} \; | grep -v "\"\| ' *$\|\\$\|\&\|)\|;\|%\|^ .[[:space:]]*$\|^ ..[[:space:]]*$\|^ . .[[:space:]]$\|^[[:space:]]*$" | sed 's/<\|>//g' | grep -v "^[[:space:]]*.$\|^[[:space:]]*..$\|^[[:space:]]*.[[:space:]].[[:space:]]*$\|^[[:space:]]*..[[:space:]]..[[:space:]]*$\|^[[:space:]]*..[[:space:]].[[:space:]]*$\|^[[:space:]]*.[[:space:]]..[[:space:]]*$\^[[:space:]]*.[[:space:]].[[:space:]]\+.[[:space:]]*$" | grep -v "^[[:space:]]*.$\|^[[:space:]]*..$\|^[[:space:]]*.[[:space:]].[[:space:]]*$\|^[[:space:]]*..[[:space:]]..[[:space:]]*$\|^[[:space:]]*..[[:space:]].[[:space:]]*$\|^[[:space:]]*.[[:space:]]..[[:space:]]*$\^[[:space:]]*.[[:space:]].[[:space:]]\+.[[:space:]]*$\|BTI\|Broadcast\|Title:\|title:\|Story:\|story:\|Story:\|TITLE:\|CONFIG:\|Config:\|config:" > ${L}.unpac; done
```

The stl files are cleaner; we extract the info with:

```
for L in VL NL; do find ./*/${L}/ -iname "*.stl" -printf '%P\n' -execdir python2 ~/Programming/stl2srt/to_srt.py {} ~/Programming/lama-dsl/data/${L}srt/{} \; ;  for f in ~/Programming/lama-dsl/data/${L}srt/*.stl; do grep -v "\-\->\|^[[:space:]]*[[:digit:]]\+$\|^[[:space:]]*$" $f | tail -n +4 ; done > ${L}.unstl; done

```

Turns out that the Flemish data does not have any STL files. Ah well.
The first stats:
```
wc ?L.all   
  384631  2770783 15103475 NL.all
  296689  2641861 14050991 VL.all
```
The next step is to run the files through ucto.

In [2]:
import ucto
import pickle

In [3]:
ucto_config = "tokconfig-nld"

vl_text = []
try:
    with open('data/VL.all.pickle', 'rb') as f:
        vl_text = pickle.load(f)
except IOError:    
    vl_tokeniser = ucto.Tokenizer(ucto_config)
    with open('data/VL.all', 'r') as f:
        for line in f:
            vl_tokeniser.process(line)
    print("All Flemish data has been tokenised.")

    current_line = []
    for token in vl_tokeniser:
        current_line.append(str(token))
        if token.isendofsentence():
            vl_text.append(" ".join(current_line))
            current_line = []
    print("All Flemish data has been converted to sentences.")
    
    with open('data/VL.all.pickle', 'wb') as f:
        pickle.dump(vl_text, f, pickle.HIGHEST_PROTOCOL)    
    print("All Flemish sentences have been written to a pickle.")

nl_text = []
try:
    with open('data/NL.all.pickle', 'rb') as f:
        nl_text = pickle.load(f)
except IOError:
    nl_tokeniser = ucto.Tokenizer(ucto_config)
    with open('data/NL.all', 'r') as f:
        for line in f:
            nl_tokeniser.process(line)        
    print("All Netherlandic data has been tokenised.")     

    current_line = []
    for token in nl_tokeniser:
        current_line.append(str(token))
        if token.isendofsentence():
            nl_text.append(" ".join(current_line))
            current_line = []
    print("All Netherlandic data has been converted to sentences.")
    
    with open('data/NL.all.pickle', 'wb') as f:
        pickle.dump(nl_text, f, pickle.HIGHEST_PROTOCOL)
    print("All Netherlandic sentences have been written to a pickle.")
print("PICKLE RICK!!!")

PICKLE RICK!!!


In [4]:
xl_text = vl_text + nl_text
xl_labels = ['vl'] * len(vl_text) + ['nl'] * len(nl_text)

In [5]:
print("Mean length Flemish sentence: ", sum([len(x.split()) for x in vl_text])/len(vl_text))
print("Mean length Dutch sentence:   ", sum([len(x.split()) for x in nl_text])/len(nl_text))
print("Mean length all sentences:    ", sum([len(x.split()) for x in xl_text])/len(xl_text))

Mean length Flemish sentence:  7.059401465519209
Mean length Dutch sentence:    6.848851963185248
Mean length all sentences:     6.948807228693701


We don't have any test data, but we will extensively use cross-validation to see our progress (if any).

# Features


In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold, cross_val_score
from sklearn.pipeline import FeatureUnion
from sklearn.svm import SVC

In [7]:
steps = [('char', CountVectorizer(analyzer='char', ngram_range=(3,3))),
         ('words', CountVectorizer(analyzer='word', ngram_range=(3,3),token_pattern=u"(?u)\\b\\w+\\b"))]

union = FeatureUnion(steps)

pipeline = Pipeline([
    ('union', union),
    ('svc', SVC(kernel='linear')),
])

['Waar gaan we vandaag lunchen ?', 'Zal ik naar de galerie komen ?']

In [16]:
small_text = vl_text[0:20] + nl_text[0:20]
small_labels = ['vl'] * 20 + ['nl'] * 20
print(len(small_text))
print(len(small_labels))

small_test = vl_text[-5:] + nl_text[-5:]
small_testl = ['vl'] * 5 + ['nl'] * 5
print(len(small_test))
print(len(small_testl))

40
40
10
10


In [17]:
print("Fitting model on data")
prediction = pipeline.fit(small_text, small_labels)
print("Predict the labels on test data")
prediction = pipeline.predict(small_test)
print("Score")
pipeline.score(small_test, small_testl)

Fitting model on data
Predict the labels on test data
Score


0.5

In [34]:
for x,y in enumerate(union.get_feature_names()):
    print(x, y)

0 char__ a 
1 char__ an
2 char__ in
3 char__ is
4 char__ lo
5 char__ or
6 char__ se
7 char__ so
8 char__ te
9 char__ th
10 char__ wh
11 char__a t
12 char__and
13 char__ano
14 char__ats
15 char__d t
16 char__e w
17 char__ee 
18 char__enc
19 char__ent
20 char__er 
21 char__ere
22 char__est
23 char__eth
24 char__ets
25 char__ger
26 char__hat
27 char__her
28 char__hin
29 char__his
30 char__in 
31 char__ing
32 char__is 
33 char__let
34 char__lon
35 char__met
36 char__n t
37 char__nce
38 char__nd 
39 char__nge
40 char__not
41 char__nte
42 char__ome
43 char__ong
44 char__or 
45 char__oth
46 char__r l
47 char__r s
48 char__s a
49 char__s i
50 char__s s
51 char__see
52 char__sen
53 char__som
54 char__st 
55 char__t o
56 char__ten
57 char__tes
58 char__the
59 char__thi
60 char__ts 
61 char__wha
62 words__a test or
63 words__and this is
64 words__another longer sentence
65 words__is a test
66 words__is another longer
67 words__lets see whats
68 words__see whats in
69 words__test or something
70 w

In [15]:
[1,2,3,4,5,6][-5:]

[2, 3, 4, 5, 6]

In [16]:
k_fold = KFold(n_splits=3)

for train_indices, test_indices in k_fold.split(xl_text):
    print('Train: %s | test: %s' % (train_indices, test_indices))

    param_grid = dict(features__univ_select__k=[1, 2],
                      svm__C=[0.1, 1, 10])
    #grid_search = GridSearchCV(pipeline, param_grid=param_grid, verbose=10)
    #grid_search.fit(alltrainingmaterial[train], alllabels[train]).score(alltrainingmaterial[test], alllabels[test])
    #    for train, test in k_fold.split(alltrainingmaterial)]

Train: [307589 307590 307591 ... 922764 922765 922766] | test: [     0      1      2 ... 307586 307587 307588]
Train: [     0      1      2 ... 922764 922765 922766] | test: [307589 307590 307591 ... 615175 615176 615177]
Train: [     0      1      2 ... 615175 615176 615177] | test: [615178 615179 615180 ... 922764 922765 922766]
