# Notes
1. Use binary classification
2. Merge training and dev data; use cross-validation

3. Use char 6-grams, but also test 7+ is memory and time permits
9. Word n-grams (1-3)
10. Use term weighting

4. nrc uses linear kernel with SVM
5. Also consider XGB or LightGBM
8. Logistic regression with L2 reg and C=1

6. Blacklists/whitelists
7. Dimensionality reduction

11. Create confusion matrices

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import preprocessing

import warnings
warnings.filterwarnings('ignore')

# Read provisional training material
The official data will be released at the end of March. It will probably be the BTI data, so in order to avoid all forms of contamination, we will use a different set. Our data consists of SUBTIEL data, with both Flemish and Netherlandic Dutch subtitles. It requires some preprocessing to convert the files from PAC and STL to SRT. We run this conversion offline, as it also contains some manual steps. And it probably is different from the official data.

All files are converted to plain text, so we remove all information pertaining to time, colour of the text, and font styles. For the conversion of pac files we run this mess of a grep:

```
for L in VL NL; do find ./*/${L}/ -iname "*.pac" -exec ./unpac {} \; | grep -v "\"\| ' *$\|\\$\|\&\|)\|;\|%\|^ .[[:space:]]*$\|^ ..[[:space:]]*$\|^ . .[[:space:]]$\|^[[:space:]]*$" | sed 's/<\|>//g' | grep -v "^[[:space:]]*.$\|^[[:space:]]*..$\|^[[:space:]]*.[[:space:]].[[:space:]]*$\|^[[:space:]]*..[[:space:]]..[[:space:]]*$\|^[[:space:]]*..[[:space:]].[[:space:]]*$\|^[[:space:]]*.[[:space:]]..[[:space:]]*$\^[[:space:]]*.[[:space:]].[[:space:]]\+.[[:space:]]*$" | grep -v "^[[:space:]]*.$\|^[[:space:]]*..$\|^[[:space:]]*.[[:space:]].[[:space:]]*$\|^[[:space:]]*..[[:space:]]..[[:space:]]*$\|^[[:space:]]*..[[:space:]].[[:space:]]*$\|^[[:space:]]*.[[:space:]]..[[:space:]]*$\^[[:space:]]*.[[:space:]].[[:space:]]\+.[[:space:]]*$\|BTI\|Broadcast\|Title:\|title:\|Story:\|story:\|Story:\|TITLE:\|CONFIG:\|Config:\|config:" > ${L}.unpac; done
```

The stl files are cleaner; we extract the info with:

```
for L in VL NL; do find ./*/${L}/ -iname "*.stl" -printf '%P\n' -execdir python2 ~/Programming/stl2srt/to_srt.py {} ~/Programming/lama-dsl/data/${L}srt/{} \; ;  for f in ~/Programming/lama-dsl/data/${L}srt/*.stl; do grep -v "\-\->\|^[[:space:]]*[[:digit:]]\+$\|^[[:space:]]*$" $f | tail -n +4 ; done > ${L}.unstl; done

```

Turns out that the Flemish data does not have any STL files. Ah well.
The first stats:
```
wc ?L.all   
  384631  2770783 15103475 NL.all
  296689  2641861 14050991 VL.all
```
The next step is to run the files through ucto.

In [2]:
import ucto
import pickle

In [3]:
ucto_config = "tokconfig-nld"

vl_text = []
try:
    with open('data/VL.all.pickle', 'rb') as f:
        vl_text = pickle.load(f)
except IOError:    
    vl_tokeniser = ucto.Tokenizer(ucto_config)
    with open('data/VL.all', 'r') as f:
        for line in f:
            vl_tokeniser.process(line)
    print("All Flemish data has been tokenised.")

    current_line = []
    for token in vl_tokeniser:
        current_line.append(str(token))
        if token.isendofsentence():
            vl_text.append(" ".join(current_line))
            current_line = []
    print("All Flemish data has been converted to sentences.")
    
    with open('data/VL.all.pickle', 'wb') as f:
        pickle.dump(vl_text, f, pickle.HIGHEST_PROTOCOL)    
    print("All Flemish sentences have been written to a pickle.")

nl_text = []
try:
    with open('data/NL.all.pickle', 'rb') as f:
        nl_text = pickle.load(f)
except IOError:
    nl_tokeniser = ucto.Tokenizer(ucto_config)
    with open('data/NL.all', 'r') as f:
        for line in f:
            nl_tokeniser.process(line)        
    print("All Netherlandic data has been tokenised.")     

    current_line = []
    for token in nl_tokeniser:
        current_line.append(str(token))
        if token.isendofsentence():
            nl_text.append(" ".join(current_line))
            current_line = []
    print("All Netherlandic data has been converted to sentences.")
    
    with open('data/NL.all.pickle', 'wb') as f:
        pickle.dump(nl_text, f, pickle.HIGHEST_PROTOCOL)
    print("All Netherlandic sentences have been written to a pickle.")
print("PICKLE RICK!!!")

PICKLE RICK!!!


In [4]:
import random
xl_text = vl_text + nl_text
xl_labels = ['vl'] * len(vl_text) + ['nl'] * len(nl_text)

combined = list(zip(xl_text, xl_labels))
random.shuffle(combined)
xl_text[:], xl_labels[:] = zip(*combined)

In [6]:
print("There are " + str(len(vl_text)) + " Flemish texts and " + str(len(nl_text)) + " Netherlandic texts. Total: " + str(len(xl_text)))
print("Mean length Flemish sentence: ", sum([len(x.split()) for x in vl_text])/len(vl_text))
print("Mean length Dutch sentence:   ", sum([len(x.split()) for x in nl_text])/len(nl_text))
print("Mean length all sentences:    ", sum([len(x.split()) for x in xl_text])/len(xl_text))

There are 438070 Flemish texts and 484697 Netherlandic texts. Total: 922767
Mean length Flemish sentence:  7.059401465519209
Mean length Dutch sentence:    6.848851963185248
Mean length all sentences:     6.948807228693701


We don't have any test data, but we will extensively use cross-validation to see our progress (if any).

# Pipeline 1: Traditional classification


## Support Vector Classification with counts
Here we use character and word $n$-grams, and a linear kernel with default parameters.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold, cross_val_score
from sklearn.pipeline import FeatureUnion
from sklearn.svm import SVC

In [8]:
steps = [('char', CountVectorizer(analyzer='char', ngram_range=(3,3))),
         ('words', CountVectorizer(analyzer='word', ngram_range=(3,3),token_pattern=u"(?u)\\b\\w+\\b"))]

union = FeatureUnion(steps)

pipeline = Pipeline([
    ('union', union),
    ('svc', SVC(kernel='linear')),
])

In [None]:

k_fold = KFold(n_splits=10)

for train_indices, test_indices in k_fold.split(xl_text):
    print('Train: %s | test: %s' % (train_indices, test_indices))

    param_grid = dict(features__univ_select__k=[1, 2],
                      svm__C=[0.1, 1, 10])
    
    #grid_search = GridSearchCV(pipeline, param_grid=param_grid, verbose=10)
    #grid_search.fit(alltrainingmaterial[train], alllabels[train]).score(alltrainingmaterial[test], alllabels[test])
    #    for train, test in k_fold.split(alltrainingmaterial)]
    
    print("Fitting model on data") xl_text[train_indices] xl_text[test_indices]
    prediction = pipeline.fit(xl_text[train_indices], xl_labels[train_indices])
    print("Predict the labels on test data")
    prediction = pipeline.predict(xl_text[test_indices])
    print("Score")
    pipeline.score(xl_text[test_indices], xl_labels[test_indices])


## Support Vector Classification with tf-idf

# Fasttext classification

In [10]:
import fasttext



with open('bla.train.txt', 'w') as f:
    for line, label in zip(xl_text, xl_labels):
        f.write(line + " __language__" + label + "\n")

ModuleNotFoundError: No module named 'fasttext'

In [None]:
    

ft_classifier = fasttext.supervised('bla.train.txt', 'model', 
                                    min_count=1, 
                                    word_ngrams=3, 
                                    minn=7, 
                                    maxn=7, 
                                    thread=2, 
                                    label_prefix='__language__')
ft_predictions = ft_classifier.predict(xl_text[-200:])

In [None]:
#import gensim
#from gensim.models.fasttext import FastText as FT_gensim
#
#ftg_model = FT_gensim(sentences=[x.split() for x in xl_text[0:9000]], size=250, min_count=1, min_n=7, max_n=7, word_ngrams=1)
#