## Cleaning

First, let's load and mop up the data a little bit. We want to drop anything where the headline is `NaN`. Fortunately, there aren't too many of those.

Secondly, a lot of the headline categories have many sub-categories. We end up with a ridiculously long list of category labels this way. Each of these likely only has so many headlines, and it's far more detailed than I'm concerned with here anyhow. I'll simplify those to be just the top-level categories.

In [14]:
import pandas as pd

df = pd.read_csv('data/irishtimes-date-text.csv')
df.head()

Unnamed: 0,publish_date,headline_category,headline_text
0,19960102,business,Smurfit's share price in retreat despite recor...
1,19960102,business,Jamont plans £5m investment to update plant
2,19960102,business,Management is blamed for most company failures
3,19960102,business,Forte expected to announce a special dividend ...
4,19960102,business,Accountancy firm adopts name change


In [15]:
# Remove blank headlines
df.dropna(axis='index', how='any', inplace=True)

# Simplify the category names
df['simple_category'] = df.headline_category.str.split('.').str.get(0)

## Prepping the Model

We'll create a small pipeline here with a bag-of-words model that we can categorize with.

We'll get a list of the unique category names and then add the labels. Then we'll create a variable with the headline text for predicting and a `dict` of labels, only of which will be `True` for a given headline.

In [16]:
labels = [{'cats': {'business': label == 'business',
                    'culture': label == 'culture',
                    'news': label == 'news',
                    'opinion': label == 'opinion',
                    'sport': label == 'sport',
                    'lifestyle': label == 'lifestyle'}}
          for label in df['simple_category']]

In [17]:
import spacy

# Create the model
nlp = spacy.blank('en')

categorizer = nlp.create_pipe('textcat', config={'exclusive_classes': True,
                                                 'architecture': 'ensemble',
                                                 'ngram_size': 3})
nlp.add_pipe(categorizer)

# Add the labels
cat_names = df['simple_category'].unique()

for cat in cat_names:
    categorizer.add_label(cat)

optimizer = nlp.begin_training()
# nlp = nlp.from_disk('models')

In [18]:
# from multiprocessing import Pool
# from nltk.corpus import stopwords

# Tokenize and remove stopwords
# stops = stopwords.words('english')

def clean_headlines(doc):
    return ' '.join([x.text.lower() for x in doc])
#                      if x.text.lower() not in stops and not x.is_punct])

# Add some data cleaning before categorization.
nlp.add_pipe(clean_headlines, before='textcat')

# Parallelize...go!
# with Pool(16) as pool:
#     headlines = pool.map(clean_headlines, df['headline_text'].values)

In [21]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['headline_text'], 
                                                    df['simple_category'],
                                                    test_size=0.2, train_size=0.8,
                                                    random_state=777, 
                                                    stratify=df['simple_category'])
train = list(zip(X_train, y_train))

ValueError: Found input variables with inconsistent numbers of samples: [1422222, 1]

## Training the Model

Now we can actually train the thing!

#### Note
* Consider different batch sizes and experimenting with the optimizer.
* Also look into separating a test set. Right now, it appears to be training on everything.

In [20]:
import random
from spacy.util import minibatch, fix_random_seed, decaying, compounding

# Make it reproducible, yo!
fix_random_seed(777)

losses = {}
dropout = decaying(0.6, 0.2, 1e-4)
batch_sizes = compounding(1, 64, 1.001)

for epoch in range(3):
    random.shuffle(train)
    batches = minibatch(train, size=batch_sizes)

    for batch in batches:
        texts, labels = zip(*batch)
        nlp.update(texts, labels, drop=next(dropout), sgd=optimizer, losses=losses)
    print(losses)

ValueError: [E151] Trying to call nlp.update without required annotation types. Expected top-level keys: ('words', 'tags', 'heads', 'deps', 'entities', 'cats', 'links'). Got: ['n', 'e', 'w', 's'].

In [None]:
# Let's save the model so that we can run it again later when we have more memory to spare.
nlp.to_disk('/home/sean/Code/irish_times/ensemble_2')

In [None]:
def compare_label(match):
    if match[1]['cats'][match[0]] == True:
        return 1
    else:
        return 0

### Drumroll, please!

In [None]:
test_test_headlines = [nlp.tokenizer(txt) for txt in X_test]
test_categorizer = nlp.get_pipe('textcat')
test_scores, _ = test_categorizer.predict(test_test_headlines)
test_pred_labels = test_scores.argmax(axis=1)
test_predictions = [test_categorizer.labels[label] for label in test_pred_labels]
compare_test_labels = list(zip(test_predictions, y_test))

# ...aaaaaannnnd the scores!
test_correct = sum([compare_label(m) for m in compare_test_labels])
print(f'The model is {(test_correct / df.shape[0]) * 100:.2f}% accurate on the test data.')

### Additional Notes
It's looking like there are some repeatable things here that would work well in a Hy macro or three.
Consider testing with methods other than `minibatch` as well as `minibatch` using `sklearn.model_selection.cross_val_predict` and train them all at once with the cool multiprocessing thing that I learned.