## fastTextでの学習

fastTextで文の絵文字推定モデルを学習する。
事前に `output` ディレクトリを作成すること。

Notes:
- fastTextでは内部で学習データをシャッフルしないので、事前にシャッフルすること https://github.com/facebookresearch/fastText/issues/74

## Parameters

Declare parameters set by `papermill` .

In [None]:
output_dir = "output"

## Install dependent libraries

ライブラリのビルドとインストール

In [None]:
# Build fastTest
! cd $output_dir && wget https://github.com/facebookresearch/fastText/archive/v0.9.1.zip && unzip v0.9.1.zip && cd fastText-0.9.1 && make
# Install python packages
! pip install $output_dir/fastText-0.9.1/ janome sklearn seaborn matplotlib pandas pytest

In [None]:
# import matplotlib to draw graph
import matplotlib
%matplotlib inline

## Test library

Test your all the libraries used in this notebook.

## Get dataset

../../data/output/{train,test}.tsv を corpus ディレクトリにコピーする このディレクトリでコンテナを起動するとコピーできないため、jupyter notebookにはコマンドは記載していないが、記載するとすれば次のようになる

$ cp ../../data/output/{train,valid,test}.tsv data/
ディレクトリ構成

```
.
├── README.md
├── data/
　   ├── train.txt
      ├── valid.txt
　   └── test.txt
```

## テキストの前処理を行う関数を定義

In [None]:
from janome.tokenizer import Tokenizer


class PreprocessingTokenizer:
    def __init__(self):
        self._tokenizer = Tokenizer()

    def tokenize(self, text):
        tokens = self._tokenizer.tokenize(text, wakati=True)
        return " ".join(tokens)

## 学習データを作成

In [None]:
import os
import pandas


def format_data(tokenizer, in_fd, out_fd, random_state=0):
    df = pandas.DataFrame({"text": [text for text in in_fd]})
    df = df.sample(frac=1.0, random_state=random_state)
    for i, line in enumerate(df["text"].values):
        try:
            label, text = line.strip('\n').split("\t")
        except:
            print(line)
            continue
        label_fasttext = "__label__{}".format(label)
        text_fasttext = tokenizer.tokenize(text)
        print("{} {}".format(label_fasttext, text_fasttext), file=out_fd)
        if (i+1) % 1000 == 0:
            print(i+1, "processed")

        
def make_data(train_file, valid_file, test_file, train_out, valid_out, test_out):
    tokenizer = PreprocessingTokenizer()

    format_data(tokenizer, open(train_file), open(train_out, "w"))
    format_data(tokenizer, open(valid_file), open(valid_out, "w"))
    format_data(tokenizer, open(test_file),  open(test_out, "w"))

    
make_data("data/train.tsv", "data/valid.tsv", "data/test.tsv", output_dir + "/train.txt", output_dir + "/valid.txt", output_dir + "/test.txt")

## 学習

In [None]:
import fasttext


def train(in_file, out_file):
    model = fasttext.train_supervised(input=in_file)
    model.save_model(out_file)
    return model
    
print("training ...")
model = train(in_file=output_dir + "/train.txt", out_file=output_dir + "/model.bin")
print("training ... done")

In [None]:
# model.test returns (number of samples, P@1, R@1)
model.test("output/valid.txt", k=1)

## Evaluation for validation set

In [None]:
def load_valid_dataset(fd):
    res = []
    for line in fd:
        line = line.strip("\n")
        tokens = line.split(" ")
        label = tokens[0]
        text = " ".join(tokens[1:])
        res.append({"label": label, "text": text})
    return res

def predict(tokenizer, model, texts):
    result = []
    for text in texts:
        text_preprocessed = tokenizer.tokenize(text)
        res = model.predict(text_preprocessed)
        label = res[0][0]
        score = res[1][0]
        result.append({"label": label, "score": score})
    return result


class IdentityTokenizer:
    def tokenize(self, text):
        return text

val_dataset = load_valid_dataset(open(output_dir + "/valid.txt"))
val_pred = predict(IdentityTokenizer() , model, texts=[item["text"] for item in val_dataset])

Calculate accuracy

In [None]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score([x["label"] for x in val_dataset], [x["label"] for x in val_pred])
print("accuracy: {}".format(accuracy))

Calculate top-K accuracy

In [None]:
def top_k_accuracy(gold, pred, k):
    total = 0
    correct = 0
    for i in range(len(gold)):
        assert len(pred[i]) == k
        total += 1
        if gold[i] in pred[i]:
            correct += 1
    return correct / total

top_5_accuracy = top_k_accuracy(
    [x["label"] for x in val_dataset],
    [model.predict(item["text"], k=5)[0] for item in val_dataset],
    k=5,
)
print("Top-5 accuracy:", top_5_accuracy)

Show evaluation result

In [None]:
import pandas
print("Accuracy")
pandas.DataFrame({"Top 1": accuracy, "Top 5": top_5_accuracy}, index=["fastText"])

confusion matrixを表示

In [None]:
from sklearn.metrics import confusion_matrix
import matplotlib
import seaborn

labels = list(sorted(set(x["label"] for x in val_dataset)))
conf_matrix = confusion_matrix([x["label"] for x in val_dataset], [x["label"] for x in val_pred], labels=labels)
matplotlib.pyplot.figure(figsize = (20,20))
seaborn.heatmap(conf_matrix)