### Underthesea Tutorial: Train a Word Segmentation

By Vu Anh - [UndertheseaNLP](https://github.com/undertheseanlp/underthesea)

This tutorial will demonstrate how to employ machine learning techniques to train a Vietnamese word segmentation model. The resulting model can be saved and used through the underthesea API.

By the end of this tutorial, you will have learned the necessary steps to train a word segmentation model, preprocess data, select and train a machine learning model, and evaluate its performance. Specifically, you will learn how to tailor a model to the Vietnamese language, which will improve segmentation accuracy in Vietnamese text. With this knowledge, you will be able to save the trained model and use it through the underthesea API.

### Setup Environement

In [None]:
#@title Installing required libraries
%%capture
! pip install underthesea seqeval datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting underthesea
  Downloading underthesea-6.1.3-py3-none-any.whl (11.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.1/11.1 MB[0m [31m71.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 KB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting datasets
  Downloading datasets-2.10.0-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Collecting python-crfsuite>=0.9.6
  Downloading python_crfsuite-0.9.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m47.2 MB/s[0m eta [36m0:00:00[0m

In [None]:
#@title Preparing traing and testing data

from datasets import load_dataset
from underthesea.utils.preprocess_dataset import preprocess_word_tokenize_dataset

# @markdown Choose dataset
name = "undertheseanlp/UTS_WTK" #@param ["undertheseanlp/UTS_WTK"]
subset = "base" #@param ["small", "base", "large"]
dataset = load_dataset(name, subset)
corpus = preprocess_word_tokenize_dataset(dataset)

train_dataset = corpus["train"]
test_dataset = corpus["test"]
print("Train dataset", len(train_dataset))
print("Test dataset", len(test_dataset))

Downloading builder script:   0%|          | 0.00/3.13k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/83.0 [00:00<?, ?B/s]

Downloading and preparing dataset uts_wtk/base to /root/.cache/huggingface/datasets/undertheseanlp___uts_wtk/base/1.0.0/356c535c138f7daf22bc8e3d40a88f3df2e0f6b5f9cfabe8431a4285c8294789...


Downloading data:   0%|          | 0.00/1.77M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/217k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/218k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset uts_wtk downloaded and prepared to /root/.cache/huggingface/datasets/undertheseanlp___uts_wtk/base/1.0.0/356c535c138f7daf22bc8e3d40a88f3df2e0f6b5f9cfabe8431a4285c8294789. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Train dataset 8000
Test dataset 1000


# Training & Predict

In [None]:
#@title Training
from os.path import dirname, join
from underthesea.trainers.crf_trainer import CRFTrainer
from underthesea.transformer.tagged_feature import lower_words as dictionary
from underthesea.models.fast_crf_sequence_tagger import FastCRFSequenceTagger

features = [
    # word unigram and bigram and trigram
    "T[-2]", "T[-1]", "T[0]", "T[1]", "T[2]",
    "T[-2,-1]", "T[-1,0]", "T[0,1]", "T[1,2]",
    "T[-2,0]", "T[-1,1]", "T[0,2]",
    "T[-2].lower", "T[-1].lower", "T[0].lower", "T[1].lower", "T[2].lower",
    "T[-2,-1].lower", "T[-1,0].lower",
    "T[0,1].lower", "T[1,2].lower",
    "T[-1].isdigit", "T[0].isdigit", "T[1].isdigit", "T[-2].istitle",
    "T[-1].istitle", "T[0].istitle", "T[1].istitle", "T[2].istitle",
    "T[0,1].istitle", "T[0,2].istitle",
    "T[-2].is_in_dict", "T[-1].is_in_dict", "T[0].is_in_dict",
    "T[1].is_in_dict", "T[2].is_in_dict",
    "T[-2,-1].is_in_dict", "T[-1,0].is_in_dict", "T[0,1].is_in_dict",
    "T[1,2].is_in_dict", "T[-2,0].is_in_dict",
    "T[-1,1].is_in_dict", "T[0,2].is_in_dict",
]
model = FastCRFSequenceTagger(features, dictionary)

pwd = "."
output_dir = join(pwd, "tmp/ws")
training_params = {
    "output_dir": output_dir,
    "params": {
        "c1": 1.0,  # coefficient for L1 penalty
        "c2": 1e-3,  # coefficient for L2 penalty
        "max_iterations": 1000,  #
        # include transitions that are possible, but not observed
        "feature.possible_transitions": True,
        "feature.possible_states": True,
    },
}

# Due to memory limit of Google Colab
train_dataset = train_dataset[:50000]

trainer = CRFTrainer(model, training_params, train_dataset, test_dataset)

trainer.train()

['T[-2]', 'T[-1]', 'T[0]', 'T[1]', 'T[2]', 'T[-2,-1]', 'T[-1,0]', 'T[0,1]', 'T[1,2]', 'T[-2,0]', 'T[-1,1]', 'T[0,2]', 'T[-2].lower', 'T[-1].lower', 'T[0].lower', 'T[1].lower', 'T[2].lower', 'T[-2,-1].lower', 'T[-1,0].lower', 'T[0,1].lower', 'T[1,2].lower', 'T[-1].isdigit', 'T[0].isdigit', 'T[1].isdigit', 'T[-2].istitle', 'T[-1].istitle', 'T[0].istitle', 'T[1].istitle', 'T[2].istitle', 'T[0,1].istitle', 'T[0,2].istitle', 'T[-2].is_in_dict', 'T[-1].is_in_dict', 'T[0].is_in_dict', 'T[1].is_in_dict', 'T[2].is_in_dict', 'T[-2,-1].is_in_dict', 'T[-1,0].is_in_dict', 'T[0,1].is_in_dict', 'T[1,2].is_in_dict', 'T[-2,0].is_in_dict', 'T[-1,1].is_in_dict', 'T[0,2].is_in_dict']
2023-02-26 01:50:11,962 Start feature extraction
2023-02-26 01:50:24,391 Finish feature extraction
2023-02-26 01:50:24,393 Start train
Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 1
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 244

In [None]:
#@title Predict

from os.path import dirname, join
from underthesea.models.fast_crf_sequence_tagger import FastCRFSequenceTagger

# sentence = "Quỳnh Như tiết lộ với báo Bồ Đào Nha về hành trình làm nên lịch sử" #@param {type: "string"}
sentence = "Quỳnh Như tiết lộ với báo Bồ Đào Nha về hành trình làm nên lịch sử" #@param {type: "string"}
tokens = sentence.split()
tokens = [[token] for token in tokens]

model = FastCRFSequenceTagger()
model.load(output_dir)
y = model.predict(tokens)
for token, x in zip(tokens, y):
    print(token, "\t", x)

./tmp/ws
['Quỳnh'] 	 B-W
['Như'] 	 B-W
['tiết'] 	 B-W
['lộ'] 	 I-W
['với'] 	 B-W
['báo'] 	 B-W
['Bồ'] 	 B-W
['Đào'] 	 I-W
['Nha'] 	 B-W
['về'] 	 B-W
['hành'] 	 B-W
['trình'] 	 I-W
['làm'] 	 B-W
['nên'] 	 I-W
['lịch'] 	 B-W
['sử'] 	 I-W
