### Install required packages

In [7]:
!pip install -qq -U spacy
!pip install -qq spacy-transformers

### Write training `base config` file to kaggle working directory. For new configuration go to [SpaCy training config](https://spacy.io/usage/training)

In [8]:
%%writefile base_config.cfg
# This is an auto-generated partial config. To use it with 'spacy train'
# you can run spacy init fill-config to auto-fill all default settings:
# python -m spacy init fill-config ./base_config.cfg ./config.cfg
[paths]
train = null
dev = null

[system]
gpu_allocator = "pytorch"

[nlp]
lang = "bn"
pipeline = ["transformer","ner"]
batch_size = 128

[components]

[components.transformer]
factory = "transformer"

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "bert-base-multilingual-uncased"
tokenizer_config = {"use_fast": true}

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.ner]
factory = "ner"

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0

[components.ner.model.tok2vec.pooling]
@layers = "reduce_mean.v1"

[corpora]

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"

[training.optimizer]
@optimizers = "Adam.v1"

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 5e-5

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256

[initialize]
vectors = ${paths.vectors}

Overwriting base_config.cfg


### Convert `base config` file to `config` file

In [9]:
!python -m spacy init fill-config ./base_config.cfg ./config.cfg

2021-08-29 07:28:35.451647: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


### Training
- Set config.cfg file
- Set GPU ID
- Model save path
- Train data path
- Validation data path


In [10]:
!python -m spacy train ./config.cfg \
    --gpu-id 0 \
    --output models_multilingual_bert \
    --paths.train ../input/ner-data-9k/train_all.spacy \
    --paths.dev ../input/ner-data-9k/val_all.spacy

2021-08-29 07:28:43.423139: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[38;5;4mℹ Saving to output directory: models_multilingual_bert[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2021-08-29 07:28:46,866] [INFO] Set up nlp object from config
[2021-08-29 07:28:46,879] [INFO] Pipeline: ['transformer', 'ner']
[2021-08-29 07:28:46,884] [INFO] Created vocabulary
[2021-08-29 07:28:46,885] [INFO] Finished initializing nlp object
Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on anothe

### Zip the trained model for easy download from kaggle

In [11]:
!zip -r ./models_multilingual_bert.zip ./models_multilingual_bert/model-best

  adding: models_multilingual_bert/model-best/ (stored 0%)
  adding: models_multilingual_bert/model-best/transformer/ (stored 0%)
  adding: models_multilingual_bert/model-best/transformer/cfg (stored 0%)
  adding: models_multilingual_bert/model-best/transformer/model/ (stored 0%)
  adding: models_multilingual_bert/model-best/transformer/model/config.json (deflated 52%)
  adding: models_multilingual_bert/model-best/transformer/model/special_tokens_map.json (deflated 40%)
  adding: models_multilingual_bert/model-best/transformer/model/tokenizer.json (deflated 57%)
  adding: models_multilingual_bert/model-best/transformer/model/pytorch_model.bin (deflated 7%)
  adding: models_multilingual_bert/model-best/transformer/model/tokenizer_config.json (deflated 37%)
  adding: models_multilingual_bert/model-best/transformer/model/vocab.txt (deflated 48%)
  adding: models_multilingual_bert/model-best/meta.json (deflated 58%)
  adding: models_multilingual_bert/model-best/tokenizer (deflated 82%)
  a

In [12]:
# check zip model size
!du -sh ./models_multilingual_bert.zip

594M	./models_multilingual_bert.zip


### Test the trained model on sample bangla sentences

In [13]:
import spacy

nlp = spacy.load("./models_multilingual_bert/model-best")

text_list = [
    "আব্দুর রহিম নামের কাস্টমারকে একশ টাকা বাকি দিলাম",
    "১০০ টাকা জমা দিয়েছেন কবির",
    "ডিপিডিসির স্পেশাল টাস্কফোর্সের প্রধান মুনীর চৌধুরী জানান",
    "অগ্রণী ব্যাংকের জ্যেষ্ঠ কর্মকর্তা পদে নিয়োগ পরীক্ষার প্রশ্নপত্র ফাঁসের অভিযোগ উঠেছে।",
    "সে আজকে ঢাকা যাবে",
]
for text in text_list:
    doc = nlp(text)

    print(f"Input: {text}")
    for entity in doc.ents:
        print(f"Entity: {entity.text}, Label: {entity.label_}")
    print("---")


Input: আব্দুর রহিম নামের কাস্টমারকে একশ টাকা বাকি দিলাম
Entity: আব্দুর রহিম, Label: PER
---
Input: ১০০ টাকা জমা দিয়েছেন কবির
Entity: কবির, Label: PER
---
Input: ডিপিডিসির স্পেশাল টাস্কফোর্সের প্রধান মুনীর চৌধুরী জানান
Entity: মুনীর চৌধুরী, Label: PER
---
Input: অগ্রণী ব্যাংকের জ্যেষ্ঠ কর্মকর্তা পদে নিয়োগ পরীক্ষার প্রশ্নপত্র ফাঁসের অভিযোগ উঠেছে।
---
Input: সে আজকে ঢাকা যাবে
---
