<a href="https://colab.research.google.com/github/hduongck/AI-ML-Learning/blob/master/Fastai%20NLP%20course/7_Vietnamese_ULMFIT_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[Video 10](https://youtu.be/MDX_x6rKXAs?t=1793)

The language you're working with may not have pretrained Wikipedia model in Fastai. Today we will learn how to do a pretrained model yourself for other languages from scratch- Vietnamese.

In [0]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai import *
from fastai.text import *

In [0]:
# bs=48
bs=24
#bs=128

In [0]:
data_path = Config.data_path()

This will create a **viwiki** folder, containing a **viwiki** text file with the wikipedia contents. (For other languages, replace **vi** with the appropriate code from the list of wikipedias.).

Vietnamese code in Wikipedia is **vi**

In [0]:
lang = 'vi'
# lang = 'zh'

In [0]:
name =f'{lang}wiki'
path = data_path/name
path.mkdir(exist_ok=True,parents=True)
lm_fns = [f'{lang}_wt', f'{lang}_wt_vocab']

f'{lang}_wt' : language model wikitext
f'{lang}_wt_vocab' : vocab 

# Vietnamese wikipedia model (forward)


## Download data

In [0]:
#@title from nlputils import get_wiki, split_wiki
from fastai.basics import *
import re


def get_wiki(path,lang):
    name = f'{lang}wiki'
    if (path/name).exists():
        print(f"{path/name} already exists; not downloading")
        return

    xml_fn = f"{lang}wiki-latest-pages-articles.xml"
    zip_fn = f"{xml_fn}.bz2"

    if not (path/xml_fn).exists():
        print("downloading...")
        download_url(f'https://dumps.wikimedia.org/{name}/latest/{zip_fn}', path/zip_fn)
        print("unzipping...")
        bunzip(path/zip_fn)

    with working_directory(path):
        if not (path/'wikiextractor').exists(): os.system('git clone https://github.com/attardi/wikiextractor.git')
        print("extracting...")
        os.system("python wikiextractor/WikiExtractor.py --processes 4 --no_templates " +
            f"--min_text_length 1800 --filter_disambig_pages --log_file log -b 100G -q {xml_fn}")
    shutil.move(str(path/'text/AA/wiki_00'), str(path/name))
    shutil.rmtree(path/'text')


def split_wiki(path,lang):
    dest = path/'docs'
    name = f'{lang}wiki'
    if dest.exists():
        print(f"{dest} already exists; not splitting")
        return dest

    dest.mkdir(exist_ok=True, parents=True)
    title_re = re.compile(rf'<doc id="\d+" url="https://{lang}.wikipedia.org/wiki\?curid=\d+" title="([^"]+)">')
    lines = (path/name).open()
    f=None

    for i,l in enumerate(lines):
        if i%100000 == 0: print(i)
        if l.startswith('<doc id="'):
            title = title_re.findall(l)[0].replace('/','_')
            if len(title)>150: continue
            if f: f.close()
            f = (dest/f'{title}.txt').open('w')
        else: f.write(l)
    f.close()
    return dest

In [9]:
get_wiki(path,lang)

/root/.fastai/data/viwiki/viwiki already exists; not downloading


In [10]:
path.ls()

[PosixPath('/root/.fastai/data/viwiki/log'),
 PosixPath('/root/.fastai/data/viwiki/wikiextractor'),
 PosixPath('/root/.fastai/data/viwiki/viwiki'),
 PosixPath('/root/.fastai/data/viwiki/docs'),
 PosixPath('/root/.fastai/data/viwiki/viwiki-latest-pages-articles.xml'),
 PosixPath('/root/.fastai/data/viwiki/viwiki-latest-pages-articles.xml.bz2')]

Look at the first 4 lines of the file with **!head -n4**

In [11]:
!head -n4 {path}/{name}

<doc id="13" url="https://vi.wikipedia.org/wiki?curid=13" title="Tiếng Việt">
Tiếng Việt

Tiếng Việt, còn gọi tiếng Việt Nam, tiếng Kinh hay Việt ngữ, là ngôn ngữ của người Việt (dân tộc Kinh) và là ngôn ngữ chính thức tại Việt Nam. Đây là tiếng mẹ đẻ của khoảng 85% dân cư Việt Nam, cùng với hơn 4 triệu Việt kiều. Tiếng Việt còn là ngôn ngữ thứ hai của các dân tộc thiểu số tại Việt Nam. Mặc dù tiếng Việt có một số từ vựng vay mượn từ tiếng Hán và trước đây dùng chữ Nôm – một hệ chữ viết dựa trên chữ Hán – để viết nhưng tiếng Việt được coi là một trong số các ngôn ngữ thuộc ngữ hệ Nam Á có số người nói nhiều nhất (nhiều hơn một số lần so với các ngôn ngữ khác cùng hệ cộng lại). Ngày nay, tiếng Việt dùng bảng chữ cái Latinh, gọi là chữ Quốc ngữ, cùng các dấu thanh để viết.


because this giant downloaded file is hundreds of gigabytes, a bit difficult to work with. What I want to do is to split this file into multiple text files, one for each Wikipedia article. They always start with this pattern , a head on top of the text.

![alt text](https://github.com/hduongck/AI-ML-Learning/blob/master/Pic/split_wiki_re.png?raw=true)

Here we use regular expression:

`<doc id="13" url="https://vi.wikipedia.org/wiki?curid=13" title="Tiếng Việt">`

`rf'<doc id="\d+" url="https://{lang}.wikipedia.org/wiki\?curid=\d+" title="([^"]+)">'`
    
- replace the number id="13" with id=""\d+"" -> means one or more digits
- replace vi with {lang} -> this was replaced with my languague variable 
- replace curid="13" with curid=\d+ -> one or more digits
- replace title="Tieng Viet" with title="([^"]+)" -> something followed by anything.

`title = title_re.findall(l)[0].replace('/','_')` -> return the things in parentheses

**re.compile** -> we'll spending some time turning that string into a faster internal representation and things we are going over millions of lines it

                                                                                
                                                                                

In [12]:
dest = split_wiki(path,lang)

/root/.fastai/data/viwiki/docs already exists; not splitting


after using split_wiki, we end up with directory which contains millions of files. One for each Vietnamese Wikipedia article 

In [13]:
dest.ls()[:5]

[PosixPath('/root/.fastai/data/viwiki/docs/CyanogenMod.txt'),
 PosixPath('/root/.fastai/data/viwiki/docs/Thám tử lừng danh Conan: Sát thủ bắn tỉa không tưởng.txt'),
 PosixPath('/root/.fastai/data/viwiki/docs/Hệ Mặt Trời.txt'),
 PosixPath('/root/.fastai/data/viwiki/docs/Đảng Công nhân Kurd.txt'),
 PosixPath('/root/.fastai/data/viwiki/docs/Người Duy Ngô Nhĩ.txt')]

In [0]:
# Use this to convert Chinese traditional to simplified characters
# ls *.txt | parallel -I% opencc -i % -o ../zhsdocs/% -c t2s.json

## Create pretrained model

we create a language model as same as creating for IMDB dataset.

In [15]:
data = (TextList.from_folder(dest)
                .split_by_rand_pct(0.1,seed=42)
                .label_for_lm()
                .databunch(bs=bs,num_workers=1))

data.save(f'{lang}_databunch')
len(data.vocab.itos),len(data.train_ds)

(60000, 62631)

The difference is that we set `pretrained=False`. It will not try to download the English Wikipedia model and fine-tune it because we are not doing English. So this way it will start with random weights. 

Since it doesn't start with random weights, we will go straight to learn.unfreeze() step

In [0]:
learn = language_model_learner(data,AWD_LSTM,drop_mult=0.5,pretrained=False).to_fp16()

In [0]:
lr = 1e-2
lr *= bs/48 #scale learning rate by batch size

In [0]:
learn.unfreeze()
learn.fit_one_cycle(1,lr,moms=(0.8,0.7))


we can have accuracy above 40%

Now we are going to save the two parts of the language model :
- the 1st thing is the actual language model . In this case, I trained the language model with fp16, but most people mostly are going to want to actually get fp32, single precision language model, so I convert back to fp32 then save.

```
learn.to_fp32().save(mdl_path/lm_fns[0], with_opt=False)
```

- the 2nd thing to save is vocab. A langugage model starts with a bunch of word embeddings and each row of word embeddings represents a word. So vocab is list of unique words that we're training on.

Note: if you create a language model in Fastai, if you say pretrained = True which is default, it will download from Fastai the pretrained Wikipedia model and vocab that it was used. That's why we to save both.



In [0]:
mdl_path = path/'models'
mdl_path.mkdir(exist_ok=True)
learn.to_fp32().save(mdl_path/lm_fns[0], with_opt=False)
learn.data.vocab.save(mdl_path/(lm_fns[1] + '.pkl'))

In [57]:
mdl_path.ls()

[PosixPath('/root/.fastai/data/viwiki/models/vi_wt.pth'),
 PosixPath('/root/.fastai/data/viwiki/models/vi_wt_vocab.pkl'),
 PosixPath('/root/.fastai/data/viwiki/models/vifine_tuned_enc.pth'),
 PosixPath('/root/.fastai/data/viwiki/models/vifine_tuned.pth'),
 PosixPath('/root/.fastai/data/viwiki/models/viclas.pth')]

In [56]:
shutil.move('/content/vi_wt.pth','/root/.fastai/data/viwiki/models/')

'/root/.fastai/data/viwiki/models/vi_wt.pth'

## Vietnamese sentiment analysis

We need to have a baseline. We need two things : a dataset in vietnamese and example of somebody who try to use that sentiment analysis dataset to predict sentiment.



### Language model:

- [Data](https://github.com/ngxbac/aivivn_phanloaisacthaibinhluan/tree/master/data)
- [Competition details](https://www.aivivn.com/contests/1)
- Top 3 f1 scores: 0.900, 0.897, 0.897

The dataset is different to IDMB dataset. Because the reviews of IMDB dataset were generally about 1500 to 2000 words. These Vietnamese reviews are much shorter. So RNNs are particular effective for longer texts. Shorter texts are often not difficult to handle , RNN can do well. But they are not to be as exceptional

In [20]:
!git clone https://github.com/ngxbac/aivivn_phanloaisacthaibinhluan.git

fatal: destination path 'aivivn_phanloaisacthaibinhluan' already exists and is not an empty directory.


In [23]:
train_df = pd.read_csv('/content/aivivn_phanloaisacthaibinhluan/data/train.csv')
train_df.loc[pd.isna(train_df.comment),'comment']='NA'
train_df.head()

Unnamed: 0,id,comment,label
0,train_000000,Dung dc sp tot cam on \nshop Đóng gói sản phẩm...,0
1,train_000001,Chất lượng sản phẩm tuyệt vời . Son mịn nhưng...,0
2,train_000002,Chất lượng sản phẩm tuyệt vời nhưng k có hộp ...,0
3,train_000003,:(( Mình hơi thất vọng 1 chút vì mình đã kỳ vọ...,1
4,train_000004,Lần trước mình mua áo gió màu hồng rất ok mà đ...,1


In [25]:
test_df = pd.read_csv('/content/aivivn_phanloaisacthaibinhluan/data/test.csv')
test_df.loc[pd.isna(test_df.comment),'comment']='NA'
test_df.head()

Unnamed: 0,id,comment
0,test_000000,Chưa dùng thử nên chưa biết
1,test_000001,Không đáng tiềnVì ngay đợt sale nên mới mua n...
2,test_000002,Cám ơn shop. Đóng gói sản phẩm rất đẹp và chắc...
3,test_000003,Vải đẹp.phom oki luôn.quá ưng
4,test_000004,Chuẩn hàng đóng gói đẹp


Combinde train and test set.

In [0]:
df = pd.concat([train_df,test_df],sort=False)

In [0]:
data_lm = (TextList.from_df(df,path,cols='comment')
                    .split_by_rand_pct(0.1,seed=42)
                    .label_for_lm()
                    .databunch(bs=bs,num_workers=1))

In [0]:
learn_lm = language_model_learner(data_lm,AWD_LSTM,pretrained_fnames=lm_fns,drop_mult=1.0)

**pretrained_fnames=lm_fns** : we pass in an array of model file name and vocab file name . That is **`lm_fns = [f'{lang}_wt', f'{lang}_wt_vocab']`**

That how we use a pretrained Vietnamese model to create a fine-tuned language model for Vietnamese sentiment analysis. Then we fit the codes identical to IDMB.

In [0]:
lr = 1e-3
lr *= bs/48

In [30]:
learn_lm.fit_one_cycle(2,lr*10,moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,4.558015,4.233903,0.317331,00:33
1,4.274643,4.110116,0.324308,00:33


In [31]:
learn_lm.unfreeze()
learn_lm.fit_one_cycle(8,lr,moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,4.183505,4.037032,0.332821,00:43
1,4.040416,3.908898,0.344075,00:44
2,3.952565,3.814539,0.352035,00:44
3,3.83878,3.752706,0.356838,00:44
4,3.773993,3.717153,0.360341,00:44
5,3.763338,3.696043,0.362154,00:44
6,3.726562,3.688317,0.362486,00:44
7,3.702669,3.687172,0.362985,00:44


We are getting accuracy of 37%

In [0]:
learn_lm.save(f'{lang}fine_tuned')
learn_lm.save_encoder(f'{lang}fine_tuned_enc')

### Classifier

In [0]:
data_clas = (TextList.from_df(train_df,path,vocab=data_lm.vocab,cols='comment')
                     .split_by_rand_pct(0.1,seed=42)
                     .label_from_df(cols='label')
                     .databunch(bs=bs,num_workers=1))

data_clas.save(f'{lang}_textlist_class')
                

In [0]:
data_clas = load_data(path,f'{lang}_textlist_class',bs=bs,num_workers=1)

The competition use the f1 score which is the average of precision and recall. There isn't a binary f1 built into Fastai but there is 1 built into sklearn lib. 

In sklearn lib, the func is assumed it getting numpy arrays not torch tensors. So we can create a function which simply called **f1_score** version and add **@np_func** decorator from fastai, it will convert `def f1(inp,targ): return f1_score(targ,np.argmax(inp,axis=-1))` to work with tensors instead of arrays. -> this is nice little trick to use any sklearn metric as a pytorch metric.

In learner, we just pass in **metric=[accuracy,f1]**


In [0]:
from sklearn.metrics import f1_score

@np_func
def f1(inp,targ): return f1_score(targ,np.argmax(inp,axis=-1))

In [0]:
learn_c = text_classifier_learner(data_clas,AWD_LSTM,drop_mult=0.5,metrics=[accuracy,f1]).to_fp16()
learn_c.load_encoder(f'{lang}fine_tuned_enc')
learn_c.freeze()

In [0]:
lr=2e-2
lr *= bs/48

In [39]:
learn_c.fit_one_cycle(2,lr,moms=(0.8,0.7))


epoch,train_loss,valid_loss,accuracy,f1,time
0,0.365764,0.325883,0.850124,0.824823,00:10
1,0.351074,0.299649,0.864428,0.835478,00:10


In [40]:
learn_c.fit_one_cycle(2,lr,moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,f1,time
0,0.351994,0.305872,0.86505,0.831658,00:10
1,0.337251,0.28179,0.884328,0.850461,00:11


In [41]:
learn_c.freeze_to(-2)
learn_c.fit_one_cycle(2,slice(lr/(2.6**4),lr/2),moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,f1,time
0,0.305439,0.257075,0.894901,0.861184,00:13
1,0.30813,0.244332,0.902985,0.876422,00:12


In [42]:
learn_c.freeze_to(-3)
learn_c.fit_one_cycle(2, slice(lr/2/(2.6**4),lr/2), moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,f1,time
0,0.276878,0.245119,0.902363,0.873948,00:19
1,0.224222,0.239502,0.902985,0.872602,00:18


In [43]:
learn_c.unfreeze()
learn_c.fit_one_cycle(1, slice(lr/10/(2.6**4),lr/10), moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,f1,time
0,0.219501,0.243153,0.899254,0.869283,00:24


In [0]:
learn_c.save(f'{lang}clas')

# Vietnamese ULMFiT from scratch (backwards)

we end up around 89% to 90% while Competition top 3 f1 scores: 0.90, 0.89, 0.89. Winner used an ensemble of 4 models: TextCNN, VDCNN, HARNN, and SARNN.

There is a trick that you can improve any model by creating a backward model. A backward langugage model is something which you dont feed wikipedia article , but you feed wikipedia article **in reverse**. 

```
data_clas = load_data(path, f'{lang}_textlist_class_bwd', bs=bs, num_workers=1, backwards=True)
```



That means a language model trained on that will learn to predict the previous word of a sentence given the last words. It's quite a weird thing to do but if you think about it: in language, the next words very often tells you context about what the previous word might have been or how to interpret it. So predicting a previous word of a sentence is likely to be just as useful as predicting the next word of the sentence .

**What we did, we re-train the whole model above in backward way.**

In [0]:
lm_fns_bwd = [f'{lang}_wt_bwd', f'{lang}_wt_vocab_bwd']

### pretrained model

In [0]:
data_bwd = (TextList.from_folder(dest)
                    .split_by_rand_pct(0.1,seed=42)
                    .label_for_lm()
                    .databunch(bs=bs,num_workers=1,backwards=True))

data_bwd.save(f'{lang}_databunch_bwd')

In [0]:
data_bwd =load_data(dest,f'{lang}_databunch_bwd',bs=bs,backwards=True)

In [0]:
learn_bwd = language_model_learner(data_bwd,AWD_LSTM,drop_mult=0.5,pretrained=False).to_fp16()

In [0]:
lr = 3e-3
lr *= bs/48 #scale learning rate by batch_size

In [63]:
learn_bwd.unfreeze()
learn_bwd.fit_one_cycle(1,lr,moms=(0.8,0.7)) # number of epochs shoule be 10

epoch,train_loss,valid_loss,accuracy,time


Buffered data was truncated after reaching the output size limit.

In [0]:
learn_bwd.to_fp32().save(mdl_path/lm_fns_bwd[0],with_opt=False)
learn_bwd.data.vocab.save(mdl_path/(lm_fns_bwd[1]+'.pkl'))

In [82]:
mdl_path.ls()

[PosixPath('/root/.fastai/data/viwiki/models/vi_wt.pth'),
 PosixPath('/root/.fastai/data/viwiki/models/vi_wt_vocab.pkl'),
 PosixPath('/root/.fastai/data/viwiki/models/vifine_tuned_enc_bwd.pth'),
 PosixPath('/root/.fastai/data/viwiki/models/vi_wt_vocab_bwd.pkl'),
 PosixPath('/root/.fastai/data/viwiki/models/vifine_tuned_enc.pth'),
 PosixPath('/root/.fastai/data/viwiki/models/vifine_tuned.pth'),
 PosixPath('/root/.fastai/data/viwiki/models/vi_wt_bwd.pth'),
 PosixPath('/root/.fastai/data/viwiki/models/viclas.pth'),
 PosixPath('/root/.fastai/data/viwiki/models/vifine_tuned_bwd.pth')]

In [84]:
shutil.copy('/root/.fastai/data/viwiki/models/vi_wt_vocab_bwd.pkl','/content/')

'/content/vi_wt_vocab_bwd.pkl'

### Language model

In [0]:
train_df = pd.read_csv('/content/aivivn_phanloaisacthaibinhluan/data/train.csv')
train_df.loc[pd.isna(train_df.comment),'comment']='NA'

test_df = pd.read_csv('/content/aivivn_phanloaisacthaibinhluan/data/train.csv')
test_df.loc[pd.isna(test_df.comment),'comment']='NA'
test_df['label'] = 0

df = pd.concat([train_df,test_df])

In [0]:
data_lm_bwd = (TextList.from_df(df,path,cols='comment')
                       .split_by_rand_pct(0.1,seed=42)
                       .label_for_lm()
                       .databunch(bs=bs,num_workers=1,backwards=True))
learn_lm_bwd = language_model_learner(data_lm_bwd,AWD_LSTM,config={**awd_lstm_lm_config,'n_hid':1152},
                                      pretrained_fnames=lm_fns_bwd,drop_mult=1.0)

In [0]:
lr = 1e-3
lr *=bs/48

In [88]:
learn_lm_bwd.fit_one_cycle(2,lr*10,moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,4.302429,3.994674,0.336155,00:41
1,4.095187,3.872484,0.345929,00:38


In [89]:
learn_lm_bwd.unfreeze()
learn_lm_bwd.fit_one_cycle(2,lr,moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,3.825342,3.64623,0.371119,00:53
1,3.710225,3.584261,0.377988,00:53


In [0]:
learn_lm_bwd.save(f'{lang}fine_tuned_bwd')
learn_lm_bwd.save_encoder(f'{lang}fine_tuned_enc_bwd')

### Classfier

In [0]:
data_clas_bwd = (TextList.from_df(train_df, path, vocab=data_lm_bwd.vocab, cols='comment')
    .split_by_rand_pct(0.1, seed=42)
    .label_from_df(cols='label')
    .databunch(bs=bs, num_workers=1, backwards=True))

data_clas_bwd.save(f'{lang}_textlist_class_bwd')

In [0]:
data_clas_bwd = load_data(path, f'{lang}_textlist_class_bwd', bs=bs, num_workers=1, backwards=True)

In [0]:
from sklearn.metrics import f1_score

@np_func
def f1(inp,targ): return f1_score(targ, np.argmax(inp, axis=-1))

In [0]:
learn_c_bwd = text_classifier_learner(data_clas_bwd, AWD_LSTM, drop_mult=0.5, metrics=[accuracy,f1]).to_fp16()
learn_c_bwd.load_encoder(f'{lang}fine_tuned_enc_bwd')
learn_c_bwd.freeze()

In [0]:
lr=2e-2
lr *= bs/48

In [96]:
learn_c_bwd.fit_one_cycle(2, lr, moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,f1,time
0,0.396237,0.336419,0.843905,0.800749,00:10
1,0.377389,0.313789,0.863806,0.827192,00:11


In [97]:
learn_c_bwd.freeze_to(-2)
learn_c_bwd.fit_one_cycle(2, slice(lr/(2.6**4),lr), moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,f1,time
0,0.323246,0.275257,0.883085,0.846457,00:13
1,0.27487,0.243117,0.901119,0.871046,00:14


In [98]:
learn_c_bwd.freeze_to(-3)
learn_c_bwd.fit_one_cycle(2, slice(lr/2/(2.6**4),lr/2), moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,f1,time
0,0.262463,0.232588,0.904229,0.879229,00:20
1,0.247088,0.223597,0.914179,0.888134,00:18


In [99]:
learn_c_bwd.unfreeze()
learn_c_bwd.fit_one_cycle(1, slice(lr/10/(2.6**4),lr/10), moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,f1,time
0,0.199528,0.222531,0.906095,0.875279,00:26


In [0]:

learn_c_bwd.save(f'{lang}clas_bwd')

# Ensemble 

[Video 10](https://youtu.be/MDX_x6rKXAs?t=3378)

Now we have two models : forward and backward . What we simple do, we ensemble them together. In Ensemble, it simply means you combine the prediction of two models, often by taking the average . 

Loading up forward classfier 

In [0]:
data_clas = load_data(path,f'{lang}_textlist_class',bs=bs,num_workers=1)
learn_c = text_classifier_learner(data_clas,AWD_LSTM,drop_mult=0.5,metrics=[accuracy,f1]).to_fp16()
learn_c.load(f'{lang}clas',purge=False);

get the prediction of forward classifier

In [102]:
preds,targs=learn_c.get_preds(ordered=True)
accuracy(preds,targs),f1(preds,targs)

(tensor(0.8993), tensor(0.8818))

Loading up the backward classifer

In [0]:
data_clas_bwd = load_data(path,f'{lang}_textlist_class_bwd',bs=bs,num_workers=1,backwards=True)
learn_c_bwd = text_classifier_learner(data_clas_bwd,AWD_LSTM,drop_mult=0.5,metrics=[accuracy,f1]).to_fp16()
learn_c_bwd.load(f'{lang}clas_bwd',purge=False);

get the prediction of backward classifier

In [104]:
preds_b,targs_b=learn_c_bwd.get_preds(ordered=True)
accuracy(preds_b,targs_b),f1(preds_b,targs_b)

(tensor(0.9061), tensor(0.8891))

take the average of two predictions and then calculate the accuracy and f1 score , 0.9154 and 0.9016 respectively.

The work here included downloading the Vietnamese wikipedia, and finding a Vietnamese sentiment analysis, we have won the popular competition. The trick of ensemble in the forwards and backwards can often give you a significant lift, particularly in the situation you don't have many data. 

**The appoach of creating your own pretrained model, it's not only useful for different language, but it's extremely useful if you're dealing with a extremely different domain , particular one would be medicine if you have a huge vocabulary of medical terms . If it appears on Wikipedia, you would create a pretrained model.**