We create a domain specific dictionary of 2000+ Indonesean to English words. These are the Indonesan words present in the 'title' field of the training data provided by Shoppee. We explore various versions of Bert, Bart and google translate to make sense of this data

In [1]:
import numpy as np
import pandas as pd

In [2]:
pd.set_option('display.max_colwidth', None)

train = pd.read_csv('../input/shopee-product-matching/train.csv')
train['titleUcase'] = train['title'].str.upper()

Let us try to get the number of unique words across all records. The 'set' is just about the right data structure

In [3]:
import re

unique_words = set()
all_words = train['titleUcase'].str.findall('\w+')

for x in all_words:
    words = []
    for item in x:
        if item.isalpha() and len(item)>2:
            words.append(item)
    unique_words.update(words)

In [4]:
len(unique_words)

18343

In [5]:
from nltk.corpus import words
setofwords = set(words.words())

Unfortunately this does not work for plural sometimes...though it seems to work for eveything else

In [6]:
print("run","run" in setofwords)
print("ran","ran" in setofwords)
print("running","running" in setofwords)
print("ren","ren" in setofwords)
print("dog","dog" in setofwords)
print("dogs","dogs" in setofwords)
print("horse","horse" in setofwords)
print("horses","horses" in setofwords)
print("sony","horses" in setofwords)
print("samsung","horses" in setofwords)

run True
ran True
running True
ren False
dog True
dogs True
horse True
horses False
sony False
samsung False


You would expect it to be consistent (if not thorough). So this puts a question mark on other words as well. Note that this would also not include joint words (which BERT nicely breaks into constituents). Anyway we want a ballpark, so we can go on

In [7]:
unique_english = []

for item in unique_words:
    if item.lower() in setofwords:
        unique_english.append(item)
    else:
        if (item[-1] == 'S') and (item[:-1].lower() in setofwords) and (len(item)>3):
            unique_english.append(item)
        
print(len(unique_english),'\n')
print(unique_english[:100])

5140 

['SAVVY', 'BELLY', 'RENOVATOR', 'LOTION', 'TAWAS', 'SYRUP', 'COCA', 'FOLDABLE', 'BUBBLE', 'TINCTURE', 'PRIMING', 'TAPER', 'CAFFEINE', 'DARKENING', 'POCKET', 'CHEF', 'KIWI', 'ORY', 'SPRAYER', 'TRANSFORMERS', 'BATHROOM', 'HOBBY', 'NAME', 'TIGER', 'USE', 'ELLIPSE', 'TANGA', 'ETHYL', 'THEFT', 'MACAN', 'MANIS', 'CONTACT', 'CROCS', 'TRACK', 'DURIAN', 'SODIUM', 'DIALOGUE', 'DULLNESS', 'WISTERIA', 'HOST', 'BOTTLE', 'PANDAN', 'INNOVATION', 'THRASHER', 'KNOW', 'BEAR', 'LITHIUM', 'SNORKEL', 'SCARLET', 'PUBLIC', 'GLOBAL', 'REPEATER', 'SKATE', 'FOLD', 'BOOKS', 'EAGLE', 'TIDE', 'JUMBO', 'INTERLOCK', 'REFUND', 'SPRINKLES', 'SHIELD', 'AWAKE', 'ANCHOR', 'TOYS', 'FROTHER', 'MAGNETIC', 'CURVE', 'BLAZE', 'REEVE', 'MAUVE', 'KLOP', 'SIMMER', 'SHIMMER', 'AESTHETIC', 'DICER', 'PROSES', 'ESCAPE', 'QUANTUM', 'RAYON', 'PANORAMA', 'WALNUT', 'PADDOCK', 'NOTIFICATION', 'YOUR', 'DODGERS', 'CREAMER', 'SPY', 'CHEMICAL', 'SINUSITIS', 'SEPT', 'TABLEWARE', 'OVERNIGHT', 'GIRAFFE', 'SIP', 'GARDEN', 'POOR', 'BUN', 'D

I didnt know that Baud, Doff, Wen etc were english words but apparently they are as per the dictionary. The challenge is that these could 'also' be Indo words  with a different meaning. Anyway let us move on

In [8]:
unique_nonenglish = unique_words - set(unique_english)
print(len(unique_nonenglish), '\n')
print(list(unique_nonenglish)[:100])

13203 

['SUPRACOLOR', 'EXFERET', 'CELUP', 'MORENO', 'JAMSUI', 'HVS', 'JOGJA', 'SARAN', 'BERGARANSI', 'KETOMBE', 'YESBABY', 'NAPOLITAN', 'PREWALKER', 'ARMOURED', 'KURNIAWAN', 'CYNCYN', 'TERPAL', 'BELIBIS', 'MINICART', 'LYCHEE', 'ASAH', 'ERITSA', 'KAOS', 'PAKAIANWANITA', 'MAYONAIS', 'COLEK', 'SUNQUICK', 'MURAHPOP', 'CASA', 'SUMPAH', 'ICECREAM', 'LBS', 'ARTELLY', 'HPL', 'SMOKEY', 'RESLETING', 'MENSPAD', 'MMI', 'KENCORO', 'SUBSIDI', 'YONGKIDZ', 'JAHITAN', 'AKUARIUM', 'NOTOARTO', 'SYAWAL', 'PUSING', 'UNICOREN', 'NOUVO', 'SARANA', 'DORENG', 'FIO', 'RIVIERA', 'DLUSIA', 'BROWNIS', 'ESGOTADO', 'SHALWA', 'BERASA', 'LOACKER', 'KUMAN', 'JELLI', 'SOLED', 'MAGRA', 'OMEGAKIDS', 'FRESHY', 'LUWAK', 'PAKAI', 'SARINGAN', 'SITRONELLA', 'ENMAC', 'DOBEL', 'BAREFOOD', 'KACA', 'ZWITZAL', 'GISELLA', 'JASTIP', 'FIORE', 'BIRCAP', 'CMS', 'TOSHIBA', 'MAURA', 'GOSWIM', 'MADILOG', 'PELIPAT', 'KARBU', 'ORICAT', 'GLADIS', 'PANTENE', 'NAFAS', 'TBSD', 'ANGGARAN', 'JEANSWASH', 'AKTIVASI', 'SUSPENSI', 'CREAMATTE', 'ONDA'

But a GOOD chunk of it will include brand name, location, org names etc.

Let us see if we can get a rough idea of number of non-English words which are not names, brands, orgs etc. We will use Google translate API

In [9]:
!pip install google_trans_new

Collecting google_trans_new
  Downloading google_trans_new-1.1.9-py3-none-any.whl (9.2 kB)
Installing collected packages: google-trans-new
Successfully installed google-trans-new-1.1.9


In [10]:
from google_trans_new import google_translator  
translator = google_translator()  
translate_text = translator.translate('wanita',lang_tgt='en')  
print(translate_text)

woman 


Google translate has severe constraints on number of concurrent requests. It blocks the IP if used in a loop. So we will have to discard it. 

Let us use the dictionary provided below. 


https://raw.githubusercontent.com/sastrawi/sastrawi/master/data/kata-dasar.original.txt

This is under MIT license

Copyright (c) 2015 Andy Librian

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

In [11]:
with open ('../input/indoneseanwords/kata-dasar.original.txt', "r") as myfile:
    indo_dict=set(myfile.read().splitlines())

In [12]:
train_indo_words = []

for item in unique_nonenglish:
    if item.lower() in indo_dict:
        train_indo_words.append(item)

In [13]:
unique_nonenglish_nonindo = unique_nonenglish - set(train_indo_words)
print(len(train_indo_words), '\n')
print(train_indo_words[:100])

2727 

['CELUP', 'SARAN', 'KETOMBE', 'TERPAL', 'BELIBIS', 'ASAH', 'KAOS', 'COLEK', 'SUMPAH', 'SUBSIDI', 'AKUARIUM', 'SYAWAL', 'PUSING', 'SARANA', 'KUMAN', 'PAKAI', 'DOBEL', 'KACA', 'SUSPENSI', 'JILID', 'TANGKI', 'TULIS', 'CANGKOK', 'PROMOSI', 'KARDUS', 'BINTANG', 'LOKI', 'VOAL', 'LANSIA', 'KUALITAS', 'HIDUP', 'NARSIS', 'TERMOS', 'RISALAH', 'KAPORIT', 'ROTASI', 'OPPO', 'SOKET', 'RESEP', 'PANAS', 'KRONIS', 'SINERGI', 'CEMBUNG', 'MESJID', 'DEWASA', 'SABLON', 'PUTRA', 'ANTING', 'NYAI', 'HOKI', 'KENCING', 'SERU', 'URAT', 'BILAS', 'TOMAT', 'AKHIR', 'UMBUL', 'SUPLEMEN', 'BUNUH', 'TASEL', 'IBU', 'KEDAUNG', 'SEPEDA', 'MADANI', 'MOHON', 'SERANGGA', 'LAKU', 'GRADASI', 'PENSIL', 'PADAT', 'BEKU', 'KONTROL', 'PERTAMA', 'BROSUR', 'UMRAH', 'PRODUSEN', 'KETAN', 'SPONTAN', 'GEPENG', 'BEDAK', 'ANTIBODI', 'IBUN', 'MAYUNG', 'HAJAR', 'SAMUDRA', 'REKAT', 'IMITASI', 'PUTING', 'SIRIP', 'LATIH', 'MALAM', 'REKAYASA', 'RAMANDA', 'ORIGAMI', 'LENTERA', 'KAUS', 'ALAT', 'SIMPEL', 'SESAT', 'BUKU']


In [14]:
print(list(unique_nonenglish)[:200])

['SUPRACOLOR', 'EXFERET', 'CELUP', 'MORENO', 'JAMSUI', 'HVS', 'JOGJA', 'SARAN', 'BERGARANSI', 'KETOMBE', 'YESBABY', 'NAPOLITAN', 'PREWALKER', 'ARMOURED', 'KURNIAWAN', 'CYNCYN', 'TERPAL', 'BELIBIS', 'MINICART', 'LYCHEE', 'ASAH', 'ERITSA', 'KAOS', 'PAKAIANWANITA', 'MAYONAIS', 'COLEK', 'SUNQUICK', 'MURAHPOP', 'CASA', 'SUMPAH', 'ICECREAM', 'LBS', 'ARTELLY', 'HPL', 'SMOKEY', 'RESLETING', 'MENSPAD', 'MMI', 'KENCORO', 'SUBSIDI', 'YONGKIDZ', 'JAHITAN', 'AKUARIUM', 'NOTOARTO', 'SYAWAL', 'PUSING', 'UNICOREN', 'NOUVO', 'SARANA', 'DORENG', 'FIO', 'RIVIERA', 'DLUSIA', 'BROWNIS', 'ESGOTADO', 'SHALWA', 'BERASA', 'LOACKER', 'KUMAN', 'JELLI', 'SOLED', 'MAGRA', 'OMEGAKIDS', 'FRESHY', 'LUWAK', 'PAKAI', 'SARINGAN', 'SITRONELLA', 'ENMAC', 'DOBEL', 'BAREFOOD', 'KACA', 'ZWITZAL', 'GISELLA', 'JASTIP', 'FIORE', 'BIRCAP', 'CMS', 'TOSHIBA', 'MAURA', 'GOSWIM', 'MADILOG', 'PELIPAT', 'KARBU', 'ORICAT', 'GLADIS', 'PANTENE', 'NAFAS', 'TBSD', 'ANGGARAN', 'JEANSWASH', 'AKTIVASI', 'SUSPENSI', 'CREAMATTE', 'ONDA', 'ADONI

Let us check out the BERT version for Indonesean lang. Huggingface provides a nice autotokenizer and AutomodelwithMLhead options that helps auto-choose models and tokernizers without us having to specify things in great details. Let us try it out..

They have several translation models from the University of Helsinki in their transformer model zoo. We use opus-mt-id-en. Basically, for any given language code pair you can download a model with the name Helsinki-NLP/optus-mt-{lang}-{target_lang} where lang is the language code for the source language and target_lang is the language code for the target language we want to translate to. 

In [15]:
from transformers import AutoTokenizer, AutoModelWithLMHead

model_helinski = AutoModelWithLMHead.from_pretrained('Helsinki-NLP/opus-mt-id-en')
tok_helinski = AutoTokenizer.from_pretrained('Helsinki-NLP/opus-mt-id-en')



Downloading:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/291M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/801k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/796k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.26M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

We will use a pipeline that absrtacts most of code and provides a nice API for most NLP tasks. In our case, we are interested in translations

It may seem like there is no need for pipelines or auto-tokenizers etc but believe me when you start exploring the 1000's of models in the zoo these features really come handy

In [16]:
from transformers import pipeline

translation = pipeline('translation_id_to_en', model=model_helinski, tokenizer=tok_helinski)

Ok. Let us see how good the translator is

In [17]:
for item in train_indo_words[:25]:
    translated_text = translation(item)[0]['translation_text']
    print(item,translated_text) if item != translated_text else None

CELUP Please, come on, come on, come on, come on, come on, come on, come on, come on, come on, come on, come on, come on, come on, come on, come on, come on, come on, come on, come on, come on, come on, come on, come on, come on, come on, come on, come on, come on, come on, come on.
SARAN SUGGESTION
TERPAL DISAPPOINTED
ASAH U.S.A.H.
SUMPAH ANIMALS
AKUARIUM I'M SEARCHING
SYAWAL SYAWL
SUSPENSI SUSPENSION
TULIS WRITE
PROMOSI PROMOTION
KARDUS CARDUS


Unfortunately about half the translations are wrong. We cant use this. The SOTA is translation is BART by Facebook and in particular mBart-50 which came in 2020. Let us check it out.

For Seq-Seq models, we cant use AutoModelWithLMHead for the model. We have to replace with AutoModelForSeq2SeqLM. Rest remains the same

In [18]:
from transformers import AutoModelForSeq2SeqLM

model_mbart50 = AutoModelForSeq2SeqLM.from_pretrained('facebook/mbart-large-50-many-to-many-mmt')
tok_mbart50 = AutoTokenizer.from_pretrained('facebook/mbart-large-50-many-to-many-mmt')
translation = pipeline("translation_id_to_en", model=model_mbart50, tokenizer=tok_mbart50)

Downloading:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/649 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/529 [00:00<?, ?B/s]

In 3 lines we were able to set up a translation pipeline. Notice how the API understands that the tokenizer corresponding to mbart-large-50-many-to-many-mmt is MBart50Tokenizer. 

Unfortunately there were multiple issues which I explored to get this working:
- Initially it gave error saying "unable to find the MBart50Tokenizer" path. Since this is a SOTA model I thought, Huggingface has yet to incorporate this in their formal release. On their support site, they request to download and use their source code directly until this is done 
- Linking directly with Hugging face transformer source code didnt work
- Some debugging of their source code shows that "sentencepiece" library is mandatory. Installed that as well but didnt work
- Transformer version was showing as older one. Though I upgraded transformers (!pip install transformers) it kept showing older version 
- Finally I got to the root of the problem. Kaggle seems to have a work env (on RHS, click on Env-preferences link). We have to override that to get the latest environments. This finally made the above statements work (note it is reccomended to use a stable v ersion rather than a latest version unless there is a compelling reason)

In [19]:
!pip freeze | grep transformers

transformers==4.4.2


In [20]:
for item in train_indo_words[:25]:
    translated_text = translation(item)[0]['translation_text']
    print(item,translated_text) if item != translated_text else None

SARAN सारण
KETOMBE Country name (optional, but should be translated)
TERPAL टेर्पाल
BELIBIS बिलिबिस
COLEK कोलेकCity name (optional, probably does not need a translation)
SUMPAH KoopCity in Jeonnam Korea
AKUARIUM AquariumConstellation name (optional)
SYAWAL City in Gyeongnam Korea
SARANA सारानाCity name (optional, probably does not need a translation)
KUMAN KumanName
PAKAI PakistanCity in Pakistan
DOBEL डोबेल
SUSPENSI StensilsStencils
TANGKI TangiCity in Gyeongnam Korea
TULIS टलिसConstellation name (optional)
CANGKOK Iinketho zegama leqela leenkwenkwezi (iyodwa) GRUSConstellation name (optional)
KARDUS कार्डसConstellation name (optional)


Well looks like the model is wanting to show off about all the languages on the universe it knows. This just does not work for us. Maybe we have to ditch the pipeline API and just try the regular route or creating the tokens and tensors ourselves

In [21]:
tok_mbart50.src_lang = "ja_XX"
encoded = tok_mbart50('wanita', return_tensors="pt")
generated_tokens = model_mbart50.generate(**encoded, forced_bos_token_id=tok_mbart50.lang_code_to_id["en_XX"])
tok_mbart50.batch_decode(generated_tokens, skip_special_tokens=True)

['Women']

Phew!!!

In [22]:
for item in train_indo_words[:25]:
    encoded = tok_mbart50(item.lower(), return_tensors="pt")
    generated_tokens = model_mbart50.generate(**encoded, forced_bos_token_id=tok_mbart50.lang_code_to_id["en_XX"])
    translated_text = tok_mbart50.batch_decode(generated_tokens, skip_special_tokens=True)
    print(item,translated_text)

CELUP ['cellop']
SARAN ['saran']
KETOMBE ['ketombe']
TERPAL ['terpal']
BELIBIS ['belibis']
ASAH ['asah']
KAOS ['chaos']
COLEK ['colek']
SUMPAH ['sumpah']
SUBSIDI ['subsidies']
AKUARIUM ['Akuarium']
SYAWAL ['syawal']
PUSING ['pusing']
SARANA ['mean']
KUMAN ['kuman']
PAKAI ['use it.']
DOBEL ['dobel']
KACA ['Glass']
SUSPENSI ['suspensi']
JILID ['jilid']
TANGKI ['tangki']
TULIS ['tulis']
CANGKOK ['cangkok']
PROMOSI ['Promotion']
KARDUS ['kardus']


So finally we get the desired outputs. You see how the 'pipeline' API ditched us and we had to abondon it for an alternate way. Well, since this is pretty new tech, maybe these issues are not reported to Huggingface yet. There are couple of other issues I faced and spent lots of time debugging. For e.g., I was originally trying this in TF but this command does not work. There is an issue and though I spent a lot of time debugging and going thru' original codes as well, I couldnt solve it. I had to change the tensors to PyTorch to get it to work.

The results are much much better than the Helinski translators but still not as great as the Google translator. However it must be noted that there are no wrong translations. We have managed to overcome the bypass API  load limitations at Google and create a neat translator of our own that can translate text from any language in the world to any other using mbart. There is hardly any online help for debugging issues that arose on the way and so there is a small sense of satisfaction :)

BART has both an encoder (like BERT) and a decoder (like GPT), essentially getting the best of both worlds. The encoder uses a denoising objective similar to BERT while the decoder attempts to reproduce the original sequence (autoencoder), token by token, using the previous (uncorrupted) tokens and the output from the encoder.

Lastly coming to the mystery of why the 'transformers' version was not getting upgraded even though we were doing pip installs. I found one kernel talk about this solution (though I didnt get time to explore in detail & confirm if it really works) - huggingface uses module "pkg_resources" to get the version. But pkg_resources is loaded right after the Kaggle's notebook starting. So it CANNOT get the correct version of transformers after we upgrade (somewhere in the middle of the kernel) and keeps pointing to the old version. The trick is to just releoad pkg_resources. Now hopefully we can change back the environment settings to the old value and not worry about ever-changing versions of libraries :) Note - Havent tested this but intend to test it in future if I get time.

But human greed sometimes does not have limits and it often challenges us to do a little more. I was thinking of some way to leverage Google tranbslate's awesome service and then realized about the 5000 character document conversions. Basically we just take
https://raw.githubusercontent.com/sastrawi/sastrawi/master/data/kata-dasar.original.txt and break it into 5000 character chunks and feed to Google translate. There are couple of caveats here. Firstly we need to put a fullstop between each word else Google will try to translate it into a sentence and may shuffle the words. For e.g. try translating "wanita PERANGKAP sapi PERANGKAT APEL wanita". Secondly let us not translate entire Indo dict but will translate only those 2K odd words that are part of Title field. We may have to do it in 5 iterations

In [23]:
for i in range(1,7):
    with open('translate' + str(i) + '.txt', 'w') as file_t:
        file_t.write('.\n'.join(str(item) for item in train_indo_words[(i-1)*500:i*500]))

IMP: There is a manual step here where I run Google translate on these 6 files and append the outputs into the 'translated.txt' file which has been uploaded.

In [24]:
with open ('../input/googletranslated/translated.txt', "r") as myfile:
    translated=(myfile.read().splitlines())

In [25]:
translated = [s.strip('.') for s in translated]
print(translated[:25])
print('\n', len(translated))

['KOK', 'ELBOW', 'MAP', 'WINDOW', 'BRAND', 'FAILED', 'THYME', 'TEETH', 'PORCUPINE', 'JEMBER', 'CASE', 'DUCK', 'Patch', 'BEE', 'VILLAGE', 'REMPAH', 'MAGPIE', 'MISS', 'MACHINE', 'ETALAGE', 'THINK', 'SAHARA', 'TINGTING', 'TRIPE', 'MISTAR']

 2727


Perfect, we just have to zip them up to create a dict. Note that I read train_indo_words from the 'translate.txt' file which is nothing but the appending of the 6 output files we created for the translate. Each time the kernel is run, train_indo_words is sorted in a different way because of the set() function. I realized this after translating the 6 files manually and have no intention of going back to sorting the set and re-doing gthe whole thing 

In [26]:
with open ('../input/translate/translate.txt', "r") as myfile:
    train_indo_words=(myfile.read().splitlines())
train_indo_words = [s.strip('.') for s in train_indo_words]
    
indo_en_dict = dict(zip(train_indo_words, translated))

Let us remove the un-translated words. Hopefully there would not be too many of them

In [27]:
for k, v in dict(indo_en_dict).items():
    if k==v:
        del indo_en_dict[k]

In [28]:
print(len(indo_en_dict), '\n\n')
print(list(indo_en_dict.items())[:100])

with open('indonesean_english_dict.txt', 'w') as file_t:
    file_t.write('\n'.join(str(item) for item in list(indo_en_dict.items())))

2144 


[('SIKUT', 'ELBOW'), ('PETAK', 'MAP'), ('JENDELA', 'WINDOW'), ('MEREK', 'BRAND'), ('GAGAS', 'FAILED'), ('TIMI', 'THYME'), ('GERIGI', 'TEETH'), ('LANDAK', 'PORCUPINE'), ('KASUS', 'CASE'), ('ITIK', 'DUCK'), ('TAMBAL', 'Patch'), ('LEBAH', 'BEE'), ('DESA', 'VILLAGE'), ('KACER', 'MAGPIE'), ('MBAK', 'MISS'), ('MESIN', 'MACHINE'), ('ETALASE', 'ETALAGE'), ('PIKIR', 'THINK'), ('BABAT', 'TRIPE'), ('BOLU', 'SPONGE'), ('TABUNG', 'TUBE'), ('EKONOMIS', 'ECONOMICAL'), ('LARAS', 'BARREL'), ('PERABOT', 'FURNITURE'), ('BINCANG', 'BEAM'), ('MASIH', 'STILL'), ('BAGAN', 'CHART'), ('DINAS', 'SERVICE'), ('WILAYAH', 'REGION'), ('RODA', 'WHEEL'), ('ASAL', 'ORIGIN'), ('MANIKUR', 'MANICURE'), ('MATI', 'DIE'), ('API', 'FIRE'), ('SPESIAL', 'SPECIAL'), ('AMPLOP', 'ENVELOPE'), ('TEROMPAH', 'SANDALS'), ('KURAS', 'DRAIN'), ('METODE', 'METHOD'), ('SEKOLAH', 'SCHOOL'), ('PESONA', 'CHARM'), ('KELINCI', 'RABBIT'), ('TUNGKU', 'WAIT'), ('SUSAH', 'HARD'), ('USAP', 'WIPE'), ('GARPU', 'FORK'), ('PANCI', 'PAN'), ('LELAK

yay! we get a nice dictionary which is domain specific to Shoppe!! Now a few questions beg to be answered at this point. Do we really a need a dict? Ragnar showed good scores with a regular english Bert model. There is also a Indonesean Bert model that is doing the rounds. Will that not suffice? Obviously whosoever is using the English version of BERT will benifit from a translation before-hand. But how about the Indonesean model? Will that benefit? We have seen that there are quite a few English words in the 'Title' field. How will the Indonesean model interpret these words? It is completely Indo right? Let us check the the Indoensian Bert - https://huggingface.co/cahya/bert-base-indonesian-522M

In [29]:
from transformers import BertTokenizer, TFBertModel

model_name='cahya/bert-base-indonesian-522M'
tokenizer = BertTokenizer.from_pretrained(model_name)
bert_layer = TFBertModel.from_pretrained(model_name)

Downloading:   0%|          | 0.00/230k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/62.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/468 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/545M [00:00<?, ?B/s]

Some layers from the model checkpoint at cahya/bert-base-indonesian-522M were not used when initializing TFBertModel: ['mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at cahya/bert-base-indonesian-522M.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Firstly no more 'pipeline's for me for a while... Secondly let us choose Tensorflow. There is hardly any documentation for TF on Hugginface models..So maybe we will learn a few new things while trying to figure out the APIs

In [30]:
bert_layer.config

BertConfig {
  "_name_or_path": "cahya/bert-base-indonesian-522M",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.4.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 32000
}

That was easy..The vocab size is 32K just like the English equivalent. In particular, I wanted the token dictionary to find out if there are English words. There is no documentation on the net for the same but Huggingface APIs are nice and standard. 

In [31]:
import random

rndtokens=[]
for item in tokenizer.vocab:
    rndtokens.append(item) if (len(rndtokens) < 320) & (random.randint(0,100)<2) else None
    
str(rndtokens)

"[';', 'θ', 'ש', 'ร', 'ส', '₱', '南', '空', '義', '聖', '##y', '##æ', '##к', '##ː', '##い', '##ら', '##হ', '##ล', '##ф', '##ほ', '##ვ', '##ꦧ', '##త', '##せ', '##ur', '##and', 'sem', 'kayu', 'har', 'en', '##ingan', '##uku', 're', 'sum', 'menggunakan', 'komp', 'kl', 'berl', 'terp', '##hi', 'kerus', '93', '##ambil', '##ender', '##urkan', 'penggun', 'maret', '2012', 'bun', 'tercatat', 'aj', '2017', '##takan', '##tin', 'lam', '##asikan', 'walaupun', 'diadakan', 'stand', 'admin', '##akup', '##ambahan', 'banj', 'sif', 'ars', 'ay', '##ungkinan', 'berusaha', 'profesi', 'pela', 'melanjutkan', 'cy', 'mask', 'turun', '##ontakan', 'inj', 'anj', 'tanjung', '##isipasi', 'rump', 'kursi', 'karyanya', '36', '##omi', 'dewi', '##esh', 'bomb', 'setengah', 'bersamaan', '##asitas', 'cit', 'akses', '##imb', 'sunda', 'hidupnya', '##alib', 'induk', '##eni', 'berpart', 'colle', '##kand', 'fos', '##elan', 'pasir', 'pp', 'setidaknya', 'amor', '##urang', '##eman', 'gelap', 'wid', '##ekor', 'dilaporkan', 'nyata', '##usat', 

We can see a bunch of characters from diff languages including Chinese, HJindi, possibly Japanese Katakana. We can see a few English words here and there but not lots. Mainly it is Indonesian. Let us check the English words present in the retail domain space

In [32]:
[item for (item) in tokenizer.vocab if item in ['organic','best','product','sale','woman', 'shirt','jeans','Original','original', 'gloves', 'hat']]

['hat', 'best', 'original', 'woman', 'product']

Interestingly, There are some English words but many are missing.

In [33]:
eng_in_indobert = []

for item in tokenizer.vocab:
    if len(item)>3 and item.lower() in setofwords:
        eng_in_indobert.append(item)
        
print(len(eng_in_indobert),'\n')
print(eng_in_indobert[:100])

3185 

['yang', 'dari', 'meng', 'pert', 'meny', 'orang', 'lain', 'pend', 'mend', 'film', 'tang', 'para', 'kali', 'genus', 'aster', 'sang', 'samp', 'asteroid', 'baru', 'ting', 'send', 'king', 'serta', 'diter', 'bang', 'kingdom', 'masa', 'pern', 'inter', 'teng', 'perm', 'member', 'raja', 'band', 'berm', 'album', 'intern', 'paling', 'terp', 'mana', 'perk', 'larva', 'plan', 'belanda', 'planet', 'organ', 'acara', 'program', 'agama', 'nebula', 'lama', 'ayah', 'bola', 'sing', 'pang', 'agar', 'orbit', 'peri', 'bahan', 'prof', 'raya', 'surat', 'bend', 'zaman', 'rata', 'unit', 'prim', 'dewan', 'pand', 'media', 'kend', 'form', 'video', 'saya', 'hind', 'badan', 'modern', 'kitab', 'serial', 'model', 'situs', 'masjid', 'data', 'gamb', 'award', 'drama', 'sultan', 'sisi', 'meter', 'episode', 'tamp', 'berg', 'enam', 'guru', 'anti', 'primordial', 'hong', 'stand', 'sumatra', 'bank']


So there is a quite a bit of English already built-in into Indo Bert. So maybe that is the reason it works well. Secondly keep in mind that token vocab file IS NOT MEANT to be something like a dictionary. BERT works by splitting quite a few words into tokens. So there could be many words that are not present in the token file and this is perfectly fine. But we just wanted to get an idea if Indo BERT has any english tokens at all and hence the peek.

Now that we have established that Indo BERT understands English to a reasonable extent, is there any benefit in this dictionary. I believe so. It may work better by translating the English words to Indonesean using the dictionary above and then using thje Indo BERT.

But why would someone want to use an Indo BERT or an English BERT when we have MBERT from google and an XLMBERT from Facebook which is more comprehensive. These are trained on 100+ languages and will perform far better. XLMBERT can be loaded using the same 2-3 lines of Hugginface APIs and if you print the length of the token file it comes to a whopping 250000 or about 10 times the token for the English or the Indo Bert. To me, that seems like a good choice to start experimenting with.

But is this dictonary we created a waste? Absolutely not. We learnt many things and had fun (and some frustrations) while creating this. I didnt see any such good dictionary of 2000+ Indo to English words on the web, so this could be a useful resource for NLP in general outside this comp. We created a neat real-time translator using mBart-50 which has no limits to the translations which can be done. Now coming to this comp - if for some reason the XLM-Roberta is eating memory and one needs a lighter BERT, one can use the above translations. More importantly these can be used to augment the data in the sparse label groups.

One can also experiment with https://pypi.org/project/trankit/. I discovered this gem a bit late and couldnt cover it, but it uses the XMLRoberta base, so it should be at least as good as mbart50, though I doubt whether it will reach Google translate standard.

In [34]:
##Please ignore
from shutil import copyfile
copyfile(src = "../input/tokenization-mbart50-fastpy/tokenization_mbart50_fast.py", dst = "../working/tokenization_mbart50_fast.py")

'../working/tokenization_mbart50_fast.py'