You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
import json
from transformers import Wav2Vec2CTCTokenizer
phones = 'ɔː,(en),oɜ,inɡɜ,o4,b,f,enɡ1,oi2,aa7,eɪ,eː,au7,aaiɜ,onɡ4,oe6,uiɜ,ɒ,iə,c,aa2,oenɡ1,ei7,oenɡ6,au1,ŋ5,iu5,aɪə,ou4,d,ai7,k,i2,eoi5,aai2,j,oenɡɜ,u1,ŋ4,i,m,oi6,unɡɜ,ou2,au2,p,yu1,a,yu4,onɡ1,ɛ,e5,əʊ,ou6,yu5,aɜ,oi1,onɡ5,ai5,aau5,inɡ5,ai1,eɜ,ei5,uɜ,o2,i5,nɡ6,enɡ4,ɐ,l,o1,iu4,enɡ6,ou5,onɡ7,anɡ1,tʃ,aau2,eo6,aa6,iː,enɡ7,oenɡ5,ŋ,aau1,u5,eo5,yu7,oi7,aaɜ,oiɜ,yu2,aa5,ɑː,oe1,n,eoi2,ui2,oenɡ2,inɡ1,anɡ4,t,au4,ei4,u2,aanɡ2,ui4,dʒ,[PAD],a1,e,oenɡ7,aau4,onɡɜ,eoi6,unɡ5,ɹ,e6,yu6,ɪ,ʃ,ei2,aauɜ,enɡɜ,unɡ1,aɪ,i6,eiɜ,aanɡ1,inɡ6,iu1,o5,ui1,inɡ2,unɡ4,eoi4,eo4,uː,ei1,oenɡ4,aa4,aanɡ7,a2,e4,enɡ2,a5,auɜ,iɜ,əl,ai6,iu2,a4,e2,ouɜ,eoi1,anɡ2,[UNK],h,onɡ6,aau6,nɡ5,nɡ4,enɡ5,oeɜ,inɡ4,a6,eoiɜ,e1,ʊ,i1,o7,z,au6,ai4,anɡ6,aai1,oi5,aʊ,v,iu6,unɡ7,au5,eoɜ,aanɡ6,ou1,aanɡ5,(zhy),anɡɜ,oi4,onɡ2,a7,w,ui5,ui6,oe5,unɡ6,aanɡ4,ɔɪ,inɡ7,ɡ,s,o6,aa1,u6,aai4,ʌ,ou7,yuɜ,ɜː,ei6,aiɜ,ə,anɡ7,ai2,u4,iu7,iuɜ,eo1,aai6,eo2,i4,i7,aai5,unɡ2'.split(',')
phones_dict = {x:i for i, x in enumerate(phones)}
with open('test.json', 'w') as f:
json.dump(phones_dict, f, indent=4, ensure_ascii=False)
tokenizer = Wav2Vec2CTCTokenizer('test.json', unk_token='[UNK]', pad_token='[PAD]')
text = 'ɡ ei2 j a4 n ɡ eɜ p anɡ4 j au5 t aaɜ n z o2 j a1 t h au2 h eiɜ'
print(tokenizer(text))
print(tokenizer.decode(tokenizer(text)['input_ids'], spaces_between_special_tokens=True))
The output:
{'input_ids': [200, 122, 35, 152, 96, 157, 200, 62, 45, 101, 35, 182, 102, 90, 96, 157, 172, 65, 35, 110, 102, 157, 158, 44, 158, 128], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
ɡ ei2 j a4 n [UNK] ɡ eɜ p anɡ4 j au5 t aaɜ n [UNK] z o2 j a1 t [UNK] h au2 h eiɜ
As you can see, the 'unknown' token ([UNK]) is added at random locations even though all the words/phones were defined in phones_dict.
What is happening?
I'm using transformers version 4.29.2.
The text was updated successfully, but these errors were encountered:
Here's a script to reproduce what I've observed:
The output:
As you can see, the 'unknown' token ([UNK]) is added at random locations even though all the words/phones were defined in
phones_dict
.What is happening?
I'm using transformers version 4.29.2.
The text was updated successfully, but these errors were encountered: