<a href="https://colab.research.google.com/github/hduongck/AI-ML-Learning/blob/master/Fastai%20NLP%20course/7_1_Turkish_ULMFiT_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Turkish ULMFiT from scratch

[Video 10](https://youtu.be/MDX_x6rKXAs?t=4036)

In [0]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai import *
from fastai.text import *

In [0]:
bs=48
#torch.cuda.set_device(2)
data_path = Config.data_path()

lang = 'tr'
name = f'{lang}wiki'
path = data_path/name
path.mkdir(exist_ok=True, parents=True)

In [0]:
mdl_path =path/'models'
mdl_path.mkdir(exist_ok=True)
lm_fns =[mdl_path/f'{lang}_wt',mdl_path/f'{lang}_wt_vocab']


Some words in Chinese `而获得诺贝尔文学奖` there are no spaces so we can't rely when dealing with languages like Turkish or Chinese on using a space to tokenize our corpus. This is a big problem or really tricky . It's not as easy as you might imagine . Chinese is such gramatically interesting language that it's not even clear sometimes where words start and end. For example in Chinese you will insert a character in the middle of a word to give it some different meaning. Or you'll insert a character at the end of the word something called a result of the complement which means  you create a new word which tells you about the outcome of some other word. It's not a simple case of just saying let's just find the words.

In recent times, a really cool solution was developed called **sentence peace**. It actually based on something that goes back even further called **byte-pair-encoding**. What **sentence peace** and **byte-pair-encoding** do is that the segment things into what are called  **subword units**. A **subword units** is a sequence of letters that appears frequently in a corpus. So frequently that you tokenize it into sub sequences .

For example In Turkish , after we use **subword encoding** we end up with something like this 


```
▁ele ▁geçirerek ▁politik ▁bir ▁özerklik ▁verdiğini ▁belirtmek le ▁birlikte ▁xxmaj ▁tibet ' in ▁yalnızca ▁1913 - 1950 ▁yılları ▁arasında ▁xxmaj ▁çin ' in ▁politik ▁nüfuz undan ▁çıktığını , ▁bölgenin ▁tarihi ▁olarak ▁xxmaj ▁çin ' e ▁ait ▁olduğunu ▁düşünmektedir . ▁xxmaj ▁tibet ' in ▁kendi ▁kültür ▁ve ▁zenginlik lerinin ▁" kültürel ▁bir ▁soykırım " a ▁tabi ▁tutulduğu ▁da ▁iddialar ▁arasındadır . ▁xxmaj ▁çin ▁hükümeti ▁ise ▁bu ▁" kültürel ▁soykırım " ▁iddia
```

These big underscore represents spaces and spaces represent token boundaries . You can see here each of the time, we have each token is a single word `▁politik` but sometimes we get things like `▁zenginlik lerinin`where being turned into two tokens, eventually split the word in the middle.

The question is like how would you start with something like this 

    `而获得诺贝尔文学奖` - And won the Nobel Prize in Literature
    or `evlerinizdenmiş` - `(he/she/it) was (apparently/said to be) from your houses`
    
and turn it into something like `▁zenginlik lerinin`. And these answer is at the high level what you basically do is you start out by looking through your corpus , you find ( using **byte-pair-encoding**) which two characters appear next to each other the most frequently. And you take that pair and you pull it out , and that's a token. 

So for example, if you're doing English, you may find T and H occur next to each other alot and you say TH is now a token. And then you repeat it, and you look for again two characters that often appear next to each other , now  you can treat TH as one character. Down the track you'll find TH and E often appear next to each other , that's now a token. So you keep doing that until eventually you have some set number of unique tokens. That's called your vocab. 

So with **sentence piece**, when you call sentence piece you can actually tell it how big a vocab do you want . And so by default to a vocab of size 30,000. So this here has been tokenized into Turkish subword units using a size of 30,000. Sentence piece actually goes a bit furthur than bye-pair-encoding, it actually creates a neural network language model at a character model. And finds combinations of characters that are most likely to appear together based on the language model but it's the same basic idea.

So this idea of using subword units , it's not very well studied but it's absolutely necessary and super powerful. Not only for Turkish, Chinese, Japanese, but also for things like medical texts. Because actually those big long chemical names in the scientific and medical literature contain well-defined subworks units that are frequently reused. So you don't need separate vocab for everyone. The problem with having a separate vocab for everyone is none of those single words appears that often , so you really want to get the sense of meaning of the underlying subword units. 

So if we want to Turkish from scratch 

## Turkish wikipedia model

In [0]:
#@title from nlputils import get_wiki, split_wiki
from fastai.basics import *
import re


def get_wiki(path,lang):
    name = f'{lang}wiki'
    if (path/name).exists():
        print(f"{path/name} already exists; not downloading")
        return

    xml_fn = f"{lang}wiki-latest-pages-articles.xml"
    zip_fn = f"{xml_fn}.bz2"

    if not (path/xml_fn).exists():
        print("downloading...")
        download_url(f'https://dumps.wikimedia.org/{name}/latest/{zip_fn}', path/zip_fn)
        print("unzipping...")
        bunzip(path/zip_fn)

    with working_directory(path):
        if not (path/'wikiextractor').exists(): os.system('git clone https://github.com/attardi/wikiextractor.git')
        print("extracting...")
        os.system("python wikiextractor/WikiExtractor.py --processes 4 --no_templates " +
            f"--min_text_length 1800 --filter_disambig_pages --log_file log -b 100G -q {xml_fn}")
    shutil.move(str(path/'text/AA/wiki_00'), str(path/name))
    shutil.rmtree(path/'text')


def split_wiki(path,lang):
    dest = path/'docs'
    name = f'{lang}wiki'
    if dest.exists():
        print(f"{dest} already exists; not splitting")
        return dest

    dest.mkdir(exist_ok=True, parents=True)
    title_re = re.compile(rf'<doc id="\d+" url="https://{lang}.wikipedia.org/wiki\?curid=\d+" title="([^"]+)">')
    lines = (path/name).open()
    f=None

    for i,l in enumerate(lines):
        if i%100000 == 0: print(i)
        if l.startswith('<doc id="'):
            title = title_re.findall(l)[0].replace('/','_')
            if len(title)>150: continue
            if f: f.close()
            f = (dest/f'{title}.txt').open('w')
        else: f.write(l)
    f.close()
    return dest

In [0]:
get_wiki(path,lang)
!head -n4 {path}/{name}

/root/.fastai/data/trwiki/trwiki already exists; not downloading
<doc id="10" url="https://tr.wikipedia.org/wiki?curid=10" title="Cengiz Han">
Cengiz Han

Cengiz Han ("Cenghis Khan", "Çinggis Haan" ya da doğum adıyla Temuçin (anlamı: demirci), Moğolca: "Чингис Хаан" ya da "Tengiz" (anlamı: deniz), ; d. 1162 – ö. 18 Ağustos 1227), Moğol komutan, hükümdar ve Moğol İmparatorluğu'nun kurucusudur. Cengiz Han, 13. Yüzyılın başında Orta Asya'daki tüm göçebe bozkır kavimlerini birleştirerek bir ulus haline getirdi ve o ulusu "Moğol" siyasi kimliği çatısı altında topladı. Dünya tarihinin en büyük askeri dehalarından biri olarak kabul edilen Cengiz Han, hükümdarlığı döneminde 1206-1227 arasında Kuzey Çin'deki Batı Xia ve Jin Hanedanı, Türkistan'daki Kara Hıtay, Maveraünnehir, Harezm, Horasan ve İran'daki Harzemşahlar ile Kafkasya'da Gürcüler, Deşt-i Kıpçak'taki Rus Knezlikleri ve Kıpçaklar ile İdil Bulgarları üzerine gerçekleştirilen seferler sonucunda Pasifik Okyanusu'ndan Hazar Denizi’ne ve Ka

In [0]:
dest = split_wiki(path,lang)

/root/.fastai/data/trwiki/docs already exists; not splitting


Turkish is an [Agglutinative_language](https://en.wikipedia.org/wiki/Agglutinative_language) so it needs special care!

![alt text](https://github.com/fastai/course-nlp/raw/85e505295efeed88ce61dc0ff5e424bde9741a15/images/turkish.jpg?raw=true)



Here we got one extra step which we added a **processor**. So by default, fastai uses the Spacy tokenizer to tokenize the text. If you want to use something else, in our case, we want to use **sentence piece**, you can replace the processor with **SPprocessor** - **Sentence Piece processor**.

If we do this, this say don't tokenize on space boundaries but instead learn a model from the text please. So now, when you go **data.show_batch()**, you can see there is the sentence piece tokenized model.

In [0]:
!pip install sentencepiece



In [0]:
data = (TextList.from_folder(dest,processor=[OpenFileProcessor(),SPProcessor()])
                .split_by_rand_pct(0.1,seed=42)
                .label_for_lm()
                .databunch(bs=bs,num_workers=1))

data.save(f'{lang}_databunch')
len(data.vocab.itos),len(data.train_ds)

(30000, 45219)

In [0]:
data.show_batch()

idx,text
0,"di , ▁xxmaj ▁v yat ka ▁xxmaj ▁gu ber ni ya sı , ▁xxmaj ▁rus ▁xxmaj ▁i ̇ mparatorluğu ▁ - ▁19 ▁xxmaj ▁mart ▁1955 ; ▁xxmaj ▁moskova , ▁xxup ▁sscb ), ▁xxmaj ▁kızıl ▁xxmaj ▁ordu ' nun ▁komutanlarından ▁biri , ▁xxmaj ▁sovyetler ▁xxmaj ▁birliği ▁xxmaj ▁mareşal i , ▁xxmaj ▁sovyetler ▁xxmaj ▁birliği ▁xxmaj ▁komünist ▁xxmaj ▁partisi ▁xxmaj ▁merkez ▁xxmaj ▁komitesi ▁aday ▁üyesi , ▁xxmaj ▁yüksek ▁xxmaj ▁sovyet ▁üyesi ▁ve"
1,"devlet in ▁tüm ▁bu ▁ayaklanma lardan ▁edindiği ▁ders , ▁ayaklanan ▁vatan ▁hain lerinin , ▁tüm ▁soyu ▁so pu yla ▁yok ed ilmesi ▁gerektiği , ▁geriye ▁"" de de m ▁dede m "" ▁diye cek ▁tek ▁bir ▁hain ▁döl ünün ▁bile ▁sağ ▁bırakılma ması ▁gerektiğini "" ▁savunmaktadır . ▁xxmaj ▁fırat , ▁xxmaj ▁kürt leri ▁"" ata sı ▁belli ▁olmayan "", ▁xxmaj ▁za gro s lar ' ın ▁"" kültürel ▁çukur unda"
2,▁vi . ▁xxmaj ▁ philippe ▁ te fe cilik ten ▁vazgeçerek ▁kendi ▁bölgesindeki ▁xxmaj ▁floransa lı ▁tüccarların ▁ve ▁bank er lerin ▁yüksek ▁miktarda ki ▁para larını ▁ şan ta j ▁yaparak ▁almıştır . ▁14. ▁yüzyılın ▁40 ' lı ▁yıllarındaki ▁xxmaj ▁ortaçağ ’ da ▁ve ba ▁ve ▁açlık ▁çok ▁büyük ▁zarar lara ▁yol ▁açmıştır ▁ve ▁kısmen ▁başarı ▁sağ lanamayan ▁son ▁savaşlar ▁ve ▁ekonomi deki ▁tatmin ▁edici ▁olmayan ▁durum ▁nedeniyle ▁xxmaj ▁grand
3,"▁uzayda ▁yapılmıştır . i ̇ mparatorluk ▁xxmaj ▁ordusu ▁her ▁savaşta ▁yenilgi ▁alıp ▁yıkılmıştır . bu ▁savaşın ▁kahramanları ▁olan ▁xxmaj ▁luke ▁skywalker , le ia ▁organ a , han ▁solo , land o ▁xxmaj ▁cal ris si an ▁ve ▁son ▁anda ▁taraf ▁değiştiren ▁xxmaj ▁darth ▁xxmaj ▁vader ▁yani ▁xxmaj ▁anakin ▁skywalker ' dır . en dor ▁xxmaj ▁savaşı ' nda ▁enerji ▁kalkanı ▁yok ▁edilince ▁xxmaj ▁i ̇ mparatorluk ▁ilk ▁darbesi ni"
4,". ▁xxmaj ▁koleksiyoncu , ▁taran ması ▁gereken ▁ uri ' lerin ▁listesini ▁alır ▁ve ▁ uri ' yi ▁barındıran ▁mağaza ya ▁erişmek ▁için ▁uygun ▁protokol ▁iş leyici sini ▁ve ▁sonra ▁da ▁( me ta ▁verileri ▁ ayık lamak ▁için ) ▁uygun ▁özellik ▁iş leyici sini ▁ve ▁belge ▁metnin i ▁ ayık lamak ▁için ▁i fil ter ' ı ▁çağırır . ▁xxmaj ▁farklı ▁indeks ler ▁farklı ▁koşu lar ▁sırasında ▁oluşturulur ;"


In [0]:
learn = language_model_learner(data,AWD_LSTM,drop_mult=0.1,wd=0.1,pretrained=False).to_fp16()

In [0]:
lr = 3e-3
lr *= bs/48  # Scale learning rate by batch size

In [0]:
learn.unfreeze()
learn.fit_one_cycle(1, lr, moms=(0.8,0.7)) # use n of epochs 10 instead

epoch,train_loss,valid_loss,accuracy,time
0,4.221147,4.158035,0.330393,33:15


We have 38% accuracy at predicting the next sub word in Turkish Wikipedia

In [0]:
learn.to_fp32().save(lm_fns[0], with_opt=False)
learn.data.vocab.save(lm_fns[1].with_suffix('.pkl'))


## Turkish sentiment analysis

https://www.win.tue.nl/~mpechen/projects/smm/

[1:17](https://youtu.be/MDX_x6rKXAs?t=4639)


### Language model

In the dataset, we have two files tr_polarity.pos (positive reviews) and tr_poplarity.neg (negative reviews).

If we use `pos = (path_clas/'tr_polarity.pos').open().readlines()` -> it got an error that it was unable to read the file using utf-8. Here the next thing that you're going to come across all the time as an NLP practitioner. The way that letters are stored on disk until recently was ill-defined . We use a character set called ASCII that had a unique mapping from the numbers I  think 0 to 127 and letters of the alphabet and punctuation and so forth. Then when folks in , the A in ASCII state stands for American. It's just designed for Americans . When folks in France decided that they wanted to put things on computers as well they changed some of the numbers to represent different characters. So if you were put in exactly the same on a French computer versus an American computer you'll see something different. 

This is particularly crazy in Japan, for example, because Japan has 4000 characters which obviously you can't fit in 127 . So they invented their own encoding using more than 127 numbers . There is something called Unicode which is consisted of millions of characters including emojis and they keep adding new emojis every year . 

The problem is that the dataset that i was looking at and this happens quite frequently , it was from 2013 where Unicode was around but not everybody using it. It was not saved in Unicode. So there's actually no correct way to open a file that's saved not in Unicode, you have to guess. So I googled for Turkish language encoding and I found a page that said quite often in Turkey people tend to use **iso-8859-9** encoding, except when they don'tk

In [0]:
!wget https://www.win.tue.nl/~mpechen/projects/smm/Turkish_Movie_Sentiment.zip


--2019-08-11 05:51:33--  https://www.win.tue.nl/~mpechen/projects/smm/Turkish_Movie_Sentiment.zip
Resolving www.win.tue.nl (www.win.tue.nl)... 131.155.11.13
Connecting to www.win.tue.nl (www.win.tue.nl)|131.155.11.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 533335 (521K) [application/zip]
Saving to: ‘Turkish_Movie_Sentiment.zip.1’


2019-08-11 05:51:35 (371 KB/s) - ‘Turkish_Movie_Sentiment.zip.1’ saved [533335/533335]



In [0]:
path.ls()

[PosixPath('/root/.fastai/data/trwiki/log'),
 PosixPath('/root/.fastai/data/trwiki/wikiextractor'),
 PosixPath('/root/.fastai/data/trwiki/models'),
 PosixPath('/root/.fastai/data/trwiki/docs'),
 PosixPath('/root/.fastai/data/trwiki/trwiki'),
 PosixPath('/root/.fastai/data/trwiki/trwiki-latest-pages-articles.xml'),
 PosixPath('/root/.fastai/data/trwiki/movies'),
 PosixPath('/root/.fastai/data/trwiki/trwiki-latest-pages-articles.xml.bz2')]

In [0]:
path_clas = path/'movies'
path_clas.mkdir(exist_ok=True)
!unzip /content/Turkish_Movie_Sentiment.zip -d {path_clas}

Archive:  /content/Turkish_Movie_Sentiment.zip
replace /root/.fastai/data/trwiki/movies/tr_polarity.neg? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

KeyboardInterrupt: ignored

In [0]:
path_clas.ls()

[PosixPath('/root/.fastai/data/trwiki/movies/tr_polarity.neg'),
 PosixPath('/root/.fastai/data/trwiki/movies/models'),
 PosixPath('/root/.fastai/data/trwiki/movies/tr_clas_databunch'),
 PosixPath('/root/.fastai/data/trwiki/movies/tr_polarity.pos')]

In [0]:
pos = (path_clas/'tr_polarity.pos').open(encoding='iso-8859-9').readlines()
pos_df = pd.DataFrame({'text':pos})
pos_df['pos']=1
pos_df.head()

Unnamed: 0,text,pos
0,gerçekten harika bir yapim birçok kez izledim ...,1
1,her izledigimde hayranlik duydugum gerçek klas...,1
2,gerçekten tarihi savas filmleri arasinda tarti...,1
3,aldigi ödülleri sonuna dek hak eden muhtesem b...,1
4,özgürlük denilince aklima gelen ilk film.bir b...,1


In [0]:
neg = (path_clas/'tr_polarity.neg').open(encoding='iso-8859-9').readlines()
neg_df = pd.DataFrame({'text':neg})
neg_df['pos']=0
neg_df.head()

Unnamed: 0,text,pos
0,giseye oynayan bir film.mel gibson'in oyunculu...,0
1,bircok yonden sahip olduklari zayifliklari pop...,0
2,"1995 ten bu yana bu tür filmler artti , o zama...",0
3,mel gibson tam bir ingiliz düsmani her filmind...,0
4,milliyetçi bir film tavsiye etmiyorum.... \n,0


In [0]:
df = pd.concat([pos_df,neg_df],sort=False)

Here we need to make sure that we use the same sentence piece of vocabulary and model which we used for the Turkish Wikipedia model , otherwise it's going to tokenize in different way and it won't make any sense. **SPProcessor.load(dest)** which passed in the path to your databunch from your pretrained Wikipedia file then that will load up the sentence piece vocab. When we do **SPProcessor()** above, it automatically saved the vocab for us. 

With this approach, we can make sure our data is tokenized in the same way as before.

There has been dramatically changed in the last couple of years, almost nobody's familiar with everything's low hanging fruit, pretty much everything you apply and try it to, you'll probably get much beter results than everybody got before. There's lot of opportunities to create products which doesn't exist before. Any product currently use this langugage you'll probably be able to make it better and it's really all about taking advantage of transfer learning in this way

In [0]:
data_lm = (TextList.from_df(df,path_clas,cols='text',processor=SPProcessor.load(dest))
                   .split_by_rand_pct(0.1,seed=42)
                   .label_for_lm()
                   .databunch(bs=bs,num_workers=1))

data_lm.save(f'{lang}_clas_databunch')

In [0]:
data_lm = load_data(path_clas,f'{lang}_clas_databunch',bs=bs)


In [0]:
data_lm.show_batch()

idx,text
0,"sa ▁bile ▁sinema ▁olgusu nun ▁en ▁üst ▁noktalar in dan . . ▁xxbos ▁gerçekten ▁tarihi ▁sava s ▁filmleri ▁ara si nda ▁tar tis ma siz ▁en ▁iyi si ▁ , ▁12 ▁ yi l ▁boyunca ▁ a caba ▁ikincisi ▁çek ir imi ▁diye ▁bekledi gi m ▁bir ▁film ▁ , bel ki ▁william ▁wallace ▁baba sinin ▁ölümünden ▁sonra ▁amca si ▁yani na ▁al mis ti ▁onu ▁ ye tis tir"
1,ni ▁öldür se ler ▁o ▁kadar ▁ üzülme m özgürlük ▁xxrep ▁4 ▁! ▁ . ▁xxbos ▁bi ▁çok ▁aç i dan ▁kusur suz ▁bir ▁film ▁olsa da ▁benim ▁beklentileri mi ▁tam ▁kars i lama di . . . ama ▁kötü ▁film mi ▁kesinlikle ▁hay ir . . . özellikle ▁son ▁20 ▁da k ki kas in da ▁etkilenme k ▁elde ▁de gil . . . ama ▁film ▁bana ▁istedi gi
2,", bi ▁gün i çinde ▁ na sil ▁fark li ▁yön lere ▁ çekil e bile ce ini ▁muh te sem ▁anlatan ▁bi ▁film ▁xxrep ▁4 ▁ . ▁xxbos ▁filmde ▁o ▁kadar ▁çok ▁ve ▁de ▁ hos ▁ay rin ti ▁var ▁ki ▁tekrar ▁tekrar ▁izlenebil cek ▁ender ▁filmlerden . ba mba ska ▁bir ▁film ▁konuyu ▁sunar ken ▁seçti gi ▁yöntem ▁itibar iyle . . . ▁xxbos ▁tam am ▁bi ▁sey"
3,ligi m ▁bunun ▁üstesinde n ▁geldi . ben ce ▁yine ▁de ▁izlenme ye ▁de ger ▁olan ▁bir ▁film . ay ri ca ▁filmin ▁yap ti gi ▁gönderme ler ▁var . bu nu ▁yak layan lar ▁zaten ▁biliyordu r . . ▁xxbos ▁ ' benim ▁ a dim ▁robert ▁neville . ▁new ▁york ▁ se hri nde ▁hayatta ▁kalan ▁biri yim . ▁sesi mi ▁duyan ▁biri ▁varsa ▁herhangi ▁biri . ▁
4,"▁for rest , ▁run ▁xxrep ▁4 ▁ . ▁xxbos ▁gerçekten ▁bu da ▁bi ▁tom ▁han ks ▁ kla si gi , . ▁xxbos ▁ya ▁bu ▁filme ▁9 ▁puan ▁veren ▁adam ▁utan ma li ▁ya w ▁film ▁süper . ▁xxbos ▁be gen me gen ler ▁ya ▁sonuna ▁kadar ▁izleme mis ▁ya ▁da ▁geyik ▁ortam in da ▁geyik ▁yaparak ▁izle mistir . ▁hala ▁da ▁be gen meyen ▁varsa ▁hakikat en ▁ya ▁kendinden"


In [0]:
learn_lm = language_model_learner(data_lm,AWD_LSTM,pretrained_fnames=lm_fns,drop_mult=1.0,wd=0.1)

In [0]:
lr = 1e-3
lr *= bs/48 

In [0]:
learn_lm.fit_one_cycle(5,lr*10,moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,5.087432,4.409014,0.28212,00:28
1,4.415067,4.180242,0.308997,00:28
2,4.154958,4.093359,0.31962,00:27
3,3.963784,4.041077,0.326145,00:27
4,3.842597,4.02941,0.327106,00:28


In [0]:
learn_lm.unfreeze()
learn_lm.fit_one_cycle(5,slice(lr/10,lr*10),moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,3.847214,4.004792,0.329625,00:34
1,3.83503,3.966111,0.33544,00:34
2,3.650264,3.893359,0.346062,00:34
3,3.427696,3.865988,0.352953,00:34
4,3.264168,3.871879,0.353846,00:34


In [0]:
learn_lm.save(f'{lang}fine_tuned')
learn_lm.save_encoder(f'{lang}fine_tuned_enc')

### Classifier

In [0]:
path_clas.ls()

[PosixPath('/root/.fastai/data/trwiki/movies/tr_polarity.neg'),
 PosixPath('/root/.fastai/data/trwiki/movies/models'),
 PosixPath('/root/.fastai/data/trwiki/movies/tr_clas_databunch'),
 PosixPath('/root/.fastai/data/trwiki/movies/tr_polarity.pos')]

In [0]:
data_clas = (TextList.from_df(df,path_clas,cols='text',processor=SPProcessor.load(dest))
                     .split_by_rand_pct(0.1,seed=42)
                     .label_from_df(cols='pos')
                     .databunch(bs=bs,num_workers=1))

In [0]:
learn_c = text_classifier_learner(path_clas,AWD_LSTM,drop_mult=0.5,pretrained=False,wd=0.1).to_fp16()
learn_c.load_encoder(f'{lang}fine_tune_enc')
learn_c.freeze()

AttributeError: ignored

In [0]:
lr=2e-2
lr *= bs/48

In [0]:
learn_c.fit_one_cycle(2,lr,moms=(0.8,0.7))

In [0]:
learn_c.fit_one_cycle(2,lr,moms=(0.8,0.7))

In [0]:
learn_c.freeze_to(-2)
learn_c.fit_one_cycle(2,slice(lr/(2.6**4),lr),moms=(0.8,0.7))

In [0]:
learn_c.freeze_to(-3)
learn_c.fit_one_cycle(2, slice(lr/2/(2.6**4),lr/2), moms=(0.8,0.7))

In [0]:
learn_c.unfreeze()
learn_c.fit_one_cycle(4, slice(lr/10/(2.6**4),lr/10), moms=(0.8,0.7))

Accuracy in Gezici (2018), Sentiment Analysis in Turkish is: 75.16%. With our result, we beat the model.

In [0]:
learn_c.save(f'{lang}clas')