<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#How-to-train-a-new-language-model-from-scratch-using-Transformers-and-Tokenizers" data-toc-modified-id="How-to-train-a-new-language-model-from-scratch-using-Transformers-and-Tokenizers-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>How to train a new language model from scratch using Transformers and Tokenizers</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Notebook-edition-(link-to-blogpost-link).-Last-update-May-15,-2020" data-toc-modified-id="Notebook-edition-(link-to-blogpost-link).-Last-update-May-15,-2020-1.0.1"><span class="toc-item-num">1.0.1&nbsp;&nbsp;</span>Notebook edition (link to blogpost <a href="https://huggingface.co/blog/how-to-train" target="_blank">link</a>). Last update May 15, 2020</a></span></li></ul></li><li><span><a href="#1.-Find-a-dataset" data-toc-modified-id="1.-Find-a-dataset-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>1. Find a dataset</a></span></li><li><span><a href="#2.-Train-a-tokenizer" data-toc-modified-id="2.-Train-a-tokenizer-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>2. Train a tokenizer</a></span></li><li><span><a href="#3.-Train-a-language-model-from-scratch" data-toc-modified-id="3.-Train-a-language-model-from-scratch-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>3. Train a language model from scratch</a></span><ul class="toc-item"><li><span><a href="#We'll-define-the-following-config-for-the-model" data-toc-modified-id="We'll-define-the-following-config-for-the-model-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>We'll define the following config for the model</a></span></li><li><span><a href="#Now-let's-build-our-training-Dataset" data-toc-modified-id="Now-let's-build-our-training-Dataset-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Now let's build our training Dataset</a></span></li><li><span><a href="#Finally,-we-are-all-set-to-initialize-our-Trainer" data-toc-modified-id="Finally,-we-are-all-set-to-initialize-our-Trainer-1.3.3"><span class="toc-item-num">1.3.3&nbsp;&nbsp;</span>Finally, we are all set to initialize our Trainer</a></span></li><li><span><a href="#Start-training" data-toc-modified-id="Start-training-1.3.4"><span class="toc-item-num">1.3.4&nbsp;&nbsp;</span>Start training</a></span><ul class="toc-item"><li><span><a href="#üéâ-Save-final-model-(+-tokenizer-+-config)-to-disk" data-toc-modified-id="üéâ-Save-final-model-(+-tokenizer-+-config)-to-disk-1.3.4.1"><span class="toc-item-num">1.3.4.1&nbsp;&nbsp;</span>üéâ Save final model (+ tokenizer + config) to disk</a></span></li></ul></li></ul></li><li><span><a href="#4.-Check-that-the-LM-actually-trained" data-toc-modified-id="4.-Check-that-the-LM-actually-trained-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>4. Check that the LM actually trained</a></span></li></ul></li></ul></div>

# How to train a new language model from scratch using Transformers and Tokenizers

### Notebook edition (link to blogpost [link](https://huggingface.co/blog/how-to-train)). Last update May 15, 2020


Over the past few months, we made several improvements to our [`transformers`](https://github.com/huggingface/transformers) and [`tokenizers`](https://github.com/huggingface/tokenizers) libraries, with the goal of making it easier than ever to **train a new language model from scratch**.

In this post we‚Äôll demo how to train a ‚Äúsmall‚Äù model (84 M parameters = 6 layers, 768 hidden size, 12 attention heads) ‚Äì that‚Äôs the same number of layers & heads as DistilBERT ‚Äì on **Esperanto**. We‚Äôll then fine-tune the model on a downstream task of part-of-speech tagging.


## 1. Find a dataset

First, let us find a corpus of text in Esperanto. Here we‚Äôll use the Esperanto portion of the [OSCAR corpus](https://traces1.inria.fr/oscar/) from INRIA.
OSCAR is a huge multilingual corpus obtained by language classification and filtering of [Common Crawl](https://commoncrawl.org/) dumps of the Web.

<img src="https://huggingface.co/blog/assets/01_how-to-train/oscar.png" style="margin: auto; display: block; width: 260px;">

The Esperanto portion of the dataset is only 299M, so we‚Äôll concatenate with the Esperanto sub-corpus of the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download), which is comprised of text from diverse sources like news, literature, and wikipedia.

The final training corpus has a size of 3 GB, which is still small ‚Äì for your model, you will get better results the more data you can get to pretrain on. 



## 2. Train a tokenizer

We choose to train a byte-level Byte-pair encoding tokenizer (the same as GPT-2), with the same special tokens as RoBERTa. Let‚Äôs arbitrarily pick its size to be 52,000.

We recommend training a byte-level BPE (rather than let‚Äôs say, a WordPiece tokenizer like BERT) because it will start building its vocabulary from an alphabet of single bytes, so all words will be decomposable into tokens (no more `<unk>` tokens!).


In [1]:
import os
import getpass

#For a kaggle username & key, just go to your kaggle account and generate key
#The JSON file so downloaded contains both of them
if("examine-the-examiner.zip" not in os.listdir()):
  print("Copy these two values from the JSON file so generated")
  os.environ['KAGGLE_USERNAME'] = "iamdenay"
  os.environ['KAGGLE_KEY'] =  "3601827abd9232de7a43a10964bde3c0"
  !kaggle datasets download -d zhaseke/kazakhoscarcorpus
  !unzip /content/examine-the-examiner.zip

Copy these two values from the JSON file so generated
Downloading kazakhoscarcorpus.zip to /home/ubuntu/ATABA/src/kazBert
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 636M/636M [01:01<00:00, 11.3MB/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 636M/636M [01:01<00:00, 10.8MB/s]
unzip:  cannot find or open /content/examine-the-examiner.zip, /content/examine-the-examiner.zip.zip or /content/examine-the-examiner.zip.ZIP.


In [71]:
from bs4 import BeautifulSoup
import re

Error in callback <function _WandbInit._resume_backend at 0x7f3c80679a60> (for pre_run_cell):


Exception: The wandb backend process has shutdown

Error in callback <function _WandbInit._pause_backend at 0x7f3c80679510> (for post_run_cell):


Exception: The wandb backend process has shutdown

In [2]:
import os
import pandas as pd

df = pd.DataFrame(columns=['observation'])

path = 'text/'

for directory in os.listdir(path):
    directory = os.path.join(path, directory)
    if os.path.isdir(directory):
        for filename in os.listdir(directory):
            print(filename)
            with open(os.path.join(directory, filename)) as f:
                soup = BeautifulSoup(f, "html.parser")
                text = soup.get_text()
                with open(os.path.join(directory, filename), "w") as f:
                    f.write(text)

wiki_84
wiki_92
wiki_25
wiki_17
wiki_51
wiki_97
wiki_18
wiki_45
wiki_26
wiki_21
wiki_34
wiki_80
wiki_59
wiki_62
wiki_70
wiki_69
wiki_49
wiki_91
wiki_60
wiki_03
wiki_02
wiki_66
wiki_46
wiki_28
wiki_40
wiki_65
wiki_73
wiki_85
wiki_36
wiki_23
wiki_42
wiki_68
wiki_54
wiki_52
wiki_83
wiki_08
wiki_11
wiki_82
wiki_95
wiki_44
wiki_61
wiki_48
wiki_43
wiki_31
wiki_19
wiki_78
wiki_29
wiki_37
wiki_74
wiki_79
wiki_39
wiki_13
wiki_98
wiki_47
wiki_12
wiki_32
wiki_71
wiki_63
wiki_20
wiki_81
wiki_86
wiki_76
wiki_58
wiki_88
wiki_27
wiki_94
wiki_96
wiki_56
wiki_05
wiki_53
wiki_64
wiki_35
wiki_41
wiki_55
wiki_38
wiki_89
wiki_04
wiki_30
wiki_33
wiki_75
wiki_07
wiki_22
wiki_00
wiki_77
wiki_15
wiki_16
wiki_93
wiki_50
wiki_72
wiki_24
wiki_01
wiki_87
wiki_57
wiki_06
wiki_90
wiki_09
wiki_10
wiki_99
wiki_14
wiki_67
wiki_84
wiki_92
wiki_25
wiki_17
wiki_51
wiki_97
wiki_18
wiki_45
wiki_26
wiki_21
wiki_34
wiki_80
wiki_59
wiki_62
wiki_70
wiki_69
wiki_49
wiki_91
wiki_60
wiki_03
wiki_02
wiki_66
wiki_46
wiki_28
wiki_40


In [1]:
%%time 
from pathlib import Path
import glob
from tokenizers import ByteLevelBPETokenizer

paths = ['azoscar.txt']

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

CPU times: user 31min 49s, sys: 4min 52s, total: 36min 41s
Wall time: 1min 31s


In [None]:
%%time 
from pathlib import Path
import glob
from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in glob.glob(r'text/**/*')]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

Now let's save files to disk

In [2]:
!mkdir azBERT
tokenizer.save_model("azBERT")

mkdir: cannot create directory ‚ÄòazBERT‚Äô: File exists


['azBERT/vocab.json', 'azBERT/merges.txt']

In [2]:
!mkdir kazBERT
tokenizer.save_model("kazBERT")

['kazBERT/vocab.json', 'kazBERT/merges.txt']

In [3]:
%%time 
from pathlib import Path
import glob
from tokenizers import ByteLevelBPETokenizer

CPU times: user 44 ¬µs, sys: 6 ¬µs, total: 50 ¬µs
Wall time: 61.3 ¬µs


In [4]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


tokenizer = ByteLevelBPETokenizer(
    "./azBERT/vocab.json",
    "./azBERT/merges.txt",
)

In [5]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [6]:
tokenizer.encode("az…ôrtac x…ôb…ôr verir ki.")

Encoding(num_tokens=9, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [7]:
tokenizer.encode("az…ôrtac x…ôb…ôr verir ki.").tokens

['<s>', 'az', '√âƒªr', 'tac', 'ƒ†x√âƒªb√âƒªr', 'ƒ†verir', 'ƒ†ki', '.', '</s>']

In [8]:
from tokenizers.decoders import ByteLevel
decoder = ByteLevel()
decoder.decode([ 'ƒ†x√âƒªb√âƒªr' ])


' x…ôb…ôr'

## 3. Train a language model from scratch

**Update:** This section follows along the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/legacy/run_language_modeling.py) script, using our new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) directly. Feel free to pick the approach you like best.

> We‚Äôll train a RoBERTa-like model, which is a BERT-like with a couple of changes (check the [documentation](https://huggingface.co/transformers/model_doc/roberta.html) for more details).

As the model is BERT-like, we‚Äôll train it on a task of *Masked language modeling*, i.e. the predict how to fill arbitrary tokens that we randomly mask in the dataset. This is taken care of by the example script.


In [3]:
# Check that PyTorch sees it
import torch
torch.cuda.is_available()

True

### We'll define the following config for the model

In [10]:
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

Now let's re-create our tokenizer in transformers

In [25]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("./azBERT", max_len=512)

Finally let's initialize our model.

**Important:**

As we are training from scratch, we only initialize from a config, not from an existing pretrained model or checkpoint.

In [12]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config)

In [13]:
model.num_parameters()
# => 84 million parameters

83504416

### Now let's build our training Dataset

We'll build our dataset by applying our tokenizer to our text file.

Here, as we only have one text file, we don't even need to customize our `Dataset`. We'll just use the `LineByLineDataset` out-of-the-box.

In [3]:
from datasets import load_dataset

In [None]:
'azoscar.txt'

In [16]:
import glob
paths = ['azoscar.txt']
dataset = load_dataset('text', data_files=paths)

Using custom data configuration default-b1af337a9aedfc0e


Downloading and preparing dataset text/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/denay/.cache/huggingface/datasets/text/default-b1af337a9aedfc0e/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5...


0 tables [00:00, ? tables/s]

Dataset text downloaded and prepared to /home/denay/.cache/huggingface/datasets/text/default-b1af337a9aedfc0e/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5. Subsequent calls will reuse this data.


In [3]:
import glob
paths = [str(x) for x in glob.glob(r'text/**/*')]
dataset = load_dataset('text', data_files=paths)

Using custom data configuration default-0ac4a4a107fff5df


Downloading and preparing dataset text/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/denay/.cache/huggingface/datasets/text/default-0ac4a4a107fff5df/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5...


0 tables [00:00, ? tables/s]

Dataset text downloaded and prepared to /home/denay/.cache/huggingface/datasets/text/default-0ac4a4a107fff5df/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5. Subsequent calls will reuse this data.


In [21]:
dataset['train'][15]

{'text': 'Psixologiya elmi q…ôdim tarix…ô malikdir. Psixoloji anlayƒ±≈ülar sistem ≈ü…ôklind…ô ilk d…ôf…ô olaraq Aristotelin (eramƒ±zdan …ôvv…ôl IV …ôsr) ""Ruh haqqƒ±nda"" traktatƒ±nda ≈ü…ôrh olunmu≈üdur. Traktat psixologiya yox, ""ruh haqqƒ±nda"" adlanƒ±r. Bu da t…ôsad√ºfi deyildir. Uzun m√ºdd…ôt (XIX …ôsrin sonlarƒ±na q…ôd…ôr) psixologiya elmi f…ôls…ôf…ôy…ô aid f…ônn hesab olunub. Avropa …ôd…ôbiyyatƒ±nda mental (latƒ±n s√∂z√º olub ""psixi olan"" dem…ôkdir) f…ôls…ôf…ô, ruhiyyat, pnevmatalogiya (pnevma ‚Äì yunan s√∂z√º olub n…ôf…ôs, ruh dem…ôkdir) adlandƒ±rƒ±lmƒ±≈ülar.'}

In [18]:
dataset['train'][2]

{'text': '06.10.22 20:27 …ôrzind…ô olan d…ôyi≈üiklikl…ôr Yazan Azer 06.10.22 20:14'}

In [14]:
%%time
from transformers import LineByLineTextDataset

CPU times: user 414 ms, sys: 74.9 ms, total: 489 ms
Wall time: 1.67 s


In [16]:


dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="azoscar.txt",
    block_size=128,
)

Like in the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script, we need to define a data_collator.

This is just a small helper that will help us batch different samples of the dataset together into an object that PyTorch knows how to perform backprop on.

In [17]:
dataset

<transformers.data.datasets.language_modeling.LineByLineTextDataset at 0x7f3b6ecf3e80>

In [18]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

### Finally, we are all set to initialize our Trainer

In [38]:


from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./azBERT",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_gpu_train_batch_size=32,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
    report_to='none'
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
    
)

PyTorch: setting up devices


### Start training

In [39]:
%%time
trainer.train()

Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
***** Running training *****
  Num examples = 580217
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 27198


Step,Training Loss
500,6.5598
1000,6.3014
1500,6.1104
2000,5.9148
2500,5.7507
3000,5.5926
3500,5.4184
4000,5.2699
4500,5.1143
5000,4.9799


Saving model checkpoint to ./azBERT/checkpoint-10000
Configuration saved in ./azBERT/checkpoint-10000/config.json
Model weights saved in ./azBERT/checkpoint-10000/pytorch_model.bin
IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



Training completed. Do not forget to share your model on huggingface.co/models =)




CPU times: user 4h 39min 36s, sys: 2min 4s, total: 4h 41min 40s
Wall time: 2h 41min 16s


TrainOutput(global_step=27198, training_loss=4.05967112216575, metrics={'train_runtime': 9676.447, 'train_samples_per_second': 179.885, 'train_steps_per_second': 2.811, 'total_flos': 5.771438944918733e+16, 'train_loss': 4.05967112216575, 'epoch': 3.0})

#### üéâ Save final model (+ tokenizer + config) to disk

In [40]:
trainer.save_model("./azBERT")

Saving model checkpoint to ./azBERT
Configuration saved in ./azBERT/config.json
Model weights saved in ./azBERT/pytorch_model.bin


## 4. Check that the LM actually trained

Aside from looking at the training and eval losses going down, the easiest way to check whether our language model is learning anything interesting is via the `FillMaskPipeline`.

Pipelines are simple wrappers around tokenizers and models, and the 'fill-mask' one will let you input a sequence containing a masked token (here, `<mask>`) and return a list of the most probable filled sequences, with their probabilities.



In [41]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./azBERT",
    tokenizer="./azBERT"
)

loading configuration file ./azBERT/config.json
Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.10.0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}

loading configuration file ./azBERT/config.json
Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_ch

In [42]:
fill_mask("az…ôrtac x…ôb…ôr <mask> ki")

[{'sequence': 'az…ôrtac x…ôb…ôr verir ki',
  'score': 0.9791690707206726,
  'token': 1053,
  'token_str': ' verir'},
 {'sequence': 'az…ôrtac x…ôb…ôr verib ki',
  'score': 0.004408467561006546,
  'token': 2313,
  'token_str': ' verib'},
 {'sequence': 'az…ôrtac x…ôb…ôr yayƒ±b ki',
  'score': 0.00216124439612031,
  'token': 6580,
  'token_str': ' yayƒ±b'},
 {'sequence': 'az…ôrtac x…ôb…ôr agentliyi ki',
  'score': 0.0014381826622411609,
  'token': 14711,
  'token_str': ' agentliyi'},
 {'sequence': 'az…ôrtac x…ôb…ôraz ki',
  'score': 0.0012858203845098615,
  'token': 320,
  'token_str': 'az'}]

In [43]:
fill_mask("M…ôn…ô o yum≈üaq fransƒ±z bulkalarƒ±ndan <mask> √ßox ver")

[{'sequence': 'M…ôn…ô o yum≈üaq fransƒ±z bulkalarƒ±ndan daha √ßox ver',
  'score': 0.5982716083526611,
  'token': 716,
  'token_str': ' daha'},
 {'sequence': 'M…ôn…ô o yum≈üaq fransƒ±z bulkalarƒ±ndan bir √ßox ver',
  'score': 0.1061108186841011,
  'token': 374,
  'token_str': ' bir'},
 {'sequence': 'M…ôn…ô o yum≈üaq fransƒ±z bulkalarƒ±ndan biri √ßox ver',
  'score': 0.05577299743890762,
  'token': 1331,
  'token_str': ' biri'},
 {'sequence': 'M…ôn…ô o yum≈üaq fransƒ±z bulkalarƒ±ndan …ôn √ßox ver',
  'score': 0.029407601803541183,
  'token': 745,
  'token_str': ' …ôn'},
 {'sequence': 'M…ôn…ô o yum≈üaq fransƒ±z bulkalarƒ±ndan √ßox √ßox ver',
  'score': 0.011952652595937252,
  'token': 524,
  'token_str': ' √ßox'}]

In [44]:
fill_mask("Az…ôrbaycan Ordusunun m√∂vqel…ôri artilleriya <mask> m…ôruz qalƒ±r")

[{'sequence': 'Az…ôrbaycan Ordusunun m√∂vqel…ôri artilleriya at…ô≈üin…ô m…ôruz qalƒ±r',
  'score': 0.23799513280391693,
  'token': 29362,
  'token_str': ' at…ô≈üin…ô'},
 {'sequence': 'Az…ôrbaycan Ordusunun m√∂vqel…ôri artilleriya h√ºcumuna m…ôruz qalƒ±r',
  'score': 0.059697140008211136,
  'token': 35291,
  'token_str': ' h√ºcumuna'},
 {'sequence': 'Az…ôrbaycan Ordusunun m√∂vqel…ôri artilleriya at…ô≈ü…ô m…ôruz qalƒ±r',
  'score': 0.023507563397288322,
  'token': 5703,
  'token_str': ' at…ô≈ü…ô'},
 {'sequence': 'Az…ôrbaycan Ordusunun m√∂vqel…ôri artilleriya silahlardan m…ôruz qalƒ±r',
  'score': 0.015011416748166084,
  'token': 32201,
  'token_str': ' silahlardan'},
 {'sequence': 'Az…ôrbaycan Ordusunun m√∂vqel…ôri artilleriya q√ºvv…ôl…ôrinin m…ôruz qalƒ±r',
  'score': 0.014030301943421364,
  'token': 7886,
  'token_str': ' q√ºvv…ôl…ôrinin'}]

In [28]:
import pickle

In [29]:
dataset1 = pickle.load(open( "./dataset1.p", "rb" ))

In [30]:
metrics=trainer.evaluate(dataset1)
print(metrics)

***** Running Evaluation *****
  Num examples = 144074
  Batch size = 16


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mdeviloper[0m (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: wandb version 0.12.1 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


{'eval_loss': 11.30222225189209, 'eval_runtime': 262.9688, 'eval_samples_per_second': 547.875, 'eval_steps_per_second': 34.244}


In [101]:
import pandas as pd
df = pd.read_csv('/home/denay/azer-bert/azertag_sentences_with_categories.csv')


ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/home/ubuntu/.virtualenvs/ml/lib/python3.6/site-packages/pandas/io/parsers.py", line 2157, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 862, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 941, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1073, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1119, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1162, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "/home/ubuntu/.virtualenvs/ml/lib/python3.6/site-packages/pandas/core/dtypes/common.py", line 530, in is_categorical_dtype
    def is_categorical_dtype(arr_or_dtype) -> bool:
KeyboardInterrupt

During handling of the above exception, another 

TypeError: object of type 'NoneType' has no len()

In [None]:
df

In [None]:
pd.value_counts(df['category'])

In [1]:
from datasets import load_dataset, load_from_disk


In [36]:
import datasets as ds

In [2]:
dataset = load_dataset("csv", data_files='/home/denay/azer-bert/azertag_sentences_with_categories.csv')


Using custom data configuration default-00fbab3f4ac620d9
Reusing dataset csv (/home/denay/.cache/huggingface/datasets/csv/default-00fbab3f4ac620d9/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a)


  0%|          | 0/1 [00:00<?, ?it/s]

In [103]:
dataset = load_dataset("pandas", data_files="/home/denay/azer-bert/merged_neural.pkl")
# dataset = dataset['train']
# dataset = dataset.rename_column("content", "text")

KeyboardInterrupt: 

In [12]:
dataset

DatasetDict({
    train: Dataset({
        features: ['labels', 'text'],
        num_rows: 3492453
    })
})

In [6]:
dataset = dataset.remove_columns(["url", "news_id",'order_number','sub_category'])

In [4]:
dataset = dataset.remove_columns(["url", "news_id",'order_number','sub_category','text_az', 'text_en', '__index_level_0__'])

ValueError: Column name text_az not in the dataset. Current columns in the dataset: ['url', 'category', 'sub_category', 'news_id', 'order_number', 'text_content']

In [None]:
dataset = dataset.remove_columns(['text_content'])

In [10]:
dataset = dataset.rename_column("category", "labels")

In [20]:
dataset = dataset.rename_column("text_az_lower", "text")

In [21]:
dataset['train']

Dataset({
    features: ['labels', 'text'],
    num_rows: 5480028
})

In [11]:
label2id = {"C∆èMƒ∞YY∆èT": 0, "R∆èSMƒ∞ XRONƒ∞KA": 1, "ƒ∞QTƒ∞SADƒ∞YYAT": 2, "R∆èSMƒ∞ S∆èN∆èDL∆èR": 3, "ELM V∆è T∆èHSƒ∞L": 4, "REGƒ∞ONLAR": 5, "Sƒ∞YAS∆èT": 6, "M∆èD∆èNƒ∞YY∆èT": 7, "QAN YADDA≈ûI": 8, "D√úNYA": 9, "ƒ∞DMAN": 10, "C√úMHURƒ∞YY∆èT - 100": 11, "M√úSAHƒ∞B∆è":12, "BA≈û X∆èB∆èRL∆èR":13, "≈ûU≈ûA ƒ∞Lƒ∞":14}

In [15]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("./azBERT", max_len=512)

In [16]:
def tokenize(batch):
    tokenized_batch = tokenizer(batch['text'], padding='max_length', truncation=True, max_length=128)
    tokenized_batch["labels"] = [label2id[label] for label in batch["labels"]]
    return tokenized_batch

In [17]:
datasets = dataset.map(tokenize, batched=True)

  0%|          | 0/3493 [00:00<?, ?ba/s]

In [18]:
datasets.save_to_disk('datasets-original-tokenized')

In [28]:
datasets.save_to_disk('datasets-tokenized-max-length')

In [19]:
datasets

DatasetDict({
    train: Dataset({
        features: ['labels', 'text', 'input_ids', 'attention_mask'],
        num_rows: 3492453
    })
})

In [104]:
datasets = load_from_disk('datasets')

In [2]:
datasets = load_from_disk('datasets-tokenized')

In [2]:
datasets = load_from_disk('datasets-stratified')

In [16]:
from datasets import Dataset, ClassLabel

In [20]:
datasets = datasets.class_encode_column('labels', )

Stringifying the column:   0%|          | 0/3493 [00:00<?, ?ba/s]

Casting to class labels:   0%|          | 0/3493 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/350 [00:00<?, ?ba/s]

In [17]:
datasets = datasets.cast_column("labels", ClassLabel(num_classes=15))

Casting the dataset:   0%|          | 0/280 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/70 [00:00<?, ?ba/s]

In [18]:
datasets = datasets['train'].train_test_split(test_size=0.2, stratify_by_column='labels')

In [23]:
datasets.save_to_disk('datasets-original-stratified')

Flattening the indices:   0%|          | 0/2794 [00:00<?, ?ba/s]

Flattening the indices:   0%|          | 0/699 [00:00<?, ?ba/s]

In [22]:
import pandas as pd
import numpy as np

In [23]:
pd.value_counts(np.array(datasets['test']['labels']))

0     171922
1     115095
7      69454
8      54111
9      43322
10     25692
11     25599
12     21761
13     19127
14      7380
2       4435
3        835
4         37
5         17
6          6
dtype: int64

In [None]:
datasets.save_to_disk('datasets-stratified')

In [None]:
datasets_old = load_from_disk('datasets')

In [31]:
datasets_old['test'][0]['text']

'iÃáyunun 22-d…ô az…ôrbaycan respublikasƒ±nƒ±n prezidenti iÃálham …ôliyev v…ô √∂zb…ôkistan respublikasƒ±nƒ±n prezidenti ≈üavkat mirziyoyev xiv…ô ≈ü…ôh…ôrind…ô ‚Äúnurullaboy‚Äù saray kompleksi il…ô tanƒ±≈ü olublar.'

In [32]:
datasets['test'][0]['text']

'f…ôaliyy…ôt qurulu≈üunun bir par√ßasƒ± olaraq d√∂vl…ôt binasƒ±nƒ±n √∂n√ºnd…ôki yerd…ô sipressl…ôrin qoyuldu'

In [24]:
datasets

DatasetDict({
    train: Dataset({
        features: ['labels', 'text', 'input_ids', 'attention_mask'],
        num_rows: 2235169
    })
    test: Dataset({
        features: ['labels', 'text', 'input_ids', 'attention_mask'],
        num_rows: 558793
    })
})

In [27]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "./azBERT", num_labels=15
)

Some weights of the model checkpoint at ./azBERT were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at ./azBERT and are newly initialized: ['classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.dense.weight']
You should probably T

In [56]:
import os
os.environ["WANDB_WATCH"] = "false"

In [28]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("./azBERT", max_len=512)

In [3]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [31]:
1

1

In [None]:
training_args = TrainingArguments(
    output_dir="classifier-original-strat-no-collator",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=datasets["train"],
    eval_dataset=datasets["test"],
    tokenizer=tokenizer,
    # data_collator=data_collator,
    # compute_metrics=compute_metrics,
)

trainer.train(resume_from_checkpoint=False)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text.
***** Running training *****
  Num examples = 2793962
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 174624
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33miamdena



Epoch,Training Loss,Validation Loss


In [32]:
1

1

In [61]:
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text.
***** Running Evaluation *****
  Num examples = 1096006
  Batch size = 32


The history saving thread hit an unexpected error (OperationalError('database or disk is full',)).History will not be written to the database.




KeyboardInterrupt: 

In [32]:
trainer.save_model('classifier-original-strat-no-collator')

Saving model checkpoint to classifier-original-strat-no-collator
Configuration saved in classifier-original-strat-no-collator/config.json
Model weights saved in classifier-original-strat-no-collator/pytorch_model.bin
tokenizer config file saved in classifier-original-strat-no-collator/tokenizer_config.json
Special tokens file saved in classifier-original-strat-no-collator/special_tokens_map.json


In [33]:
from transformers import pipeline


In [36]:
classifier = pipeline(
    "text-classification",
    model="./classifier-original-strat-no-collator",
    tokenizer= "./azBERT", max_length=512, truncation=True  #"./azBERT"
)

loading configuration file ./classifier-original-strat-no-collator/config.json
Model config RobertaConfig {
  "_name_or_path": "./azBERT",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9",
    "10": "LABEL_10",
    "11": "LABEL_11",
    "12": "LABEL_12",
    "13": "LABEL_13",
    "14": "LABEL_14"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_10": 10,
    "LABEL_11": 11,
    "LABEL_12": 12,
    "LABEL_13": 13,
    "LABEL_14": 14,
    "LABEL_2": 2,
    "LABEL_3": 3

TypeError: __init__() got an unexpected keyword argument 'max_length'

In [25]:
data_collator(datasets['test'][0])

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

In [75]:
datasets['test'][0]

{'labels': 13,
 'text': 'm…ôrasimd…ô s√∂z alanlar milli lider heydar reylyevin tragediyanƒ±n g√ºnahkarlarƒ±nƒ± a√ßƒ±qlamasƒ±na, bu qanalarƒ±n katliamlarƒ±nƒ± siyasi v…ô hukuki a√ßƒ±dan deƒüerlendirm…ôy…ô v…ô ≈üehitl…ôrin adlarƒ±nƒ± davam ettirm…ôy…ô xidm…ôtin…ô b√∂y√ºk t…ô≈ü…ôkk√ºr ettiler.',
 'input_ids': [0,
  2655,
  7939,
  270,
  1004,
  21314,
  1815,
  3522,
  1398,
  583,
  17263,
  80,
  1821,
  264,
  283,
  14627,
  437,
  1182,
  16117,
  1069,
  5402,
  1553,
  16,
  449,
  1081,
  2771,
  2282,
  298,
  312,
  1069,
  2063,
  304,
  301,
  2657,
  14413,
  1173,
  383,
  1373,
  295,
  303,
  51176,
  1026,
  304,
  357,
  708,
  368,
  536,
  17875,
  1181,
  379,
  821,
  1026,
  17075,
  832,
  4919,
  379,
  88,
  15172,
  18,
  2,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,


In [10]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "./classifier-original-strat-no-collator", num_labels=15
)

In [32]:
datasets['test'][0]['labels']

2

In [6]:
import torch

In [40]:
model(torch.tensor([datasets['test'][0]['input_ids'], datasets['test'][0]['attention_mask']]))

SequenceClassifierOutput(loss=None, logits=tensor([[  2.2476,   2.8794,  -3.2040,  -4.7832,  -6.0020,  -7.9419, -10.1485,
           3.8386,   6.0702,  -0.2575,  -1.5031,  -0.4337,  -0.7097,  -2.5050,
          -3.0035],
        [  1.3702,   3.9130,  -3.5998,  -4.8467,  -6.5749,  -7.5954,  -9.5198,
           1.2158,   5.7206,   0.0210,  -0.7157,   0.0716,   0.1507,  -2.2955,
          -1.3261]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [41]:
x = model(torch.tensor([datasets['test'][0]['input_ids'], datasets['test'][0]['attention_mask']])).logits

In [42]:
y = torch.add(x[0], x[1])

In [2]:
datasets = load_from_disk('datasets-original-stratified')

In [3]:
ds = datasets.remove_columns(['text'])

In [4]:
ds_test = ds['test'].with_format('torch')

In [7]:
dataloader = torch.utils.data.DataLoader(ds_test, batch_size=32)

In [8]:
for x in dataloader:
    print(x)  # model(torch.tensor())
    break

{'labels': tensor([ 8, 10,  9,  1,  1, 10,  1,  8,  8,  1,  7, 13,  0,  1,  0,  1,  7,  0,
         1,  0,  0,  0,  7,  7, 13,  0, 11,  1,  0,  0,  0,  7]), 'input_ids': tensor([[    0,   681,   990,  ...,     1,     1,     1],
        [    0,  3084,   522,  ...,     1,     1,     1],
        [    0,    17,   271,  ...,     1,     1,     1],
        ...,
        [    0,  5348,   451,  ...,     1,     1,     1],
        [    0,  5348,  8220,  ...,     1,     1,     1],
        [    0, 36879,  2254,  ...,     1,     1,     1]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}


In [11]:
import tqdm
preds = []
labels = []
for batch in tqdm.tqdm(dataloader):
    p = model(batch['input_ids'], batch['attention_mask']).logits
    preds.extend(torch.argmax(p, dim=1))
    labels.extend(batch['labels'])

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 21828/21828 [1:30:57<00:00,  4.00it/s]


In [None]:
1

In [22]:
errors

214

In [None]:
import pandas as pd

In [None]:
pd.value_counts(ds_test)

In [96]:
from sklearn.metrics import classification_report

In [97]:
print(classification_report(preds, labels))

              precision    recall  f1-score   support

           0       0.90      0.81      0.85    193116
           1       0.79      0.89      0.84    101645
           2       0.87      0.91      0.89     46975
           3       0.65      0.79      0.71      7654
           4       0.41      0.68      0.51       248
           5       0.04      1.00      0.07         7
           6       0.24      0.89      0.38        19
           7       0.85      0.82      0.83     72168
           8       0.94      0.92      0.93     55236
           9       0.75      0.77      0.76     42378
          10       0.80      0.88      0.84     23430
          11       0.58      0.66      0.62     22234
          12       0.76      0.77      0.76     21400
          13       0.97      0.90      0.93    228229
          14       0.74      0.97      0.84     62066

    accuracy                           0.86    876805
   macro avg       0.69      0.84      0.72    876805
weighted avg       0.87   

In [None]:
classifier(
    "Az…ôrbaycan Ordusunun m√∂vqel…ôri artilleriya at…ô≈üin…ô m…ôruz qalƒ±r"
)

In [None]:
preds = classifier(
    datasets['test']['text']
)

In [20]:
datasets['test']['labels'][:10]

[3, 5, 1, 0, 5, 2, 2, 0, 0, 1]

In [32]:
datasets['test']['labels']

[3,
 5,
 1,
 0,
 5,
 2,
 2,
 0,
 0,
 1,
 1,
 2,
 3,
 0,
 1,
 1,
 1,
 3,
 6,
 2,
 3,
 0,
 0,
 5,
 2,
 5,
 4,
 5,
 2,
 4,
 0,
 4,
 3,
 1,
 3,
 3,
 4,
 2,
 0,
 1,
 10,
 1,
 7,
 0,
 5,
 0,
 10,
 1,
 1,
 2,
 3,
 1,
 6,
 0,
 0,
 7,
 4,
 5,
 5,
 4,
 0,
 0,
 4,
 6,
 4,
 2,
 4,
 3,
 8,
 8,
 2,
 0,
 7,
 4,
 0,
 1,
 5,
 3,
 4,
 4,
 8,
 3,
 1,
 3,
 6,
 0,
 1,
 3,
 8,
 8,
 6,
 1,
 0,
 0,
 8,
 1,
 5,
 3,
 1,
 2,
 1,
 1,
 4,
 4,
 1,
 0,
 2,
 0,
 2,
 3,
 4,
 5,
 0,
 1,
 0,
 2,
 1,
 3,
 0,
 9,
 1,
 6,
 6,
 1,
 0,
 0,
 7,
 0,
 3,
 0,
 0,
 1,
 1,
 3,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 2,
 0,
 4,
 2,
 7,
 1,
 1,
 1,
 8,
 1,
 0,
 2,
 0,
 0,
 0,
 2,
 2,
 6,
 3,
 2,
 0,
 1,
 6,
 3,
 2,
 1,
 6,
 0,
 8,
 6,
 0,
 1,
 4,
 0,
 0,
 0,
 1,
 0,
 6,
 7,
 2,
 0,
 8,
 1,
 5,
 0,
 1,
 6,
 2,
 0,
 0,
 2,
 0,
 1,
 3,
 0,
 1,
 1,
 4,
 9,
 1,
 2,
 8,
 0,
 0,
 5,
 2,
 0,
 0,
 0,
 1,
 0,
 2,
 0,
 2,
 0,
 0,
 2,
 3,
 0,
 5,
 1,
 0,
 2,
 0,
 1,
 1,
 0,
 4,
 0,
 2,
 1,
 2,
 2,
 0,
 6,
 0,
 0,
 10,
 0,
 1,
 3,
 3,
 2,
 4,
 3,
 2,
 1,
 

In [25]:
from textaugment import EDA
import nltk

In [26]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/denay/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [23]:
t = EDA(random_state=42)

In [52]:
output = t.synonym_replacement('An official lunch was given in honor of Pakistani Prime Minister on behalf of the President of Azerbaijan'.lower())
print(output)

an official lunch was given in pureness of pakistani prime minister on behalf of the president of azerbaijan


In [None]:
label2id = {"C∆èMƒ∞YY∆èT": 0, "R∆èSMƒ∞ XRONƒ∞KA": 1, "ƒ∞QTƒ∞SADƒ∞YYAT": 2, "R∆èSMƒ∞ S∆èN∆èDL∆èR": 3, "ELM V∆è T∆èHSƒ∞L": 4, "REGƒ∞ONLAR": 5, "Sƒ∞YAS∆èT": 6, "M∆èD∆èNƒ∞YY∆èT": 7, "QAN YADDA≈ûI": 8, "D√úNYA": 9, "ƒ∞DMAN": 10, "C√úMHURƒ∞YY∆èT - 100": 11, "M√úSAHƒ∞B∆è":12, "BA≈û X∆èB∆èRL∆èR":13, "≈ûU≈ûA ƒ∞Lƒ∞":14}

In [None]:
id2label = dict((v, k) for k, v in label2id.items())

In [None]:
labels_for_aug = [id2label[x] for x in [8,9,10,11,12,13, 14]]

In [10]:
import pandas as pd
df = pd.read_csv('/home/denay/azer-bert/azertag_sentences_with_categories.csv')


df_for_augmentation = df[df['category'].str.contains('|'.join(labels_for_aug))]

In [146]:
from tqdm import tqdm as tqdm

In [147]:
tqdm.pandas()

In [194]:
df_for_augmentation

Unnamed: 0,url,category,sub_category,news_id,order_number,text_content
1208432,https://azertag.az/xeber/Tarixine_sahib_chixan...,C√úMHURƒ∞YY∆èT - 100,cumhuriyyet,63610,1,≈û…ômkird…ôki Heyd…ôr ∆èliyev M…ôrk…ôzind…ô Az…ôrbaycan...
1208433,https://azertag.az/xeber/Tarixine_sahib_chixan...,C√úMHURƒ∞YY∆èT - 100,cumhuriyyet,63610,2,Konfransda ≈û…ômkir Rayon ƒ∞cra Hakimiyy…ôtinin ba...
1208434,https://azertag.az/xeber/Tarixine_sahib_chixan...,C√úMHURƒ∞YY∆èT - 100,cumhuriyyet,63610,3,"AZ∆èRTAC-ƒ±n b√∂lg…ô m√ºxbiri x…ôb…ôr verir ki, konfr..."
1208435,https://azertag.az/xeber/Tarixine_sahib_chixan...,C√úMHURƒ∞YY∆èT - 100,cumhuriyyet,63610,4,Sonra Heyd…ôr ∆èliyev M…ôrk…ôzinin foyesind…ô AMEA-...
1208436,https://azertag.az/xeber/Tarixine_sahib_chixan...,C√úMHURƒ∞YY∆èT - 100,cumhuriyyet,63610,5,T…ôdbir i≈ütirak√ßƒ±larƒ± s…ôrgid…ô tariximizi …ôks et...
...,...,...,...,...,...,...
3492422,https://azertag.az/xeber/31_mart___azerbaycanl...,QAN YADDA≈ûI,bloody_memory_31_march,317210,73,Bu is…ô √∂zl√ºy√ºnd…ô XX …ôsr tariximizd…ô yeni-yeni ...
3492423,https://azertag.az/xeber/31_mart___azerbaycanl...,QAN YADDA≈ûI,bloody_memory_31_march,317210,74,1998-ci ild…ôn b…ôri respublikamƒ±zda 31 mart h…ôr...
3492424,https://azertag.az/xeber/31_mart___azerbaycanl...,QAN YADDA≈ûI,bloody_memory_31_march,317210,75,"Bu hadis…ôl…ôri, onlarƒ±n d…ôrsl…ôrini unutmaƒüa haq..."
3492425,https://azertag.az/xeber/31_mart___azerbaycanl...,QAN YADDA≈ûI,bloody_memory_31_march,317210,76,"Tarix he√ß zaman unudulmur, h…ôm d…ô yazƒ±lƒ±r."


In [2]:
from checkpoints import checkpoints
checkpoints.enable()

In [227]:
df_for_augmentation['text_en'] = df_for_augmentation['text_content'].safe_map(translate)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [230]:
df_for_augmentation.to_pickle('en_for_aug.pkl')

In [10]:
checkpoints.results

In [215]:
df_check = df_for_augmentation.copy()

In [216]:
df_check['text_en'] = checkpoints.results

In [218]:
df_check.to_pickle('df_check_165922.pkl')

In [259]:
df_for_augmentation['text_en'].reset_index()['text_en'][34523]

'It is known that dinosaurs were destroyed by this meteorite impact and mammals started to develop.'

In [239]:
df_for_augmentation[['category','text_en']].to_csv('for_eda.tsv',sep='\t', index=False)

In [265]:
df_for_augmentation['text_en'].str.findall(r'[^a-zA-Z0-9 ]').str.len().sort_values(ascending=False)

3448904    86
3439067    85
3420454    80
3389276    74
3298084    73
           ..
3363924     0
3307782     0
3391407     0
1210064     0
3359672     0
Name: text_en, Length: 198990, dtype: int64

In [323]:
test_df = df_for_augmentation.copy()

In [324]:
test_df['non-alpha'] = test_df['text_en'].str.findall(r'[^a-zA-Z ]').str.len().sort_values(ascending=False)

In [325]:
test_df['len'] = test_df['text_en'].str.len().sort_values(ascending=False)

In [326]:
test_df[test_df['len']==test_df['non-alpha']]

Unnamed: 0,url,category,sub_category,news_id,order_number,text_content,text_en,non-alpha,len
3322582,https://azertag.az/xeber/Bacariqsiz_dahiler-82...,D√úNYA,oddly_enough,303753,17,10.,10.,3,3
3334980,https://azertag.az/xeber/Dekabr_ayi_uchun_Fran...,D√úNYA,oddly_enough,304646,49,....,....,4,4
3356560,https://azertag.az/xeber/Masallinin_Sirebil_ke...,D√úNYA,oddly_enough,306324,10,######,######,6,6
3399018,https://azertag.az/xeber/IRALI_Ictimai_Birliyi...,QAN YADDA≈ûI,bloody_memory_20_january,309886,4,######,######,6,6
3399056,https://azertag.az/xeber/Meksikanin_iki_en_nuf...,QAN YADDA≈ûI,bloody_memory_20_january,309889,8,######,######,6,6
...,...,...,...,...,...,...,...,...,...
3490865,https://azertag.az/xeber/YASAMAL_RAYONUNDA_MAR...,QAN YADDA≈ûI,bloody_memory_31_march,317095,9,######,######,6,6
3491473,https://azertag.az/xeber/Susa_teatri_Sumqayit_...,QAN YADDA≈ûI,bloody_memory_31_march,317138,16,######,######,6,6
3491551,https://azertag.az/xeber/Zaqatalada_Azerbaycan...,QAN YADDA≈ûI,bloody_memory_31_march,317146,13,######,######,6,6
3491986,https://azertag.az/xeber/Ankara_Il_qezeti_bir_...,QAN YADDA≈ûI,bloody_memory_31_march,317184,5,######,######,6,6


In [327]:
test_df = test_df[~(test_df['len']==test_df['non-alpha'])]

In [328]:
test_df

Unnamed: 0,url,category,sub_category,news_id,order_number,text_content,text_en,non-alpha,len
1208432,https://azertag.az/xeber/Tarixine_sahib_chixan...,C√úMHURƒ∞YY∆èT - 100,cumhuriyyet,63610,1,≈û…ômkird…ôki Heyd…ôr ∆èliyev M…ôrk…ôzind…ô Az…ôrbaycan...,A scientific-practical conference dedicated to...,5,155
1208433,https://azertag.az/xeber/Tarixine_sahib_chixan...,C√úMHURƒ∞YY∆èT - 100,cumhuriyyet,63610,2,Konfransda ≈û…ômkir Rayon ƒ∞cra Hakimiyy…ôtinin ba...,"At the conference, Alimpasha Mammadov, Head of...",21,741
1208434,https://azertag.az/xeber/Tarixine_sahib_chixan...,C√úMHURƒ∞YY∆èT - 100,cumhuriyyet,63610,3,"AZ∆èRTAC-ƒ±n b√∂lg…ô m√ºxbiri x…ôb…ôr verir ki, konfr...",The regional correspondent of AZERTAC reports ...,1,223
1208435,https://azertag.az/xeber/Tarixine_sahib_chixan...,C√úMHURƒ∞YY∆èT - 100,cumhuriyyet,63610,4,Sonra Heyd…ôr ∆èliyev M…ôrk…ôzinin foyesind…ô AMEA-...,Then a book exhibition of the researches of th...,1,129
1208436,https://azertag.az/xeber/Tarixine_sahib_chixan...,C√úMHURƒ∞YY∆èT - 100,cumhuriyyet,63610,5,T…ôdbir i≈ütirak√ßƒ±larƒ± s…ôrgid…ô tariximizi …ôks et...,The participants of the event got acquainted w...,1,108
...,...,...,...,...,...,...,...,...,...
3492422,https://azertag.az/xeber/31_mart___azerbaycanl...,QAN YADDA≈ûI,bloody_memory_31_march,317210,73,Bu is…ô √∂zl√ºy√ºnd…ô XX …ôsr tariximizd…ô yeni-yeni ...,This in itself has led to the opening of new v...,3,85
3492423,https://azertag.az/xeber/31_mart___azerbaycanl...,QAN YADDA≈ûI,bloody_memory_31_march,317210,74,1998-ci ild…ôn b…ôri respublikamƒ±zda 31 mart h…ôr...,"Since 1998, March 31 has been celebrated annua...",8,121
3492424,https://azertag.az/xeber/31_mart___azerbaycanl...,QAN YADDA≈ûI,bloody_memory_31_march,317210,75,"Bu hadis…ôl…ôri, onlarƒ±n d…ôrsl…ôrini unutmaƒüa haq...",We have no right to forget these events and th...,1,58
3492425,https://azertag.az/xeber/31_mart___azerbaycanl...,QAN YADDA≈ûI,bloody_memory_31_march,317210,76,"Tarix he√ß zaman unudulmur, h…ôm d…ô yazƒ±lƒ±r.","History is never forgotten, but also written.",2,45


In [330]:
test_df[['category','text_en_lower']].to_csv('for_eda.tsv',sep='\t', index=False)

In [309]:
test_df.drop(test_df[test_df['len']==test_df['non-alpha']], axis=1)

1208432
1208433
1208434
1208435
1208436
...
3492422
3492423
3492424
3492425
3492426


In [302]:
test_df[test_df['len']==test_df['non-alpha']]

Unnamed: 0,url,category,sub_category,news_id,order_number,text_content,text_en,non-alpha,len
3334980,https://azertag.az/xeber/Dekabr_ayi_uchun_Fran...,D√úNYA,oddly_enough,304646,49,....,....,4,4
3356560,https://azertag.az/xeber/Masallinin_Sirebil_ke...,D√úNYA,oddly_enough,306324,10,######,######,6,6
3399018,https://azertag.az/xeber/IRALI_Ictimai_Birliyi...,QAN YADDA≈ûI,bloody_memory_20_january,309886,4,######,######,6,6
3399056,https://azertag.az/xeber/Meksikanin_iki_en_nuf...,QAN YADDA≈ûI,bloody_memory_20_january,309889,8,######,######,6,6
3399194,https://azertag.az/xeber/20_Yanvar_xalqimizin_...,QAN YADDA≈ûI,bloody_memory_20_january,309902,12,######,######,6,6
...,...,...,...,...,...,...,...,...,...
3490865,https://azertag.az/xeber/YASAMAL_RAYONUNDA_MAR...,QAN YADDA≈ûI,bloody_memory_31_march,317095,9,######,######,6,6
3491473,https://azertag.az/xeber/Susa_teatri_Sumqayit_...,QAN YADDA≈ûI,bloody_memory_31_march,317138,16,######,######,6,6
3491551,https://azertag.az/xeber/Zaqatalada_Azerbaycan...,QAN YADDA≈ûI,bloody_memory_31_march,317146,13,######,######,6,6
3491986,https://azertag.az/xeber/Ankara_Il_qezeti_bir_...,QAN YADDA≈ûI,bloody_memory_31_march,317184,5,######,######,6,6


In [329]:
test_df['text_en_lower'] = test_df['text_en'].apply(lambda x: x.lower())

In [322]:
test_df['text_en_lower']

1208432    a scientific-practical conference dedicated to...
1208433    at the conference, alimpasha mammadov, head of...
1208434    the regional correspondent of azertac reports ...
1208435    then a book exhibition of the researches of th...
1208436    the participants of the event got acquainted w...
                                 ...                        
3492422    this in itself has led to the opening of new v...
3492423    since 1998, march 31 has been celebrated annua...
3492424    we have no right to forget these events and th...
3492425        history is never forgotten, but also written.
3492426    although the armenian atrocities are a page wr...
Name: text_en_lower, Length: 198770, dtype: object

In [335]:
eda_df = pd.read_csv('eda_for_eda.tsv', sep='\t').iloc[9:]

In [40]:
eda_df.to_pickle('translated_augmented_2kk.pkl')

KeyboardInterrupt: 

In [41]:
import pandas as pd

In [42]:
eda_df = pd.read_pickle('translated_augmented_2kk.pkl')

In [43]:
for row in eda_df['text en']:
    print(row)
    res = translator.translate(row, dest='az', src='en').text
    print(res)

practical conference dedicated to the anniversary of the azerbaijan democratic republic was held the heydar aliyev center in shamkir
≈û…ômkird…ô Heyd…ôr ∆èliyev M…ôrk…ôzi Az…ôrbaycan Xalq C√ºmhuriyy…ôtinin ild√∂n√ºm√ºn…ô h…ôsr olunmu≈ü praktik konfrans ke√ßirildi
a scientific practical conference dedicated to the 100th anniversary of the democratic republic was held at the heydar aliyev center in
Heyd…ôr ∆èliyev M…ôrk…ôzind…ô Demokratik Respublikanƒ±n 100 illik yubileyin…ô h…ôsr olunmu≈ü elmi praktik konfrans ke√ßirildi
a scientific practical conference dedicated to the 100th anniversary of the azerbaijan democratic republic azerbajdzhan republic was held at the heydar aliyev day of remembrance center in shamkir
Az…ôrbaycan Xalq C√ºmhuriyy…ôtinin 100 illik yubileyin…ô h…ôsr olunmu≈ü elmi praktik konfrans ≈û…ômkird…ô Heyd…ôr ∆èliyevin Heyd…ôr ∆èliyev G√ºn√ºind…ô ke√ßirildi
a scientific practical to the 100th anniversary of the azerbaijan democratic republic was held at the heydar aliyev

KeyboardInterrupt: 

In [None]:
eda_df['text_az'] = eda_df['text en'].safe_map(retranslate)

In [None]:
checkpoints.results

In [346]:
eda_df.values

array([['C√úMHURƒ∞YY∆èT - 100',
        'practical conference dedicated to the anniversary of the azerbaijan democratic republic was held the heydar aliyev center in shamkir'],
       ['C√úMHURƒ∞YY∆èT - 100',
        'a scientific practical conference dedicated to the 100th anniversary of the democratic republic was held at the heydar aliyev center in'],
       ['C√úMHURƒ∞YY∆èT - 100',
        'a scientific practical conference dedicated to the 100th anniversary of the azerbaijan democratic republic azerbajdzhan republic was held at the heydar aliyev day of remembrance center in shamkir'],
       ...,
       ['QAN YADDA≈ûI',
        'although the armenian atrocities are a page all in written in the history of azerbaijan strong azerbaijan is always able to successfully overcome blood problems it having a voice and influence in the region where by is located'],
       ['QAN YADDA≈ûI',
        'although the armenian atrocities are a page written in rake in the history of azerbaijan strong

In [17]:
eda_df

Unnamed: 0,category,text en
9,C√úMHURƒ∞YY∆èT - 100,practical conference dedicated to the annivers...
10,C√úMHURƒ∞YY∆èT - 100,a scientific practical conference dedicated to...
11,C√úMHURƒ∞YY∆èT - 100,a scientific practical conference dedicated to...
12,C√úMHURƒ∞YY∆èT - 100,a scientific practical to the 100th anniversar...
13,C√úMHURƒ∞YY∆èT - 100,a scientific practical conference at to the 10...
...,...,...
1987579,QAN YADDA≈ûI,although the armenian a page written in blood ...
1987580,QAN YADDA≈ûI,although the armenian atrocities are a page in...
1987581,QAN YADDA≈ûI,although the armenian atrocities are a page al...
1987582,QAN YADDA≈ûI,although the armenian atrocities are a page wr...


In [20]:
eda_df['text_az'] = eda_df['text en'].safe_map(retranslate)



TypeError: the JSON object must be str, bytes or bytearray, not 'NoneType'

In [1]:
import pandas as pd

In [8]:
dfs = []
for i in range(8):
    path = "/home/greamdesu/GF/azer_translated_pkl/df_neaural_{}_final.pkl".format(i)
    dfs.append(pd.read_pickle(path))

In [1]:
df['text_az'] = df['text_content']

NameError: name 'df' is not defined

In [12]:
df_aug = pd.concat(dfs)

In [13]:
df_aug

Unnamed: 0,category,text_en,text_az
0,C√úMHURƒ∞YY∆èT - 100,practical conference dedicated to the annivers...,Azerbaycan Demokrat Cumhuriyetinin kurulu≈ü yƒ±l...
1,C√úMHURƒ∞YY∆èT - 100,a scientific practical conference dedicated to...,Demokrat republicin 100. yƒ±ld√∂n√ºm√º m√ºnasebetiy...
2,C√úMHURƒ∞YY∆èT - 100,a scientific practical conference dedicated to...,Azerbaycan Demokrat Cumhuriyetinin 100-ci il…ô ...
3,C√úMHURƒ∞YY∆èT - 100,a scientific practical to the 100th anniversar...,Azerbaijan Demokrat Cumhuriyetinin 100-ci il…ô ...
4,C√úMHURƒ∞YY∆èT - 100,a scientific practical conference at to the 10...,Azerbaijan Demokrat Cumhuriyetinin 100-ci il…ô ...
...,...,...,...
248441,QAN YADDA≈ûI,although the armenian a page written in blood ...,…ôg…ôr armeniyalƒ±lar azerbaycanlƒ±q tarixinin i√ßi...
248442,QAN YADDA≈ûI,although the armenian atrocities are a page in...,…ôg…ôr Ermenil…ôrin i≈ülediƒüi zul√ºm…ôl…ôr …ôs…ôbd…ô bir...
248443,QAN YADDA≈ûI,although the armenian atrocities are a page al...,…ôg…ôr armen okrutlarƒ± b√ºt√ºn Azerbaijan tarixini...
248444,QAN YADDA≈ûI,although the armenian atrocities are a page wr...,…ôg…ôr Ermeni okrutlar azerbaycanlƒ±q tarixinin b...


In [24]:
big_df = pd.concat([df, df_aug])

In [27]:
big_df['text_az_lower'] = big_df['text_az'].apply(lambda x: x.lower())

In [28]:
big_df.to_pickle('merged_neural.pkl')

In [37]:
df = pd.read_csv('azertag_sentences_with_categories.csv')

In [35]:
big_df = pd.read_pickle('merged_neural.pkl')

In [39]:
df

Unnamed: 0,url,category,sub_category,news_id,order_number,text_content
0,https://azertag.az/xeber/Dovlet_baschilari_Xiv...,R∆èSMƒ∞ XRONƒ∞KA,R∆èSMƒ∞ XRONƒ∞KA,1,1,ƒ∞yunun 22-d…ô Az…ôrbaycan Respublikasƒ±nƒ±n Prezid...
1,https://azertag.az/xeber/Dovlet_baschilari_Xiv...,R∆èSMƒ∞ XRONƒ∞KA,R∆èSMƒ∞ XRONƒ∞KA,1,2,"AZ∆èRTAC x…ôb…ôr verir ki, saray kompleksi bar…ôd…ô..."
2,https://azertag.az/xeber/Dovlet_baschilari_Xiv...,R∆èSMƒ∞ XRONƒ∞KA,R∆èSMƒ∞ XRONƒ∞KA,1,3,"Bildirilib ki, …ôz…ôm…ôti v…ô g√∂z…ôlliyi il…ô m…ô≈ühur..."
3,https://azertag.az/xeber/Dovlet_baschilari_Xiv...,R∆èSMƒ∞ XRONƒ∞KA,R∆èSMƒ∞ XRONƒ∞KA,1,4,D√∂rd hiss…ôd…ôn ibar…ôt saray kompleksin…ô √ºmumili...
4,https://azertag.az/xeber/Dovlet_baschilari_Xiv...,R∆èSMƒ∞ XRONƒ∞KA,R∆èSMƒ∞ XRONƒ∞KA,1,5,Saray kompleksind…ôki b√ºt√ºn tikilil…ôr uzunluƒüu ...
...,...,...,...,...,...,...
3492448,https://azertag.az/xeber/Susa_medeniyyetinin_i...,M∆èD∆èNƒ∞YY∆èT,shusha_year,317214,3,K√ºrd√º (sƒ±rƒ±nmƒ±≈ü qolsuz k√∂yn…ôk) ‚Äì qadƒ±n √ºst gey...
3492449,https://azertag.az/xeber/Susa_medeniyyetinin_i...,M∆èD∆èNƒ∞YY∆èT,shusha_year,317214,4,Bu geyim tirm…ôd…ôn v…ô m…ôxm…ôrd…ôn tikilirdi.
3492450,https://azertag.az/xeber/Susa_medeniyyetinin_i...,M∆èD∆èNƒ∞YY∆èT,shusha_year,317214,5,"X…ôzl…ô b…ôz…ôk, sƒ±x naxƒ±≈ü vurulurdu."
3492451,https://azertag.az/xeber/Susa_medeniyyetinin_i...,M∆èD∆èNƒ∞YY∆èT,shusha_year,317214,6,"K√ºrd√ºn√º ≈ûu≈üada, h…ôm√ßinin Az…ôrbaycanƒ±n h…ôr yeri..."


In [48]:
big_df

Unnamed: 0,url,category,sub_category,news_id,order_number,text_content,text_az,text_en,text_az_lower
0,https://azertag.az/xeber/Dovlet_baschilari_Xiv...,R∆èSMƒ∞ XRONƒ∞KA,R∆èSMƒ∞ XRONƒ∞KA,1.0,1.0,ƒ∞yunun 22-d…ô Az…ôrbaycan Respublikasƒ±nƒ±n Prezid...,ƒ∞yunun 22-d…ô Az…ôrbaycan Respublikasƒ±nƒ±n Prezid...,,iÃáyunun 22-d…ô az…ôrbaycan respublikasƒ±nƒ±n prezi...
1,https://azertag.az/xeber/Dovlet_baschilari_Xiv...,R∆èSMƒ∞ XRONƒ∞KA,R∆èSMƒ∞ XRONƒ∞KA,1.0,2.0,"AZ∆èRTAC x…ôb…ôr verir ki, saray kompleksi bar…ôd…ô...","AZ∆èRTAC x…ôb…ôr verir ki, saray kompleksi bar…ôd…ô...",,"az…ôrtac x…ôb…ôr verir ki, saray kompleksi bar…ôd…ô..."
2,https://azertag.az/xeber/Dovlet_baschilari_Xiv...,R∆èSMƒ∞ XRONƒ∞KA,R∆èSMƒ∞ XRONƒ∞KA,1.0,3.0,"Bildirilib ki, …ôz…ôm…ôti v…ô g√∂z…ôlliyi il…ô m…ô≈ühur...","Bildirilib ki, …ôz…ôm…ôti v…ô g√∂z…ôlliyi il…ô m…ô≈ühur...",,"bildirilib ki, …ôz…ôm…ôti v…ô g√∂z…ôlliyi il…ô m…ô≈ühur..."
3,https://azertag.az/xeber/Dovlet_baschilari_Xiv...,R∆èSMƒ∞ XRONƒ∞KA,R∆èSMƒ∞ XRONƒ∞KA,1.0,4.0,D√∂rd hiss…ôd…ôn ibar…ôt saray kompleksin…ô √ºmumili...,D√∂rd hiss…ôd…ôn ibar…ôt saray kompleksin…ô √ºmumili...,,d√∂rd hiss…ôd…ôn ibar…ôt saray kompleksin…ô √ºmumili...
4,https://azertag.az/xeber/Dovlet_baschilari_Xiv...,R∆èSMƒ∞ XRONƒ∞KA,R∆èSMƒ∞ XRONƒ∞KA,1.0,5.0,Saray kompleksind…ôki b√ºt√ºn tikilil…ôr uzunluƒüu ...,Saray kompleksind…ôki b√ºt√ºn tikilil…ôr uzunluƒüu ...,,saray kompleksind…ôki b√ºt√ºn tikilil…ôr uzunluƒüu ...
...,...,...,...,...,...,...,...,...,...
248441,,QAN YADDA≈ûI,,,,,…ôg…ôr armeniyalƒ±lar azerbaycanlƒ±q tarixinin i√ßi...,although the armenian a page written in blood ...,…ôg…ôr armeniyalƒ±lar azerbaycanlƒ±q tarixinin i√ßi...
248442,,QAN YADDA≈ûI,,,,,…ôg…ôr Ermenil…ôrin i≈ülediƒüi zul√ºm…ôl…ôr …ôs…ôbd…ô bir...,although the armenian atrocities are a page in...,…ôg…ôr ermenil…ôrin i≈ülediƒüi zul√ºm…ôl…ôr …ôs…ôbd…ô bir...
248443,,QAN YADDA≈ûI,,,,,…ôg…ôr armen okrutlarƒ± b√ºt√ºn Azerbaijan tarixini...,although the armenian atrocities are a page al...,…ôg…ôr armen okrutlarƒ± b√ºt√ºn azerbaijan tarixini...
248444,,QAN YADDA≈ûI,,,,,…ôg…ôr Ermeni okrutlar azerbaycanlƒ±q tarixinin b...,although the armenian atrocities are a page wr...,…ôg…ôr ermeni okrutlar azerbaycanlƒ±q tarixinin b...


In [47]:
big_df['text_az_lower'].isna().sum()

0

In [38]:
pd.value_counts(df['category'])

C∆èMƒ∞YY∆èT             1074514
R∆èSMƒ∞ XRONƒ∞KA         719341
ƒ∞QTƒ∞SADƒ∞YYAT          434084
R∆èSMƒ∞ S∆èN∆èDL∆èR        338191
ELM V∆è T∆èHSƒ∞L         270760
REGƒ∞ONLAR             160572
Sƒ∞YAS∆èT               159996
M∆èD∆èNƒ∞YY∆èT            136005
QAN YADDA≈ûI           119544
D√úNYA                  46127
ƒ∞DMAN                  27721
C√úMHURƒ∞YY∆èT - 100       5220
M√úSAHƒ∞B∆è                 233
BA≈û X∆èB∆èRL∆èR             105
≈ûU≈ûA ƒ∞Lƒ∞                  40
Name: category, dtype: int64

In [36]:
pd.value_counts(big_df['category'])

QAN YADDA≈ûI          1312690
C∆èMƒ∞YY∆èT             1074514
R∆èSMƒ∞ XRONƒ∞KA         719341
D√úNYA                 507366
ƒ∞QTƒ∞SADƒ∞YYAT          434084
R∆èSMƒ∞ S∆èN∆èDL∆èR        338191
ƒ∞DMAN                 304931
ELM V∆è T∆èHSƒ∞L         270760
REGƒ∞ONLAR             160572
Sƒ∞YAS∆èT               159996
M∆èD∆èNƒ∞YY∆èT            136005
C√úMHURƒ∞YY∆èT - 100      57420
M√úSAHƒ∞B∆è                2563
BA≈û X∆èB∆èRL∆èR            1155
≈ûU≈ûA ƒ∞Lƒ∞                 440
Name: category, dtype: int64

In [2]:
import pandas as pd
df = pd.read_csv('/home/denay/azer-bert/azertag_sentences_with_categories.csv')

In [5]:
df

Unnamed: 0,url,category,sub_category,news_id,order_number,text_content,count
0,https://azertag.az/xeber/Dovlet_baschilari_Xiv...,R∆èSMƒ∞ XRONƒ∞KA,R∆èSMƒ∞ XRONƒ∞KA,1,1,ƒ∞yunun 22-d…ô Az…ôrbaycan Respublikasƒ±nƒ±n Prezid...,21
1,https://azertag.az/xeber/Dovlet_baschilari_Xiv...,R∆èSMƒ∞ XRONƒ∞KA,R∆èSMƒ∞ XRONƒ∞KA,1,2,"AZ∆èRTAC x…ôb…ôr verir ki, saray kompleksi bar…ôd…ô...",11
2,https://azertag.az/xeber/Dovlet_baschilari_Xiv...,R∆èSMƒ∞ XRONƒ∞KA,R∆èSMƒ∞ XRONƒ∞KA,1,3,"Bildirilib ki, …ôz…ôm…ôti v…ô g√∂z…ôlliyi il…ô m…ô≈ühur...",20
3,https://azertag.az/xeber/Dovlet_baschilari_Xiv...,R∆èSMƒ∞ XRONƒ∞KA,R∆èSMƒ∞ XRONƒ∞KA,1,4,D√∂rd hiss…ôd…ôn ibar…ôt saray kompleksin…ô √ºmumili...,22
4,https://azertag.az/xeber/Dovlet_baschilari_Xiv...,R∆èSMƒ∞ XRONƒ∞KA,R∆èSMƒ∞ XRONƒ∞KA,1,5,Saray kompleksind…ôki b√ºt√ºn tikilil…ôr uzunluƒüu ...,13
...,...,...,...,...,...,...,...
3492448,https://azertag.az/xeber/Susa_medeniyyetinin_i...,M∆èD∆èNƒ∞YY∆èT,shusha_year,317214,3,K√ºrd√º (sƒ±rƒ±nmƒ±≈ü qolsuz k√∂yn…ôk) ‚Äì qadƒ±n √ºst gey...,8
3492449,https://azertag.az/xeber/Susa_medeniyyetinin_i...,M∆èD∆èNƒ∞YY∆èT,shusha_year,317214,4,Bu geyim tirm…ôd…ôn v…ô m…ôxm…ôrd…ôn tikilirdi.,6
3492450,https://azertag.az/xeber/Susa_medeniyyetinin_i...,M∆èD∆èNƒ∞YY∆èT,shusha_year,317214,5,"X…ôzl…ô b…ôz…ôk, sƒ±x naxƒ±≈ü vurulurdu.",5
3492451,https://azertag.az/xeber/Susa_medeniyyetinin_i...,M∆èD∆èNƒ∞YY∆èT,shusha_year,317214,6,"K√ºrd√ºn√º ≈ûu≈üada, h…ôm√ßinin Az…ôrbaycanƒ±n h…ôr yeri...",7


In [4]:
df['count'] = df['text_content'].str.split().str.len()

In [9]:
counted = df[(df['count']<15) & (df['count']>10)]

In [19]:
counted = counted[~counted['category'].isin(['≈ûU≈ûA ƒ∞Lƒ∞', 'BA≈û X∆èB∆èRL∆èR'])]

In [20]:
counted

Unnamed: 0,url,category,sub_category,news_id,order_number,text_content,count
1,https://azertag.az/xeber/Dovlet_baschilari_Xiv...,R∆èSMƒ∞ XRONƒ∞KA,R∆èSMƒ∞ XRONƒ∞KA,1,2,"AZ∆èRTAC x…ôb…ôr verir ki, saray kompleksi bar…ôd…ô...",11
4,https://azertag.az/xeber/Dovlet_baschilari_Xiv...,R∆èSMƒ∞ XRONƒ∞KA,R∆èSMƒ∞ XRONƒ∞KA,1,5,Saray kompleksind…ôki b√ºt√ºn tikilil…ôr uzunluƒüu ...,13
7,https://azertag.az/xeber/Dovlet_baschilari_Xiv...,R∆èSMƒ∞ XRONƒ∞KA,R∆èSMƒ∞ XRONƒ∞KA,1,8,‚ÄúNurullaboy‚Äù sarayƒ± milli v…ô Avropa memarlƒ±q √º...,11
8,https://azertag.az/xeber/Dovlet_baschilari_Xiv...,R∆èSMƒ∞ XRONƒ∞KA,R∆èSMƒ∞ XRONƒ∞KA,1,9,Bu da h…ômin d√∂vrd…ô ≈ü…ôh…ôrsalma sah…ôsind…ô √ºst√ºnl...,11
9,https://azertag.az/xeber/Dovlet_baschilari_Xiv...,R∆èSMƒ∞ XRONƒ∞KA,R∆èSMƒ∞ XRONƒ∞KA,1,10,Qƒ±zƒ±lƒ± v…ô r…ôngli r…ôsml…ôrl…ô √∂rt√ºlm√º≈ü m…ô≈ühur Xiv...,13
...,...,...,...,...,...,...,...
3492421,https://azertag.az/xeber/31_mart___azerbaycanl...,QAN YADDA≈ûI,bloody_memory_31_march,317210,72,Mart hadis…ôl…ôrinin saxtala≈üdƒ±rƒ±lmasƒ± xalqƒ±mƒ±za...,13
3492422,https://azertag.az/xeber/31_mart___azerbaycanl...,QAN YADDA≈ûI,bloody_memory_31_march,317210,73,Bu is…ô √∂zl√ºy√ºnd…ô XX …ôsr tariximizd…ô yeni-yeni ...,12
3492431,https://azertag.az/xeber/Diaspor_Genclerinin_I...,REGƒ∞ONLAR,shusha_year,317212,1,Diaspor G…ôncl…ôrinin III Yay D√º≈ü…ôrg…ôsinin i≈ütir...,11
3492438,https://azertag.az/xeber/Diaspor_Genclerinin_I...,REGƒ∞ONLAR,shusha_year,317212,8,"Bundan ba≈üqa, …ôyl…ônc…ôli proqramlarƒ±n v…ô intell...",11


In [21]:
counted.value_counts('category')

category
C∆èMƒ∞YY∆èT             174775
R∆èSMƒ∞ XRONƒ∞KA        148667
ƒ∞QTƒ∞SADƒ∞YYAT          79927
R∆èSMƒ∞ S∆èN∆èDL∆èR        54040
ELM V∆è T∆èHSƒ∞L         43483
REGƒ∞ONLAR             32050
M∆èD∆èNƒ∞YY∆èT            28304
Sƒ∞YAS∆èT               25020
QAN YADDA≈ûI           18624
D√úNYA                 11078
ƒ∞DMAN                  6834
C√úMHURƒ∞YY∆èT - 100       804
M√úSAHƒ∞B∆è                 52
dtype: int64

In [24]:
xxx = counted.groupby(['category']).apply(lambda grp: grp.sample(n=30))

In [30]:
sss.to_excel('for_survey.xlsx')

In [29]:
sss = xxx.drop(columns=['url', 'sub_category', 'news_id', 'order_number', 'count'])