In [None]:
#@title Install Required Packages

# On your local machine, uncomment them
# !pip install -qU torch
# !pip install -qU numpy
# !pip install -qU pandas

!pip install -qU transformers

[K     |████████████████████████████████| 1.8MB 8.5MB/s 
[K     |████████████████████████████████| 3.2MB 36.5MB/s 
[K     |████████████████████████████████| 890kB 59.4MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


In [None]:
#@title Load Packages

from transformers import AutoModelForMaskedLM, AutoTokenizer
from transformers import pipeline

from pprint import pprint
from IPython import display

# ParsBERT


<br/>
<p align="center">
    <img src="https://hooshvare.s3.ir-thr-at1.arvanstorage.com/parsbert.png" height="400" />
    <br/>
    <em>Figure 1: ParsBERT: Transformer-based Model for Persian Language Understanding</em><br/>
    <em><a href="https://github.com/hooshvare/parsbert">https://github.com/hooshvare/parsbert</a></em>
</p>
<br/>



**ParsBERT** is a monolingual language model based on Google’s BERT architecture. This model is pre-trained on large Persian corpora with various writing styles from numerous subjects (e.g., scientific, novels, news).

ParsBERT augmented during this one year from the preliminary version `1.0` to `2.0` and currently at version `3.0`. 

- [v1.0:](https://huggingface.co/HooshvareLab/bert-base-parsbert-uncased) It was trained on 3.9M documents, 73M sentences, and 1.3B words from online public resources Persian Wikidumps and MirasText and six other manually crawled text data from a various type of websites (BigBang Page scientific, Chetor lifestyle, Eligasht itinerary, Digikala digital magazine, Ted Talks general conversational, Books novels, storybooks, short stories from old to the contemporary era) for about 1920000 steps.
- [v2.0:](https://huggingface.co/HooshvareLab/bert-fa-base-uncased) Same as the v1.0 but trained over 2575000 + 300000 steps, just vocabulary updated as well as optional tokens added for other down-stream tasks.
- [v3.0:](https://huggingface.co/HooshvareLab/bert-fa-zwnj-base) The model was trained over half a million steps over various online public resources like Wikipedia, digikala, chetor, more than seven news stations, entertainment and fun channels, tweets, more than five technology hubs, and Virgool. It can tackle the zero-width non-joiner character for Persian writing.

# Evaluation

ParsBERT is evaluated on three NLP downstream tasks: Sentiment Analysis (SA), Text Classification, and Named Entity Recognition (NER). For this matter and due to insufficient resources, two large datasets for SA and two for text classification were manually composed, which are available for public use and benchmarking. ParsBERT outperformed all other language models, including multilingual BERT and other hybrid deep learning models for all tasks, improving the state-of-the-art performance in Persian language modeling.

## Results

The following table summarizes the F1 score obtained by ParsBERT as compared to other models and architectures.


### Sentiment Analysis (SA) task

|          Dataset         | ParsBERT v2 | ParsBERT v1 | mBERT | DeepSentiPers |
|:------------------------:|:-----------:|:-----------:|:-----:|:-------------:|
|  Digikala User Comments  |    81.72    |    81.74*   | 80.74 |       -       |
|  SnappFood User Comments |    87.98    |    88.12*   | 87.87 |       -       |
|  SentiPers (Multi Class) |    71.31*   |    71.11    |   -   |     69.33     |
| SentiPers (Binary Class) |    92.42*   |    92.13    |   -   |     91.98     |



### Text Classification (TC) task

|      Dataset      | ParsBERT v2 | ParsBERT v1 | mBERT |
|:-----------------:|:-----------:|:-----------:|:-----:|
| Digikala Magazine |    93.65*   |    93.59    | 90.72 |
|    Persian News   |    97.44*   |    97.19    | 95.79 |


### Named Entity Recognition (NER) Task

| Dataset | ParsBERT v2 | ParsBERT v1 | mBERT | MorphoBERT | Beheshti-NER | LSTM-CRF | Rule-Based CRF | BiLSTM-CRF |
|:-------:|:-----------:|:-----------:|:-----:|:----------:|:------------:|:--------:|:--------------:|:----------:|
|  PEYMA  |    93.40*   |    93.10    | 86.64 |      -     |     90.59    |     -    |      84.00     |      -     |
|  ARMAN  |    99.84*   |    98.79    | 95.89 |    89.9    |     84.03    |   86.55  |        -       |    77.45   |

In [None]:
versions = {
    "v1.0": "HooshvareLab/bert-base-parsbert-uncased",
    "v2.0": "HooshvareLab/bert-fa-base-uncased",
    "v3.0": "HooshvareLab/bert-fa-zwnj-base",
}

# Comparisons

## Comparison of Tokenizers

In [None]:
text = "اصلاح نویسه‌ها و استفاده از نیم‌فاصله پردازش را آسان می‌کند"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(versions["v1.0"])
print("\n".join(tokenizer.tokenize(text)))

اصلاح
نویسه
##ها
و
استفاده
از
نیم
##فاصله
پردازش
را
اسان
میکند


In [None]:
tokenizer = AutoTokenizer.from_pretrained(versions["v2.0"])
print("\n".join(tokenizer.tokenize(text)))

اصلاح
نویسهها
و
استفاده
از
نیم
##فاصله
پردازش
را
اسان
میکند


In [None]:
tokenizer = AutoTokenizer.from_pretrained(versions["v3.0"])
print("\n".join(tokenizer.tokenize(text)))

اصلاح
نویسه
[ZWNJ]
ها
و
استفاده
از
نیم
[ZWNJ]
فاصله
پردازش
را
آ
##سان
می
[ZWNJ]
کند


## Comparison of Models

In [None]:
text1 = "این کتاب__‌ها کجا هستند؟"
text2 = "نمایشگرها بزرگترین مصرف‌کننده‌ی انرژی در گوشی‌های __ هستند."


text1 = text1.replace("__", "[MASK]")
text2 = text2.replace("__", "[MASK]")

In [None]:
nlp_fill = pipeline("fill-mask", model=versions["v1.0"])

output = nlp_fill(text1)
pprint(output)
print('-' * 90)
output = nlp_fill(text2)
pprint(output)

Some weights of the model checkpoint at HooshvareLab/bert-base-parsbert-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.08914531022310257,
  'sequence': 'این کتاب پرفروش ها کجا هستند ؟',
  'token': 14989,
  'token_str': 'پرفروش'},
 {'score': 0.05617669224739075,
  'sequence': 'این کتاب بچه ها کجا هستند ؟',
  'token': 5049,
  'token_str': 'بچه'},
 {'score': 0.03774799406528473,
  'sequence': 'این کتاب خواندنی ها کجا هستند ؟',
  'token': 20156,
  'token_str': 'خواندنی'},
 {'score': 0.014398455619812012,
  'sequence': 'این کتاب داستان ها کجا هستند ؟',
  'token': 4556,
  'token_str': 'داستان'},
 {'score': 0.010210617445409298,
  'sequence': 'این کتاب امریکایی ها کجا هستند ؟',
  'token': 3511,
  'token_str': 'امریکایی'}]
------------------------------------------------------------------------------------------
[{'score': 0.9409950971603394,
  'sequence': 'نمایشگرها بزرگترین مصرفکنندهی انرژی در گوشیهای هوشمند هستند.',
  'token': 3831,
  'token_str': 'هوشمند'},
 {'score': 0.039490777999162674,
  'sequence': 'نمایشگرها بزرگترین مصرفکنندهی انرژی در گوشیهای همراه هستند.',
  'token': 2653,
  'token_st

In [None]:
nlp_fill = pipeline("fill-mask", model=versions["v2.0"])

output = nlp_fill(text1)
pprint(output)
print('-' * 90)
output = nlp_fill(text2)
pprint(output)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=654226731.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at HooshvareLab/bert-fa-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.06316491216421127,
  'sequence': 'این کتاب پرفروش ها کجا هستند ؟',
  'token': 13077,
  'token_str': 'پرفروش'},
 {'score': 0.056880172342061996,
  'sequence': 'این کتاب کتاب ها کجا هستند ؟',
  'token': 3250,
  'token_str': 'کتاب'},
 {'score': 0.03647409379482269,
  'sequence': 'این کتاب سال ها کجا هستند ؟',
  'token': 2844,
  'token_str': 'سال'},
 {'score': 0.025033805519342422,
  'sequence': 'این کتاب - ها کجا هستند ؟',
  'token': 1011,
  'token_str': '-'},
 {'score': 0.01697702333331108,
  'sequence': 'این کتاب فروشی ها کجا هستند ؟',
  'token': 8786,
  'token_str': 'فروشی'}]
------------------------------------------------------------------------------------------
[{'score': 0.9764866232872009,
  'sequence': 'نمایشگرها بزرگترین مصرفکنندهی انرژی در گوشیهای هوشمند هستند.',
  'token': 4980,
  'token_str': 'هوشمند'},
 {'score': 0.0075789508409798145,
  'sequence': 'نمایشگرها بزرگترین مصرفکنندهی انرژی در گوشیهای همراه هستند.',
  'token': 3287,
  'token_str': 'همراه'},
 {'score

In [None]:
nlp_fill = pipeline("fill-mask", model=versions["v3.0"])

output = nlp_fill(text1)
pprint(output)
print('-' * 90)
output = nlp_fill(text2)
pprint(output)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=473451616.0, style=ProgressStyle(descri…




Some weights of BertModel were not initialized from the model checkpoint at HooshvareLab/bert-fa-zwnj-base and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[{'score': 0.3467977046966553,
  'sequence': 'این کتاب ها کجا هستند ؟',
  'token': 9,
  'token_str': '[ZWNJ]'},
 {'score': 0.052707135677337646,
  'sequence': 'این کتاب خوان ها کجا هستند ؟',
  'token': 3463,
  'token_str': 'خوان'},
 {'score': 0.0424254909157753,
  'sequence': 'این کتاب فروشی ها کجا هستند ؟',
  'token': 6316,
  'token_str': 'فروشی'},
 {'score': 0.03970666974782944,
  'sequence': 'این کتاب بچه ها کجا هستند ؟',
  'token': 4321,
  'token_str': 'بچه'},
 {'score': 0.03866000846028328,
  'sequence': 'این کتاب نوشته ها کجا هستند ؟',
  'token': 3283,
  'token_str': 'نوشته'}]
------------------------------------------------------------------------------------------
[{'score': 0.811574399471283,
  'sequence': 'نمایشگرها بزرگترین مصرف کننده ی انرژی در گوشی های هوشمند هستند.',
  'token': 2617,
  'token_str': 'هوشمند'},
 {'score': 0.12329542636871338,
  'sequence': 'نمایشگرها بزرگترین مصرف کننده ی انرژی در گوشی های موبایل هستند.',
  'token': 2984,
  'token_str': 'موبایل'},
 {'score'