<font>
<div dir=ltr align=center>
<img src='https://cdn.freebiesupply.com/logos/large/2x/sharif-logo-png-transparent.png' width=150 height=150> <br>
<font color=0F5298 size=6>
Natural Language Processing<br>
<font color=2565AE size=4>
Computer Engineering Department<br>
Spring 2025<br>
<font color=3C99D size=4>
Workshop 1 - NLP Frameworks - 🤗 Hugging Face<br>
<font color=696880 size=3>
<a href='https://language.ml'>https://language.ml</a><br>
info [AT] language [dot] ml

# 📖 Part 1 – Introduction

Hugging Face provides two core Python libraries:

| Library             | Purpose                                                                              |
|---------------------|--------------------------------------------------------------------------------------|
| **🤗 Datasets**     | Convenient access to hundreds of public NLP datasets (+ on-disk caching, streaming). |
| **🤗 Transformers** | Thousands of pre-trained models (BERT, T5, GPT-2, etc.) with a unified API.          |

You will learn just enough to **load a dataset** and **run a pre-trained model**; fine-tuning and configuration details are left for later study.


## ⚙️ Installation & Setup

In [54]:
!pip install transformers datasets torch



# 📚 Part 2 – Loading a Dataset

This part shows how to load a dataset using 🤗 Datasets and inspect its basic structure.

In [55]:
from datasets import load_dataset

# Load the 'train' split of the Divar real estate ads dataset
dataset = load_dataset("divaroffical/real_estate_ads", split="train[:1000]")
# dataset = load_dataset("divaroffical/real_estate_ads", split="train[:1000]+test")

In [56]:
print("Number of examples:", len(dataset))

Number of examples: 1000


In [57]:
print("Column names:", dataset.column_names)

Column names: ['cat2_slug', 'cat3_slug', 'city_slug', 'neighborhood_slug', 'created_at_month', 'user_type', 'description', 'title', 'rent_mode', 'rent_value', 'rent_to_single', 'rent_type', 'price_mode', 'price_value', 'credit_mode', 'credit_value', 'rent_credit_transform', 'transformable_price', 'transformable_credit', 'transformed_credit', 'transformable_rent', 'transformed_rent', 'land_size', 'building_size', 'deed_type', 'has_business_deed', 'floor', 'rooms_count', 'total_floors_count', 'unit_per_floor', 'has_balcony', 'has_elevator', 'has_warehouse', 'has_parking', 'construction_year', 'is_rebuilt', 'has_water', 'has_warm_water_provider', 'has_electricity', 'has_gas', 'has_heating_system', 'has_cooling_system', 'has_restroom', 'has_security_guard', 'has_barbecue', 'building_direction', 'has_pool', 'has_jacuzzi', 'has_sauna', 'floor_material', 'property_type', 'regular_person_capacity', 'extra_person_capacity', 'cost_per_extra_person', 'rent_price_on_regular_days', 'rent_price_on_s

In [58]:
print("Features:", dataset.features)

Features: {'cat2_slug': Value(dtype='string', id=None), 'cat3_slug': Value(dtype='string', id=None), 'city_slug': Value(dtype='string', id=None), 'neighborhood_slug': Value(dtype='string', id=None), 'created_at_month': Value(dtype='string', id=None), 'user_type': Value(dtype='string', id=None), 'description': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'rent_mode': Value(dtype='string', id=None), 'rent_value': Value(dtype='float64', id=None), 'rent_to_single': Value(dtype='float64', id=None), 'rent_type': Value(dtype='string', id=None), 'price_mode': Value(dtype='string', id=None), 'price_value': Value(dtype='float64', id=None), 'credit_mode': Value(dtype='string', id=None), 'credit_value': Value(dtype='float64', id=None), 'rent_credit_transform': Value(dtype='bool', id=None), 'transformable_price': Value(dtype='bool', id=None), 'transformable_credit': Value(dtype='float64', id=None), 'transformed_credit': Value(dtype='float64', id=None), 'transformable_ren

In [59]:
dataset[0]

{'cat2_slug': 'temporary-rent',
 'cat3_slug': 'villa',
 'city_slug': 'karaj',
 'neighborhood_slug': 'mehrshahr',
 'created_at_month': '2024-08-01 00:00:00',
 'user_type': 'مشاور املاک',
 'description': '۵۰۰متر\n۲۰۰متر بنا دوبلکس\n۳خواب\nاستخر آبگرم داخل\nسیستم صوتی حرفه ای\nسرگرمی ایرهاکی\nبرای اطلاعات بیشتر تماس حاصل فرماید',
 'title': 'باغ ویلا اجاره روزانه استخر داخل لشکرآباد سهیلیه',
 'rent_mode': None,
 'rent_value': None,
 'rent_to_single': None,
 'rent_type': None,
 'price_mode': None,
 'price_value': None,
 'credit_mode': None,
 'credit_value': None,
 'rent_credit_transform': None,
 'transformable_price': None,
 'transformable_credit': None,
 'transformed_credit': None,
 'transformable_rent': None,
 'transformed_rent': None,
 'land_size': None,
 'building_size': 500.0,
 'deed_type': None,
 'has_business_deed': None,
 'floor': None,
 'rooms_count': 'سه',
 'total_floors_count': None,
 'unit_per_floor': None,
 'has_balcony': None,
 'has_elevator': None,
 'has_warehouse': None,
 'h

In [60]:
new_dataset = dataset.map(
    lambda rent_price_at_weekends: {
        "rent_price_at_weekends_rial": None if rent_price_at_weekends is None else rent_price_at_weekends * 10
    }, input_columns=["rent_price_at_weekends"], remove_columns=["rent_price_at_weekends"], num_proc=5)

In [61]:
new_dataset_columns = set(new_dataset.column_names)
dataset_columns = set(dataset.column_names)

print("Added columns:", new_dataset_columns - dataset_columns)
print("Removed columns:", dataset_columns - new_dataset_columns)

Added columns: {'rent_price_at_weekends_rial'}
Removed columns: {'rent_price_at_weekends'}


# 🏷️ Part 3 – Loading a Pre-trained Model & Tokenizer

This part shows how to initialize a pretrained Persian model and its tokenizer from 🤗 Transformers.

In [62]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Choose a Persian BERT-based checkpoint
model_name = "HooshvareLab/bert-base-parsbert-uncased"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [63]:
tokenizer

BertTokenizerFast(name_or_path='HooshvareLab/bert-base-parsbert-uncased', vocab_size=100000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

# 🚀 Part 4 – Zero-Shot Tasks with ParsBERT-Base

`HooshvareLab/bert-base-parsbert-uncased` can serve three common zero-shot tasks without any fine-tuning:

| Sub-part | Task                                | Pipeline keyword       | Output                    |
|----------|-------------------------------------|------------------------|---------------------------|
| 4.1      | **Sentiment analysis**              | `"sentiment-analysis"` | single label + confidence |
| 4.2      | **Masked-token prediction** (cloze) | `"fill-mask"`          | top-k token completions   |
| 4.3      | **Feature extraction**              | `"feature-extraction"` | per-token embeddings      |

In [64]:
from transformers import pipeline
import torch

device = "mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu"
device

'mps'

## 🔹 4.1 Sentiment Analysis

In [65]:
from transformers import AutoModelForSequenceClassification

mdl_name = "HooshvareLab/bert-base-parsbert-uncased"
clf_model = AutoModelForSequenceClassification.from_pretrained(mdl_name)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at HooshvareLab/bert-base-parsbert-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [66]:
clf_model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(100000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1

In [67]:
clf_model.config

BertConfig {
  "_name_or_path": "HooshvareLab/bert-base-parsbert-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.43.4",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 100000
}

In [68]:
sentiment = pipeline("sentiment-analysis", model=clf_model, tokenizer=tokenizer, device=device)

sample = dataset["description"][0]
print("Input snippet:", sample[:80], "…")

Input snippet: ۵۰۰متر
۲۰۰متر بنا دوبلکس
۳خواب
استخر آبگرم داخل
سیستم صوتی حرفه ای
سرگرمی ایرهاک …


In [69]:
out = sentiment(sample)
print("Output:", out)
print("Output[0]:", out[0])

Output: [{'label': 'LABEL_0', 'score': 0.5169467329978943}]
Output[0]: {'label': 'LABEL_0', 'score': 0.5169467329978943}


## 🔹 4.2 Masked-Token Prediction (Fill-Mask)

In [70]:
from transformers import AutoModelForMaskedLM

mlm_model = AutoModelForMaskedLM.from_pretrained(mdl_name)

Some weights of the model checkpoint at HooshvareLab/bert-base-parsbert-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [71]:
mlm_model

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(100000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementw

In [72]:
fill_mask = pipeline("fill-mask", model=mlm_model, tokenizer=tokenizer, device=device)

sentence = "از تهران به [MASK] رفتم."
for prediction in fill_mask(sentence):
    print(f"{prediction['token_str']:<8} → {prediction['score']:.4f}")

اصفهان   → 0.0801
کرج      → 0.0477
شیراز    → 0.0370
مشهد     → 0.0332
انجا     → 0.0329


## 🔹 4.3 Feature Extraction (Embeddings)

In [73]:
from transformers import AutoModel

fe_model = AutoModel.from_pretrained(mdl_name)

In [74]:
fe_model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(100000, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=Fals

In [75]:
fe = pipeline("feature-extraction", model=fe_model, tokenizer=tokenizer, device=device)

vecs = fe("سلام دنیا")[0]  # shape: [num_tokens, 768]
print("Tokens:", len(vecs), "| Hidden size:", len(vecs[0]))

Tokens: 4 | Hidden size: 768


In [76]:
tokens = tokenizer.tokenize("سلام دنیا", add_special_tokens=True)
tokens

['[CLS]', 'سلام', 'دنیا', '[SEP]']