Based on the work of anis kouba (ARABIANGPT: NATIVE ARABIC GPT-BASED LARGE LANGUAGE) at:
- https://arxiv.org/pdf/2402.15313
- https://pypi.org/project/aranizer/

In [2]:
%pip install aranizer

Collecting aranizer
  Downloading aranizer-0.2.5-py3-none-any.whl.metadata (5.7 kB)
Collecting sentence-transformers>=0.4.0 (from aranizer)
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting scikit-learn (from sentence-transformers>=0.4.0->aranizer)
  Downloading scikit_learn-1.5.1-cp311-cp311-win_amd64.whl.metadata (12 kB)
Collecting joblib>=1.2.0 (from scikit-learn->sentence-transformers>=0.4.0->aranizer)
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn->sentence-transformers>=0.4.0->aranizer)
  Using cached threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Downloading aranizer-0.2.5-py3-none-any.whl (12.0 MB)
   ---------------------------------------- 0.0/12.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/12.0 MB 640.0 kB/s eta 0:00:19
   ---------------------------------------- 0.1/12.0 MB 1.3 MB/s eta 0:00:10
   - -------------------------------------- 0.3/1

### Importing Tokenizers
Import your desired tokenizer from AraNizer. 

Available tokenizers include: BEP variants: get_bpe with keys bpe32, bpe50, bpe64, bpe86, bpe32T, bpe50T, bpe64T, bpe86T SentencePiece variants: get_sp with keys sp32, sp50, sp64, sp86, sp32T, sp50T, sp64T, sp86T

In [3]:
from aranizer import get_bpe, get_sp  # Import functions to retrieve tokenizers

# Example for importing a BPE tokenizer
bpe_tokenizer = get_bpe("bpe32")  # Replace with your chosen tokenizer key

# Example for importing a SentencePiece tokenizer
sp_tokenizer = get_sp("sp32")  # Replace with your chosen tokenizer key

  from .autonotebook import tqdm as notebook_tqdm


### Tokenizing Text
Tokenize Arabic text using the selected tokenizer:

In [4]:
text = "مثال على النص العربي"  # Example Arabic text

# Using BPE tokenizer
bpe_tokens = bpe_tokenizer.tokenize(text)
print(bpe_tokens)

# Using SentencePiece tokenizer
sp_tokens = sp_tokenizer.tokenize(text)
print(sp_tokens)

['Ùħ', 'Ø«Ø§ÙĦ', 'ĠØ¹ÙĦÙī', 'ĠØ§ÙĦÙĨØµ', 'ĠØ§ÙĦØ¹Ø±Ø¨ÙĬ']
['▁مثال', '▁على', '▁النص', '▁العربي']


Encoding and Decoding
Encode text into token ids and decode back to text.

Encoding: To encode text, use the encode method.

In [6]:
text = "مثال على النص العربي"  # Example Arabic text

# Using BPE tokenizer
encoded_bpe_output = bpe_tokenizer.encode(text, add_special_tokens=True)
print(encoded_bpe_output)

# Using SentencePiece tokenizer
encoded_sp_output = sp_tokenizer.encode(text, add_special_tokens=True)
print(encoded_sp_output)

[268, 3408, 363, 3836, 1951]
[8947, 220, 2674, 1267]


Decoding: To convert token ids back to text, use the decode method:

In [7]:
# Using BPE tokenizer
decoded_bpe_text = bpe_tokenizer.decode(encoded_bpe_output)
print(decoded_bpe_text)

# Using SentencePiece tokenizer
decoded_sp_text = sp_tokenizer.decode(encoded_sp_output)
print(decoded_sp_text)

مثال على النص العربي
مثال على النص العربي


### Available Tokenizers
- get_bpe("bpe32"): Based on BPE Tokenizer with Vocab Size of 32k
- get_bpe("bpe50"): Based on BPE Tokenizer with Vocab Size of 50k
- get_bpe("bpe64"): Based on BPE Tokenizer with Vocab Size of 64k
- get_bpe("bpe86"): Based on BPE Tokenizer with Vocab Size of 86k
- get_bpe("bpe32T"): Based on BPE Tokenizer with Vocab Size of 32k (with Tashkeel (diacritics))
- get_bpe("bpe50T"): Based on BPE Tokenizer with Vocab Size of 50k (with Tashkeel (diacritics))
- get_bpe("bpe64T"): Based on BPE Tokenizer with Vocab Size of 64k (with Tashkeel (diacritics))
- get_bpe("bpe86T"): Based on BPE Tokenizer with Vocab Size of 86k (with Tashkeel (diacritics))
- get_sp("sp32"): Based on SentencePiece Tokenizer with Vocab Size of 32k
- get_sp("sp50"): Based on SentencePiece Tokenizer with Vocab Size of 50k
- get_sp("sp64"): Based on SentencePiece Tokenizer with Vocab Size of 64k
- get_sp("sp86"): Based on SentencePiece Tokenizer with Vocab Size of 86k
- get_sp("sp32T"): Based on SentencePiece Tokenizer with Vocab Size of 32k (with Tashkeel (diacritics))
- get_sp("sp50T"): Based on SentencePiece Tokenizer with Vocab Size of 50k (with Tashkeel (diacritics))
- get_sp("sp64T"): Based on SentencePiece Tokenizer with Vocab Size of 64k (with Tashkeel (diacritics))
- get_sp("sp86T"): Based on SentencePiece Tokenizer with Vocab Size of 86k (with Tashkeel (diacritics))