## checkout this papers:

[mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset](https://arxiv.org/abs/2108.13897)

[A cost-benefit analysis of cross-lingual transfer methods](https://arxiv.org/abs/2105.06813)


In [1]:
#load the mMARCO a multilingual version of the MS MARCO passage ranking dataset 
#from huggingface https://huggingface.co/datasets/unicamp-dl/mmarco
from datasets import load_dataset
dataset = load_dataset('unicamp-dl/mmarco', 'arabic')
dataset

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/9.96k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/3.66k [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/51.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/905M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Loading dataset shards:   0%|          | 0/91 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['query', 'positive', 'negative'],
        num_rows: 39780811
    })
})

In [2]:
# https://huggingface.co/aubmindlab/araelectra-base-discriminator
# A preprocessing is recommended by the authors of AraELECTRA and AraBERT before training or testing on any dataset. 
!pip install arabert -q
from arabert.preprocess import ArabertPreprocessor

model_name="araelectra-base"
arabert_prep = ArabertPreprocessor(model_name=model_name)

#text = "و لن نبالغ إذا قلنا إن الهاتف أ و كمبيوتر  المكتب في زمننا هذا ضروري"
#arabert_prep.preprocess(text)

In [3]:
# select 5M sample out of 39M sample
dataset_eval = dataset['train'].select(range(5000000, 5005000))
dataset_train = dataset['train'].select(range(0, 5000000))

In [4]:
# The dataset is in the form (query, positive passage, negative passage).
# We split it into the forms (query, positive passage, label=1) and (query, negative passage, label=0).
def split_examples(batch):
    queries = []
    passages = []
    labels = []
    for label in ["positive", "negative"]:
        for (query, passage) in zip(batch["query"], batch[label]):
            queries.append(arabert_prep.preprocess(query))
            passages.append(arabert_prep.preprocess(passage))
            labels.append(int(label == "positive"))
    return {"query": queries, "passage": passages, "label": labels}

dataset_train = dataset_train.map(split_examples, batched=True, remove_columns=["positive", "negative"])
dataset_eval = dataset_eval.map(split_examples, batched=True, remove_columns=["positive", "negative"])


Map:   0%|          | 0/5000000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [5]:
# we apply tokenization 
import torch
from datasets import load_dataset
from transformers import AutoTokenizer
args_model="aubmindlab/araelectra-base-discriminator"
tokenizer = AutoTokenizer.from_pretrained(args_model)

def tokenize(batch):
    tokenized = tokenizer(
        batch["query"],
        batch["passage"],
        padding=True,
        truncation="only_second",
        max_length=512,
        )
    tokenized["labels"] = [[float(label)] for label in batch["label"]]
    return tokenized

tokenizer_config.json:   0%|          | 0.00/392 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/503 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/825k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.64M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [6]:
dataset_train = dataset_train.map(tokenize, batched=True, remove_columns=["query", "passage", "label"])
dataset_train.set_format("torch")
dataset_eval = dataset_eval.map(tokenize, batched=True, remove_columns=["query", "passage", "label"])
dataset_eval.set_format("torch")

Map:   0%|          | 0/10000000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [7]:
# Save the dataset locally
dataset_train.save_to_disk("mmarco_train10M_preprossesd_for_AraELECTRA")
dataset_eval.save_to_disk("mmarco_eval10k_preprossesd_for_AraELECTRA")

Saving the dataset (0/28 shards):   0%|          | 0/10000000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/10000 [00:00<?, ? examples/s]

In [8]:
import huggingface_hub 
hf = huggingface_hub.HfFolder()
access_token = "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" 
organization_dataset_id="hatemestinbejaia/RARAELECTRAandRARABERTusedDATASET"
#To push the dataset to your own Huggingface repository, change the organization_dataset_id and access_token
hf.save_token(access_token)
dataset_train.push_to_hub(organization_dataset_id, "mmarco_train10M_preprossesd_for_AraELECTRA")
dataset_eval.push_to_hub(organization_dataset_id, "mmarco_eval10k_preprossesd_for_AraELECTRA")

Uploading the dataset shards:   0%|          | 0/28 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/358 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/1.04k [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/10 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/hatemestinbejaia/RARAELECTRAandRARABERTusedDATASET/commit/09a2e0efa0b5a072f40899a74f3e531bd89d85bd', commit_message='Upload dataset', commit_description='', oid='09a2e0efa0b5a072f40899a74f3e531bd89d85bd', pr_url=None, pr_revision=None, pr_num=None)

In [9]:
#You can use the processed dataset directly from our repository to fine-tune your owen version-based AraELECTRA
#using the below code 
from datasets import load_dataset
dataset_train = load_dataset(organization_dataset_id, 'mmarco_train10M_preprossesd_for_AraELECTRA')
dataset_eval = load_dataset(organization_dataset_id, 'mmarco_eval10k_preprossesd_for_AraELECTRA')

Downloading readme:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/28 [00:00<?, ?it/s]

Downloading data: 100%|██████████| 62.9M/62.9M [00:03<00:00, 19.4MB/s]
Downloading data: 100%|██████████| 62.1M/62.1M [00:02<00:00, 23.2MB/s]
Downloading data: 100%|██████████| 63.0M/63.0M [00:01<00:00, 36.5MB/s]
Downloading data: 100%|██████████| 63.0M/63.0M [00:01<00:00, 40.0MB/s]
Downloading data: 100%|██████████| 62.8M/62.8M [00:02<00:00, 31.2MB/s]
Downloading data: 100%|██████████| 61.5M/61.5M [00:01<00:00, 37.1MB/s]
Downloading data: 100%|██████████| 62.7M/62.7M [00:01<00:00, 32.0MB/s]
Downloading data: 100%|██████████| 62.0M/62.0M [00:02<00:00, 30.6MB/s]
Downloading data: 100%|██████████| 62.2M/62.2M [00:01<00:00, 38.0MB/s]
Downloading data: 100%|██████████| 62.8M/62.8M [00:01<00:00, 32.0MB/s]
Downloading data: 100%|██████████| 62.7M/62.7M [00:01<00:00, 37.3MB/s]
Downloading data: 100%|██████████| 61.8M/61.8M [00:01<00:00, 36.3MB/s]
Downloading data: 100%|██████████| 61.8M/61.8M [00:02<00:00, 27.4MB/s]
Downloading data: 100%|██████████| 62.5M/62.5M [00:01<00:00, 32.4MB/s]
Downlo

Generating train split:   0%|          | 0/10000000 [00:00<?, ? examples/s]

Loading dataset shards:   0%|          | 0/29 [00:00<?, ?it/s]

Downloading data: 100%|██████████| 1.79M/1.79M [00:00<00:00, 3.08MB/s]


Generating train split:   0%|          | 0/10000 [00:00<?, ? examples/s]