If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets as well as other dependencies. Uncomment the following cell and run it.

In [1]:
!pip install datasets transformers 



## Loading the dataset

In [None]:
import pandas as pd 
df=pd.read_csv("/kaggle/working/our_data.csv")
df1=df[0:20000]
df1.to_csv("ourr_data.csv")

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("csv", data_files="/kaggle/working/our_data2.csv")

In [3]:
raw_datasets=raw_datasets.remove_columns(['Unnamed: 0.1','Unnamed: 0'])

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set:

In [4]:
raw_datasets = raw_datasets["train"].train_test_split(test_size=.1)

In [None]:
raw_datasets

In [5]:
import re
chars_to_ignore_regex = '[\,\?\!\-\;\:\\\%\\�\+\؟\[\]\،\\*\\&\\ufeff\\ـ\'ّ\$]'


def remove_special_characters(batch):
    batch["txt"] = re.sub(chars_to_ignore_regex, '', batch["txt"]).lower()
    batch["txt"] = re.sub('[a-z]','',batch["txt"])  
    batch["syllables"] = re.sub(chars_to_ignore_regex, '', batch["syllables"]).lower()
    batch["syllables"] = re.sub('[a-z]','',batch["syllables"])   
    return batch

In [6]:
raw_datasets['train']=raw_datasets['train'].map(remove_special_characters)
raw_datasets['test']=raw_datasets['test'].map(remove_special_characters)

#valid=valid.map(remove_special_characters)

  0%|          | 0/47269 [00:00<?, ?ex/s]

  0%|          | 0/5253 [00:00<?, ?ex/s]

In [None]:
raw_datasets

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [7]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [8]:
show_random_elements(raw_datasets['train'])

Unnamed: 0,syllables,txt
0,|وَ|نَ|فَاْ|دُ|قُوْ|وَ|تِلْ|مَرْ|ءَ|تِ|بِلْ|حَمْ|لِ|وَ|لِتْ|تَرْ|بِ|يَهْ.,وَنَفَادُ|قُوَةِ|الْمَرْأَةِ|بِالْحَمْلِ|وَالتَرْبِيَةِ.
1,|ءَ|خَ|ذَطْ|طَاْ|لِ|بُ|كُ|تُ|بَهْ.,أَخَذَ|الطَالِبُ|كُتُبَهُ.
2,|فِلْ|وَقْ|تِلْ|لَ|ذِيْ|كَاْ|نَ|فِيْ|ھِ|مُعْ|ظَ|مُلْ|مُ|شَاْ|رِ|كِيْ|نَ|فِيْ|حَمْ|لَ|تِنْ|نَ|ظَاْ|فَ|تِ|يَحْ|تَ|فِ|ظُوْ|نَ|فِيْ|ءَيْ|دِيْ|ھِمْ|بِ|عُلْ|بَ|ھِنْ.,فِي|الْوَقْتِ|الَذِي|كَانَ|فِيهِ|مُعْظَمُ|الْمُشَارِكِينَ|فِي|حَمْلَةِ|النَظَافَةِ|يَحْتَفِظُونَ|فِي|أَيْدِيهِمْ|بِعُلْبَةٍ.
3,|كُلْ|لُ|مُ|وَاْ|طِ|نِمْ|مِ|نَلْ|حُ|صُوْ|لِ|عَ|لَى|حَقْ|قِ|ھِلْ|مَشْ|رُوْ|عِ|فِ|يَصْ|صِحْ|حَ|تِ|وَلْ|حَ|يَاْهْ.,كُلُ|مُوَاطِنٍ|مِنْ|الْحُصُولِ|عَلَى|حَقِهِ|الْمَشْرُوعِ|فِي|الصِحَةِ|وَالْحَيَاةِ.
4,|فِلْ|وَقْ|تِ|نَفْ|سِ|ھِ|مَ|عَ|طَ|لَاْ|ءِ|عِنْ|نُ|وَى|فَ|فَكْ|كَ|رُوْ|فِيْ|ءِمْ|كَاْ|نِيْ|يَ|تِ|ءَيْ|يَسْ|تَخْ|دِ|مُ|وُتْ|تِ|قَ|نِيْ|يَ|تَ|لِ|نَقْلْ.,فِي|الْوَقْتِ|نَفْسِهِ|مَعَ|طَلَائِعِ|النُوَى|فَفَكَرُوا|فِي|إِمْكَانِيَةِ|أَنْ|يَسْتَخْدِمُوا|التِقَنِيَةَ|لِنَقْلِ.


In [None]:
raw_datasets

The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

You can call its `compute` method with your predictions and labels, which need to be list of decoded strings:

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that the model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [10]:
from transformers import AutoTokenizer
from transformers import MT5ForConditionalGeneration, AutoTokenizer

model = MT5ForConditionalGeneration.from_pretrained("google/mt5-base")

# this tokenizer contains al arabic chars
tokenizer1 = AutoTokenizer.from_pretrained('IbrahimSalah/wav_chars_.155')

tokenizer=tokenizer1

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


Downloading (…)lve/main/config.json:   0%|          | 0.00/702 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.33G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/402 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/537 [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/30.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/679 [00:00<?, ?B/s]

In [11]:
max_input_length = 8129
max_target_length = 8129

def preprocess_function(examples):
    model_inputs = tokenizer(examples["syllables"], max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    labels = tokenizer(text_target=examples["txt"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [12]:
tokenized_datasets = raw_datasets['train'].map(preprocess_function, batched=True)
tokenized_datasets2 =raw_datasets['test'].map(preprocess_function, batched=True)

#valid = valid.map(preprocess_function, batched=True)

  0%|          | 0/48 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

In [13]:
from transformers import TrainingArguments,Seq2SeqTrainingArguments

batch_size = 16
#model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    "model",
    evaluation_strategy = "steps",
    learning_rate=1e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    weight_decay=0.01,
    eval_steps=500,
    save_steps=1000,
    save_total_limit=1,
    num_train_epochs=20,
    logging_steps=100,
    fp16=False
   
)

In [14]:
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

Then we just need to pass all of this along with our datasets to the `Seq2SeqTrainer`:

In [15]:
from transformers import Seq2SeqTrainer
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets,
    eval_dataset=tokenized_datasets2,
    data_collator=data_collator,
    tokenizer=tokenizer,
  
)

We can now finetune our model by just calling the `train` method:

In [16]:
import os 
os.environ ['WANDB_MODE'] = 'offline'
os.environ["WANDB_DISABLED"] = "true"

* I stopped the trainning process since I reached the desired loss 

In [17]:
trainer.train(resume_from_checkpoint ='/kaggle/working/model/checkpoint-14000')



Step,Training Loss,Validation Loss
14500,0.0811,0.063459
15000,0.0681,0.058899
15500,0.0685,0.05588
16000,0.0702,0.053484
16500,0.0653,0.050569
17000,0.0583,0.049411
17500,0.0598,0.048941
18000,0.0627,0.048167
18500,0.0573,0.048274
19000,0.0577,0.047828


# check the output model with example 

In [21]:
# Define the input text
inp="|زِ|يَاْ|دَ|تَنْ|فِلْ|مَ|بِيْ|عَاْ|تِ|وَلْ|ءَرْ|بَ|اِحْ|لِلْ|عَاْ|مِثْ|ثَاْ|مِ|نِ|عَ|لَىتْ|تَ|وَاْ|لِيْ."
t='|فِيْ|مِنْ|طَ|قَ|تِشْ|شَرْ|قِلْ|ءَوْ|سَطْ|وَ|تُرْ|كِ|يَاْ|وَ|شَ|مَ|اِلْ|ءِفْ|رِيْ|قِ|يَاْ.'
g='|وَ|اِثْ|نَيْ|نِ|وَ|سِتْ|تِيْ|نَ|فِلْ|مِ|ءَهْ.'
# Tokenize the input text
input_ids = tokenizer.encode(g, return_tensors="pt",)

# Generate the output
output_ids = model.generate(
    input_ids,
    max_length=100,
    early_stopping=True,
    pad_token_id=tokenizer.pad_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

# Decode the output
output_text = tokenizer.decode(output_ids[0][1:], skip_special_tokens=True)
print(output_text.split(".")[0])

وَاثْنَيْنِ وَسِتِينَ فِي الْمِئَةِ
