## `DESCRIPTION ABOUT THE PROBLEM`
# Overview

**Bengali Text to IPA (International Phonetic Alphabet) Transcription** is an area that has seen relatively limited development compared to other languages, despite Bengali being one of the world's most widely spoken native languages. There is a growing need for automated systems that can accurately convert Bengali text into IPA notation due to the vast audience and various applications in linguistics, language learning, and phonetic research. Having this in mind, we welcome participants to participate in **DataVerse**, a part of **ITVerse 2023** organized by IIT Software Engineers' Community (IITSEC) as we partner with **Bengali.AI** to advance research in Bengali text to IPA domain.

# [Description](http://www.kaggle.com/competitions/dataverse_2023/overview/description)

## Goal of the Competition

The goal of this competition is to recognize model IPA transcription from Bengali texts(Remember the greek characters in the dictionary, to help you find out the accurate pronunciation of words? That was International Phonetic Alphabet (IPA) transcription! You will build a model trained on a linguist validated dataset containing Bengali text from different domains. The test set contains numbers, loan-words and domain-specific words to add to the challenge.

Your efforts could improve Bengali computational linguistics and NLP research using the first Bengali sentence level IPA transcription dataset from Bengali.AI . In addition, your submission will be among the first open-source IPA transcription methods for Bengali.


# Evaluation

Submissions are evaluated by a mean **Word Error Rate**, proceeding as follows:

1. The WER is computed for each instance in the test set.
2. The WERs are averaged within domains, weighted by the number of words in the sentence.
3. The (unweighted) mean of the domain averages is the final score.

# [Rules for the Competition](http://https://www.kaggle.com/competitions/dataverse_2023/rules)

*     Registration is mandatory to participate in all the phases of the contest.
*     Only one account per participant in Kaggle will be allowed.
*     Participants are allowed to use external data, but the data has to be disclosed publicly before the final round.
*     The use of the internet is allowed for research and reference, but **API calls are not permitted.**
*     All teams must submit working Kaggle notebooks for training and inference (and the models too), including appropriate instructions and documentation for reproducibility. The inference notebook must run on Kaggle Notebooks.
*     Teams are required to make all scripts, data, and reports publicly available.
*     No private code sharing outside teams.
*     Teams must prepare a project report of at least 2 pages. The paper has to be in IEEE/ACM (2 column) format.


## `NOW SOLUTION`

# 1. Predicting the IPA of `Bangla` words using machine learning.

In this notebook, we're going to go through an example machine learning project with the goal of predicting the IPA of `Bangla` words.



## Import Necessary Libraries

In [1]:
# Regular EDA(exploratory data analysis) and plotting libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
%matplotlib inline

from tqdm import tqdm
import string
import statistics
import re
import joblib
import random



# 2. Data

### Submission Previously Saved

In [2]:
# Import the training set
df = pd.read_csv('/kaggle/input/dataverse-ipa/dataverse_2023/trainIPAdb_u.csv')
df.head(2)

Unnamed: 0,text,ipa
0,এরপরও তারা বকেয়া পরিশোধ করেনি।,eɾpɔɾo t̪ɐɾɐ bɔkeʲɐ poɾɪʃod̪ʱ kɔɾenɪ।
1,আগে সুইস ব্যাংকে জমা টাকার কোনো প্রতিবেদন প্রক...,ɐge suɪ̯s bɛŋke ɟɔmɐ tɐkɐɾ kono pɾot̪ɪbed̪ɔn p...


In [3]:
# See a random sample from the training set 
rand_index = np.random.randint(0, len(df))
print(f'Text = {df.text[rand_index]}\nIPA = {df.ipa[rand_index]}')

Text = দেশের ক্রিমিনোলজি বিভাগগুলোতে মলিকুলার বায়োলজি, জেনেটিক্সের মতো সাবজেক্টগুলোর প্রায়োগিক দিক সম্পর্কে নজর দেয়া উচিত।
IPA = d̪eʃeɾ kɾɪmɪnoloɟɪ bɪbʱɐggulot̪e molɪkulɐɾ bɐʲolɔɟɪ, ɟenetɪkseɾ mɔt̪o sɐbɟektguloɾ pɾɐʲogɪk d̪ɪk ʃɔmpɔɾke noɟoɾ d̪eʲɐ ucɪt̪। 


In [4]:
df.shape

(21999, 2)

In [5]:
ano_dict = {
    'সে' : 'ʃe', 'তার': 't̪ɐɾ', 'তিনি': 't̪ɪnɪ', 'ইমন': 'ɪmon',
    'ফিরে': 'pʰɪɾe', 'সরিয়ে': 'ʃoɾɪʲe', 'বলেন': 'bɔlen', 'করা': 'kɔɾɐ', 'হাজির': 'hɐɟɪɾ', 'স্বীকার': 'ʃɪkɐɾ', 
    'চালাচ্ছে': 'cɐlɐccʰe', 
    'আসে': 'ɐʃe', 'বাংলাদেশ': 'bɐŋlɐd̪eʃ', 'আদালতের' : 'ɐd̪ɐlɔt̪eɾ', 'উপস্থিত' : 'upost̪ʰɪt̪', 'কমিশন': 'kɔmɪʃon',
    'রাখে': 'ɾɐkʰe', 'মালিক': 'mɐlɪk', 'সংবাদ': 'ʃɔŋbɐd̪'
}

In [6]:
new_df = pd.read_csv('/kaggle/input/dataverse-ipa/new_train.csv')
new_df.head(3)

Unnamed: 0,text,ipa
0,আমি বলেন সাত আদালতের।,ɐmɪ bɔlen ʃɐt̪ ɐd̪ɐlɔt̪eɾ।
1,তুমি করা চুরানব্বই রাখে।,t̪umɪ kɔɾɐ cuɾɐnɔbboɪ ɾɐkʰe।
2,আমরা স্বীকার আটান্ন উপস্থিত।,ɐmɾɐ ʃɪkɐɾ ɐtɐnno upost̪ʰɪt̪।


In [7]:
ex_df = pd.read_csv('/kaggle/input/dataverse-ipa/Exception - Sheet1.csv')
print(ex_df.shape)
ex_df.tail(5)

(17, 2)


Unnamed: 0,text,ipa
12,৭৭/৩ ৭৭/৩ ৭৭/৩।,ʃɐt̪ɐt̪t̪oɾ/t̪ɪn ʃɐt̪ɐt̪t̪oɾ/t̪ɪn ʃɐt̪ɐt̪t̪oɾ/...
13,সিক্স (২+চার)।,sɪks (duɪ+cɐɾ)।
14,১+ চল্ চল্ চল্।,ɛk+ cɔl cɔl cɔl।
15,২- cɔl cɔl cɔl।,duɪ- cɔl cɔl cɔl।
16,"৮,৫৭৪ ৮,৫৭৪ চল্ চল্ চল্।",ɐt hɐɟɐɾ pɐncʃo cuɐt̪t̪oɾ ɐt hɐɟɐɾ pɐncʃo cuɐt...


In [8]:
test_df = pd.read_csv('/kaggle/input/dataverse-ipa/dataverse_2023/testData.csv')
test_df.head(3)

Unnamed: 0,row_id_column_name,text
0,0,বিশেষ অতিথি এফএম ইকবাল বিন আনোয়ার (ডন) অ্যাডিশ...
1,1,এ নিয়ে বিবাদে ২০১৫ সালের ২ জুন রাত সাড়ে ১১টায় ...
2,2,আজ থেকে ১৪ বছর আগে তিনি চলে গেছেন না ফেরার দেশে।


In [9]:
train = pd.concat([df, new_df], axis = 0)
train.shape

(22999, 2)

In [10]:
train = train.reset_index(drop=True)

In [11]:
filt_cond = ["১-৯", 'A-Za-z0-9']
# Filtering text samples that contain English alphanumeric values
filtered_train = train[lambda x: x["text"].str.contains("[A-Za-z0-9]")]

print(f'Length of the Df = {filtered_train.shape}')

Length of the Df = (19, 2)


In [12]:
with pd.option_context('display.max_colwidth', 0):
    display(filtered_train.head(n=4))

Unnamed: 0,text,ipa
69,আজ তোমাদের জন্য ইংরেজি প্রথমপত্রের Part-I-Gi seen passage এবং Part-II-Gi Writing Test অংশের সাজেশন দেয়া হল।,ɐɟ t̪omɐd̪eɾ ɟonno ɪŋɾeɟɪ pɾot̪ʰompɔt̪ɾeɾ -- seen pɐssɐge eboŋ -- Wɾɪtɪng Test ɔŋʃeɾ ʃɐɟeʃon d̪eʲɐ hɔl।
1479,হেপাটাইটিস B ভাইরাসের প্রতিরোধে রয়েছে টিকা।,hepɐtɐɪ̯tɪs B bʱɐɪ̯ɾɐʃeɾ pɾot̪ɪɾod̪ʱe ɾoʲecʰe tɪkɐ।
2801,কর্মশালায় অংশগ্রহণে আগ্রহীরা বাংলাদেশ মিডিয়া ফোরামের ওয়েবসাইটের মাধ্যমে www.bdmediaforum.com/register রেজিস্ট্রেশন করতে পারবেন।,kɔɾmoʃɐlɐe̯ ɔŋʃogɾohone ɐgɾohɪɾɐ bɐŋlɐd̪eʃ mɪdɪʲɐ pʰoɾɐmeɾ oʷebʃɐɪ̯teɾ mɐd̪d̪ʱome ../ ɾeɟɪstɾeʃon koɾt̪e pɐɾben।
3781,এ সংক্রান্ত বিস্তারিত তথ্য বিশ্ববিদ্যালয়ের ওয়েবসাইট www.admissions. nu.edu.bd অথবা nu.edu.bd/ admissions থেকে পাওয়া যাবে।,e ʃɔŋkɾɐnt̪o bɪst̪ɐɾɪt̪o t̪ot̪t̪ʰo bɪʃʃobɪd̪d̪ɐlɔʲeɾ oʷebʃɐɪ̯t .. .. ɔt̪ʰobɐ ../ ɐdmɪssɪons t̪ʰeke pɐo̯ʷɐ ɟɐbe।


In [13]:
filt_cond = ["১-৯", 'A-Za-z0-9']
# Filtering text samples that contain English alphanumeric values
fltr_test = test_df[lambda x: x["text"].str.contains("[১-৯]")]

print(f'Length of the Df = {fltr_test.shape}')

Length of the Df = (9601, 2)


In [14]:
with pd.option_context('display.max_colwidth', 0):
    display(fltr_test.head(n=5))

Unnamed: 0,row_id_column_name,text
1,1,এ নিয়ে বিবাদে ২০১৫ সালের ২ জুন রাত সাড়ে ১১টায় চাচা সুশীল দাসকে কুপিয়ে জখম করে সে।
2,2,আজ থেকে ১৪ বছর আগে তিনি চলে গেছেন না ফেরার দেশে।
3,3,নিহত ব্যক্তি কুতপালং টালের ই-২ ব্লকের আবুল বাছেদ (৪০)।
4,4,"সংক্ষিপ্ত স্কোরশ্রীলংকা প্রথম ইনিংস ৪৮২ (করুনারত্নে ১৯৬, চান্দিমাল ৬২, ডিকভেলা ৫২, পেরেরা ৫৮।"
5,5,"এগুলোর মধ্যে সাজ্জাদ হোসেনের ‘নন স্টপ’, আশুতোষ সুজনের ‘এই শহরে’ এবং রুলিন রহমানের ‘রোড নাম্বার-৭, বাসা নাম্বার-১৩’।"


In [15]:
# Remove English alphanumeric values
alpha_pat = "[a-zA-z0-9০-৯]"

train["text"] = train["text"].str.replace(alpha_pat, "", regex=True)
# test_df["text"] = test_df["text"].str.replace(alpha_pat, "", regex=True)

In [16]:
filt_cond = ["১-৯", 'A-Za-z0-9']
bengali_pat = "[\u0980-\u09FF]"
# Filtering text samples that contain English alphanumeric values
filt = train[lambda x: x["ipa"].str.contains(bengali_pat)]

print(f'Length of the Df = {filt.shape}')

Length of the Df = (433, 2)


In [17]:
filt.head(3)

Unnamed: 0,text,ipa
87,বুধবার সকালে সৌভিক বকশীর সমর্থক দশম শ্রেণীর ছা...,bud̪ʱbɐɾ ʃɔkɐle ʃo͡u̯bʱɪk bokʃɪɾ ʃɔmɔɾt̪ʰok d̪...
324,নতুনভাবে বাড়ি নির্মাণ বাবদ স্থানীয় অচিন্ত সাহা...,not̪unbʱɐbe bɐɽɪ nɪɾmɐn bɐbod̪ st̪ʰɐnɪʲo অচিন্...
336,আর বিরোধী দল রিপাবলিকান প্রার্থীদের মধ্যে এগিয়...,ɐɾ bɪɾod̪ʱɪ d̪ɔl ɾɪpɐblɪkɐn pɾɐɾt̪ʰɪd̪eɾ mod̪d...


In [18]:
indx_to_drop = filt.index
indx_to_drop[:4], len(indx_to_drop)

(Index([87, 324, 336, 355], dtype='int64'), 433)

In [19]:
train.shape

(22999, 2)

In [20]:
ftrain = train.drop(indx_to_drop)

In [21]:
print(ftrain.shape)
ftrain.tail(2)

(22566, 2)


Unnamed: 0,text,ipa
22997,ইমন হাজির আশি হাজার পাঁচশ চুয়াল্লিশ রাখে।,ɪmon hɐɟɪɾ ɐʃɪ hɐɟɐɾ pɐncʃo cuɐllɪʃ ɾɐkʰe।
22998,তিনি স্বীকার সাত লক্ষ দুইশ দুই আদালতের।,t̪ɪnɪ ʃɪkɐɾ ʃɐt̪ lɔkkʰo duɪʃo duɪ ɐd̪ɐlɔt̪eɾ।


In [22]:
filt_cond = ["১-৯", 'A-Za-z0-9']
bengali_pat = "[\u0980-\u09FF]"
# Filtering text samples that contain English alphanumeric values
filt = ftrain[lambda x: x["ipa"].str.contains(bengali_pat)]

print(f'Length of the Df = {filt.shape}')

Length of the Df = (0, 2)


In [23]:
ftrain.tail(7)

Unnamed: 0,text,ipa
22992,ইমন সরিয়ে এক লক্ষ সত্তর হাজার ছিয়াশি বাংলাদেশ।,ɪmon ʃoɾɪʲe ɛk lɔkkʰoʃot̪t̪oɾ hɐɟɐɾ cʰɪɐʃɪ bɐŋ...
22993,তিনি করা সাত লক্ষ তিয়াত্তর হাজার ছয়শ ঊনসত্তর...,t̪ɪnɪ kɔɾɐ ʃɐt̪ lɔkkʰo t̪ɪɐt̪t̪oɾ hɐɟɐɾ cʰoe̯ʃ...
22994,তার করা দুই লক্ষ চুরানব্বই হাজার একশ আটষট্টি আ...,t̪ɐɾ kɔɾɐ duɪ lɔkkʰo cuɾɐnɔbboɪ hɐɟɐɾ ɛkʃo ɐtʃ...
22995,সে স্বীকার তিন লক্ষ ঊনষাট হাজার বিরানব্বই সংবাদ।,ʃe ʃɪkɐɾ tɪn lɔkkʰo unoʃɐt hɐɟɐɾ bɪɾɐnɔbboɪ ʃɔ...
22996,তুমি আসে দুই লক্ষ তিন হাজার দুইশ সাতাশি মালিক।,t̪umɪ ɐʃe duɪ lɔkkʰo tɪn hɐɟɐɾ duɪʃo ʃɐt̪ɐʃɪ m...
22997,ইমন হাজির আশি হাজার পাঁচশ চুয়াল্লিশ রাখে।,ɪmon hɐɟɪɾ ɐʃɪ hɐɟɐɾ pɐncʃo cuɐllɪʃ ɾɐkʰe।
22998,তিনি স্বীকার সাত লক্ষ দুইশ দুই আদালতের।,t̪ɪnɪ ʃɪkɐɾ ʃɐt̪ lɔkkʰo duɪʃo duɪ ɐd̪ɐlɔt̪eɾ।


In [24]:
train_ = pd.concat([ftrain, ex_df], axis = 0)
train_.shape

(22583, 2)

In [25]:
train_.tail(3)

Unnamed: 0,text,ipa
14,১+ চল্ চল্ চল্।,ɛk+ cɔl cɔl cɔl।
15,২- cɔl cɔl cɔl।,duɪ- cɔl cɔl cɔl।
16,"৮,৫৭৪ ৮,৫৭৪ চল্ চল্ চল্।",ɐt hɐɟɐɾ pɐncʃo cuɐt̪t̪oɾ ɐt hɐɟɐɾ pɐncʃo cuɐt...


In [26]:
train_ = train_.reset_index(drop=True)
train_.tail(3)

Unnamed: 0,text,ipa
22580,১+ চল্ চল্ চল্।,ɛk+ cɔl cɔl cɔl।
22581,২- cɔl cɔl cɔl।,duɪ- cɔl cɔl cɔl।
22582,"৮,৫৭৪ ৮,৫৭৪ চল্ চল্ চল্।",ɐt hɐɟɐɾ pɐncʃo cuɐt̪t̪oɾ ɐt hɐɟɐɾ pɐncʃo cuɐt...


# Modeling

In [27]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(ftrain, test_size=0.1, shuffle=True, random_state=42)
train_df = train_df.reset_index(drop=True)
val_df = val_df.reset_index(drop=True)

In [28]:
# !pip install datasets
# !pip install transformers

In [29]:
from datasets import Dataset

train_tf = Dataset.from_pandas(train_df)
valid_tf = Dataset.from_pandas(val_df)

In [30]:
train_tf

Dataset({
    features: ['text', 'ipa'],
    num_rows: 20309
})

In [31]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model_id = "google/mt5-small"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
data_collator = DataCollatorForSeq2Seq(tokenizer)

Downloading (…)okenizer_config.json:   0%|          | 0.00/82.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/553 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Downloading pytorch_model.bin:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [32]:
def prepare_dataset(sample):
    output = tokenizer(sample["text"])
    output["labels"] = tokenizer(sample["ipa"])['input_ids']
    output["length"] = len(output["labels"])
    return output


train_tf = train_tf.map(prepare_dataset, remove_columns=train_tf.column_names)
valid_tf = valid_tf.map(prepare_dataset, remove_columns=valid_tf.column_names)

  0%|          | 0/20309 [00:00<?, ?ex/s]

  0%|          | 0/2257 [00:00<?, ?ex/s]

In [33]:
# train_tf['input_ids']

In [34]:
pip install jiwer

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting jiwer
  Downloading jiwer-3.0.3-py3-none-any.whl (21 kB)
Installing collected packages: jiwer
Successfully installed jiwer-3.0.3
Note: you may need to restart the kernel to use updated packages.


In [35]:
import numpy as np
from datasets import load_metric

wer_metric = load_metric("wer")


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    
    if isinstance(preds, tuple):
        preds = preds[0]
    
    preds = np.where(preds != -100, preds, tokenizer.pad_token_id)
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = wer_metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"wer": result}

Downloading builder script:   0%|          | 0.00/1.90k [00:00<?, ?B/s]

In [36]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

model_id = "mt5-bangla-text-to-ipa"

training_args = Seq2SeqTrainingArguments(
    output_dir=model_id,
    group_by_length=True,
    length_column_name="length",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=16,
    evaluation_strategy="steps",
    metric_for_best_model="wer",
    greater_is_better=False,
    load_best_model_at_end=True,
    num_train_epochs=10,
    save_steps=4000,
    eval_steps=4000,
    logging_steps=4000,
    learning_rate=1e-4,
    weight_decay=1e-2,
    warmup_steps=2000,
    save_total_limit=2,
    predict_with_generate=True,
    generation_max_length=128,
    push_to_hub=False,
    report_to="none",
)

In [37]:
trainer = Seq2SeqTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_tf,
    eval_dataset=valid_tf,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)


In [38]:
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Wer
4000,4.516,0.796586,0.628152
8000,0.8326,0.244618,0.235033
12000,0.4242,0.1479,0.144032
16000,0.2954,0.11327,0.114328
20000,0.2368,0.098492,0.097546
24000,0.2076,0.092841,0.092721




TrainOutput(global_step=25390, training_loss=1.0369958818217064, metrics={'train_runtime': 14399.1526, 'train_samples_per_second': 14.104, 'train_steps_per_second': 1.763, 'total_flos': 7037800900177920.0, 'train_loss': 1.0369958818217064, 'epoch': 10.0})

In [39]:
# Save the model to a file
joblib.dump(model, 'model.pkl')
print("Model saved successfully.")

Model saved successfully.


In [40]:
# Later, to load the model back
loaded_model = joblib.load('/kaggle/working/model.pkl')
loaded_model

MT5ForConditionalGeneration(
  (shared): Embedding(250112, 512)
  (encoder): MT5Stack(
    (embed_tokens): Embedding(250112, 512)
    (block): ModuleList(
      (0): MT5Block(
        (layer): ModuleList(
          (0): MT5LayerSelfAttention(
            (SelfAttention): MT5Attention(
              (q): Linear(in_features=512, out_features=384, bias=False)
              (k): Linear(in_features=512, out_features=384, bias=False)
              (v): Linear(in_features=512, out_features=384, bias=False)
              (o): Linear(in_features=384, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 6)
            )
            (layer_norm): MT5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): MT5LayerFF(
            (DenseReluDense): MT5DenseGatedActDense(
              (wi_0): Linear(in_features=512, out_features=1024, bias=False)
              (wi_1): Linear(in_features=512, out_features=1024, bias=False)
          