## `DESCRIPTION ABOUT THE PROBLEM`
# Overview

**Bengali Text to IPA (International Phonetic Alphabet) Transcription** is an area that has seen relatively limited development compared to other languages, despite Bengali being one of the world's most widely spoken native languages. There is a growing need for automated systems that can accurately convert Bengali text into IPA notation due to the vast audience and various applications in linguistics, language learning, and phonetic research. Having this in mind, we welcome participants to participate in **DataVerse**, a part of **ITVerse 2023** organized by IIT Software Engineers' Community (IITSEC) as we partner with **Bengali.AI** to advance research in Bengali text to IPA domain.

# [Description](http://www.kaggle.com/competitions/dataverse_2023/overview/description)

## Goal of the Competition

The goal of this competition is to recognize model IPA transcription from Bengali texts(Remember the greek characters in the dictionary, to help you find out the accurate pronunciation of words? That was International Phonetic Alphabet (IPA) transcription! You will build a model trained on a linguist validated dataset containing Bengali text from different domains. The test set contains numbers, loan-words and domain-specific words to add to the challenge.

Your efforts could improve Bengali computational linguistics and NLP research using the first Bengali sentence level IPA transcription dataset from Bengali.AI . In addition, your submission will be among the first open-source IPA transcription methods for Bengali.


# Evaluation

Submissions are evaluated by a mean **Word Error Rate**, proceeding as follows:

1. The WER is computed for each instance in the test set.
2. The WERs are averaged within domains, weighted by the number of words in the sentence.
3. The (unweighted) mean of the domain averages is the final score.

# [Rules for the Competition](http://https://www.kaggle.com/competitions/dataverse_2023/rules)

*     Registration is mandatory to participate in all the phases of the contest.
*     Only one account per participant in Kaggle will be allowed.
*     Participants are allowed to use external data, but the data has to be disclosed publicly before the final round.
*     The use of the internet is allowed for research and reference, but **API calls are not permitted.**
*     All teams must submit working Kaggle notebooks for training and inference (and the models too), including appropriate instructions and documentation for reproducibility. The inference notebook must run on Kaggle Notebooks.
*     Teams are required to make all scripts, data, and reports publicly available.
*     No private code sharing outside teams.
*     Teams must prepare a project report of at least 2 pages. The paper has to be in IEEE/ACM (2 column) format.


## `NOW SOLUTION`

# 1. Predicting the IPA of `Bangla` words using machine learning.

In this notebook, we're going to go through an example machine learning project with the goal of predicting the IPA of `Bangla` words.



## Import Necessary Libraries

In [1]:
# Regular EDA(exploratory data analysis) and plotting libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
%matplotlib inline

from tqdm import tqdm
import string
import statistics
import re



# 2. Data

In [2]:
# Import the training set
df = pd.read_csv('/kaggle/input/dataverse-ipa/dataverse_2023/trainIPAdb_u.csv')
df.head(2)

Unnamed: 0,text,ipa
0,এরপরও তারা বকেয়া পরিশোধ করেনি।,eɾpɔɾo t̪ɐɾɐ bɔkeʲɐ poɾɪʃod̪ʱ kɔɾenɪ।
1,আগে সুইস ব্যাংকে জমা টাকার কোনো প্রতিবেদন প্রক...,ɐge suɪ̯s bɛŋke ɟɔmɐ tɐkɐɾ kono pɾot̪ɪbed̪ɔn p...


In [3]:
# See a random sample from the training set 
rand_index = np.random.randint(0, len(df))
print(f'Text = {df.text[rand_index]}\nIPA = {df.ipa[rand_index]}')

Text = অনুষ্ঠানটির মিডিয়া পার্টনার ছিল সিবিএনএ, দেশদিগন্ত, নাগরিক টিভি ও আরটিভি।
IPA = onuʃtʰɐntɪɾ mɪdɪʲɐ pɐɾtnɐɾ cʰɪlo ʃɪbɪe̯ne, d̪eʃd̪ɪgɔnt̪o, nɐgoɾɪk tɪbʱɪ o ɐɾɔtɪbʱɪ। 


In [4]:
df.shape

(21999, 2)

In [5]:
ano_dict = {
    'সে' : 'ʃe', 'তার': 't̪ɐɾ', 'তিনি': 't̪ɪnɪ', 'ইমন': 'ɪmon',
    'ফিরে': 'pʰɪɾe', 'সরিয়ে': 'ʃoɾɪʲe', 'বলেন': 'bɔlen', 'করা': 'kɔɾɐ', 'হাজির': 'hɐɟɪɾ', 'স্বীকার': 'ʃɪkɐɾ', 
    'চালাচ্ছে': 'cɐlɐccʰe', 
    'আসে': 'ɐʃe', 'বাংলাদেশ': 'bɐŋlɐd̪eʃ', 'আদালতের' : 'ɐd̪ɐlɔt̪eɾ', 'উপস্থিত' : 'upost̪ʰɪt̪', 'কমিশন': 'kɔmɪʃon',
    'রাখে': 'ɾɐkʰe', 'মালিক': 'mɐlɪk', 'সংবাদ': 'ʃɔŋbɐd̪'
}

In [6]:
new_df = pd.read_csv('/kaggle/input/dataverse-ipa/new_train.csv')
new_df.head(3)

Unnamed: 0,text,ipa
0,আমি বলেন সাত আদালতের।,ɐmɪ bɔlen ʃɐt̪ ɐd̪ɐlɔt̪eɾ।
1,তুমি করা চুরানব্বই রাখে।,t̪umɪ kɔɾɐ cuɾɐnɔbboɪ ɾɐkʰe।
2,আমরা স্বীকার আটান্ন উপস্থিত।,ɐmɾɐ ʃɪkɐɾ ɐtɐnno upost̪ʰɪt̪।


In [7]:
test_df = pd.read_csv('/kaggle/input/dataverse-ipa/dataverse_2023/testData.csv')
test_df.head(3)

Unnamed: 0,row_id_column_name,text
0,0,বিশেষ অতিথি এফএম ইকবাল বিন আনোয়ার (ডন) অ্যাডিশ...
1,1,এ নিয়ে বিবাদে ২০১৫ সালের ২ জুন রাত সাড়ে ১১টায় ...
2,2,আজ থেকে ১৪ বছর আগে তিনি চলে গেছেন না ফেরার দেশে।


In [8]:
train = pd.concat([df, new_df], axis = 0)
train.shape

(22999, 2)

In [9]:
filt_cond = ["১-৯", 'A-Za-z0-9']
# Filtering text samples that contain English alphanumeric values
filtered_train = train[lambda x: x["text"].str.contains("[A-Za-z0-9]")]

print(f'Length of the Df = {filtered_train.shape}')

Length of the Df = (19, 2)


In [10]:
with pd.option_context('display.max_colwidth', 0):
    display(filtered_train.head(n=10))

Unnamed: 0,text,ipa
69,আজ তোমাদের জন্য ইংরেজি প্রথমপত্রের Part-I-Gi seen passage এবং Part-II-Gi Writing Test অংশের সাজেশন দেয়া হল।,ɐɟ t̪omɐd̪eɾ ɟonno ɪŋɾeɟɪ pɾot̪ʰompɔt̪ɾeɾ -- seen pɐssɐge eboŋ -- Wɾɪtɪng Test ɔŋʃeɾ ʃɐɟeʃon d̪eʲɐ hɔl।
1479,হেপাটাইটিস B ভাইরাসের প্রতিরোধে রয়েছে টিকা।,hepɐtɐɪ̯tɪs B bʱɐɪ̯ɾɐʃeɾ pɾot̪ɪɾod̪ʱe ɾoʲecʰe tɪkɐ।
2801,কর্মশালায় অংশগ্রহণে আগ্রহীরা বাংলাদেশ মিডিয়া ফোরামের ওয়েবসাইটের মাধ্যমে www.bdmediaforum.com/register রেজিস্ট্রেশন করতে পারবেন।,kɔɾmoʃɐlɐe̯ ɔŋʃogɾohone ɐgɾohɪɾɐ bɐŋlɐd̪eʃ mɪdɪʲɐ pʰoɾɐmeɾ oʷebʃɐɪ̯teɾ mɐd̪d̪ʱome ../ ɾeɟɪstɾeʃon koɾt̪e pɐɾben।
3781,এ সংক্রান্ত বিস্তারিত তথ্য বিশ্ববিদ্যালয়ের ওয়েবসাইট www.admissions. nu.edu.bd অথবা nu.edu.bd/ admissions থেকে পাওয়া যাবে।,e ʃɔŋkɾɐnt̪o bɪst̪ɐɾɪt̪o t̪ot̪t̪ʰo bɪʃʃobɪd̪d̪ɐlɔʲeɾ oʷebʃɐɪ̯t .. .. ɔt̪ʰobɐ ../ ɐdmɪssɪons t̪ʰeke pɐo̯ʷɐ ɟɐbe।
5604,ভর্তি পরীক্ষার ফলাফল স্ব-স্ব প্রতিষ্ঠান এবং বিশ্ববিদ্যালয়ের ওয়েবসাইট www.iau.edu.bd থেকে জানা যাবে।,bʱɔɾt̪ɪ poɾɪkkʰɐɾ pʰɔlɐpʰɔl ʃɔ-ʃʃo pɾot̪ɪʃtʱɐn eboŋ bɪʃʃobɪd̪d̪ɐlɔʲeɾ oʷebʃɐɪ̯t ... t̪ʰeke ɟɐnɐ ɟɐbe।
7344,আবেদনপত্র পাঠানো যাবে এসএমই ফাউন্ডেশনের ঢাকা অফিসের ঠিকানায় অথবা ই-মেইল করা যাবে [email protected]এ ঠিকানায়।,ɐbed̪ɔnpɔt̪ɾo pɐtʱɐno ɟɐbe eʃemɪ pʰɐu̯ndeʃoneɾ dʱɐkɐ ɔpʰɪʃeɾ tʰɪkɐnɐe̯ ɔt̪ʰobɐ ɪ-meɪ̯l kɔɾɐ ɟɐbe []e tʰɪkɐnɐe̯।
9319,সম্মেলনে ভিশনারি বক্তব্য দেন FAO-এর বাংলাদেশ প্রতিনিধি মাইক রবসন।,ʃɔmmelone bʱɪʃonɐɾɪ bokt̪obbo d̪ɛn -eɾ bɐŋlɐd̪eʃ pɾot̪ɪnɪd̪ʱɪ mɐɪ̯k ɾɔbʃon।
9905,টিকিট কাটতে http://ashajawa.com ওয়েব ঠিকানায় গিয়ে ই-মেইল ও ফোন নম্বর দিয়ে টিকিট অর্ডার করা যাবে।,tɪkɪt kɐtt̪e ://. oʲeb tʰɪkɐnɐe̯ gɪʲe ɪ-meɪ̯l o pʰon nɔmbɔɾ d̪ɪʲe tɪkɪt ɔɾdɐɾ kɔɾɐ ɟɐbe।
11010,"কমলাকে ইংরেজিতে 'Mandarin orange', 'Mandarin' এবং 'Mandarine' বলা হয়।","kɔmlɐke ɪŋɾeɟɪt̪e ' ', '' eboŋ '' bɔlɐ hɔe̯।"
12638,এরপর ডাউনলোড লিংক জেনারেট হলে click here to download বাটন দেখাবে।,eɾpɔɾ dɐu̯nlod lɪŋk ɟenɐɾet hole bɐton d̪ekʰɐbe।


In [11]:
filt_cond = ["১-৯", 'A-Za-z0-9']
# Filtering text samples that contain English alphanumeric values
fltr_test = test_df[lambda x: x["text"].str.contains("[১-৯]")]

print(f'Length of the Df = {fltr_test.shape}')

Length of the Df = (9601, 2)


In [12]:
with pd.option_context('display.max_colwidth', 0):
    display(fltr_test.head(n=5))

Unnamed: 0,row_id_column_name,text
1,1,এ নিয়ে বিবাদে ২০১৫ সালের ২ জুন রাত সাড়ে ১১টায় চাচা সুশীল দাসকে কুপিয়ে জখম করে সে।
2,2,আজ থেকে ১৪ বছর আগে তিনি চলে গেছেন না ফেরার দেশে।
3,3,নিহত ব্যক্তি কুতপালং টালের ই-২ ব্লকের আবুল বাছেদ (৪০)।
4,4,"সংক্ষিপ্ত স্কোরশ্রীলংকা প্রথম ইনিংস ৪৮২ (করুনারত্নে ১৯৬, চান্দিমাল ৬২, ডিকভেলা ৫২, পেরেরা ৫৮।"
5,5,"এগুলোর মধ্যে সাজ্জাদ হোসেনের ‘নন স্টপ’, আশুতোষ সুজনের ‘এই শহরে’ এবং রুলিন রহমানের ‘রোড নাম্বার-৭, বাসা নাম্বার-১৩’।"


In [13]:
# Remove English alphanumeric values
alpha_pat = "[a-zA-z0-9০-৯()]"

train["text"] = train["text"].str.replace(alpha_pat, "", regex=True)
test_df["text"] = test_df["text"].str.replace(alpha_pat, "", regex=True)

In [14]:
filt_cond = ["১-৯", 'A-Za-z0-9']
# Filtering text samples that contain English alphanumeric values
filtered_train = train[lambda x: x["text"].str.contains("[A-Za-z0-9()]")]

print(f'Length of the Df = {filtered_train.shape}')

Length of the Df = (0, 2)


In [15]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(train, test_size=0.1, shuffle=True, random_state=42)
train_df = train_df.reset_index(drop=True)
val_df = val_df.reset_index(drop=True)

In [16]:
!pip install datasets
!pip install transformers



In [17]:
from datasets import Dataset

train_tf = Dataset.from_pandas(train_df)
valid_tf = Dataset.from_pandas(val_df)

In [18]:
train_tf

Dataset({
    features: ['text', 'ipa'],
    num_rows: 20699
})

In [19]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model_id = "google/mt5-small"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
data_collator = DataCollatorForSeq2Seq(tokenizer)

Downloading (…)okenizer_config.json:   0%|          | 0.00/82.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/553 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Downloading pytorch_model.bin:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [20]:
def prepare_dataset(sample):
    output = tokenizer(sample["text"])
    output["labels"] = tokenizer(sample["ipa"])['input_ids']
    output["length"] = len(output["labels"])
    return output


train_tf = train_tf.map(prepare_dataset, remove_columns=train_tf.column_names)
valid_tf = valid_tf.map(prepare_dataset, remove_columns=valid_tf.column_names)

  0%|          | 0/20699 [00:00<?, ?ex/s]

  0%|          | 0/2300 [00:00<?, ?ex/s]

In [21]:
# train_tf['input_ids']

In [22]:
pip install jiwer

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting jiwer
  Downloading jiwer-3.0.3-py3-none-any.whl (21 kB)
Installing collected packages: jiwer
Successfully installed jiwer-3.0.3
Note: you may need to restart the kernel to use updated packages.


In [23]:
import numpy as np
from datasets import load_metric

wer_metric = load_metric("wer")


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    
    if isinstance(preds, tuple):
        preds = preds[0]
    
    preds = np.where(preds != -100, preds, tokenizer.pad_token_id)
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = wer_metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"wer": result}

Downloading builder script:   0%|          | 0.00/1.90k [00:00<?, ?B/s]

In [24]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

model_id = "mt5-bangla-text-to-ipa"

training_args = Seq2SeqTrainingArguments(
    output_dir=model_id,
    group_by_length=True,
    length_column_name="length",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=16,
    evaluation_strategy="steps",
    metric_for_best_model="wer",
    greater_is_better=False,
    load_best_model_at_end=True,
    num_train_epochs=10,
    save_steps=4000,
    eval_steps=4000,
    logging_steps=4000,
    learning_rate=3e-4,
    weight_decay=1e-2,
    warmup_steps=2000,
    save_total_limit=2,
    predict_with_generate=True,
    generation_max_length=128,
    push_to_hub=False,
    report_to="none",
)

In [25]:
trainer = Seq2SeqTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_tf,
    eval_dataset=valid_tf,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)


In [26]:
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Wer
4000,3.2715,0.351481,0.325892
8000,0.4261,0.128999,0.125299
12000,0.2182,0.09113,0.087075
16000,0.1485,0.072675,0.068375
20000,0.1115,0.064662,0.059272
24000,0.0891,0.058637,0.052187
28000,0.074,0.056292,0.049922
32000,0.064,0.052706,0.04535
36000,0.0553,0.050009,0.042425
40000,0.0473,0.048564,0.041231


TrainOutput(global_step=51750, training_loss=0.35694515062415083, metrics={'train_runtime': 10731.8876, 'train_samples_per_second': 19.287, 'train_steps_per_second': 4.822, 'total_flos': 6934454088038400.0, 'train_loss': 0.35694515062415083, 'epoch': 10.0})

In [27]:
import joblib

# Save the model to a file
joblib.dump(model, 'model.pkl')
print("Model saved successfully.")

Model saved successfully.


In [28]:
# Later, to load the model back
loaded_model = joblib.load('/kaggle/working/model.pkl')
loaded_model

MT5ForConditionalGeneration(
  (shared): Embedding(250112, 512)
  (encoder): MT5Stack(
    (embed_tokens): Embedding(250112, 512)
    (block): ModuleList(
      (0): MT5Block(
        (layer): ModuleList(
          (0): MT5LayerSelfAttention(
            (SelfAttention): MT5Attention(
              (q): Linear(in_features=512, out_features=384, bias=False)
              (k): Linear(in_features=512, out_features=384, bias=False)
              (v): Linear(in_features=512, out_features=384, bias=False)
              (o): Linear(in_features=384, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 6)
            )
            (layer_norm): MT5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): MT5LayerFF(
            (DenseReluDense): MT5DenseGatedActDense(
              (wi_0): Linear(in_features=512, out_features=1024, bias=False)
              (wi_1): Linear(in_features=512, out_features=1024, bias=False)
          

In [29]:
from transformers import pipeline

pipe = pipeline("text2text-generation", model=loaded_model,tokenizer=tokenizer, device=0)

In [30]:
texts = "সুন্দর সুন্দরমনা আমি তোমাকে ভালবাসি।"
texts = "বিশেষ অতিথি এফএম ইকবাল বিন আনোয়ার (ডন) অ্যাডিশনাল ডাইরেক্টরগেমস অ্যান্ড স্পোর্টস ডিপার্টমেন্ট ওয়ালটন।"
texts = 'আমি তোমাকে ভালোবাসি কেমন।'
ipas = pipe(texts, max_length=128, batch_size=16)
print(ipas)

[{'generated_text': 'ɐmɪ t̪omɐke bɦɐlobɐʃɪ ken।'}]


In [31]:
def calculate_ipa(txt):
    '''
    Pass
    '''
    ipas = pipe(txt, max_length=128, batch_size=16)
    return ipas[0]['generated_text']


## Now time to test the model

In [32]:
len(val_df)

2300

In [33]:
# pd.set_option('display.max_colwidth', None)
val_df.head(3)

Unnamed: 0,text,ipa
0,"তিনি বলেন, ‘আমরা সবাই মানসিকভাবে ভেঙে পড়েছি।","t̪ɪnɪ bɔlen, ‘ɐmɾɐ ʃɔbɐɪ mɐnoʃɪkbʱɐbe bʱeŋe po..."
1,এর উচিত জবাব জনগণ দেবেই।,eɾ ucɪt̪ ɟɔbɐb ɟɔngɔn d̪ebe͡ɪ̯।
2,"সমাবেশ থেকে ঘোষণা করা হয়েছে, আগামীকাল সোমবার জ...","ʃɔmɐbeʃ t̪ʰeke gʱoʃonɐ kɔɾɐ hoʲecʰe, ɐgɐmɪkɐl ..."


In [34]:
val_df.head(2)

Unnamed: 0,text,ipa
0,"তিনি বলেন, ‘আমরা সবাই মানসিকভাবে ভেঙে পড়েছি।","t̪ɪnɪ bɔlen, ‘ɐmɾɐ ʃɔbɐɪ mɐnoʃɪkbʱɐbe bʱeŋe po..."
1,এর উচিত জবাব জনগণ দেবেই।,eɾ ucɪt̪ ɟɔbɐb ɟɔngɔn d̪ebe͡ɪ̯।


In [35]:
tests = val_df.text.to_list()
trues = val_df.ipa.to_list()
tests[:4]

['তিনি বলেন, ‘আমরা সবাই মানসিকভাবে ভেঙে পড়েছি।',
 'এর উচিত জবাব জনগণ দেবেই।',
 'সমাবেশ থেকে ঘোষণা করা হয়েছে, আগামীকাল সোমবার জেলা শিক্ষক সমিতির আয়োজনে বিক্ষোভ ও সমাবেশ অনুষ্ঠিত হবে।',
 'টুর্নামেন্টের সেরা খেলোয়াড় নির্বাচিত হন পাবনা জেলার রেহানা খাতুন।']

In [36]:
index = 0
print('Prediction: ', calculate_ipa(tests[index]))
print('True: ', trues[index])
print('Text: ', tests[index])

Prediction:  t̪ɪnɪ bɔlen, ‘ɐmɾɐ ʃɔbɐɪ mɐnoʃɪkbɦɐbe bɦeŋe poɽechɪ। 
True:  t̪ɪnɪ bɔlen, ‘ɐmɾɐ ʃɔbɐɪ mɐnoʃɪkbʱɐbe bʱeŋe poɽecʰɪ। 
Text:  তিনি বলেন, ‘আমরা সবাই মানসিকভাবে ভেঙে পড়েছি।


In [37]:
len(test_df)

27228

In [38]:
test_df.head(2)

Unnamed: 0,row_id_column_name,text
0,0,বিশেষ অতিথি এফএম ইকবাল বিন আনোয়ার ডন অ্যাডিশনা...
1,1,এ নিয়ে বিবাদে সালের জুন রাত সাড়ে টায় চাচা সু...


In [39]:

# test_df["text"] = test_df["text"].str.replace([A-Za-z0-9০-৯], "", regex=True)
# Filtering text samples that contain English alphanumeric values
filtered_train = test_df[lambda x: x["text"].str.contains("[A-Za-z0-9০-৯]")]

print(f'Length of the Df = {filtered_train.shape}')
# filtered_train.head(2)

Length of the Df = (0, 2)


In [40]:
texts = test_df["text"].tolist()
ipas = []
for text in tqdm(texts[:4]):
    ipa_output = pipe(text, max_length=128, batch_size=16)
    ipas.append(ipa_output)

100%|██████████| 4/4 [00:03<00:00,  1.21it/s]


In [41]:
ipas

[[{'generated_text': 'bɪʃeʃ ot̪ɪt̪hɪ ephem ɪkbɐl bɪn ɐnowɐɾ dɔn ɛdɪʃonɐl dɐɪ̯ɾektɔɾgemɔʃ ɛnd spoɾts dɪpɐɾtment dɪpɐɾtment owɐltɔn।'}],
 [{'generated_text': 'e nɪje bɪbɐd̪e ʃɐleɾ ɟun ɾɐt̪ ʃɐɽe tɐe̯ cɐcɐ ʃuʃɪl d̪ɐʃke kupɪje ɟɔkhom koɾe ʃe।'}],
 [{'generated_text': 'ɐɟ t̪heke bɔchoɾ ɐge t̪ɪnɪ cɔle gɛchen nɐ pheɾɐɾ d̪eʃe।'}],
 [{'generated_text': 'nɪhɔt̪o bɛkt̪ɪ kut̪pɐloŋ tɐleɾ ɪ- blekeɾ ɐbul bɐched̪ ।'}]]

In [42]:
%%time
texts = test_df["text"].tolist()
ipas = tqdm(pipe(texts, max_length=128, batch_size=16))
ipas = [ipa["generated_text"] for ipa in ipas]

test_df["ipa"] = ipas
test_df.head()

100%|██████████| 27228/27228 [00:00<00:00, 1122924.15it/s]

CPU times: user 47min 6s, sys: 2.34 s, total: 47min 8s
Wall time: 47min 12s





Unnamed: 0,row_id_column_name,text,ipa
0,0,বিশেষ অতিথি এফএম ইকবাল বিন আনোয়ার ডন অ্যাডিশনা...,bɪʃeʃ ot̪ɪt̪hɪ bɔe̯m ɪkbɐl bɪn ɐnowɐɾ dɔn ɛdɪʃ...
1,1,এ নিয়ে বিবাদে সালের জুন রাত সাড়ে টায় চাচা সু...,e nɪje bɪbɐd̪e sɐleɾ ɟun ɾɐt̪ ʃɐɽe tɐe̯ cɐcɐ s...
2,2,আজ থেকে বছর আগে তিনি চলে গেছেন না ফেরার দেশে।,ɐɟ t̪heke bɔchoɾ ɐge t̪ɪnɪ cɔle gɛchen nɐ pheɾ...
3,3,নিহত ব্যক্তি কুতপালং টালের ই- ব্লকের আবুল বাছেদ ।,nɪhɔt̪o bɛkt̪ɪ kut̪pɐlɔŋ tɐleɾ ɪ- skeɾ ɐbul bɐ...
4,4,সংক্ষিপ্ত স্কোরশ্রীলংকা প্রথম ইনিংস করুনারত্ন...,ʃɔŋkkhɪpt̪o skoɾsɾɪlɔŋkɐ pɾot̪hom ɪnɪŋʃo kɔɾun...


In [43]:
test_df.head(3)

Unnamed: 0,row_id_column_name,text,ipa
0,0,বিশেষ অতিথি এফএম ইকবাল বিন আনোয়ার ডন অ্যাডিশনা...,bɪʃeʃ ot̪ɪt̪hɪ bɔe̯m ɪkbɐl bɪn ɐnowɐɾ dɔn ɛdɪʃ...
1,1,এ নিয়ে বিবাদে সালের জুন রাত সাড়ে টায় চাচা সু...,e nɪje bɪbɐd̪e sɐleɾ ɟun ɾɐt̪ ʃɐɽe tɐe̯ cɐcɐ s...
2,2,আজ থেকে বছর আগে তিনি চলে গেছেন না ফেরার দেশে।,ɐɟ t̪heke bɔchoɾ ɐge t̪ɪnɪ cɔle gɛchen nɐ pheɾ...


In [44]:
print(test_df['ipa'][0])
test_df['text'][0]

bɪʃeʃ ot̪ɪt̪hɪ bɔe̯m ɪkbɐl bɪn ɐnowɐɾ dɔn ɛdɪʃonɐl dɐɪ̯ɾektɔɾgems ɛnd stɔɾtɔs dɪpɐɾtment dɔjɐlɔton।


'বিশেষ অতিথি এফএম ইকবাল বিন আনোয়ার ডন অ্যাডিশনাল ডাইরেক্টরগেমস অ্যান্ড স্পোর্টস ডিপার্টমেন্ট ওয়ালটন।'

In [45]:
lenli = []
for i in range(len(test_df)):
    l1 = len(test_df['text'][i].split())
    l2 = len(test_df['ipa'][i].split())
    lenli.append([l1, l2])

In [46]:
lenlisrt = sorted(lenli, key= lambda x:x[1])

In [47]:
# lenlisrt

In [48]:
for i in test_df['ipa']:
    if len(i)<2:
        print(i)





## Create a Submission File

In [49]:
sub_df = test_df.copy()
# sub_df = sub_df.drop(['text'], axis=1)
sub_df.head(2)

Unnamed: 0,row_id_column_name,text,ipa
0,0,বিশেষ অতিথি এফএম ইকবাল বিন আনোয়ার ডন অ্যাডিশনা...,bɪʃeʃ ot̪ɪt̪hɪ bɔe̯m ɪkbɐl bɪn ɐnowɐɾ dɔn ɛdɪʃ...
1,1,এ নিয়ে বিবাদে সালের জুন রাত সাড়ে টায় চাচা সু...,e nɪje bɪbɐd̪e sɐleɾ ɟun ɾɐt̪ ʃɐɽe tɐe̯ cɐcɐ s...


In [50]:
sub_df.to_csv("submission3.csv", index=False)