# Get Marathi text in Devnagari from Latin script

This notebook explains process of conversion of Latin script text names data in Devnagari script. Technically this process is known as 'Transliteration'. We will used data generated by this process to develop charecter based prediction applications developed for Marathi language.

We are using indic_transliteration package for this conversion. It have support for most of Indian script. You may access this package with help of following link
https://pypi.org/project/indic-transliteration/

For our application we will import sanscript and transliterate package as follows.

In [1]:
from indic_transliteration import sanscript

In [2]:
from indic_transliteration.sanscript import transliterate

We are checking working of indic_transliteration library in following code. If English word is not ends with Vowels, we need to append extra 'a' at end of word, other it will print gramatically incomplete Marathi/Sanskrit word in Devnagari. 

In [3]:
input_text = "pritama"
 
    # converted into the given devanagari
    # transliterated text
output_text = transliterate(input_text, sanscript.ITRANS,sanscript.DEVANAGARI)
print(output_text)

प्रितम


Now we are loading English names in Pandas dataframe for further processing.
These Indian names in Latin script have taken from following [dataset](https://www.kaggle.com/datasets/ananysharma/indian-names-dataset)

In [4]:
import pandas as pd
names_df = pd.read_csv('./Indian_Names.csv')
male_names_df = pd.read_csv('./Indian-Male-Names.csv')
female_names_df = pd.read_csv('./Indian-Female-Names.csv')

In [6]:
names_df

Unnamed: 0.1,Unnamed: 0,Name
0,0,aabid
1,1,aabida
2,2,aachal
3,3,aadesh
4,4,aadil
...,...,...
6481,6481,zishan
6482,6482,ziyabul
6483,6483,zoya
6484,6484,zuhaib


In [9]:
all_names_df = pd.DataFrame()
all_names_df = pd.concat([names_df['Name'], male_names_df['name'], female_names_df['name']], axis=0, ignore_index=True)

# Rename the column to 'Merged'
all_names_df.columns = ['names']
all_names_df

0                            aabid
1                           aabida
2                           aachal
3                           aadesh
4                            aadil
                   ...            
36708                   saroj devi
36709                naina @ geeta
36710    manju d/0 baboo lal jatav
36711                      shivani
36712                        nayna
Length: 36713, dtype: object

We are now removing special charecters and numbers from text data as most of Indian name not contain it.

In [42]:
import re
all_single_name_list = [re.sub(r'[^a-zA-Z]', ' ', str(name)).strip().split() for name in all_names_df]
all_single_name_list[-10:]

[['anjum'],
 ['miss', 'reena'],
 ['pooja'],
 ['rakhi'],
 ['musarrat'],
 ['saroj', 'devi'],
 ['naina', 'geeta'],
 ['manju', 'd', 'baboo', 'lal', 'jatav'],
 ['shivani'],
 ['nayna']]

As we get list of name list from above processing, we are making flat name list of list of list with help of itertools python library. 

In [55]:
import itertools

final_names_set = set(list(itertools.chain.from_iterable(all_single_name_list)))
sorted(final_names_set)

['a',
 'aabid',
 'aabida',
 'aachal',
 'aadesh',
 'aadil',
 'aadish',
 'aaditya',
 'aaenab',
 'aafreen',
 'aafrin',
 'aaftaab',
 'aaftab',
 'aagad',
 'aagand',
 'aahim',
 'aahuja',
 'aajad',
 'aajam',
 'aajiv',
 'aakanksha',
 'aakar',
 'aakas',
 'aakash',
 'aakhon',
 'aakib',
 'aakil',
 'aalabndi',
 'aalam',
 'aale',
 'aalina',
 'aaliya',
 'aamil',
 'aamin',
 'aamina',
 'aamir',
 'aamliya',
 'aamna',
 'aamod',
 'aamosh',
 'aamrin',
 'aanad',
 'aanamika',
 'aanand',
 'aanchal',
 'aanik',
 'aanil',
 'aansi',
 'aansu',
 'aanu',
 'aanya',
 'aapa',
 'aapu',
 'aaqif',
 'aara',
 'aaradhana',
 'aarati',
 'aarav',
 'aardhna',
 'aarif',
 'aarifa',
 'aarifun',
 'aariv',
 'aarju',
 'aarsi',
 'aarti',
 'aarushi',
 'aary',
 'aarya',
 'aas',
 'aasa',
 'aasan',
 'aash',
 'aasha',
 'aashi',
 'aashia',
 'aashif',
 'aashik',
 'aashiq',
 'aashis',
 'aashish',
 'aashiya',
 'aashkim',
 'aashma',
 'aashre',
 'aashu',
 'aasif',
 'aasim',
 'aasish',
 'aasma',
 'aasmin',
 'aastha',
 'aasto',
 'aasu',
 'aatam',


In [58]:
processed_name_df = pd.DataFrame()
processed_name_df['name'] = sorted(list(final_names_set))
processed_name_df

Unnamed: 0,name
0,a
1,aabid
2,aabida
3,aachal
4,aadesh
...,...
8885,zoya
8886,zu
8887,zuber
8888,zuhaib


In [59]:
int_names_df = processed_name_df['name'].apply(lambda x : str(x).strip() + 'a' if not str(x).strip()[-1] in ['i','e','o','u'] else str(x).strip())
int_names_df

0            aa
1        aabida
2       aabidaa
3       aachala
4       aadesha
         ...   
8885      zoyaa
8886         zu
8887     zubera
8888    zuhaiba
8889     zuveba
Name: name, Length: 8890, dtype: object

In [64]:
final_name_list = [name for name in int_names_df][1:]
sorted(final_name_list)
final_name_list[-10:]
#final_names = sorted(list(set(final_names_list)))[1:]
#final_names

['ziarula',
 'zile',
 'zinaa',
 'zishana',
 'ziyabula',
 'zoyaa',
 'zu',
 'zubera',
 'zuhaiba',
 'zuveba']

In [66]:
final_name_list[:10]

['aabida',
 'aabidaa',
 'aachala',
 'aadesha',
 'aadila',
 'aadisha',
 'aadityaa',
 'aaenaba',
 'aafreena',
 'aafrina']

Once we get all English name with desired processing, we are using transliterate library to get all names in Devnagari.

In [67]:
marathi_names = [transliterate(name, sanscript.ITRANS,sanscript.DEVANAGARI) for name in final_name_list]

In [70]:
len(marathi_names)

8889

After end of processing we will store all names in file.

In [69]:
with open('more_names.txt', 'w') as file:
    # Iterate over the list and write each item to a new line in the file
    for item in marathi_names:
        file.write(item + '\n')