https://www.analyticsvidhya.com/blog/2020/01/3-important-nlp-libraries-indian-languages-python/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+AnalyticsVidhya+%28Analytics+Vidhya%29

1) This article is all about breaking boundaries and exploring 3 amazing libraries for Indian Languages

2) We will implement plenty of NLP tasks in Python using these 3 libraries and work with Indian languages

*    **iNLTK** - Hindi, Punjabi, Sanskrit, Gujarati, Kannada, Malyalam, Nepali, Odia, Marathi, Bengali, Tamil, Urdu

      It Perform:
      * Tokenization
      * Word Embedding
      * Text Completion
      * Similarity of sentence etc..
*    **Indic NLP Library** - Assamese, Sindhi, Sinhala, Sanskrit, Konkani, Kannada, Telugu,

     It Performs:
     * Normalization
     * Tranliteration
     * Phonetic Analysis
     * Syllabification etc...
    

*    **StanfordNLP** - Many of the above languages

     It Performs:
     * Lemmatization
     * Parts of speech (POS)
     * Name Entitiy Recognition(NER)
     * Dependency Parsing etc..

In [0]:
!pip install inltk

**Setting the language**

iNLTK has language models trained for different languages and in order to use one, we have to download its files first. We will be working with Hindi text, so let’s set **“Hindi”** as our language:

In [2]:
from inltk.inltk import setup
setup('hi') # This will download all the necessary files to make inferences for Hindi.

Downloading Model. This might take time, depending on your internet connection. Please be patient.
We'll only do this for the first time.
Downloading Model. This might take time, depending on your internet connection. Please be patient.
We'll only do this for the first time.
Done!


**Tokenization**

The first step we do to solve any NLP task is to break down the text into its smallest units or tokens. 

In [3]:
from inltk.inltk import tokenize

hindi_text = """प्राचीन काल में विक्रमादित्य नाम के एक आदर्श राजा हुआ करते थे।
अपने साहस, पराक्रम और शौर्य के लिए  राजा विक्रम मशहूर थे। 
ऐसा भी कहा जाता है कि राजा विक्रम अपनी प्राजा के जीवन के दुख दर्द जानने के लिए रात्री के पहर में भेष बदल कर नगर में घूमते थे।"""

# tokenize(input text, language code)
tokenize(hindi_text, "hi")

['▁प्राचीन',
 '▁काल',
 '▁में',
 '▁विक्रमादित्य',
 '▁नाम',
 '▁के',
 '▁एक',
 '▁आदर्श',
 '▁राजा',
 '▁हुआ',
 '▁करते',
 '▁थे',
 '।',
 '▁अपने',
 '▁साहस',
 ',',
 '▁पराक्रम',
 '▁और',
 '▁शौर्य',
 '▁के',
 '▁लिए',
 '▁राजा',
 '▁विक्रम',
 '▁मशहूर',
 '▁थे',
 '।',
 '▁ऐसा',
 '▁भी',
 '▁कहा',
 '▁जाता',
 '▁है',
 '▁कि',
 '▁राजा',
 '▁विक्रम',
 '▁अपनी',
 '▁प्रा',
 'जा',
 '▁के',
 '▁जीवन',
 '▁के',
 '▁दुख',
 '▁दर्द',
 '▁जानने',
 '▁के',
 '▁लिए',
 '▁रात्री',
 '▁के',
 '▁पहर',
 '▁में',
 '▁भेष',
 '▁बदल',
 '▁कर',
 '▁नगर',
 '▁में',
 '▁घूमते',
 '▁थे',
 '।']

**Generate similar sentences from a given text input**

This feature of iNLTK is **very useful for text data augmentation** as we can just multiply the sentences in our training data by populating it with sentences that have a similar meaning.

In [4]:
from inltk.inltk import get_similar_sentences

# get similar sentences to the one given in hindi
output = get_similar_sentences('मैं आज बहुत खुश हूं', 5, 'hi')

print(output)











































['मैं आज काफ़ी खुश हूं', 'मैं आज काफी खुश हूं', 'मैं आज अत्यधिक खुश हूं', 'मै आज बहुत खुश हूं', 'मैं आज बहुत खुश हूँ']


**Identify the language of a text**

 Useful when working with multilingual data

In [5]:
from inltk.inltk import identify_language

identify_language('मैं आज बहुत खुश हूं')

Downloading Model. This might take time, depending on your internet connection. Please be patient.
We'll only do this for the first time.
Downloading Model. This might take time, depending on your internet connection. Please be patient.
We'll only do this for the first time.




'hindi'

**Extract embedding vectors**

These embedding vectors **capture the semantic information of the text input**and are easier to work with for the models (as they expect numerical input).

We get two embedding vectors, one for each word in the input sentence.

**Notice** that each word is denoted by an embedding of 400 dimensions.

In [6]:
from inltk.inltk import get_embedding_vectors

# get embedding for input words
vectors = get_embedding_vectors("विश्लेषिकी विद्या", "hi")

print(vectors)
# print shape of the first word
print("shape:", vectors[0].shape)



[array([-0.432755, -0.138092,  0.318305, -0.635152, ...,  0.137299, -0.00537 ,  0.549906,  0.068798], dtype=float32), array([ 0.617097,  0.112811, -0.406291, -0.263062, ...,  0.551395,  0.138665,  0.592104,  0.091295], dtype=float32), array([ 0.086235,  0.357199, -0.080211, -0.884763, ...,  0.060092, -0.440086,  0.522778, -0.156389], dtype=float32)]
shape: (400,)


**Text completion**

Text completion is one of the most exciting aspects of language modeling. can easily use it to **auto-complete the input text.**

 Taken a Bengali sentence that says “The weather is nice today”. The fourth parameter is to adjust the “randomness” of the model to make different generations

 We can often use **text generation abilities**of a language model to **augment** the text dataset, and since we usually have small datasets for vernacular languages, this feature of iNLTK comes in handy.

In [7]:
from inltk.inltk import setup
from inltk.inltk import predict_next_words

# download models for Gujarati
setup('bn')
# predict the next words of the sentence "The weather is nice today"
predict_next_words("আবহাওয়া চমৎকার", 10, "bn", 0.7)



Downloading Model. This might take time, depending on your internet connection. Please be patient.
We'll only do this for the first time.
Downloading Model. This might take time, depending on your internet connection. Please be patient.
We'll only do this for the first time.
Done!




'আবহাওয়া চমৎকার প্রাকৃতিক বন নিয়ে আলোচনা করে। এসকল প্রাকৃতিক পরিবেশের উপর'

**Finding similarity between two sentences**
 
 For more features: https://inltk.readthedocs.io/en/latest/api_docs.html#api

Finds semantic similarities between two pieces of text. This is a really useful feature!

 We can use the similarity score for **feature engineering** and even building **sentiment analysis systems**.

 The model gives out a **cosine similarity of 0.67** which means that the sentences are pretty close, and that’s correct.

In [8]:
from inltk.inltk import get_sentence_similarity

# similarity of encodings is calculated by using cmp function whose default is cosine similarity
get_sentence_similarity('मुझे भोजन पसंद है।', 'मैं ऐसे भोजन की सराहना करता हूं जिसका स्वाद अच्छा हो।', 'hi')

#The first one roughly translates to “I like food” while the second one means “I appreciate food that tastes good”





0.6777094006538391