# Introduction to Indic NLP Library

Now we will have some hands-on experince using Indic NLP Library. While the libraries discussed in the previous notebook works for English, Indian languages sometimes requires additional handling like tokenization, sentence-splitting. etc. 

More details can be found here https://github.com/anoopkunchukuttan/indic_nlp_library

## Set-up

### Download IndicNLP Library resources from here

In [None]:
!git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git

Set-up the path to Indic NLP Resources folder

## Initialize the Indic NLP library

In [None]:
INDIC_NLP_RESOURCES=r"indic_nlp_resources/"

In [None]:
import sys
from indicnlp import common
common.set_resources_path(INDIC_NLP_RESOURCES)

In [None]:
from indicnlp import loader
loader.load()

We will now try out some of the APIs provided by the library

**NOTE:** Many APIs require us to provide the language we are working with. Language information is provided as a 2-letter ISO 639-1 codes. More details and the exact 2-letter code used can be found here https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

Some languages do not have assigned 2-letter codes. The library uses the following two-letter codes for such languages:

 - Konkani : kK
 - Manipuri : mP
 - Bodo : bD

## Text Normalization

Standardize the text written in Indic scripts. Some of the issues handled are
 - Non-spacing characters
 - Multiple representations of Nukta based characters
 - Multiple representations of two part dependent vowel signs
 - Typing inconsistencies: e.g. use of pipe (|) for poorna virama

In [None]:
def normalize_text(input_text, normalizer):
    output_text=normalizer.normalize(input_text)

    print(input_text)
    print()

    print('Before normalization')
    print(' '.join([ hex(ord(c)) for c in input_text ] ))
    print('Length: {}'.format(len(input_text)))
    print()    
    print('After normalization')
    print(' '.join([ hex(ord(c)) for c in output_text ] ))
    print('Length: {}'.format(len(output_text)))    

In [None]:
from indicnlp.normalize.indic_normalize import IndicNormalizerFactory

factory=IndicNormalizerFactory()
normalizer=factory.get_normalizer("hi")
# Reference for Hindi https://unicode.org/charts/PDF/U0900.pdf

input_text="\u0958 \u0915\u093c"

normalize_text( input_text, normalizer )

input_text="\u0959 \u0916\u093c"

normalize_text( input_text, normalizer )

#### To-Do: Find more instances of such issues in Hindi or your native language and print them here

### Sentence Splitter

Rule-based system to split text into sentences

In [None]:
from indicnlp.tokenize import sentence_tokenize

indic_string="बहुत समय पहले की बात है. एक घने जंगल में एक तोता अपने दो बच्चों के साथ रहता है. उनका जीवन ख़ुशी-ख़ुशी बीत रहा था."
sentences=sentence_tokenize.sentence_split(indic_string, lang='hi')
for t in sentences:
    print(t)


#### To-Do: Experiment with sentence-splitting for your native language. Find instances where the model could fail

### Tokenization

Tokenize based on punctuation boundary

In [None]:
from indicnlp.tokenize import indic_tokenize  

indic_string='उनका जीवन ख़ुशी-ख़ुशी बीत रहा था.'

print('Input String: {}'.format(indic_string))
print('Tokens: ')
for t in indic_tokenize.trivial_tokenize(indic_string): 
    print(t)

#### Detokenization

It is natural to not add white-space between words and punctuations. The detokenizer handles that part

In [None]:
from indicnlp.tokenize import indic_detokenize  
indic_string = ' '.join( indic_tokenize.trivial_tokenize(indic_string) )

print('Input String: {}'.format(indic_string))
print('Detokenized String: {}'.format(indic_detokenize.trivial_detokenize(indic_string,lang='hi')))


#### To-Do: Experiment with tokenization for your native language.

### Script Conversion

Convert from one Indic script to another using a rule-based system

The following scripts are supported:

Devanagari (Hindi,Marathi,Sanskrit,Konkani,Sindhi,Nepali), Assamese, Bengali, Oriya, Gujarati, Gurumukhi (Punjabi), Sindhi, Tamil, Telugu, Kannada, Malayalam

In [None]:
from indicnlp.transliterate.unicode_transliterate import UnicodeIndicTransliterator
input_text = 'राजस्थान'
print(UnicodeIndicTransliterator.transliterate(input_text,"hi","kn"))

#### To-Do: Experiment with Script conversion between various language pairs

### Romanization

Convert script text to Roman text in the ITRANS notation

In [None]:
from indicnlp.transliterate.unicode_transliterate import ItransTransliterator

input_text = 'ರುದ್ರ ಮೂರ್ತಿ'
lang='kn'

itrans_text = ItransTransliterator.to_itrans(input_text,lang)

print(itrans_text)


#### To-Do: Experiment with Romanization between your native language and English

### Indicization (ITRANS to Indic Script)

Convert script text in ITRANS notation to Indic script

In [None]:
from indicnlp.transliterate.unicode_transliterate import ItransTransliterator

lang='kn'

x = ItransTransliterator.from_itrans(itrans_text,lang)
print(x)
for y in x:
    print('{:x}'.format(ord(y)))


### Word Segmentation

Perform unsupervised word segmentation using Morfessor

The following languages are supported:

Hindi, Punjabi, Marathi, Konkani, Gujarati, Bengali, Kannada, Tamil, Telugu, Malayalam

In [None]:
from indicnlp.morph import unsupervised_morph 
from indicnlp import common

analyzer=unsupervised_morph.UnsupervisedMorphAnalyzer('kn')

In [None]:
indic_string='ರೈತನೊಬ್ಬನ ತೋಟದಲ್ಲಿದ್ದ ಸೇಬು ಮರದಲ್ಲಿಒಂದೇ ಒಂದು ಹಣ್ಣು ಬೆಳೆಯಿತು .'

analyzes_tokens=analyzer.morph_analyze_document(indic_string.split(' '))

for w in analyzes_tokens: 
    print(w)

#### To-Do: Experiment with Word Segmentation for your native language