In [1]:
import nltk
from nltk.util import ngrams
from collections import defaultdict
nltk.download('punkt')
nltk.download('indian')

hindi_sentences = [
    "मेरा नाम है।",
    "मेरा खाना तैयार है।",
    "मेरा दोस्त आ रहा है।",
    "तुम कैसे हो?",
    "वह आया।",
    "हम जाएँगे।",
    "मैं समझता हूँ।",
    "यह बहुत अच्छा है।",
    "कृपया ध्यान दें।",
    "कौन हो तुम?",
    "आपका स्वागत है।",
    "धन्यवाद।"
]



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package indian to /root/nltk_data...
[nltk_data]   Unzipping corpora/indian.zip.


In [2]:
def tokenize_text(text):
    sentences = nltk.sent_tokenize(text)
    words = nltk.word_tokenize(text)
    return sentences, words

def generate_trigrams(input_list):
    return list(ngrams(input_list, 3))

def predict_next_words(input_phrase, hindi_sentences):
    input_words = nltk.word_tokenize(input_phrase)
    hindi_corpus_words = []
    for sentence in hindi_sentences:
        _, words = tokenize_text(sentence)
        hindi_corpus_words.extend(words)
    trigrams = generate_trigrams(hindi_corpus_words)
    next_words_count = defaultdict(int)
    for idx, word in enumerate(trigrams):
        if word[0] == input_words[-2] and word[1] == input_words[-1]:
            next_word = trigrams[idx][2]
            next_words_count[next_word] += 1
    total_count = sum(next_words_count.values())
    next_words_probs = {word: count / total_count for word, count in next_words_count.items()}
    return next_words_probs

input_phrase = "मेरा नाम"
predictions = predict_next_words(input_phrase, hindi_sentences)
print(predictions)

{'है।': 1.0}


Theory:

Statistical language models, in its essence, are the type of models that assign probabilities to the sequences of words. In this article, we’ll understand the simplest model that assigns probabilities to sentences and sequences of words, the n-gram
You can think of an N-gram as the sequence of N words, by that notion, a 2-gram (or bigram) is a two-word sequence of words like “please turn”, “turn your”, or ”your homework”, and a 3-gram (or trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”

\
\

Algortihm:

An N-gram model is a statistical language model that captures the likelihood of a word based on its context, considering the N preceding words. In this case, I'll provide you with a step-by-step algorithm for building an N-gram language model, where N represents the number of preceding words considered for prediction. We'll use Python for the implementation. Let's consider a simple example of a bigram model (N=2) for simplicity.

\
\
\

**Bigram (2-gram) Model:**

Step 1: **Tokenization**

- Tokenize your input text into a list of words or tokens.
- Count the number of unique words in the text.


Step 2: **Create N-grams**

- Generate bigrams (2-grams) by pairing each word with its consecutive word.



Step 3: **Build a Frequency Distribution**

- Calculate the frequency of each bigram using Python's `nltk.FreqDist`.


Step 4: **Calculate Conditional Probabilities**

- Calculate conditional probabilities for each bigram, which represent the likelihood of a word given the previous word.



Step 5: **Generate Text**

- You can generate text using the trained bigram model. Start with an initial word and predict the next word based on the conditional probabilities.

\
\


This algorithm builds a simple bigram model and generates text based on the conditional probabilities. For more accurate and sophisticated N-gram models, you can extend this approach to higher N values, such as trigrams (3-grams) or more, and use larger text corpora for training.

The step count may vary depending on the specific implementation and complexity of your N-gram model, but these steps provide a general guideline for building and using an N-gram model.