<a href="https://colab.research.google.com/github/morteza80mr/Farsi-Predictive-Typing-App/blob/main/Farsi_Predictive_Typing_App_Using_Trie_Based_Autocomplete_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [17]:
!pip install --upgrade --no-no-cache-dir gdown
!gdown 1MnFrHCY_b-2-OzovvqBt3xl18YbVyM_l
!gdown 1BCq9MTIDzqIH5v9BPNE72rW3UHdzIzcX


Usage:   
  pip3 install [options] <requirement specifier> [package-index-options] ...
  pip3 install [options] -r <requirements file> [package-index-options] ...
  pip3 install [options] [-e] <vcs project url> ...
  pip3 install [options] [-e] <local project path> ...
  pip3 install [options] <archive url/path> ...

no such option: --no-no-cache-dir
Downloading...
From: https://drive.google.com/uc?id=1MnFrHCY_b-2-OzovvqBt3xl18YbVyM_l
To: /content/pos_tagger.model
100% 19.2M/19.2M [00:00<00:00, 70.7MB/s]
Downloading...
From: https://drive.google.com/uc?id=1BCq9MTIDzqIH5v9BPNE72rW3UHdzIzcX
To: /content/farsi_word_frequency.db
100% 4.03M/4.03M [00:00<00:00, 45.4MB/s]


In [18]:
pip install pandas sqlalchemy



In [19]:
import pandas as pd
from sqlalchemy import create_engine

# Create a database engine
db_file = '/content/farsi_word_frequency.db'
engine = create_engine(f'sqlite:///{db_file}')

# Read a table into a DataFrame
table_name = 'word'
df = pd.read_sql_table(table_name, con=engine)

# Display the DataFrame
df

Unnamed: 0,id,word,count
0,1,انوشه,919
1,2,آسیه,1974
2,3,اتمامِ‌حجت,1
3,4,ده‌سالگی,489
4,5,به‌هم‌پیوستگی,819
...,...,...,...
141248,141249,هم‌نامی,32
141249,141250,زیرپائی,16
141250,141251,ارّه‌موئی,0
141251,141252,ادّعائی,0


In [20]:
!pip uninstall -y hazm

Found existing installation: hazm 0.10.0
Uninstalling hazm-0.10.0:
  Successfully uninstalled hazm-0.10.0


In [21]:
# Install Hazm
!pip install hazm

Collecting hazm
  Using cached hazm-0.10.0-py3-none-any.whl.metadata (11 kB)
Using cached hazm-0.10.0-py3-none-any.whl (892 kB)
Installing collected packages: hazm
Successfully installed hazm-0.10.0


In [22]:
from hazm import Normalizer, word_tokenize, Stemmer, Lemmatizer, POSTagger

# Initialize Hazm components
normalizer = Normalizer()
stemmer = Stemmer()
lemmatizer = Lemmatizer()
tagger = POSTagger(model='/content/pos_tagger.model')

In [23]:
# Sample Farsi text
text = 'این یک متن نمونه است که شامل حروف فارسی می‌باشد.'

# Normalize the text
normalized_text = normalizer.normalize(text)

# Tokenize the text
tokens = word_tokenize(normalized_text)

# Perform POS tagging
pos_tags = tagger.tag(tokens)

# Display the results
print('Normalized Text:', normalized_text)
print('Tokens:', tokens)
print('POS Tags:', pos_tags)

Normalized Text: این یک متن نمونه است که شامل حروف فارسی می‌باشد.
Tokens: ['این', 'یک', 'متن', 'نمونه', 'است', 'که', 'شامل', 'حروف', 'فارسی', 'می\u200cباشد', '.']
POS Tags: [('این', 'PRON'), ('یک', 'NUM'), ('متن', 'NOUN,EZ'), ('نمونه', 'ADJ'), ('است', 'VERB'), ('که', 'SCONJ'), ('شامل', 'ADJ,EZ'), ('حروف', 'NOUN,EZ'), ('فارسی', 'NOUN'), ('می\u200cباشد', 'VERB'), ('.', 'PUNCT')]


In [24]:
# Normalize the words
df['word_normalized'] = df['word'].apply(normalizer.normalize)

**Build the Trie Data Structure :**
Implement a trie where each node represents a character, and the path from the root to a node represents a prefix or complete word.

In [25]:
class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_end_of_word = False
        self.frequency = 0  # Store word frequency at the end node

class Trie:
    def __init__(self):
        self.root = TrieNode()

    def insert(self, word, frequency):
        node = self.root
        for char in word:
            # If the character is not already a child, add it
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
        # Mark the end of the word and store frequency
        node.is_end_of_word = True
        node.frequency = frequency

    def search(self, prefix):
        node = self.root
        for char in prefix:
            if char not in node.children:
                return None  # No words with this prefix
            node = node.children[char]
        return node

    def autocomplete(self, prefix):
        node = self.search(prefix)
        if not node:
            return []

        suggestions = []
        self._dfs(node, prefix, suggestions)
        # Sort suggestions by frequency in descending order
        suggestions.sort(key=lambda x: x[1], reverse=True)
        return suggestions

    def _dfs(self, node, prefix, suggestions):
        if node.is_end_of_word:
            suggestions.append((prefix, node.frequency))
        for char, child_node in node.children.items():
            self._dfs(child_node, prefix + char, suggestions)

**Populate the Trie with Data:**
Insert all normalized words from your data frame into the trie.

In [26]:
trie = Trie()

for index, row in df.iterrows():
    word = row['word_normalized']
    frequency = row['count']
    trie.insert(word, frequency)

**Implement the Autocomplete Functionality :**
Create a function that takes user input and returns the top suggestions.

In [27]:
def get_suggestions(trie, user_input, top_n=10):
    normalized_input = normalizer.normalize(user_input)
    suggestions = trie.autocomplete(normalized_input)
    return suggestions[:top_n]

# Example usage:
user_input = 'خرو'
suggestions = get_suggestions(trie, user_input)

for word, freq in suggestions:
    print(f'Suggested word: {word}, Frequency: {freq}')

Suggested word: خروجی, Frequency: 246342
Suggested word: خروج, Frequency: 119686
Suggested word: خروشنده, Frequency: 12137
Suggested word: خروش, Frequency: 6707
Suggested word: خروس, Frequency: 3900
Suggested word: خروشان, Frequency: 2412
Suggested word: خروشچف, Frequency: 1445
Suggested word: خروار, Frequency: 878
Suggested word: خروشی, Frequency: 424
Suggested word: خروشانی, Frequency: 252


In [28]:
!pip install transformers



**Import Libraries**

In [29]:
from transformers import AutoTokenizer, TFAutoModelForCausalLM

**Load the GPT2FA Model and Tokenizer**

In [30]:
tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/gpt2-fa")
model = TFAutoModelForCausalLM.from_pretrained("HooshvareLab/gpt2-fa")

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at HooshvareLab/gpt2-fa.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Encode the Input Text

In [31]:
# Encode the input text
input_ids = tokenizer.encode(normalized_text, return_tensors='pt')

Generate Predictions

In [37]:
# Generate output
# TensorFlow does not require `eval` or `torch.no_grad`
# You can directly use the model's `generate` method

# Generate tokens after the input text
outputs = model.generate(
    input_ids,
    max_length=input_ids.shape[1] + 5,  # Generate up to 5 additional tokens
    do_sample=True,
    top_k=50,
    top_p=0.95,
    num_return_sequences=1
)

# Decode the generated tokens
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Extract the predicted next word(s)
predicted_text = generated_text[len(normalized_text):].strip()
predicted_words = predicted_text.split()
next_word = predicted_words[0] if predicted_words else ''
print(f"Predicted next word: {next_word}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:5 for open-end generation.


Predicted next word: نمونه‌هایی


Function for Predicting Next Word

In [39]:
def predict_next_word(input_text, model, tokenizer, normalizer, max_new_tokens=5):
    # Normalize the input text
    normalized_text = normalizer.normalize(input_text)
    # Encode the input text (TensorFlow expects `return_tensors='tf'`)
    input_ids = tokenizer.encode(normalized_text, return_tensors='tf')
    # Generate predictions
    outputs = model.generate(
        input_ids,
        max_length=input_ids.shape[1] + max_new_tokens,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        num_return_sequences=1
    )
    # Decode the generated tokens
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract the predicted next word
    predicted_text = generated_text[len(normalized_text):].strip()
    predicted_words = predicted_text.split()
    next_word = predicted_words[0] if predicted_words else ''
    return next_word

# Example usage
input_text = "دیروز به"
next_word = predict_next_word(input_text, model, tokenizer, normalizer)
print(f"Input text: {input_text}")
print(f"Predicted next word: {next_word}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:5 for open-end generation.


Input text: دیروز به
Predicted next word: واسطه


Function for Predicting Next Word

In [41]:
def predict_next_word(input_text, model, tokenizer, normalizer, max_new_tokens=5):
    # Normalize the input text
    normalized_text = normalizer.normalize(input_text)
    # Encode the input text (TensorFlow expects `return_tensors='tf'`)
    input_ids = tokenizer.encode(normalized_text, return_tensors='tf')
    # Generate predictions
    outputs = model.generate(
        input_ids,
        max_length=input_ids.shape[1] + max_new_tokens,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        num_return_sequences=1
    )
    # Decode the generated tokens
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract the predicted next word
    predicted_text = generated_text[len(normalized_text):].strip()
    predicted_words = predicted_text.split()
    next_word = predicted_words[0] if predicted_words else ''
    return next_word

# Example usage
input_text = "دیروز به"
next_word = predict_next_word(input_text, model, tokenizer, normalizer)
print(f"Input text: {input_text}")
print(f"Predicted next word: {next_word}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:5 for open-end generation.


Input text: دیروز به
Predicted next word: اطلاعتان


Generating Multiple Suggestions

In [42]:
def predict_next_words(input_text, model, tokenizer, normalizer, num_suggestions=5):
    # Normalize the input text
    normalized_text = normalizer.normalize(input_text)
    # Encode the input text (TensorFlow expects `return_tensors='tf'`)
    input_ids = tokenizer.encode(normalized_text, return_tensors='tf')
    # Generate predictions
    outputs = model.generate(
        input_ids,
        max_length=input_ids.shape[1] + 5,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        num_return_sequences=num_suggestions
    )
    suggestions = []
    for i in range(num_suggestions):
        generated_text = tokenizer.decode(outputs[i], skip_special_tokens=True)
        predicted_text = generated_text[len(normalized_text):].strip()
        predicted_words = predicted_text.split()
        next_word = predicted_words[0] if predicted_words else ''
        suggestions.append(next_word)
    # Remove duplicates and empty strings
    suggestions = list(set(filter(None, suggestions)))
    return suggestions

# Example usage
input_text = "من با اطمینان"
suggestions = predict_next_words(input_text, model, tokenizer, normalizer, num_suggestions=5)
print(f"Input text: {input_text}")
print("Predicted next words:", suggestions)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:5 for open-end generation.


Input text: من با اطمینان
Predicted next words: ['می\u200cگویم', 'زیادی', 'می\u200cگوید:', 'می\u200cگویند', 'خاطر']
