# 1. Text Corrector

The objective of this notebook is build a function that recibe as input any words and find the most similar world of an specific vocabulary.
In this case the vocabulary are the words in a pdf file called `Data Science from Scratch- First Principles with Python`.

We use the Levenshtein distance to find the closest word.

## Importing Libraries

In [1]:
import PyPDF2
import nltk
from nltk.corpus import stopwords
from gensim.models import Word2Vec
from sklearn.neighbors import NearestNeighbors
from collections import defaultdict,Counter
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
import Levenshtein


In [2]:
nltk.download("punkt")
path_pdf="Data Science from Scratch- First Principles with Python.pdf"

[nltk_data] Downloading package punkt to /home/dalopeza/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Function to extract text from pdf file

In [3]:
def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, "rb") as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        for page_num in range(17,len(pdf_reader.pages)):  #ignoring first 10 pages
            text += pdf_reader.pages[page_num].extract_text()


    return text

## Tokenazing text of pdf file to create the Vocabulary

In [4]:
stop_words_nltk = set(stopwords.words("english"))
def create_vocabulary(text):
    tokens = word_tokenize(text.lower())
    tokens = [w for w in tokens if w.isalpha()]
    tokens = [w for w in tokens if w not in stop_words_nltk]
    vocabulary = set(tokens)
    return vocabulary

## Creating function to find the closest word in the vocabulary based on Levenshtein distance

In [5]:
def auto_correct(input_string, vocabulary):
    input_words = input_string.split()
    corrected_words = []

    for word in input_words:
        # Find the closest word in the vocabulary based on Levenshtein distance
        closest_word = min(vocabulary, key=lambda x: Levenshtein.distance(word, x))
        corrected_words.append(closest_word)

    corrected_string = ' '.join(corrected_words)
    return corrected_string

## Execute the function

In [12]:


vocabulary=create_vocabulary(
                 extract_text_from_pdf(path_pdf)
             )

print("Welcome to the words corrector program (type 'esc' or 'exit' to close the program)")

while True:
    print("\nEnter a word: ")
    input_string=input("Enter a word: ")
    if input_string=="exit" or input_string=="esc":
        break
    else:
        closest_word=auto_correct(input_string,vocabulary)
        print(f'The closest word/sentence to "{input_string}" is "{closest_word}"')
                   
        


Welcome to the words corrector program (type 'esc' or 'exit' to close the program)

Enter a word: 
The closest word to helo is hero

Enter a word: 
The closest word to helo is hero

Enter a word: 
The closest word to macine is machine

Enter a word: 
The closest word to macine lernin is machine learning

Enter a word: 
The closest word to aprendizaje is predicate

Enter a word: 
The closest word to artificial is artificial

Enter a word: 
The closest word to artifical inteligencia is artificial intelligence

Enter a word: 


# 2.Text Suggestion 

Based on a few characters find the closest completed word.

In [17]:
# Step 3: Build a Trie data structure for vocabulary
class TrieNode:
    def __init__(self):
        self.children = defaultdict(TrieNode)
        self.is_end_of_word = False

def build_trie(vocabulary):
    root = TrieNode()
    for word in vocabulary:
        node = root
        for char in word:
            node = node.children[char]
        node.is_end_of_word = True
    return root

In [20]:
# Step 5: Define a function to suggest words based on partial input using Trie
def suggest_word(input_prefix, root):
    input_prefix = input_prefix.lower()
    
    # Check if the input prefix is in the vocabulary
    if input_prefix in vocabulary:
        return input_prefix

    node = root
    for char in input_prefix:
        if char in node.children:
            node = node.children[char]
        else:
            break
    
    suggestions = []

    def dfs(node, prefix):
        if node.is_end_of_word:
            suggestions.append(prefix)
        for char, child_node in node.children.items():
            dfs(child_node, prefix + char)

    dfs(node, input_prefix)

    if not suggestions:
        return "No matching words found"
    
    # Count word frequencies and find the suggestion with the highest frequency
    word_frequencies = Counter(vocabulary)
    suggestion = max(suggestions, key=lambda x: word_frequencies[x])
    return suggestion



In [22]:
# Build the Trie for the vocabulary
trie_root = build_trie(vocabulary)

print("Welcome to the text suggestion program (type 'esc' or 'exit' to close the program)")

while True:
    print("\nEnter some characters: ")
    input_prefix=input("Enter some characters: ")
    if input_prefix=="exit" or input_prefix=="esc":
        break
    else:
        suggestion = suggest_word(input_prefix, trie_root)
        print(f"The suggestion for '{input_prefix}' is '{suggestion}'")
                   

Welcome to the text suggestion program (type 'esc' or 'exit' to close the program)

Enter some characters: 
The suggestion for 'sci' is 'science'

Enter some characters: 
The suggestion for 'dat' is 'data'

Enter some characters: 
The suggestion for 'dat' is 'data'

Enter some characters: 
The suggestion for 'mach' is 'machine'

Enter some characters: 
The suggestion for 'machine l' is 'machine l'

Enter some characters: 
The suggestion for 'lear' is 'learn'

Enter some characters: 
The suggestion for 'inte' is 'interpret'

Enter some characters: 
The suggestion for 'expo' is 'exposition'

Enter some characters: 
The suggestion for 'clas' is 'class'

Enter some characters: 
The suggestion for 'reg' is 'regression'

Enter some characters: 
The suggestion for 're' is 'regression'

Enter some characters: 
The suggestion for 'log' is 'log'

Enter some characters: 
The suggestion for 'logi' is 'logic'

Enter some characters: 
The suggestion for 'logis' is 'logistic'

Enter some characters: 