# <font color='#5581A5'> Natural Language Processing

<font color=#5581A5> **The objective of this project is twofold: to delve into the study of natural language processing (NLP) while advancing further in the exploration of Jupyter Notebook and Python.**

Several practices will be employed in this study:

1- Below certain commands, there will be a summary of their meanings.

2- All text will be written in English.

3- The data has been extracted from exercises on the Alura platform.

# About

We will build a Portuguese spell checker using Python and applying NLP techniques.

Additionally, we will discuss various topics such as machine-human communication, and we will see that this interaction is not direct. Instead, it involves an intermediary, which is Natural Language Processing (NLP).

NLP has a wide range of applications, including personal assistants like Google Assistant, Apple's Siri, Alexa, and others.

We will also perform sentiment analysis, which allows us to evaluate a person's opinion about a movie as positive or negative. Furthermore, translation tools and search engines like Google heavily utilize these resources, as will the spell checker we will build.

In [1]:
# Imports

import nltk

In [2]:
# Opening file

with open("Dados/NLP/Corretor/corretor-master/artigos.txt", 'r', encoding='utf-8') as f:
    articles = f.read()

print(articles[:500])




imagem 

Temos a seguinte classe que representa um usuário no nosso sistema:

java

Para salvar um novo usuário, várias validações são feitas, como por exemplo: Ver se o nome só contém letras, [**o CPF só números**] e ver se o usuário possui no mínimo 18 anos. Veja o método que faz essa validação:

java 

Suponha agora que eu tenha outra classe, a classe `Produto`, que contém um atributo nome e eu quero fazer a mesma validação que fiz para o nome do usuário: Ver se só contém letras. E aí? Vou


In [3]:
# Tokenize

# nltk.download('punkt')

separate_words_text = nltk.tokenize.word_tokenize(articles[:100])
print(separate_words_text)

['imagem', 'Temos', 'a', 'seguinte', 'classe', 'que', 'representa', 'um', 'usuário', 'no', 'nosso', 'sistema', ':', 'java', 'Para', 'salvar', 'u']


In [4]:
# Creating function to separate words from punctuation


def separate_words(list_tokens):
    list_words = []
    for token in list_tokens:
        if token.isalpha():
            list_words.append(token)
    return list_words

# testing

separate_words(separate_words_text)

['imagem',
 'Temos',
 'a',
 'seguinte',
 'classe',
 'que',
 'representa',
 'um',
 'usuário',
 'no',
 'nosso',
 'sistema',
 'java',
 'Para',
 'salvar',
 'u']

In [5]:
# Applying on our corpous

list_tokens = nltk.tokenize.word_tokenize(articles)
list_words = separate_words(list_tokens)
len(list_words)

403104

In [6]:
# Applying normalize

def normalize_text(list_words):
    list_normalized = []
    for word in list_words:
        list_normalized.append(word.lower())
    return list_normalized

# Testing

normalize_text(separate_words_text)

['imagem',
 'temos',
 'a',
 'seguinte',
 'classe',
 'que',
 'representa',
 'um',
 'usuário',
 'no',
 'nosso',
 'sistema',
 ':',
 'java',
 'para',
 'salvar',
 'u']

In [7]:
# Applying normalization on our article

list_normalized = normalize_text(list_words)

# Removing same words

list_words_treated = set(list_normalized)

# Checking result

len(list_words_treated)

18465

In [8]:
# Creating variables to store frequency and the number of words in our list

frequency = nltk.FreqDist(list_normalized)
total_words = len(list_normalized)

# Creating function to generate words

def generate_words(word):
    slices = []
    for i in range(len(word) + 1):
        slices.append((word[:i], word[i:]))
    
    generated_words = insert_words(slices)
    return generated_words

# Creating function to insert letters between words

def insert_words(slices):
    new_words = []
    letters = 'abcdefghijklmnopqrstuvwxyzàáâãèéêìíîòóôõùúûç'

    for L, R in slices:  
        for letter in letters:
            new_words.append(L + letter + R)
    
    return new_words

# Creating spell checker function

def spell_checker(word):
    generated_words = generate_words(word)
    right_word = max(generated_words, key=probability)
    
    return right_word

# Function to calculate the probability to be the right word

def probability(word):
    return frequency[word] / total_words

# Example usage
tested_word = 'lgica'
print(spell_checker(tested_word))

lógica


In [9]:
# Creating testing scenario to evaluate how accurate the model is

def data_test_creation(file_name):
    list_test_words = []
    with open(file_name, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.split()
            if len(parts) >= 2:
                right_word = parts[0]
                wrong_word = parts[1]
                list_test_words.append((right_word, wrong_word))
    return list_test_words

# Example usage
test_list = data_test_creation('Dados/NLP/Corretor/corretor-master/palavras.txt')
print(test_list)

[('podemos', 'pyodemos'), ('esse', 'esje'), ('já', 'jrá'), ('nosso', 'nossov'), ('são', 'sãêo'), ('dos', 'dosa'), ('muito', 'muifo'), ('imagem', 'iômagem'), ('sua', 'ósua'), ('também', 'tambéùm'), ('ele', 'eme'), ('fazer', 'èazer'), ('temos', 'temfs'), ('essa', 'eàssa'), ('quando', 'quaôdo'), ('vamos', 'vamvos'), ('sobre', 'hsobre'), ('java', 'sjava'), ('das', 'daõs'), ('agora', 'agorah'), ('está', 'eòtá'), ('cada', 'céda'), ('mesmo', 'zmesmo'), ('nos', 'noâ'), ('forma', 'fobma'), ('seja', 'sejéa'), ('então', 'enêão'), ('criar', 'èriar'), ('código', 'cóeigo'), ('caso', 'casío'), ('exemplo', 'áexemplo'), ('tem', 'tĩem'), ('usuário', 'usuárôio'), ('dados', 'dfados'), ('python', 'pgthon'), ('nossa', 'nossah'), ('além', 'alémè'), ('assim', 'asõim'), ('ter', 'teb'), ('até', 'atĩ'), ('bem', 'âem'), ('design', 'desigen'), ('trabalho', 'trabalàho'), ('foi', 'foo'), ('apenas', 'apenaũ'), ('empresa', 'empresà'), ('valor', 'valíor'), ('será', 'serr'), ('entre', 'entke'), ('método', 'méqodo'), ('p

In [10]:
# Creating evaluator function

def evaluate(test):
    total_words = len(test)
    hit = 0
    for right, wrong in test:
        corrected_word = spell_checker(wrong)
        if corrected_word == right:
            hit += 1
    hit_ratio = hit/total_words
    print(f'{round((hit_ratio * 100),2)} % of {total_words} words')

evaluate(test_list)

1.08 % of 186 words


In [14]:
# Creating function to delete characters if someone type more letters

def delete_character(slices):
    new_words = []

    for L, R in slices:  
        new_words.append(L + R[1:])
    
    return new_words

# Concatenating the lists of letter deletion and insertion cases

def generate_words(word):
    slices = []
    for i in range(len(word) + 1):
        slices.append((word[:i], word[i:]))
    
    generated_words = insert_words(slices)
    generated_words += delete_character(slices)
    
    return generated_words

evaluate(test_list)

41.4 % of 186 words


In [16]:
# Creating function if someone type a wrong letter (not inserting or missing)

def change_character(slices):
    new_words = []
    letters = 'abcdefghijklmnopqrstuvwxyzàáâãèéêìíîòóôõùúûç'

    for L, R in slices:  
        for letter in letters:
            new_words.append(L + letter + R[1:])
    
    return new_words

# Changing generate words to include change letter function

def generate_words(word):
    slices = []
    for i in range(len(word) + 1):
        slices.append((word[:i], word[i:]))
    
    generated_words = insert_words(slices)
    generated_words += delete_character(slices)
    generated_words += change_character(slices)
    
    return generated_words

evaluate(test_list)

76.34 % of 186 words


In [19]:
# Creating function if someone invert words

def invert_character(slices):
    new_words = []

    for L, R in slices:  
        if len(R) > 1:
            new_words.append(L + R[1] + R[0] + R[2:])
    
    return new_words

# Changing generate words to include invert character function

def generate_words(word):
    slices = []
    for i in range(len(word) + 1):
        slices.append((word[:i], word[i:]))
    
    generated_words = insert_words(slices)
    generated_words += delete_character(slices)
    generated_words += change_character(slices)
    generated_words += invert_character(slices)
    
    return generated_words

evaluate(test_list)

76.34 % of 186 words


In [28]:
# Now we have a problem in our spell checker, some words are unknown, and we are limited to our test list of words, lets check how many words are unknown

def evaluate(test, vocabulary):
    total_words = len(test)
    hit = 0
    unknown = 0
    for right, wrong in test:
        corrected_word = spell_checker(wrong)
        unknown += (corrected_word not in vocabulary)
        if corrected_word == right:
            hit += 1
    hit_ratio = hit/total_words
    unknown_ratio = unknown/total_words
    
    print(f'{round((hit_ratio * 100),2)} % of {total_words} words - unknown {round((unknown_ratio * 100),2)} % ')

evaluate(test_list, list_words_treated)

76.34 % of 186 words - unknown 16.13 % 


In [27]:
# Another problem is if the person inserts more than one word wrong, lets fix it

def generated_words_better(g_words):
    new_words = []
    for word in g_words:
        new_words += generate_words(word)
    return new_words

# Creating new spell checker

def spell_checker_new(word):
    generated_words = generate_words(word)
    better_words = generated_words_better(generated_words)
    all_words = set(generated_words + better_words)
    candidates = [word]
    for word in all_words:
        if word in list_words_treated:
            candidates.append(word)
    right_word = max(candidates, key=probability)
    return right_word

tested_word = 'lóiigica'
spell_checker_new(tested_word)

'lógica'