<a href="https://colab.research.google.com/github/guilhermelaviola/NaturalLanguageProcessing/blob/main/In.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Analysis and interpretation of Natural Language**

In [2]:
!pip install python-Levenshtein

Collecting python-Levenshtein
  Downloading python_Levenshtein-0.26.1-py3-none-any.whl.metadata (3.7 kB)
Collecting Levenshtein==0.26.1 (from python-Levenshtein)
  Downloading levenshtein-0.26.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.2 kB)
Collecting rapidfuzz<4.0.0,>=3.9.0 (from Levenshtein==0.26.1->python-Levenshtein)
  Downloading rapidfuzz-3.12.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading python_Levenshtein-0.26.1-py3-none-any.whl (9.4 kB)
Downloading levenshtein-0.26.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (162 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.7/162.7 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rapidfuzz-3.12.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m31.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages:

In [3]:
# Importing all the necessary libraries:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import numpy as np
from collections import Counter
from Levenshtein import distance as levenshtein_distance

In [9]:
# Downloading necessary NLTK resources:
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [10]:
# Example text and query
# Opening and displaying the .txt file:
with open('texto.txt') as f:
  text = f.read()

print(text)

query = 'importância do processamento de linguagem natural'

A crescente quantidade de dados textuais gerada diariamente tem impulsionado a importância do processamento de linguagem natural (PLN) e das técnicas de recuperação de informação (RI). Com o objetivo de extrair informações relevantes de grandes volumes de dados, a RI utiliza métodos avançados para encontrar e classificar documentos que atendam a consultas específicas dos usuários. A distância de Levenshtein e o modelo bag-of-words são duas abordagens distintas, mas eficazes, para medir a similaridade entre textos e consultas. A distância de Levenshtein calcula o número mínimo de operações necessárias para transformar uma string em outra, enquanto o modelo bag-of-words representa os textos como vetores de frequência de palavras, permitindo a comparação baseada na similaridade vetorial. Essas técnicas são fundamentais para melhorar a precisão dos sistemas de busca e a relevância dos resultados apresentados aos usuários.


In [11]:
# Defining the stopwords in Portuguese and tokenizing the words:
def preprocess_text(text):
    stop_words = set(stopwords.words('portuguese'))
    tokens = word_tokenize(text.lower())
    tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
    return tokens

In [12]:
# Preprocessing the text and query:
tokens_text = preprocess_text(text)
tokens_query = preprocess_text(query)

In [13]:
# Implementing the ri_levenshtein function to calculate the Levenshtein distance between the query and the documents:
def ri_levenshtein(query, text):
    distance = levenshtein_distance(query, text)
    print(f'Levenshtein Distance: {distance}')

ri_levenshtein(query, text)

Levenshtein Distance: 882


In [14]:
# Implementating of Information Retrieval Using Bag-of-Words:
def ri_bag_of_words(query_tokens, text_tokens):
    query_counter = Counter(query_tokens)
    text_counter = Counter(text_tokens)
    all_words = set(query_counter.keys()).union(set(text_counter.keys()))

    # Create vector representations
    query_vector = np.array([query_counter[word] for word in all_words])
    text_vector = np.array([text_counter[word] for word in all_words])

    # Compute cosine similarity
    dot_product = np.dot(query_vector, text_vector)
    norm_query = np.linalg.norm(query_vector)
    norm_text = np.linalg.norm(text_vector)
    similarity = dot_product / (norm_query * norm_text) if norm_query and norm_text else 0

    print(f'Bag-of-Words Similarity: {similarity:.4f}')

ri_bag_of_words(tokens_query, tokens_text)

Bag-of-Words Similarity: 0.2031
