### Corpora

Conjunto de dados de texto (matriz de palavras)

In [1]:
# Example of loading a popular dataset using NLTK

#import nltk
#nltk.download('brown')

from nltk.corpus import brown
print(brown.words())

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\luiza\AppData\Roaming\nltk_data...


['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]


[nltk_data]   Package brown is already up-to-date!


### Tokenization

Separação de 'tokens'

In [6]:
# Example of word tokenization using NLTK

'''import nltk
nltk.download('punkt')'''

from nltk.tokenize import word_tokenize

text = "Natural Language Processing with Python is super awesome"
tokens = word_tokenize(text)
print(tokens)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\luiza\AppData\Roaming\nltk_data...


['Natural', 'Language', 'Processing', 'with', 'Python', 'is', 'super', 'awesome']


[nltk_data]   Package punkt is already up-to-date!


### Embeddings
Processo de criar vetores de distâncias entre palavras. Um 'embeeding' é uma representação de uma palavra em forma de vetor criado por um modelo de deep learning para fins de pesquisas de similaridade.

In [2]:
# Example of generating word embeddings using gensim's Word2Vec
from gensim.models import Word2Vec

# Define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
              ['this', 'is', 'the', 'second', 'sentence'],
              ['yet', 'another', 'sentence'],
              ['one', 'more', 'sentence'],
              ['and', 'the', 'final', 'sentence']]

# Train a Word2Vec model
model = Word2Vec(sentences, min_count=1)

# Get an embedding for a word
word_embedding = model.wv['sentence']
print(word_embedding)

[-5.3622725e-04  2.3643136e-04  5.1033497e-03  9.0092728e-03
 -9.3029495e-03 -7.1168090e-03  6.4588725e-03  8.9729885e-03
 -5.0154282e-03 -3.7633716e-03  7.3805046e-03 -1.5334714e-03
 -4.5366134e-03  6.5540518e-03 -4.8601604e-03 -1.8160177e-03
  2.8765798e-03  9.9187379e-04 -8.2852151e-03 -9.4488179e-03
  7.3117660e-03  5.0702621e-03  6.7576934e-03  7.6286553e-04
  6.3508903e-03 -3.4053659e-03 -9.4640139e-04  5.7685734e-03
 -7.5216377e-03 -3.9361035e-03 -7.5115822e-03 -9.3004224e-04
  9.5381187e-03 -7.3191668e-03 -2.3337686e-03 -1.9377411e-03
  8.0774371e-03 -5.9308959e-03  4.5162440e-05 -4.7537340e-03
 -9.6035507e-03  5.0072931e-03 -8.7595852e-03 -4.3918253e-03
 -3.5099984e-05 -2.9618145e-04 -7.6612402e-03  9.6147433e-03
  4.9820580e-03  9.2331432e-03 -8.1579173e-03  4.4957981e-03
 -4.1370760e-03  8.2453608e-04  8.4986202e-03 -4.4621765e-03
  4.5175003e-03 -6.7869602e-03 -3.5484887e-03  9.3985079e-03
 -1.5776526e-03  3.2137157e-04 -4.1406299e-03 -7.6826881e-03
 -1.5080082e-03  2.46979

### Bag of Words (BoW)

Matriz de frequências. Eficaz em classificação de documentos e filtragem de spam. Inadequado para a compreensão de nuances linguísticas como sintaxe e semântica (pois não capta a ordem das palavras). Pode levar a uma alta dimensionalidade se o vocabulário for grande.

In [1]:
# Example of creating a Bag of Words model using Python's CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "Hello, how are you?",
    "Winning is not everything, it's the only thing.",
    "Today is a beautiful day."
]

# Create a CountVectorizer object
vectorizer = CountVectorizer()

# Fit the model and transform the documents
bow_matrix = vectorizer.fit_transform(documents)

# Get the feature names
feature_names = vectorizer.get_feature_names_out()

# Create a DataFrame for better visualization
import pandas as pd
df = pd.DataFrame(bow_matrix.toarray(), columns=feature_names)
print(df)

   are  beautiful  day  everything  hello  how  is  it  not  only  the  thing  \
0    1          0    0           0      1    1   0   0    0     0    0      0   
1    0          0    0           1      0    0   1   1    1     1    1      1   
2    0          1    1           0      0    0   1   0    0     0    0      0   

   today  winning  you  
0      0        0    1  
1      0        1    0  
2      1        0    0  


### TF-IDF

Matriz de importâncias - mitiga palavras genéricas. Eficaz em extração de palavras-chave, modelagem de tópicos e muitos tipos de classificação de texto, como filtragem de e-mails de spam.

In [4]:
# Example of calculating TF-IDF using scikit-learn's TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
    "The sky is blue.",
    "The sun is bright today.",
    "The sun in the sky is bright.",
    "We can see the shining sun, the bright sun."
]

# Create a TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Fit the model and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)

# Get the feature names
feature_names = vectorizer.get_feature_names_out()

# Create a DataFrame for better visualization
import pandas as pd
df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
print(df)

       blue    bright       can        in        is       see   shining  \
0  0.659191  0.000000  0.000000  0.000000  0.420753  0.000000  0.000000   
1  0.000000  0.404129  0.000000  0.000000  0.404129  0.000000  0.000000   
2  0.000000  0.321846  0.000000  0.504235  0.321846  0.000000  0.000000   
3  0.000000  0.239102  0.374599  0.000000  0.000000  0.374599  0.374599   

        sky       sun       the     today        we  
0  0.519714  0.000000  0.343993  0.000000  0.000000  
1  0.000000  0.404129  0.330402  0.633146  0.000000  
2  0.397544  0.321846  0.526261  0.000000  0.000000  
3  0.000000  0.478204  0.390963  0.000000  0.374599  


### N-grams

Sequências de 'n' itens de uma determinada amostra de texto ou fala. Usado para modelar a probabilidade de cada item em uma sequência, com base na ocorrência de itens anteriores. Ferramenta para capturar contexto.

- Unigramas (1 grama): Cada item é considerado isoladamente (ex: "o", "gato").
- Bigramas (2 gramas): Sequências de dois itens (ex: "o gato", "gato sentado").
- Trigramas (3 gramas): Sequências de três itens (ex: “o gato sentou”).

In [5]:
import nltk
from nltk.util import bigrams

text = "I need to write a sentence with some words"
tokens = nltk.word_tokenize(text)
bigrams_list = list(bigrams(tokens))

print(bigrams_list)

[('I', 'need'), ('need', 'to'), ('to', 'write'), ('write', 'a'), ('a', 'sentence'), ('sentence', 'with'), ('with', 'some'), ('some', 'words')]
