## Hidden Markov Models (HMMs)

Modelos estatísticos que geram uma sequência de símbolos ou quantidades. Eles são especialmente conhecidos por sua aplicação no reconhecimento de padrões temporais, como fala, caligrafia, reconhecimento de gestos, marcação de classes gramaticais e bioinformática.

In [1]:
from hmmlearn import hmm
import numpy as np

# Example: Modeling a simple weather system
states = ["Sunny", "Rainy"]
n_states = len(states)

observations = ["walk", "shop", "clean"]
n_observations = len(observations)

# Start probability
start_probability = np.array([0.6, 0.4])

# Transition probability
transition_probability = np.array([
  [0.7, 0.3],
  [0.4, 0.6]
])

# Emission probability
emission_probability = np.array([
  [0.3, 0.4, 0.3],
  [0.1, 0.3, 0.6]
])

# Create HMM
model = hmm.MultinomialHMM(n_components=n_states, n_trials=n_observations)
model.startprob_ = start_probability
model.transmat_ = transition_probability
model.emissionprob_ = emission_probability

# Observation sequence
# Each observation is a one-hot encoded vector representing "walk", "shop", "clean"
obs_seq = np.array([
    [1, 0, 0],  # walk
    [0, 0, 1],  # clean
    [0, 1, 0],  # shop
    [0, 1, 0],  # shop
    [0, 0, 1],  # clean
    [1, 0, 0]   # walk
])

# Predict the hidden states of the given observation sequence
logprob, states_seq = model.decode(obs_seq, algorithm="viterbi")

# Map the state indices to state names
state_names = [states[state_idx] for state_idx in states_seq]

print("The states are:", ", ".join(state_names))

MultinomialHMM has undergone major changes. The previous version was implementing a CategoricalHMM (a special case of MultinomialHMM). This new implementation follows the standard definition for a Multinomial distribution (e.g. as in https://en.wikipedia.org/wiki/Multinomial_distribution). See these issues for details:
https://github.com/hmmlearn/hmmlearn/issues/335
https://github.com/hmmlearn/hmmlearn/issues/340


The states are: Sunny, Sunny, Sunny, Sunny, Sunny, Sunny


## Latent Dirichlet Allocation (LDA)

Modelo estatístico generativo sofisticado projetado para descobrir estruturas temáticas ocultas em grandes coleções de dados de texto. Ao identificar agrupamentos não observados, conhecidos como tópicos, o LDA facilita uma compreensão mais profunda dos temas subjacentes que permeiam um corpus, tornando-o inestimável para tarefas como classificação de documentos, recuperação de informações e resumo de conteúdo.

In [1]:
from gensim import corpora, models
import gensim

# Sample documents
doc_a = "The cat sat on the hat"
doc_b = "The dog ate the cat and the hat"
# Compile documents
doc_set = [doc_a, doc_b]

# Tokenize documents
texts = [doc.split() for doc in doc_set]

# Create a dictionary from the tokens
dictionary = corpora.Dictionary(texts)

# Convert to bag-of-words format
corpus = [dictionary.doc2bow(text) for text in texts]

# Apply LDA
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)

# Print topics
for topic in ldamodel.print_topics(num_topics=2, num_words=3):
    print(topic)

(0, '0.143*"The" + 0.143*"hat" + 0.143*"cat"')
(1, '0.203*"the" + 0.120*"cat" + 0.120*"hat"')


## Word2Vec & Doc2Vec

Word2Vec (também conhecido como embeddings de palavras neurais) e Doc2Vec são algoritmos usados ​​para produzir embeddings de palavras, que são representações vetoriais de palavras e documentos. Esses modelos capturam relações semânticas entre palavras e podem ser usados ​​para diversas tarefas de NLP.

In [2]:
from gensim.models import Word2Vec, Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from nltk.tokenize import word_tokenize
import nltk

# Sample text data (for illustration purposes, replace with your dataset)
texts = [
    "Word2Vec is a technique to compute vector representations of words",
    "Doc2Vec is an extension of Word2Vec to compute vector representations of documents",
    "This is a simple example of Word2Vec and Doc2Vec models"
]

# Tokenize the text data
tokenized_texts = [nltk.word_tokenize(text.lower()) for text in texts]

# Word2Vec Training
word2vec_model = Word2Vec(sentences=tokenized_texts, vector_size=50, window=5, min_count=1, workers=2)
word2vec_model.train(tokenized_texts, total_examples=word2vec_model.corpus_count, epochs=10)

# Word2Vec Inference
print("Word2Vec Inference:")
word_vector = word2vec_model.wv['word2vec']  # Get vector for 'word2vec'
print(f"Vector for 'word2vec':{word_vector}")

# Prepare data for Doc2Vec
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(texts)]

# Doc2Vec Training
doc2vec_model = Doc2Vec(vector_size=50, window=2, min_count=1, workers=2, epochs=10)
doc2vec_model.build_vocab(tagged_data)
doc2vec_model.train(tagged_data, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.epochs)

# Doc2Vec Inference
print("Doc2Vec Inference:")
doc_vector = doc2vec_model.infer_vector(tokenized_texts[0])  # Infer vector for the first document
print(f"Vector for the first document:{doc_vector}")

Word2Vec Inference:
Vector for 'word2vec':[-0.01632791  0.00899188 -0.00827343  0.00165943  0.01698695 -0.00892656
  0.00904765 -0.01356449 -0.00712328  0.01878826 -0.00314963  0.0006321
 -0.00827048 -0.0153644  -0.0030088   0.00494147 -0.00176688  0.01107519
 -0.00551609  0.00450278  0.01090864  0.01669597 -0.0028803  -0.01841042
  0.00874657  0.00113789  0.01489176 -0.00161231 -0.0052919  -0.01750233
 -0.00170897  0.00563109  0.01079674  0.01408598 -0.01140993  0.00371207
  0.01219733 -0.00959853 -0.00621573  0.01358914  0.00326036  0.00038111
  0.00692987  0.00043471  0.01925062  0.01012234 -0.01782598 -0.01408062
  0.00180371  0.01278775]
Doc2Vec Inference:
Vector for the first document:[-0.00337606  0.005096   -0.00744679 -0.00469759 -0.00609954 -0.00254945
 -0.00614331  0.00334415  0.00171494  0.00324126 -0.00612217 -0.00508196
  0.00493752 -0.00348908 -0.00076603  0.005182    0.00626754 -0.00412648
  0.00528952  0.00979717 -0.00860548  0.00278074 -0.00857871 -0.00922783
 -0.0068

## GloVe

Algoritmo de aprendizagem não supervisionado para gerar representações vetoriais de palavras.

In [1]:
import requests
import os
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors

# URL of the GloVe file to download
glove_url = "http://nlp.stanford.edu/data/glove.6B.zip"
glove_zip_file = "glove.6B.zip"
glove_file = "glove.6B.100d.txt"
word2vec_output_file = "glove.6B.100d.word2vec"

# Download GloVe vectors
if not os.path.exists(glove_zip_file):
    print("Downloading GloVe vectors...")
    response = requests.get(glove_url)
    with open(glove_zip_file, "wb") as f:
        f.write(response.content)

# Unzip GloVe file (you might need to use a specific library like zipfile or tarfile based on your environment)
# For example:
# import zipfile
# with zipfile.ZipFile(glove_zip_file, "r") as zip_ref:
#     zip_ref.extractall()

# Convert the GloVe file format to Word2Vec
glove2word2vec(glove_file, word2vec_output_file)

# Load the converted vectors
model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)

# Explore the model
print(model['computer'])  # Output the vector for 'computer'
print(model.most_similar('computer'))  # Find similar words

Downloading GloVe vectors...


KeyboardInterrupt: 