# Gensim Fasttext

Gensim provides a convenient implementation of FastText, which can be used to train word vectors on a custom corpus or to use pre-trained models for various tasks such as finding similar words and computing similarity scores between words.

### Installing Gensim

> pip install gensim

### Using Gensim FastText

Here are the key steps to use Gensim FastText:

1. Loading a Pre-trained Model
2. Training a FastText Model on a Custom Corpus
3. Finding Similar Words
4. Computing Similarity Scores Between Words

### Links

[Migrating-from-Gensim-3.x-to-4](https://github.com/piskvorky/gensim/wiki/Migrating-from-Gensim-3.x-to-4)

[cub-200-2011_paper](https://paperswithcode.com/dataset/cub-200-2011)

[cub-200-2011_dataset](https://www.kaggle.com/datasets/wenewone/cub2002011?resource=download)

[Microsoft C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/)

### 1. Loading a Pre-trained Model

Gensim provides a way to load pre-trained FastText models. For example, you can load the pre-trained FastText model provided by Facebook:

In [4]:
import gensim.downloader as api

# Load pre-trained FastText model
model = api.load('fasttext-wiki-news-subwords-300')

print(model)

KeyedVectors<vector_size=300, 999999 keys>


### 2. Training a FastText Model on a Custom Corpus

You can also train a FastText model on your custom corpus. Here is an example:

In [12]:
from gensim.models import FastText
from gensim.utils import simple_preprocess

# Example sentences
# sentences = [
#     "Cats and dogs are both popular household pets, yet cats are more independent and often prefer solitude. They share some hunting instincts with their larger feline cousins like lions and tigers.",
#     "Dogs and cats are common pets, but dogs are known for their loyalty and tendency to form strong bonds with humans. Unlike solitary big cats, dogs are social animals that thrive in packs.",
#     "Horses, like elephants, have been domesticated to assist humans in various tasks. However, horses are known for their speed and agility, whereas elephants are prized for their strength and intelligence.",
#     "Lions and tigers are both apex predators, but lions are social animals living in prides. In contrast, tigers are solitary creatures, only coming together during mating or to raise cubs.",
#     "Tigers share their powerful physique and hunting prowess with lions. Unlike the social lions, tigers are mostly solitary, showcasing a stark behavioral difference between the two big cats.",
#     "Elephants, similar to horses, have been used by humans for labor due to their strength. Elephants, however, are highly intelligent with complex social structures, unlike the more individually task-oriented horses.",
# ]
sentences = [
    'african_buffalo', 'alligator', 'amphibian', 'amur_leopard', 
    'ants', 'bear', 'bird', 'blue_whale', 'bobcat', 'cat', 'chimp', 
    'chimpanzee', 'cow', 'dog', 'dolphin', 'domestic_water_buffalo', 
    'eagle', 'elephant', 'fish', 'frog', 'giant', 'giant_panda', 'goat', 
    'gorilla', 'hen', 'horse', 'killer_whale', 'lion', 'lizard', 'monkey', 
    'mouse', 'orangutan', 'ostrich', 'ox', 'panda', 'polar_bear', 'rabbit', 
    'rat', 'rhino', 'rhinoceros', 'rhinoceroses', 'seal', 'sealskin', 
    'siamese_cat', 'skunk', 'spider_monkey', 'squirrel', 'tiger', 'turtle', 
    'walrus', 'whale', 'bird', 'fish', 'lion', 'tiger', 'bull'
]

# Preprocess sentences
sentences = [simple_preprocess(sentence) for sentence in sentences]

print(f"sentences : {sentences} \n")

# Train FastText model
model = FastText(sentences, vector_size=300, window=5, min_count=1, epochs=10000)

print(f"model : {model} \n")

sentences : [['african_buffalo'], ['alligator'], ['amphibian'], ['amur_leopard'], ['ants'], ['bear'], ['bird'], ['blue_whale'], ['bobcat'], ['cat'], ['chimp'], ['chimpanzee'], ['cow'], ['dog'], ['dolphin'], [], ['eagle'], ['elephant'], ['fish'], ['frog'], ['giant'], ['giant_panda'], ['goat'], ['gorilla'], ['hen'], ['horse'], ['killer_whale'], ['lion'], ['lizard'], ['monkey'], ['mouse'], ['orangutan'], ['ostrich'], ['ox'], ['panda'], ['polar_bear'], ['rabbit'], ['rat'], ['rhino'], ['rhinoceros'], ['rhinoceroses'], ['seal'], ['sealskin'], ['siamese_cat'], ['skunk'], ['spider_monkey'], ['squirrel'], ['tiger'], ['turtle'], ['walrus'], ['whale'], ['bird'], ['fish'], ['lion'], ['tiger'], ['bull']] 

model : FastText<vocab=51, vector_size=300, alpha=0.025> 



In [24]:
model.wv, len(model.wv), len(model.wv[0]) # model.wv.key_to_index

(<gensim.models.fasttext.FastTextKeyedVectors at 0x17e562e5600>, 51, 300)

### 3. Finding Similar Words

Once you have a trained or pre-trained model, you can find similar words using the `most_similar` method:

In [13]:
# Find similar words to 'machine'
similar_words = model.wv.most_similar('cat', topn=10)

print(similar_words)

[('bobcat', 0.2737165093421936), ('goat', 0.15198037028312683), ('bird', 0.12732714414596558), ('ostrich', 0.09741493314504623), ('seal', 0.09225074201822281), ('siamese_cat', 0.08397927135229111), ('african_buffalo', 0.08353123813867569), ('ants', 0.06892219930887222), ('walrus', 0.05629545822739601), ('rat', 0.055904362350702286)]


### 4. Computing Similarity Scores Between Words

You can compute similarity scores between two words using the `similarity` method:

In [14]:
# Compute similarity score between 'cat' and 'dog'
similarity_score = model.wv.similarity('cat', 'bobcat')
print(similarity_score)

similarity_score = model.wv.similarity('cat', 'siamese_cat')
print(similarity_score, ' *')

similarity_score = model.wv.similarity('cat', 'siamese cat')
print(similarity_score, ' *')

similarity_score = model.wv.similarity('cat', 'dog')
print(similarity_score)

similarity_score = model.wv.similarity('cat', 'rat')
print(similarity_score)

similarity_score = model.wv.similarity('cat', 'ferret')
print(similarity_score)

0.2737165
0.08397927  *
0.1394797  *
0.012166994
0.055904355
0.007804998


# Gensim Word2Vec

### Using Gensim Word2Vec

Here are the key steps to use Gensim Word2Vec:

1. Training a Word2Vec Model on a Custom Corpus
2. Finding Similar Words
3. Computing Similarity Scores Between Words

### 1. Training a Word2Vec Model on a Custom Corpus

You can train a Word2Vec model on your custom corpus. Here is an example:

In [31]:
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

# Example sentences
sentences = [
    'african_buffalo', 'alligator', 'amphibian', 'amur_leopard', 
    'ants', 'bear', 'bird', 'blue_whale', 'bobcat', 'cat', 'chimp', 
    'chimpanzee', 'cow', 'dog', 'dolphin', 'domestic_water_buffalo', 
    'eagle', 'elephant', 'fish', 'frog', 'giant', 'giant_panda', 'goat', 
    'gorilla', 'hen', 'horse', 'killer_whale', 'lion', 'lizard', 'monkey', 
    'mouse', 'orangutan', 'ostrich', 'ox', 'panda', 'polar_bear', 'rabbit', 
    'rat', 'rhino', 'rhinoceros', 'rhinoceroses', 'seal', 'sealskin', 
    'siamese_cat', 'skunk', 'spider_monkey', 'squirrel', 'tiger', 'turtle', 
    'walrus', 'whale', 'bird', 'fish', 'lion', 'tiger', 'bull'
]

# Preprocess sentences
sentences = [simple_preprocess(sentence) for sentence in sentences]
print(f"sentences : {sentences} \n")

# Train Word2Vec model
model = Word2Vec(sentences, vector_size=300, window=5, min_count=1, epochs=10000)
print(f"model : {model} \n")

sentences : [['african_buffalo'], ['alligator'], ['amphibian'], ['amur_leopard'], ['ants'], ['bear'], ['bird'], ['blue_whale'], ['bobcat'], ['cat'], ['chimp'], ['chimpanzee'], ['cow'], ['dog'], ['dolphin'], [], ['eagle'], ['elephant'], ['fish'], ['frog'], ['giant'], ['giant_panda'], ['goat'], ['gorilla'], ['hen'], ['horse'], ['killer_whale'], ['lion'], ['lizard'], ['monkey'], ['mouse'], ['orangutan'], ['ostrich'], ['ox'], ['panda'], ['polar_bear'], ['rabbit'], ['rat'], ['rhino'], ['rhinoceros'], ['rhinoceroses'], ['seal'], ['sealskin'], ['siamese_cat'], ['skunk'], ['spider_monkey'], ['squirrel'], ['tiger'], ['turtle'], ['walrus'], ['whale'], ['bird'], ['fish'], ['lion'], ['tiger'], ['bull']] 

model : Word2Vec<vocab=51, vector_size=300, alpha=0.025> 



### 2. Finding Similar Words

Once you have a trained model, you can find similar words using the most_similar method:

In [32]:
# Find similar words to 'cat'
similar_words = model.wv.most_similar('cat', topn=10)
print("Most similar words to 'cat':")
for word, score in similar_words:
    print(f"{word}: {score}")

Most similar words to 'cat':
ants: 0.11463356018066406
goat: 0.10705526173114777
rhinoceros: 0.09309180825948715
monkey: 0.09122835099697113
rhino: 0.08179710805416107
rabbit: 0.07725443691015244
orangutan: 0.07632383704185486
bear: 0.07548326253890991
rhinoceroses: 0.06613056361675262
chimpanzee: 0.042745448648929596


### 3. Computing Similarity Scores Between Words

You can compute similarity scores between two words using the similarity method:

In [34]:
# Compute similarity score between 'cat' and 'bobcat'
similarity_score = model.wv.similarity('cat', 'bobcat')
print(f"Similarity score between 'cat' and 'bobcat': {similarity_score}")

Similarity score between 'cat' and 'bobcat': -0.01531929150223732


# Gensim Doc2Vec

### Using Gensim Doc2Vec

Here are the key steps to use Gensim Doc2Vec:

1. Prepare the Data
2. Train the Doc2Vec Model
3. Finding Similar Documents
4. Computing Similarity Scores Between Documents


[Practical Guide To Doc2Vec](https://spotintelligence.com/2023/09/06/doc2vec/)

### 1. Prepare the Data

You need to preprocess your documents and tag them appropriately. Gensim's TaggedDocument is used for this purpose.

In [36]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.utils import simple_preprocess

# Example documents
documents = [
    'african_buffalo', 'alligator', 'amphibian', 'amur_leopard', 
    'ants', 'bear', 'bird', 'blue_whale', 'bobcat', 'cat', 'chimp', 
    'chimpanzee', 'cow', 'dog', 'dolphin', 'domestic_water_buffalo', 
    'eagle', 'elephant', 'fish', 'frog', 'giant', 'giant_panda', 'goat', 
    'gorilla', 'hen', 'horse', 'killer_whale', 'lion', 'lizard', 'monkey', 
    'mouse', 'orangutan', 'ostrich', 'ox', 'panda', 'polar_bear', 'rabbit', 
    'rat', 'rhino', 'rhinoceros', 'rhinoceroses', 'seal', 'sealskin', 
    'siamese_cat', 'skunk', 'spider_monkey', 'squirrel', 'tiger', 'turtle', 
    'walrus', 'whale', 'bird', 'fish', 'lion', 'tiger', 'bull'
]

# Preprocess and tag documents
tagged_documents = [TaggedDocument(simple_preprocess(doc), [i]) for i, doc in enumerate(documents)]
print(f"tagged_documents : {tagged_documents}")

tagged_documents : [TaggedDocument(words=['african_buffalo'], tags=[0]), TaggedDocument(words=['alligator'], tags=[1]), TaggedDocument(words=['amphibian'], tags=[2]), TaggedDocument(words=['amur_leopard'], tags=[3]), TaggedDocument(words=['ants'], tags=[4]), TaggedDocument(words=['bear'], tags=[5]), TaggedDocument(words=['bird'], tags=[6]), TaggedDocument(words=['blue_whale'], tags=[7]), TaggedDocument(words=['bobcat'], tags=[8]), TaggedDocument(words=['cat'], tags=[9]), TaggedDocument(words=['chimp'], tags=[10]), TaggedDocument(words=['chimpanzee'], tags=[11]), TaggedDocument(words=['cow'], tags=[12]), TaggedDocument(words=['dog'], tags=[13]), TaggedDocument(words=['dolphin'], tags=[14]), TaggedDocument(words=[], tags=[15]), TaggedDocument(words=['eagle'], tags=[16]), TaggedDocument(words=['elephant'], tags=[17]), TaggedDocument(words=['fish'], tags=[18]), TaggedDocument(words=['frog'], tags=[19]), TaggedDocument(words=['giant'], tags=[20]), TaggedDocument(words=['giant_panda'], tags=

### 2. Train the Doc2Vec Model

Now, you can train a Doc2Vec model using the preprocessed and tagged documents.

In [37]:
# Train Doc2Vec model
model = Doc2Vec(tagged_documents, vector_size=300, window=5, min_count=1, epochs=10000)
print(f"model : {model}")

model : Doc2Vec<dm/m,d300,n5,w5,s0.001,t3>


### 3. Finding Similar Documents

You can find documents that are similar to a given document by using the dv.most_similar method.

In [39]:
# Find similar documents to the first document
similar_docs = model.dv.most_similar(9, topn=5)
print("Most similar documents to the first document:")
for doc_id, score in similar_docs:
    print(f"Document {doc_id}: {score}")

Most similar documents to the first document:
Document 34: 0.40884047746658325
Document 6: 0.3998793661594391
Document 35: 0.39414697885513306
Document 53: 0.3926096558570862
Document 27: 0.390058308839798


### 4. Computing Similarity Scores Between Documents

To compute the similarity score between two documents, you can use the dv.similarity method.

In [40]:
# Compute similarity score between the first and second document
similarity_score = model.dv.similarity(9, 8)
print(f"Similarity score between the first and second document: {similarity_score}")

Similarity score between the first and second document: 0.3201076090335846


# BERT

Finding similar words using BERT involves extracting embeddings for words and then calculating similarity between these embeddings. BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained transformer model that generates context-aware word embeddings. Here's how you can find similar words using BERT:

### Steps to Find Similar Words Using BERT

1. Install Necessary Libraries
2. Load Pre-trained BERT Model and Tokenizer
3. Get Embeddings for Words
4. Compute Similarity Scores
5. Find Most Similar Words

### 1. Install Necessary Libraries

First, ensure you have the necessary libraries installed:

In [None]:
# !pip install transformers torch scipy

### 2. Load Pre-trained BERT Model and Tokenizer

Load a pre-trained BERT model and its tokenizer from the Hugging Face library:

In [42]:
import torch
from transformers import BertTokenizer, BertModel

# Load pre-trained BERT model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
print(f"tokenizer : {tokenizer} \n")

model = BertModel.from_pretrained(model_name)
print(f"model : {model} \n")

tokenizer : BertTokenizer(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
} 

model : BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (

### 3. Get Embeddings for Words

Define a function to get the embeddings for a word:

In [43]:
def get_word_embedding(word, tokenizer, model):
    inputs = tokenizer(word, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs)
    # Use the output from the last hidden state
    embeddings = outputs.last_hidden_state
    # Get the mean of the token embeddings
    word_embedding = embeddings.mean(dim=1)
    return word_embedding.squeeze()

### 4. Compute Similarity Scores

Compute cosine similarity between word embeddings:

In [44]:
from scipy.spatial.distance import cosine

def cosine_similarity(vec1, vec2):
    return 1 - cosine(vec1, vec2)

### 5. Find Most Similar Words

Find the most similar words in a given list of words:

In [45]:
def find_similar_words(target_word, word_list, tokenizer, model, top_n=5):
    target_embedding = get_word_embedding(target_word, tokenizer, model)
    similarities = []
    for word in word_list:
        word_embedding = get_word_embedding(word, tokenizer, model)
        similarity = cosine_similarity(target_embedding, word_embedding)
        similarities.append((word, similarity))
    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:top_n]

# Example word list
word_list = [
    'african_buffalo', 'alligator', 'amphibian', 'amur_leopard', 
    'ants', 'bear', 'bird', 'blue_whale', 'bobcat', 'cat', 'chimp', 
    'chimpanzee', 'cow', 'dog', 'dolphin', 'domestic_water_buffalo', 
    'eagle', 'elephant', 'fish', 'frog', 'giant', 'giant_panda', 'goat', 
    'gorilla', 'hen', 'horse', 'killer_whale', 'lion', 'lizard', 'monkey', 
    'mouse', 'orangutan', 'ostrich', 'ox', 'panda', 'polar_bear', 'rabbit', 
    'rat', 'rhino', 'rhinoceros', 'rhinoceroses', 'seal', 'sealskin', 
    'siamese_cat', 'skunk', 'spider_monkey', 'squirrel', 'tiger', 'turtle', 
    'walrus', 'whale', 'bird', 'fish', 'lion', 'tiger', 'bull'
]

# Find words similar to 'cat'
similar_words = find_similar_words("cat", word_list, tokenizer, model)
print("Most similar words to 'cat':")
for word, score in similar_words:
    print(f"{word}: {score}")


Most similar words to 'cat':
cat: 1
squirrel: 0.9277651906013489
rabbit: 0.9275433421134949
monkey: 0.9177643060684204
tiger: 0.912547767162323


# Autoencoder

To find similar words using an autoencoder, you can follow these steps:

1. Prepare the Data: Create a dataset of word embeddings.
2. Build an Autoencoder: Design an autoencoder model.
3. Train the Autoencoder: Train the model on your dataset.
4. Encode Words: Use the trained encoder to get compressed representations (latent vectors) of words.
5. Find Similar Words: Calculate similarity between these latent vectors to find similar words.

### 1. Prepare the Data

First, you need a dataset of word embeddings. You can use pre-trained word vectors such as GloVe or Word2Vec. For simplicity, let's use GloVe embeddings.

In [48]:
import numpy as np
import urllib.request
from gensim.utils import simple_preprocess

# Download GloVe embeddings
glove_url = "http://nlp.stanford.edu/data/glove.6B.zip"
urllib.request.urlretrieve(glove_url, "glove.6B.zip")

import zipfile
with zipfile.ZipFile("glove.6B.zip", "r") as zip_ref:
    zip_ref.extractall("glove.6B")


In [53]:
# Example custom dataset
custom_sentences = [
    'african_buffalo', 'alligator', 'amphibian', 'amur_leopard', 
    'ants', 'bear', 'bird', 'blue_whale', 'bobcat', 'cat', 'chimp', 
    'chimpanzee', 'cow', 'dog', 'dolphin', 'domestic_water_buffalo', 
    'eagle', 'elephant', 'fish', 'frog', 'giant', 'giant_panda', 'goat', 
    'gorilla', 'hen', 'horse', 'killer_whale', 'lion', 'lizard', 'monkey', 
    'mouse', 'orangutan', 'ostrich', 'ox', 'panda', 'polar_bear', 'rabbit', 
    'rat', 'rhino', 'rhinoceros', 'rhinoceroses', 'seal', 'sealskin', 
    'siamese_cat', 'skunk', 'spider_monkey', 'squirrel', 'tiger', 'turtle', 
    'walrus', 'whale', 'bird', 'fish', 'lion', 'tiger', 'bull'
]

# Load GloVe embeddings
def load_glove_embeddings(file_path):
    embeddings_index = {}
    with open(file_path, 'r', encoding='utf8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    return embeddings_index

embeddings_index = load_glove_embeddings("glove.6B/glove.6B.50d.txt")  # Using 50d GloVe embeddings
print(f"embeddings_index : {type(embeddings_index)} \n")

# Convert sentences to embeddings
def sentence_to_embedding(sentence, embeddings_index):
    words = simple_preprocess(sentence)
    valid_words = [embeddings_index[word] for word in words if word in embeddings_index]
    if valid_words:
        return np.mean(valid_words, axis=0)
    else:
        return np.zeros(50)

sentence_embeddings = np.array([sentence_to_embedding(sentence, embeddings_index) for sentence in custom_sentences])
print(f"sentence_embeddings : {type(embeddings_index)} \n")

embeddings_index : <class 'dict'> 

sentence_embeddings : <class 'dict'> 



### 2: Build the Autoencoder Model

Design the autoencoder model using Keras:

In [54]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense

# Define the size of the input and latent space
input_dim = 50  # Dimension of GloVe embeddings
latent_dim = 16  # Dimension of latent space

# Input layer
input_layer = Input(shape=(input_dim,))

# Encoder layers
encoded = Dense(32, activation='relu')(input_layer)
encoded = Dense(latent_dim, activation='relu')(encoded)

# Decoder layers
decoded = Dense(32, activation='relu')(encoded)
decoded = Dense(input_dim, activation='sigmoid')(decoded)

# Autoencoder model
autoencoder = Model(input_layer, decoded)

# Encoder model
encoder = Model(input_layer, encoded)

# Compile the autoencoder
autoencoder.compile(optimizer='adam', loss='mse')







### 3: Train the Autoencoder

Train the autoencoder on your custom dataset:

In [55]:
# Train the autoencoder
autoencoder.fit(sentence_embeddings, sentence_embeddings, epochs=50, batch_size=2, shuffle=True, validation_split=0.1)

Epoch 1/50

Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.src.callbacks.History at 0x17de41fcb50>

### 4: Encode Words

Use the trained encoder to get latent representations of sentences:

In [57]:
# Encode sentences to get their latent representations
sentence_latents = encoder.predict(sentence_embeddings)
# print(f"sentence_latents : {sentence_latents}")

# Create a dictionary to map sentences to their latent representations
sentence_to_latent = {i: sentence_latents[i] for i in range(len(custom_sentences))}
# print(f"sentence_to_latent : {sentence_to_latent}")



### 5: Find Similar Sentences

Calculate similarity between latent vectors to find similar sentences:

In [61]:
from scipy.spatial.distance import cosine

def find_similar_sentences(target_sentence_index, sentence_to_latent, top_n=5):
    target_latent = sentence_to_latent[target_sentence_index]
    similarities = []
    for index, latent in sentence_to_latent.items():
        if index != target_sentence_index:
            similarity = 1 - cosine(target_latent, latent)
            similarities.append((index, similarity))
    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:top_n]

# Find sentences similar to the first sentence
similar_sentences = find_similar_sentences(8, sentence_to_latent)
print("Most similar sentences to the first sentence:")
for index, score in similar_sentences:
    print(f"Sentence: {custom_sentences[index]} - Similarity: {score}")


Most similar sentences to the first sentence:
Sentence: skunk - Similarity: 0.9701589345932007
Sentence: walrus - Similarity: 0.9549508690834045
Sentence: rhinoceros - Similarity: 0.9532212615013123
Sentence: squirrel - Similarity: 0.9429846405982971
Sentence: rhinoceroses - Similarity: 0.9255936145782471


# ZSL

Zero-shot learning (ZSL) is a machine learning paradigm where a model is trained to recognize objects or perform tasks that it has never seen before during training. Instead of relying solely on labeled examples for every possible category, ZSL leverages auxiliary information (such as semantic attributes, descriptions, or relationships) to make predictions about unseen classes.

# Code for Zero-Shot Learning

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.applications import ResNet50
from sklearn.preprocessing import normalize
from scipy.spatial.distance import cdist
import fasttext

# Load a pre-trained ResNet50 model
model = ResNet50(weights='imagenet', include_top=False, pooling='avg')

# Function to extract visual features
def extract_features(image_path):
    image = tf.keras.preprocessing.image.load_img(image_path, target_size=(224, 224))
    image = tf.keras.preprocessing.image.img_to_array(image)
    image = np.expand_dims(image, axis=0)
    image = tf.keras.applications.resnet50.preprocess_input(image)
    features = model.predict(image)
    return features

# Load FastText word vectors
fasttext_model = fasttext.load_model('cc.en.300.bin')
# from gensim.models.keyedvectors import KeyedVectors
# import gensim.downloader as api
# fast_text_vectors = api.load("fasttext-wiki-news-subwords-300")
# fast_text_vectors.save('fstwk_1.d2v')
# fast_text_vectors = KeyedVectors.load("fstwk_1.d2v")

# Example seen and unseen classes
seen_classes = ['cat', 'dog', 'horse']
unseen_classes = ['lion', 'tiger', 'elephant']

# Get word vectors for classes
def get_class_vectors(classes):
    return np.array([fasttext_model.get_word_vector(cls) for cls in classes])

# Normalize the word vectors
seen_vectors = normalize(get_class_vectors(seen_classes))
unseen_vectors = normalize(get_class_vectors(unseen_classes))

# Function to perform zero-shot classification
def zero_shot_classify(image_path):
    features = extract_features(image_path)
    features = normalize(features)
    distances = cdist(features, unseen_vectors, metric='cosine')
    return unseen_classes[np.argmin(distances)]

# Example usage
image_path = 'Sample_Images\cat1.jpg'
predicted_class = zero_shot_classify(image_path)
print(f'Predicted class: {predicted_class}')





ModuleNotFoundError: No module named 'fasttext'

Conditional Autoencoders (CAEs) are a variant of autoencoders where additional information is used to condition the encoding and decoding processes. This conditioning can help the autoencoder learn more structured and relevant representations based on the context provided by the additional information.

### Autoencoders Recap

Autoencoders are neural networks designed to learn efficient representations (encodings) of input data, typically for the purpose of dimensionality reduction or data denoising.

Components:
- Encoder: Compresses the input data into a latent-space representation.
- Decoder: Reconstructs the input data from the latent representation.

### Conditional Autoencoders
In a Conditional Autoencoder, the input data is conditioned on some additional information (conditions). This information can be labels, attributes, or any other relevant context that influences the encoding and decoding processes.