# Avg Word2Vec

Average Word2Vec is an approach to generate word embeddings by averaging the Word2Vec vectors of individual words in a text document. It captures the overall semantic meaning of the document by taking the average of the word vectors present in it.

### How Avg Word2Vec Works:

Word Vector Generation: First, Word2Vec is used to generate word vectors for each word in the corpus. Word2Vec typically generates high-dimensional vectors representing the semantic meaning of each word based on its context.

Average Calculation: For a given text document, the word vectors of all the words present in the document are averaged to produce a single vector representation for the entire document. This average vector encapsulates the overall semantic information of the document.

### Example:

Consider two text documents:

"I love dogs and cats."

"Dogs are loyal companions."

For each document, we generate Word2Vec vectors for individual words and then average them to get the document embeddings. The resulting embeddings can be used for various tasks such as document similarity computation or classification.

### Where Avg Word2Vec is Used:

Document Similarity: It is used to measure the similarity between documents by comparing their average Word2Vec embeddings.

Text Classification: In text classification tasks, Avg Word2Vec can be used to represent documents and classify them into predefined categories.

Information Retrieval: It helps in retrieving relevant documents based on their semantic similarity to a query document or search query.

### Advantages:

Semantic Representation: It captures the semantic meaning of the entire document by considering the meanings of individual words.

Dimensionality Reduction: By averaging the word vectors, the dimensionality of the document representation is reduced compared to using individual word vectors, which can be beneficial for downstream tasks.

Robustness: It is robust to the varying lengths of documents since it produces a fixed-length vector representation for each document.

### Disadvantages:

Loss of Information: Averaging the word vectors may lead to loss of important information, especially if the document contains a large number of words with diverse meanings.

Equal Weighting: It treats all words in the document equally, regardless of their importance or frequency. This may not accurately represent the significance of each word in the document.

In [None]:
import gensim.downloader as api

In [None]:
# Load the pre-trained Word2Vec model
word2vec_model = api.load('word2vec-google-news-300')



In [None]:
documents = [
    "I love dogs and cats.",
    "Dogs are loyal companions."
]

In [None]:
# Tokenize the documents
tokenized_documents = [doc.lower().split() for doc in documents]
tokenized_documents

[['i', 'love', 'dogs', 'and', 'cats.'],
 ['dogs', 'are', 'loyal', 'companions.']]

In [None]:
# Calculate the average Word2Vec vectors for each document
avg_word2vec_vectors = []
for doc_tokens in tokenized_documents:
    doc_vector = []
    for token in doc_tokens:
        if token in word2vec_model:
            doc_vector.append(word2vec_model[token])
    if doc_vector:
        avg_doc_vector = sum(doc_vector) / len(doc_vector)
        avg_word2vec_vectors.append(avg_doc_vector)
    else:
        avg_word2vec_vectors.append([])

In [None]:
# Print the average Word2Vec vectors for each document
for i, vector in enumerate(avg_word2vec_vectors):
    print(f"Average Word2Vec vector for Document {i+1}:")
    print(vector)
    print()

Average Word2Vec vector for Document 1:
[-0.0476888  -0.06144206 -0.00374349  0.20670573 -0.11165365  0.0620931
  0.09285482 -0.08919271 -0.07666016  0.08528646  0.05952962 -0.29231772
 -0.08768717 -0.08959961 -0.07556152  0.10270182 -0.00219727  0.14632161
 -0.05493164 -0.0008138  -0.02685547  0.0164388   0.19335938 -0.0851237
 -0.03255208  0.07991537 -0.19677734  0.07519531  0.05989583 -0.10026041
 -0.05875651  0.08924866 -0.02050781  0.01318359 -0.01251221  0.09480795
 -0.07910156  0.08625793 -0.02856445  0.2010905   0.09368896 -0.14208984
  0.09554037  0.05053711 -0.00764974 -0.06298828  0.18131511 -0.05440267
 -0.04048665  0.01668294  0.00061035  0.09667969  0.10139974  0.1428833
  0.04736328  0.10099284  0.03393555 -0.05110677  0.0290273  -0.02998861
  0.11507162  0.07448324 -0.02620443  0.08585612 -0.00203451 -0.14029948
 -0.07210287  0.0258433  -0.14542644  0.06233724  0.20914714  0.01139323
 -0.10559082  0.07613119 -0.18562825 -0.12516277  0.0016276  -0.05969238
  0.1241862   

    Similarity score close to 0 indicates little to no similarity between the documents,
    while a score close to 1 indicates high similarity. Conversely, a score close to -1 would indicate high dissimilarity.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Calculate cosine similarity between the vectors of the two documents
similarity_score = cosine_similarity([avg_word2vec_vectors[0]], [avg_word2vec_vectors[1]])

# Print the similarity score
print("Similarity Score:", similarity_score[0][0])


Similarity Score: 0.56154925
