# **Vector Space Model (VSM) Evaluation Notebook**

## **Objective**
This notebook will serve as the testing site for comparing different **Vector Space Model (VSM) approaches** using various word embeddings.

## **Approaches Compared**
1. **Word2Vec Pretrained**  
2. **Word2Vec Fine-tuned (Pretrained + Own Data)**  
3. **Word2Vec Trained from Scratch on Own Data**  
4. **GloVe**  
5. **FastText**  

## **Evaluation Methods**
Each method will be evaluated using the following criteria:

### **A. Word Similarity Evaluation**
Measure the similarity between words using cosine similarity.

```python
print(word2vec_model.similarity("car", "vehicle"))
print(glove_model.similarity("car", "vehicle"))
print(custom_word2vec.wv.similarity("car", "vehicle"))
```

### **B. Document Vector Representation**
Compute document vectors by averaging the word embeddings of the words they contain.

```python
import numpy as np
from nltk.tokenize import word_tokenize

def get_document_vector(model, text):
    words = word_tokenize(text.lower())
    vectors = [model.wv[word] for word in words if word in model.wv]
    return np.mean(vectors, axis=0) if vectors else np.zeros(model.vector_size)

doc_vector = get_document_vector(word2vec_model, "This is a test document.")
```

### **C. Clustering (K-Means)**
Cluster documents using K-Means clustering based on their computed document vectors.
```python
from sklearn.cluster import KMeans

# Example document vectors
doc_vectors = [get_document_vector(word2vec_model, doc) for doc in documents]

# Apply K-Means clustering
kmeans = KMeans(n_clusters=2, random_state=42)
clusters = kmeans.fit_predict(doc_vectors)

print("Cluster assignments:", clusters)
```

### **D. Text Classification**
Use document embeddings as input for a classification model and evaluate its accuracy. (finally)


```python

```

