<a href="https://colab.research.google.com/github/longphi1103/text-classification-embedidng-vector-database/blob/main/Text_Classification_Embedidng_Vector_Database.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -qq faiss-cpu
!pip install -qq transformers
!pip install -qq pandas
!pip install -qq numpy
!pip install -qq scikit-learn
!pip install -qq tqdm

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# https://drive.google.com/file/d/1N7rk-kfnDFIGMeX0ROVTjKh71gcgx-7R/view?usp=sharing
!gdown 1N7rk-kfnDFIGMeX0ROVTjKh71gcgx-7R

Downloading...
From: https://drive.google.com/uc?id=1N7rk-kfnDFIGMeX0ROVTjKh71gcgx-7R
To: /content/2cls_spam_text_cls.csv
  0% 0.00/486k [00:00<?, ?B/s]100% 486k/486k [00:00<00:00, 41.1MB/s]


## **1. Import các thư viện cần thiết**

In [None]:
import pandas as pd
import numpy as np
import torch
import torch.nn.functional as F
import faiss
from transformers import AutoTokenizer, AutoModel
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tqdm import tqdm

## **2. Đọc bộ dữ liệu**

In [None]:
DATASET_PATH = '/content/2cls_spam_text_cls.csv'
df = pd.read_csv(DATASET_PATH)
df

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [None]:
messages = df['Message'].values.tolist()
labels = df['Category'].values.tolist()

## **3. Chuẩn bị embedding model và dữ liệu**

### **3.1. Load embedding model**

In [None]:
# Load embedding model
MODEL_NAME = 'intfloat/multilingual-e5-base'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModel.from_pretrained(MODEL_NAME)

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
model.eval()

print(f'Using device: {device}')
print(f'Model loaded: {MODEL_NAME}')

def average_pool(last_hidden_states, attention_mask):
    last_hidden = last_hidden_states.masked_fill(
        ~attention_mask[..., None].bool(), 0.0
    )
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

def encode_text(batch_texts, model, tokenizer, device):
    batch_dict = tokenizer(
        batch_texts,
        max_length=512,
        padding=True,
        truncation=True,
        return_tensors='pt'
    )

    # Move to device
    batch_dict = {k: v.to(device) for k, v in batch_dict.items()}

    # Generate embeddings
    with torch.no_grad():
        outputs = model(**batch_dict)
        batch_embeddings = average_pool(
            outputs.last_hidden_state, batch_dict['attention_mask'])
        # Normalize embeddings
        batch_embeddings = F.normalize(batch_embeddings, p=2, dim=1)

    return batch_embeddings.cpu().numpy()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Using device: cuda
Model loaded: intfloat/multilingual-e5-base


### **3.2. Tạo sentence embeddings**

In [None]:
def get_embeddings(texts, model, tokenizer, device, batch_size=32):
    """Generate embeddings for a list of texts"""
    embeddings = []

    for i in tqdm(range(0,len(texts),batch_size),desc="Generating embeddings"):
        batch_texts = texts[i:i+batch_size]

        # Add passage prefix for better retrieval performance
        batch_texts_with_prefix = [f"passage: {text}" for text in batch_texts]

        batch_embeddings = encode_text(batch_texts, model, tokenizer, device)
        embeddings.append(batch_embeddings)

    return np.vstack(embeddings)

In [None]:
# Prepare labels
le = LabelEncoder()
y = le.fit_transform(labels)
print(f'Classes: {le.classes_}')

# Generate embeddings for all messages
print(f"Generating embeddings for {len(messages)} messages...")
X_embeddings = get_embeddings(messages, model, tokenizer, device)
print(f"Embeddings shape: {X_embeddings.shape}")

# Create metadata for each document
metadata = []
for i, (message, label) in enumerate(zip(messages, labels)):
    metadata.append({
        'index': i,
        'message': message,
        'label': label,
        'label_encoded': y[i]
    })

print(f"Created metadata for {len(metadata)} documents")

Classes: ['ham' 'spam']
Generating embeddings for 5572 messages...


Generating embeddings: 100%|██████████| 175/175 [00:20<00:00,  8.38it/s]

Embeddings shape: (5572, 768)
Created metadata for 5572 documents





### **3.3. Tạo FAISS index và chia dữ liệu**

In [None]:
# Split data into train and test (90% train, 10% test)
TEST_SIZE = 0.1
SEED = 42

train_indices, test_indices = train_test_split(
    range(len(messages)),
    test_size=TEST_SIZE,
    stratify=y,
    random_state=SEED
)

# Split embeddings and metadata
X_train_emb = X_embeddings[train_indices]
X_test_emb = X_embeddings[test_indices]
y_train = y[train_indices]
y_test = y[test_indices]

train_metadata = [metadata[i] for i in train_indices]
test_metadata = [metadata[i] for i in test_indices]

print(f"Train size: {len(X_train_emb)}")
print(f"Test size: {len(X_test_emb)}")
print(f"Train label distribution: {np.bincount(y_train)}")
print(f"Test label distribution: {np.bincount(y_test)}")

Train size: 5014
Test size: 558
Train label distribution: [4342  672]
Test label distribution: [483  75]


In [None]:
# Create FAISS index
embedding_dim = X_train_emb.shape[1]
index = faiss.IndexFlatIP(embedding_dim)  # Inner product for cosine similarity
index.add(X_train_emb.astype('float32'))

print(f"FAISS index created with {index.ntotal} vectors")

FAISS index created with 5014 vectors


## **4. Implement classification với embedding similarity**

In [None]:
def classify_with_knn(query_text, model, tokenizer, device, index, train_metadata, k=1):
    """Classify text using k-nearest neighbors with embeddings"""

    # Get query embedding
    query_with_prefix = f"query: {query_text}"
    query_embedding = encode_text([query_with_prefix], model, tokenizer, device)

    # Search in FAISS index
    scores, indices = index.search(query_embedding, k)

    # Get predictions from top-k neighbors
    predictions = []
    neighbor_info = []

    for i in range(k):
        neighbor_idx = indices[0][i]
        neighbor_score = scores[0][i]
        neighbor_label = train_metadata[neighbor_idx]['label']
        neighbor_message = train_metadata[neighbor_idx]['message']

        predictions.append(neighbor_label)
        neighbor_info.append({
            'score': float(neighbor_score),
            'label': neighbor_label,
            'message': neighbor_message[:100] + "..." if len(neighbor_message) > 100 else neighbor_message
        })

    # Majority vote for final prediction
    unique_labels, counts = np.unique(predictions, return_counts=True)
    final_prediction = unique_labels[np.argmax(counts)]

    return final_prediction, neighbor_info

def evaluate_knn_accuracy(test_embeddings, test_labels, test_metadata, index, train_metadata, k_values=[1, 3, 5]):
    """Evaluate accuracy for different k values using precomputed embeddings"""
    results = {}
    all_errors = {}

    for k in k_values:
        correct = 0
        total = len(test_embeddings)
        errors = []

        for i in tqdm(range(total), desc=f"Evaluating k={k}"):
            query_embedding = test_embeddings[i:i+1].astype('float32')
            true_label = test_metadata[i]['label']
            true_message = test_metadata[i]['message']

            # Search in FAISS index
            scores, indices = index.search(query_embedding, k)

            # Get predictions from top-k neighbors
            predictions = []
            neighbor_details = []
            for j in range(k):
                neighbor_idx = indices[0][j]
                neighbor_label = train_metadata[neighbor_idx]['label']
                neighbor_message = train_metadata[neighbor_idx]['message']
                neighbor_score = float(scores[0][j])

                predictions.append(neighbor_label)
                neighbor_details.append({
                    'label': neighbor_label,
                    'message': neighbor_message,
                    'score': neighbor_score
                })

            # Majority vote
            unique_labels, counts = np.unique(predictions, return_counts=True)
            predicted_label = unique_labels[np.argmax(counts)]

            if predicted_label == true_label:
                correct += 1
            else:
                # Collect error information
                error_info = {
                    'index': i,
                    'original_index': test_metadata[i]['index'],
                    'message': true_message,
                    'true_label': true_label,
                    'predicted_label': predicted_label,
                    'neighbors': neighbor_details,
                    'label_distribution': {label: int(count) for label, count in zip(unique_labels, counts)}
                }
                errors.append(error_info)

        accuracy = correct / total
        error_count = total - correct

        results[k] = accuracy
        all_errors[k] = errors

        print(f"Accuracy with k={k}: {accuracy:.4f}")
        print(f"Number of errors with k={k}: {error_count}/{total} ({(error_count/total)*100:.2f}%)")

    return results, all_errors

## **5. Đánh giá accuracy trên test set**

In [None]:
%%time
# Evaluate accuracy for different k values
print("Evaluating accuracy on test set...")
accuracy_results, error_results = evaluate_knn_accuracy(
    X_test_emb,
    y_test,
    test_metadata,
    index,
    train_metadata,
    k_values=[1, 3, 5]
)

# Display results
print("\n" + "="*50)
print("ACCURACY RESULTS")
print("="*50)
for k, accuracy in accuracy_results.items():
    print(f"Top-{k} accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
print("="*50)

# Save error analysis to JSON file
import json
from datetime import datetime

error_analysis = {
    'timestamp': datetime.now().isoformat(),
    'model': MODEL_NAME,
    'test_size': len(X_test_emb),
    'accuracy_results': accuracy_results,
    'errors_by_k': {}
}

for k, errors in error_results.items():
    error_analysis['errors_by_k'][f'k_{k}'] = {
        'total_errors': len(errors),
        'error_rate': len(errors) / len(X_test_emb),
        'errors': errors
    }

# Save to JSON file
output_file = 'error_analysis.json'
with open(output_file, 'w', encoding='utf-8') as f:
    json.dump(error_analysis, f, ensure_ascii=False, indent=2)

print(f"\n***Error analysis saved to: {output_file}***")
print()
print(f"***Summary:")
for k, errors in error_results.items():
    print(f"   k={k}: {len(errors)} errors out of {len(X_test_emb)} samples")


Evaluating accuracy on test set...


Evaluating k=1: 100%|██████████| 558/558 [00:00<00:00, 824.42it/s]


Accuracy with k=1: 0.9928
Number of errors with k=1: 4/558 (0.72%)


Evaluating k=3: 100%|██████████| 558/558 [00:00<00:00, 824.42it/s]


Accuracy with k=3: 0.9910
Number of errors with k=3: 5/558 (0.90%)


Evaluating k=5: 100%|██████████| 558/558 [00:00<00:00, 825.56it/s]

Accuracy with k=5: 0.9892
Number of errors with k=5: 6/558 (1.08%)

ACCURACY RESULTS
Top-1 accuracy: 0.9928 (99.28%)
Top-3 accuracy: 0.9910 (99.10%)
Top-5 accuracy: 0.9892 (98.92%)

***Error analysis saved to: error_analysis.json***

***Summary:
   k=1: 4 errors out of 558 samples
   k=3: 5 errors out of 558 samples
   k=5: 6 errors out of 558 samples
CPU times: user 1.56 s, sys: 1.57 ms, total: 1.56 s
Wall time: 2.06 s





## **6. Pipeline classification cho user input**

In [None]:
def spam_classifier_pipeline(user_input, k=3):
    """
    Complete pipeline for spam classification

    Args:
        user_input (str): Text to classify
        k (int): Number of nearest neighbors to consider

    Returns:
        dict: Classification results with details
    """

    print()
    print(f"***Classifying: '{user_input}'")
    print()
    print(f"***Using top-{k} nearest neighbors")
    print()

    # Get prediction and neighbors
    prediction, neighbors = classify_with_knn(
        user_input, model, tokenizer, device, index, train_metadata, k=k
    )

    # Display results
    print(f"***Prediction: {prediction.upper()}")
    print()

    print("***Top neighbors:")
    for i, neighbor in enumerate(neighbors, 1):
        print(f"{i}. Label: {neighbor['label']} | Score: {neighbor['score']:.4f}")
        print(f"   Message: {neighbor['message']}")
        print()

    # Count label distribution
    labels = [n['label'] for n in neighbors]
    label_counts = {label: labels.count(label) for label in set(labels)}

    return {
        'prediction': prediction,
        'neighbors': neighbors,
        'label_distribution': label_counts
    }

## **7. Test pipeline với các ví dụ**

In [None]:
# Test với các ví dụ khác nhau
test_examples = [
    "I am actually thinking a way of doing something useful",
    "FREE!! Click here to win $1000 NOW! Limited time offer!",
    # "Hey, can you pick me up at 5pm today?",
    # "URGENT: Your account will be suspended unless you verify your details NOW",
    # "Thanks for the meeting today, let's schedule the next one for next week",
    # "Congratulations! You've won a prize! Call this number to claim it"
]

print("Testing pipeline with different examples:")
print()

for i, example in enumerate(test_examples, 1):
    print(f"\n***Example {i}:")
    result = spam_classifier_pipeline(example, k=3)
    print()

Testing pipeline with different examples:


***Example 1:

***Classifying: 'I am actually thinking a way of doing something useful'

***Using top-3 nearest neighbors

***Prediction: HAM

***Top neighbors:
1. Label: ham | Score: 0.8378
   Message: K, I'll work something out

2. Label: ham | Score: 0.8376
   Message: I have gone into get info bt dont know what to do

3. Label: ham | Score: 0.8327
   Message: Same. Wana plan a trip sometme then



***Example 2:

***Classifying: 'FREE!! Click here to win $1000 NOW! Limited time offer!'

***Using top-3 nearest neighbors

***Prediction: SPAM

***Top neighbors:
1. Label: spam | Score: 0.8572
   Message: URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to ...

2. Label: spam | Score: 0.8556
   Message: Win a £1000 cash prize or a prize worth £5000

3. Label: spam | Score: 0.8545
   Message: For your chance to WIN a FREE Bluetooth Headset then simply reply back with "ADP"




In [None]:
# Interactive testing - user có thể thay đổi text và k value
print("***Interactive Testing")
print()

# Người dùng có thể thay đổi các giá trị này để test với các ví dụ khác nhau
user_text = "Win a free iPhone! Click here now!"
k_value = 5

print(f"***Testing with k={k_value}")
result = spam_classifier_pipeline(user_text, k=k_value)

print("***To test with different inputs:")
print("1. Change 'user_text' variable above")
print("2. Change 'k_value' for different number of neighbors")
print("3. Re-run this cell")

***Interactive Testing

***Testing with k=5

***Classifying: 'Win a free iPhone! Click here now!'

***Using top-5 nearest neighbors

***Prediction: SPAM

***Top neighbors:
1. Label: spam | Score: 0.8663
   Message: FREE entry into our £250 weekly competition just text the word WIN to 80086 NOW. 18 T&C www.txttowin...

2. Label: spam | Score: 0.8591
   Message: TheMob>Yo yo yo-Here comes a new selection of hot downloads for our members to get for FREE! Just cl...

3. Label: spam | Score: 0.8580
   Message: U have won a nokia 6230 plus a free digital camera. This is what u get when u win our FREE auction. ...

4. Label: spam | Score: 0.8573
   Message: Call FREEPHONE 0800 542 0578 now!

5. Label: spam | Score: 0.8572
   Message: important information 4 orange user . today is your lucky day!2find out why log onto http://www.uraw...

***To test with different inputs:
1. Change 'user_text' variable above
2. Change 'k_value' for different number of neighbors
3. Re-run this cell
