### 📦 Step 1: Importing Libraries

**What I'm doing here:**  
I’m importing all the essential Python libraries I’ll use in this notebook:
- `pandas` and `numpy` for handling data,
- `sentence-transformers` for semantic embeddings,
- `cosine_similarity` from scikit-learn to measure similarity between texts,
- `TfidfVectorizer` to extract features from text using TF-IDF,
- `matplotlib.pyplot` for visualization,
- and other helpers like `time`, `re`, and Hugging Face’s `AutoTokenizer`.


In [None]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
import time
import re
from transformers import AutoTokenizer

### 📂 Step 2: Loading the Dataset

**What I'm doing here:**  
Now I'm loading the dataset (`DataNeuron_Text_Similarity.csv`) into a DataFrame using pandas.  
I'll preview the first few rows to understand the structure of the data I'm working with.


In [None]:
# 2. Load Dataset
file_path = "DataNeuron_Text_Similarity.csv"

df = pd.read_csv(file_path)

df.head()

# Dataset Understanding & Preprocessing

### 📖 Step 3: Measuring Readability

**What I'm doing here:**  
I'm using the `textstat` library to measure how easy it is to read each sentence using the **Flesch Reading Ease** score.  
This helps me understand the complexity of the text in both `text1` and `text2`.


### Readability Analysis

In [None]:
import textstat

read_scores1 = df['text1'].apply(textstat.flesch_reading_ease)
print("Avg Flesch Reading Score of text1:", read_scores1.mean())
read_scores2 = df['text2'].apply(textstat.flesch_reading_ease)
print("Avg Flesch Reading Score of text2:", read_scores2.mean())

### 🚫 (Skipped) Optional Text Cleaning

**What I planned here (but skipped):**  
I had written a function to clean the text by converting it to lowercase, removing special characters, and stripping extra spaces.  
But I’ve commented it out because I'm working with already clean data.


### Clean & Normalize Text

In [None]:
# import re

# def clean_text(text):
#     text = str(text).lower()
#     text = re.sub(r'[^a-z0-9]', ' ', text)
#     text = re.sub(r'\s+', ' ', text).strip()
#     return text

# df['text1_clean'] = df['text1'].apply(clean_text)
# df['text2_clean'] = df['text2'].apply(clean_text)

# Will not be used here as we are using the pre-cleaned data


### 🔢 Step 4: Analyzing Text Length

**What I'm doing here:**  
I'm calculating the number of words in `text1` and `text2`. This helps me understand how long the texts are on average.


### Length Analysis (Word Count)

In [None]:
df['text1_len'] = df['text1'].str.split().str.len()
df['text2_len'] = df['text1'].str.split().str.len()

print("Average length of text1:", df['text1_len'].mean())
print("Average length of text2:", df['text2_len'].mean())

### 🧮 Step 5: TF-IDF Baseline Similarity

**What I'm doing here:**  
Now I’m building a **baseline model** using TF-IDF vectorization to represent the texts as vectors.  
Then I compute **cosine similarity** between the vectors of `text1` and `text2`.  
This gives me a basic similarity score using traditional NLP techniques.


# Baseline & Feature-Based Similarity
Before embeddings, test traditional vector similarity.

In [None]:
# 4. TF-IDF Baseline
# Baseline using vectorization and cosine similarity (traditional method)
tfidf = TfidfVectorizer(max_features=5000)
tfidf.fit(pd.concat([df['text1'], df['text2']]))

t1_vecs = tfidf.transform(df['text1'])
t2_vecs = tfidf.transform(df['text2'])

df['similarity_tfidf'] = [cosine_similarity(t1_vecs[i], t2_vecs[i])[0][0] for i in range(len(df))]
df[['text1', 'text2', 'similarity_tfidf']].head()
print("🧮 Baseline TF-IDF Avg Similarity:", round(df['similarity_tfidf'].mean(), 3))


# Chossing Model


Semantic Similarity Evaluation using Sentence Transformers
-----------------------------------------------------------
Assumptions
Unlabeled dataset: No ground truth similarity scores provided.

Can use pretrained models, embedding distances, or generate pseudo-labels for supervised fine-tuning.

Focus: How semantically similar two paragraphs are, not just word overlap.

Evaluation Metrics
Cosine Similarity: Measures the cosine of the angle between two vectors.
Average Cosine Similarity: Computes the average cosine similarity across all pairs.
Runtime Performance

This evaluates different sentence embedding models for their 
performance in measuring semantic similarity between pairs of texts. 
It computes average cosine similarity scores, runtime performance, 
and token usage per model.

In [None]:
# -----------------------------
# 6. Model Caching
# -----------------------------
_cached_models = {}
_cached_tokenizers = {}

def load_model_and_tokenizer(name, path):
    if name not in _cached_models:
        print(f"\n🔍 Loading model and tokenizer for {name}...")
        _cached_models[name] = SentenceTransformer(path)
        _cached_tokenizers[name] = AutoTokenizer.from_pretrained(path)
    return _cached_models[name], _cached_tokenizers[name]


In [None]:
# -----------------------------
# 1. Token Count Estimation
# -----------------------------
def get_token_count(tokenizer, texts, max_length=None):
    """
    Estimate token count per text using the model tokenizer.
    """
    counts = []
    for text in texts:
        tokens = tokenizer.encode(text, truncation=False)
        if max_length:
            tokens = tokens[:max_length]
        counts.append(len(tokens))
    return counts

# -----------------------------
# 2. Text Truncation Helper
# -----------------------------
def truncate_texts(texts, tokenizer, max_length):
    """
    Truncate each text to max_length tokens.
    """
    truncated_texts = []
    for text in texts:
        tokens = tokenizer.encode(text, truncation=False)
        tokens = tokens[:max_length]
        truncated = tokenizer.decode(tokens, clean_up_tokenization_spaces=True)
        truncated_texts.append(truncated)
    return truncated_texts

# -----------------------------
# 3. Model Evaluation Function
# -----------------------------
def evaluate_model(name, model, tokenizer, texts1, texts2, max_length=None,batch_size=32):
    """
    Run model evaluation: encode, compute similarity, runtime, and token stats.
    """
    # Truncate input texts
    if max_length:
        texts1 = truncate_texts(texts1, tokenizer, max_length)
        texts2 = truncate_texts(texts2, tokenizer, max_length)

    start = time.time()

    # Encode with normalization
    emb1 = model.encode(texts1, convert_to_numpy=True, normalize_embeddings=True,show_progress_bar=True, truncation=True,batch_size=batch_size)
    emb2 = model.encode(texts2, convert_to_numpy=True, normalize_embeddings=True,show_progress_bar=True, truncation=True,batch_size=batch_size)

    # Compute cosine similarities
    similarities = [cosine_similarity([emb1[i]], [emb2[i]])[0][0] for i in range(len(emb1))]
    avg_sim = np.mean(similarities)
    elapsed = time.time() - start

    # Token statistics
    token_counts1 = get_token_count(tokenizer, texts1, max_length)
    token_counts2 = get_token_count(tokenizer, texts2, max_length)

    return {
        "model": name,
        "avg_similarity": round(avg_sim, 4),
        "runtime_sec": round(elapsed, 2),
        "avg_tokens_text1": round(np.mean(token_counts1), 2),
        "avg_tokens_text2": round(np.mean(token_counts2), 2)
    }

# -----------------------------
# 4. Model Configuration
# -----------------------------
model_configs = {
    "MiniLM-L12": "sentence-transformers/all-MiniLM-L12-v2",
    "MPNet": "sentence-transformers/all-mpnet-base-v2",
    "MultiQA": "sentence-transformers/paraphrase-mpnet-base-v2"
}

# -----------------------------
# 7. Evaluation
# -----------------------------
def run_evaluation(df, sample_size=50, max_length=512):
    sample = df.sample(sample_size, random_state=42).reset_index(drop=True)
    results = []

    for name, model_path in model_configs.items():
        model, tokenizer = load_model_and_tokenizer(name, model_path)
        print(f"Evaluating {name}...")
        res = evaluate_model(
            name,
            model,
            tokenizer,
            sample['text1'].tolist(),
            sample['text2'].tolist(),
            max_length=max_length,
        )
        results.append(res)

    results_df = pd.DataFrame(results).sort_values(by="avg_similarity", ascending=False)
    return results_df

In [None]:
sample = df.sample(50, random_state=42).reset_index(drop=True)
print("Sample of 50 rows from the dataset:")
print("Running model comparisons on text pairs...")
summary = run_evaluation(df)

print("\n Final Evaluation Summary:")
print(summary.to_string(index=False))


After running the semantic similarity evaluation, I observed that MultiQA achieved the highest average similarity (0.1913), indicating better semantic understanding of longer texts, though it took the most time (~119s). MPNet offered a good balance between accuracy and speed, while MiniLM-L12 was by far the fastest (~5s), but with a trade-off in similarity score.

Based on this, I chose MultiQA for its superior semantic accuracy, which is crucial for my use case where capturing deeper meaning is more important than runtime speed.

In [None]:
from sklearn.preprocessing import MinMaxScaler

def compute_and_normalize_similarity(df, text_col1='text1', text_col2='text2', max_length=512):
    """
    Compute semantic similarity using MultiQA on full dataset and normalize results [0,1].

    Args:
        df (pd.DataFrame): DataFrame with two text columns.
        text_col1 (str): Name of first text column.
        text_col2 (str): Name of second text column.
        max_length (int): Max token length to truncate texts before encoding.

    Returns:
        pd.DataFrame: DataFrame with new 'normalized_similarity' column.
    """
    # Load MultiQA model and tokenizer (cached)
    model, tokenizer = load_model_and_tokenizer("MultiQA", model_configs["MultiQA"])

    # Optionally truncate texts
    texts1 = truncate_texts(df[text_col1].tolist(), tokenizer, max_length)
    texts2 = truncate_texts(df[text_col2].tolist(), tokenizer, max_length)

    # Encode texts
    emb1 = model.encode(texts1, convert_to_numpy=True, show_progress_bar=True)
    emb2 = model.encode(texts2, convert_to_numpy=True, show_progress_bar=True)

    # Compute cosine similarity diagonal for each pair
    similarities = np.diag(cosine_similarity(emb1, emb2))

    # Normalize similarities to range [0,1]
    scaler = MinMaxScaler()
    similarities_norm = scaler.fit_transform(similarities.reshape(-1, 1)).flatten()

    # Add normalized similarity column
    df_Next = df.copy()
    df_Next['normalized_similarity'] = similarities_norm

    return df_Next


**Analysis:**
In this cell, I run the code above to advance my semantic modeling workflow. It helps me understand how the data is being transformed and prepares it for the next step.

In [None]:
# Assuming df is your full dataset with columns 'text1' and 'text2'
df_with_sim = compute_and_normalize_similarity(df, max_length=512)
print(df_with_sim.head())


**Analysis:**
In this cell, I run the code above to advance my semantic modeling workflow. It helps me understand how the data is being transformed and prepares it for the next step.

In [None]:
# 6. Analyze Results
print("\nTop 5 Most Similar Pairs:")
df.sort_values('normalized_similarity', ascending=False).head(5)[['text1', 'text2', 'normalized_similarity']]



**Analysis:**
In this cell, I run the code above to advance my semantic modeling workflow. It helps me understand how the data is being transformed and prepares it for the next step.

In [None]:
print("\nTop 5 Least Similar Pairs:")
df.sort_values('normalized_similarity', ascending=True).head(5)[['text1', 'text2', 'normalized_similarity']]


**Analysis:**
In this cell, I run the code above to advance my semantic modeling workflow. It helps me understand how the data is being transformed and prepares it for the next step.

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-mpnet-base-v2')
model = AutoModel.from_pretrained('sentence-transformers/paraphrase-mpnet-base-v2')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, max pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)


# 🧠 Sentence Embedding using Transformers: Analysis and Reflection

## Overview

In this exercise, I worked with the HuggingFace `transformers` library to generate **sentence embeddings** using the pre-trained model `sentence-transformers/paraphrase-mpnet-base-v2`. Sentence embeddings are numerical representations of text that capture semantic information and can be used for tasks like semantic search, clustering, or sentence similarity.

The code performs the following steps:

1. **Tokenization**  
   I used `AutoTokenizer` to tokenize a list of sentences. Tokenization converts text into token IDs while ensuring proper padding and truncation for batch processing.

2. **Model Loading**  
   I loaded the pre-trained `paraphrase-mpnet-base-v2` model from HuggingFace, which is fine-tuned for generating semantically meaningful sentence embeddings.

3. **Model Inference**  
   The tokenized input is passed through the transformer model using `torch.no_grad()` to prevent gradient computations, which speeds up inference and reduces memory usage.

4. **Mean Pooling**  
   After obtaining token-level embeddings from the model, I applied **mean pooling**, which averages the embeddings across all tokens in the sentence, taking the attention mask into account. This provides a fixed-size vector representation per sentence.

5. **Output**  
   Finally, I printed the sentence embeddings, which are `768`-dimensional vectors capturing the semantic meaning of the input sentences.

---

## 🧮 Pooling Strategy: Mean Pooling

The `mean_pooling` function is crucial here. It ensures that only the actual tokens (not padding) contribute to the final sentence embedding by applying the attention mask. This gives a more accurate representation of the sentence meaning, as padding tokens are excluded from the average.

---

## 🛠️ If I Had to Build This From Scratch

If I were to build a sentence embedding model from scratch (without using a pre-trained transformer), I would approach it in the following way:

1. **Data Collection**  
   I would first gather a large dataset of sentence pairs with semantic similarity scores, such as the SNLI, STS-B, or Quora Question Pairs datasets.

2. **Model Architecture**  
   I would implement a transformer-based architecture similar to BERT or MPNet using PyTorch or TensorFlow. This would involve:
   - Token and positional embeddings  
   - Multi-head self-attention layers  
   - Feed-forward layers  
   - Layer normalization and residual connections

3. **Training Objective**  
   I would train the model using a **contrastive learning objective**, such as **triplet loss** or **cosine similarity loss**, to ensure that semantically similar sentences are closer in vector space than dissimilar ones.

4. **Pooling Layer**  
   After obtaining the final hidden states, I would implement a pooling strategy — mean pooling, max pooling, or using the `[CLS]` token embedding — to generate fixed-size sentence vectors.

5. **Evaluation**  
   Finally, I would evaluate the model on downstream tasks like sentence similarity, clustering, or classification using standard benchmarks (e.g., STS Benchmark).

---

## ✍️ Reflection

Using pre-trained models like `paraphrase-mpnet-base-v2` significantly accelerates NLP experimentation and deployment by providing high-quality semantic representations out of the box. However, understanding the internals — like attention, pooling, and training objectives — gives me the confidence to customize or build models for specialized applications when needed.


# 🚀 Part B: Deployment of Sentence Similarity Model as a Server API Endpoint

## Core Approach

In Part B, my goal was to **deploy the sentence similarity algorithm developed in Part A as a RESTful API** on a cloud service provider. This API allows external clients to send two input sentences and receive a similarity score in response.

### Steps Taken:

1. **API Development with FastAPI **  
   I wrapped the sentence embedding and similarity computation logic from Part A inside a lightweight web framework:
   - The API exposes a POST endpoint `/similarity` that accepts a JSON body with keys `"text1"` and `"text2"`.
   - Upon receiving the request, the API:
     - Tokenizes and embeds both input sentences using the pre-trained transformer model.
     - Computes the cosine similarity between the two embeddings.
     - Returns the similarity score in JSON format with the key `"similarity score"`.

2. ### **Model Loading and Caching**
   - The `SentenceTransformer` model (`paraphrase-mpnet-base-v2`) is either loaded from disk using `joblib` or downloaded and cached on first run.
   - This drastically improves performance by preventing the model from reloading with every request.
   - The model is used in CPU mode for broad compatibility across environments.


3. ### **Hosting**
   - The API was deployed on a **self-hosted Linux server** with public IP access.
   - It runs on port `8005` and can be accessed globally.
   - Deployment was managed using `uvicorn` as the ASGI server.

4. **Request-Response Format**  
   The API strictly follows the prescribed request-response format:

   - **Request JSON**:  
     ```json
     {
       "text1": "nuclear body seeks new tech",
       "text2": "terror suspects face arrest"
     }
     ```
   
   - **Response JSON**:  
     ```json
     {
       "similarity score": 0.2
     }
     ```
5. ### **Similarity Computation Logic**
   - Both input sentences are encoded into dense vector embeddings using the loaded transformer.
   - Cosine similarity is calculated using `sklearn.metrics.pairwise.cosine_similarity`.
   - The score, initially in the range `[-1, 1]`, is normalized to `[0, 1]` using the formula:  
     \[
     \text{normalized\_score} = \frac{\text{cosine\_score} + 1}{2}
     \]
   - The output is rounded to 4 decimal places for consistency.

6. ### **Error Handling**
   - If either `"text1"` or `"text2"` is missing or empty, a `400 Bad Request` is returned.
   - This prevents unnecessary computation and ensures input quality.

7. **Testing**  
   I tested the API using HTTP clients such as `curl` or Postman to ensure the endpoint correctly processes inputs and returns the expected output.

---

## Submission Contents

- **Live API Endpoint:**  
  The deployed API is accessible at:  
  `http://207.148.78.17:8005/similarity`  

- **Complete Code:**  
  Both Part A (model and embedding code) and Part B (API and deployment scripts) are provided as `.py` files.

- **Report:**  
  This report explains the main approach for both parts in a concise manner.

---

## Reflection

Deploying this sentence similarity model as a cloud-accessible API made it immediately usable for real-world applications. Key strengths of this deployment include:

Performance: Using cached model loading and CPU-friendly inference ensured fast response times.

Scalability: FastAPI’s non-blocking async design and uvicorn server allow handling multiple requests concurrently.

Portability: The deployment is cloud-agnostic and can easily be ported to AWS, Azure, or Docker environments.

However, there are also some limitations to consider:
- **Limited Model Flexibility**: Using a pre-trained model limits the model’s ability to adapt to new data.
- **Single Endpoint**: The API only supports a single similarity computation endpoint.
- **No Authentication**: The API is open to the public without any authentication or authorization.
- **Time Constraints**: The API is not optimized for high-frequency or low-latency use cases and complete model coudn't built in given 2 days timeframe. 
In summary, this deployment showcases the power of FastAPI and Hugging Face Transformers in building performant, scalable, and portable cloud services.
---

