# Exercise 2 - Vector Embeddings & Semantic Search

## 2.1 Load and Preview the Dataset

In [None]:
import pandas as pd

# Load CSV assuming it's in the same directory
df = pd.read_csv("ex_2_data.csv")

# Preview it
df.head()


## 2.2 Generate Embeddings from Text

In [None]:
!pip install sentence-transformers chromadb


In [None]:
from sentence_transformers import SentenceTransformer

# Load a compact and effective embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Use the correct column name
texts = df["paragraph"].tolist()

# Generate embeddings with progress bar
embeddings = model.encode(texts, show_progress_bar=True)


## 2.3 Create a Vector DB with Chroma and Index the Embeddings

In [None]:
import chromadb
from chromadb.config import Settings
from chromadb.utils.embedding_functions import DefaultEmbeddingFunction

# Set up in-memory ChromaDB client
client = chromadb.Client(Settings(anonymized_telemetry=False))

# Create a collection (like a table)
collection = client.create_collection(name="paragraphs", get_or_create=True)


# Insert the embeddings with their metadata
collection.add(
    documents=texts,  # original text
    embeddings=embeddings,  # generated embeddings
    ids=[str(i) for i in df["id"].tolist()],  # use the 'id' column
    metadatas=df[["source", "category"]].to_dict(orient="records")
)


## 2.4 Semantic Search: Retrieve Top-5 Similar Passages

In [None]:
# Define the user query
query = "What is reinforcement learning?"

# Search for the top 5 most similar paragraphs
results = collection.query(
    query_texts=[query],
    n_results=5
)

# Display the results with their metadata
for i in range(5):
    print(f"\n🔹 Match {i+1}")
    print("Paragraph:", results["documents"][0][i])
    print("Source:", results["metadatas"][0][i]["source"])
    print("Category:", results["metadatas"][0][i]["category"])


## Business Insights & Client-Facing Recommendations

### 🔍 Key Business Insights:
- We successfully built a **semantic search pipeline** using sentence embeddings and ChromaDB.
- The system can **identify semantically similar content**, even when wording differs — enabling smarter search, FAQ automation, or knowledge base enhancement.
- This method scales well for **internal document retrieval**, **customer service automation**, or **R&D content indexing**.

### 🧠 Example Use Cases:
- **Customer Support**: Retrieve relevant troubleshooting articles or past tickets based on customer queries.
- **Enterprise Knowledge Base**: Let employees find policies or procedures using natural questions.
- **R&D and Legal Teams**: Quickly surface patents, papers, or legal precedents semantically related to the topic of interest.

---

### 🧰 Mapping to IBM Tools & Services:
- **IBM watsonx.ai**: Can host and fine-tune foundation models, including MiniLM-style transformers, for vector representation.
- **IBM watsonx.data**: Combine structured + unstructured semantic search over hybrid data lakes.
- **IBM Cloud Databases for Elasticsearch**: A scalable vector DB alternative to ChromaDB, allowing enterprise-grade search.
- **IBM Cloud Pak for Data**: Full AI lifecycle platform where embedding models and search pipelines can be deployed with governance.

---

✅ This solution shows how foundational AI components (like embedding models + vector search) can power real-world business tools. It's a modular and explainable approach that can evolve with client needs.


# Exercise 3 - LLM-based Evaluation

## 3.1 Load the Summaries and Compare with BLEU Score

In [None]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Load summaries from the uploaded files
with open("summary-1-flan-ul2--article1.txt", "r") as f:
    reference_summary = f.read()

with open("summary-2-flan-ul2--article1.txt", "r") as f:
    candidate_summary = f.read()

# Tokenize the summaries
reference_tokens = [reference_summary.split()]
candidate_tokens = candidate_summary.split()

# Apply smoothing (for shorter summaries)
smoothie = SmoothingFunction().method4

# Compute BLEU score
bleu_score = sentence_bleu(reference_tokens, candidate_tokens, smoothing_function=smoothie)
print(f"BLEU Score: {bleu_score:.4f}")


## BLEU Score Interpretation
A BLEU score of 0.0139 is very low.
This indicates very limited lexical overlap between the candidate and reference summary.
It may suggest different wording or structure, not necessarily poor content — especially in summarization where rewording is common.


## Final Business Insights & IBM Tool Mapping

## 💡 Business Insights & IBM Solutions

- The BLEU score shows a significant divergence between the two summaries, likely due to different phrasing or structure.
- This underlines a common business challenge: **automated text evaluation needs context-aware metrics**, not just word overlap.
- Organizations building summarization or translation tools should adopt **multi-metric evaluation** strategies that balance lexical precision with semantic relevance.

### ✅ IBM Tools & Capabilities
- Use **IBM watsonx.ai** to compare candidate summaries using semantic similarity metrics (e.g., cosine similarity on embeddings).
- Integrate **IBM Watson Natural Language Understanding (NLU)** to assess sentiment, key concepts, and named entities in summaries for deeper insights.
- For scalable and compliant NLP pipelines, deploy **IBM Cloud Pak for Data** with integrated governance and automated model lifecycle tools.

# Exercise 1: Model Prompting

## 1.1: Load and Preview the Data

In [1]:
import pandas as pd

# Load the job postings data
df = pd.read_csv("job_postings.csv")
df.head()


Unnamed: 0,posting_date,company_name,job_description
0,2024-09-01,Nexarion Inc.,We're on the hunt for a talented software engi...
1,2024-08-25,Eonix Solutions,About Us: We're a team of innovators and probl...
2,2024-09-10,Kaidon Technologies,"Job Title: Product Manager\nLocation: Redmond,..."
3,2024-08-30,Lumina Creative,Our company is a dynamic and innovative startu...
4,2024-09-05,Voltara Inc.,Job Summary: We're seeking an electrical engin...


## 1.2: Prepare Prompt and Hugging Face API Call

In [23]:
import requests

# ✅ Switched to a Hugging Face-hosted model that works
API_URL = "https://api-inference.huggingface.co/models/meta-llama/Llama-2-7b-chat-hf"


headers = {
    "Authorization": "Bearer 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"
}



## 1.3 Define a prompt function

In [24]:
def extract_job_info(description):
    prompt = f"""You are an AI assistant. Extract the following information from this job description:
- Job Title
- Location
- Salary Range (if mentioned)

Format the output as JSON with keys: "title", "location", "salary".

Job Description:
\"\"\"
{description}
\"\"\"
"""
    response = requests.post(API_URL, headers=headers, json={"inputs": prompt})
    
    # 🔍 NEW DEBUG BLOCK
    print("Status code:", response.status_code)
    print("Raw response text:", response.text)

    # Safely try to decode JSON
    try:
        return response.json()
    except Exception as e:
        return {"error": "Invalid JSON", "details": str(e), "raw": response.text}



In [25]:
# Test with one job description
extract_job_info(df.loc[0, "job_description"])


Status code: 404
Raw response text: Not Found


{'error': 'Invalid JSON',
 'details': 'Expecting value: line 1 column 1 (char 0)',
 'raw': 'Not Found'}