# Week 3 | Assignment 2
## Assignment 2: Personalized Course Recommendation Engine
### 1. Background & Context
Online learning platforms host thousands of courses across domains—learners often feel overwhelmed by choices. A personalized recommender that understands both course content and individual learner profiles can boost engagement and completion rates by suggesting the most relevant next steps.

### 2. Problem Statement
“Design and implement a Course Recommendation Engine that—given a user query (completed courses + a short interests blurb)—returns the top-5 most relevant courses from a catalog of course  offerings, using embedding models and a vector database for semantic matching.”

### 3. Objectives & Learning Outcomes
- Embeddings & Semantic Search: Convert course descriptions into high-dimensional vectors.
- Vector Database: Index and query vectors for fast similarity retrieval.
- Recommendation Logic: Rank courses by cosine similarity to the user query vector.
- Basic UI/CLI (optional): Demonstrate the engine end-to-end with sample queries.

### 4. Technical Requirements
Data Ingestion & Indexing: Read the course catalog (title + description), compute embeddings, and upsert into a vector DB.
Recommendation Logic & API:

```python
def recommend_courses(profile: str, completed_ids: List[str]) ->
List[Tuple[str, float]]:
"""
Returns a list of (course_id, similarity_score) for the top-5
recommendations.
"""
```
### 5. Deliverables
- Code: Jupyter notebook file in PDF format OR .py files in zip (keep requirements .txt if needed)
- Evaluation Report: Jupyter Notebook should include test results for each of 5 test profiles, list recommendations and comment on relevance.

### 6. Dataset

Dataset file: assignment2data.csv

https://raw.githubusercontent.com/Bluedata-Consulting/GAAPB01-training-code-base/refs/heads/main/Assignments/assignment2dataset.csv

### 7. Sample Input Queries
- “I’ve completed the ‘Python Programming for Data Science’ course and enjoy data visualization. What should I take next?”
- “I know Azure basics and want to manage containers and build CI/CD pipelines. Recommend courses.”
- “My background is in ML fundamentals; I’d like to specialize in neural networks and production workflows.”
- “I want to learn to build and deploy microservices with Kubernetes—what courses fit best?”
- “I’m interested in blockchain and smart contracts but have no prior experience. Which courses do you suggest?”


In [1]:
# !python -m pip install langchain-text-splitters ragas --quiet

In [2]:
import os
import numpy as np
import pandas as pd
from typing import List, Optional, Tuple, Dict
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from sklearn.metrics.pairwise import cosine_similarity
from langchain_community.vectorstores import FAISS
from langchain.schema import Document
from ragas import evaluate
from ragas import evaluate, SingleTurnSample, EvaluationDataset
from ragas.metrics import AnswerRelevancy, Faithfulness, ContextPrecision
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain.prompts import ChatPromptTemplate
import json

from datasets import Dataset
from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
from dotenv import load_dotenv
load_dotenv()

True

### Initializing LLM & Embedding Models

In [4]:
client = AzureChatOpenAI(
    deployment_name=os.environ['AZURE_OPENAI_DEPLOYMENT'],  # Your deployment name
        model_name="gpt-4o",
        temperature=0.1
)

In [5]:
# Quick LLM Invokation test to verify model deployment

client.invoke('What is the capital of France?').content 

'The capital of France is Paris.'

In [6]:

embedding_model = AzureOpenAIEmbeddings(
    model="text-embedding-ada-002",
)

In [7]:
# Quick Embedding model test to verify model deployment

print(embedding_model.embed_query("The quick brown fox jumps over the lazy dog")[:10]) 

[-0.0035696455743163824, 0.008301939815282822, -0.014215736649930477, -0.004543756600469351, -0.01546008512377739, 0.01862751692533493, -0.02047518640756607, -0.010218738578259945, -0.012883404269814491, -0.028129812330007553]


In [8]:
# Quick Similarity Test

print("Cosine Similarity:",cosine_similarity(
    np.array(embedding_model.embed_query("It's a lovely day outside")).reshape(-1, 1),
    np.array(embedding_model.embed_query("The weather today is beautiful")).reshape(-1,1)
)[0][0])

Cosine Similarity: 1.0


### Reading Given Dataset

In [9]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/Bluedata-Consulting/GAAPB01-training-code-base/refs/heads/main/Assignments/assignment2dataset.csv"
)

df

Unnamed: 0,course_id,title,description
0,C001,Foundations of Machine Learning,Understand foundational machine learning algor...
1,C002,Deep Learning with TensorFlow and Keras,Explore neural network architectures using Ten...
2,C003,Natural Language Processing Fundamentals,Dive into NLP techniques for processing and un...
3,C004,Computer Vision and Image Processing,Learn the principles of computer vision and im...
4,C005,Reinforcement Learning Basics,Get introduced to reinforcement learning parad...
5,C006,Data Engineering on AWS,Build scalable data pipelines using AWS servic...
6,C007,Cloud Computing with Azure,Master Microsoft Azure’s core services: virtua...
7,C008,DevOps Practices and CI/CD,Adopt DevOps methodologies to accelerate softw...
8,C009,Containerization with Docker and Kubernetes,Learn container fundamentals with Docker: imag...
9,C010,APIs and Microservices Architecture,Design and implement RESTful and GraphQL APIs ...


### Creating Vector Store

In [10]:
documents = []
for _, row in df.iterrows():
    # Combine title and description for richer semantic representation
    content = f"Title: {row['title']}\nDescription: {row['description']}"
    doc = Document(
        page_content=content,
        metadata={
            'course_id': row['course_id'],
            'title': row['title'],
            'description': row['description']
        }
    )
    documents.append(doc)

print(f"Creating vector store with {len(documents)} courses...")
vector_store = FAISS.from_documents(documents, embedding_model)

Creating vector store with 25 courses...


In [11]:
vector_store

<langchain_community.vectorstores.faiss.FAISS at 0x726cca8952b0>

In [12]:
test_queries = [
    """I’ve completed the ‘Python Programming for Data Science’ course and enjoy data visualization. What should I take next?""",
    """I know Azure basics and want to manage containers and build CI/CD pipelines. Recommend courses.""",
    """“My background is in ML fundamentals; I’d like to specialize in neural networks and production workflows.""",
    """I want to learn to build and deploy microservices with Kubernetes—what courses fit best?""",
    """I’m interested in blockchain and smart contracts but have no prior experience. Which courses do you suggest?"""
]

### Parsing User Query to understand their profile

In [13]:
def parse_user_query(query: str, df: pd.DataFrame, vector_store: FAISS, llm: AzureChatOpenAI) -> Dict:
    """
    Use LLM + Vector DB to extract completed courses and user interests from natural language query.
    Uses semantic search to identify courses mentioned in the query.
    """
    # Step 1: Use LLM to identify course mentions in the query
    prompt = ChatPromptTemplate.from_messages([
        ("system", """You are a course recommendation assistant. Analyze the user's query and extract:
1. Course mentions: Extract any course names or topics the user mentions as "completed" or "taken"
2. User interests: What does the user want to learn next?

Return your response in JSON format:
{{
    "completed_course_mentions": ["Python Programming for Data Science", "Azure basics"],
    "user_interests": "Brief summary of what user wants to learn next",
    "search_query": "Optimized search query for finding new relevant courses"
}}

If no courses are mentioned as completed, return empty list.
The search_query should focus on what the user wants to learn NEXT, not what they've completed."""),
        ("user", "{query}")
    ])
    
    chain = prompt | llm
    response = chain.invoke({"query": query})
    
    # Parse JSON response
    try:
        llm_parsed = json.loads(response.content)
    except:
        llm_parsed = {
            "completed_course_mentions": [],
            "user_interests": query,
            "search_query": query
        }
    
    # Step 2: Use vector DB to match mentioned courses to actual course IDs
    completed_course_ids = []
    completed_course_names = []
    
    for mention in llm_parsed.get("completed_course_mentions", []):
        # Use semantic search to find matching courses
        matches = vector_store.similarity_search(mention, k=1)
        if matches:
            matched_course_id = matches[0].metadata['course_id']
            matched_course_name = matches[0].metadata['title']
            completed_course_ids.append(matched_course_id)
            completed_course_names.append(matched_course_name)
    
    return {
        "completed_course_ids": completed_course_ids,
        "completed_course_names": completed_course_names,
        "user_interests": llm_parsed.get("user_interests", query),
        "search_query": llm_parsed.get("search_query", query)
    }

In [14]:
parse_user_query(
    query = test_queries[0], 
    df = df, 
    vector_store=vector_store, 
    llm = client
    
)

{'completed_course_ids': ['C016'],
 'completed_course_names': ['Python Programming for Data Science'],
 'user_interests': 'The user wants to learn more about data visualization.',
 'search_query': 'Data visualization courses for Python'}

### Retriever Logic

In [15]:
def recommend_courses(
    profile: str,
    # completed_ids: List[str],
    vector_store: FAISS,
    df: pd.DataFrame,
    llm: AzureChatOpenAI,
    top_k: int = 5
) -> List[Tuple[str, float]]:
    """
    Returns top-K course recommendations using LLM for query understanding
    and vector search for semantic matching.
    
    Args:
        profile: User's natural language query
        completed_ids: Pre-extracted completed course IDs (for API usage)
        vector_store: FAISS vector database
        df: Course catalog DataFrame
        llm: Language model for query parsing
        top_k: Number of recommendations to return
    
    Returns:
        List of (course_id, similarity_score) tuples
    """
    # Use LLM to parse query and understand user intent
    parsed = parse_user_query(profile, df, vector_store, llm)
    
    print(f"LLM Analysis:")
    print(f"  Completed Courses: {parsed['completed_course_names']}")
    print(f"  Course IDs: {parsed['completed_course_ids']}")
    print(f"  User Interests: {parsed['user_interests']}")
    print(f"  Search Query: {parsed['search_query']}\n")
    
    # Merge completed courses from parsing and explicit input
    all_completed = list(set(parsed['completed_course_ids']))
    
    # Use optimized search query from LLM
    search_query = parsed['search_query']
    
    # Retrieve more candidates to account for filtering
    retrieval_k = top_k + len(all_completed) + 10
    
    # Perform semantic search
    results = vector_store.similarity_search_with_score(search_query, k=retrieval_k)
    
    # Filter and rank
    recommendations = []
    for doc, score in results:
        course_id = doc.metadata['course_id'] 
        description = doc.metadata['description']
        
        if course_id in all_completed:
            continue
        
        # Convert FAISS L2 distance to similarity score
        similarity_score = 1 / (1 + score)
        recommendations.append({
            "course_id":course_id,
            "course_title":doc.metadata['title'],
            "course_description":description,
            "similarity_score":round(similarity_score.item(), 3)
        })
        
        if len(recommendations) >= top_k:
            break
    
    return recommendations, parsed

### Output

The output contains top_k=5 recommendations of courses for a user query, based on cosine similarity scores betweent the query (incl. completed courses - already completed ones, if any) & retrieved courses.

In [26]:
recommendations, parsed_obj = recommend_courses(
    profile=test_queries[0],
    vector_store=vector_store,
    df=df,
    llm=client
)
print(recommendations)

LLM Analysis:
  Completed Courses: ['Python Programming for Data Science']
  Course IDs: ['C016']
  User Interests: The user wants to learn more about data visualization.
  Search Query: Data visualization courses for Python

[{'course_id': 'C011', 'course_title': 'Big Data Analytics with Spark', 'course_description': 'Process and analyze large datasets using Apache Spark and PySpark. The course covers RDDs, DataFrames, Spark SQL, and MLlib for machine learning at scale. You’ll learn cluster deployment on YARN or Kubernetes, performance tuning, and structured streaming for real-time analytics. Hands-on projects include building ETL pipelines and interactive dashboards, unlocking insights from big data.', 'similarity_score': 0.747}, {'course_id': 'C014', 'course_title': 'Data Visualization with Tableau', 'course_description': 'Transform raw data into compelling visual stories using Tableau. Learn to connect to diverse data sources, create interactive dashboards, and apply best practices