# Course Recommendation and Review System

This notebook implements a system for recommending courses based on a user's profile (resume, skills, career goals) and for submitting and auditing course reviews.

The system utilizes a locally running language model (llama-cpp-python) for tasks like summarizing course descriptions into user-friendly recommendations and auditing submitted course reviews based on predefined rules.

Key features include:
- **Course Matching:** Recommends courses by comparing user profile text with course descriptions and names using TF-IDF and cosine similarity.
- **Career Goal Expansion:** Expands user-provided skills based on common keywords associated with their career goals.
- **Course Summarization:** Generates concise, career-goal-oriented summaries of course descriptions using the LLM.
- **Review Auditing:** Automatically checks submitted course reviews for inappropriate content and minimum length using the LLM.
- **Gradio Interface:** Provides a user-friendly web interface to interact with the recommendation and review functionalities.

The course data is loaded from a JSON file (`cmu_courses_merged.json`). The system uses the `smolagents` library to interface with the llama-cpp LLM.

The core functionalities of this system are implemented in the following functions:

-   **`course_match`**: This function takes a user's profile and the list of courses as input and identifies the most relevant courses based on the user's background, skills, and career goals. It uses TF-IDF and cosine similarity to calculate a matching percentage for each course.
-   **`course_summarize`**: This function takes a course description and the user's profile (optional) and uses the language model to generate a concise, single-sentence recommendation summary for the course, often highlighting its relevance to the user's career goals.
-   **`audit_review_llm`**: This function takes a user-submitted course review and uses the language model to audit it based on predefined rules (e.g., checking for inappropriate language and minimum length). It returns an audit status (Pass/Fail) and a reason if it fails.

# set up

In [None]:
# Install llama-cpp-python first, as this can take a while
!CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --quiet

In [None]:
!pip install scikit-learn nltk --quiet

In [None]:
# Install smolagents and the requets library.
!pip -q install smolagents[toolkit]

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.5/42.5 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m149.8/149.8 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.3/5.3 MB[0m [31m96.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m120.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!pip install -U smolagents


In [None]:
import json  # For handling JSON

import IPython.display  # For formatting outputs
import smolagents  # For agents
import llama_cpp # For calling an LLM

In [None]:
# Define the model that we want to work with.
MODEL_REPO = "bartowski/Qwen_Qwen3-4B-Instruct-2507-GGUF"
MODEL_FILE = "Qwen_Qwen3-4B-Instruct-2507-Q4_K_M.gguf"

import llama_cpp
llm_real = llama_cpp.Llama.from_pretrained(
    repo_id=MODEL_REPO,
    filename=MODEL_FILE,
    n_layers=-1,
    n_ctx=4096,
    n_threads=8,
    verbose=False
)


llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized


In [None]:

class LlamaCppModel(smolagents.Model):
    """
    Thin wrapper for a llama.cpp OpenAI-compatible client.
    Pass in an object exposing `create_chat_completion(...)` (e.g., from llama_cpp).
    """

    def __init__(self, llm, model_id: str = "llama", **gen_defaults):
        super().__init__()
        self.llm = llm
        self.model_id = model_id
        self.gen_defaults = {"max_tokens": 1024, "temperature": 0.2}
        self.gen_defaults.update(gen_defaults)

    # ---- helpers -------------------------------------------------------------
    @staticmethod
    def _content_to_str(content) -> str:
        """
        smolagents.ChatMessage.content may be:
        - str
        - list of parts (e.g., [{"type":"text","text":"..."}, ...])
        - dict (rare) or other
        Normalize to a plain string for llama.cpp's chat formatter.
        """
        if content is None:
            return ""
        if isinstance(content, str):
            return content

        if isinstance(content, list):
            parts = []
            for p in content:
                if isinstance(p, str):
                    parts.append(p)
                elif isinstance(p, dict):
                    # common structured part: {"type":"text","text":"..."}
                    if p.get("type") == "text" and "text" in p:
                        parts.append(str(p["text"]))
                    elif "text" in p:
                        parts.append(str(p["text"]))
                    else:
                        parts.append(json.dumps(p, ensure_ascii=False))
                else:
                    parts.append(str(p))
            return "\n".join([s for s in parts if s])

        if isinstance(content, dict):
            if content.get("type") == "text" and "text" in content:
                return str(content["text"])
            if "content" in content and isinstance(content["content"], str):
                return content["content"]
            return json.dumps(content, ensure_ascii=False)

        # fallback
        return str(content)

    @staticmethod
    def _safe_get(obj, *keys, default=None):
        if isinstance(obj, dict):
            for k in keys:
                if k in obj:
                    return obj[k]
            return default
        # pydantic/attr objects (llama_cpp sometimes returns these)
        for k in keys:
            if hasattr(obj, k):
                return getattr(obj, k)
        return default

    def _to_openai_messages(self, messages: list[smolagents.ChatMessage]) -> list[dict]:
        oa = []
        for m in messages:
            # support both attr and dict access
            role = getattr(m, "role", None) or (m.get("role") if isinstance(m, dict) else None) or "user"
            content = getattr(m, "content", None) or (m.get("content") if isinstance(m, dict) else None)
            text = self._content_to_str(content)

            # optionally note images (llama.cpp chat format is text-only)
            images = getattr(m, "images", None) or (m.get("images") if isinstance(m, dict) else None)
            if images:
                text = (text + f"\n[Note: {len(images)} image(s) omitted]").strip()

            oa.append({"role": role, "content": text})
        return oa

    def _from_openai_message(self, msg) -> smolagents.ChatMessage:
        role = self._safe_get(msg, "role", default="assistant")
        content = self._safe_get(msg, "content", default="")
        # map tool_calls here later if you enable function/tool calling in llama.cpp
        return smolagents.ChatMessage(role=role, content=content)

    # ---- Model.generate ------------------------------------------------------
    def generate(
        self,
        messages: list[smolagents.ChatMessage],
        stop_sequences: list[str] | None = None,
        response_format: dict[str, str] | None = None,
        tools_to_call_from: list[smolagents.Tool] | None = None,
        **kwargs,
    ) -> smolagents.ChatMessage:
        # 1) normalize messages
        oa_msgs = self._to_openai_messages(messages)

        # 2) merge params
        params = dict(self.gen_defaults)
        params.update(kwargs)
        if stop_sequences:
            params["stop"] = stop_sequences
        if response_format:
            # llama.cpp supports OpenAI-like JSON mode on some chat formats
            params["response_format"] = response_format

        # 3) call llama.cpp
        resp = self.llm.create_chat_completion(
            model=self.model_id,
            messages=oa_msgs,
            **params,
        )

        # 4) extract first message robustly
        choices = self._safe_get(resp, "choices", default=[])
        if not choices:
            # fallback: wrap raw text if `resp` is a plain string
            text = self._safe_get(resp, "content", default=str(resp))
            return smolagents.ChatMessage(role="assistant", content=text)

        first = choices[0]
        message = self._safe_get(first, "message", default={})
        return self._from_openai_message(message)

In [None]:
model_llama_real = LlamaCppModel(llm_real)


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
def format_markdown(markdown_string: str):
    """This function prints the provided string as formatted markdown."""
    IPython.display.display(IPython.display.Markdown(markdown_string))

# Load JSON file(course info)

In [None]:
# 1. Describe user input format
print("Expected User Input:")
print("- Resume (text format, including work experience, education, etc.)")
print("- Specific skills (list of skills the user possesses)")
print("- Career goals (text description of desired career path)")
print("\n")

# 2. Examine the course JSON file structure
print("Examining cmu_courses_merged.json structure...")
# Assuming the file exists in the specified path and has a list of course dictionaries.
# The exact fields will be determined after loading.

# 3. Load and parse the course JSON data
import json

course_file_path = '/content/drive/MyDrive/CourseSummaries/cmu_courses_merged.json'

try:
    with open(course_file_path, 'r') as f:
        cmu_courses = json.load(f)
    print("cmu_courses_merged.json loaded successfully.")
except FileNotFoundError:
    print(f"Error: {course_file_path} not found.")
    cmu_courses = []
except json.JSONDecodeError:
    print(f"Error: Could not decode JSON from {course_file_path}.")
    cmu_courses = []

# 4. Print a sample of the loaded course data
if cmu_courses:
    print("\nSample of loaded course data:")
    for i, course in enumerate(cmu_courses[:5]): # Print first 5 courses
        print(f"Course {i+1}:")
        for key, value in course.items():
            print(f"  {key}: {value}")
        print("-" * 20)
else:
    print("\nNo course data loaded.")

Expected User Input:
- Resume (text format, including work experience, education, etc.)
- Specific skills (list of skills the user possesses)
- Career goals (text description of desired career path)


Examining cmu_courses_merged.json structure...
cmu_courses_merged.json loaded successfully.

Sample of loaded course data:
Course 1:
  course_id: 93721
  course_name: Arts Management & Intellectual Property
  description: Spring 2026 dates: Dates are January 24 and February 7th From 10:00 -3:00pm.  Introduction to Intellectual Property for Arts Managers will introduce important concepts in trademark and copyright law. A significant portion of the class will focus on copyright, with the goal of developing an understanding of the scope of rights, exceptions, and fair use. The class will also explore Creative Commons licenses, considerations related to traditional cultural expressions, and touch on the evolving discourse on intellectual property and artificial intelligence.
  prerequisites: 

# Define course_match function

In [None]:
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)

def preprocess_text(text):
    if not text:
        return ""
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    tokens = text.split()   # Replace nltk.word_tokenize(text) with split
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    return " ".join(tokens)


def course_match(user_profile: dict, courses: list) -> list:
    """
    Matches user profile (resume, skills, career goals) with courses.

    Args:
        user_profile: A dictionary containing 'resume' (str), 'skills' (list of str), 'career_goals' (str).
        courses: A list of course dictionaries, each with 'course_name', 'description', 'prerequisites', etc.

    Returns:
        A list of dictionaries, each containing course details and a 'matching_percentage',
        sorted by 'matching_percentage' in descending order.
    """
    user_text = preprocess_text(user_profile.get('resume', '') + " ".join(user_profile.get('skills', [])) + user_profile.get('career_goals', ''))

    if not user_text:
        print("Warning: User profile text is empty. Cannot perform matching.")
        return []

    course_texts = [preprocess_text(course.get('course_name', '') + " " + course.get('description', '') + " " + course.get('prerequisites', '')) for course in courses]

    if not course_texts:
        print("Warning: No course texts available for matching.")
        return []

    # Use TF-IDF and Cosine Similarity for matching
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([user_text] + course_texts)

    # Calculate cosine similarity between user profile and each course
    cosine_sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:]).flatten()

    matched_courses = []
    for i, course in enumerate(courses):
        matching_percentage = cosine_sim[i] * 100  # Convert similarity to percentage
        matched_courses.append({
            'course_id': course.get('course_id'),
            'course_name': course.get('course_name'),
            'description': course.get('description'), # Keep full description for summarize tool later
            'prerequisites': course.get('prerequisites'),
            'matching_percentage': round(matching_percentage, 2)
        })

    # Sort by matching percentage
    matched_courses = sorted(matched_courses, key=lambda x: x['matching_percentage'], reverse=True)

    return matched_courses

# Example Usage (for testing the function)
# Make sure cmu_courses is loaded from the previous step
# user_example = {
#     'resume': 'Experienced in Python programming and machine learning projects.',
#     'skills': ['Python', 'Machine Learning', 'NLP'],
#     'career_goals': 'Become an AI Product Manager'
# }
#
# if 'cmu_courses' in locals() and cmu_courses:
#      matched_courses_example = course_match(user_example, cmu_courses)
#      print("\nExample Matching Results:")
#      for course in matched_courses_example[:10]: # Print top 10 matches
#          print(f"Course: {course['course_name']} (Match: {course['matching_percentage']}%)")
# else:
#     print("\nCourse data not loaded. Cannot run example matching.")

In [None]:
# Place before all model/flow code
job_keywords_map = {
    "AI product manager": [
        "product management", "python", "machine learning", "deep learning",
        "data analysis", "artificial intelligence", "user experience", "business analysis",
        "prompt engineering", "algorithms", "app development"
    ],
    "data scientist": [
        "python", "machine learning", "statistics", "data mining", "data visualization",
        "deep learning", "SQL", "big data"
    ],
    # Add more professions as needed
}

In [None]:
import re

def normalize_job(s):
    return re.sub(r'[^a-z0-9 ]', '', s.lower()).strip()

def expand_profile(user_profile):
    goal = normalize_job(user_profile.get('career_goals', ''))
    extra_skills = set()
    for k, kws in job_keywords_map.items():
        k_norm = normalize_job(k)
        # Trigger if the goal description contains the keyword key, or if all words after tokenization are in the goal
        if k_norm in goal or all(word in goal for word in k_norm.split()):
            extra_skills.update(kws)
    skills = set([s.lower() for s in user_profile.get('skills', [])])
    total_skills = list(skills | extra_skills)
    new_profile = dict(user_profile)
    new_profile['skills'] = total_skills
    return new_profile

# define course_summarize(recommendation words to user)

In [None]:
def course_summarize(course_description: str, llm, user_profile: dict = None) -> str:
    """
    Generates a concise, user-friendly recommendation summary for a course using the LLM.
    Can optionally condition on user_profile (e.g., career goals).
    """
    if not course_description:
        return "No description available."

    # Include user information in the prompt if available
    if user_profile and "career_goals" in user_profile:
        prompt = f"""Based on the user's future career goal of "{user_profile['career_goals']}", summarize the following course description into a single, appealing sentence recommendation that highlights its relevance to that goal:

Course Description:
{course_description}

Recommendation:
"""
    else:
        prompt = f"""Summarize the following course description into a concise and appealing single-sentence recommendation that highlights the course's value:

Course Description:
{course_description}

Recommendation:
"""

    try:
        messages = [smolagents.ChatMessage(role="user", content=prompt)]
        response = llm.generate(messages)
        summary = response.content.strip()
        return summary
    except Exception as e:
        print(f"Error during summarization: {e}")
        return "Could not generate summary."

# define audit_review_llm function(avoid bad response uploaded )

In [None]:
def audit_review_llm(text, llm):
    prompt = f"""
    Please perform an automatic AI audit of the given course review content based on the following rules:
    1. If it contains rude, vulgar, or foul language (e.g., "idiot", "go to hell"), it fails the audit.
    2. If the number of characters is less than 15, it fails the audit with the reason "Content is too brief".
    3. Otherwise, it passes the audit, regardless of whether it expresses criticism or praise.
    Return only: {{"Audit Status": "Pass"/"Fail", "Reason": "..."}}. Review text: {text}
    """
    response = llm.generate([{"role":"user","content": prompt}])
    # Parse response.content as a dict
    import json
    try:
        result = json.loads(response.content)
    except Exception:
        result = {"Audit Status": "Fail", "Reason": "AI failed to correctly determine"}
    return result

# Get agent now( 3 tools!)

In [None]:
class CourseRecommenderAgent:
    def __init__(self, courses, llm):
        self.courses = courses
        self.llm = llm

    def course_match(self, user_profile):
        return course_match(user_profile, self.courses)

    def course_summarize(self, course_description, user_profile=None):
        return course_summarize(course_description, self.llm, user_profile)

    def course_review_audit(self, review_text):
        return audit_review_llm(review_text, self.llm)

    # Generic dispatch interface (optional)
    def call(self, tool_name, **kwargs):
        if tool_name == "course_match":
            return self.course_match(kwargs["user_profile"])
        elif tool_name == "course_summarize":
            return self.course_summarize(kwargs["course_description"], kwargs.get("user_profile"))
        elif tool_name == "course_review_audit":
            return self.course_review_audit(kwargs["review_text"])
        else:
            raise ValueError("Unknown tool_name")

In [None]:
agent = CourseRecommenderAgent(cmu_courses, model_llama_real)


In [None]:
print(agent.llm)


<__main__.LlamaCppModel object at 0x7d4388b779e0>


# Try on gradio

In [None]:
import gradio as gr
import re

# Assuming cmu_courses and agent are already initialized above, e.g.:
# agent = CourseRecommenderAgent(cmu_courses, model_llama)

course_choices = [f"{c.get('course_id','Unknown')} {c.get('course_name','')}" for c in cmu_courses]

def recommend_block(resume, skills_str, career_goals):
    try:
        skills = [s.strip() for s in re.split('[,，；;、 ]', skills_str) if s.strip()]
        user_profile = {"resume": resume, "skills": skills, "career_goals": career_goals}
        expanded_profile = expand_profile(user_profile)
        matches = agent.course_match(expanded_profile)
        out = ""
        for i, c in enumerate(matches[:15]):  # Display top 15 matches
            # Combine output: Index + Course ID + Course Name + Match Percentage + Summarized Recommendation
            out += (
                f"{i+1}. [{c['course_id']}] {c['course_name']} (Match: {c['matching_percentage']}%)\n"
                f"Recommendation: {agent.course_summarize(c['description'], user_profile)}\n\n"
            )
        return out or "No matching courses found."
    except Exception:
        import traceback; print(traceback.format_exc())
        return "An error occurred (please check input and look at Colab logs)."

def review_block(selected_course, workload, total_score, interest_score, useful_score, roi_score, review_text):
    try:
        review_struct = {
            "course_id": (selected_course or 'Unknown').split()[0],
            "workload": workload,
            "total_score": total_score,
            "interest_score": interest_score,
            "useful_score": useful_score,
            "roi_score": roi_score,
            "review_text": review_text,
        }
        audit = agent.course_review_audit(review_text)
        if audit.get("Audit Status") == "Pass":
            result = "Review passed audit. Thank you for your submission!"
        else:
            result = f"Audit failed: {audit.get('Reason','No reason provided')}"
        return review_struct, result
    except Exception:
        import traceback; print(traceback.format_exc())
        return {}, "An error occurred (please check input and look at Colab logs)."


with gr.Blocks() as demo:
    gr.Markdown("## Course Recommendation & Review System Demo")
    with gr.Tab("Course Recommendation"):
        r_resume = gr.Textbox(label="Your Resume/Background (Required)")
        r_skills = gr.Textbox(label="Skills (Comma or space separated)")
        r_goal = gr.Textbox(label="Career Goal")
        r_output = gr.Textbox(label="Recommendation Results", lines=15)
        r_btn = gr.Button("Generate Recommendation")
        r_btn.click(fn=recommend_block, inputs=[r_resume, r_skills, r_goal], outputs=r_output)
    with gr.Tab("Contribute Course Review"):
        v_course = gr.Dropdown(choices=course_choices, label="Course ID - Course Name")
        v_workload = gr.Textbox(label="Weekly Workload (Hours)")
        v_total = gr.Slider(1,5,step=1,label="Total Score")
        v_interest = gr.Slider(1,5,step=1,label="Interest Score")
        v_useful = gr.Slider(1,5,step=1,label="Usefulness Score")
        v_roi = gr.Slider(1,5,step=1,label="ROI Score")
        v_text = gr.Textbox(label="Freeform Review (Minimum 10 characters recommended)")
        v_out_struct = gr.JSON(label="Your Submitted Data")
        v_out_msg = gr.Textbox(label="Audit & Feedback")
        v_btn = gr.Button("Submit Review")
        v_btn.click(fn=review_block, inputs=[v_course, v_workload, v_total, v_interest, v_useful, v_roi, v_text],
                    outputs=[v_out_struct, v_out_msg])

demo.launch(debug=True)

It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://f616790f632f7b7029.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7865 <> https://f616790f632f7b7029.gradio.live


