# üåê Multi-Subject Knowledge Graph QA System

This notebook implements a Knowledge Graph-based Question Answering system with **multi-subject support**.

## ‚ú® Key Features

### Multi-Subject Understanding
- **Supports multiple departments**: Computer Science (CSCI), Data Science (DS), and more
- **Cross-subject relationships**: Discovers prerequisites across departments (e.g., "DS 2001 requires CSCI 1012")
- **Unified knowledge graph**: All subjects in one graph for comprehensive reasoning

### Advanced Capabilities
- **Multi-hop reasoning**: Answer complex questions requiring multiple inference steps
- **Graph-based retrieval**: Uses Graph Attention Networks (GAT) for intelligent context selection
- **Prerequisite chains**: Track course dependencies across subjects
- **Topic-based search**: Find courses by topics spanning multiple departments

## üìä Data Sources

The system automatically detects and processes all subjects in:
- `data/spring_2026_courses.csv` - Schedule data with course offerings
- `data/bulletin_courses.csv` - Course descriptions and prerequisites

Each course includes a `subject` field (e.g., "CSCI", "DS") for subject tracking.

## üéØ Example Queries

**Single-subject:**
- "Who teaches Machine Learning (CSCI 6364)?"
- "What are the prerequisites for CSCI 6212?"

**Cross-subject:**
- "Which Computer Science courses are required for Data Science courses?"
- "What's the prerequisite path from CSCI 1112 to DS 6450?"
- "Which professors teach both CS and DS courses?"

---



## 1. Setup Environment


In [1]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    import torch; v = re.match(r"[0-9\.]{3,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth

# Let unsloth manage transformers version - removed version pin
!pip install --no-deps trl==0.22.2
!pip install evaluate
!pip install networkx pandas numpy scikit-learn
!pip install nltk spacy
!python -m spacy download en_core_web_sm

In [2]:
import pandas as pd
import numpy as np
import json
import re
import networkx as nx
from collections import defaultdict, Counter
from typing import Dict, List, Tuple, Set, Optional
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import spacy
from datasets import load_dataset, Dataset
import warnings
warnings.filterwarnings('ignore')

# Download NLTK data
try:
    nltk.download('punkt', quiet=True)
    nltk.download('stopwords', quiet=True)
except:
    pass

# Load spaCy model
try:
    nlp = spacy.load("en_core_web_sm")
except:
    print("Warning: spaCy model not loaded. Topic extraction may be limited.")
    nlp = None

print("‚úÖ Environment setup complete")

‚úÖ Environment setup complete


## 2. Load and Prepare Data


In [5]:
# Create data directory if it doesn't exist
import os
os.makedirs("data", exist_ok=True)

# Load course data - with error handling
try:
    courses_df = pd.read_csv("data/spring_2026_courses.csv")
    bulletin_df = pd.read_csv("data/bulletin_courses.csv")
    print(f"Loaded {len(courses_df)} course sections")
    print(f"Loaded {len(bulletin_df)} course descriptions")
except FileNotFoundError as e:
    print(f"‚ùå Error: {e}")
    print("Please upload the required CSV files:")
    print("  - data/spring_2026_courses.csv")
    print("  - data/bulletin_courses.csv")
    raise

# Create course code to description mapping
course_descriptions = {}
for _, row in bulletin_df.iterrows():
    code = str(row['course_code']).strip()
    desc = str(row.get('description', '')).strip()
    if desc and desc != 'nan':
        course_descriptions[code] = desc

print(f"‚úÖ Data loaded: {len(course_descriptions)} course descriptions available")

# Track subjects in dataset
if 'subject' in courses_df.columns:
    subjects = sorted(courses_df['subject'].unique())
    print(f"üìö Subjects in dataset: {subjects}")
if 'subject' in bulletin_df.columns:
    bulletin_subjects = sorted(bulletin_df['subject'].unique())
    print(f"üìñ Subjects in bulletin: {bulletin_subjects}")

Loaded 586 course sections
Loaded 187 course descriptions
‚úÖ Data loaded: 183 course descriptions available
üìö Subjects in dataset: ['CSCI', 'DATS']
üìñ Subjects in bulletin: ['CSCI', 'DATS']


## 3. Prerequisites and Topic Extraction


In [6]:
def extract_prerequisites(description: str) -> List[str]:
    """Extract prerequisite course codes from description."""
    if not description or description == 'nan':
        return []

    # Replace unicode non-breaking spaces and other unicode issues
    description = description.replace('\xa0', ' ').replace('\u00a0', ' ')

    prerequisites = []
    # Enhanced patterns for better prerequisite extraction
    patterns = [
        r'[Pp]rerequisite[s]?[:\s]+([A-Z]{2,}\s+\d{4}[A-Z]?)',
        r'[Pp]rerequisite[s]?[:\s]+([A-Z]{2,}\s+\d{4})',
        r'([A-Z]{2,}\s+\d{4}[A-Z]?)\s+with\s+a\s+minimum\s+grade',
        r'([A-Z]{2,}\s+\d{4}[A-Z]?)\s+or\s+([A-Z]{2,}\s+\d{4}[A-Z]?)',
        r'([A-Z]{2,}\s+\d{4}[A-Z]?)\s+and\s+([A-Z]{2,}\s+\d{4}[A-Z]?)',
        r'completion\s+of\s+([A-Z]{2,}\s+\d{4}[A-Z]?)',
        r'[Cc]orequisite[s]?[:\s]+([A-Z]{2,}\s+\d{4}[A-Z]?)',
        r'([A-Z]{2,}\s+\d{4}[A-Z]?)\s+is\s+required',
        r'must\s+complete\s+([A-Z]{2,}\s+\d{4}[A-Z]?)',
    ]

    for pattern in patterns:
        matches = re.findall(pattern, description)
        if matches:
            if isinstance(matches[0], tuple):
                # Flatten the list of tuples into a single list of strings
                for sub_match in matches:
                    prerequisites.extend([item for item in sub_match if item])
            else:
                prerequisites.extend([m for m in matches if m])

    # Clean and normalize
    cleaned_prerequisites = []
    for prereq in prerequisites:
        # Remove extra whitespace and unicode characters
        cleaned = re.sub(r'\s+', ' ', prereq.strip())
        cleaned = cleaned.replace('\xa0', ' ').replace('\u00a0', ' ')

        # Normalize course code format (e.g., "CSCI  1112" -> "CSCI 1112")
        cleaned = re.sub(r'([A-Z]{2,})\s+(\d{4}[A-Z]?)', r'\1 \2', cleaned)

        if cleaned and cleaned not in cleaned_prerequisites:
            cleaned_prerequisites.append(cleaned)

    return cleaned_prerequisites

def extract_topics(description: str, course_code: str) -> List[str]:
    """Extract topics from course description using NLP."""
    if not description or description == 'nan':
        return []

    topics = []

    # Expanded CS topics/keywords
    cs_topics = [
        'machine learning', 'deep learning', 'neural networks', 'computer vision',
        'natural language processing', 'nlp', 'data structures', 'algorithms',
        'database', 'software engineering', 'operating systems', 'networks',
        'distributed systems', 'security', 'cryptography', 'web development',
        'mobile development', 'artificial intelligence', 'ai', 'robotics',
        'graphics', 'game development', 'cloud computing', 'parallel computing',
        'compilers', 'programming languages', 'theory', 'optimization',
        'data mining', 'big data', 'data science', 'reinforcement learning',
        'computer architecture', 'embedded systems', 'internet of things', 'iot',
        'blockchain', 'cybersecurity', 'information retrieval', 'human computer interaction',
        'hci', 'visualization', 'bioinformatics', 'quantum computing',
        'object oriented programming', 'oop', 'functional programming',
        'agile', 'devops', 'version control', 'testing', 'debugging',
        'concurrency', 'synchronization', 'microservices', 'api design',
        'data analytics', 'statistics', 'probability', 'linear algebra',
        'calculus', 'discrete math', 'graph theory', 'complexity theory',
        'automata theory', 'formal methods', 'logic', 'reasoning',
        'knowledge representation', 'expert systems', 'semantic web',
        'information security', 'network security', 'malware analysis',
        'penetration testing', 'ethical hacking', 'digital forensics',
        'user experience', 'ux', 'interface design', 'usability',
        'image processing', 'signal processing', 'pattern recognition',
        'speech recognition', 'information systems', 'database management',
        'data warehousing', 'business intelligence', 'cloud architecture',
        'virtualization', 'containerization', 'docker', 'kubernetes'
    ]

    description_lower = description.lower()
    for topic in cs_topics:
        if topic in description_lower:
            topics.append(topic)

    # Use spaCy for named entity recognition if available
    if nlp:
        doc = nlp(description)
        # Extract technical terms (nouns and noun phrases)
        for chunk in doc.noun_chunks:
            text = chunk.text.lower()
            if len(text) > 3 and text not in stopwords.words('english'):
                # Filter for technical terms
                if any(keyword in text for keyword in ['algorithm', 'system', 'structure', 'model', 'framework',
                                                       'analysis', 'design', 'development', 'implementation',
                                                       'network', 'database', 'software', 'hardware', 'data']):
                    topics.append(text)

    return list(set(topics))

# Extract prerequisites and topics for all courses
course_prerequisites = {}
course_topics = {}

for course_code, description in course_descriptions.items():
    course_prerequisites[course_code] = extract_prerequisites(description)
    course_topics[course_code] = extract_topics(description, course_code)

# Print statistics
total_with_prereqs = sum(1 for v in course_prerequisites.values() if v)
total_with_topics = sum(1 for v in course_topics.values() if v)

print(f"‚úÖ Extracted prerequisites for {total_with_prereqs} courses")
print(f"‚úÖ Extracted topics for {total_with_topics} courses")
print(f"\nSample prerequisites: {dict(list(course_prerequisites.items())[:5])}")
print(f"\nSample topics: {dict(list(course_topics.items())[:5])}")

‚úÖ Extracted prerequisites for 120 courses
‚úÖ Extracted topics for 170 courses

Sample prerequisites: {'CSCI 1010': [], 'CSCI 1011': [], 'CSCI 1012': [], 'CSCI 1013': ['CSCI 1012'], 'CSCI 1020': []}

Sample topics: {'CSCI 1010': [], 'CSCI 1011': ['control structures', 'ai'], 'CSCI 1012': ['oop'], 'CSCI 1013': ['visualization', 'data structures', 'interdisciplinary computing and data science applications', 'data types', 'data science'], 'CSCI 1020': ['database', 'microcomputer hardware', 'database management', 'software']}


In [7]:
class KnowledgeGraph:
    """Knowledge Graph for GW Courses with nodes and edges."""

    def __init__(self):
        self.graph = nx.DiGraph()  # Directed graph
        self.course_nodes = {}  # course_code -> node_id
        self.professor_nodes = {}  # professor_name -> node_id
        self.topic_nodes = {}  # topic -> node_id
        self.node_features = {}  # node_id -> features
        self.edge_types = {}  # (source, target) -> edge_type
        self.node_id_counter = 0

    def add_node(self, node_type: str, node_id: str, features: Dict = None):
        """Add a node to the graph."""
        if node_id not in self.graph:
            self.graph.add_node(node_id, node_type=node_type, **{**(features or {})})
            self.node_features[node_id] = features or {}
            return True
        return False

    def add_edge(self, source: str, target: str, edge_type: str, weight: float = 1.0):
        """Add an edge to the graph."""
        if source in self.graph and target in self.graph:
            self.graph.add_edge(source, target, edge_type=edge_type, weight=weight)
            self.edge_types[(source, target)] = edge_type
            return True
        return False

    def build_from_data(self, courses_df: pd.DataFrame, course_descriptions: Dict,
                       course_prerequisites: Dict, course_topics: Dict):
        """Build knowledge graph from course data."""
        print("Building knowledge graph...")

        # Add course nodes
        unique_courses = courses_df['course_code'].unique()
        for course_code in unique_courses:
            course_code = str(course_code).strip()
            if course_code and course_code != 'nan':
                node_id = f"course_{course_code}"
                description = course_descriptions.get(course_code, "")
                features = {
                    'code': course_code,
                    'description': description,
                    'has_prerequisites': len(course_prerequisites.get(course_code, [])) > 0,
                    'topics': course_topics.get(course_code, [])
                }
                self.add_node('course', node_id, features)
                self.course_nodes[course_code] = node_id

        # Add professor nodes
        unique_professors = courses_df['instructor'].dropna().unique()
        for prof in unique_professors:
            prof = str(prof).strip()
            if prof and prof != 'nan':
                node_id = f"prof_{prof.replace(' ', '_').replace(',', '')}"
                if self.add_node('professor', node_id, {'name': prof}):
                    self.professor_nodes[prof] = node_id

        # Add topic nodes
        all_topics = set()
        for topics in course_topics.values():
            all_topics.update(topics)

        for topic in all_topics:
            node_id = f"topic_{topic.replace(' ', '_')}"
            if self.add_node('topic', node_id, {'name': topic}):
                self.topic_nodes[topic] = node_id

        # Add edges: taught_by
        for _, row in courses_df.iterrows():
            course_code = str(row['course_code']).strip()
            prof = str(row.get('instructor', '')).strip()

            if course_code in self.course_nodes and prof in self.professor_nodes:
                course_node = self.course_nodes[course_code]
                prof_node = self.professor_nodes[prof]
                self.add_edge(course_node, prof_node, 'taught_by')

        # Add edges: prerequisite
        for course_code, prereqs in course_prerequisites.items():
            if course_code in self.course_nodes:
                course_node = self.course_nodes[course_code]
                for prereq_code in prereqs:
                    prereq_node_id = f"course_{prereq_code}"
                    if prereq_node_id in self.graph:
                        self.add_edge(course_node, prereq_node_id, 'prerequisite')

        # Add edges: covers_topic
        for course_code, topics in course_topics.items():
            if course_code in self.course_nodes:
                course_node = self.course_nodes[course_code]
                for topic in topics:
                    topic_node_id = f"topic_{topic.replace(' ', '_')}"
                    if topic_node_id in self.graph:
                        self.add_edge(course_node, topic_node_id, 'covers_topic')

        print(f"‚úÖ Graph built: {self.graph.number_of_nodes()} nodes, {self.graph.number_of_edges()} edges")
        print(f"   - Courses: {len(self.course_nodes)}")
        print(f"   - Professors: {len(self.professor_nodes)}")
        print(f"   - Topics: {len(self.topic_nodes)}")

    def get_subgraph(self, start_nodes: List[str], max_hops: int = 2) -> nx.DiGraph:
        """Get subgraph starting from given nodes with max_hops depth."""
        subgraph_nodes = set(start_nodes)

        for _ in range(max_hops):
            new_nodes = set()
            for node in subgraph_nodes:
                # Get neighbors (both incoming and outgoing)
                new_nodes.update(self.graph.successors(node))
                new_nodes.update(self.graph.predecessors(node))
            subgraph_nodes.update(new_nodes)

        return self.graph.subgraph(subgraph_nodes)

    def find_paths(self, source: str, target: str, max_length: int = 3) -> List[List[str]]:
        """Find all paths from source to target."""
        try:
            paths = list(nx.all_simple_paths(self.graph, source, target, cutoff=max_length))
            return paths
        except:
            return []

# Build knowledge graph
kg = KnowledgeGraph()
kg.build_from_data(courses_df, course_descriptions, course_prerequisites, course_topics)

Building knowledge graph...
‚úÖ Graph built: 489 nodes, 566 edges
   - Courses: 85
   - Professors: 76
   - Topics: 328


In [8]:
!pip install pyvis

Collecting pyvis
  Downloading pyvis-0.3.2-py3-none-any.whl.metadata (1.7 kB)
Collecting jedi>=0.16 (from ipython>=5.3.0->pyvis)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading pyvis-0.3.2-py3-none-any.whl (756 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m756.0/756.0 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.6/1.6 MB[0m [31m59.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jedi, pyvis
Successfully installed jedi-0.19.2 pyvis-0.3.2


In [9]:
## 3.5. Graph Visualization and Exploration

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from typing import Optional, List, Tuple

# Install pyvis if not already installed (for interactive visualization)
try:
    from pyvis.network import Network
    from IPython.display import HTML, display
    PYVIS_AVAILABLE = True
except ImportError:
    PYVIS_AVAILABLE = False
    print("üí° Install pyvis for interactive visualizations: !pip install pyvis")

def visualize_graph(kg: KnowledgeGraph,
                    max_nodes: int = 50,
                    node_type_filter: Optional[List[str]] = None,
                    layout: str = 'spring',
                    figsize: Tuple[int, int] = (20, 15)):
    """
    Visualize the knowledge graph with color-coded node types.

    Args:
        kg: KnowledgeGraph instance
        max_nodes: Maximum number of nodes to display (for readability)
        node_type_filter: Filter by node types ['course', 'professor', 'topic'] or None for all
        layout: Layout algorithm ('spring', 'circular', 'kamada_kawai')
        figsize: Figure size
    """
    # Get subgraph if filtering
    if node_type_filter:
        nodes_to_include = []
        for node_id, data in kg.graph.nodes(data=True):
            if data.get('node_type') in node_type_filter:
                nodes_to_include.append(node_id)
        G = kg.graph.subgraph(nodes_to_include)
    else:
        G = kg.graph

    # Limit nodes if too many
    if G.number_of_nodes() > max_nodes:
        # Get nodes with highest degree (most connected)
        degrees = dict(G.degree())
        top_nodes = sorted(degrees.items(), key=lambda x: x[1], reverse=True)[:max_nodes]
        G = G.subgraph([node for node, _ in top_nodes])
        print(f"‚ö†Ô∏è Showing top {max_nodes} most connected nodes (out of {kg.graph.number_of_nodes()} total)")

    # Create figure
    plt.figure(figsize=figsize)

    # Choose layout
    if layout == 'spring':
        pos = nx.spring_layout(G, k=1, iterations=50, seed=42)
    elif layout == 'circular':
        pos = nx.circular_layout(G)
    elif layout == 'kamada_kawai':
        pos = nx.kamada_kawai_layout(G)
    else:
        pos = nx.spring_layout(G, seed=42)

    # Color nodes by type
    node_colors = []
    node_sizes = []
    for node in G.nodes():
        node_type = G.nodes[node].get('node_type', 'unknown')
        if node_type == 'course':
            node_colors.append('#FF6B6B')  # Red
            node_sizes.append(500)
        elif node_type == 'professor':
            node_colors.append('#4ECDC4')  # Teal
            node_sizes.append(400)
        elif node_type == 'topic':
            node_colors.append('#95E1D3')  # Light teal
            node_sizes.append(300)
        else:
            node_colors.append('#C7CEEA')  # Light purple
            node_sizes.append(200)

    # Draw nodes
    nx.draw_networkx_nodes(G, pos, node_color=node_colors,
                           node_size=node_sizes, alpha=0.7)

    # Color edges by type
    edge_colors = []
    for u, v in G.edges():
        edge_data = G.get_edge_data(u, v, {})
        edge_type = edge_data.get('edge_type', 'unknown')
        if edge_type == 'prerequisite':
            edge_colors.append('#FF6B6B')  # Red
        elif edge_type == 'taught_by':
            edge_colors.append('#4ECDC4')  # Teal
        elif edge_type == 'covers_topic':
            edge_colors.append('#95E1D3')  # Light teal
        else:
            edge_colors.append('#C7CEEA')  # Light purple

    # Draw edges
    nx.draw_networkx_edges(G, pos, edge_color=edge_colors,
                          alpha=0.3, arrows=True, arrowsize=20,
                          arrowstyle='->', width=1.5)

    # Add labels for important nodes (courses and professors)
    labels = {}
    for node in G.nodes():
        node_data = G.nodes[node]
        node_type = node_data.get('node_type', '')
        if node_type == 'course':
            labels[node] = node_data.get('code', node.split('_')[-1])
        elif node_type == 'professor':
            name = node_data.get('name', node.split('_')[-1])
            labels[node] = name.split(',')[0] if ',' in name else name[:15]

    nx.draw_networkx_labels(G, pos, labels, font_size=8, font_weight='bold')

    # Add legend
    legend_elements = [
        mpatches.Patch(color='#FF6B6B', label='Courses'),
        mpatches.Patch(color='#4ECDC4', label='Professors'),
        mpatches.Patch(color='#95E1D3', label='Topics'),
    ]
    plt.legend(handles=legend_elements, loc='upper right')

    plt.title(f'Knowledge Graph Visualization\n({G.number_of_nodes()} nodes, {G.number_of_edges()} edges)',
              fontsize=16, fontweight='bold')
    plt.axis('off')
    plt.tight_layout()
    plt.show()

def visualize_graph_interactive(kg: KnowledgeGraph,
                                course_code: Optional[str] = None,
                                max_hops: int = 2,
                                output_file: str = "knowledge_graph.html"):
    """
    Create an interactive HTML visualization of the graph.

    Args:
        kg: KnowledgeGraph instance
        course_code: If provided, show subgraph around this course
        max_hops: Number of hops from starting node (if course_code provided)
        output_file: Output HTML file name
    """
    if not PYVIS_AVAILABLE:
        print("‚ùå pyvis not installed. Run: !pip install pyvis")
        return

    # Get subgraph if course specified
    if course_code:
        if course_code not in kg.course_nodes:
            print(f"‚ùå Course {course_code} not found in graph")
            return
        start_node = kg.course_nodes[course_code]
        G = kg.get_subgraph([start_node], max_hops=max_hops)
        print(f"üìä Showing subgraph around {course_code} ({G.number_of_nodes()} nodes)")
    else:
        G = kg.graph
        print(f"üìä Showing full graph ({G.number_of_nodes()} nodes)")

    # Create network
    net = Network(height="800px", width="100%", bgcolor="#222222", font_color="white", directed=True)
    net.set_options("""
    {
      "physics": {
        "enabled": true,
        "barnesHut": {
          "gravitationalConstant": -2000,
          "centralGravity": 0.1,
          "springLength": 200,
          "springConstant": 0.04,
          "damping": 0.09
        }
      }
    }
    """)

    # Add nodes
    for node_id in G.nodes():
        node_data = G.nodes[node_id]
        node_type = node_data.get('node_type', 'unknown')

        # Set node properties based on type
        if node_type == 'course':
            color = '#FF6B6B'
            size = 30
            label = node_data.get('code', node_id.split('_')[-1])
            title = f"Course: {label}\nDescription: {node_data.get('description', 'N/A')[:100]}..."
        elif node_type == 'professor':
            color = '#4ECDC4'
            size = 25
            label = node_data.get('name', node_id.split('_')[-1])
            title = f"Professor: {label}"
        elif node_type == 'topic':
            color = '#95E1D3'
            size = 20
            label = node_data.get('name', node_id.split('_')[-1])
            title = f"Topic: {label}"
        else:
            color = '#C7CEEA'
            size = 15
            label = node_id
            title = label

        net.add_node(node_id, label=label, color=color, size=size, title=title)

    # Add edges
    for u, v in G.edges():
        edge_data = G.get_edge_data(u, v, {})
        edge_type = edge_data.get('edge_type', 'unknown')

        # Color edges by type
        if edge_type == 'prerequisite':
            color = '#FF6B6B'
            title = 'Prerequisite'
        elif edge_type == 'taught_by':
            color = '#4ECDC4'
            title = 'Taught By'
        elif edge_type == 'covers_topic':
            color = '#95E1D3'
            title = 'Covers Topic'
        else:
            color = '#C7CEEA'
            title = edge_type

        net.add_edge(u, v, color=color, title=title, width=2)

    # Save and show
    net.save_graph(output_file)
    print(f"‚úÖ Interactive graph saved to {output_file}")
    print(f"   Open it in your browser to explore!")

    # Try to open in notebook
    try:
        display(HTML(f'<iframe src="{output_file}" width="100%" height="800px"></iframe>'))
    except:
        pass

def explore_graph(kg: KnowledgeGraph, course_code: str = None, professor_name: str = None, topic: str = None):
    """Explore the graph by showing connections for a specific node."""

    if course_code:
        if course_code not in kg.course_nodes:
            print(f"‚ùå Course {course_code} not found")
            return

        node_id = kg.course_nodes[course_code]
        print(f"\nüìö Exploring Course: {course_code}")
        print("=" * 60)

        # Get node data
        node_data = kg.graph.nodes[node_id]
        print(f"Description: {node_data.get('description', 'N/A')[:200]}...")
        topics_list = node_data.get('topics', [])
        if topics_list:
            print(f"Topics: {', '.join(topics_list[:10])}")

        # Get prerequisites
        print(f"\nüîó Prerequisites:")
        prereqs_found = False
        for pred in kg.graph.predecessors(node_id):
            pred_data = kg.graph.nodes[pred]
            if pred_data.get('node_type') == 'course':
                edge_data = kg.graph.get_edge_data(pred, node_id)
                if edge_data and edge_data.get('edge_type') == 'prerequisite':
                    print(f"  - {pred_data.get('code', pred)}")
                    prereqs_found = True
        if not prereqs_found:
            print("  (None found)")

        # Get courses this is a prerequisite for
        print(f"\nüìñ Required for:")
        required_found = False
        for succ in kg.graph.successors(node_id):
            succ_data = kg.graph.nodes[succ]
            if succ_data.get('node_type') == 'course':
                edge_data = kg.graph.get_edge_data(node_id, succ)
                if edge_data and edge_data.get('edge_type') == 'prerequisite':
                    print(f"  - {succ_data.get('code', succ)}")
                    required_found = True
        if not required_found:
            print("  (None found)")

        # Get professors
        print(f"\nüë®‚Äçüè´ Taught by:")
        profs_found = False
        for succ in kg.graph.successors(node_id):
            succ_data = kg.graph.nodes[succ]
            if succ_data.get('node_type') == 'professor':
                print(f"  - {succ_data.get('name', succ)}")
                profs_found = True
        if not profs_found:
            print("  (None found)")

        # Get topics
        print(f"\nüè∑Ô∏è  Covers topics:")
        topics_found = False
        for succ in kg.graph.successors(node_id):
            succ_data = kg.graph.nodes[succ]
            if succ_data.get('node_type') == 'topic':
                print(f"  - {succ_data.get('name', succ)}")
                topics_found = True
        if not topics_found:
            print("  (None found)")

    elif professor_name:
        # Similar exploration for professors
        matching_profs = {k: v for k, v in kg.professor_nodes.items()
                         if professor_name.lower() in k.lower()}
        if not matching_profs:
            print(f"‚ùå Professor '{professor_name}' not found")
            return

        for prof_name, node_id in matching_profs.items():
            print(f"\nüë®‚Äçüè´ Exploring Professor: {prof_name}")
            print("=" * 60)

            courses_taught = []
            for pred in kg.graph.predecessors(node_id):
                pred_data = kg.graph.nodes[pred]
                if pred_data.get('node_type') == 'course':
                    courses_taught.append(pred_data.get('code', pred))

            if courses_taught:
                print(f"Courses taught ({len(courses_taught)}):")
                for course in courses_taught[:20]:
                    print(f"  - {course}")
            else:
                print("  (No courses found)")

    elif topic:
        # Explore topic
        matching_topics = {k: v for k, v in kg.topic_nodes.items()
                          if topic.lower() in k.lower()}
        if not matching_topics:
            print(f"‚ùå Topic '{topic}' not found")
            return

        for topic_name, node_id in matching_topics.items():
            print(f"\nüè∑Ô∏è  Exploring Topic: {topic_name}")
            print("=" * 60)

            courses_covering = []
            for pred in kg.graph.predecessors(node_id):
                pred_data = kg.graph.nodes[pred]
                if pred_data.get('node_type') == 'course':
                    courses_covering.append(pred_data.get('code', pred))

            if courses_covering:
                print(f"Courses covering this topic ({len(courses_covering)}):")
                for course in courses_covering[:20]:
                    print(f"  - {course}")
            else:
                print("  (No courses found)")

def graph_statistics(kg: KnowledgeGraph):
    """Print detailed statistics about the graph."""
    print("üìä Knowledge Graph Statistics")
    print("=" * 60)
    print(f"Total Nodes: {kg.graph.number_of_nodes()}")
    print(f"Total Edges: {kg.graph.number_of_edges()}")
    print(f"\nNode Types:")
    node_types = {}
    for node_id, data in kg.graph.nodes(data=True):
        node_type = data.get('node_type', 'unknown')
        node_types[node_type] = node_types.get(node_type, 0) + 1
    for node_type, count in sorted(node_types.items()):
        print(f"  - {node_type}: {count}")

    print(f"\nEdge Types:")
    edge_types = {}
    for u, v, data in kg.graph.edges(data=True):
        edge_type = data.get('edge_type', 'unknown')
        edge_types[edge_type] = edge_types.get(edge_type, 0) + 1
    for edge_type, count in sorted(edge_types.items()):
        print(f"  - {edge_type}: {count}")

    print(f"\nMost Connected Courses:")
    course_degrees = {}
    for course_code, node_id in kg.course_nodes.items():
        degree = kg.graph.degree(node_id)
        course_degrees[course_code] = degree

    for course, degree in sorted(course_degrees.items(), key=lambda x: x[1], reverse=True)[:10]:
        print(f"  - {course}: {degree} connections")

    print(f"\nGraph Density: {nx.density(kg.graph):.4f}")
    try:
        print(f"Average Clustering: {nx.average_clustering(kg.graph.to_undirected()):.4f}")
    except:
        print("Average Clustering: N/A (graph may be too sparse)")

# Print statistics first
graph_statistics(kg)

print("\n" + "="*60)
print("üí° Usage Examples:")
print("="*60)
print("1. Static visualization (shows top 50 nodes):")
print("   visualize_graph(kg, max_nodes=50)")
print("\n2. Interactive HTML visualization (around a course):")
print("   visualize_graph_interactive(kg, course_code='CSCI 6364', max_hops=2)")
print("\n3. Explore a specific course:")
print("   explore_graph(kg, course_code='CSCI 6364')")
print("\n4. Explore a professor:")
print("   explore_graph(kg, professor_name='Smith')")
print("\n5. Explore a topic:")
print("   explore_graph(kg, topic='machine learning')")

üìä Knowledge Graph Statistics
Total Nodes: 489
Total Edges: 566

Node Types:
  - course: 85
  - professor: 76
  - topic: 328

Edge Types:
  - covers_topic: 325
  - prerequisite: 40
  - taught_by: 201

Most Connected Courses:
  - CSCI 6908: 23 connections
  - CSCI 6999: 21 connections
  - CSCI 6998: 20 connections
  - CSCI 2113: 19 connections
  - CSCI 3908: 19 connections
  - CSCI 8999: 19 connections
  - CSCI 6212: 12 connections
  - CSCI 6231: 12 connections
  - DATS 6303: 12 connections
  - CSCI 3212: 11 connections

Graph Density: 0.0024
Average Clustering: 0.0047

üí° Usage Examples:
1. Static visualization (shows top 50 nodes):
   visualize_graph(kg, max_nodes=50)

2. Interactive HTML visualization (around a course):
   visualize_graph_interactive(kg, course_code='CSCI 6364', max_hops=2)

3. Explore a specific course:
   explore_graph(kg, course_code='CSCI 6364')

4. Explore a professor:
   explore_graph(kg, professor_name='Smith')

5. Explore a topic:
   explore_graph(kg, top

### Generate Interactive Graph HTML

In [10]:
visualize_graph_interactive(kg, course_code='CSCI 6212', max_hops=2, output_file='knowledge_graph.html')
print("Check the file browser on the left for 'knowledge_graph.html' to download and view interactively.")

üìä Showing subgraph around CSCI 6212 (60 nodes)
‚úÖ Interactive graph saved to knowledge_graph.html
   Open it in your browser to explore!


Check the file browser on the left for 'knowledge_graph.html' to download and view interactively.


### üîç Cross-Subject Visualization Examples

Visualize relationships between courses from different departments.


In [12]:
# Example: Visualize data science courses and their CS prerequisites
# Uncomment to run:
ds_courses = [c for c in kg.course_nodes.keys() if c.startswith('DS') or c.startswith('DATS')]
if ds_courses:
  visualize_graph_interactive(kg, course_code=ds_courses[0], max_hops=2)

print("üí° Tip: Use the visualization above to explore cross-subject prerequisite chains!")


üìä Showing subgraph around DATS 1001 (23 nodes)
‚úÖ Interactive graph saved to knowledge_graph.html
   Open it in your browser to explore!


üí° Tip: Use the visualization above to explore cross-subject prerequisite chains!


In [13]:
class GraphRetriever:
    """Retriever that finds relevant graph subgraphs for queries."""

    def __init__(self, knowledge_graph: KnowledgeGraph):
        self.kg = knowledge_graph

    def retrieve_subgraph(self, query: str, query_entities: List[str], max_hops: int = 2) -> nx.DiGraph:
        """Retrieve relevant subgraph for a query."""
        # Find starting nodes from query entities
        start_nodes = []
        for entity in query_entities:
            # Try to match course codes
            if entity in self.kg.course_nodes:
                start_nodes.append(self.kg.course_nodes[entity])
            # Try to match professors
            for prof_name, node_id in self.kg.professor_nodes.items():
                if entity.lower() in prof_name.lower() or prof_name.lower() in entity.lower():
                    start_nodes.append(node_id)
            # Try to match topics
            for topic, node_id in self.kg.topic_nodes.items():
                if entity.lower() in topic.lower():
                    start_nodes.append(node_id)

        if not start_nodes:
            # If no entities found, return empty subgraph
            return nx.DiGraph()

        # Get subgraph
        subgraph = self.kg.get_subgraph(start_nodes, max_hops=max_hops)
        return subgraph

    def format_subgraph_context(self, subgraph: nx.DiGraph) -> str:
        """Format subgraph as text context for LLM."""
        if subgraph.number_of_nodes() == 0:
            return "No relevant graph information found."

        context_parts = []

        # Group by edge type
        edges_by_type = defaultdict(list)
        for u, v, data in subgraph.edges(data=True):
            edge_type = data.get('edge_type', 'unknown')
            edges_by_type[edge_type].append((u, v))

        # Format prerequisite relationships
        if 'prerequisite' in edges_by_type:
            prereqs = []
            for u, v in edges_by_type['prerequisite']:
                course_u = subgraph.nodes[u].get('code', u)
                course_v = subgraph.nodes[v].get('code', v)
                prereqs.append(f"{course_v} is a prerequisite for {course_u}")
            if prereqs:
                context_parts.append("Prerequisites: " + "; ".join(prereqs[:10]))

        # Format taught_by relationships
        if 'taught_by' in edges_by_type:
            taught_by = []
            for u, v in edges_by_type['taught_by']:
                course = subgraph.nodes[u].get('code', u)
                prof = subgraph.nodes[v].get('name', v)
                taught_by.append(f"{course} is taught by {prof}")
            if taught_by:
                context_parts.append("Instructors: " + "; ".join(taught_by[:10]))

        # Format covers_topic relationships
        if 'covers_topic' in edges_by_type:
            topics = []
            for u, v in edges_by_type['covers_topic']:
                course = subgraph.nodes[u].get('code', u)
                topic = subgraph.nodes[v].get('name', v)
                topics.append(f"{course} covers {topic}")
            if topics:
                context_parts.append("Topics: " + "; ".join(topics[:10]))

        return "\n".join(context_parts) if context_parts else "Graph context available."

# Initialize retriever
retriever = GraphRetriever(kg)
print("‚úÖ Graph Retriever initialized")

‚úÖ Graph Retriever initialized


In [14]:
def extract_entities_from_query(query: str) -> List[str]:
    """Extract potential entities (course codes, professor names) from query."""
    entities = []

    # Extract course codes (e.g., "CSCI 6212")
    course_pattern = r'([A-Z]{2,}\s+\d{4}[A-Z]?)'
    course_matches = re.findall(course_pattern, query)
    entities.extend(course_matches)

    # Extract topic mentions
    for topic in kg.topic_nodes.keys():
        if topic.lower() in query.lower():
            entities.append(topic)

    return list(set(entities))

def generate_multi_hop_questions(kg: KnowledgeGraph, num_examples: int = 200) -> List[Dict]:
    """Generate multi-hop reasoning questions from the knowledge graph."""
    questions = []

    # Type 1: Prerequisite chain questions
    courses_with_prereqs = [code for code, prereqs in course_prerequisites.items() if prereqs]

    # Generate more prerequisite chain questions with variations
    num_prereq_questions = min(num_examples // 3, 80)
    for _ in range(num_prereq_questions):
        if not courses_with_prereqs:
            break
        target_course = np.random.choice(courses_with_prereqs)
        prereqs = course_prerequisites[target_course]
        if prereqs:
            completed_course = np.random.choice(prereqs) if prereqs else None

            if completed_course:
                # Find all prerequisites in the chain
                all_prereqs = []
                def get_all_prereqs(course):
                    if course in course_prerequisites:
                        for p in course_prerequisites[course]:
                            if p not in all_prereqs:
                                all_prereqs.append(p)
                                get_all_prereqs(p)

                get_all_prereqs(target_course)

                # Generate varied question formats
                question_formats = [
                    f"Which courses should I take to prepare for {target_course} if I've completed {completed_course}?",
                    f"What courses do I need after {completed_course} to enroll in {target_course}?",
                    f"After completing {completed_course}, what's the path to {target_course}?",
                    f"If I finished {completed_course}, which courses remain before {target_course}?",
                ]
                query = np.random.choice(question_formats)

                # Build answer
                remaining_prereqs = [p for p in all_prereqs if p != completed_course]
                if remaining_prereqs:
                    answer = f"To prepare for {target_course}, you should also take: {', '.join(remaining_prereqs[:5])}."
                else:
                    answer = f"After completing {completed_course}, you are ready to take {target_course}."

                # Get graph context
                entities = extract_entities_from_query(query)
                subgraph = retriever.retrieve_subgraph(query, entities, max_hops=3)
                graph_context = retriever.format_subgraph_context(subgraph)

                questions.append({
                    'query': query,
                    'answer': answer,
                    'graph_context': graph_context,
                    'reasoning_path': f"{completed_course} -> {target_course}",
                    'type': 'prerequisite_chain'
                })

    # Type 2: Professor intersection questions (enhanced)
    courses_list = list(kg.course_nodes.keys())[:60]
    num_prof_questions = min(num_examples // 3, 60)

    for _ in range(num_prof_questions):
        if len(courses_list) < 2:
            break
        course1, course2 = np.random.choice(courses_list, 2, replace=False)

        # Find professors teaching prerequisites of both
        prereqs1 = set(course_prerequisites.get(course1, []))
        prereqs2 = set(course_prerequisites.get(course2, []))

        # Also check for professors teaching the courses directly
        common_prereqs = prereqs1.intersection(prereqs2) if prereqs1 and prereqs2 else set()

        if common_prereqs or (prereqs1 or prereqs2):
            question_formats = [
                f"Which professors teach courses that are prerequisites for both {course1} and {course2}?",
                f"Who teaches prerequisite courses needed for {course1} and {course2}?",
                f"What instructors teach courses required before taking {course1} and {course2}?",
            ]
            query = np.random.choice(question_formats)

            # Find professors teaching common prerequisites
            profs = []
            check_courses = common_prereqs if common_prereqs else (prereqs1 | prereqs2)
            for _, row in courses_df.iterrows():
                if str(row['course_code']).strip() in check_courses:
                    prof = str(row.get('instructor', '')).strip()
                    if prof and prof != 'nan':
                        profs.append(prof)

            if profs:
                answer = f"Professors teaching prerequisites include: {', '.join(list(set(profs))[:5])}."
            else:
                answer = f"Prerequisites for these courses are available, check course catalog for current instructors."

            entities = extract_entities_from_query(query)
            subgraph = retriever.retrieve_subgraph(query, entities, max_hops=3)
            graph_context = retriever.format_subgraph_context(subgraph)

            questions.append({
                'query': query,
                'answer': answer,
                'graph_context': graph_context,
                'reasoning_path': f"{course1} \u2229 {course2} prerequisites",
                'type': 'professor_intersection'
            })

    # Type 3: Topic-based questions (enhanced)
    topics_list = list(kg.topic_nodes.keys())[:30]
    num_topic_questions = min(num_examples // 4, 50)

    for _ in range(num_topic_questions):
        if len(topics_list) < 1:
            break
        topic1 = np.random.choice(topics_list)

        # Find courses covering topic1
        courses_topic1 = [code for code, topics in course_topics.items() if topic1 in topics]

        if courses_topic1:
            question_formats = [
                f"What courses cover {topic1}?",
                f"Which courses teach {topic1}?",
                f"Where can I learn about {topic1}?",
                f"What classes focus on {topic1}?",
            ]
            query = np.random.choice(question_formats)
            answer = f"Courses covering {topic1} include: {', '.join(courses_topic1[:5])}."

            entities = extract_entities_from_query(query)
            subgraph = retriever.retrieve_subgraph(query, entities, max_hops=2)
            graph_context = retriever.format_subgraph_context(subgraph)

            questions.append({
                'query': query,
                'answer': answer,
                'graph_context': graph_context,
                'reasoning_path': f"topic:{topic1}",
                'type': 'topic_based'
            })

    # Type 4: Multi-hop path questions (enhanced)
    num_path_questions = min(num_examples // 4, 50)
    attempt_counter = 0
    max_attempts = num_path_questions * 5  # Allow more attempts to find valid paths

    while len([q for q in questions if q['type'] == 'multi_hop_path']) < num_path_questions and attempt_counter < max_attempts:
        attempt_counter += 1
        if len(courses_list) < 2:
            break
        source, target = np.random.choice(courses_list, 2, replace=False)

        source_node = kg.course_nodes.get(source)
        target_node = kg.course_nodes.get(target)

        if source_node and target_node:
            paths = kg.find_paths(source_node, target_node, max_length=4)
            if paths:
                path = paths[0]  # Take first path
                path_courses = [kg.graph.nodes[n].get('code', n) for n in path if kg.graph.nodes[n].get('node_type') == 'course']

                if len(path_courses) > 1:
                    question_formats = [
                        f"What is the prerequisite path from {source} to {target}?",
                        f"How do I get from {source} to {target} through prerequisites?",
                        f"What's the course sequence from {source} to {target}?",
                    ]
                    query = np.random.choice(question_formats)
                    answer = f"The path from {source} to {target} is: {' -> '.join(path_courses)}."

                    entities = extract_entities_from_query(query)
                    subgraph = retriever.retrieve_subgraph(query, entities, max_hops=4)
                    graph_context = retriever.format_subgraph_context(subgraph)

                    questions.append({
                        'query': query,
                        'answer': answer,
                        'graph_context': graph_context,
                        'reasoning_path': ' -> '.join(path_courses),
                        'type': 'multi_hop_path'
                    })

    # Type 5: Course description questions (new type)
    num_desc_questions = min(num_examples // 5, 30)
    for _ in range(num_desc_questions):
        course = np.random.choice(courses_list)
        if course in course_descriptions:
            question_formats = [
                f"What is {course} about?",
                f"Tell me about {course}.",
                f"What topics does {course} cover?",
                f"Describe {course}.",
            ]
            query = np.random.choice(question_formats)

            desc = course_descriptions[course][:200] + "..." if len(course_descriptions[course]) > 200 else course_descriptions[course]
            answer = f"{course}: {desc}"

            entities = extract_entities_from_query(query)
            subgraph = retriever.retrieve_subgraph(query, entities, max_hops=2)
            graph_context = retriever.format_subgraph_context(subgraph)

            questions.append({
                'query': query,
                'answer': answer,
                'graph_context': graph_context,
                'reasoning_path': f"course:{course}",
                'type': 'course_description'
            })

    return questions

# Generate multi-hop questions
print("Generating multi-hop reasoning questions...")
multi_hop_questions = generate_multi_hop_questions(kg, num_examples=200)
print(f"\u2705 Generated {len(multi_hop_questions)} multi-hop questions")
print(f"\nQuestion type breakdown:")
type_counts = {}
for q in multi_hop_questions:
    qtype = q['type']
    type_counts[qtype] = type_counts.get(qtype, 0) + 1
for qtype, count in sorted(type_counts.items()):
    print(f"  - {qtype}: {count}")
print(f"\nSample questions:")
for i, q in enumerate(multi_hop_questions[:3]):
    print(f"\n{i+1}. {q['query']}")
    print(f"   Answer: {q['answer'][:100]}...")
    print(f"   Type: {q['type']}")

Generating multi-hop reasoning questions...
‚úÖ Generated 200 multi-hop questions

Question type breakdown:
  - course_description: 28
  - multi_hop_path: 2
  - prerequisite_chain: 66
  - professor_intersection: 54
  - topic_based: 50

Sample questions:

1. If I finished CSCI 6531, which courses remain before CSCI 8531?
   Answer: After completing CSCI 6531, you are ready to take CSCI 8531....
   Type: prerequisite_chain

2. Which courses should I take to prepare for CSCI 8901 if I've completed APSC 3115?
   Answer: After completing APSC 3115, you are ready to take CSCI 8901....
   Type: prerequisite_chain

3. What courses do I need after CSCI 2461 to enroll in CSCI 3410?
   Answer: To prepare for CSCI 3410, you should also take: CSCI 1112, CSCI 1111, CSCI 2113, MATH 1221....
   Type: prerequisite_chain


In [15]:
def create_rag_training_example(query: str, answer: str, graph_context: str,
                                reasoning_path: str = "", include_structure: bool = True) -> Dict:
    """Create a RAG training example with graph context."""

    # Build system message with graph context
    system_content = """You are a helpful assistant providing information about GWU Computer Science courses for Spring 2026.
You have access to a knowledge graph with course relationships, prerequisites, instructors, and topics.
Use the provided graph context to answer questions accurately."""

    # Build user message with graph context
    if graph_context and graph_context != "No relevant graph information found.":
        user_content = f"""Graph Context:
{graph_context}

Question: {query}"""
    else:
        user_content = f"Question: {query}"

    # Build assistant response with structured format
    if include_structure and reasoning_path:
        assistant_content = f"""Reasoning Path: {reasoning_path}

Answer: {answer}"""
    else:
        assistant_content = answer

    return {
        "messages": [
            {"role": "system", "content": system_content},
            {"role": "user", "content": user_content},
            {"role": "assistant", "content": assistant_content}
        ],
        "graph_context": graph_context,
        "reasoning_path": reasoning_path,
        "query_type": "multi_hop" if reasoning_path else "simple"
    }

# Load existing simple Q&A data (optional - gracefully handles if missing)
existing_dataset = []
try:
    if os.path.exists("data/course_finetune.jsonl"):
        with open("data/course_finetune.jsonl", 'r') as f:
            for line in f:
                if line.strip():
                    existing_dataset.append(json.loads(line))
        print(f"Loaded {len(existing_dataset)} existing examples")
    else:
        print("No existing dataset found (data/course_finetune.jsonl), using only multi-hop questions")
except Exception as e:
    print(f"Warning: Could not load existing dataset: {e}")
    print("Continuing with only multi-hop questions")

# Convert existing examples to RAG format (without graph context for simple ones)
rag_dataset = []
for example in existing_dataset[:500]:  # Limit to avoid too much data
    messages = example.get('messages', [])
    if len(messages) >= 3:
        user_msg = messages[1].get('content', '')
        assistant_msg = messages[2].get('content', '')

        # Extract entities and get graph context
        entities = extract_entities_from_query(user_msg)
        if entities:
            subgraph = retriever.retrieve_subgraph(user_msg, entities, max_hops=2)
            graph_context = retriever.format_subgraph_context(subgraph)
        else:
            graph_context = ""

        rag_example = create_rag_training_example(
            user_msg, assistant_msg, graph_context,
            reasoning_path="", include_structure=False
        )
        rag_dataset.append(rag_example)

# Add multi-hop questions
for q in multi_hop_questions:
    rag_example = create_rag_training_example(
        q['query'], q['answer'], q['graph_context'],
        reasoning_path=q.get('reasoning_path', ''), include_structure=True
    )
    rag_dataset.append(rag_example)

print(f"‚úÖ Created RAG dataset with {len(rag_dataset)} examples")
print(f"   - Simple Q&A: {len([x for x in rag_dataset if x['query_type'] == 'simple'])}")
print(f"   - Multi-hop: {len([x for x in rag_dataset if x['query_type'] == 'multi_hop'])}")

# Ensure data directory exists before saving
os.makedirs("data", exist_ok=True)

# Save dataset
output_file = "data/course_finetune_kg_rag.jsonl"
with open(output_file, 'w') as f:
    for example in rag_dataset:
        f.write(json.dumps(example) + "\n")
print(f"‚úÖ Saved dataset to {output_file}")

No existing dataset found (data/course_finetune.jsonl), using only multi-hop questions
‚úÖ Created RAG dataset with 200 examples
   - Simple Q&A: 0
   - Multi-hop: 200
‚úÖ Saved dataset to data/course_finetune_kg_rag.jsonl


In [16]:
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template

# Model configuration
max_seq_length = 2048
dtype = None  # Auto-detect
load_in_4bit = True

print("Loading model...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

# Setup chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",
)

print("‚úÖ Model loaded")


ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.


    PyTorch 2.6.0+cu124 with CUDA 1204 (you have 2.9.0+cu126)
    Python  3.12.9 (you have 3.12.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


Switching to PyTorch attention since your Xformers is broken.

Unsloth: Xformers was not installed correctly.
Please install xformers separately first.
Then confirm if it's correctly installed by running:
python -m xformers.info

Longer error message:
xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.6.0+cu124 with CUDA 1204 (you have 2.9.0+cu126)
    Python  3.12.9 (you have 3.12.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
ü¶• Unsloth Zoo will now patch everything to make training faster!
Unsloth: Could not import trl.trainer.alignprop_trainer: Failed to import trl.trainer.alignprop_trainer because of the following error (look up to see its traceback):
Failed to import trl.models.modeling_sd_base because of the following error (look up to see its traceback):
Failed to import diffusers.pipelines.stable_diffusion.pipeline_stab

model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

‚úÖ Model loaded


In [17]:
# Configure LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

print("‚úÖ LoRA configured")


Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.11.6 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


‚úÖ LoRA configured


## 9. Prepare Dataset for Training


In [18]:
from datasets import load_dataset

# Verify the dataset file exists
if not os.path.exists("data/course_finetune_kg_rag.jsonl"):
    raise FileNotFoundError("‚ùå Dataset file 'data/course_finetune_kg_rag.jsonl' not found. Please run previous cells to generate it.")

# Load dataset
dataset = load_dataset("json", data_files="data/course_finetune_kg_rag.jsonl", split="train")

def formatting_prompts_func(examples):
    """Format dataset with chat template."""
    convos = examples["messages"]
    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)
             for convo in convos]
    return {"text": texts}

# Format dataset
dataset = dataset.map(formatting_prompts_func, batched=True)

# Train/validation split (80/20)
dataset = dataset.train_test_split(test_size=0.2, seed=3407)
train_dataset = dataset["train"]
eval_dataset = dataset["test"]

print(f"‚úÖ Dataset prepared:")
print(f"   Train examples: {len(train_dataset)}")
print(f"   Validation examples: {len(eval_dataset)}")
print(f"\nSample training example:")
print(train_dataset[0]["text"][:500] + "...")

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

‚úÖ Dataset prepared:
   Train examples: 160
   Validation examples: 40

Sample training example:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

You are a helpful assistant providing information about GWU Computer Science courses for Spring 2026.
You have access to a knowledge graph with course relationships, prerequisites, instructors, and topics.
Use the provided graph context to answer questions accurately.<|eot_id|><|start_header_id|>user<|end_header_id|>

Graph Context:
Prerequisites: CSCI 6221 is a prerequisi...


## 10. Training Configuration with Evaluation Metrics


In [19]:
%pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=75a1d965cbac73de8468664f5c06fe2a227548d117c56f56267c5ebe5f1d8140
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [20]:
from trl import SFTConfig, SFTTrainer
from transformers import EarlyStoppingCallback
import evaluate

# Load evaluation metrics
bleu_metric = evaluate.load("bleu")
rouge_metric = evaluate.load("rouge")

# NOTE: compute_metrics with text generation is complex and may slow training significantly
# For this training, we'll rely on validation loss. You can uncomment below for full metrics.
"""
def compute_metrics(eval_pred):
    #Compute evaluation metrics.
    predictions, labels = eval_pred

    # Decode predictions and labels
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Compute BLEU
    bleu_results = bleu_metric.compute(
        predictions=decoded_preds,
        references=[[ref] for ref in decoded_labels]
    )

    # Compute ROUGE
    rouge_results = rouge_metric.compute(
        predictions=decoded_preds,
        references=decoded_labels
    )

    return {
        "bleu": bleu_results["bleu"],
        "rouge1": rouge_results["rouge1"],
        "rouge2": rouge_results["rouge2"],
        "rougeL": rouge_results["rougeL"],
    }
"""

# Training configuration
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    packing=False,
    # compute_metrics=compute_metrics,  # Uncomment if you defined the function above
    args=SFTConfig(
        # Batch size - reduced for better memory usage
        per_device_train_batch_size=2,  # Reduced from 4 to 2 for better memory handling
        per_device_eval_batch_size=2,
        gradient_accumulation_steps=4,  # Increased to maintain effective batch size

        # Learning rate
        learning_rate=1e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.1,

        # Training duration
        num_train_epochs=5,
        max_steps=-1,

        # Optimization
        optim="adamw_8bit",
        weight_decay=0.01,
        adam_beta1=0.9,
        adam_beta2=0.999,

        # Evaluation and logging
        eval_strategy="steps",
        eval_steps=100,
        save_strategy="steps",
        save_steps=200,
        logging_steps=10,
        report_to="none",

        # Output
        output_dir="outputs_kg_qa",
        seed=3407,
        fp16=False, # Changed from True to False
        bf16=True,  # Changed from False to True

        # Early stopping
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,
    ),
)

# Add early stopping
early_stopping = EarlyStoppingCallback(
    early_stopping_patience=3,
    early_stopping_threshold=0.001,
)
trainer.add_callback(early_stopping)

print("‚úÖ Training configuration complete")
print("Note: Using validation loss as primary metric for efficiency")
print("Batch size reduced to 2 per device with gradient accumulation 4 for better memory usage")

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/160 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/40 [00:00<?, ? examples/s]

‚úÖ Training configuration complete
Note: Using validation loss as primary metric for efficiency
Batch size reduced to 2 per device with gradient accumulation 4 for better memory usage


## 11. Train Model


In [21]:
# Check GPU memory
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

# Train
print("\nStarting training...")
trainer_stats = trainer.train()

# Training statistics
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)

print(f"\n‚úÖ Training completed!")
print(f"Runtime: {trainer_stats.metrics['train_runtime']:.2f} seconds ({trainer_stats.metrics['train_runtime']/60:.2f} minutes)")
print(f"Peak reserved memory: {used_memory} GB ({used_percentage}%)")
print(f"Training memory: {used_memory_for_lora} GB")
print(f"Final training loss: {trainer_stats.metrics.get('train_loss', 'N/A')}")


The model is already on multiple devices. Skipping the move to device specified in `args`.


GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.557 GB.
7.135 GB of memory reserved.

Starting training...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 160 | Num Epochs = 5 | Total steps = 100
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 83,886,080 of 8,114,147,328 (1.03% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss
100,0.2996,0.292937


Unsloth: Not an error, but LlamaForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient



‚úÖ Training completed!
Runtime: 261.51 seconds (4.36 minutes)
Peak reserved memory: 7.65 GB (19.339%)
Training memory: 0.515 GB
Final training loss: 0.6341041231155395


## 12. Evaluation Framework


In [22]:
def exact_match(prediction: str, reference: str) -> bool:
    """Check if prediction exactly matches reference."""
    return prediction.strip().lower() == reference.strip().lower()

def f1_score(prediction: str, reference: str) -> float:
    """Compute F1 score between prediction and reference."""
    pred_tokens = set(prediction.lower().split())
    ref_tokens = set(reference.lower().split())

    if len(pred_tokens) == 0 or len(ref_tokens) == 0:
        return 0.0

    intersection = pred_tokens.intersection(ref_tokens)
    if len(intersection) == 0:
        return 0.0

    precision = len(intersection) / len(pred_tokens)
    recall = len(intersection) / len(ref_tokens)

    if precision + recall == 0:
        return 0.0

    return 2 * (precision * recall) / (precision + recall)

def evaluate_qa_predictions(predictions: List[str], references: List[str]) -> Dict:
    """Evaluate QA predictions with multiple metrics."""
    em_scores = [exact_match(p, r) for p, r in zip(predictions, references)]
    f1_scores = [f1_score(p, r) for p, r in zip(predictions, references)]

    # Compute BLEU and ROUGE
    bleu_results = bleu_metric.compute(
        predictions=predictions,
        references=[[ref] for ref in references]
    )
    rouge_results = rouge_metric.compute(
        predictions=predictions,
        references=references
    )

    return {
        "exact_match": np.mean(em_scores),
        "f1": np.mean(f1_scores),
        "bleu": bleu_results["bleu"],
        "rouge1": rouge_results["rouge1"],
        "rouge2": rouge_results["rouge2"],
        "rougeL": rouge_results["rougeL"],
    }

# Create test set with complex queries
test_queries = [
    {
        "query": "Which courses should I take to prepare for CSCI 6364 if I've completed CSCI 1112?",
        "expected_entities": ["CSCI 6364", "CSCI 1112"],
        "type": "prerequisite_chain"
    },
    {
        "query": "Who teaches Machine Learning?",
        "expected_entities": ["machine learning"],
        "type": "simple"
    },
    {
        "query": "What courses cover computer vision and are prerequisites for deep learning courses?",
        "expected_entities": ["computer vision", "deep learning"],
        "type": "topic_based"
    },
    {
        "query": "Tell me about CSCI 1012.",
        "expected_entities": ["CSCI 1012"],
        "type": "simple"
    },
    {
        "query": "Which professors teach courses that are prerequisites for both CSCI 6364 and CSCI 6444?",
        "expected_entities": ["CSCI 6364", "CSCI 6444"],
        "type": "professor_intersection"
    }
]

print(f"‚úÖ Evaluation framework ready with {len(test_queries)} test queries")


‚úÖ Evaluation framework ready with 5 test queries


## 13. Inference Pipeline with Graph Retrieval


In [29]:
FastLanguageModel.for_inference(model)

def clean_model_output(text: str) -> str:
    """Clean model output to remove chat artifacts and return clean answer."""
    # Remove unicode characters and artifacts
    text = text.replace('\xa0', ' ').replace('\u00a0', ' ')
    # Remove specific non-English characters that might appear from tokenization issues
    text = re.sub(r'[\u0400-\u04FF]+', '', text) # Cyrillic
    text = re.sub(r'[\u00C0-\u00FF]+', '', text) # Latin-1 Supplement (√°, √©, etc.)

    # Remove common chat artifacts and trailing prompt elements
    artifacts_to_remove = [
        'assistant', 'user', 'system',
        '<|assistant|>', '<|user|>', '<|system|>',
        '<|eot_id|>', '<|start_header_id|>', '<|end_header_id|>',
        '„Éª‚îÅ„Éª‚îÅuser', '.**************',
        'Cutting Knowledge Date:', 'Today Date:', # Specific patterns from user's observation
        'Question:', # If question gets repeated
        'Rationale:' # Sometimes model generates this
    ]

    for artifact in artifacts_to_remove:
        text = text.replace(artifact, '')

    # Remove "Reasoning Path:" section to isolate the answer
    if "Reasoning Path:" in text:
        # Split by "Answer:" to ensure we get the actual response content
        parts = text.split("Answer:", 1) # Split only once
        if len(parts) > 1:
            text = parts[1] # Take everything after "Answer:"
        else:
            # If "Answer:" is not found, but "Reasoning Path:" is, take everything after "Reasoning Path:" is less likely to be actual answer
            text = text.split("Reasoning Path:", 1)[0] # Take everything before "Reasoning Path:" if "Answer:" not present

    # Further cleanup for unwanted trailing information if it slipped through
    # Look for common conversational markers or boilerplate that indicates end of response
    stop_patterns = [
        r'\s*([\w\s]*)\);$', # e.g., (user); -- Changed \p{L}\p{N} to \w
        r'\s*\.\s*\*+\s*$', # e.g., .**************
        r'\s*Cutting Knowledge Date.*$',
        r'\s*Today Date.*$',
        r'\s*Question:.*$'
    ]
    # Apply re.UNICODE for patterns using \w
    text = re.sub(stop_patterns[0], '', text, flags=re.IGNORECASE | re.DOTALL | re.UNICODE)
    for pattern in stop_patterns[1:]:
        text = re.sub(pattern, '', text, flags=re.IGNORECASE | re.DOTALL)

    # Clean up multiple newlines and whitespace
    text = re.sub(r'\n+', ' ', text).strip()
    text = re.sub(r'\s+', ' ', text).strip()

    # Get the first meaningful sentence or main answer
    sentences = [s.strip() for s in text.split('.') if s.strip()]
    if sentences:
        # Return first meaningful sentence
        for sentence in sentences:
            # Skip very short sentences or fragments or those starting with keywords that indicate non-answer content
            if len(sentence) > 10 and not sentence.lower().startswith(('reasoning', 'note', 'other', 'graph context', 'prerequisites', 'instructors', 'topics')):
                return sentence.strip() + '.'
        # If no good sentence found, return first sentence
        return sentences[0].strip() + '.'

    return text.strip()

def answer_with_graph_retrieval(query: str, max_new_tokens: int = 256) -> Dict:
    """Answer query using graph retrieval + LLM generation."""

    # Step 1: Extract entities from query
    entities = extract_entities_from_query(query)

    # Step 2: Retrieve relevant subgraph
    subgraph = retriever.retrieve_subgraph(query, entities, max_hops=3)
    graph_context = retriever.format_subgraph_context(subgraph)

    # Step 3: Build prompt with graph context
    system_content = """You are a helpful assistant providing information about GWU Computer Science and Data Science courses for Spring 2026.
You have access to a knowledge graph with course relationships, prerequisites, instructors, and topics.
Use the provided graph context to answer questions accurately. Provide concise, direct answers."""

    if graph_context and graph_context != "No relevant graph information found.":
        user_content = f"""Graph Context:
{graph_context}

Question: {query}"""
    else:
        user_content = f"Question: {query}"

    messages = [
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_content},
    ]

    # Step 4: Generate answer
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to("cuda")

    attention_mask = torch.ones_like(inputs)

    outputs = model.generate(
        input_ids=inputs,
        attention_mask=attention_mask,
        max_new_tokens=max_new_tokens,
        temperature=0.1,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
        repetition_penalty=1.2,
        top_p=0.9,
    )

    # Decode answer
    output_text = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=False) # Keep special tokens initially for cleaning

    # Clean the output
    cleaned_answer = clean_model_output(output_text)

    return {
        "query": query,
        "answer": cleaned_answer,
        "raw_answer": output_text.strip(),  # Keep raw output for debugging
        "graph_context": graph_context,
        "entities_found": entities,
        "subgraph_size": subgraph.number_of_nodes() if subgraph else 0
    }

# Test inference
print("Testing inference pipeline with cleaned outputs...\n")
for i, test_query in enumerate(test_queries[:3], 1):
    print(f"{'='*60}")
    print(f"Test {i}: {test_query['query']}")
    print(f"{'='*60}")

    result = answer_with_graph_retrieval(test_query['query'])

    print(f"Entities found: {result['entities_found']}")
    print(f"Subgraph nodes: {result['subgraph_size']}")

    # Print full graph context for the first test query to debug
    if i == 1:
        print(f"\nFull Graph Context (for Test 1):\n{result['graph_context']}")
    else:
        print(f"\nGraph Context (truncated for other tests):\n{result['graph_context'][:200]}...")

    print(f"\nCleaned Answer:\n{result['answer']}")
    print(f"\nRaw Answer (first 200 chars):\n{result['raw_answer'][:200]}...")
    print()

Testing inference pipeline with cleaned outputs...

Test 1: Which courses should I take to prepare for CSCI 6364 if I've completed CSCI 1112?
Entities found: ['CSCI 6364', 'CSCI 1112']
Subgraph nodes: 161

Full Graph Context (for Test 1):
Prerequisites: CSCI 2113 is a prerequisite for CSCI 2441W; CSCI 1311 is a prerequisite for CSCI 3212; CSCI 1111 is a prerequisite for CSCI 1112; CSCI 2113 is a prerequisite for CSCI 6444; CSCI 3212 is a prerequisite for CSCI 4511; DATS 6202 is a prerequisite for DATS 6312; CSCI 2113 is a prerequisite for CSCI 4531; CSCI 3212 is a prerequisite for CSCI 4364; CSCI 6364 is a prerequisite for CSCI 6365; CSCI 6212 is a prerequisite for CSCI 6511
Instructors: CSCI 3212 is taught by Simha, R; CSCI 6212 is taught by Youssef, A; CSCI 6998 is taught by Pless, R; CSCI 6998 is taught by Vora, P; CSCI 6998 is taught by Youssef, A; CSCI 6998 is taught by Qu, X; CSCI 6998 is taught by Huang, H; CSCI 6998 is taught by Simha, R; CSCI 6998 is taught by Wood, T; CSCI 69

### Test Custom Queries

Use the cell below to try out your own questions. You can modify the `custom_query` variable to experiment with different prompts.

In [30]:
# Enter your custom query here
custom_query = "Which professor teaches CSCI 6212 and who teaches the most DATS Courses?"

print(f"\n{'='*60}")
print(f"Custom Query: {custom_query}")
print(f"{'='*60}")

# Get the answer from the model with graph retrieval
custom_result = answer_with_graph_retrieval(custom_query)

print(f"Entities found: {custom_result['entities_found']}")
print(f"Subgraph nodes: {custom_result['subgraph_size']}")
print(f"\nGraph Context (truncated):\n{custom_result['graph_context'][:500]}...")
print(f"\nCleaned Answer:\n{custom_result['answer']}")
print(f"\nRaw Answer (first 500 chars):\n{custom_result['raw_answer'][:500]}...")



Custom Query: Which professor teaches CSCI 6212 and who teaches the most DATS Courses?
Entities found: ['CSCI 6212']
Subgraph nodes: 151

Graph Context (truncated):
Prerequisites: CSCI 6221 is a prerequisite for CSCI 6231; CSCI 1311 is a prerequisite for CSCI 3212; CSCI 1111 is a prerequisite for CSCI 1112; CSCI 2113 is a prerequisite for CSCI 6444; CSCI 3212 is a prerequisite for CSCI 4511; DATS 6202 is a prerequisite for DATS 6312; CSCI 3212 is a prerequisite for CSCI 4364; CSCI 6364 is a prerequisite for CSCI 6365; CSCI 6212 is a prerequisite for CSCI 6511; CSCI 1311 is a prerequisite for CSCI 2541
Instructors: DATS 6001 is taught by Majhi, S; CSCI 6907 ...

Cleaned Answer:
Professors teaching CSCI 6212 include: Youssef, A.

Raw Answer (first 500 chars):
Reasoning Path: course:CSCI 6212 -> instructor

Answer: Professors teaching CSCI 6212 include: Youssef, A.„Éª‚îÅ„Éª‚îÅ.**************
ifneedederusformuser);
„Éª‚îÅ„Éª‚îÅ„Éª‚îÅ„Éª‚îÅuser>
.AbsoluteConstraints √ßer√ßevuser—ü—ü—ü—ü—ü—

## 14. Performance Monitoring and Metrics


In [24]:
# Evaluate on test set
print("Evaluating on test queries...\n")

predictions = []
references = []
graph_retrieval_stats = {
    "total_queries": 0,
    "queries_with_graph_context": 0,
    "avg_subgraph_size": [],
    "entities_extracted": 0
}

for test_query in test_queries:
    result = answer_with_graph_retrieval(test_query['query'])
    predictions.append(result['answer'])

    # For evaluation, we'd need ground truth answers
    # For now, we'll use a placeholder
    references.append("")  # Would be actual ground truth

    # Track graph retrieval stats
    graph_retrieval_stats["total_queries"] += 1
    if result['graph_context'] and result['graph_context'] != "No relevant graph information found.":
        graph_retrieval_stats["queries_with_graph_context"] += 1
    graph_retrieval_stats["avg_subgraph_size"].append(result['subgraph_size'])
    graph_retrieval_stats["entities_extracted"] += len(result['entities_found'])

# Print statistics
print("Graph Retrieval Statistics:")
print(f"  Total queries: {graph_retrieval_stats['total_queries']}")
print(f"  Queries with graph context: {graph_retrieval_stats['queries_with_graph_context']}")
print(f"  Average subgraph size: {np.mean(graph_retrieval_stats['avg_subgraph_size']):.2f}")
print(f"  Total entities extracted: {graph_retrieval_stats['entities_extracted']}")
print(f"  Average entities per query: {graph_retrieval_stats['entities_extracted'] / graph_retrieval_stats['total_queries']:.2f}")

# Note: Full evaluation with ground truth would require labeled test set
print("\n‚úÖ Performance monitoring complete")


Evaluating on test queries...

Graph Retrieval Statistics:
  Total queries: 5
  Queries with graph context: 5
  Average subgraph size: 120.80
  Total entities extracted: 8
  Average entities per query: 1.60

‚úÖ Performance monitoring complete


## 15. Data Augmentation for Synthetic Questions


In [25]:
def augment_question(query: str, answer: str) -> List[Dict]:
    """Generate variations of a question for data augmentation."""
    variations = []

    # Variation 1: Paraphrase
    # Simple paraphrasing (in production, use a paraphrase model)
    if "which courses" in query.lower():
        variations.append({
            "query": query.replace("Which courses", "What courses"),
            "answer": answer,
            "type": "paraphrase"
        })

    # Variation 2: Question type change
    if "who teaches" in query.lower():
        variations.append({
            "query": query.replace("Who teaches", "Which professor teaches"),
            "answer": answer,
            "type": "question_type"
        })

    # Variation 3: Add context
    if "if I've completed" in query.lower():
        variations.append({
            "query": query.replace("if I've completed", "assuming I have completed"),
            "answer": answer,
            "type": "context_variation"
        })

    return variations

def generate_synthetic_questions_from_graph(kg: KnowledgeGraph, num_synthetic: int = 50) -> List[Dict]:
    """Generate synthetic questions by exploring the graph structure."""
    synthetic = []

    # Generate questions by following graph paths
    courses_list = list(kg.course_nodes.keys())[:30]

    for _ in range(num_synthetic):
        # Random walk on graph
        start_course = np.random.choice(courses_list)
        start_node = kg.course_nodes[start_course]

        # Get neighbors
        neighbors = list(kg.graph.successors(start_node))[:3]
        if neighbors:
            target_node = np.random.choice(neighbors)
            edge_data = kg.graph.get_edge_data(start_node, target_node)

            if edge_data:
                edge_type = edge_data.get('edge_type', '')

                if edge_type == 'prerequisite':
                    target_course = kg.graph.nodes[target_node].get('code', '')
                    query = f"What is a prerequisite for {start_course}?"
                    answer = f"{target_course} is a prerequisite for {start_course}."

                    entities = extract_entities_from_query(query)
                    subgraph = retriever.retrieve_subgraph(query, entities, max_hops=2)
                    graph_context = retriever.format_subgraph_context(subgraph)

                    synthetic.append({
                        'query': query,
                        'answer': answer,
                        'graph_context': graph_context,
                        'reasoning_path': f"{target_course} -> {start_course}",
                        'type': 'synthetic_prerequisite'
                    })

    return synthetic

# Generate synthetic questions
print("Generating synthetic questions...")
synthetic_questions = generate_synthetic_questions_from_graph(kg, num_synthetic=50)
print(f"‚úÖ Generated {len(synthetic_questions)} synthetic questions")

# Augment existing questions
augmented = []
for q in multi_hop_questions[:20]:
    variations = augment_question(q['query'], q['answer'])
    for var in variations:
        var['graph_context'] = q.get('graph_context', '')
        var['reasoning_path'] = q.get('reasoning_path', '')
        augmented.append(var)

print(f"‚úÖ Generated {len(augmented)} augmented question variations")


Generating synthetic questions...
‚úÖ Generated 9 synthetic questions
‚úÖ Generated 11 augmented question variations


In [26]:
# Save LoRA adapters
model.save_pretrained("lora_model_kg_qa")
tokenizer.save_pretrained("lora_model_kg_qa")
print("‚úÖ LoRA adapters saved to 'lora_model_kg_qa'")

# Save knowledge graph
import pickle
with open("kg_graph.pkl", "wb") as f:
    pickle.dump(kg, f)
print("‚úÖ Knowledge graph saved to 'kg_graph.pkl'")

# Save retriever (without GAT model for now)
with open("graph_retriever.pkl", "wb") as f:
    pickle.dump(retriever, f)
print("‚úÖ Graph retriever saved to 'graph_retriever.pkl'")

# Optional: Export merged model
model.save_pretrained_merged("merged_model_kg_qa", tokenizer, save_method="merged_16bit")
# print("‚úÖ Merged model saved to 'merged_model_kg_qa'")


‚úÖ LoRA adapters saved to 'lora_model_kg_qa'
‚úÖ Knowledge graph saved to 'kg_graph.pkl'
‚úÖ Graph retriever saved to 'graph_retriever.pkl'


config.json:   0%|          | 0.00/947 [00:00<?, ?B/s]

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Checking cache directory for required files...
Cache check failed: model-00001-of-00004.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files:  25%|‚ñà‚ñà‚ñå       | 1/4 [00:13<00:39, 13.10s/it]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2/4 [00:29<00:30, 15.02s/it]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files:  75%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 3/4 [00:41<00:13, 13.88s/it]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:45<00:00, 11.26s/it]


Note: tokenizer.model not found (this is OK for non-SentencePiece models)


Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:58<00:00, 14.54s/it]


Unsloth: Merge process complete. Saved to `/content/merged_model_kg_qa`


In [40]:
from huggingface_hub import HfApi
from google.colab import userdata

HF_TOKEN = userdata.get('HF_TOKEN') # Retrieve the token and assign it to HF_TOKEN

repo_id = "itsmepraks/gwcourses_RAG"

# Create repository if it doesn't exist
api = HfApi(token=HF_TOKEN)
print(f"Ensuring repository {repo_id} exists...")
api.create_repo(repo_id=repo_id, repo_type="model", exist_ok=True)

# Push LoRA adapters to Hugging Face
model.push_to_hub(repo_id)
print(f"‚úÖ LoRA adapters pushed to {repo_id}")

# Push tokenizer to Hugging Face
tokenizer.push_to_hub(repo_id)
print(f"‚úÖ Tokenizer pushed to {repo_id}")

# Upload the knowledge graph (kg_graph.pkl)
api.upload_file(
    path_or_fileobj="kg_graph.pkl",
    path_in_repo="kg_graph.pkl",
    repo_id=repo_id,
    repo_type="model",
)
print(f"‚úÖ Knowledge graph (kg_graph.pkl) uploaded to {repo_id}")

# Upload the graph retriever (graph_retriever.pkl)
api.upload_file(
    path_or_fileobj="graph_retriever.pkl",
    path_in_repo="graph_retriever.pkl",
    repo_id=repo_id,
    repo_type="model",
)
print(f"‚úÖ Graph retriever (graph_retriever.pkl) uploaded to {repo_id}")

Ensuring repository itsmepraks/gwcourses_RAG exists...


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...adapter_model.safetensors:   1%|1         | 3.93MB /  336MB            

Saved model to https://huggingface.co/itsmepraks/gwcourses_RAG
‚úÖ LoRA adapters pushed to itsmepraks/gwcourses_RAG


README.md: 0.00B [00:00, ?B/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...mp6bwnzwz1/tokenizer.json: 100%|##########| 17.2MB / 17.2MB            

‚úÖ Tokenizer pushed to itsmepraks/gwcourses_RAG


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  kg_graph.pkl                : 100%|##########|  118kB /  118kB            

‚úÖ Knowledge graph (kg_graph.pkl) uploaded to itsmepraks/gwcourses_RAG


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  graph_retriever.pkl         : 100%|##########|  118kB /  118kB            

‚úÖ Graph retriever (graph_retriever.pkl) uploaded to itsmepraks/gwcourses_RAG


After pushing, find your model and tokenizer on the Hugging Face Hub at `https://huggingface.co/itsmepraks/gwcourses_RAG`.