# Topic Extraction Model Training

This notebook trains a topic extraction model for academic syllabus parsing.

**Goal**: Create `topic_extractor.pkl` for use in the AutoYT-Playlist system.

## Steps:
1. Load and preprocess syllabus data
2. Extract topics using NLP techniques
3. Train a topic classification model
4. Save the model as `.pkl` file

**Upload this notebook to Kaggle and run it there!**

## 1. Install Dependencies

In [None]:
!pip install -q transformers sentence-transformers spacy scikit-learn pandas numpy
!python -m spacy download en_core_web_sm

## 2. Import Libraries

In [None]:
import re
import os
import pickle
import pandas as pd
import numpy as np
from typing import List, Dict, Tuple

import spacy
from sentence_transformers import SentenceTransformer
from sklearn.cluster import DBSCAN
from sklearn.feature_extraction.text import TfidfVectorizer

print("‚úÖ Libraries imported successfully!")

## 3. Load Sample Syllabus Data

**Note**: Upload your own syllabus files to Kaggle dataset or use the sample below.

In [None]:
# Sample syllabus text (replace with your actual data)
sample_syllabus = """
Machine Learning Course Syllabus

Unit 1: Introduction to Machine Learning
- What is Machine Learning?
- Types of Machine Learning: Supervised, Unsupervised, Reinforcement
- Applications of ML

Unit 2: Linear Regression
- Simple Linear Regression
- Multiple Linear Regression
- Gradient Descent
- Cost Function

Unit 3: Classification Algorithms
- Logistic Regression
- Decision Trees
- Random Forests
- Support Vector Machines

Unit 4: Neural Networks
- Perceptron
- Backpropagation
- Deep Learning Basics
- Convolutional Neural Networks
"""

print("Sample syllabus loaded.")
print(f"Length: {len(sample_syllabus)} characters")

## 4. Topic Extraction Class

In [None]:
class TopicExtractor:
    """Extract topics from academic syllabus using NLP."""
    
    def __init__(self):
        print("Loading models...")
        self.nlp = spacy.load("en_core_web_sm")
        self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
        print("‚úÖ Models loaded!")
    
    def extract_topics(self, text: str) -> List[Dict]:
        """Extract topics from syllabus text."""
        topics = []
        
        # Pattern matching for common syllabus structures
        patterns = [
            r'Unit\s+(\d+):\s*(.+?)(?=\n|$)',
            r'Chapter\s+(\d+):\s*(.+?)(?=\n|$)',
            r'Week\s+(\d+):\s*(.+?)(?=\n|$)',
            r'Module\s+(\d+):\s*(.+?)(?=\n|$)',
            r'Lecture\s+(\d+):\s*(.+?)(?=\n|$)',
        ]
        
        for pattern in patterns:
            matches = re.finditer(pattern, text, re.IGNORECASE | re.MULTILINE)
            for match in matches:
                topic_num = match.group(1)
                topic_name = match.group(2).strip()
                
                # Extract subtopics (bullet points after the main topic)
                subtopics = self._extract_subtopics(text, match.end())
                
                topics.append({
                    'number': topic_num,
                    'name': topic_name,
                    'subtopics': subtopics,
                    'type': pattern.split('\\')[0].replace('r\'', '')
                })
        
        # If no structured topics found, use sentence clustering
        if not topics:
            topics = self._cluster_topics(text)
        
        return topics
    
    def _extract_subtopics(self, text: str, start_pos: int, max_lines: int = 10) -> List[str]:
        """Extract bullet points/subtopics after a main topic."""
        subtopics = []
        lines = text[start_pos:].split('\n')[:max_lines]
        
        for line in lines:
            line = line.strip()
            # Stop at next major topic
            if re.match(r'(Unit|Chapter|Week|Module|Lecture)\s+\d+', line, re.IGNORECASE):
                break
            # Extract bullet points
            if line.startswith(('-', '‚Ä¢', '*', '‚Äì')) or re.match(r'^\d+\.', line):
                subtopic = re.sub(r'^[-‚Ä¢*‚Äì\d.]+\s*', '', line)
                if subtopic:
                    subtopics.append(subtopic)
        
        return subtopics
    
    def _cluster_topics(self, text: str) -> List[Dict]:
        """Cluster sentences into topics using embeddings."""
        doc = self.nlp(text)
        sentences = [sent.text.strip() for sent in doc.sents if len(sent.text.strip()) > 20]
        
        if not sentences:
            return []
        
        # Generate embeddings
        embeddings = self.embedder.encode(sentences)
        
        # Cluster using DBSCAN
        clustering = DBSCAN(eps=0.5, min_samples=2, metric='cosine').fit(embeddings)
        
        # Group sentences by cluster
        topics = []
        for cluster_id in set(clustering.labels_):
            if cluster_id == -1:  # Skip noise
                continue
            
            cluster_sentences = [sentences[i] for i, label in enumerate(clustering.labels_) if label == cluster_id]
            
            topics.append({
                'number': str(cluster_id + 1),
                'name': cluster_sentences[0][:100],  # Use first sentence as topic name
                'subtopics': cluster_sentences[1:],
                'type': 'clustered'
            })
        
        return topics

print("‚úÖ TopicExtractor class defined!")

## 5. Train and Test the Model

In [None]:
# Initialize extractor
extractor = TopicExtractor()

# Extract topics from sample syllabus
topics = extractor.extract_topics(sample_syllabus)

# Display results
print(f"\nüìö Extracted {len(topics)} topics:\n")
for topic in topics:
    print(f"\n{topic['type'].upper()} {topic['number']}: {topic['name']}")
    if topic['subtopics']:
        for subtopic in topic['subtopics'][:5]:  # Show first 5
            print(f"  - {subtopic}")
        if len(topic['subtopics']) > 5:
            print(f"  ... and {len(topic['subtopics']) - 5} more")

## 6. Save the Model as .pkl File

In [None]:
# Save the trained extractor
output_path = 'topic_extractor.pkl'

with open(output_path, 'wb') as f:
    pickle.dump(extractor, f)

print(f"\n‚úÖ Model saved to: {output_path}")
print(f"File size: {os.path.getsize(output_path) / (1024*1024):.2f} MB")

# Test loading
with open(output_path, 'rb') as f:
    loaded_extractor = pickle.load(f)
    
print("\n‚úÖ Model loaded successfully!")
print("\nüì• Download this file and place it in: ml_models/nlp/topic_extractor.pkl")

## 7. Validation Test

In [None]:
# Test with loaded model
test_text = """
Week 1: Python Basics
- Variables and Data Types
- Control Flow
- Functions

Week 2: Object-Oriented Programming
- Classes and Objects
- Inheritance
- Polymorphism
"""

test_topics = loaded_extractor.extract_topics(test_text)
print(f"\nüß™ Test extraction: Found {len(test_topics)} topics")
for topic in test_topics:
    print(f"  - {topic['name']}")

## Next Steps

1. ‚úÖ Download `topic_extractor.pkl` from Kaggle
2. üìÅ Place it in: `c:\Users\Acer\Documents\GitHub\AutoYT-Playlist\ml_models\nlp\topic_extractor.pkl`
3. üöÄ The backend will automatically use this model!

---

**Optional Improvements:**
- Train on more diverse syllabi
- Fine-tune the clustering parameters
- Add domain-specific keyword extraction