# RAN Domain-Specific Model Fine-tuning

This notebook guides you through the process of fine-tuning a language model specifically for RAN (Radio Access Network) knowledge graph queries.

## Overview
- **Knowledge Graph**: 273 tables, 5,584 columns, 4M+ relationships
- **Goal**: Create a domain-specific model that understands RAN terminology
- **Process**: Data generation ‚Üí Model training ‚Üí Evaluation ‚Üí Deployment

## Prerequisites
- Neo4j database with RAN knowledge graph
- Required Python packages (see requirements.txt)
- GPU recommended for faster training

## 1. Setup and Dependencies

In [1]:
# Install required packages if not already installed
!pip install transformers datasets torch accelerate
!pip install neo4j sentence-transformers scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [2]:
# Import required libraries
import sys
import os
import json
import torch
import numpy as np
import pandas as pd
from datetime import datetime
import logging
import warnings
warnings.filterwarnings('ignore')

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name()}")

Python version: 3.10.18 (main, Jul  1 2025, 05:26:40) [GCC 12.2.0]
PyTorch version: 2.7.1+cu126
CUDA available: False


In [3]:
# Import our custom modules
sys.path.append('..')
from ran_finetuning import RANDomainModelTrainer, RANEntityRecognitionTrainer
from knowledge_graph_module.kg_builder import RANNeo4jIntegrator

print("‚úÖ Modules imported successfully")

Transformers version: 4.54.1
‚úÖ Modules imported successfully
‚úÖ Modules imported successfully


## 2. Database Connection Setup

In [4]:
# Configure your Neo4j connection
NEO4J_URI = "bolt://localhost:7687"  # Update with your Neo4j URI
NEO4J_USERNAME = "neo4j"              # Update with your username
NEO4J_PASSWORD = "ranqarag#1"           # Update with your password

# Initialize Neo4j connection
try:
    neo4j_integrator = RANNeo4jIntegrator(NEO4J_URI, NEO4J_USERNAME, NEO4J_PASSWORD)
    print("‚úÖ Neo4j connection established")
    
    # Test connection with a simple query
    with neo4j_integrator.driver.session() as session:
        result = session.run("MATCH (n) RETURN count(n) as total_nodes")
        total_nodes = result.single()['total_nodes']
        print(f"üìä Total nodes in knowledge graph: {total_nodes:,}")
        
except Exception as e:
    print(f"‚ùå Failed to connect to Neo4j: {e}")
    print("Please check your connection settings and ensure Neo4j is running")
    neo4j_integrator = None

INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2


‚úÖ Neo4j connection established
üìä Total nodes in knowledge graph: 5,857


## 3. Explore Knowledge Graph Structure

In [5]:
# Get overview of the knowledge graph
if neo4j_integrator:
    with neo4j_integrator.driver.session() as session:
        # Count different node types
        tables_count = session.run("MATCH (t:Table) RETURN count(t) as count").single()['count']
        columns_count = session.run("MATCH (c:Column) RETURN count(c) as count").single()['count']
        
        # Count different relationship types
        rel_types = session.run("""
            MATCH ()-[r]->()
            RETURN type(r) as rel_type, count(r) as count
            ORDER BY count DESC
        """)
        
        print("üìä Knowledge Graph Overview:")
        print(f"   Tables: {tables_count:,}")
        print(f"   Columns: {columns_count:,}")
        print("\nüîó Relationship Types:")
        
        rel_data = []
        for record in rel_types:
            rel_type = record['rel_type']
            count = record['count']
            print(f"   {rel_type}: {count:,}")
            rel_data.append({'Relationship Type': rel_type, 'Count': count})
        
        # Create a DataFrame for better visualization
        rel_df = pd.DataFrame(rel_data)
        print("\nüìà Relationship Distribution:")
        print(rel_df.to_string(index=False))

üìä Knowledge Graph Overview:
   Tables: 273
   Columns: 5,584

üîó Relationship Types:
   CONCEPTUAL_GROUP: 3,870,783
   PATTERN_MATCH: 850,776
   NAME_SIMILARITY: 452,809
   VALUE_OVERLAP: 130,360
   REFERENCES: 101,712
   HAS_COLUMN: 5,584

üìà Relationship Distribution:
Relationship Type   Count
 CONCEPTUAL_GROUP 3870783
    PATTERN_MATCH  850776
  NAME_SIMILARITY  452809
    VALUE_OVERLAP  130360
       REFERENCES  101712
       HAS_COLUMN    5584
   CONCEPTUAL_GROUP: 3,870,783
   PATTERN_MATCH: 850,776
   NAME_SIMILARITY: 452,809
   VALUE_OVERLAP: 130,360
   REFERENCES: 101,712
   HAS_COLUMN: 5,584

üìà Relationship Distribution:
Relationship Type   Count
 CONCEPTUAL_GROUP 3870783
    PATTERN_MATCH  850776
  NAME_SIMILARITY  452809
    VALUE_OVERLAP  130360
       REFERENCES  101712
       HAS_COLUMN    5584


In [6]:
# Sample some table and column names to understand the data
if neo4j_integrator:
    with neo4j_integrator.driver.session() as session:
        # Sample table names
        sample_tables = session.run("""
            MATCH (t:Table)
            RETURN t.name as table_name, t.row_count as rows, t.column_count as cols
            ORDER BY t.row_count DESC
            LIMIT 100
        """)
        
        print("üèÜ Top 100 Tables by Row Count:")
        table_data = []
        for record in sample_tables:
            table_name = record['table_name']
            rows = record['rows'] or 0
            cols = record['cols'] or 0
            print(f"   {table_name}: {rows:,} rows, {cols} columns")
            table_data.append({
                'Table Name': table_name,
                'Rows': rows,
                'Columns': cols
            })
        
        tables_df = pd.DataFrame(table_data)
        print("\nüìä Tables Summary:")
        print(tables_df.to_string(index=False))

üèÜ Top 100 Tables by Row Count:
   FeatureState: 945 rows, 15 columns
   FeatureKey: 708 rows, 16 columns
   SctpAssociation: 409 rows, 10 columns
   ExternalENodeBFunction: 390 rows, 35 columns
   Schema: 374 rows, 14 columns
   NRCellRelation: 182 rows, 20 columns
   SwItem: 151 rows, 13 columns
   TermPointToSGW: 108 rows, 10 columns
   HcRule: 94 rows, 18 columns
   ExternalGNodeBFunction: 83 rows, 16 columns
   ExternalGeranCell: 79 rows, 36 columns
   EUtranFreqRelation: 72 rows, 89 columns
   ExternalGNBCUCPFunction: 65 rows, 14 columns
   CapacityState: 63 rows, 17 columns
   X2ULink: 56 rows, 11 columns
   HwItem: 52 rows, 23 columns
   S1ULink: 52 rows, 11 columns
   PmFlexCounterFilter: 48 rows, 45 columns
   RlfProfile: 46 rows, 14 columns
   EUtranFrequency: 43 rows, 26 columns
   Log: 40 rows, 11 columns
   DrxProfile: 40 rows, 18 columns
   PmUlInterferenceReport: 39 rows, 10 columns
   GeranFrequency: 38 rows, 14 columns
   SciProfile: 38 rows, 26 columns
   RfBranch:

In [7]:
# Sample some conceptual groups to understand semantic categories
if neo4j_integrator:
    with neo4j_integrator.driver.session() as session:
        conceptual_groups = session.run("""
            MATCH (c1:Column)-[r:CONCEPTUAL_GROUP]-(c2:Column)
            WHERE r.semantic_category IS NOT NULL
            WITH r.semantic_category as category, count(r) as relationship_count
            RETURN category, relationship_count
            ORDER BY relationship_count DESC
            LIMIT 100
        """)
        
        print("üéØ Top Semantic Categories:")
        category_data = []
        for record in conceptual_groups:
            category = record['category']
            count = record['relationship_count']
            print(f"   {category}: {count:,} relationships")
            category_data.append({
                'Semantic Category': category,
                'Relationships': count
            })
        
        if category_data:
            categories_df = pd.DataFrame(category_data)
            print("\nüìà Semantic Categories Distribution:")
            print(categories_df.to_string(index=False))
        else:
            print("   No semantic categories found with the current query")

üéØ Top Semantic Categories:
   network_topology: 2,791,422 relationships
   quality: 1,612,858 relationships
   frequency: 884,566 relationships
   general: 472,906 relationships
   configuration_parameters: 437,738 relationships
   topology: 434,332 relationships
   traffic_analysis: 349,212 relationships
   frequency_spectrum: 208,452 relationships
   timing_synchronization: 198,936 relationships
   traffic: 147,424 relationships
   quality_metrics: 104,442 relationships
   configuration: 75,344 relationships
   performance_metrics: 13,904 relationships
   power_management: 9,824 relationships
   mobility_management: 130 relationships
   security_features: 76 relationships

üìà Semantic Categories Distribution:
       Semantic Category  Relationships
        network_topology        2791422
                 quality        1612858
               frequency         884566
                 general         472906
configuration_parameters         437738
                topology         4

## 4. Initialize Fine-tuning Components

In [8]:
# Initialize the RAN domain model trainer
if neo4j_integrator:
    ran_trainer = RANDomainModelTrainer(neo4j_integrator)
    print("‚úÖ RAN Domain Model Trainer initialized")
    
    # Display the RAN intents that will be trained
    print("\nüéØ RAN Domain Intents:")
    for i, (intent, description) in enumerate(ran_trainer.ran_intents.items(), 1):
        print(f"   {i:2d}. {intent}: {description}")
    
    print(f"\nüìù Total intents: {len(ran_trainer.ran_intents)}")
else:
    print("‚ùå Cannot initialize trainer without Neo4j connection")
    ran_trainer = None

‚úÖ RAN Domain Model Trainer initialized

üéØ RAN Domain Intents:
    1. performance_analysis: Analyze network performance metrics and KPIs
    2. power_optimization: Optimize power consumption and efficiency
    3. spectrum_management: Manage frequency spectrum allocation and bandwidth
    4. cell_configuration: Configure cell parameters and settings
    5. quality_assessment: Assess signal quality and coverage metrics
    6. traffic_analysis: Analyze network traffic patterns and load
    7. fault_detection: Detect and diagnose network faults and issues
    8. capacity_planning: Plan network capacity and resource allocation
    9. interference_analysis: Analyze and mitigate interference sources
   10. handover_optimization: Optimize handover procedures and mobility

üìù Total intents: 10


In [9]:
# Initialize the NER trainer
if neo4j_integrator:
    ner_trainer = RANEntityRecognitionTrainer(neo4j_integrator)
    print("‚úÖ RAN Entity Recognition Trainer initialized")
    
    print("\nüè∑Ô∏è Entity Types for NER:")
    for entity_type, label in ner_trainer.entity_types.items():
        print(f"   {entity_type}: {label}")
else:
    print("‚ùå Cannot initialize NER trainer without Neo4j connection")
    ner_trainer = None

‚úÖ RAN Entity Recognition Trainer initialized

üè∑Ô∏è Entity Types for NER:
   TABLE_NAME: B-TAB
   COLUMN_NAME: B-COL
   CELL_ID: B-CELL
   FREQUENCY: B-FREQ
   POWER_VALUE: B-PWR
   METRIC_NAME: B-MET
   TIME_VALUE: B-TIME


In [10]:
# Final Test: Generate Training Data with More Semantic Categories
print("üéØ Final Test: Enhanced Training Data Generation")
print("=" * 50)

if ran_trainer:
    # Reload to get the latest limit increase
    import importlib
    from ran_finetuning import RANDomainModelTrainer
    ran_trainer = RANDomainModelTrainer(neo4j_integrator)
    
    print("üîÑ Generating training data with expanded coverage...")
    enhanced_training_data = ran_trainer.generate_training_data()
    
    print(f"‚úÖ Generated {len(enhanced_training_data)} enhanced training samples")
    
    # Analyze entity coverage
    samples_with_entities = sum(1 for item in enhanced_training_data if item.get('entities'))
    categories_found = set()
    entity_fields_found = set()
    
    for item in enhanced_training_data:
        entities = item.get('entities', {})
        if entities.get('semantic_category'):
            categories_found.add(entities['semantic_category'])
        for field in entities.keys():
            entity_fields_found.add(field)
    
    print(f"\nüìä Enhanced Analysis:")
    print(f"   Total samples: {len(enhanced_training_data)}")
    print(f"   Samples with entities: {samples_with_entities} ({(samples_with_entities/len(enhanced_training_data)*100):.1f}%)")
    print(f"   Unique semantic categories: {len(categories_found)}")
    print(f"   Unique entity field types: {len(entity_fields_found)}")
    
    print(f"\nüè∑Ô∏è Semantic Categories Found:")
    for i, category in enumerate(sorted(categories_found), 1):
        print(f"   {i:2d}. {category}")
    
    print(f"\nüîß Entity Field Types Available:")
    for i, field in enumerate(sorted(entity_fields_found), 1):
        print(f"   {i:2d}. {field}")
    
    # Show a few diverse samples
    print(f"\nüéØ Sample Enhanced Training Data:")
    categories_shown = set()
    for item in enhanced_training_data:
        if len(categories_shown) >= 3:  # Show 3 different categories
            break
        
        category = item.get('entities', {}).get('semantic_category', 'unknown')
        if category not in categories_shown and category != 'unknown':
            categories_shown.add(category)
            print(f"\nüìù Category: {category}")
            print(f"   Text: '{item['text'][:80]}...'")
            print(f"   Intent: {item['intent']}")
            
            entities = item.get('entities', {})
            key_entities = ['table_name', 'column_name', 'semantic_category', 'domain_type', 'entity_confidence']
            print(f"   Key Entities:")
            for key in key_entities:
                if key in entities:
                    print(f"      ‚Ä¢ {key}: {entities[key]}")
    
    print(f"\n‚úÖ Training data is now properly enhanced with rich entity information!")
    print(f"üöÄ Ready for model training with real KG data representation!")

else:
    print("‚ùå No trainer available")

üéØ Final Test: Enhanced Training Data Generation
üîÑ Generating training data with expanded coverage...
üîç Extracting real KG data for training...
üìä Extracted 10888 real entities from 16 categories
‚úÖ Generated 1180 training samples with entities
‚úÖ Generated 1180 enhanced training samples

üìä Enhanced Analysis:
   Total samples: 1180
   Samples with entities: 1180 (100.0%)
   Unique semantic categories: 16
   Unique entity field types: 19

üè∑Ô∏è Semantic Categories Found:
    1. configuration
    2. configuration_parameters
    3. frequency
    4. frequency_spectrum
    5. general
    6. mobility_management
    7. network_topology
    8. performance_metrics
    9. power_management
   10. quality
   11. quality_metrics
   12. security_features
   13. timing_synchronization
   14. topology
   15. traffic
   16. traffic_analysis

üîß Entity Field Types Available:
    1. analysis_type
    2. column_name
    3. column_type
    4. comparison_type
    5. config_type
    6. dom

In [11]:
# Export training data for manual review or external use
if ran_trainer and enhanced_training_data:
    export_path = "./ran_training_data.json"
    
    try:
        ran_trainer.export_training_data(export_path)
        print(f"‚úÖ Training data exported to {export_path}")
        
        # Show file size
        file_size = os.path.getsize(export_path) / (1024 * 1024)  # MB
        print(f"üìÅ File size: {file_size:.2f} MB")
        
    except Exception as e:
        print(f"‚ùå Error exporting training data: {e}")

üîç Extracting real KG data for training...
üìä Extracted 10888 real entities from 16 categories
‚úÖ Generated 1180 training samples with entities
Training data exported to ./ran_training_data.json
Total samples: 1180

Intent distribution:
  cell_configuration: 490 samples
  fault_detection: 60 samples
  handover_optimization: 30 samples
  performance_analysis: 180 samples
  power_optimization: 60 samples
  quality_assessment: 120 samples
  spectrum_management: 120 samples
  traffic_analysis: 120 samples
‚úÖ Training data exported to ./ran_training_data.json
üìÅ File size: 0.83 MB
üìä Extracted 10888 real entities from 16 categories
‚úÖ Generated 1180 training samples with entities
Training data exported to ./ran_training_data.json
Total samples: 1180

Intent distribution:
  cell_configuration: 490 samples
  fault_detection: 60 samples
  handover_optimization: 30 samples
  performance_analysis: 180 samples
  power_optimization: 60 samples
  quality_assessment: 120 samples
  spectru

## 7. Model Training Configuration

In [13]:
# Configure training parameters - Memory Optimized
TRAINING_CONFIG = {
    'model_name': 'distilbert-base-uncased',  # Base model for fine-tuning
    'output_dir': './ran_domain_model',       # Where to save the trained model
    'num_epochs': 2,                          # Reduced epochs to prevent crashes
    'batch_size': 4,                          # Small batch size for memory safety
    'learning_rate': 2e-5,                    # Learning rate
    'warmup_steps': 100,                      # Reduced warmup steps
    'weight_decay': 0.01,                     # Weight decay for regularization
    'logging_steps': 50,                      # More frequent logging
    'save_steps': 500,                        # More frequent saves
    'gradient_accumulation_steps': 2,         # Accumulate gradients to simulate larger batch
}

print("‚öôÔ∏è Memory-Optimized Training Configuration:")
for key, value in TRAINING_CONFIG.items():
    print(f"   {key}: {value}")

# Auto-adjust based on system capabilities
import psutil
import gc

# Check system memory
system_memory_gb = psutil.virtual_memory().total / (1024**3)
print(f"\nüñ•Ô∏è System Memory: {system_memory_gb:.1f} GB")

# Check GPU memory if available
if torch.cuda.is_available():
    gpu_memory_gb = torch.cuda.get_device_properties(0).total_memory / (1024**3)
    print(f"üéÆ GPU Memory: {gpu_memory_gb:.1f} GB")
    
    # Adjust based on GPU memory
    if gpu_memory_gb < 4:
        TRAINING_CONFIG['batch_size'] = 2
        TRAINING_CONFIG['gradient_accumulation_steps'] = 4
        print("   ‚ö†Ô∏è Low GPU memory - using batch_size=2, gradient_accumulation=4")
    elif gpu_memory_gb < 8:
        TRAINING_CONFIG['batch_size'] = 4
        TRAINING_CONFIG['gradient_accumulation_steps'] = 2
        print("   ‚úÖ Medium GPU memory - using batch_size=4, gradient_accumulation=2")
    else:
        TRAINING_CONFIG['batch_size'] = 8
        TRAINING_CONFIG['gradient_accumulation_steps'] = 1
        print("   üöÄ High GPU memory - using batch_size=8, gradient_accumulation=1")
else:
    # CPU-only configuration
    TRAINING_CONFIG['batch_size'] = 2
    TRAINING_CONFIG['gradient_accumulation_steps'] = 4
    TRAINING_CONFIG['num_epochs'] = 2  # Single epoch for CPU
    print("   üíª CPU-only mode - minimal settings for stability")

# Memory cleanup before training
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print("\nüßπ Memory cleared and ready for training")

‚öôÔ∏è Memory-Optimized Training Configuration:
   model_name: distilbert-base-uncased
   output_dir: ./ran_domain_model
   num_epochs: 2
   batch_size: 4
   learning_rate: 2e-05
   warmup_steps: 100
   weight_decay: 0.01
   logging_steps: 50
   save_steps: 500
   gradient_accumulation_steps: 2

üñ•Ô∏è System Memory: 7.8 GB
   üíª CPU-only mode - minimal settings for stability


## 8. Train the RAN Domain Model

In [14]:
# Train the intent classification model with memory optimization
if ran_trainer and enhanced_training_data:
    print("üöÄ Starting Memory-Optimized RAN Domain Model Training")
    print(f"‚è±Ô∏è Started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"üìä Training on {len(enhanced_training_data)} samples")
    print("\n‚ö†Ô∏è Note: This uses memory-optimized settings to prevent kernel crashes")
    print("Training may take 15-45 minutes depending on your hardware.")
    print("=" * 60)
    
    # Pre-training memory cleanup
    import gc
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        print("üßπ GPU memory cleared before training")
    
    try:
        # Train the model with optimized settings
        print(f"üéØ Training Configuration:")
        print(f"   Model: {TRAINING_CONFIG['model_name']}")
        print(f"   Epochs: {TRAINING_CONFIG['num_epochs']}")
        print(f"   Batch Size: {TRAINING_CONFIG['batch_size']}")
        print(f"   Gradient Accumulation: {TRAINING_CONFIG['gradient_accumulation_steps']}")
        print(f"   Output: {TRAINING_CONFIG['output_dir']}")
        print("")
        
        model, tokenizer = ran_trainer.train_ran_model(
            output_dir=TRAINING_CONFIG['output_dir']
        )
        
        print("\n" + "=" * 60)
        print("‚úÖ Training completed successfully!")
        print(f"‚è±Ô∏è Finished at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
        print(f"üíæ Model saved to: {TRAINING_CONFIG['output_dir']}")
        
        # Check model files
        model_files = os.listdir(TRAINING_CONFIG['output_dir'])
        print(f"\nüìÅ Model files created: {len(model_files)}")
        for file in sorted(model_files):
            file_path = os.path.join(TRAINING_CONFIG['output_dir'], file)
            if os.path.isfile(file_path):
                size_mb = os.path.getsize(file_path) / (1024 * 1024)
                print(f"   {file}: {size_mb:.2f} MB")
        
        # Calculate total model size
        total_size_mb = sum(os.path.getsize(os.path.join(TRAINING_CONFIG['output_dir'], f)) 
                           for f in model_files if os.path.isfile(os.path.join(TRAINING_CONFIG['output_dir'], f))) / (1024 * 1024)
        print(f"\nüì¶ Total model size: {total_size_mb:.2f} MB")
        
        training_successful = True
        
        # Post-training cleanup
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            print("üßπ GPU memory cleared after training")
        
    except Exception as e:
        print(f"\n‚ùå Training failed: {e}")
        print(f"   Error type: {type(e).__name__}")
        print("\nüõ†Ô∏è Troubleshooting suggestions:")
        print("   1. Restart the kernel and try again")
        print("   2. Reduce batch_size to 1 in TRAINING_CONFIG")
        print("   3. Set num_epochs to 1 for a quick test")
        print("   4. Close other applications to free memory")
        print("   5. Try CPU-only training by setting CUDA_VISIBLE_DEVICES=''")
        
        training_successful = False
        model, tokenizer = None, None
        
        # Emergency cleanup
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            
else:
    print("‚ùå Cannot train model without training data")
    print("Please run the training data generation cells first.")
    training_successful = False
    model, tokenizer = None, None

üöÄ Starting Memory-Optimized RAN Domain Model Training
‚è±Ô∏è Started at: 2025-08-08 07:10:02
üìä Training on 1180 samples

‚ö†Ô∏è Note: This uses memory-optimized settings to prevent kernel crashes
Training may take 15-45 minutes depending on your hardware.
üéØ Training Configuration:
   Model: distilbert-base-uncased
   Epochs: 2
   Batch Size: 2
   Gradient Accumulation: 4
   Output: ./ran_domain_model

Compatibility check: Transformers 4.54.1 - fully compatible
Generating training data...
üîç Extracting real KG data for training...
üìä Extracted 10888 real entities from 16 categories
‚úÖ Generated 1180 training samples with entities
Generated 1180 training samples
Loading base model...
üìä Extracted 10888 real entities from 16 categories
‚úÖ Generated 1180 training samples with entities
Generated 1180 training samples
Loading base model...


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Tokenizing dataset...


Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1180/1180 [00:00<00:00, 14242.98 examples/s]



Starting training...
üöÄ Starting training with memory optimization...


Step,Training Loss
50,2.0556
100,0.8491
150,0.0917
200,0.0152
250,0.0108


Saving model to ./ran_domain_model
Training completed successfully!

‚úÖ Training completed successfully!
‚è±Ô∏è Finished at: 2025-08-08 07:19:55
üíæ Model saved to: ./ran_domain_model

üìÅ Model files created: 10
   config.json: 0.00 MB
   intent_labels.json: 0.00 MB
   model.safetensors: 255.45 MB
   special_tokens_map.json: 0.00 MB
   tokenizer.json: 0.68 MB
   tokenizer_config.json: 0.00 MB
   training_args.bin: 0.01 MB
   training_info.json: 0.00 MB
   vocab.txt: 0.22 MB

üì¶ Total model size: 256.36 MB
Training completed successfully!

‚úÖ Training completed successfully!
‚è±Ô∏è Finished at: 2025-08-08 07:19:55
üíæ Model saved to: ./ran_domain_model

üìÅ Model files created: 10
   config.json: 0.00 MB
   intent_labels.json: 0.00 MB
   model.safetensors: 255.45 MB
   special_tokens_map.json: 0.00 MB
   tokenizer.json: 0.68 MB
   tokenizer_config.json: 0.00 MB
   training_args.bin: 0.01 MB
   training_info.json: 0.00 MB
   vocab.txt: 0.22 MB

üì¶ Total model size: 256.36 MB


## 14. Summary and Next Steps

In [None]:
# Training Process Summary
print("üìã RAN Domain Model Training Summary")
print("=" * 40)

# Check what was completed
completed_steps = []
if neo4j_integrator:
    completed_steps.append("‚úÖ Neo4j connection established")
else:
    completed_steps.append("‚ùå Neo4j connection failed")

if 'enhanced_training_data' in locals() and enhanced_training_data:
    completed_steps.append(f"‚úÖ Generated {len(enhanced_training_data)} training samples with entities")
else:
    completed_steps.append("‚ùå Training data generation failed")

if 'training_successful' in locals() and training_successful:
    completed_steps.append("‚úÖ Model training completed successfully")
    completed_steps.append(f"‚úÖ Model saved to {TRAINING_CONFIG['output_dir']}")
else:
    completed_steps.append("‚ùå Model training failed")

for step in completed_steps:
    print(step)

print("\nüéØ Next Steps:")
if 'training_successful' in locals() and training_successful:
    print("1. ‚úÖ RAN domain model training completed!")
    print("2. ? Run model evaluation using ran_model_evaluation.ipynb")
    print("3. üöÄ Deploy the model in your chatbot application")
    print("4. üìä Monitor performance in production")
    print("5. üîÑ Retrain periodically with new data")
    
    print(f"\nüìÅ Model saved at: {TRAINING_CONFIG['output_dir']}")
    print(f"üìä Training samples: {len(enhanced_training_data) if 'enhanced_training_data' in locals() else 'N/A'}")
    
    print("\n? For comprehensive evaluation:")
    print("   üëâ Open and run: ran_model_evaluation.ipynb")
    print("   This will test the model performance, accuracy, and integration")
    
    print("\n?üí° Usage in your application:")
    print('```python')
    print('from chatbot import EnhancedRANChatbot')
    print('chatbot = EnhancedRANChatbot(neo4j_integrator, use_domain_model=True)')
    print('result = chatbot.enhanced_process_query("your query here")')
    print('```')
else:
    print("1. ‚ùì Review any errors above")
    print("2. üîß Check memory configuration if kernel crashed")
    print("3. üîÑ Try reducing batch_size in TRAINING_CONFIG")
    print("4. üíª Consider CPU-only training for memory-constrained systems")
    print("5. ? Re-run the training process")

print(f"\n‚è∞ Training completed at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

# Memory cleanup after training
import gc
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print("üßπ GPU memory cleaned up")

## 15. Additional Resources and Tips

In [None]:
# Additional tips and resources
print("üí° Tips for Production Use")
print("=" * 30)
print()
print("üöÄ Performance Optimization:")
print("   ‚Ä¢ Use GPU for faster inference")
print("   ‚Ä¢ Implement model caching for repeated queries")
print("   ‚Ä¢ Consider model quantization for smaller size")
print()
print("üìä Model Monitoring:")
print("   ‚Ä¢ Track prediction confidence scores")
print("   ‚Ä¢ Monitor query types and patterns")
print("   ‚Ä¢ Collect user feedback for retraining")
print()
print("üîÑ Continuous Improvement:")
print("   ‚Ä¢ Regularly retrain with new data")
print("   ‚Ä¢ Expand training data with real user queries")
print("   ‚Ä¢ Fine-tune hyperparameters based on performance")
print()
print("üõ†Ô∏è Troubleshooting:")
print("   ‚Ä¢ Low confidence predictions: Add more training data")
print("   ‚Ä¢ Memory issues: Reduce batch size or model size")
print("   ‚Ä¢ Slow training: Use GPU or reduce data size")
print()
print("üìö Documentation:")
print("   ‚Ä¢ Transformers: https://huggingface.co/docs/transformers")
print("   ‚Ä¢ Datasets: https://huggingface.co/docs/datasets")
print("   ‚Ä¢ Neo4j Python: https://neo4j.com/docs/python-manual")
print()
print("üéØ For questions or support:")
print("   ‚Ä¢ Check the README.md in the chatbot_module")
print("   ‚Ä¢ Review the example code in chatbot_example.py")
print("   ‚Ä¢ Consult the enhanced chatbot documentation")