# Movie NER Model Demo

This notebook demonstrates how to train and use a Named Entity Recognition (NER) model for extracting movie-related entities from user queries.

## Entity Types
- **DIRECTOR**: Movie directors (e.g., "Christopher Nolan")
- **CAST**: Actors and actresses (e.g., "Leonardo DiCaprio")
- **GENRE**: Movie genres (e.g., "action", "comedy")

## 1. Setup and Imports

In [2]:
import os
import pandas as pd
from ner_model import MovieNERModel, MovieNERDataGenerator, train_movie_ner_model

# Set up logging to see training progress
import logging
logging.basicConfig(level=logging.INFO)

print("✓ Imports successful")

✓ Imports successful


## 2. Generate Training Data

First, let's see how training data is generated for the NER model.

In [6]:
# Create data generator
generator = MovieNERDataGenerator()

# Generate sample training data
samples = generator.generate_training_data(num_samples=10)

print(f"Generated {len(samples)} training samples:")
print()

for i, (text, annotations) in enumerate(samples[:5], 1):
    print(f"Sample {i}:")
    print(f"  Text: {text}")
    print(f"  Entities: {annotations['entities']}")
    
    # Show entity text
    for start, end, label in annotations['entities']:
        entity_text = text[start:end]
        print(f"    {label}: '{entity_text}'")
    print()

Generated 10 training samples:

Sample 1:
  Text: Show me films directed by Chloe Zhao
  Entities: [(26, 36, 'DIRECTOR')]
    DIRECTOR: 'Chloe Zhao'

Sample 2:
  Text: Find movies with Margot Robbie and Lupita Nyong'o
  Entities: [(17, 30, 'CAST'), (35, 49, 'CAST')]
    CAST: 'Margot Robbie'
    CAST: 'Lupita Nyong'o'

Sample 3:
  Text: Show me horror films
  Entities: [(8, 14, 'GENRE')]
    GENRE: 'horror'

Sample 4:
  Text: Find action and western movies
  Entities: [(5, 11, 'GENRE'), (16, 23, 'GENRE')]
    GENRE: 'action'
    GENRE: 'western'

Sample 5:
  Text: Find movies by Quentin Tarantino
  Entities: [(15, 32, 'DIRECTOR')]
    DIRECTOR: 'Quentin Tarantino'



## 3. Before Training - Test Untrained Model

Let's see what happens when we try to extract entities with an untrained model.

In [11]:
# Create untrained model
untrained_model = MovieNERModel()

# Test query
test_query = "I want action movies directed by Christopher Nolan"

# Extract entities (should be empty)
untrained_entities = untrained_model.extract_entities(test_query)

print("UNTRAINED MODEL RESULTS:")
print(f"Query: {test_query}")
print(f"Entities: {untrained_entities}")
print()
print("As expected, the untrained model finds no entities.")

INFO:ner_model:Loaded base model: en_core_web_sm


UNTRAINED MODEL RESULTS:
Query: I want action movies directed by Christopher Nolan
Entities: {'DIRECTOR': [], 'CAST': [], 'GENRE': []}

As expected, the untrained model finds no entities.


## 4. Train the Model

Now let's train a model and see the difference.

In [3]:
# Initialize model
model = MovieNERModel()

# Prepare training data
print("Preparing training data...")
model.prepare_training_data(num_samples=300)  # Smaller number for notebook demo

print(f"Training samples: {len(model.training_data)}")
print(f"Validation samples: {len(model.validation_data)}")

INFO:ner_model:Loaded base model: en_core_web_sm
INFO:ner_model:Generating 300 training samples...
INFO:ner_model:Prepared 240 training and 60 validation samples


Preparing training data...
Training samples: 240
Validation samples: 60


In [4]:
# Train the model
print("Training the model...")
print("This may take a few minutes...")

metrics = model.train(n_iter=50)  # Fewer iterations for notebook demo

print("\n✓ Training completed!")
print(f"Final F1 score: {metrics['final_score'].get('ents_f', 0):.4f}")

INFO:ner_model:Training model for 50 iterations...
[2025-05-23 15:30:38,244] [INFO] Added vocab lookups: lexeme_norm
INFO:spacy:Added vocab lookups: lexeme_norm
[2025-05-23 15:30:38,245] [INFO] Created vocabulary
INFO:spacy:Created vocabulary
[2025-05-23 15:30:38,249] [INFO] Finished initializing nlp object
INFO:spacy:Finished initializing nlp object


Training the model...
This may take a few minutes...


INFO:ner_model:Iteration 0: Loss=748.6536, Val F1={'ents_f': 0.0, 'ents_p': 0.0, 'ents_r': 0.0}
INFO:ner_model:Iteration 5: Loss=49.2942, Val F1={'ents_f': 0.0, 'ents_p': 0.0, 'ents_r': 0.0}
INFO:ner_model:Iteration 10: Loss=7.1466, Val F1={'ents_f': 0.0, 'ents_p': 0.0, 'ents_r': 0.0}
INFO:ner_model:Iteration 15: Loss=2.4199, Val F1={'ents_f': 0.0, 'ents_p': 0.0, 'ents_r': 0.0}
INFO:ner_model:Iteration 20: Loss=5.9272, Val F1={'ents_f': 0.0, 'ents_p': 0.0, 'ents_r': 0.0}
INFO:ner_model:Iteration 25: Loss=0.0004, Val F1={'ents_f': 0.0, 'ents_p': 0.0, 'ents_r': 0.0}
INFO:ner_model:Iteration 30: Loss=3.9748, Val F1={'ents_f': 0.0, 'ents_p': 0.0, 'ents_r': 0.0}
INFO:ner_model:Iteration 35: Loss=1.9842, Val F1={'ents_f': 0.0, 'ents_p': 0.0, 'ents_r': 0.0}
INFO:ner_model:Iteration 40: Loss=1.5255, Val F1={'ents_f': 0.0, 'ents_p': 0.0, 'ents_r': 0.0}
INFO:ner_model:Iteration 45: Loss=0.0002, Val F1={'ents_f': 0.0, 'ents_p': 0.0, 'ents_r': 0.0}
INFO:ner_model:Training completed. Final F1 score


✓ Training completed!
Final F1 score: 0.0000


## 5. Test the Trained Model

Now let's test the trained model with the same query.

In [12]:
# Test the same query with trained model
test_query = "I want action movies directed by Christopher Nolan"
trained_entities = model.extract_entities(test_query)

print("TRAINED MODEL RESULTS:")
print(f"Query: {test_query}")
print(f"Entities: {trained_entities}")
print()

# Compare results
print("COMPARISON:")
print(f"Untrained: {untrained_entities}")
print(f"Trained:   {trained_entities}")
print()

# Check improvement
untrained_total = sum(len(entities) for entities in untrained_entities.values())
trained_total = sum(len(entities) for entities in trained_entities.values())

if trained_total > untrained_total:
    print("✓ Training improved entity extraction!")
else:
    print("⚠ Training may need more data or iterations")

TRAINED MODEL RESULTS:
Query: I want action movies directed by Christopher Nolan
Entities: {'DIRECTOR': ['Christopher Nolan'], 'CAST': [], 'GENRE': ['action']}

COMPARISON:
Untrained: {'DIRECTOR': [], 'CAST': [], 'GENRE': []}
Trained:   {'DIRECTOR': ['Christopher Nolan'], 'CAST': [], 'GENRE': ['action']}

✓ Training improved entity extraction!


## 6. Test Multiple Queries

Let's test the trained model with various types of queries.

In [13]:
# Test queries
test_queries = [
    "I want action movies directed by Christopher Nolan",
    "Show me comedy films with Will Smith",
    "Find horror movies starring Lupita Nyong'o",
    "I love animated movies",
    "Show me thriller films",
    "Find movies with Tom Hanks",
    "I want films directed by Quentin Tarantino",
    "Show me sci-fi movies with Leonardo DiCaprio"
]

print("ENTITY EXTRACTION RESULTS:")
print("=" * 60)

for i, query in enumerate(test_queries, 1):
    entities = model.extract_entities(query)
    
    print(f"\n{i}. Query: {query}")
    
    found_any = False
    for entity_type, entity_list in entities.items():
        if entity_list:
            print(f"   {entity_type}: {entity_list}")
            found_any = True
    
    if not found_any:
        print("   No entities found")

ENTITY EXTRACTION RESULTS:

1. Query: I want action movies directed by Christopher Nolan
   DIRECTOR: ['Christopher Nolan']
   GENRE: ['action']

2. Query: Show me comedy films with Will Smith
   CAST: ['Will Smith']
   GENRE: ['comedy']

3. Query: Find horror movies starring Lupita Nyong'o
   CAST: ["Lupita Nyong'o"]
   GENRE: ['horror']

4. Query: I love animated movies
   No entities found

5. Query: Show me thriller films
   GENRE: ['thriller']

6. Query: Find movies with Tom Hanks
   CAST: ['Tom Hanks']

7. Query: I want films directed by Quentin Tarantino
   DIRECTOR: ['Quentin Tarantino']

8. Query: Show me sci-fi movies with Leonardo DiCaprio
   CAST: ['Leonardo DiCaprio']
   GENRE: ['sci-fi']


## 7. Save and Load Model

Let's save the trained model and demonstrate loading it.

In [14]:
# Save the model
model_path = "saved_models/notebook_ner_model"
model.save_model(model_path)

print(f"✓ Model saved to: {model_path}")

INFO:ner_model:Model saved to saved_models/notebook_ner_model


✓ Model saved to: saved_models/notebook_ner_model


In [15]:
# Load the model in a new instance
loaded_model = MovieNERModel()
loaded_model.load_model(model_path)

# Test the loaded model
test_query = "Find drama movies with Meryl Streep"
entities = loaded_model.extract_entities(test_query)

print("LOADED MODEL TEST:")
print(f"Query: {test_query}")
print(f"Entities: {entities}")
print("\n✓ Model loaded and working correctly!")

INFO:ner_model:Loaded base model: en_core_web_sm
INFO:ner_model:Model loaded from saved_models/notebook_ner_model


LOADED MODEL TEST:
Query: Find drama movies with Meryl Streep
Entities: {'DIRECTOR': [], 'CAST': ['Meryl Streep'], 'GENRE': ['drama']}

✓ Model loaded and working correctly!


## 8. Complete Training Function

For convenience, you can also use the complete training function.

In [16]:
# Check if we have movie data available
movie_data_files = [
    'wiki_movie_plots_deduped.csv',
    'wiki_movie_plots_deduped_cleaned.csv'
]

movie_data_path = None
for file_path in movie_data_files:
    if os.path.exists(file_path):
        movie_data_path = file_path
        print(f"Found movie data: {file_path}")
        break

if not movie_data_path:
    print("No movie data found. Using fallback data.")

# Train a complete model
print("\nTraining complete model with more data...")
complete_model_path = train_movie_ner_model(
    movie_data_path=movie_data_path,
    num_samples=500,
    n_iter=25,
    model_save_path="saved_models/complete_ner_model"
)

print(f"\n✓ Complete model saved to: {complete_model_path}")

Found movie data: wiki_movie_plots_deduped.csv

Training complete model with more data...


INFO:ner_model:Loaded base model: en_core_web_sm
INFO:ner_model:Generating 500 training samples...
INFO:ner_model:Loaded 12592 directors and 30365 cast members
INFO:ner_model:Prepared 400 training and 100 validation samples
INFO:ner_model:Training model for 25 iterations...
[2025-05-23 15:43:02,874] [INFO] Added vocab lookups: lexeme_norm
INFO:spacy:Added vocab lookups: lexeme_norm
[2025-05-23 15:43:02,874] [INFO] Created vocabulary
INFO:spacy:Created vocabulary
[2025-05-23 15:43:02,874] [INFO] Finished initializing nlp object
INFO:spacy:Finished initializing nlp object
INFO:ner_model:Iteration 0: Loss=1079.6726, Val F1={'ents_f': 0.0, 'ents_p': 0.0, 'ents_r': 0.0}
INFO:ner_model:Iteration 5: Loss=178.3941, Val F1={'ents_f': 0.0, 'ents_p': 0.0, 'ents_r': 0.0}
INFO:ner_model:Iteration 10: Loss=38.9710, Val F1={'ents_f': 0.0, 'ents_p': 0.0, 'ents_r': 0.0}
INFO:ner_model:Iteration 15: Loss=13.2484, Val F1={'ents_f': 0.0, 'ents_p': 0.0, 'ents_r': 0.0}
INFO:ner_model:Iteration 20: Loss=8.94


✓ Complete model saved to: saved_models/complete_ner_model_20250523_154434


## 9. Test the Complete Model

Let's test the complete model with more challenging queries.

In [17]:
# Load the complete model
complete_model = MovieNERModel()
complete_model.load_model(complete_model_path)

# More challenging test queries
challenging_queries = [
    "I want action movies directed by Christopher Nolan with Leonardo DiCaprio",
    "Show me comedy and drama films",
    "Find sci-fi movies starring Tom Hanks and directed by Steven Spielberg",
    "I love romantic comedies with Emma Stone",
    "Show me horror and thriller movies"
]

print("COMPLETE MODEL - CHALLENGING QUERIES:")
print("=" * 60)

for i, query in enumerate(challenging_queries, 1):
    entities = complete_model.extract_entities(query)
    
    print(f"\n{i}. Query: {query}")
    
    for entity_type, entity_list in entities.items():
        if entity_list:
            print(f"   {entity_type}: {entity_list}")

INFO:ner_model:Loaded base model: en_core_web_sm
INFO:ner_model:Model loaded from saved_models/complete_ner_model_20250523_154434


COMPLETE MODEL - CHALLENGING QUERIES:

1. Query: I want action movies directed by Christopher Nolan with Leonardo DiCaprio
   DIRECTOR: ['Christopher Nolan']
   CAST: ['Leonardo DiCaprio']
   GENRE: ['action']

2. Query: Show me comedy and drama films
   GENRE: ['comedy', 'drama']

3. Query: Find sci-fi movies starring Tom Hanks and directed by Steven Spielberg
   DIRECTOR: ['Steven Spielberg']
   CAST: ['Tom Hanks']
   GENRE: ['sci-fi']

4. Query: I love romantic comedies with Emma Stone
   CAST: ['Emma Stone']
   GENRE: ['romantic']

5. Query: Show me horror and thriller movies
   GENRE: ['horror', 'thriller']


## 10. Summary

This notebook demonstrated:

1. **Training Data Generation**: How synthetic training data is created
2. **Model Training**: Training a spaCy NER model for movie entities
3. **Before/After Comparison**: Showing the improvement from training
4. **Entity Extraction**: Using the trained model to extract entities
5. **Model Persistence**: Saving and loading trained models

### Key Takeaways:
- NER models need training to extract entities effectively
- More training data and iterations improve performance
- The model can extract multiple entity types from one query
- Trained models can be saved and reused

### Next Steps:
- Use the trained model in your movie recommendation system
- Integrate with genre prediction and vector search components
- Experiment with different training data sizes and iterations
- Add more entity types if needed