# Applying the Pre-trained SpaCy NER Model for Book Title Extraction

In this notebook, we will apply the pre-trained Named Entity Recognition (NER) model from SpaCy, `nl_core_news_lg`. This model is capable of extracting the `work_of_art` entity, which includes book titles and other artistic works. Unlike previous notebooks, we will not fine-tune this model on our own data. Instead, we will focus on evaluating its performance as-is.

The main steps involved in this notebook are:
1. **Model Application:** Utilize the `nl_core_news_lg` model to extract book titles from historical newspaper texts.
2. **Data Selection:** Apply the model to articles from our dataset, specifically from the Leeuwarder Courant, Trouw, and Het Parool.
3. **Performance Evaluation:** Assess the model's accuracy and effectiveness in identifying book titles without any additional fine-tuning.

By the end of this notebook, we aim to understand how well the pre-trained SpaCy NER model performs on our dataset in its default state.

Let's get started!


In [1]:
# !python -m spacy download nl_core_news_lg

In [1]:
import pandas as pd
import numpy as np
import string
import re

from datasets import Dataset, load_metric
from transformers import DataCollatorForTokenClassification, pipeline, AutoTokenizer, AutoModelForTokenClassification, Trainer, TrainingArguments

import torch

import spacy
from spacy import displacy
from spacy.tokens import Doc


import os
from datetime import datetime
import json

from tqdm import tqdm
from sklearn.metrics import f1_score, precision_score, recall_score

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Get the current notebook directory
current_dir = os.path.abspath('')

# Set the main directory (modify as needed to point to your main project directory)
main_dir = os.path.abspath(os.path.join(current_dir, '../'))

# Change the working directory to the main directory
os.chdir(main_dir)

# Verify that the working directory has been set correctly
print(f"Current working directory: {os.getcwd()}")

Current working directory: C:\Users\niels\PycharmProjects\BookReviewsThesis


In [3]:
from scripts.dataset_preparation import remove_punctuation, find_sentence_in_text, create_mask_for_sentence, process_text, \
                                        create_data_set, trouw_parool_create_dataset, save_dataset, load_dataset, split_samples, remove_extra_spaces

In [4]:
pd.set_option('display.max_columns', None)

## Load Datasets

In [5]:
# Load data from Excel and CSV files into DataFrames

# Load leeuwarde courant Excel file into a DataFrame
df_lc = pd.read_excel('data/raw/manullay_check_partially_matched_titles.xlsx', engine='openpyxl')

# Load Trouw and Het Parool annotated book review file into a DataFrame
df_trouw_parool = pd.read_csv('data/raw/trouw_and_parool_annotated_book_titles.csv')

In [7]:
# Apply the remove_extra_spaces function to relevant columns in df_lc
df_lc['content'] = df_lc['content'].apply(remove_extra_spaces)
df_lc['title1'] = df_lc['title1'].apply(remove_extra_spaces)
df_lc['title4'] = df_lc['title4'].apply(remove_extra_spaces)

In [8]:
# Filter out rows with 'manually_removed' set to 1 and get unique 'content' values
content_removed = df_lc[df_lc['manually_removed'] == 1]['content'].unique()

# Filter out the removed 'content' values from the main DataFrame
df_lc_clean = df_lc[~df_lc['content'].isin(content_removed)]

In [9]:
nlp = spacy.load("nl_core_news_lg")

In [10]:
remove_punc = False
force_lower_case = False

In [11]:
# Define file paths for saving/loading datasets
lc_train_filename = 'C:/Users/niels/PycharmProjects/BookReviewsThesis/data/processed/lc_train_dataset.pkl'
lc_val_filename = 'C:/Users/niels/PycharmProjects/BookReviewsThesis/data/processed/lc_val_dataset.pkl'
lc_test_filename = 'C:/Users/niels/PycharmProjects/BookReviewsThesis/data/processed/lc_test_dataset.pkl'

trouw_test_filename = 'C:/Users/niels/PycharmProjects/BookReviewsThesis/data/processed/trouw_test_dataset.pkl'
parool_test_filename = 'C:/Users/niels/PycharmProjects/BookReviewsThesis/data/processed/parool_test_dataset.pkl'

In [12]:
# Split the samples into training, validation, and test sets

# Set the random seed for reproducibility
np.random.seed(42)

# Get unique content samples
samples = df_lc_clean['content'].unique()

# Split the samples into training, validation, and test sets
lc_train_samples, lc_val_samples, lc_test_samples = split_samples(samples=samples, train_ratio=0.7, val_ratio=0.15, test_ratio=0.15)

# Check if datasets already exist, otherwise create them
if os.path.exists(lc_train_filename) and os.path.exists(lc_val_filename) and os.path.exists(lc_test_filename):
    print("Loading training, validation, and test datasets....")
    lc_train_dataset = load_dataset(lc_train_filename)
    lc_val_dataset = load_dataset(lc_val_filename)
    lc_test_dataset = load_dataset(lc_test_filename)
else:
    print("Creating training, validation, and test datasets....")
    # Create dataset
    lc_train_dataset = Dataset.from_list(create_data_set(samples=lc_train_samples, df=df_lc_clean, nlp=nlp, remove_punc=remove_punc, force_lower_case=force_lower_case))
    lc_val_dataset = Dataset.from_list(create_data_set(samples=lc_val_samples, df=df_lc_clean, nlp=nlp, remove_punc=remove_punc, force_lower_case=force_lower_case))
    lc_test_dataset = Dataset.from_list(create_data_set(samples=lc_test_samples, df=df_lc_clean, nlp=nlp, remove_punc=remove_punc, force_lower_case=force_lower_case))

    # Save dataset, so we don't have to create it everytime again
    save_dataset(lc_train_dataset, lc_train_filename)
    save_dataset(lc_val_dataset, lc_val_filename)
    save_dataset(lc_test_dataset, lc_test_filename)

print("Done...")

Loading training, validation, and test datasets....
Done...


In [33]:
# Create or load the Trouw/Parool test dataset

if os.path.exists(parool_test_filename):
    print("Loading parool test dataset....")
    parool_test_dataset = load_dataset(parool_test_filename)
else:
    print("Creating parool test dataset....")
    parool_test_dataset = Dataset.from_list(trouw_parool_create_dataset(df=df_trouw_parool[df_trouw_parool['newspaper'] == 'Parool'], nlp=nlp, remove_punc=remove_punc, force_lower_case=force_lower_case))    
    
    # Save dataset, so we don't have to create it everytime again
    save_dataset(parool_test_dataset, parool_test_filename)


if os.path.exists(trouw_test_filename):
    print("Loading trouw test dataset....")
    trouw_test_dataset = load_dataset(trouw_test_filename)
else:
    print("Creating trouw test dataset....")
    trouw_test_dataset = Dataset.from_list(trouw_parool_create_dataset(df=df_trouw_parool[df_trouw_parool['newspaper'] == 'Trouw'], nlp=nlp, remove_punc=remove_punc, force_lower_case=force_lower_case))
    
    # Save dataset, so we don't have to create it everytime again
    save_dataset(trouw_test_dataset, trouw_test_filename)


Loading parool test dataset....
Loading trouw test dataset....


## Predict and Test on Leeuwarden Courant validation dataset

In [14]:
# Initialize lists to hold all predictions and truths
all_preds_lc_val = []
all_truths_lc_val = []

# Process each sample
for i, text in enumerate(lc_val_samples):
    doc = nlp(text)
    
    # Generate predictions
    pred = [1 if token.ent_type_ == "WORK_OF_ART" else 0 for token in doc]
    all_preds_lc_val.extend(pred)  
    
    all_truths_lc_val.extend(lc_val_dataset[i]['ner_tags'])

In [15]:
# Calculate precision
lc_val_precision = precision_score(y_true=all_truths_lc_val, y_pred=all_preds_lc_val, average='binary')

# Calculate recall
lc_val_recall = recall_score(y_true=all_truths_lc_val, y_pred=all_preds_lc_val, average='binary')

# Calculate the F1 score
lc_val_f1 = f1_score(y_true=all_truths_lc_val, y_pred=all_preds_lc_val, average='binary')

print("Precision on LC validation set:", lc_val_precision)
print("Recall on LC validation set:", lc_val_recall)
print("F1 Score on LC validation set:", lc_val_f1)

Precision on LC test set: 0.12816292943838717
Recall on LC test set: 0.0765215255174108
F1 Score on LC test set: 0.09582772543741588


## Predict and Test on Leeuwarden Courant testdataset

In [16]:
# Initialize lists to hold all predictions and truths
all_preds_lc_test = []
all_truths_lc_test = []

# Process each sample
for i, text in enumerate(lc_test_samples):
    doc = nlp(text)
    
    # Generate predictions
    pred = [1 if token.ent_type_ == "WORK_OF_ART" else 0 for token in doc]
    all_preds_lc_test.extend(pred)  
    
    all_truths_lc_test.extend(lc_test_dataset[i]['ner_tags'])

In [17]:
# Calculate precision
lc_test_precision = precision_score(y_true=all_truths_lc_test, y_pred=all_preds_lc_test, average='binary')

# Calculate recall
lc_test_recall = recall_score(y_true=all_truths_lc_test, y_pred=all_preds_lc_test, average='binary')

# Calculate the F1 score
lc_test_f1 = f1_score(y_true=all_truths_lc_test, y_pred=all_preds_lc_test, average='binary')

print("Precision on LC test set:", lc_test_precision)
print("Recall on LC test set:", lc_test_recall)
print("F1 Score on LC test set:", lc_test_f1)

Precision on LC test set: 0.12482528760348349
Recall on LC test set: 0.07149455015702938
F1 Score on LC test set: 0.09091620986687549


## Predict and Test on Trouw testdataset

In [36]:
# Initialize lists to hold all predictions and truths
all_preds_trouw_test = []
all_truths_trouw_test = []

# Process each sample
for sample in trouw_test_dataset:
    doc = Doc(nlp.vocab, words=sample['tokens'])
    doc = nlp.get_pipe('ner')(doc)
    
    # Generate predictions
    pred = [1 if token.ent_type_ == "WORK_OF_ART" else 0 for token in doc]
    all_preds_trouw_test.extend(pred)  
    
    all_truths_trouw_test.extend(sample['ner_tags'])

In [37]:
# Calculate precision
trouw_test_precision = precision_score(y_true=all_truths_trouw_test, y_pred=all_preds_trouw_test, average='binary')

# Calculate recall
trouw_test_recall = recall_score(y_true=all_truths_trouw_test, y_pred=all_preds_trouw_test, average='binary')

# Calculate the F1 score
trouw_test_f1 = f1_score(y_true=all_truths_trouw_test, y_pred=all_preds_trouw_test, average='binary')

print("Precision on Trouw test set:", trouw_test_precision)
print("Recall on Trouw test set:", trouw_test_recall)
print("F1 Score on Trouw test set:", trouw_test_f1)

Precision on Trouw test set: 0.17122040072859745
Recall on Trouw test set: 0.07051762940735183
F1 Score on Trouw test set: 0.09989373007438895


## Predict and Test on Parool testdataset

In [38]:
# Initialize lists to hold all predictions and truths
all_preds_parool_test = []
all_truths_parool_test = []

# Process each sample
for sample in parool_test_dataset:
    doc = Doc(nlp.vocab, words=sample['tokens'])
    doc = nlp.get_pipe('ner')(doc)
    
    # Generate predictions
    pred = [1 if token.ent_type_ == "WORK_OF_ART" else 0 for token in doc]
    all_preds_parool_test.extend(pred)  
    
    all_truths_parool_test.extend(sample['ner_tags'])

In [39]:
# Calculate precision
parool_test_precision = precision_score(y_true=all_truths_parool_test, y_pred=all_preds_parool_test, average='binary')

# Calculate recall
parool_test_recall = recall_score(y_true=all_truths_parool_test, y_pred=all_preds_parool_test, average='binary')

# Calculate the F1 score
parool_test_f1 = f1_score(y_true=all_truths_parool_test, y_pred=all_preds_parool_test, average='binary')

print("Precision on Parool test set:", parool_test_precision)
print("Recall on Parool test set:", parool_test_recall)
print("F1 Score on Parool test set:", parool_test_f1)

Precision on Trouw & Parool test set: 0.36580766813324955
Recall on Trouw & Parool test set: 0.1985670419651996
F1 Score on Trouw & Parool test set: 0.2574082264484741


## Visualize prediction example

In [22]:
visualize_index = -1

In [23]:
validation_df = df_lc_clean[df_lc_clean['content'].isin(lc_test_samples)]

In [24]:
doc = nlp(validation_df.iloc[visualize_index]["content"])

# Filter entities to only include WORK_OF_ART
doc.ents = [ent for ent in doc.ents if ent.label_ == "WORK_OF_ART"]

In [25]:
validation_df[validation_df['content'] == validation_df.iloc[visualize_index]["content"]].title4

25932    Vallende ster
Name: title4, dtype: object

In [26]:
displacy.render(doc, style="ent", jupyter=True)