# Locating the Provided NBT Book Titles in Newspaper Reviews

In this notebook, we aim to determine the location of given book titles from the Nederlandse Bibliografie Totaal (NBT) database within newspaper book review articles. The process involves multiple steps to ensure accurate matching:

1. **Direct Matching:** Initially, we attempt to directly match the given NBT titles with the text in the newspaper book reviews.
2. **Main Title Extraction:** If a direct match is not found, we truncate the NBT title to retain only the main title.
3. **Fuzzy Matching:** If the main title still does not appear directly in the text, we apply fuzzy matching on the main title to locate it within the review.

By the end of this notebook, we will have a transformed dataset which makes the provided dataset a NER dataset.

Let's get started!


In [53]:
import pandas as pd
import numpy as np
import string
import re

import nltk
from nltk.tokenize import word_tokenize

import os
from datetime import datetime
import json
import difflib
import re
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

from tqdm import tqdm

# Ensure you have the necessary NLTK tokenizer models downloaded
# nltk.download('punkt')

In [54]:
pd.set_option('display.max_columns', None)

In [55]:
def remove_extra_spaces(text):
    if not (type(text) == str):
        return np.nan
    # Replace multiple spaces with a single space
    cleaned_text = re.sub(r'\s+', ' ', text)
    return cleaned_text.strip()

In [56]:
df = pd.read_excel('to_clean3.xlsx', engine='openpyxl')

In [57]:
# Mannually cleaned till 11286
df.loc[11286 + 1:, 'title3'] = np.nan

In [58]:
df['content'] = df['content'].apply(remove_extra_spaces)
df['title1'] = df['title1'].apply(remove_extra_spaces)
df['title2'] = df['title2'].apply(remove_extra_spaces)
df['title3'] = df['title3'].apply(remove_extra_spaces)

In [59]:
def find_best_match(large_text, query_sentence):
    # Initialize the SequenceMatcher
    s = difflib.SequenceMatcher(None, large_text, query_sentence)
    
    # Find the best match in the large text
    match = s.find_longest_match(0, len(large_text), 0, len(query_sentence))
    
    # Calculate similarity ratio
    match_similarity = s.ratio()  # This considers the overall similarity, not just the best match
    
    # Adjust the start position to the beginning of a word
    start = match.a
    while start > 0 and large_text[start - 1] != ' ':
        start -= 1

    # Re-calculate the end position based on the new start position
    end = start + len(query_sentence)

    # Extract the best matching text from the new start to the original match length
    best_match_text = large_text[start:end]

    return start, end, best_match_text, match_similarity

def fuzzy_extract_matching_substring(query, text):
    # Split the query and text into tokens
    query_tokens = query.split()
    text_tokens = text.split()
    
    # Length of the query in terms of tokens
    query_length = len(query_tokens)
    
    # Initialize best match variables
    best_match = ""
    highest_score = 0
    
    # Loop through the text tokens to form segments with the same number of tokens as the query
    for i in range(len(text_tokens) - query_length + 1):
        # Create a segment from the text with the same number of tokens as the query
        segment = " ".join(text_tokens[i:i + query_length])
        
        # Calculate fuzzy match score between the query and the segment
        score = fuzz.token_set_ratio(query, segment)
        if score > highest_score:
            highest_score = score
            best_match = segment

    return best_match, highest_score

### Extract main title from title1 (that is present in the review)

In [69]:
yes = 0
almost_match = 0

titles = []

partially_matched = [0] * len(df)
partially_matched_original_title = ["NONE"] * len(df)

for index, row in tqdm(df.iterrows()):
    if (type(row['title3']) == str): # if not nan
        title = row['title3'].lower()
        if (title in row['content'].lower()):
            yes += 1
            titles.append(title)
            continue

    title = row.title1.split("/")[0]
    title = re.sub(r'#.*?#', '', title) # Remove #..#
    title = title.replace("...", "")
    
    if (title.lower() in row.content.lower()):
        yes += 1
        titles.append(title)
    elif (title.split(":")[0].lower().strip() in row.content.lower()):
        yes += 1
        titles.append(title.split(":")[0].strip())
    elif (title.split(":")[0].split(';')[0].lower().strip() in row.content.lower()):
        yes += 1
        titles.append(title.split(":")[0].split(';')[0].strip())
    elif (title.split(":")[0].split(';')[0].split('=')[0].lower().strip() in row.content.lower()):
        yes += 1
        titles.append(title.split(":")[0].split(';')[0].split('=')[0].strip())
    elif (title.split(":")[0].split(';')[0].split(',')[0].lower().strip() in row.content.lower()):
        yes += 1
        titles.append(title.split(":")[0].split(';')[0].split(',')[0].strip())
    else:
        best_match_text, match_similarity = fuzzy_extract_matching_substring(text=row.content.lower(), query=title.split("/")[0].split(":")[0].split(';')[0].lower())
        if (match_similarity > 0.1):
            yes += 1
            almost_match += 1
            partially_matched[index] = 1
            partially_matched_original_title[index] = title.split("/")[0].split(":")[0].split(';')[0].lower()
            
            titles.append(best_match_text)
        else:
            titles.append(np.nan)

25940it [01:24, 306.29it/s]


In [70]:
yes / len(df)
# 0.7492675404780262

1.0

In [71]:
almost_match

6562

In [72]:
df['partially_matched'] = partially_matched
df['title4'] = titles
df['partially_matched_original_title'] = partially_matched_original_title
df['manually_removed'] = [0] * len(df)

In [73]:
f"The number of titles we found: {len(df['title4']) - df['title4'].isnull().sum()}" 

'The number of titles we found: 25940'

In [74]:
f"The number of unique reviews where we found the title: {len(df[df['title4'].notnull()].content.unique())}" 

'The number of unique reviews where we found the title: 13129'

### Export cleaned df

In [75]:
# Keep only the records where the title was found in the review
df_clean = df[df['title4'].notnull()]

In [76]:
# Export df to excel
df_clean.to_excel('xxxxxxx.xlsx', index=False, engine='openpyxl')

In [77]:
# Dubble check if all the titles are actually in the review
for index, row in df_clean.iterrows():
    if not (row['title4'].lower() in row['content'].lower()):
        print(row['title4'])

### Check partially matching quality

In [None]:
count = 0

for index, row in df[df['title4'].isnull()].iterrows():

    title = row.title1.split("/")[0]
    title = re.sub(r'#.*?#', '', title) # Remove #..#
    title = title.replace("...", "")
    
    start, end, best_match_text, match_similarity = find_best_match(large_text=row.content.lower(), query_sentence=title.split("/")[0].split(":")[0].split(';')[0].lower())
    if (match_similarity > 0.005):
        count += 1
        
        print(best_match_text)
        print(title.split("/")[0].split(":")[0].split(';')[0].lower())
        print("\n\n")

In [None]:
count

In [37]:
count = 0

for index, row in df[df['title4'].isnull()].iterrows():

    title = row.title1.split("/")[0]
    title = re.sub(r'#.*?#', '', title) # Remove #..#
    title = title.replace("...", "")
    
    best_match_text, match_similarity = fuzzy_extract_matching_substring(text=row.content.lower(), query=title.split("/")[0].split(":")[0].split(';')[0].lower())
    if (match_similarity > 80):
        count += 1
        
        print(best_match_text)
        print(title.split("/")[0].split(":")[0].split(';')[0].lower())
        print("\n\n")

het weerhcht op de kimmen;
het weerlicht op de kimmen 



de modemeester en het
panda en de modemeester 



modemeester en het geheimzinnige glas water;
panda en het geheimzinnige glas water 



bij het geheim van de oude prentbriefkaarten en
kappie en het geheim van de oude prentbriefkaarten 



en bij de slaapslaven.
kappie en de slaapslaven 



en om de manege. de
in en om de manege 



mortho-anna jarings.
martha-anna jarings 



- „de ogen van de
de ogen van de mol 



„fantasie für got
fantasie fur gott 



laßrava,
labrava 



„zitten, zat gezeten".
zitten zat gezeten 



- zeven eeuwen mijnen en mijnwerkers in
zeven eeuwen mijnen en mijnwerkers in limburg 



help, ik heb een camera gekregen
help! ik heb een camera gekregen



midden van koningen rust hij in vrede.
temidden van koningen rust hij in vrede 



de ring van moebius 2,
de ring van möbius 2 



'hoe bevalt de keizersnede?'
hoe bevalt een keizersnede? 



met wome wimpers. de
rose, met vrome wimpers 



place dient op