<a href="https://colab.research.google.com/github/hadiwyne/names_extractor/blob/main/names_extractor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extracting the names of authors mentioned in a book.

The inspiration for this project came to me while reading The Space of Literature by Maurice Blanchot. Maurice mentions many influential authors in his work and explores the themes these authors were struggling to express in their own works. By the time I finished reading the book, I realized that I was now interested in the authors Maurice had mentioned in his work, but I forgot to jot down their names.

If only there was a way to go through the text again and just extract the names of all the authors mentioned there ... Well, why not make something that does just that myself?

I also aim to calculate just how many times an author was mentioned in the text; this will allow me to see which author was most important for Maurice.

The first step of any project involving a Jupyter Notebook is importing the required libraries and the data itself. For the sake of simplicity, I have renamed the book of my choice to `sample_book.epub`. However, it is Maurice Blanchot's The Space of Literature.

In [1]:
import os
import re
import spacy
import pandas as pd
from collections import Counter
from tika import parser
from spacy.lang.en.stop_words import STOP_WORDS

#Load stopwords and NLP model

In [2]:
stop_words = set(STOP_WORDS)
nlp = spacy.load('en_core_web_md')

# Load and clean the text

In [3]:
def load_text(file_path):
    ext = os.path.splitext(file_path)[1].lower()
    if ext in [".pdf", ".epub"]:
        return parser.from_file(file_path)['content']
    elif ext == ".txt":
        with open(file_path, "r", encoding="utf-8") as f:
            return f.read()
    else:
        raise ValueError("Unsupported file format. Only .pdf, .epub, .txt are allowed.")

def clean_text(text):
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Extract authors

In [4]:
def extract_authors(text):
    doc = nlp(text)
    authors = []

    # 1. Extract PERSON entities with additional checks
    for ent in doc.ents:
        if ent.label_ == 'PERSON' and ent.text.istitle():
            # Filter out single-word common terms
            if len(ent.text.split()) == 1:
                if ent.text.lower() not in stop_words:
                    authors.append(ent.text)
            else:
                authors.append(ent.text)

    # 2. Extract names using contextual patterns
    patterns = [
        r'(?:according to|by|writes|stated by|argued by|noted by|in)\s+([A-Z][a-z]+(?:\s+[A-Z][a-z]+)+)',
        r'([A-Z][a-z]+(?:\s+[A-Z][a-z]+)+\'s (?:work|book|essay|theory))'
    ]
    for pattern in patterns:
        authors += re.findall(pattern, text)

    return authors

# Main Processing

In [5]:
file_path = "sample_book_2.epub"
raw_text = load_text(file_path)
cleaned_text = clean_text(raw_text)

# Extract and count authors

In [6]:
author_list = extract_authors(cleaned_text)
author_counts = Counter(author_list)

# Filter and save results

In [7]:
if author_counts:
    df = pd.DataFrame(author_counts.items(), columns=['Author', 'Mentions'])
    df = df[df['Mentions'] > 1]
    df.sort_values('Mentions', ascending=False, inplace=True)

    # Filter out common non-author terms
    common_non_authors = {"The", "This", "But", "And", "What", "For", "That", "When"}
    df = df[~df['Author'].isin(common_non_authors)]

    df.to_excel('authors.xlsx', index=False)
    print(f"Found {len(df)} authors. Results saved to authors.xlsx")
else:
    print("No authors found")

Found 129 authors. Results saved to authors.xlsx
