# Building a 19th c. Notes & Queries Full Text Search Engine

Having got various pieces in place, we're now in a position to attempt to create a comprehensive full text search engine over 19th century issue of *Notes & Queries*. As we use the database, there may well be "optimisations" we can make, for example in trying to tidy up the content a little. But for now, let's just put the pieces we've already assembled together and see how it looks.

Start off by loading some essentially packages, as well as package files we created for ourselves previously:

In [1]:
from pathlib import Path
from sqlite_utils import Database

from ia_utils.create_db_table_metadata import create_db_table_metadata
from ia_utils.open_metadata_records import open_metadata_records
from ia_utils.add_patched_metadata_records_to_db import add_patched_metadata_records_to_db
from ia_utils.create_db_table_issues import create_db_table_issues
from ia_utils.create_db_table_pages_metadata import create_db_table_pages_metadata

It takes several hours to download the datafiles, so if we already have a full database file, we may want to put in some guards so we don't overwrite it an lose all that previously downloaded data:

In [2]:
RECREATE_FULL_DB = False
full_db_name = "full_nq.db"

db_exists = Path("full_nq.db").is_file()

dirname = "ia-downloads"
p = Path(dirname)

# Extra cautious...
if RECREATE_FULL_DB and not db_exists:
    db_full = Database(full_db_name, recreate=True)

    create_db_table_metadata(db_full)
    data_records = open_metadata_records()
    add_patched_metadata_records_to_db(db_full, data_records)

    create_db_table_issues(db_full)

    create_db_table_pages_metadata(db_full)
    db_full["pages_metadata"].add_column("page_text", str)
    db_full["pages_metadata_fts"].drop(ignore=True)
    db_full["pages_metadata"].enable_fts(["id", "page_idx", "page_text"],
                                         create_triggers=True, tokenize="porter")
else:
    db_full = Database(full_db_name)

In [3]:
from pandas import read_sql

# Get the records for a particular year

q = """
SELECT id, title, date, is_index
FROM metadata
WHERE is_index = 0
    AND strftime('%Y', datetime) = '{year}';
"""

results_19th_cent = read_sql(q.format(year=1849), db_full.conn)
results_19th_cent

Unnamed: 0,id,title,date,is_index
0,sim_notes-and-queries_1849-11-03_1_1,Notes and Queries 1849-11-03: Vol 1 Iss 1,1849-11-03,0
1,sim_notes-and-queries_1849-11-10_1_2,Notes and Queries 1849-11-10: Vol 1 Iss 2,1849-11-10,0
2,sim_notes-and-queries_1849-11-17_1_3,Notes and Queries 1849-11-17: Vol 1 Iss 3,1849-11-17,0
3,sim_notes-and-queries_1849-11-24_1_4,Notes and Queries 1849-11-24: Vol 1 Iss 4,1849-11-24,0
4,sim_notes-and-queries_1849-12-01_1_5,Notes and Queries 1849-12-01: Vol 1 Iss 5,1849-12-01,0
5,sim_notes-and-queries_1849-12-08_1_6,Notes and Queries 1849-12-08: Vol 1 Iss 6,1849-12-08,0
6,sim_notes-and-queries_1849-12-15_1_7,Notes and Queries 1849-12-15: Vol 1 Iss 7,1849-12-15,0
7,sim_notes-and-queries_1849-12-22_1_8,Notes and Queries 1849-12-22: Vol 1 Iss 8,1849-12-22,0
8,sim_notes-and-queries_1849-12-29_1_9,Notes and Queries 1849-12-29: Vol 1 Iss 9,1849-12-29,0


We now need to:
    
- iterate through the records;
- download the issue;
- carve it into various parts;
- add the parts to the database.

We have all the pieces we need, so let's do it:

In [4]:
# Dowload the tqdm progrss bar tools
from tqdm.notebook import tqdm
#And enable the pandas extensions
tqdm.pandas()

from ia_utils.download_and_extract_text import download_and_extract_text
from ia_utils.download_ia_records_by_format import download_ia_records_by_format
from ia_utils.add_page_metadata_to_db import add_page_metadata_to_db
from ia_utils.chunk_page_text import chunk_page_text

# Extra cautious
if RECREATE_FULL_DB and not db_exists:
    # Iterate through all the years we want to search over
    for year in tqdm(range(1849, 1900)):
        # Get issues by year
        results_by_year = read_sql(q.format(year=str(year)), db_full.conn)

        # Download issue content by year
        results_by_year['content'] = results_by_year["id"].apply(download_and_extract_text,
                                                                 verbose=False)

        # Add issues content to database
        results_by_year[["id", "content"]].to_sql("issues",
                                                  db_full.conn,
                                                  index=False, if_exists="append")

        # For each issue, we need to grab the metadata and store it in the database
        download_ia_records_by_format(results_by_year.to_dict(orient="records"), p)
        add_page_metadata_to_db(db_full, results_by_year.to_dict(orient="records"),
                                verbose=False)

        for record_id_val in results_by_year['id'].to_list():
            updated_pages = chunk_page_text(db_full, record_id_val)
            db_full["pages_metadata"].upsert_all(updated_pages, pk=("id", "page_idx"))

In [5]:
q = """
SELECT COUNT(*)
FROM pages_metadata;
"""

read_sql(q, db_full.conn)

Unnamed: 0,COUNT(*)
0,58406


In [6]:
q = """
SELECT COUNT(*) FROM issues;
"""

read_sql(q, db_full.conn)

Unnamed: 0,COUNT(*)
0,2565


In [8]:
search_term = "sin eater"

#q = f"""
#SELECT * FROM pages_metadata_fts
#WHERE pages_metadata_fts MATCH {db.quote(search_term)};
#"""

q = """
SELECT id, snippet(pages_metadata_fts, -1, "__", "__", "...", 10) as clip
FROM pages_metadata_fts WHERE pages_metadata_fts MATCH {query} ;
"""

read_sql(q.format(query=db_full.quote(search_term)), db_full.conn)

Unnamed: 0,id,clip
0,sim_notes-and-queries_1851-09-20_4_99,...Minor Queries :— Mazer Wood __eaters__ —“ A...
1,sim_notes-and-queries_1851-09-20_4_99,...Mazer Wood and __Sin__-__eaters__ (Vol. iii...
2,sim_notes-and-queries_1851-12-27_4_113,"...on mazer-wood and __sin__-__eaters__, 211. ..."
3,sim_notes-and-queries_1851-12-27_4_113,"...__Sin__-__eaters__, notices respecting, 211..."
4,sim_notes-and-queries_1852-10-23_6_156,...Coffins - -\nMinor Queries ANSwereD : — __S...
5,sim_notes-and-queries_1852-10-23_6_156,...__Sin__-__eater__.— Can any of your readers...
6,sim_notes-and-queries_1852-12-04_6_162,"...furnish its quota The __Sin__-__eater__, by..."
7,sim_notes-and-queries_1852-12-04_6_162,"...regoi — irst line, THE __SIN__-__EATER__. I..."
8,sim_notes-and-queries_1852-12-25_6_165,"...B. “CE. ) on the __sin__-__eater__, 541.\nB..."
9,sim_notes-and-queries_1852-12-25_6_165,"...Leeper ( Alex.) on the __sin__-__eater__, 5..."
