# World of Tales Online Story Collection

The [World of Tales](https://www.worldoftales.com/) website, published by Viktor Andonov, contains a wide range of tales extracted from a large number of out of copyright stroy collections.

The majority of the content is in the public domain.

A data download of the stories does not appear to be available: so let's make one...

## Identify Books Available

The book collections are organised geographically, with geogprahpical regions listed in a sidebar on the website homepage. However, a full list of books is also available on a single page, with a regular structure describing each book.

![](images/world_of_tales_books.png)

In [1]:
# These packages make it easy to download web pages so that we can work with them
import requests
# "Cacheing" pages mans grabbing a local copy of the page so we only need to download it once
import requests_cache
from datetime import timedelta

requests_cache.install_cache('web_cache',
                             backend='sqlite',
                             expire_after=timedelta(days=1000))

We'll be adding things to a database...

In [2]:
from sqlite_utils import Database

db_name = "word_of_tales.db"

# Uncomment the following lines to connect to a pre-existing database
#db = Database(db_name)

In [3]:
books_index_url = "https://www.worldoftales.com/all_books.html"

# And then grab the page
html = requests.get(books_index_url)

Each book description is contained in an HTML `div` element with class `box2 GM books`:

In [4]:
# The BeautifulSoup package provides a range of tools
# that help us work with the downloaded web page,
# such as extracting particular elements from it
from bs4 import BeautifulSoup

# The "soup" is a parsed and structured form of the page we downloaded
soup = BeautifulSoup(html.content, "html.parser")

# We can use browser developer tools to grab a CSS selector
# for the books list, then extend it to pull just the links
class_ = "box2"
# Find the span elements containing the links
# The first item is just a header
items_ = soup.find_all(class_=class_)[1:]

len(items_)

73

The structure inside each record appears to be consistent, although some records have more fields than others.

Let's see what a sample record looks like:

In [5]:
items_[0]

<div align="justify" class="box2 GM books" style="margin-top:50px;">
<h2 class="GL"><a href="Chinese_wonder_book.html">A Chinese Wonder Book</a></h2>
<span class="pic"><a href="Chinese_wonder_book.html"><img alt="Chinese folktales" border="0" height="165" src="media/Chinese_folktales.jpg" width="112"/></a></span>
<p><strong>Notes</strong>: Read 15 Chinese folktales </p>
<p><strong>Author</strong>: Norman Hinsdale Pitman <br/>
<strong>Editor</strong>: Andrew Lang <br/>
<strong>Published</strong>: 1919<br/>
<strong>Publisher</strong>: E. P. Dutton &amp; Co., 681 Fifth Avenue, New York</p>
<br/>
</div>

Okay, so there is a mixture of paragrap and `<br/>` separated elements. Let's treat things as lines. One of the easiest ways to do that is to view the record in a simple text form by casting the HTML to markdown.

In [6]:
from markdownify import markdownify

# We can convert the soup element to html text
# by casting it as a string
record_txt = markdownify(str(items_[0])).strip()
record_txt

'[A Chinese Wonder Book](Chinese_wonder_book.html)\n-------------------------------------------------\n\n\n[![Chinese folktales](media/Chinese_folktales.jpg)](Chinese_wonder_book.html)\n**Notes**: Read 15 Chinese folktales \n\n\n**Author**: Norman Hinsdale Pitman   \n\n**Editor**: Andrew Lang   \n\n**Published**: 1919  \n\n**Publisher**: E. P. Dutton & Co., 681 Fifth Avenue, New York'

This gives us a link in the heading, a book cover image, and then a series of metadata fields, each on a new line, with an emboldened header.

Let's create a parser for that structure.

In [7]:
import re

# We could split on three or more "-" characters
#parts = [p.strip() for p in re.split("---+", record_txt)]
# However, not all the records hav a header defined that way
# (at least one is defined with a leading ###)
# Instead, remove any heading chars then break when we find the linked image
parts = [p.strip().strip() for p in re.split("[-\n\s]+\[", record_txt.lstrip("#").strip())]

parts

['[A Chinese Wonder Book](Chinese_wonder_book.html)',
 '![Chinese folktales](media/Chinese_folktales.jpg)](Chinese_wonder_book.html)\n**Notes**: Read 15 Chinese folktales \n\n\n**Author**: Norman Hinsdale Pitman   \n\n**Editor**: Andrew Lang   \n\n**Published**: 1919  \n\n**Publisher**: E. P. Dutton & Co., 681 Fifth Avenue, New York']

In [8]:
from parse import parse

title_parts = parse("[{title}]({path})", parts[0])

# This gives us the book title and the path to it
(title_parts["title"], title_parts["path"])

('A Chinese Wonder Book', 'Chinese_wonder_book.html')

The next thing we need to do is extract out the image alt text, the image path, the book path, and the metadata lines:

In [9]:
#Remember, we lost the opening [ in the split
body_parts = parse("![{alt}]({img_path})]({path}){body}", parts[1])

(body_parts["alt"], body_parts["img_path"], body_parts["path"], body_parts["body"])

('Chinese folktales',
 'media/Chinese_folktales.jpg',
 'Chinese_wonder_book.html',
 '\n**Notes**: Read 15 Chinese folktales \n\n\n**Author**: Norman Hinsdale Pitman   \n\n**Editor**: Andrew Lang   \n\n**Published**: 1919  \n\n**Publisher**: E. P. Dutton & Co., 681 Fifth Avenue, New York')

We could parse the metadata by splitting on the lines (`\n+`) and then extracting the metadata filed and text, but this may be a little brittle if metadata fields are split over several lines. So instead split on a new header:

In [10]:
metadata_ = [l.strip() for l in re.split("\n\s*\*", body_parts["body"].strip())]
metadata_

['**Notes**: Read 15 Chinese folktales',
 '*Author**: Norman Hinsdale Pitman',
 '*Editor**: Andrew Lang',
 '*Published**: 1919',
 '*Publisher**: E. P. Dutton & Co., 681 Fifth Avenue, New York']

And make a dictionary from that:

In [11]:
metadata = {}
for m in metadata_:
    m_parts = m.split("**:")
    metadata[m_parts[0].strip("*")] = m_parts[1].strip()

metadata

{'Notes': 'Read 15 Chinese folktales',
 'Author': 'Norman Hinsdale Pitman',
 'Editor': 'Andrew Lang',
 'Published': '1919',
 'Publisher': 'E. P. Dutton & Co., 681 Fifth Avenue, New York'}

From a quick skim of the book page, at least the following fields appear (there may be others... I'm sure we'll find them!): *Notes*, *Author*, *Editor*, *Translator*, *Translators*, *Published*, *Publisher*.

One simplification we might make when generating the metadata keys is to stem the key by removeing any trailing letter *s* (for example, using `.rstrip("s")`)

Let's get all the book metadata:

In [12]:
def get_metadata(book_item):
    """Extract the metadata from each book record."""
    record_txt = markdownify(str(book_item)).strip()
    # Instead, break when we find the linked image
    parts = [p.strip().strip() for p in re.split("[-\n\s]+\[",
                                                 record_txt.lstrip("#").strip())]
    
    # Splitting when we find the image
    # HACK: in a couple of case, we have text after the link
    # For now, dump it and repair the break token...
    parts[0] = f'{parts[0].split(")")[0]})'
    # In at least one case, we need to strip embolden tags
    title_parts = parse("[{title}]({path})", parts[0].strip("*"))
    # Remember, we have lost the opening [ in the split...
    body_parts = parse("![{alt}]({img_path})]({path}){body}", parts[1])

    metadata_ = [l.strip() for l in re.split("\n\s*\*", body_parts["body"].strip())]
    metadata = {}
    for m in metadata_:
        m_parts = m.split("**:")
        # Simplify metadta keys by removing plurals
        metadata[m_parts[0].strip("*").rstrip("s")] = m_parts[1].strip()
    # Clean the title 
    title = re.sub(r'[\s\n]+', ' ',  title_parts["title"].strip("*"))
    record = {"title": title,
              "path": title_parts["path"],
              "img_alt": body_parts["alt"],
              "img_path": body_parts["img_path"],
    }
    # Use the py3.9 dict merge operator
    record = record | metadata

    return record

Extract the metadata for all the books:

In [13]:
book_records = []
for item in items_:
    book_records.append(get_metadata(item))

book_records[:3]

[{'title': 'A Chinese Wonder Book',
  'path': 'Chinese_wonder_book.html',
  'img_alt': 'Chinese folktales',
  'img_path': 'media/Chinese_folktales.jpg',
  'Note': 'Read 15 Chinese folktales',
  'Author': 'Norman Hinsdale Pitman',
  'Editor': 'Andrew Lang',
  'Published': '1919',
  'Publisher': 'E. P. Dutton & Co., 681 Fifth Avenue, New York'},
 {'title': 'A Hundred fables of La Fontaine',
  'path': 'fables/LaFontaine_fables.html',
  'img_alt': 'La Fontaine book cover ',
  'img_path': 'media/La_Fontaine_book.jpg',
  'Note': '"A Hundred fables of La Fontaine" offers a selection of some of Jean de La Fontaine\'s best known fables. These are all in verse.',
  'Author': 'Jean de La Fontaine',
  'Published': '1900',
  'Publisher': 'John Lane Co., London; New York'},
 {'title': 'A Treasury of Eskimo Tales',
  'path': 'Treasury_Eskimo_Tales.html',
  'img_alt': 'A Treasury of Eskimo Tales',
  'img_path': 'media/Eskimo_Tales.jpg',
  'Note': 'Contains 31 folktales gathered from the Eskimo living 

Get a superset list of all the metadata keys so we can define an appropiate metadata table in the database:

In [14]:
metadata_keys = {k for r in book_records for k in r.keys()}
metadata_keys

{'Author',
 'Compiler',
 'Editor',
 'Note',
 'Published',
 'Publisher',
 'Translator',
 'img_alt',
 'img_path',
 'path',
 'title'}

Let's create a database table to store the metadata:

In [15]:
# Do not run this cell if your database already exists!

# While developing the script, recreate database each time...
db = Database(db_name, recreate=True)

# This schema has been evolved iteratively as I have identified structure
# that can be usefully mined...

db["books_metadata"].create({
    "title": str,
    "path": str,
    "img_path": str,
    "img_alt": str,
    "Author": str,
    "Compiler": str,
    "Editor": str,
    "Note": str,
    "Published": str,
    "Publisher": str,   
    "Translator":str,
    
}, pk=("title"))

# Enable full text search
# This creates an extra virtual table (books_fts) to support the full text search
db["books_metadata"].enable_fts(["title", "Author", "Translator",
                                 "Compiler", "Editor", "Note", "Publisher"], create_triggers=True)

<Table books_metadata (title, path, img_path, img_alt, Author, Compiler, Editor, Note, Published, Publisher, Translator)>

Now we can add our data to it:

In [16]:
db["books_metadata"].upsert_all(book_records, pk=("title"))

<Table books_metadata (title, path, img_path, img_alt, Author, Compiler, Editor, Note, Published, Publisher, Translator)>

Try a query:

In [17]:
for row in db.query("SELECT * FROM books_metadata LIMIT 3"):
    print(row)

{'title': 'A Chinese Wonder Book', 'path': 'Chinese_wonder_book.html', 'img_path': 'media/Chinese_folktales.jpg', 'img_alt': 'Chinese folktales', 'Author': 'Norman Hinsdale Pitman', 'Compiler': None, 'Editor': 'Andrew Lang', 'Note': 'Read 15 Chinese folktales', 'Published': '1919', 'Publisher': 'E. P. Dutton & Co., 681 Fifth Avenue, New York', 'Translator': None}
{'title': 'A Hundred fables of La Fontaine', 'path': 'fables/LaFontaine_fables.html', 'img_path': 'media/La_Fontaine_book.jpg', 'img_alt': 'La Fontaine book cover ', 'Author': 'Jean de La Fontaine', 'Compiler': None, 'Editor': None, 'Note': '"A Hundred fables of La Fontaine" offers a selection of some of Jean de La Fontaine\'s best known fables. These are all in verse.', 'Published': '1900', 'Publisher': 'John Lane Co., London; New York', 'Translator': None}
{'title': 'A Treasury of Eskimo Tales', 'path': 'Treasury_Eskimo_Tales.html', 'img_path': 'media/Eskimo_Tales.jpg', 'img_alt': 'A Treasury of Eskimo Tales', 'Author': 'Cla

We can also search for books using free text search over the metadata:

In [18]:
q = "Langmans" # This is a publisher

print(f"Search on: {q}\n")

for story in db["books_metadata"].search(db.quote_fts(q)):
    print(story)

Search on: Langmans

{'rowid': 60, 'title': 'The Green Fairy Book', 'path': 'fairy_tales/Andrew_Lang_green_fairy_tale_book.html', 'img_path': 'media/green_fairy_book.jpg', 'img_alt': 'Green fairy book Andrew Lang cover', 'Author': 'Various', 'Compiler': None, 'Editor': 'Andrew Lang', 'Note': "The third book from Andrew Lang's collection was first published in 1892 and contains 42 fairy tales.", 'Published': '1892', 'Publisher': 'Langmans, Green, and Co., London; New York', 'Translator': None}
{'rowid': 66, 'title': 'The Red Fairy Book', 'path': 'fairy_tales/Andrew_Lang_red_fairy_tale_book.html', 'img_path': 'media/red_fairy_book.jpg', 'img_alt': 'Red fairy book Andrew Lang cover', 'Author': 'Various', 'Compiler': None, 'Editor': 'Andrew Lang', 'Note': "The second book from Andrew Lang's collection was first published in 1890 and contains 37 fairy tales.", 'Published': '1890', 'Publisher': 'Langmans, Green, and Co., London; New York', 'Translator': None}
{'rowid': 58, 'title': 'The Blue

## Scraping Book Index Pages

The book index pages have links to separate pages for each story.

We need to grab the links to the story pages so we can then scrape each story.

From a quick glance, it looks like links can be found listed in a `div` element with selector `p.GM > a`

In [19]:
BASE_URL = "https://www.worldoftales.com/"

In [20]:
# Get an example page
html = requests.get(f'{BASE_URL}/{book_records[0]["path"]}')
soup = BeautifulSoup(html.content, "html.parser")

items_ = [(a.text, a.get('href')) for a in soup.select("p.GM > a")]
items_

[('1.The Golden Beetle or Why the Dog Hates the Cat',
  'Asian_folktales/Chinese_Folktale_1.html'),
 ('2.The Great Bell', 'Asian_folktales/Chinese_Folktale_2.html'),
 ('3.The Strange Tale of Doctor Dog',
  'Asian_folktales/Chinese_Folktale_3.html'),
 ('4.How Footbinding Started', 'Asian_folktales/Chinese_Folktale_4.html'),
 ('5.The Talking Fish', 'Asian_folktales/Chinese_Folktale_5.html'),
 ('6.Bamboo and the Turtle', 'Asian_folktales/Chinese_Folktale_6.html'),
 ('7.The Mad Goose and the Tiger Forest',
  'Asian_folktales/Chinese_Folktale_7.html'),
 ('8.The Nodding Tiger', 'Asian_folktales/Chinese_Folktale_8.html'),
 ('9.The Princess Kwan-Yin', 'Asian_folktales/Chinese_Folktale_9.html'),
 ('10.The Two Jugglers', 'Asian_folktales/Chinese_Folktale_10.html'),
 ('11.The Phantom Vessel', 'Asian_folktales/Chinese_Folktale_11.html'),
 ('12.The Wooden Tablet', 'Asian_folktales/Chinese_Folktale_12.html'),
 ('13.The Golden Nugget', 'Asian_folktales/Chinese_Folktale_13.html'),
 ('14.The Man Who Wo

Let's see if we can do that for everything...

In [21]:
storylinks = {}

for book in book_records:
    story_links = []
    html = requests.get(f'{BASE_URL}/{book["path"]}')
    soup = BeautifulSoup(html.content, "html.parser")

    book["storylinks"] = [(a.text, a.get('href')) for a in soup.select("p.GM > a")]
    print(f'{len(book["storylinks"])} found for {book["title"]}')

15 found for A Chinese Wonder Book
0 found for A Hundred fables of La Fontaine
0 found for A Treasury of Eskimo Tales
0 found for Aesop's fables
0 found for Andersen's fairy tales
32 found for Australian Legendary Tales
0 found for Canadian fairy tales
27 found for Celtic Fairy Tales
8 found for Child-Life in Japan and Japanese Child Stories
11 found for Chinese Folk-lore Tales
0 found for Christmas stories
28 found for Cossack Fairy Tales and Folk Tales
16 found for Czechoslovak Fairy Tales
0 found for Dutch Fairy Tales for Young Folks
0 found for East of the Sun and West of the Moon
0 found for English Fairy Tales
0 found for English Fairy Tales
12 found for Fairies and Folk of Ireland
18 found for Fairy tales from Brazil
0 found for Fairy Tales from the German Forests
21 found for Fairy Tales of the Slav Peasants and Herdsmen
0 found for Folk-lore and Legends: Germany
0 found for Folk-Lore and Legends: Oriental
0 found for Folk-lore and Legends: Scandinavia
34 found for Folk-Lore an

Okay, so that strategy doesn't work for quite a lot of the index pages...

What if we just grab *all* the links, and then use the heuristic that the chapter link text appears to start with a numerical indicator, followed by a `.`, and then the title (for example, `3.`, `IX.`), at least for the pages I looked at...

In [22]:
"/".join(["1"])

'1'

In [23]:
for book in book_records:
    # We've cached the pages, so requesting them again is fine...
    html = requests.get(f'{BASE_URL}/{book["path"]}')
    soup = BeautifulSoup(html.content, "html.parser")
    # It seems we have a number, then a . at the start of many pages
    story_links = []
    for a in soup.select("a"):
        if a.text.strip():
            # Extract out a number
            a_ = a.text.split(".")[0].strip()
            # Does it look like a nunmerical index value, howsoever defined
            if a_.isdigit() or not [x for x in a_ if x.lower() not in "ivxl"]:
                href = a.get('href')
                # NOTE: the paths are relative, so we need to fix any
                # relative offsets
                _path = book["path"].split("/")
                if len(_path)>1:
                    href = f"{'/'.join(_path[:-1])}/{href}"
                story_links.append((a_,
                                    ".".join(a.text.split(".")[1:]).strip(),
                                    href))

    book["storylinks"] = story_links
    print(f'{len(book["storylinks"])} found for {book["title"]}')

15 found for A Chinese Wonder Book
20 found for A Hundred fables of La Fontaine
32 found for A Treasury of Eskimo Tales
16 found for Aesop's fables
18 found for Andersen's fairy tales
32 found for Australian Legendary Tales
27 found for Canadian fairy tales
27 found for Celtic Fairy Tales
8 found for Child-Life in Japan and Japanese Child Stories
11 found for Chinese Folk-lore Tales
16 found for Christmas stories
28 found for Cossack Fairy Tales and Folk Tales
16 found for Czechoslovak Fairy Tales
21 found for Dutch Fairy Tales for Young Folks
16 found for East of the Sun and West of the Moon
44 found for English Fairy Tales
41 found for English Fairy Tales
12 found for Fairies and Folk of Ireland
18 found for Fairy tales from Brazil
10 found for Fairy Tales from the German Forests
21 found for Fairy Tales of the Slav Peasants and Herdsmen
0 found for Folk-lore and Legends: Germany
25 found for Folk-Lore and Legends: Oriental
28 found for Folk-lore and Legends: Scandinavia
34 found for

That looks much more interesting with just a couple of exceptions... The *Momotaro* contains just a single story on the index page, and *Folk-lore and Legends: Germany* actually points to a list of books that we already have.

Our book records should now be annotated with links to their stories:

In [24]:
book_records[0]

{'title': 'A Chinese Wonder Book',
 'path': 'Chinese_wonder_book.html',
 'img_alt': 'Chinese folktales',
 'img_path': 'media/Chinese_folktales.jpg',
 'Note': 'Read 15 Chinese folktales',
 'Author': 'Norman Hinsdale Pitman',
 'Editor': 'Andrew Lang',
 'Published': '1919',
 'Publisher': 'E. P. Dutton & Co., 681 Fifth Avenue, New York',
 'storylinks': [('1',
   'The Golden Beetle or Why the Dog Hates the Cat',
   'Asian_folktales/Chinese_Folktale_1.html'),
  ('2', 'The Great Bell', 'Asian_folktales/Chinese_Folktale_2.html'),
  ('3',
   'The Strange Tale of Doctor Dog',
   'Asian_folktales/Chinese_Folktale_3.html'),
  ('4', 'How Footbinding Started', 'Asian_folktales/Chinese_Folktale_4.html'),
  ('5', 'The Talking Fish', 'Asian_folktales/Chinese_Folktale_5.html'),
  ('6', 'Bamboo and the Turtle', 'Asian_folktales/Chinese_Folktale_6.html'),
  ('7',
   'The Mad Goose and the Tiger Forest',
   'Asian_folktales/Chinese_Folktale_7.html'),
  ('8', 'The Nodding Tiger', 'Asian_folktales/Chinese_Fo

We could create a simple database table at this point to just how the titles, but let's wait until we have the story text and then create a more complete table.

## Scraping the Story Pages

The next step is to scrape a story page. Once again, let's assume there is a regular structure, build a scraper for a single story, selected more or less at random, and then see how far we get...

From a skim of some pages, it looks like we might be in a luck, with the text appearing in a `div` tag with `id=text`

In [25]:
# Get an example page
html = requests.get(f'{BASE_URL}/{book_records[0]["storylinks"][0][2]}')
soup = BeautifulSoup(html.content, "html.parser")

markdownify(str(soup.select("#text")))[:100]

'[\nWhat we shall eat to-morrow, I haven\'t the slightest idea!" said Widow Wang to her eldest son, as '

So... are we lucky?

In [26]:
import time

def get_story(link, nice=1):
    """Get story text."""
    # Be nice in the scrape
    if nice:
        time.sleep(nice)

    html = requests.get(f'{BASE_URL}/{link[2]}')
    soup = BeautifulSoup(html.content, "html.parser")
    txt = markdownify(str(soup.select("#text")))
    
    return link[0], link[1], txt

We have a cache set up, so for now it doesn't matter if we keep requesting pages (the cache will handle repeated requests.) We've also added a delay into the page request to try to be nice to the hosting server.

Let's try to track down books where we don't appear to be getting stories back:

In [27]:
from tqdm.notebook import tqdm

for book in tqdm(book_records):
    # Once the requests cache has been built, we don't need to be nice
    # and can set the delay to 0
    delay = 0
    if book["storylinks"] and len(get_story(book["storylinks"][0], delay)[2]) < 5: # i.e. not may characters
            print(f'No story text for: {book["title"]} on {book["path"]} at {book["storylinks"][0]}')

  0%|          | 0/73 [00:00<?, ?it/s]

No story text for: A Hundred fables of La Fontaine on fables/LaFontaine_fables.html at ('I', '1.The Grasshopper and the Ant; 2.The Thieves and the Ass; 3.The Wolf Accusing the Fox; 4.The Lion and the Ass Hunting; 5.The Wolf turned Shepherd', 'fables/LaFontaine_fables/LaFontaine_Fables_1.html')


Let's add that to a database.

First, create a story table:

In [28]:
db["tales"].create({
    "title": str,
    "path": str,
    "book": str,
    "text": str,
    "chapter": str,
}, pk=("book", "title"))

# Enable full text search
# This creates an extra virtual table (books_fts) to support the full text search
db["tales"].enable_fts(["title", "text"], create_triggers=True)

<Table tales (title, path, book, text, chapter)>

And add all the tales to it...

In [29]:
book_records[1]

{'title': 'A Hundred fables of La Fontaine',
 'path': 'fables/LaFontaine_fables.html',
 'img_alt': 'La Fontaine book cover ',
 'img_path': 'media/La_Fontaine_book.jpg',
 'Note': '"A Hundred fables of La Fontaine" offers a selection of some of Jean de La Fontaine\'s best known fables. These are all in verse.',
 'Author': 'Jean de La Fontaine',
 'Published': '1900',
 'Publisher': 'John Lane Co., London; New York',
 'storylinks': [('I',
   '1.The Grasshopper and the Ant; 2.The Thieves and the Ass; 3.The Wolf Accusing the Fox; 4.The Lion and the Ass Hunting; 5.The Wolf turned Shepherd',
   'fables/LaFontaine_fables/LaFontaine_Fables_1.html'),
  ('II',
   '1.The Swan and the Cook; 2.The Weasel in the Granary; 3.The Shepherd and the Sea; 4.The Ass and the Little Dog; 5.The Man and the Wooden God',
   'fables/LaFontaine_fables/LaFontaine_Fables_2.html'),
  ('III',
   '1.The Ears of the Hare; 2.The Old Woman and Her Servants; 3.The Ass Carrying Relics; 4.The Hare and the Partridge; 5.The Lion 

In [30]:
for book in tqdm(book_records):
    tales = []
    for link in book["storylinks"]:
        # If have the cache - no delay necessary
        (num, title, txt) = get_story(link)
        if len(txt)>5:
            tales.append({"book": book["title"],
                          "title": title,
                          "path": link,
                          "text": txt,
                          "chapter": num}
                        )

    db["tales"].upsert_all(tales, pk=("book", "title"))

  0%|          | 0/73 [00:00<?, ?it/s]

Let's check we have some tales...

In [31]:
for row in db.query("SELECT * FROM tales LIMIT 1"):
    print(row)

{'title': 'The Golden Beetle or Why the Dog Hates the Cat', 'path': '["1", "The Golden Beetle or Why the Dog Hates the Cat", "Asian_folktales/Chinese_Folktale_1.html"]', 'book': 'A Chinese Wonder Book', 'text': '[\nWhat we shall eat to-morrow, I haven\'t the slightest idea!" said Widow Wang to her eldest son, as he started out one morning in search of work. \n\n\n "Oh, the gods will provide. I\'ll find a few coppers somewhere," replied the boy, trying to speak cheerfully, although in his heart he also had not the slightest idea in which direction to turn. \n\n\n The winter had been a hard one: extreme cold, deep snow, and violent winds. The Wang house had suffered greatly. The roof had fallen in, weighed down by heavy snow. Then a hurricane had blown a wall over, and Ming-li, the son, up all night and exposed to a bitter cold wind, had caught pneumonia. Long days of illness followed, with the spending of extra money for medicine. All their scant savings had soon melted away, and at the

And a free text search:

In [36]:
q = 'king "three princesses"' # This is a publisher

print(f"Search on: {q}\n")

for story in db["tales"].search(db.quote_fts(q)):
    print(story["title"])

Search on: king "three princesses"

The three dogs (Sweden)
The Three Men of PowerEvening, Midnight, and Sunrise
The princess of the brazen mountain
The three princesses of Whiteland
The Queen Bee
The Three Dogs
Silverwhite and Lillwacker
The Gnome
Queen Crane
The Adventures of a Fishermans Son
Story of the Golden Mountain
Prince Bayaya: The Story of a Magic Horse
The Enchanted Pig
The Wonderful Sheep
The Three Princesses of Whiteland
The three princesses in the blue mountain


In [42]:
q = """
SELECT title, snippet(tales_fts, -1, "__", "__", "...", 30) as clip
FROM tales_fts
WHERE tales_fts MATCH 'king "three princesses"' LIMIT 5;
"""

for row in db.query(q):
    print(row)

{'title': 'Prince Bayaya: The Story of a Magic Horse', 'clip': '...the __king__ was ready to set forth.\n\n\nHe handed over the affairs of the castle to Bayaya and also intrusted to him the safety of the __three princesses__. Bayaya did...'}
{'title': 'The three princesses of Whiteland', 'clip': '[\nOnce on a time there was a fisherman who lived close by a palace, and fished for the *__King__’s* table. One day when he was out fishing he just...'}
{'title': 'The three princesses in the blue mountain', 'clip': '[\nThere were once upon a time a *__King__* and *Queen* who had no children, and they took it so much to heart that they hardly ever had a happy moment...'}
{'title': 'The three dogs (Sweden)', 'clip': '...The __king__ was much distressed, for he loved his children more than anything else in the world. So he gave strict orders that the __three princesses__ should be always kept...'}
{'title': 'The Queen Bee', 'clip': "[\n\nTwo __king__'s sons once started to seek adventures, and f