# Gibbs Tiny Tales Collection

Laura Gibbs has a huge collection of hundred word tales, a collection that is still growing, published via a series of openly licensed books.

The tales are all 100 words long, so the collection provides a really concise sett of texts that might be interesting to try to analyse for structure.

In [1]:
# These packages make it easy to download web pages so that we can work with them
import requests
# "Cacheing" pages mans grabbing a local copy of the page so we only need to download it once
import requests_cache
from datetime import timedelta

requests_cache.install_cache('web_cache',
                             backend='sqlite',
                             expire_after=timedelta(days=1000))

In [2]:
url = "https://microfables.blogspot.com/"

# And then grab the page
html = requests.get(url)

In [3]:
# The BeautifulSoup package provides a range of tools
# that help us work with the downloaded web page,
# such as extracting particular elements from it
from bs4 import BeautifulSoup

# The "soup" is a parsed and structured form of the page we downloaded
soup = BeautifulSoup(html.content, "html.parser")

# We can use browser developer tools to grab a CSS selector
# for the books list, then extend it to pull just the links
selector = "#Label2 > div > ul > li > a"
# Find the span elements containing the links
items_ = soup.select(selector)

book_index_urls = {}
for item in items_:
    if item.text not in ["Audiobooks", "Book Indexes", "Book: Teaching Guide"]:
        book = item.text.split(":")[-1].strip()
        book_index_urls[book]= item.get("href")

book_index_urls

{'Aesop': 'https://microfables.blogspot.com/search/label/Book%3A%20Aesop',
 'Africa 1': 'https://microfables.blogspot.com/search/label/Book%3A%20Africa%201',
 'Africa 2': 'https://microfables.blogspot.com/search/label/Book%3A%20Africa%202',
 'Anansi': 'https://microfables.blogspot.com/search/label/Book%3A%20Anansi',
 'India': 'https://microfables.blogspot.com/search/label/Book%3A%20India',
 'Mahabharata': 'https://microfables.blogspot.com/search/label/Book%3A%20Mahabharata',
 'Nasruddin': 'https://microfables.blogspot.com/search/label/Book%3A%20Nasruddin',
 'Ramayana': 'https://microfables.blogspot.com/search/label/Book%3A%20Ramayana',
 'Sufis': 'https://microfables.blogspot.com/search/label/Book%3A%20Sufis'}

On each book page, there is a link to different formats of the book. Some have a text version, all have an HTML version. If we make the (dangerous!) assumption that the structure of the HTML pages are the same, we could just scrape those.

But the text pages are simpler, so as a warm up exercise, let's scrape those.

## Scraping the Text Versions 

Let's start by seeing which books have text versions. The book pages have a set of well labeled links that we can easily search on from the link text:

![](images/Tiny_Tales_links.png)

In [4]:
text_urls = {}

for book in book_index_urls:
    html = requests.get(book_index_urls[book])
    soup = BeautifulSoup(html.content, "html.parser")
    book_url = soup.find('a', text = "TXT")
    if book_url:
        text_urls[book] = book_url.get('href')

text_urls

{'Aesop': 'http://aesop.lauragibbs.net/Aesop.txt',
 'Anansi': 'http://Anansi.lauragibbs.net/Anansi.txt',
 'India': 'http://india.lauragibbs.net/India.txt',
 'Mahabharata': 'https://mahabharata.lauragibbs.net/mahabharata.txt',
 'Nasruddin': 'https://nasruddin.lauragibbs.net/Nasruddin.txt',
 'Ramayana': 'https://ramayana.lauragibbs.net/ramayana.txt',
 'Sufis': 'http://Sufis.lauragibbs.net/Sufis.txt'}

All but the two collections of African tales appear to have simple text versions.

Let's look at the content of one of those text versions:

In [5]:
text_doc = requests.get(text_urls["Aesop"]).text
text_doc[:2500]

'TINY TALES FROM AESOP: \nA Book of Two Hundred 100-Word Stories\nby Laura Gibbs\n\nCopyright 2020 by Laura Gibbs. This work is released under the terms of the Attribution - NonCommercial - ShareAlike 4.0 International (CC BY-NC-SA 4.0).\nVersion date: July 15, 2020\n\nABOUT THIS BOOK\n\nAesop was a legendary storyteller of ancient Greece, and the stories called "Aesop\'s fables" have been going strong for three thousand years. This book contains a selection of classical, medieval, Renaissance, and modern Aesop\'s fables, ranging from the ancient Roman poet Phaedrus to the 18th-century neo-Latin poet Desbillons. You will find famous fables here such as "The Lion\'s Share" and "The Boy Who Cried Wolf," plus many not-so-famous fables about animals, about people, and about the gods and goddesses too. The fables included here represent only a small fraction of the Aesopic fable tradition. For more Aesop\'s fables, visit:\nAesop.LauraGibbs.net\n\nThe paragraph you just read about this book 

Let's also check the end of the book to see if there is structure there:

In [6]:
text_doc[-7000:]



It looks like there is an index, separated by `STORY TITLE INDEX`, so we can clean that out.

From the start of the document, it looks like the titles are wrapped in `~` characters, so we should be able to split the stories, and story titles, quite straightforwardly:

In [7]:
# Get rid of the index
text_doc_ = text_doc.split("STORY TITLE INDEX")[0]

if len(text_doc_) != len(text_doc):
    print("Seems like we removed the index...")

# Split the seperate stories (omit the header section)
stories_ = text_doc_.split("\n~")[1:]

# Now parse into the separate stories
stories = []

for s in stories_:
    # Separate out the title and body
    parts = s.split("~")
    stories.append( (parts[0].strip(), parts[1].strip()) )

# Did we get any?
stories[:3]

Seems like we removed the index...


[("1. The Lion's Share",
  'A lion, a cow, a goat, and a sheep were working together as partners.\nThey managed to kill a stag, and the lion divided their prize into four equal parts.\n"The first part is mine," he said, "because I am the lion. The second part goes to me because I am the strongest. Next, I will take the third part for myself on account of my exceedingly hard work. Finally, if anyone so much as touches the fourth part, they will know my wrath!"\nThat is the lion\'s share: he pretends to share, but he takes it all for himself.'),
 ('2. The Angry Lion',
  'There was once an enraged lion, filled with anger and hatred, hoping to find another lion he could fight with and kill.\nThen, as he was looking down into a well, there it was: a lion had fallen in there. \nIt was just his own reflection in the water, of course, but he saw what he wanted to see.\nThe angry lion, convinced he had found the enemy he was hoping to find, sprang and jumped into the well, and he drowned.\nSo i

## Scraping all the Text Flavoured Books

Let's see how consistent the text fomrat is by trying to scrape the other books using the same script:

In [8]:
def tiny_text_scrape(book, url):
    """Scrape stories from text version of a Tiny Tales book."""
    text_doc = requests.get(url).text
    
    # Keep track of chunk cleaning progress...
    prev_len = len(text_doc)
    
    # Get rid of the index
    text_doc_ = text_doc.split("STORY TITLE INDEX")[0]
    if len(text_doc_) != prev_len:
        print(f"Seems like we removed the index for {book}...")

    # Get rid of "LIST OF" section
    prev_len = len(text_doc_)
    text_doc_ = text_doc_.split("LIST OF ")[0]
    if len(text_doc_) != prev_len:
        print(f"Seems like we removed a list of something for {book}...")


    # Split the seperate stories (omit the header section)
    stories_ = text_doc_.split("\n~")[1:]

    # Now parse into the separate stories
    stories = []

    for s in stories_:
        # Separate out the title and body
        parts = s.split("~")
        stories.append( (parts[0].strip(), parts[1].strip()) )
        
    if stories:
        print(f"Found {len(stories)} tales in {book}")
    else:
        print(f"No stories found for {book}")
    
    return stories

for book in text_urls:
    _ = tiny_text_scrape(book, text_urls[book])

Seems like we removed the index for Aesop...
Found 200 tales in Aesop
Seems like we removed the index for Anansi...
Found 200 tales in Anansi
Seems like we removed the index for India...
Found 200 tales in India
Seems like we removed a list of something for Mahabharata...
Found 200 tales in Mahabharata
Seems like we removed the index for Nasruddin...
Found 200 tales in Nasruddin
Seems like we removed the index for Ramayana...
Found 200 tales in Ramayana
Seems like we removed the index for Sufis...
Found 200 tales in Sufis


Checking the *Mahabharata*, we see that rather than the `STORY TITLE INDEX` at the end of the book, the endnotes begin with a `LIST OF CHARACTERS AND GLOSSARY` which we can retrospetivly add to our earlier script.

## Creating a Simple Database

Let's create a simple database for the stories. Obvious columns include:

- the book;
- the chapter number;
- the title;
- the story.

As as unique key, we can use the book and the chapter number, or the book and the title.

In [9]:
from sqlite_utils import Database

db_name = "gibbs_tiny_tales.db"

# Uncomment the following lines to connect to a pre-existing database
#db = Database(db_name)

In [10]:
# Do not run this cell if your database already exists!

# While developing the script, recreate database each time...
db = Database(db_name, recreate=True)

# This schema has been evolved iteratively as I have identified structure
# that can be usefully mined...

db["tiny_tales"].create({
    "book": str,
    "title": str,
    "text": str,
    "chapter_order": int, # Sort order of stories in book
}, pk=("book", "title"))

# Enable full text search
# This creates an extra virtual table (books_fts) to support the full text search
db["tiny_tales"].enable_fts(["title", "text"], create_triggers=True)

<Table tiny_tales (book, title, text, chapter_order)>

Let's now add the tales for each book to the database.

The stories are returned as a list of 2-tuples — (chapter number and title, text) — but we will extend that to give the database row 4-tuple:

In [11]:
for book in text_urls:
    tales = tiny_text_scrape(book, text_urls[book])
    rows = []
    for (title_, txt) in tales:
        # Just in case, tidy the titles
        parts = title_.split(".")
        chapter_num = int(parts[0])
        title = ".".join(parts[1:]).strip(".").strip()
        rows.append( {"book": book,
                      "title": title,
                      "text": txt,
                      "chapter_order": chapter_num } )
    
    db["tiny_tales"].upsert_all(rows, pk=("book", "title" ))

Seems like we removed the index for Aesop...
Found 200 tales in Aesop
Seems like we removed the index for Anansi...
Found 200 tales in Anansi
Seems like we removed the index for India...
Found 200 tales in India
Seems like we removed a list of something for Mahabharata...
Found 200 tales in Mahabharata
Seems like we removed the index for Nasruddin...
Found 200 tales in Nasruddin
Seems like we removed the index for Ramayana...
Found 200 tales in Ramayana
Seems like we removed the index for Sufis...
Found 200 tales in Sufis


Let's see how many tales we have...

In [12]:
for row in db.query("SELECT COUNT(*) AS number_of_tales FROM tiny_tales"):
    print(row)

{'number_of_tales': 1400}


How many stories do we have featuring a *dog* and *cat*?

In [13]:
q = "dog cat"

print(f"Search on: {q}\n")

for story in db["tiny_tales"].search(db.quote_fts(q),
                                columns=["title", "book", "chapter_order"]):
    print(story)

Search on: dog cat

{'title': 'Anansi Takes Pig Home', 'book': 'Anansi', 'chapter_order': 199}
{'title': "Anansi and the Cats' Wedding", 'book': 'Anansi', 'chapter_order': 74}
{'title': 'King Tiger and Anansi', 'book': 'Anansi', 'chapter_order': 28}


In [14]:
q = """
SELECT title, text FROM tiny_tales
WHERE book='Anansi' AND chapter_order=199
"""

for story in db.query(q):
    print(story)

{'title': 'Anansi Takes Pig Home', 'text': 'Anansi was taking Pig home, but Pig wouldn\'t cross the stream.\n"I refuse!" said Pig.\n"Dog, bite Pig!" Dog refused.\n"Stick, beat Dog!" Stick refused.\n"Fire, burn Stick!" Fire refused.\n"Water, douse Fire!" Water refused.\n"Cow, drink Water!" Cow refused.\n"Butcher, kill Cow!" Butcher refused.\n"Rope, hang Butcher!" Rope refused.\n"Rat, gnaw Rope!" Rat refused.\n"Cat, eat Rat!"\n"Gladly!" said Cat, and Cat scared Rat who scared Rope who scared Butcher who scared Cow who scared Water who scared Fire who scared Stick who scared Dog who bit Pig, who jumped the stream.\nAnansi didn\'t pay anybody for helping either!'}


Find the database here: https://github.com/psychemedia/storynotes/blob/main/gibbs_tiny_tales.db?raw=true