# A Full Text Searchable Database of Lang's Fairy Books

In the late 19th and early 20th century, Andrew Lang published various collections of fairy tales, starting with *The Blue Fairy Book* and then progressing though various other colours to *The Olive Fairy Book*.

This notebook represents a playful aside in trying to build various searchable contexts over the stories.

To begin with, let's start by ingesting the stories into a database and building a full text search over them.

## Obtain Source Texts

We can download the raw text for each of Lang's coloured Fairy Books from the Sacred Texts website. The books are listed on a single index page:

![](images/Sacred-Texts__Lang_Fairy_Books.png)

Let's start by importing some packages that can help us download pages from the Sacred Texts website in an efficient and straightforward way: 

In [1]:
# These packages make it easy to download web pages so that we can work with them
import requests
# "Cacheing" pages mans grabbing a local copy of the page so we only need to download it once
import requests_cache
from datetime import timedelta

requests_cache.install_cache('web_cache', backend='sqlite', expire_after=timedelta(days=100))

Given the index page URL, we can easily download the index page:

In [2]:
# Specify the URL of the page we want to download
url = "https://www.sacred-texts.com/neu/lfb/index.htm"

# And then grab the page
html = requests.get(url)

# Preview some of the raw web page / HTML text in the page we just downloaded
html.text[:1000]

'<HTML>\r\n<HEAD>\r\n<link rel="stylesheet" href="../../css/ista.css"><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">\r\n<link rel="alternate" type="application/rss+xml" title="RSS" href="http://sacred-texts.com/rss/new.xml">\r\n\r\n<META name="description"\r\ncontent="Sacred Texts: Lang Fairy Books">\r\n<META name="keywords"\r\ncontent="Colored Fairy Books Fairy Tales Tale Folklore Folk lore Children Literature">\r\n<TITLE>Sacred-Texts: Lang Fairy Books</TITLE></HEAD>\r\n<BODY>\r\n<table width="800" border="0" align="center" cellpadding="0" cellspacing="0"><tr> \r\n<td height="131" width="200" align="left" valign="top"> \r\n<div align="left"><a href="../../cdshop/index.htm"><img src="../../img/cdad.gif" width="206" height="136" border="0"></a></div>\r\n</td>\r\n<td height="131" width="600" colspan="3"><div align="left"><img src="../../img/menu.jpg" width="600" height="134" usemap="#Map" border="0"><map name="Map"><area shape="rect" coords="33,5,552,78" href="../../

By inspection of the HTML, we see the books are in `span` tag with a `ista-content` class. Digging further, we then notice the links are in `c_t` classed `span` elements. We can extract them using beautiful soup:

In [3]:
# The BeautifulSoup package provides a range of tools
# that help us work with the downloaded web page,
# such as extracting particular elements from it
from bs4 import BeautifulSoup

# The "soup" is a parsed and structured form of the page we downloaded
soup = BeautifulSoup(html.content, "html.parser")

# Find the span elements containing the links
items_ = soup.find("span", class_="ista-content").find_all("span", class_="c_t")

# Preview the first few extracted <span> elements
items_[:3]

[<span class="c_t"><a href="bl/index.htm">The Blue Fairy Book</a></span>,
 <span class="c_t"><a href="br/index.htm">The Brown Fairy Book</a></span>,
 <span class="c_t"><a href="cr/index.htm">The Crimson Fairy Book</a></span>]

Let's grab just the anchor tags from there:

In [4]:
# The following construction is known as a "list comprehension"
# It generates a list of items (items contained in square brackets, [])
# from another list of items

items_ = [item.find("a") for item in items_]
items_

[<a href="bl/index.htm">The Blue Fairy Book</a>,
 <a href="br/index.htm">The Brown Fairy Book</a>,
 <a href="cr/index.htm">The Crimson Fairy Book</a>,
 <a href="gn/index.htm">The Green Fairy Book</a>,
 <a href="gy/index.htm">The Grey Fairy Book</a>,
 <a href="li/index.htm">The Lilac Fairy Book</a>,
 <a href="ol/index.htm">The Olive Fairy Book</a>,
 <a href="or/index.htm">The Orange Fairy Book</a>,
 <a href="pi/index.htm">The Pink Fairy Book</a>,
 <a href="re/index.htm">The Red Fairy Book</a>,
 <a href="vi/index.htm">The Violet Fairy Book</a>,
 <a href="ye/index.htm">The Yellow Fairy Book</a>]

`````{admonition} List Comprehensions
List comprehensions provide a concise form for defining one list structure based on the contents of another (or more generally, any iterable).

In an expanded form, we might create one list from another using a loop of the form:

```python
new_list = []
for item in items:
    new_list.append( process(items) )
```

In a list comprehension, we might write:

```python
new_list = []
for item in items:
    new_list.append( process(items) )
```

`````

The links are *relative* links, which means we need to resolve them relative to the path of the current page.

Obtain the path to the current page:

In [5]:
# Strip the "index.htm" element from the URL to give a "base" URL
base_url = url.replace("index.htm", "")

Extract the link text (`link.text`) and relative links (`link.get('href')`) from the `<a>` tags and use a Pyhton f-string to generate full links for each book page (`f"{base_url}{link.get('href')}"`):

In [6]:
links = [(link.text, f"{base_url}{link.get('href')}") for link in items_]

# Display some annotated output to see what's going on
print(f"Base URL: {base_url}\nExample links: {links[:3]}")

Base URL: https://www.sacred-texts.com/neu/lfb/
Example links: [('The Blue Fairy Book', 'https://www.sacred-texts.com/neu/lfb/bl/index.htm'), ('The Brown Fairy Book', 'https://www.sacred-texts.com/neu/lfb/br/index.htm'), ('The Crimson Fairy Book', 'https://www.sacred-texts.com/neu/lfb/cr/index.htm')]


```{admonition} Python f-strings

Python's f-strings (*formatted string literals*, [PEP 498](https://docs.python.org/3/whatsnew/3.6.html#whatsnew36-pep498)) are strings prefixed with an `f` character. The strings contain "replacement fields" of code contained within curly braces. The contents of the curly braces are evaluated and included in the returned string.
```

We can also grab the publication year for each work:

In [7]:
years_ = soup.find("span", class_="ista-content").find_all("span", class_="c_d")
years = [year.text for year in years_]

And merge those in to a metadata record collection:

In [8]:
sacred_metadata = list(zip(links, years))
sacred_metadata[:3]

[(('The Blue Fairy Book', 'https://www.sacred-texts.com/neu/lfb/bl/index.htm'),
  '1889'),
 (('The Brown Fairy Book',
   'https://www.sacred-texts.com/neu/lfb/br/index.htm'),
  '1904'),
 (('The Crimson Fairy Book',
   'https://www.sacred-texts.com/neu/lfb/cr/index.htm'),
  '1903')]

We could now load each of those pages and then scrape the download link. But, we notice that the download links have a regular pattern: `https://www.sacred-texts.com/neu/lfb/bl/blfb.txt.gz` which we can derive from the book pages:

In [9]:
download_links = []

for (_title, _url) in links:
    # We need to get the "short" colour name of the book
    # which can be found in the URL path...
    book_path = _url.split("/")[-2]
    zip_fn =  f"{book_path}fb.txt.gz"
    zip_url = _url.replace("index.htm", zip_fn)
    
    download_links.append((_title, zip_url))

download_links[:3]

[('The Blue Fairy Book',
  'https://www.sacred-texts.com/neu/lfb/bl/blfb.txt.gz'),
 ('The Brown Fairy Book',
  'https://www.sacred-texts.com/neu/lfb/br/brfb.txt.gz'),
 ('The Crimson Fairy Book',
  'https://www.sacred-texts.com/neu/lfb/cr/crfb.txt.gz')]

Now we can download and unzip the files...

In [10]:
import urllib

for (_, url) in download_links:
    # Create a file name to save file to as the file downloaded from the URL
    zip_file = url.split("/")[-1]
    urllib.request.urlretrieve(url, zip_file)

In [11]:
!ls

Ashliman_folk_texts_scraper.ipynb
DBPedia_Aarne-Thompson-Uther_Search.ipynb
Jacobs' Fairy Tales.ipynb
LICENSE
Lang_Doc2Vec.ipynb
MFTD-Multilingual_Folk_Tale_Database.ipynb
Missouri_Tale_Types.ipynb
README.md
Story db search examples.ipynb
Thompson_Motif_Index.ipynb
[34m_build[m[m
_config.yml
_toc.yml
ashliman_demo.db
blfb.txt.gz
brfb.txt.gz
crfb.txt.gz
demo.db
[34mdocs[m[m
gnfb.txt.gz
gyfb.txt.gz
how-to-read-this-book.ipynb
[34mimages[m[m
lang-fairy-books-db.ipynb
lang-fairy-books-db_PART 1.ipynb
lang-fairy-books-db_PART 2.ipynb
lang-fairy-books-db_PART 3.ipynb
lang-fairy-books-db_PART 4.ipynb
lang_fairy_tale.db
lang_model.gensim
lifb.txt.gz
motifs_demo.db
mtdf_demo.db
olfb.txt.gz
orfb.txt.gz
pifb.txt.gz
preface.md
refb.txt.gz
requirements.txt
tale_types_demo.db
thompson_motif_index.csv
vifb.txt.gz
web_cache.sqlite
yefb.txt.gz


The following function will read in the contents of a local gzip file:

In [12]:
import gzip

def gzip_txt(fn):
    """Open gzip file and extract text."""
    with gzip.open(fn,'rb') as f:
        txt = f.read().decode('UTF-8').replace("\r", "")

    return txt

Let's see how it works:

In [13]:
gzip_txt('gnfb.txt.gz')[:1000]

'\nThe Green Fairy Book, by Andrew Lang, [1892], at sacred-texts.com\n\nTHE GREEN FAIRY BOOK\n\nBy Various\n\nEdited by Andrew Lang\n\nLondon, New York: Longmans, Green, and Co.\n\n[1892]\n\nTo\n\nStella Margaret Alleyne\n\nthe\n\nGreen Fairy Book\n\nis dedicated\n\nThe Green Fairy Book, by Andrew Lang, [1892], at sacred-texts.com\n\nContents\n\n[*To the Friendly Reader]\n\n[*The Blue Bird]\n\n[*The Half-Chick]\n\n[*The Story of Caliph Stork]\n\n[*The Enchanted Watch]\n\n[*Rosanella]\n\n[*Sylvain and Jocosa]\n\n[*Fairy Gifts]\n\n[*Prince Narcissus and the Princess Potentilla]\n\n[*Prince Featherhead and the Princess Celandine]\n\n[*The Three Little Pigs]\n\n[*Heart of Ice]\n\n[*The Enchanted Ring]\n\n[*The Snuff-box]\n\n[*The Golden Blackbird]\n\n[*The Little Soldier]\n\n[*The Magic Swan]\n\n[*The Dirty Shepherdess]\n\n[*The Enchanted Snake]\n\n[*The Biter Bit]\n\n[*King Kojata]\n\n[*Prince Fickle and Fair Helena]\n\n[*Puddocky]\n\n[*The Story of Hok Lee and the Dwarfs]\n\n[*The Story 

In [14]:
!ls

Ashliman_folk_texts_scraper.ipynb
DBPedia_Aarne-Thompson-Uther_Search.ipynb
Jacobs' Fairy Tales.ipynb
LICENSE
Lang_Doc2Vec.ipynb
MFTD-Multilingual_Folk_Tale_Database.ipynb
Missouri_Tale_Types.ipynb
README.md
Story db search examples.ipynb
Thompson_Motif_Index.ipynb
[34m_build[m[m
_config.yml
_toc.yml
ashliman_demo.db
blfb.txt.gz
brfb.txt.gz
crfb.txt.gz
demo.db
[34mdocs[m[m
gnfb.txt.gz
gyfb.txt.gz
how-to-read-this-book.ipynb
[34mimages[m[m
lang-fairy-books-db.ipynb
lang-fairy-books-db_PART 1.ipynb
lang-fairy-books-db_PART 2.ipynb
lang-fairy-books-db_PART 3.ipynb
lang-fairy-books-db_PART 4.ipynb
lang_fairy_tale.db
lang_model.gensim
lifb.txt.gz
motifs_demo.db
mtdf_demo.db
olfb.txt.gz
orfb.txt.gz
pifb.txt.gz
preface.md
refb.txt.gz
requirements.txt
tale_types_demo.db
thompson_motif_index.csv
vifb.txt.gz
web_cache.sqlite
yefb.txt.gz


Select one of the books and read in the book text:

In [15]:
txt = gzip_txt('blfb.txt.gz')

# Preview the first 1500 characters
txt[:1500]

'\nThe Blue Fairy Book, by Andrew Lang, [1889], at sacred-texts.com\n\nTHE BLUE FAIRY BOOK\n\nby Andrew Lang\n\nLondon, New York: Longmans, Green\n\n[1889]\n\nThe Blue Fairy Book, by Andrew Lang, [1889], at sacred-texts.com\n\nCONTENTS\n\n[*THE BRONZE RING]\n\n[*PRINCE HYACINTH AND THE DEAR LITTLE PRINCESS]\n\n[*EAST OF THE SUN AND WEST OF THE MOON]\n\n[*THE YELLOW DWARF]\n\n[*LITTLE RED RIDING-HOOD]\n\n[*THE SLEEPING BEAUTY IN THE WOOD]\n\n[*CINDERELLA; OR, THE LITTLE GLASS SLIPPER]\n\n[*ALADDIN AND THE WONDERFUL LAMP]\n\n[*THE TALE OF A YOUTH WHO SET OUT TO LEARN WHAT FEAR WAS]\n\n[*RUMPELSTILTZKIN]\n\n[*BEAUTY AND THE BEAST]\n\n[*THE MASTER-MAID]\n\n[*WHY THE SEA IS SALT]\n\n[*THE MASTER CAT; OR, PUSS IN BOOTS]\n\n[*FELICIA AND THE POT OF PINKS]\n\n[*THE WHITE CAT]\n\n[*THE WATER-LILY. THE GOLD-SPINNERS]\n\n[*THE TERRIBLE HEAD]\n\n[*THE STORY OF PRETTY GOLDILOCKS]\n\n[*THE HISTORY OF WHITTINGTON]\n\n[*THE WONDERFUL SHEEP]\n\n[*LITTLE THUMB]\n\n[*THE FORTY THIEVES]\n\n[*HANSEL AND GR

## Extract Stories

Having got the contents, let's now extract all the stories.

Within each book, the stories are delimited by a pattern `[fNN]` (for digits `N`). We can use this pattern to split out the stories.

To do this, we'll use the `re` regular expression package:

In [16]:
import re

We can now define a pattern against which we can split each file into separate chunks:

In [17]:
# Split the file into separate chunks delimited by the pattern: [fNN]
stories = re.split("\[f\d{2}\]", txt)

# Strip whitespace at start and end
stories = [s.strip("\n") for s in stories]

## Extract the contents

The contents appear in the first "story chunk" (index `0`) in the text:

In [18]:
stories[0]

'The Blue Fairy Book, by Andrew Lang, [1889], at sacred-texts.com\n\nTHE BLUE FAIRY BOOK\n\nby Andrew Lang\n\nLondon, New York: Longmans, Green\n\n[1889]\n\nThe Blue Fairy Book, by Andrew Lang, [1889], at sacred-texts.com\n\nCONTENTS\n\n[*THE BRONZE RING]\n\n[*PRINCE HYACINTH AND THE DEAR LITTLE PRINCESS]\n\n[*EAST OF THE SUN AND WEST OF THE MOON]\n\n[*THE YELLOW DWARF]\n\n[*LITTLE RED RIDING-HOOD]\n\n[*THE SLEEPING BEAUTY IN THE WOOD]\n\n[*CINDERELLA; OR, THE LITTLE GLASS SLIPPER]\n\n[*ALADDIN AND THE WONDERFUL LAMP]\n\n[*THE TALE OF A YOUTH WHO SET OUT TO LEARN WHAT FEAR WAS]\n\n[*RUMPELSTILTZKIN]\n\n[*BEAUTY AND THE BEAST]\n\n[*THE MASTER-MAID]\n\n[*WHY THE SEA IS SALT]\n\n[*THE MASTER CAT; OR, PUSS IN BOOTS]\n\n[*FELICIA AND THE POT OF PINKS]\n\n[*THE WHITE CAT]\n\n[*THE WATER-LILY. THE GOLD-SPINNERS]\n\n[*THE TERRIBLE HEAD]\n\n[*THE STORY OF PRETTY GOLDILOCKS]\n\n[*THE HISTORY OF WHITTINGTON]\n\n[*THE WONDERFUL SHEEP]\n\n[*LITTLE THUMB]\n\n[*THE FORTY THIEVES]\n\n[*HANSEL AND GRET

Let's pull out the book name:

In [19]:
# The name appears before the first comma
book = stories[0].split(",")[0]
book

'The Blue Fairy Book'

The Python [`parse`](https://github.com/r1chardj0n3s/parse) package provides a simple way of *matching* patterns using syntax that resembles a string formatting template that could be used to create the strings being matched against.

In [20]:
import parse

We can alternatively is this package to extract the title against a template style pattern:

In [21]:
#The Blue Fairy Book, by Andrew Lang, [1889], at sacred-texts.com
metadata = parse.parse("{title}, by Andrew Lang, [{year}]{}, at sacred-texts.com", stories[0])

metadata["title"], metadata["year"]

('The Blue Fairy Book', '1889')

There are plenty of cribs to help us pull out the contents, although it may not be obviously clear with the early content items whether they are stories or not...

In [22]:
# There is a Contents header, but it may be cased...
# So split in a case insensitive way
boilerplate = re.split('(Contents|CONTENTS)', stories[0])
boilerplate

['The Blue Fairy Book, by Andrew Lang, [1889], at sacred-texts.com\n\nTHE BLUE FAIRY BOOK\n\nby Andrew Lang\n\nLondon, New York: Longmans, Green\n\n[1889]\n\nThe Blue Fairy Book, by Andrew Lang, [1889], at sacred-texts.com\n\n',
 'CONTENTS',
 '\n\n[*THE BRONZE RING]\n\n[*PRINCE HYACINTH AND THE DEAR LITTLE PRINCESS]\n\n[*EAST OF THE SUN AND WEST OF THE MOON]\n\n[*THE YELLOW DWARF]\n\n[*LITTLE RED RIDING-HOOD]\n\n[*THE SLEEPING BEAUTY IN THE WOOD]\n\n[*CINDERELLA; OR, THE LITTLE GLASS SLIPPER]\n\n[*ALADDIN AND THE WONDERFUL LAMP]\n\n[*THE TALE OF A YOUTH WHO SET OUT TO LEARN WHAT FEAR WAS]\n\n[*RUMPELSTILTZKIN]\n\n[*BEAUTY AND THE BEAST]\n\n[*THE MASTER-MAID]\n\n[*WHY THE SEA IS SALT]\n\n[*THE MASTER CAT; OR, PUSS IN BOOTS]\n\n[*FELICIA AND THE POT OF PINKS]\n\n[*THE WHITE CAT]\n\n[*THE WATER-LILY. THE GOLD-SPINNERS]\n\n[*THE TERRIBLE HEAD]\n\n[*THE STORY OF PRETTY GOLDILOCKS]\n\n[*THE HISTORY OF WHITTINGTON]\n\n[*THE WONDERFUL SHEEP]\n\n[*LITTLE THUMB]\n\n[*THE FORTY THIEVES]\n\n[*HANS

In [23]:
# The name of the book repeats at the end of the content block
# So snip it out... 
contents_ = boilerplate[-1].split(book)[0].strip("\n")
contents_

'[*THE BRONZE RING]\n\n[*PRINCE HYACINTH AND THE DEAR LITTLE PRINCESS]\n\n[*EAST OF THE SUN AND WEST OF THE MOON]\n\n[*THE YELLOW DWARF]\n\n[*LITTLE RED RIDING-HOOD]\n\n[*THE SLEEPING BEAUTY IN THE WOOD]\n\n[*CINDERELLA; OR, THE LITTLE GLASS SLIPPER]\n\n[*ALADDIN AND THE WONDERFUL LAMP]\n\n[*THE TALE OF A YOUTH WHO SET OUT TO LEARN WHAT FEAR WAS]\n\n[*RUMPELSTILTZKIN]\n\n[*BEAUTY AND THE BEAST]\n\n[*THE MASTER-MAID]\n\n[*WHY THE SEA IS SALT]\n\n[*THE MASTER CAT; OR, PUSS IN BOOTS]\n\n[*FELICIA AND THE POT OF PINKS]\n\n[*THE WHITE CAT]\n\n[*THE WATER-LILY. THE GOLD-SPINNERS]\n\n[*THE TERRIBLE HEAD]\n\n[*THE STORY OF PRETTY GOLDILOCKS]\n\n[*THE HISTORY OF WHITTINGTON]\n\n[*THE WONDERFUL SHEEP]\n\n[*LITTLE THUMB]\n\n[*THE FORTY THIEVES]\n\n[*HANSEL AND GRETTEL]\n\n[*SNOW-WHITE AND ROSE-RED]\n\n[*THE GOOSE-GIRL]\n\n[*TOADS AND DIAMONDS]\n\n[*PRINCE DARLING]\n\n[*BLUE BEARD]\n\n[*TRUSTY JOHN]\n\n[*THE BRAVE LITTLE TAILOR]\n\n[*A VOYAGE TO LILLIPUT]\n\n[*THE PRINCESS ON THE GLASS HILL]\n\n[*

We note that `contents_` conains a string with repeated end of line elements (`\n\n`) separating the titles in the form `[*STORY TITLE]` (for example, `[*LITTLE RED RIDING-HOOD]`).

We can parse out titles from the contents list based on the pattern delimiter `[*EXTRACT THIS PATTERN]`:

In [24]:
# Match against [* and ] and extract everything in between
contents = parse.findall("[*{}]", contents_)

# The title text available as item.fixed[0]
# Also convert the title to title case
titles = [item.fixed[0].title() for item in contents]
titles

['The Bronze Ring',
 'Prince Hyacinth And The Dear Little Princess',
 'East Of The Sun And West Of The Moon',
 'The Yellow Dwarf',
 'Little Red Riding-Hood',
 'The Sleeping Beauty In The Wood',
 'Cinderella; Or, The Little Glass Slipper',
 'Aladdin And The Wonderful Lamp',
 'The Tale Of A Youth Who Set Out To Learn What Fear Was',
 'Rumpelstiltzkin',
 'Beauty And The Beast',
 'The Master-Maid',
 'Why The Sea Is Salt',
 'The Master Cat; Or, Puss In Boots',
 'Felicia And The Pot Of Pinks',
 'The White Cat',
 'The Water-Lily. The Gold-Spinners',
 'The Terrible Head',
 'The Story Of Pretty Goldilocks',
 'The History Of Whittington',
 'The Wonderful Sheep',
 'Little Thumb',
 'The Forty Thieves',
 'Hansel And Grettel',
 'Snow-White And Rose-Red',
 'The Goose-Girl',
 'Toads And Diamonds',
 'Prince Darling',
 'Blue Beard',
 'Trusty John',
 'The Brave Little Tailor',
 'A Voyage To Lilliput',
 'The Princess On The Glass Hill',
 'The Story Of Prince Ahmed And The Fairy Paribanou',
 'The History O

## Coping With Page Numbers

There seems to be work in progress adding page numbers to books using a pattern of the form `[p. ix]`, `[p. 1]`, `[p. 11]` and so on.

For now, let's create a regular expression substitution to remove those...

In [25]:
example = """[f01]
[p. ix]

THE YELLOW FAIRY BOOK

THE CAT AND THE MOUSE IN PARTNERSHIP

A cat had made acquaintance with a mouse, and had spoken so much of the great love and friendship she felt for her, that at last the Mouse consented to live in the same house with her, and to go shares in the housekeeping.  'But we must provide for the winter or else we shall suffer hunger,' said the Cat.  'You, little Mouse, cannot venture everywhere in case you run at last into a trap.'  This good counsel was followed, and a little pot of fat was bought.  But they did not know where to put it.  At length, after long consultation, the Cat said, 'I know of no place where it could be better put than in the church.  No one will trouble to take it away from there.  We will hide it in a corner, and we won't touch it till we are in want.'  So the little pot was placed in safety; but it was not long before the Cat had a great longing for it, and said to the Mouse, 'I wanted to tell you, little Mouse, that my cousin has a little son, white with brown spots, and she wants me to be godmother to it.  Let me go out to-day, and do you take care of the house alone.'

[p. 1]

'Yes, go certainly,' replied the Mouse, 'and when you eat anything good, think of me; I should very much like a drop of the red christening wine.'

But it was all untrue.  The Cat had no cousin, and had not been asked to be godmother.  She went straight to the church, slunk to the little pot of fat, began to lick it, and licked the top off.  Then she took a walk on the roofs of the town, looked at the view, stretched

[P. 22]

herself out in the sun, and licked her lips whenever she thought of the little pot of fat.  As soon as it was evening she went home again.
"""

# Example of regex to remove page numbers
re.sub(r'\n*\[[pP]\. [^\]\s]*\]\n\n', '', example)

"[f01]THE YELLOW FAIRY BOOK\n\nTHE CAT AND THE MOUSE IN PARTNERSHIP\n\nA cat had made acquaintance with a mouse, and had spoken so much of the great love and friendship she felt for her, that at last the Mouse consented to live in the same house with her, and to go shares in the housekeeping.  'But we must provide for the winter or else we shall suffer hunger,' said the Cat.  'You, little Mouse, cannot venture everywhere in case you run at last into a trap.'  This good counsel was followed, and a little pot of fat was bought.  But they did not know where to put it.  At length, after long consultation, the Cat said, 'I know of no place where it could be better put than in the church.  No one will trouble to take it away from there.  We will hide it in a corner, and we won't touch it till we are in want.'  So the little pot was placed in safety; but it was not long before the Cat had a great longing for it, and said to the Mouse, 'I wanted to tell you, little Mouse, that my cousin has a 

## Pulling the Parser Together

Let's create a function to parse the book for us by pulling together all the previous fragments:

In [26]:
def parse_book(txt):
    """Parse book from text."""
    
    # Get story chunks
    stories = re.split("\[f\d{2}\]", txt)
    stories = [s.strip("\n") for s in stories]
    
    # Get book name
    book = stories[0].split(",")[0]
    
    # Process contents
    boilerplate = re.split('(Contents|CONTENTS)', stories[0])

    # The name of the book repeats at the end of the content block
    # So snip it out... 
    contents_ = boilerplate[-1].split(book)[0].strip("\n")
    
    # Match against [* and ] and extract everything in between
    contents = parse.findall("[*{}]", contents_)

    # Get titles from contents
    titles = [item.fixed[0].title() for item in contents]
    
    # Get metadata
    metadata = parse.parse("{title}, by Andrew Lang, [{year}]{}, at sacred-texts.com", stories[0]).named

    return book, stories, titles, metadata

## Create Simple Database Structure

Let's create a simple database structure and configure it for full text search.

We'll use SQLite3 for the database. One of the easiest ways of working with SQLite3 databases is via the [`sqlite_utils`](https://sqlite-utils.datasette.io/en/stable/) package.

In [27]:
from sqlite_utils import Database

Specifiy the database filename (and optionally conntect to the database if it already exists):

In [28]:
db_name = "demo.db"

# Uncomment the following lines to connect to a pre-existing database
#db = Database(db_name)

The following will create a new database (or overwrite a pre-existing one of the same name) and define the database tables we require.

Note that we also enable full text search on the `book` that creates an extra virtual table that supports full text search.

In [29]:
# Do not run this cell if your database already exists!

# While developing the script, recreate database each time...
db = Database(db_name, recreate=True)

# This schema has been evolved iteratively as I have identified structure
# that can be usefully mined...

db["books"].create({
    "book": str,
    "title": str,
    "text": str,
    "last_para": str, # sometimes contains provenance
    "first_line": str, # maybe we want to review the openings, or create an index...
    "provenance": str, # attempt at provenance
    "chapter_order": int, # Sort order of stories in book
}, pk=("book", "title"))

db["books_metadata"].create({
    "title": str,
    "year": int
})

# Enable full text search
# This creates an extra virtual table (books_fts) to support the full text search
db["books"].enable_fts(["title", "text"], create_triggers=True)

<Table books (book, title, text, last_para, first_line, provenance, chapter_order)>

## Build Database

Let's now create a function that can populate our database based on the contents of one of the books:

In [30]:
def extract_book_stories(db_tbl, book, stories, titles=None, quiet=False):
    book_items = []

    # The titles are from the contents list
    # We will actually grab titles from the story
    # but the titles grabbed from the contents can be passed in
    # if we want to write a check against them.
    # Note: there may be punctation differences in the title in the contents
    # and the actual title in the text
    for i, story in enumerate(stories[1:]):
        # Remove the page numbers for now...
        story = re.sub(r'\n*\[[pP]\. [^\]\s]*\]\n\n', '', story).strip("\n")
        # Other cleaning
        story = re.sub(r'\[\*\d+\s*\]', '', story)
        
        # Get the title from the start of the story text
        story_ = story.split("\n\n")
        title_ = story_[0].strip()

        # Force the title case variant of the title
        title = title_.title().replace("'S", "'s")
    
        # Optionally display the titles and the book
        if not quiet:
            print(f"{title} :: {book}")

        # Reassemble the story
        text = "\n\n".join(story_[1:])
        
        # Clean out the name of the book if it is in the text
        #The Green Fairy Book, by Andrew Lang, [1892], at sacred-texts.com
        name_ignorecase = re.compile(f"{book}, by Andrew Lang, \[\d*\], at sacred-texts.com", re.IGNORECASE)
        text = name_ignorecase.sub('', text).strip()
        
        # Extract the first line then add the full stop back in.
        first_line = text.split("\n")[0].split(".")[0] + "."
        
        last_para = text.split("\n")[-1]
        
        provenance_1 = parse.parse('[{}] {provenance}', last_para)
        provenance_2 = parse.parse('[{provenance}]', last_para)
        provenance_3 = parse.parse('({provenance})', last_para)
        provenance_4 = {"provenance":last_para} if len(last_para.split())<7 else {} # Heuristic
        provenance_ = provenance_1 or provenance_2 or provenance_3 or provenance_4
        
        provenance = provenance_["provenance"] if provenance_ else ""
        book_items.append({"book": book,
                           "title": title,
                           "text": text,
                           "last_para": last_para,
                           "first_line": first_line,
                           "provenance": provenance,
                           "chapter_order":i})
    
    # The upsert means "add or replace"
    db_tbl.upsert_all(book_items, pk=("book", "title" ))

We can add the data for a particular book by passing in the titles and stories:

In [31]:
book, stories, titles, metadata = parse_book(txt)

extract_book_stories(db["books"], book, stories)

The Bronze Ring :: The Blue Fairy Book
Prince Hyacinth And The Dear Little Princess :: The Blue Fairy Book
East Of The Sun And West Of The Moon :: The Blue Fairy Book
The Yellow Dwarf :: The Blue Fairy Book
Little Red Riding Hood :: The Blue Fairy Book
The Sleeping Beauty In The Wood :: The Blue Fairy Book
Cinderella, Or The Little Glass Slipper :: The Blue Fairy Book
Aladdin And The Wonderful Lamp :: The Blue Fairy Book
The Tale Of A Youth Who Set Out To Learn What Fear Was :: The Blue Fairy Book
Rumpelstiltzkin :: The Blue Fairy Book
Beauty And The Beast :: The Blue Fairy Book
The Master-Maid :: The Blue Fairy Book
Why The Sea Is Salt :: The Blue Fairy Book
The Master Cat; Or, Puss In Boots :: The Blue Fairy Book
Felicia And The Pot Of Pinks :: The Blue Fairy Book
The White Cat :: The Blue Fairy Book
The Water-Lily. The Gold-Spinners :: The Blue Fairy Book
The Terrible Head :: The Blue Fairy Book
The Story Of Pretty Goldilocks :: The Blue Fairy Book
The History Of Whittington :: The 

We can now run a full text search over the stories. For example, if we are looking for a story with a king and three sons:

In [32]:
q = 'king "three sons"'

# The `.search()` method knows how to find the full text search table
# given the original table name
for story in db["books"].search(db.quote_fts(q), columns=["title", "book"]):
    print(story)

{'title': 'The Master Cat; Or, Puss In Boots', 'book': 'The Blue Fairy Book'}
{'title': 'The White Cat', 'book': 'The Blue Fairy Book'}
{'title': 'The Story Of Prince Ahmed And The Fairy Paribanou', 'book': 'The Blue Fairy Book'}


We can also construct a full text search query over the full text search virtual table explicitly:

In [33]:
q2 = 'king "three sons" goose'

_q = f'SELECT title FROM books_fts WHERE books_fts MATCH {db.quote(q2)} ;'

for row in db.query(_q):
    print(row["title"])

The full text search also allows us to select snippets around the search term:

In [34]:
q3 = '"three sons"'

_q = f"""
SELECT title, snippet(books_fts, -1, "__", "__", "...", 30) as clip
FROM books_fts WHERE books_fts MATCH {db.quote(q3)} LIMIT 2 ;
"""

for row in db.query(_q):
    print(row["clip"]+'\n---\n')

There was a miller who left no more estate to the __three sons__ he had than his mill, his ass, and his cat. The partition was soon made. Neither scrivener...
---

Once upon a time there was a king who had __three sons__, who were all so clever and brave that he began to be afraid that they would want to...
---



We can now create a complete database of Lang's collected fairy stories by churning through all the books and adding them to the database:

In [35]:
import os

for fn in [fn for fn in os.listdir() if fn.endswith(".gz")]:
    # Read in book from gzip file
    txt = gzip_txt(fn)
    # Parse book
    book, stories, titles, metadata = parse_book(txt)
    
    #Populate metadata table
    db["books_metadata"].upsert(metadata, pk=("title", "year"))
    
    # Extract stories and add them to the database
    # The records are upserted (add or replaced) so we won't get duplicate records
    # for the book we have already loaded into the database
    extract_book_stories(db["books"], book, stories, quiet=True)

How many books are there?

In [36]:
for row in db.query('SELECT * FROM books_metadata ORDER BY year ASC'):
    print(row)

{'title': 'The Pink Fairy Book', 'year': 1889}
{'title': 'The Blue Fairy Book', 'year': 1889}
{'title': 'The Yellow Fairy Book', 'year': 1889}
{'title': 'The Red Fairy Book', 'year': 1890}
{'title': 'The Green Fairy Book', 'year': 1892}
{'title': 'The Grey Fairy Book', 'year': 1900}
{'title': 'The Violet Fairy Book', 'year': 1901}
{'title': 'The Crimson Fairy Book', 'year': 1903}
{'title': 'The Brown Fairy Book', 'year': 1904}
{'title': 'The Orange Fairy Book', 'year': 1906}
{'title': 'The Olive Fairy Book', 'year': 1907}
{'title': 'The Lilac Fairy Book', 'year': 1910}


Okay - the titles are fine but the years look a bit shonky to me...

The dates are okay if we use the ones from the sacred texts listing page that we previously grabbed into `sacred_metadata`:

In [37]:
new_metadata = []
for m in sacred_metadata:
    new_metadata.append({"title": m[0][0], "year": m[1]})
    
new_metadata

[{'title': 'The Blue Fairy Book', 'year': '1889'},
 {'title': 'The Brown Fairy Book', 'year': '1904'},
 {'title': 'The Crimson Fairy Book', 'year': '1903'},
 {'title': 'The Green Fairy Book', 'year': '1892'},
 {'title': 'The Grey Fairy Book', 'year': '1900'},
 {'title': 'The Lilac Fairy Book', 'year': '1910'},
 {'title': 'The Olive Fairy Book', 'year': '1910'},
 {'title': 'The Orange Fairy Book', 'year': '1906'},
 {'title': 'The Pink Fairy Book', 'year': '1897'},
 {'title': 'The Red Fairy Book', 'year': '1890'},
 {'title': 'The Violet Fairy Book', 'year': '1901'},
 {'title': 'The Yellow Fairy Book', 'year': '1894'}]

Replace the `books_metadata` table:

In [38]:
# The truncate=True clears the records from the original table
db["books_metadata"].insert_all(new_metadata, pk=("title", "year"), truncate=True)

<Table books_metadata (title, year)>

In [39]:
for row in db.query('SELECT * FROM books_metadata ORDER BY year ASC'):
    print(row)

{'title': 'The Blue Fairy Book', 'year': 1889}
{'title': 'The Red Fairy Book', 'year': 1890}
{'title': 'The Green Fairy Book', 'year': 1892}
{'title': 'The Yellow Fairy Book', 'year': 1894}
{'title': 'The Pink Fairy Book', 'year': 1897}
{'title': 'The Grey Fairy Book', 'year': 1900}
{'title': 'The Violet Fairy Book', 'year': 1901}
{'title': 'The Crimson Fairy Book', 'year': 1903}
{'title': 'The Brown Fairy Book', 'year': 1904}
{'title': 'The Orange Fairy Book', 'year': 1906}
{'title': 'The Lilac Fairy Book', 'year': 1910}
{'title': 'The Olive Fairy Book', 'year': 1910}


That looks a bit better.

How many stories do we now have with a king and three sons?

In [40]:
print(f"Search on: {q}\n")

for story in db["books"].search(db.quote_fts(q), columns=["title", "book"]):
    print(story)

Search on: king "three sons"

{'title': 'The Three Brothers', 'book': 'The Pink Fairy Book'}
{'title': 'The Princess Who Was Hidden Underground', 'book': 'The Violet Fairy Book'}
{'title': 'The Black Thief And Knight Of The Glen.', 'book': 'The Red Fairy Book'}
{'title': 'Blockhead-Hans', 'book': 'The Yellow Fairy Book'}
{'title': 'The Golden Lion', 'book': 'The Pink Fairy Book'}
{'title': 'The Golden Goose', 'book': 'The Red Fairy Book'}
{'title': 'The Master Cat; Or, Puss In Boots', 'book': 'The Blue Fairy Book'}
{'title': 'The Enchanted Watch', 'book': 'The Green Fairy Book'}
{'title': 'The Story Of The Fair Circassians', 'book': 'The Grey Fairy Book'}
{'title': 'The Norka', 'book': 'The Red Fairy Book'}
{'title': 'Tritill, Litill, And The Birds', 'book': 'The Crimson Fairy Book'}
{'title': 'The Seven Foals', 'book': 'The Red Fairy Book'}
{'title': 'The Witch And Her Servants', 'book': 'The Yellow Fairy Book'}
{'title': 'The Flying Ship', 'book': 'The Yellow Fairy Book'}
{'title': "

How about Jack stories?

In [41]:
for story in db["books"].search("Jack", columns=["title", "book"]):
    print(story)

{'title': 'The History Of Jack The Giant-Killer', 'book': 'The Blue Fairy Book'}
{'title': 'Jack And The Beanstalk', 'book': 'The Red Fairy Book'}
{'title': 'Jack My Hedgehog', 'book': 'The Green Fairy Book'}
{'title': 'The Three Treasures Of The Giants', 'book': 'The Orange Fairy Book'}
{'title': 'Farmer Weatherbeard', 'book': 'The Red Fairy Book'}
{'title': 'The Shirt-Collar', 'book': 'The Pink Fairy Book'}
{'title': 'To The Friendly Reader', 'book': 'The Green Fairy Book'}
{'title': 'Preface', 'book': 'The Red Fairy Book'}
{'title': 'Preface', 'book': 'The Orange Fairy Book'}
{'title': 'The Princess Mayblossom', 'book': 'The Red Fairy Book'}
{'title': 'Tale Of A Tortoise And Of A Mischievous Monkey', 'book': 'The Brown Fairy Book'}


Ah... so maybe *Preface* is something we could also catch and exclude... And perhaps *To The Friendly Reader* as a special exception.

Or Hans?

In [42]:
for story in db["books"].search("Hans", columns=["title", "book"]):
    print(story)

{'title': "Hans, The Mermaid's Son", 'book': 'The Pink Fairy Book'}
{'title': 'The Headless Dwarfs', 'book': 'The Violet Fairy Book'}
{'title': 'Blockhead-Hans', 'book': 'The Yellow Fairy Book'}
{'title': 'The Underground Workers', 'book': 'The Violet Fairy Book'}
{'title': 'The Magic Book', 'book': 'The Orange Fairy Book'}
{'title': 'Preface', 'book': 'The Yellow Fairy Book'}
{'title': 'The Shirt-Collar', 'book': 'The Pink Fairy Book'}
{'title': 'The Goblin And The Grocer', 'book': 'The Pink Fairy Book'}
{'title': 'The Flying Trunk', 'book': 'The Pink Fairy Book'}
{'title': 'The Snow-Man', 'book': 'The Pink Fairy Book'}
{'title': 'The Fir-Tree', 'book': 'The Pink Fairy Book'}
{'title': 'The Daughter Of Buk Ettemsuch', 'book': 'The Grey Fairy Book'}
{'title': 'The Story Of Halfman', 'book': 'The Violet Fairy Book'}
{'title': 'Udea And Her Seven Brothers', 'book': 'The Grey Fairy Book'}
{'title': 'The Ugly Duckling', 'book': 'The Orange Fairy Book'}
{'title': 'The Snow-Queen', 'book': '

In [43]:
for story in db["books"].search("donkey", columns=["title", "book"]):
    print(story)

{'title': 'The Street Musicians', 'book': 'The Grey Fairy Book'}
{'title': 'The Heart Of A Monkey', 'book': 'The Lilac Fairy Book'}
{'title': 'The Ogre', 'book': 'The Grey Fairy Book'}
{'title': 'The Cunning Shoemaker', 'book': 'The Pink Fairy Book'}
{'title': 'Donkey Skin', 'book': 'The Grey Fairy Book'}
{'title': 'The Biter Bit', 'book': 'The Green Fairy Book'}
{'title': 'The Donkey Cabbage', 'book': 'The Yellow Fairy Book'}
{'title': 'The Stones Of Plouhinec', 'book': 'The Lilac Fairy Book'}
{'title': 'The Colony Of Cats', 'book': 'The Crimson Fairy Book'}
{'title': 'The Prince And The Three Fates', 'book': 'The Brown Fairy Book'}
{'title': 'Story Of The King Who Would Be Stronger Than Fate', 'book': 'The Brown Fairy Book'}
{'title': 'The Story Of Hassebu', 'book': 'The Violet Fairy Book'}
{'title': 'The Nunda, Eater Of People', 'book': 'The Violet Fairy Book'}
{'title': 'Preface', 'book': 'The Yellow Fairy Book'}
{'title': 'A French Puck', 'book': 'The Lilac Fairy Book'}
{'title': 

We can also run explicit SQL queries over the database. For example, how do some of the stories start?

In [44]:
for row in db.query('SELECT first_line FROM books LIMIT 5'):
    print(row["first_line"])

Once upon a time in a certain country there lived a king whose palace was surrounded by a spacious garden.
Once upon a time there lived a king who was deeply in love with a princess, but she could not marry anyone, because she was under an enchantment.
Once upon a time there was a poor husbandman who had many children and little to give them in the way either of food or clothing.
Once upon a time there lived a queen who had been the mother of a great many children, and of them all only one daughter was left.
Once upon a time there lived in a certain village a little country girl, the prettiest creature was ever seen.


I seem to recall there may have been some sources at the end of some texts? A quick text for that is to see if there is any mention of `Grimm`:

In [45]:
for story in db["books"].search("Grimm", columns=["title", "book"]):
    print(story)

{'title': 'Preface', 'book': 'The Crimson Fairy Book'}
{'title': 'The Three Brothers', 'book': 'The Pink Fairy Book'}
{'title': 'To The Friendly Reader', 'book': 'The Green Fairy Book'}
{'title': 'Jorinde And Joringel', 'book': 'The Green Fairy Book'}
{'title': 'Spindle, Shuttle, And Needle', 'book': 'The Green Fairy Book'}
{'title': 'The Marvellous Musician', 'book': 'The Red Fairy Book'}
{'title': 'Rumpelstiltzkin', 'book': 'The Blue Fairy Book'}
{'title': 'The Twelve Huntsmen', 'book': 'The Green Fairy Book'}
{'title': 'The Riddle', 'book': 'The Green Fairy Book'}
{'title': 'Mother Holle', 'book': 'The Red Fairy Book'}
{'title': 'The Story Of A Clever Tailor', 'book': 'The Green Fairy Book'}
{'title': 'The Three Snake-Leaves', 'book': 'The Green Fairy Book'}
{'title': 'The War Of The Wolf And The Fox', 'book': 'The Green Fairy Book'}
{'title': 'Rapunzel', 'book': 'The Red Fairy Book'}
{'title': 'The White Snake', 'book': 'The Green Fairy Book'}
{'title': 'The House In The Wood', 'bo

Okay, so let's check the end of one of those:

In [46]:
for row in db.query('SELECT last_para FROM books WHERE text LIKE "%Grimm%"'):
    print(row["last_para"][-200:])

[1] Grimm.
[1] Grimm.
[1] Grimm.
[1] Grimm.
[1] Grimm.
[1] Grimm.
ut them up in the cellar, but in the morning they shall be led forth into the forest and shall serve a charcoal burner until they have improved, and will never again suffer poor animals to go hungry.'
 ill and died the two others were so deeply grieved that they were also taken ill and died too. And so, because they had all been so clever, and so fond of each other, they were all laid in one grave.
 and Mrs. Skovgaard-Pedersen has done 'The Green Knight' from the Danish. I must especially thank Monsieur Macler for permitting us to use some of his Contes Armeniens (Paris: Ernest Leroux, Editeur).
^32:1 Grimm.
ffer in colour; language, religion, and almost everything else; but they all love a nursery tale. The stories have mainly been adapted or translated by Mrs. Lang, a few by Miss Lang and Miss Blackley.
ill not be dull. So good-bye, and when you have read a fairy book, lend it to other children who have none, or tell t

How about some stories that don't reference Grimm?

In [47]:
# This query was used to help iterate the regular expressions used to extract the provenance
for row in db.query('SELECT last_para, provenance FROM books WHERE text NOT LIKE "%Grimm%" LIMIT 10'):
    print(row["provenance"],"::", row["last_para"][-200:])

Traditions Populaires de l'Asie Mineure. Carnoy et Nicolaides. Paris: Maisonneuve, 1889. :: [1] Traditions Populaires de l'Asie Mineure. Carnoy et Nicolaides. Paris: Maisonneuve, 1889.
Le Prince Desir et la Princesse Mignonne. Par Madame Leprince de Beaumont. :: [1] Le Prince Desir et la Princesse Mignonne. Par Madame Leprince de Beaumont.
Asbjornsen and Moe. :: [1] Asbjornsen and Moe.
Madame d'Aulnoy. :: [1] Madame d'Aulnoy.
 :: And, saying these words, this wicked wolf fell upon Little Red Riding-Hood, and ate her all up.
 ::  creatures she had ordered to be thrown into it for others. The King could not but be very sorry, for she was his mother; but he soon comforted himself with his beautiful wife and his pretty children.
Charles Perrault. :: [1] Charles Perrault.
Arabian Nights. :: [1] Arabian Nights.
La Belle et la Bete. Par Madame de Villeneuve. :: [1] La Belle et la Bete. Par Madame de Villeneuve.
Asbjornsen and Moe. :: [1] Asbjornsen and Moe.


In [48]:
for row in db.query('SELECT DISTINCT provenance, COUNT(*) AS num FROM books GROUP BY provenance ORDER BY num DESC LIMIT 10'):
    print(row["num"], row["provenance"])

126 
31 Grimm.
5 Lapplandische Mahrchen.
5 Japanische Marchen.
5 From 'West Highland Tales.'
5 Ehstnische Marchen.
5 Charles Perrault.
4 Volksmarchen der Serben.
4 Madame d'Aulnoy.
4 From Ungarische Mahrchen.


Hmm.. it seemed like there were more mentions of Grimm than that?

## Making *pandas* based Database Queries

For convenience, let's set up a database connection so we can easily run *pandas* mediated queries:

In [49]:
import pandas as pd
import sqlite3

conn = sqlite3.connect(db_name)

In [50]:
#--SPLITHERE--

## Entity Extraction...

So what entities can we find in the stories...?!

Let's load in the `spacy` natural language processing toolkit:

In [51]:
#%pip install --upgrade spacy
import spacy
nlp = spacy.load("en_core_web_sm")

Couldn't import dot_parser, loading of dot files will not be possible.


Get a dataframe of data frm the database:

In [52]:
q = "SELECT * FROM books"
df = pd.read_sql(q, conn)

df.head()

Unnamed: 0,book,title,text,last_para,first_line,provenance,chapter_order
0,The Blue Fairy Book,The Bronze Ring,Once upon a time in a certain country there li...,[1] Traditions Populaires de l'Asie Mineure. C...,Once upon a time in a certain country there li...,Traditions Populaires de l'Asie Mineure. Carno...,0
1,The Blue Fairy Book,Prince Hyacinth And The Dear Little Princess,Once upon a time there lived a king who was de...,[1] Le Prince Desir et la Princesse Mignonne. ...,Once upon a time there lived a king who was de...,Le Prince Desir et la Princesse Mignonne. Par ...,1
2,The Blue Fairy Book,East Of The Sun And West Of The Moon,Once upon a time there was a poor husbandman w...,[1] Asbjornsen and Moe.,Once upon a time there was a poor husbandman w...,Asbjornsen and Moe.,2
3,The Blue Fairy Book,The Yellow Dwarf,Once upon a time there lived a queen who had b...,[1] Madame d'Aulnoy.,Once upon a time there lived a queen who had b...,Madame d'Aulnoy.,3
4,The Blue Fairy Book,Little Red Riding Hood,Once upon a time there lived in a certain vill...,"And, saying these words, this wicked wolf fell...",Once upon a time there lived in a certain vill...,,4


Now let's have a go at extracting some entities (this may take some time!):

In [53]:
# Extract a set of entities, rather than a list...
get_entities = lambda desc: {f"{entity.label_} :: {entity.text}" for entity in nlp(desc).ents}

# The full run takes some time....
df['entities'] = df["text"].apply(get_entities)

df.head(10)

Unnamed: 0,book,title,text,last_para,first_line,provenance,chapter_order,entities
0,The Blue Fairy Book,The Bronze Ring,Once upon a time in a certain country there li...,[1] Traditions Populaires de l'Asie Mineure. C...,Once upon a time in a certain country there li...,Traditions Populaires de l'Asie Mineure. Carno...,0,"{DATE :: this very day, TIME :: night, CARDINA..."
1,The Blue Fairy Book,Prince Hyacinth And The Dear Little Princess,Once upon a time there lived a king who was de...,[1] Le Prince Desir et la Princesse Mignonne. ...,Once upon a time there lived a king who was de...,Le Prince Desir et la Princesse Mignonne. Par ...,1,"{PERSON :: Fairy, DATE :: Hyacinth, WORK_OF_AR..."
2,The Blue Fairy Book,East Of The Sun And West Of The Moon,Once upon a time there was a poor husbandman w...,[1] Asbjornsen and Moe.,Once upon a time there was a poor husbandman w...,Asbjornsen and Moe.,2,"{DATE :: Thursday, FAC :: the White Bear, TIME..."
3,The Blue Fairy Book,The Yellow Dwarf,Once upon a time there lived a queen who had b...,[1] Madame d'Aulnoy.,Once upon a time there lived a queen who had b...,Madame d'Aulnoy.,3,"{ORG :: Bellissima, ORG :: Courts, CARDINAL ::..."
4,The Blue Fairy Book,Little Red Riding Hood,Once upon a time there lived in a certain vill...,"And, saying these words, this wicked wolf fell...",Once upon a time there lived in a certain vill...,,4,"{ORG :: Grandmamma, DATE :: three days, PERSON..."
5,The Blue Fairy Book,The Sleeping Beauty In The Wood,"There were formerly a king and a queen, who we...","No one dared to tell him, when the Ogress, all...","There were formerly a king and a queen, who we...",,5,"{PERSON :: Fairy, PERSON :: Ogresses, ORG :: F..."
6,The Blue Fairy Book,"Cinderella, Or The Little Glass Slipper","Once there was a gentleman who married, for hi...",[1] Charles Perrault.,"Once there was a gentleman who married, for hi...",Charles Perrault.,6,"{PERSON :: Charlotte, PERSON :: Fairy, PERSON ..."
7,The Blue Fairy Book,Aladdin And The Wonderful Lamp,"There once lived a poor tailor, who had a son ...",[1] Arabian Nights.,"There once lived a poor tailor, who had a son ...",Arabian Nights.,7,"{PRODUCT :: The Grand Vizier, DATE :: forty ye..."
8,The Blue Fairy Book,The Tale Of A Youth Who Set Out To Learn What ...,"A father had two sons, of whom the eldest was ...",[1] Grimm.,"A father had two sons, of whom the eldest was ...",Grimm.,8,"{TIME :: the night, TIME :: night, ORDINAL :: ..."
9,The Blue Fairy Book,Rumpelstiltzkin,There was once upon a time a poor miller who h...,[1] Grimm.,There was once upon a time a poor miller who h...,Grimm.,9,"{PERSON :: Miller, GPE :: Melchior, TIME :: Go..."


*We should probably just do this once and add an appropriate table of entities to the database...*

We can explode these out into a long format dataframe:

In [54]:
from pandas import Series

# Explode the entities one per row...
df_long = df.explode('entities')
df_long.rename(columns={"entities":"entity"}, inplace=True)

# And then separate out entity type and value
df_long[["entity_typ", "entity_value"]] = df_long["entity"].str.split(" :: ").apply(Series)
df_long.head()

Unnamed: 0,book,title,text,last_para,first_line,provenance,chapter_order,entity,entity_typ,entity_value
0,The Blue Fairy Book,The Bronze Ring,Once upon a time in a certain country there li...,[1] Traditions Populaires de l'Asie Mineure. C...,Once upon a time in a certain country there li...,Traditions Populaires de l'Asie Mineure. Carno...,0,DATE :: this very day,DATE,this very day
0,The Blue Fairy Book,The Bronze Ring,Once upon a time in a certain country there li...,[1] Traditions Populaires de l'Asie Mineure. C...,Once upon a time in a certain country there li...,Traditions Populaires de l'Asie Mineure. Carno...,0,TIME :: night,TIME,night
0,The Blue Fairy Book,The Bronze Ring,Once upon a time in a certain country there li...,[1] Traditions Populaires de l'Asie Mineure. C...,Once upon a time in a certain country there li...,Traditions Populaires de l'Asie Mineure. Carno...,0,CARDINAL :: twelve,CARDINAL,twelve
0,The Blue Fairy Book,The Bronze Ring,Once upon a time in a certain country there li...,[1] Traditions Populaires de l'Asie Mineure. C...,Once upon a time in a certain country there li...,Traditions Populaires de l'Asie Mineure. Carno...,0,GPE :: Nicolaides,GPE,Nicolaides
0,The Blue Fairy Book,The Bronze Ring,Once upon a time in a certain country there li...,[1] Traditions Populaires de l'Asie Mineure. C...,Once upon a time in a certain country there li...,Traditions Populaires de l'Asie Mineure. Carno...,0,CARDINAL :: three,CARDINAL,three


And explore...

In [55]:
df_long["entity_typ"].value_counts()

DATE           3108
CARDINAL       2536
TIME           1928
ORG            1564
PERSON         1499
ORDINAL         879
GPE             564
WORK_OF_ART     358
NORP            343
QUANTITY        209
LOC             165
PRODUCT         144
FAC             126
MONEY            41
LAW              34
EVENT            28
LANGUAGE         12
Name: entity_typ, dtype: int64

What sort of money has been identified in the stories?

In [56]:
df_long[df_long["entity_typ"]=="MONEY"]["entity_value"].value_counts().head(10)

a penny                     6
a hundred dollars           5
three hundred dollars       4
a few pence                 4
two hundred dollars         3
the hundred dollars         2
ten dollars                 2
fifty dollars               1
Five hundred dollars        1
only a few hundred yards    1
Name: entity_value, dtype: int64

Dollars? Really??? What about gold coins?! Do I need to train a new classifier?! Or was the original text really like that... Or has the text been got at? *(Maybe I should do my own digitisation project to extract the text from copies of the original books on the Internet Archive? Hmmm.. that could be interesting for when we go on strike...)*

What about other quantities?

In [57]:
df_long[df_long["entity_typ"]=="QUANTITY"]["entity_value"].value_counts().head(10)

one foot           15
a mile             12
a few miles         7
twenty miles        6
three miles         5
a hundred miles     5
two barrels         3
a few yards         3
several miles       3
two miles           3
Name: entity_value, dtype: int64

What people have been identified?

In [58]:
df_long[df_long["entity_typ"]=="PERSON"]["entity_value"].value_counts().head(10)

Fairy       31
Prince      29
Majesty     29
Queen       27
bush        12
wolf        11
Lang        11
Madam       11
Campbell    11
Jack        10
Name: entity_value, dtype: int64

How about geo-political entities (GPEs)?

In [59]:
df_long[df_long["entity_typ"]=="GPE"]["entity_value"].value_counts().head(10)

thou       11
Paris      10
France      9
Japan       8
Greece      7
Ireland     6
Denmark     6
Egypt       5
Persia      5
Thou        5
Name: entity_value, dtype: int64

When did things happen?

In [60]:
df_long[df_long["entity_typ"]=="DATE"]["entity_value"].value_counts().head(10)

one day         188
One day         154
the day          86
three days       79
next day         62
the next day     61
a few days       50
all day          44
many years       41
that day         40
Name: entity_value, dtype: int64

And how about time considerations?

In [61]:
df_long[df_long["entity_typ"]=="TIME"]["entity_value"].value_counts().head(10)

night            117
evening          115
a few minutes     77
morning           69
next morning      63
one morning       58
Next morning      53
the night         47
midnight          44
one night         40
Name: entity_value, dtype: int64

How were things organised?

In [62]:
df_long[df_long["entity_typ"]=="ORG"]["entity_value"].value_counts().head(10)

King        102
Princess     71
Court        48
Prince       42
Grimm        28
King's       23
Quick        18
eagle        12
Majesty      11
I.           10
Name: entity_value, dtype: int64

What's a `NORP`? (Ah... *Nationalities Or Religious or Political groups.)*

In [63]:
df_long[df_long["entity_typ"]=="NORP"]["entity_value"].value_counts().head(10)

German        24
Danish        18
French        14
Russian       11
Indian        10
Christian      9
Italian        8
Portuguese     7
Spanish        7
Persian        6
Name: entity_value, dtype: int64

In [64]:
#--SPLITHERE--

## Add Wikipedia Links

The Wikipedia page [`Lang's_Fairy_Books`](https://en.wikipedia.org/wiki/Lang's_Fairy_Books) lists the contents of Lang's coloured fairy books (as well as several other books), along with links to the Wikipedia page associated with each tale, if available.

This means we can have a go at annotating our database with Wikipedia links for each story. From those pages in turn, or associated *DBpedia* pages, we might also be able to extract Aarne-Thompson classification codes for the corresponding stories.

In [65]:
url = "https://en.wikipedia.org/wiki/Lang's_Fairy_Books"

html = requests.get(url)

wp_soup = BeautifulSoup(html.content, "html.parser")

In [66]:
# Find the span for a particular book
wp_book_loc =  wp_soup.find("span", id="The_Blue_Fairy_Book_(1889)")

# Then navigate relative to this to get the (linked) story list
wp_book_stories = wp_book_loc.find_parent().find_next("ul").find_all('li')
wp_book_stories[:3]

[<li>"<a href="/wiki/The_Bronze_Ring" title="The Bronze Ring">The Bronze Ring</a>"</li>,
 <li>"<a href="/wiki/Prince_Hyacinth_and_the_Dear_Little_Princess" title="Prince Hyacinth and the Dear Little Princess">Prince Hyacinth and the Dear Little Princess</a>"</li>,
 <li>"<a href="/wiki/East_of_the_Sun_and_West_of_the_Moon" title="East of the Sun and West of the Moon">East of the Sun and West of the Moon</a>"</li>]

Get the Wikipedia path for stories with a Wikipedia page:

In [67]:
wp_book_paths = [(li.find("a").get("title"), li.find("a").get("href")) for li in wp_book_stories]

wp_book_paths[:3]

[('The Bronze Ring', '/wiki/The_Bronze_Ring'),
 ('Prince Hyacinth and the Dear Little Princess',
  '/wiki/Prince_Hyacinth_and_the_Dear_Little_Princess'),
 ('East of the Sun and West of the Moon',
  '/wiki/East_of_the_Sun_and_West_of_the_Moon')]

Useful as a list of `dict`s or *pandas* `DataFrame`?

In [68]:
import pandas as pd

wp_book_paths_wide = []

for item in wp_book_paths:
    wp_book_paths_wide.append( {"title":item[0].strip(), "path":item[1]} )
    
wp_book_df = pd.DataFrame(wp_book_paths_wide)
wp_book_df

Unnamed: 0,title,path
0,The Bronze Ring,/wiki/The_Bronze_Ring
1,Prince Hyacinth and the Dear Little Princess,/wiki/Prince_Hyacinth_and_the_Dear_Little_Prin...
2,East of the Sun and West of the Moon,/wiki/East_of_the_Sun_and_West_of_the_Moon
3,The Yellow Dwarf,/wiki/The_Yellow_Dwarf
4,Little Red Riding Hood,/wiki/Little_Red_Riding_Hood
5,Sleeping Beauty,/wiki/Sleeping_Beauty
6,Cinderella,/wiki/Cinderella
7,Aladdin,/wiki/Aladdin
8,The Story of the Youth Who Went Forth to Learn...,/wiki/The_Story_of_the_Youth_Who_Went_Forth_to...
9,Rumpelstiltskin,/wiki/Rumpelstiltskin


See if we can then cross reference these with stories in the database?

In [69]:
q = "SELECT book, title, chapter_order FROM books WHERE book='The Blue Fairy Book' ORDER BY chapter_order ASC"
df_blue = pd.read_sql(q, conn)

df_blue.head()

Unnamed: 0,book,title,chapter_order
0,The Blue Fairy Book,The Bronze Ring,0
1,The Blue Fairy Book,Prince Hyacinth And The Dear Little Princess,1
2,The Blue Fairy Book,East Of The Sun And West Of The Moon,2
3,The Blue Fairy Book,The Yellow Dwarf,3
4,The Blue Fairy Book,Little Red Riding Hood,4


Let's see if the chapters align in terms of order as presented:

In [70]:
pd.DataFrame({"book":df_blue["title"], "wp":wp_book_df["title"], "wp_path":wp_book_df["path"]})

Unnamed: 0,book,wp,wp_path
0,The Bronze Ring,The Bronze Ring,/wiki/The_Bronze_Ring
1,Prince Hyacinth And The Dear Little Princess,Prince Hyacinth and the Dear Little Princess,/wiki/Prince_Hyacinth_and_the_Dear_Little_Prin...
2,East Of The Sun And West Of The Moon,East of the Sun and West of the Moon,/wiki/East_of_the_Sun_and_West_of_the_Moon
3,The Yellow Dwarf,The Yellow Dwarf,/wiki/The_Yellow_Dwarf
4,Little Red Riding Hood,Little Red Riding Hood,/wiki/Little_Red_Riding_Hood
5,The Sleeping Beauty In The Wood,Sleeping Beauty,/wiki/Sleeping_Beauty
6,"Cinderella, Or The Little Glass Slipper",Cinderella,/wiki/Cinderella
7,Aladdin And The Wonderful Lamp,Aladdin,/wiki/Aladdin
8,The Tale Of A Youth Who Set Out To Learn What ...,The Story of the Youth Who Went Forth to Learn...,/wiki/The_Story_of_the_Youth_Who_Went_Forth_to...
9,Rumpelstiltzkin,Rumpelstiltskin,/wiki/Rumpelstiltskin


Yes, they do so we can use that as a basis of a merge. That said, in the genral case it would probably also be useful to generate a fuzzy match score between matched titles with a report on any low scoring matches, just in case the alignment has gone awry.

In [71]:
# TO DO  - wp table for links, story and story order?
# TO DO fuzzy match score test just to check ingest and allow user to check poor matches

In passing,what if we wanted to try to match on the titles themselves?

If we use decased, but otherwise exact, matching, we see it's bit flaky....

In [72]:
pd.merge(df_blue["title"], wp_book_df,
         left_on=df_blue["title"].str.lower(),
         right_on=wp_book_df["title"].str.lower(),
         how ="left" )

Unnamed: 0,key_0,title_x,title_y,path
0,the bronze ring,The Bronze Ring,The Bronze Ring,/wiki/The_Bronze_Ring
1,prince hyacinth and the dear little princess,Prince Hyacinth And The Dear Little Princess,Prince Hyacinth and the Dear Little Princess,/wiki/Prince_Hyacinth_and_the_Dear_Little_Prin...
2,east of the sun and west of the moon,East Of The Sun And West Of The Moon,East of the Sun and West of the Moon,/wiki/East_of_the_Sun_and_West_of_the_Moon
3,the yellow dwarf,The Yellow Dwarf,The Yellow Dwarf,/wiki/The_Yellow_Dwarf
4,little red riding hood,Little Red Riding Hood,Little Red Riding Hood,/wiki/Little_Red_Riding_Hood
5,the sleeping beauty in the wood,The Sleeping Beauty In The Wood,,
6,"cinderella, or the little glass slipper","Cinderella, Or The Little Glass Slipper",,
7,aladdin and the wonderful lamp,Aladdin And The Wonderful Lamp,,
8,the tale of a youth who set out to learn what ...,The Tale Of A Youth Who Set Out To Learn What ...,,
9,rumpelstiltzkin,Rumpelstiltzkin,,


A fuzzy match might be able to improve things...

In [73]:
# Reused from on https://stackoverflow.com/a/56315491/454773
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    :param df_1: the left table to join
    :param df_2: the right table to join
    :param key1: key column of the left table
    :param key2: key column of the right table
    :param threshold: how close the matches should be to return a match, based on Levenshtein distance
    :param limit: the amount of matches that will get returned, these are sorted high to low
    :return: dataframe with boths keys and matches
    """
    s = df_2[key2].tolist()
    
    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))  
    df_1['matches'] = m
    
    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2
    return df_1

In [74]:
fuzzy_merge(df_blue, wp_book_df, "title", "title", 88, limit=1)[["title", "matches"]]

Unnamed: 0,title,matches
0,The Bronze Ring,The Bronze Ring
1,Prince Hyacinth And The Dear Little Princess,Prince Hyacinth and the Dear Little Princess
2,East Of The Sun And West Of The Moon,East of the Sun and West of the Moon
3,The Yellow Dwarf,The Yellow Dwarf
4,Little Red Riding Hood,Little Red Riding Hood
5,The Sleeping Beauty In The Wood,Sleeping Beauty
6,"Cinderella, Or The Little Glass Slipper",Cinderella
7,Aladdin And The Wonderful Lamp,Aladdin
8,The Tale Of A Youth Who Set Out To Learn What ...,
9,Rumpelstiltzkin,Rumpelstiltskin


In [75]:
#https://github.com/jsoma/fuzzy_pandas/

# This is probably overkill...
#%pip install fuzzy_pandas
import fuzzy_pandas as fpd

fpd.fuzzy_merge(df_blue[["title"]], wp_book_df,
            left_on='title',
            right_on='title',
            ignore_case=True,
            ignore_nonalpha=True,
            method='jaro', #bilenko, levenshtein, metaphone, jaro
            threshold=0.86, # If we move to 0.86 we get a false positive...
            keep_left='all',
            keep_right="all"
               )


Unnamed: 0,title,title.1,path
0,The Bronze Ring,The Bronze Ring,/wiki/The_Bronze_Ring
1,Prince Hyacinth And The Dear Little Princess,Prince Hyacinth and the Dear Little Princess,/wiki/Prince_Hyacinth_and_the_Dear_Little_Prin...
2,East Of The Sun And West Of The Moon,East of the Sun and West of the Moon,/wiki/East_of_the_Sun_and_West_of_the_Moon
3,The Yellow Dwarf,The Yellow Dwarf,/wiki/The_Yellow_Dwarf
4,Little Red Riding Hood,Little Red Riding Hood,/wiki/Little_Red_Riding_Hood
5,"Cinderella, Or The Little Glass Slipper",Cinderella,/wiki/Cinderella
6,Rumpelstiltzkin,Rumpelstiltskin,/wiki/Rumpelstiltskin
7,Beauty And The Beast,Beauty and the Beast,/wiki/Beauty_and_the_Beast
8,The Master-Maid,The Master Maid,/wiki/The_Master_Maid
9,Why The Sea Is Salt,Why the Sea Is Salt,/wiki/Why_the_Sea_Is_Salt


In [76]:
fpd.fuzzy_merge(df_blue[["title"]], wp_book_df,
            left_on='title',
            right_on='title',
            ignore_case=True,
            ignore_nonalpha=True,
            method='metaphone', #levenshtein, metaphone, jaro, bilenko
            threshold=0.86,
            keep_left='all',
            keep_right="all"
               )

Unnamed: 0,title,title.1,path
0,The Bronze Ring,The Bronze Ring,/wiki/The_Bronze_Ring
1,Prince Hyacinth And The Dear Little Princess,Prince Hyacinth and the Dear Little Princess,/wiki/Prince_Hyacinth_and_the_Dear_Little_Prin...
2,East Of The Sun And West Of The Moon,East of the Sun and West of the Moon,/wiki/East_of_the_Sun_and_West_of_the_Moon
3,The Yellow Dwarf,The Yellow Dwarf,/wiki/The_Yellow_Dwarf
4,Little Red Riding Hood,Little Red Riding Hood,/wiki/Little_Red_Riding_Hood
5,Rumpelstiltzkin,Rumpelstiltskin,/wiki/Rumpelstiltskin
6,Beauty And The Beast,Beauty and the Beast,/wiki/Beauty_and_the_Beast
7,The Master-Maid,The Master Maid,/wiki/The_Master_Maid
8,Why The Sea Is Salt,Why the Sea Is Salt,/wiki/Why_the_Sea_Is_Salt
9,Felicia And The Pot Of Pinks,Felicia and the Pot of Pinks,/wiki/Felicia_and_the_Pot_of_Pinks


## Other Things to Link In

Have other people generated data sets that can be linked in?

- http://www.mythfolklore.net/andrewlang/indexbib.htm /via @OnlineCrsLady

In [77]:
#--SPLITHERE--

## Common Refrains / Repeating Phrases

Many stories incorporate a repeating phrase or refrain in the story, but you may need to read quite a long way into a story before you can identify that repeating phrase. So are there any tools we might be able to use 

In [78]:
#db = Database(db_name)
              
q2 = '"pretty hen"'

_q = f'SELECT * FROM books_fts WHERE books_fts MATCH {db.quote(q2)} ;'

for row in db.query(_q):
    print(row["title"])

The House In The Wood


In [79]:
import nltk
from nltk.util import ngrams as nltk_ngrams

tokens = nltk.word_tokenize(row["text"])

size = 5
#for i in nltk_ngrams(tokens, size):
#    print(' '.join(i))

We could then look for repeating phrases:

In [80]:
import pandas as pd

df = pd.DataFrame({'phrase':[' '.join(i) for i in nltk_ngrams(tokens, size)]})
df['phrase'].value_counts()

, pretty brindled cow ,        4
And you , pretty brindled      4
you , pretty brindled cow      4
pretty brindled cow , What     4
brindled cow , What do         4
                              ..
leaving him all day without    1
for leaving him all day        1
wife for leaving him all       1
his wife for leaving him       1
to go hungry . '               1
Name: phrase, Length: 1787, dtype: int64

Really, we need to do a scan down from large token size until we find a match (longest match phrase).

But for now, let's see what repeating elements we get from one of those search phrases:

In [81]:
import re

_q = 'pretty brindled cow'

for m in re.finditer(_q, row["text"]):
    # Display the matched terms and the 50 characters
    # immediately preceding and following the phrase 
    print(f'===\n{q2}: ', m.start(), m.end(), row["text"][max(0, m.start()-50):m.end()+50])

===
"pretty hen":  1566 1585 
The man said:

Pretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?

'Duks,' answered the beast
===
"pretty hen":  3505 3524 ed the beasts:

Pretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?

The beasts answered, 'Duks
===
"pretty hen":  4932 4951  beasts again:

Pretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?

'Duks,' they said. Then th
===
"pretty hen":  6119 6138  to rest now?'

Pretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?

The animals said, 'Duks:




Make a function for that:

In [82]:
def find_contexts(text, phrase, width=50):
    """Find the context(s) of the phrase."""
    contexts = []
    for m in re.finditer(phrase, text):
        # Display the matched terms and the `width` characters
        # immediately preceding and following the phrase 
        contexts.append(text[max(0, m.start()-width):m.end()+width])
    return contexts

for i in find_contexts(row['text'], 'pretty brindled cow'):
    print(i,"\n==")


The man said:

Pretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?

'Duks,' answered the beast 
==
ed the beasts:

Pretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?

The beasts answered, 'Duks 
==
 beasts again:

Pretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?

'Duks,' they said. Then th 
==
 to rest now?'

Pretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?

The animals said, 'Duks:

 
==


In [83]:
find_contexts(row['text'], 'pretty brindled cow')

["\nThe man said:\n\nPretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?\n\n'Duks,' answered the beast",
 "ed the beasts:\n\nPretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?\n\nThe beasts answered, 'Duks",
 " beasts again:\n\nPretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?\n\n'Duks,' they said. Then th",
 " to rest now?'\n\nPretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?\n\nThe animals said, 'Duks:\n\n"]

We can also make this a SQLite lookup function:

In [84]:
from vtfunc import TableFunction

def concordances(text, phrase, width=50):
    """Find the concordances of a phrase in a text."""
    contexts = []
    for m in re.finditer(phrase, text):
        # Display the matched terms and the `width` characters
        # immediately preceding and following the phrase
        context = text[max(0, m.start()-width):m.end()+width]
        contexts.append( (context, m.start(), m.end()) )
    return contexts


class Concordances(TableFunction):
    params = ['phrase', 'text']
    columns = ['match', 'start', 'end']
    name = 'concordance'

    def initialize(self, phrase=None, text=None):
        self._iter = iter(concordances(text, phrase))

    def iterate(self, idx):
        (context, start, end) = next(self._iter)
        return (context, start, end,)

Concordances.register(db.conn)

In [85]:
concordances(row['text'], 'pretty brindled cow')

[("\nThe man said:\n\nPretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?\n\n'Duks,' answered the beast",
  1566,
  1585),
 ("ed the beasts:\n\nPretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?\n\nThe beasts answered, 'Duks",
  3505,
  3524),
 (" beasts again:\n\nPretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?\n\n'Duks,' they said. Then th",
  4932,
  4951),
 (" to rest now?'\n\nPretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?\n\nThe animals said, 'Duks:\n\n",
  6119,
  6138)]

In [86]:
q = """
SELECT matched.*
  FROM books, concordance("pretty brindled cow", books.text) AS matched
  WHERE title="The House In The Wood";
"""
for i in db.execute(q):
    print(i)

("\nThe man said:\n\nPretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?\n\n'Duks,' answered the beast", 1566, 1585)
("ed the beasts:\n\nPretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?\n\nThe beasts answered, 'Duks", 3505, 3524)
(" beasts again:\n\nPretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?\n\n'Duks,' they said. Then th", 4932, 4951)
(" to rest now?'\n\nPretty cock, Pretty hen, And you, pretty brindled cow, What do you say now?\n\nThe animals said, 'Duks:\n\n", 6119, 6138)


In [87]:
# allow different tokenisers
from nltk.tokenize import RegexpTokenizer

def scanner(text, minlen=4, startlen=50, min_repeats = 3, autostop=True):
    """Search a text for repeated phrases above a minimum length."""
    # Tokenise the text
    tokenizer = RegexpTokenizer(r'\w+')
    tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')
    tokens = nltk.word_tokenize(text)
    
    #nltk_ngrams returns an empty list if we ask for an ngram longer than the sentence
    # So set the (long) start length to the lesser of the original provided
    # start length or the token length of the original text;
    # which is to say, the minimum of the provided start length 
    # or the length of the text
    startlen = min(startlen, len(tokens))
    
    # Start with a long sequence then iterate down to a minumum length sequence
    for size in range(startlen, minlen-1, -1):
        # Generate a dataframe containing all the ngrams, one row per ngram
        df = pd.DataFrame({'phrase':[' '.join(i) for i in nltk_ngrams(tokens, size)]})
        
        # Find the occurrence counts of each phrase
        value_counts_series = df['phrase'].value_counts()

        # If we have at least the specified number of occurrences
        # don't bother searching for any more
        if max(value_counts_series) >= min_repeats:
            if autostop:
                break
            pass
    # Return a pandas series (an indexed list, essentially)
    # containing the longest (or phrases) we found
    
    return value_counts_series[(value_counts_series>=min_repeats) & (value_counts_series==max(value_counts_series))]

In [88]:
scanner( row["text"] )

: Pretty cock , Pretty hen , And you , pretty brindled cow , What do you say now ?    3
Name: phrase, dtype: int64

In [89]:
# Display the first (0'th indexed) item
# (In this case there is only one item hat repeats this number of times anyway.)
scanner( row["text"] ).index[0], scanner( row["text"] ).values[0]

(': Pretty cock , Pretty hen , And you , pretty brindled cow , What do you say now ?',
 3)

If we constrain this function to return a single item, we can create a simple SQLite function that will search through records and return the longest phrase above a certain minimum length (or the first longest phrase, if several long phrases of the same length are found):

In [90]:
def find_repeating_phrase(text):
    """Return the longest repeating phrase found in a text.
       If there are more than one of the same length, return the first.
    """
    phrase = scanner(text)
    
    #If there is at least one response, take the first
    if not phrase.empty:
        return phrase.index[0]

In [91]:
find_repeating_phrase(row['text'])

': Pretty cock , Pretty hen , And you , pretty brindled cow , What do you say now ?'

In [92]:
# The `db` object is a sqlite_utils database object
# Pass in:
# - the name of the function we want to use in the database
# - the number of arguments it takes
# - the function we want to invoke
db.conn.create_function('find_repeating_phrase', 1,
                        find_repeating_phrase)

In [93]:
_q = """
SELECT book, title, find_repeating_phrase(text) AS phrase 
FROM books WHERE title="The House In The Wood" ;
"""

for row2 in db.query(_q):
    print(row2)

{'book': 'The Pink Fairy Book', 'title': 'The House In The Wood', 'phrase': ': Pretty cock , Pretty hen , And you , pretty brindled cow , What do you say now ?'}


In [94]:
_q = """
SELECT title, find_repeating_phrase(text) AS phrase
FROM books WHERE book="The Pink Fairy Book" ;
"""

for row3 in db.query(_q):
    if row3['phrase'] is not None:
        print(row3)

{'title': 'Catherine And Her Destiny', 'phrase': 'the court , and'}
{'title': 'Esben And The Witch', 'phrase': "? ' 'Ye -- e -- s ! ' 'Are you coming back again ? ' 'That may be , ' said Esben . 'Then you 'll catch it , '"}
{'title': "Hans, The Mermaid's Son", 'phrase': ", ' said Hans ; ' I"}
{'title': 'How The Dragon Was Tricked', 'phrase': ", ' said the dragon"}
{'title': "How The Hermit Helped To Win The King's Daughter", 'phrase': "'Ask him if he will come with us"}
{'title': 'I Know What I Have Learned', 'phrase': 'and asked his wife whether the cow had calved'}
{'title': 'King Lindorm', 'phrase': 'rode out into the forest'}
{'title': 'Maiden Bright-Eye', 'phrase': ". 'Good evening , ' it said . 'Thanks , Maiden Bright-eye , ' said the dog . 'Where is my brother ? ' 'He is in the serpent-pit . ' 'Where is my wicked sister ? ' 'She is with the noble king . ' 'Alas ! alas !"}
{'title': 'Master And Pupil', 'phrase': ", ' said the boy ."}
{'title': 'Peter Bull', 'phrase': "'Oh , yes ,

The punctuation gets in the way somewhat, so it might be useful if removed the punctuation and tried again:

In [95]:
#Allow param and de-punctuate

def scanner2(text, minlen=4, startlen=50, min_repeats = 4, autostop=True, tokeniser='word'):
    """Search a text for repeated phrases above a minimum length."""
    # Tokenise the text
    if tokeniser == 'depunc_word':
        tokenizer = RegexpTokenizer(r'\w+')
        tokens = tokenizer.tokenize(text)
    elif tokeniser == 'sent':
        pass
    else:
        # eg for default: tokeniser='word'
        tokenizer = RegexpTokenizer(r'\w+')
        tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')
        tokens = nltk.word_tokenize(text)
    
    #nltk_ngrams returns an empty list if we ask for an ngram longer than the sentence
    # So set the (long) start length to the lesser of the original provided
    # start length or the token length of the original text;
    # which is to say, the minimum of the provided start length 
    # or the lenth of the text
    startlen = min(startlen, len(tokens))
    
    # Start with a long sequence then iterate down to a minumum length sequence
    for size in range(startlen, minlen-1, -1):
        
        # Generate a dataframe containing all the ngrams, one row per ngram
        df = pd.DataFrame({'phrase':[' '.join(i) for i in nltk_ngrams(tokens,size)]})
        
        # Find the occurrence counts of each phrase
        value_counts_series = df['phrase'].value_counts()
        
        # If we have at least the specified number of occurrences
        # don't bother searching for any more
        if max(value_counts_series) >= min_repeats:
            if autostop:
                break
            pass
    # Return a pandas series (an indexed list, essentially)
    # containing the long phrase (or phrases) we found
    return value_counts_series[(value_counts_series>=min_repeats) & (value_counts_series==max(value_counts_series))]

In [96]:
def find_repeating_phrase_depunc(text, minlen):
    """Return the longest repeating phrase found in a text.
       If there are more than one of the same lentgh, return the first.
    """
    
    # Accepts a specified minimum phrase length (minlin)
    # Reduce the required number of repeats
    phrase = scanner2(text, minlen=minlen, min_repeats = 3,
                      tokeniser='depunc_word')
    
    #If there is at least one response, take the first
    if not phrase.empty:
        return phrase.index[0]

In [97]:
find_repeating_phrase_depunc(row['text'], 5)

'Pretty cock Pretty hen And you pretty brindled cow What do you say now'

Register the function:

In [98]:
# Note we need to update the number of arguments (max. 2)
db.conn.create_function('find_repeating_phrase_depunc', 2,
                        find_repeating_phrase_depunc)

Try again:

In [99]:
_q = """
SELECT book, title, find_repeating_phrase_depunc(text, 7) AS phrase
FROM books WHERE book="The Pink Fairy Book" ;
"""

for row5 in db.query(_q):
    if row5['phrase'] is not None:
        print(row5)

{'book': 'The Pink Fairy Book', 'title': 'Esben And The Witch', 'phrase': 'Ye e s Are you coming back again That may be said Esben Then you ll catch it'}
{'book': 'The Pink Fairy Book', 'title': "How The Hermit Helped To Win The King's Daughter", 'phrase': 'Ask him if he will come with us'}
{'book': 'The Pink Fairy Book', 'title': 'I Know What I Have Learned', 'phrase': 'and asked his wife whether the cow had calved'}
{'book': 'The Pink Fairy Book', 'title': 'Maiden Bright-Eye', 'phrase': 'Good evening it said Thanks Maiden Bright eye said the dog Where is my brother He is in the serpent pit Where is my wicked sister She is with the noble king Alas alas'}
{'book': 'The Pink Fairy Book', 'title': "The Bird 'Grip'", 'phrase': 'the horse with the golden shoes and'}
{'book': 'The Pink Fairy Book', 'title': 'The House In The Wood', 'phrase': 'Pretty cock Pretty hen And you pretty brindled cow What do you say now'}
{'book': 'The Pink Fairy Book', 'title': 'The Princess In The Chest', 'phrase

Check the context:

In [100]:
_q = """
SELECT text, find_repeating_phrase(text) AS phrase
FROM books WHERE title="Maiden Bright-Eye" ;
"""

for row6 in db.query(_q):
    for c in find_contexts(row6['text'], "Where is my wicked ", 100):
        print(c,"\n===")
    #print(row6['phrase'])




'Thanks, Maiden Bright-eye,' said the dog.

'Where is my brother?'

'He is in the serpent-pit.'

'Where is my wicked sister?'

'She is with the noble king.'

'Alas! alas! I am here this evening, and shall be for two e 
===


'Thanks, Maiden Bright-eye,' said the dog.

'Where is my brother?'

'He is in the serpent-pit.'

'Where is my wicked sister?'

'She is with the noble king.'

'Alas! alas! I am here this evening, and shall be for one e 
===


'Thanks, Maiden Bright-eye,' said the dog.

'Where is my brother?'

'He is in the serpent-pit.'

'Where is my wicked sister?'

'She is with the noble king.'

'Alas! alas! now I shall never come again.' With this it sl 
===


In [101]:
for row6 in db.query(_q):
    for c in find_contexts(row6['text'], "the king's palace", 100):
        print(c,"\n===")

 something about the stepson. He had gone out into the world to look about him, and took service in the king's palace. About this time he got permission to go home and see his sister, and when he saw how lovely and be 
===
 he saw how lovely and beautiful she was, he was so pleased and delighted that when he came back to the king's palace everyone there wanted to know what he was always so happy about. He told them that it was because h 
===
r life, and she was at once transformed into a duck. The duck swam away after the ship, and came to the king's palace on the next evening. There it waddled up the drain, and so into the kitchen, where her little dog l 
===
it.

At this time the brother in the serpent-pit dreamed that his right sister had come swimming to the king's palace in the shape of a duck, and that she could not regain her own form until her beak was cut off. He g 
===


We need to be able to find short sentences down to the minimum that are not in a longer phrase:

In [102]:
def scanner_all(text, minlen=4, startlen=50,
                min_repeats = 4, autostop=True):
    long_phrases = {}
    tokens = nltk.word_tokenize(text)
    for size in range(startlen, minlen-1, -1):
        df = pd.DataFrame({'phrase':[' '.join(i) for i in nltk_ngrams(tokens, min(size, len(tokens)))]})
        value_counts_series = df['phrase'].value_counts()
        
        if max(value_counts_series) >= min_repeats:
            test_phrases = value_counts_series[value_counts_series==max(value_counts_series)]
            for (test_phrase, val) in test_phrases.iteritems():
                if (test_phrase not in long_phrases) and not any(test_phrase in long_phrase for long_phrase in long_phrases):
                    long_phrases[test_phrase] = val
            
    return long_phrases

In [103]:
txt_reps ="""
Nota that There once was a thing that and 5 There once was a thing that and 4 There once was a thing that and 3
There once was a thing that and 1 There once was a thing that and  6 There once was a thing that and 7
there was another that 1 and there was another that 2 and there was another that 3 and there was another that and
there was another that and there was another that 5 and there was another that 9 and there was another that
"""

In [104]:
scanner( txt_reps )

There once was a thing that and    6
Name: phrase, dtype: int64

In [105]:
scanner_all(txt_reps)

{'There once was a thing that and': 6, 'and there was another that': 7}

In [106]:
scanner_all( row["text"])

{'Pretty cock , Pretty hen , And you , pretty brindled cow , What do you say now ?': 4}

## Longest Common Substring

Could we use `difflib.SequenceMatcher.find_longest_match()` on first and second half of doc, or various docs samples, to try to find common refrains?

Or chunk into paragraphs and compare every paragraph with every other paragraph?

Here's how the to call the `SequenceMatcher().find_longest_match()` function:

In [107]:
from difflib import SequenceMatcher

m = SequenceMatcher(None, txt_reps.split('\n')[1],
                    txt_reps.split('\n')[2]).find_longest_match()
m, txt_reps.split('\n')[1][m.a: m.a + m.size]

(Match(a=9, b=33, size=33), ' There once was a thing that and ')

## Doc2Vec Search Engine

To explore: a simple `Doc2Vec` powered search engine based on https://www.kaggle.com/hgilles06/a-doc2vec-search-engine-cord19-new-version .