# Annotating Lang's Fairy Tales With Wikipedia Links

The Wikipedia page [`Lang's_Fairy_Books`](https://en.wikipedia.org/wiki/Lang's_Fairy_Books) lists the contents of Lang's coloured fairy books (as well as several other books), along with links to the Wikipedia page associated with each tale, if available.

This means we can have a go at annotating our database with Wikipedia links for each story. From those pages in turn, or associated *DBpedia* pages, we might also be able to extract Aarne-Thompson classification codes for the corresponding stories.

In [1]:
from sqlite_utils import Database

db_name = "lang_fairy_tale.db"
db = Database(db_name)
conn = db.conn

# Load in the sql magic
%load_ext sql
%sql sqlite:///$db_name

Load in the Wikipedia page that lists Lang's Fairy Book collections and provides links to other WIkipedia pages associated with stories contained in them.

In [2]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Lang's_Fairy_Books"

html = requests.get(url)

We now make some lovely soup from the page that we can then start to fish entrails out of:

In [3]:
wp_soup = BeautifulSoup(html.content, "html.parser")

In [4]:
# Find the span for a particular book
wp_book_loc =  wp_soup.find("span", id="The_Blue_Fairy_Book_(1889)")

# Then navigate relative to this to get the (linked) story list
wp_book_stories = wp_book_loc.find_parent().find_next("ul").find_all('li')
wp_book_stories[:3]

[<li>"<a href="/wiki/The_Bronze_Ring" title="The Bronze Ring">The Bronze Ring</a>"</li>,
 <li>"<a href="/wiki/Prince_Hyacinth_and_the_Dear_Little_Princess" title="Prince Hyacinth and the Dear Little Princess">Prince Hyacinth and the Dear Little Princess</a>"</li>,
 <li>"<a href="/wiki/East_of_the_Sun_and_West_of_the_Moon" title="East of the Sun and West of the Moon">East of the Sun and West of the Moon</a>"</li>]

Get the Wikipedia path for stories with a Wikipedia page:

In [5]:
wp_book_paths = [(li.find("a").get("title"), li.find("a").get("href")) for li in wp_book_stories]

wp_book_paths[:3]

[('The Bronze Ring', '/wiki/The_Bronze_Ring'),
 ('Prince Hyacinth and the Dear Little Princess',
  '/wiki/Prince_Hyacinth_and_the_Dear_Little_Princess'),
 ('East of the Sun and West of the Moon',
  '/wiki/East_of_the_Sun_and_West_of_the_Moon')]

Useful as a list of `dict`s or *pandas* `DataFrame`?

In [6]:
import pandas as pd

wp_book_paths_wide = []

for item in wp_book_paths:
    wp_book_paths_wide.append( {"title":item[0].strip(), "path":item[1]} )
    
wp_book_df = pd.DataFrame(wp_book_paths_wide)
wp_book_df

Unnamed: 0,title,path
0,The Bronze Ring,/wiki/The_Bronze_Ring
1,Prince Hyacinth and the Dear Little Princess,/wiki/Prince_Hyacinth_and_the_Dear_Little_Prin...
2,East of the Sun and West of the Moon,/wiki/East_of_the_Sun_and_West_of_the_Moon
3,The Yellow Dwarf,/wiki/The_Yellow_Dwarf
4,Little Red Riding Hood,/wiki/Little_Red_Riding_Hood
5,Sleeping Beauty,/wiki/Sleeping_Beauty
6,Cinderella,/wiki/Cinderella
7,Aladdin,/wiki/Aladdin
8,The Story of the Youth Who Went Forth to Learn...,/wiki/The_Story_of_the_Youth_Who_Went_Forth_to...
9,Rumpelstiltskin,/wiki/Rumpelstiltskin


See if we can then cross reference these with stories in the database?

In [7]:
q = "SELECT book, title, chapter_order FROM books WHERE book='The Blue Fairy Book' ORDER BY chapter_order ASC"
df_blue = pd.read_sql(q, conn)

df_blue.head()

Unnamed: 0,book,title,chapter_order
0,The Blue Fairy Book,The Bronze Ring,0
1,The Blue Fairy Book,Prince Hyacinth And The Dear Little Princess,1
2,The Blue Fairy Book,East Of The Sun And West Of The Moon,2
3,The Blue Fairy Book,The Yellow Dwarf,3
4,The Blue Fairy Book,Little Red Riding Hood,4


Let's see if the chapters align in terms of order as presented:

In [8]:
pd.DataFrame({"book":df_blue["title"], "wp":wp_book_df["title"], "wp_path":wp_book_df["path"]})

Unnamed: 0,book,wp,wp_path
0,The Bronze Ring,The Bronze Ring,/wiki/The_Bronze_Ring
1,Prince Hyacinth And The Dear Little Princess,Prince Hyacinth and the Dear Little Princess,/wiki/Prince_Hyacinth_and_the_Dear_Little_Prin...
2,East Of The Sun And West Of The Moon,East of the Sun and West of the Moon,/wiki/East_of_the_Sun_and_West_of_the_Moon
3,The Yellow Dwarf,The Yellow Dwarf,/wiki/The_Yellow_Dwarf
4,Little Red Riding Hood,Little Red Riding Hood,/wiki/Little_Red_Riding_Hood
5,The Sleeping Beauty In The Wood,Sleeping Beauty,/wiki/Sleeping_Beauty
6,"Cinderella, Or The Little Glass Slipper",Cinderella,/wiki/Cinderella
7,Aladdin And The Wonderful Lamp,Aladdin,/wiki/Aladdin
8,The Tale Of A Youth Who Set Out To Learn What ...,The Story of the Youth Who Went Forth to Learn...,/wiki/The_Story_of_the_Youth_Who_Went_Forth_to...
9,Rumpelstiltzkin,Rumpelstiltskin,/wiki/Rumpelstiltskin


Yes, they do so we can use that as a basis of a merge. That said, in the genral case it would probably also be useful to generate a fuzzy match score between matched titles with a report on any low scoring matches, just in case the alignment has gone awry.

In [9]:
# TO DO  - wp table for links, story and story order?
# TO DO fuzzy match score test just to check ingest and allow user to check poor matches

In passing,what if we wanted to try to match on the titles themselves?

If we use decased, but otherwise exact, matching, we see it's bit flaky....

In [10]:
pd.merge(df_blue["title"], wp_book_df,
         left_on=df_blue["title"].str.lower(),
         right_on=wp_book_df["title"].str.lower(),
         how ="left" )

Unnamed: 0,key_0,title_x,title_y,path
0,the bronze ring,The Bronze Ring,The Bronze Ring,/wiki/The_Bronze_Ring
1,prince hyacinth and the dear little princess,Prince Hyacinth And The Dear Little Princess,Prince Hyacinth and the Dear Little Princess,/wiki/Prince_Hyacinth_and_the_Dear_Little_Prin...
2,east of the sun and west of the moon,East Of The Sun And West Of The Moon,East of the Sun and West of the Moon,/wiki/East_of_the_Sun_and_West_of_the_Moon
3,the yellow dwarf,The Yellow Dwarf,The Yellow Dwarf,/wiki/The_Yellow_Dwarf
4,little red riding hood,Little Red Riding Hood,Little Red Riding Hood,/wiki/Little_Red_Riding_Hood
5,the sleeping beauty in the wood,The Sleeping Beauty In The Wood,,
6,"cinderella, or the little glass slipper","Cinderella, Or The Little Glass Slipper",,
7,aladdin and the wonderful lamp,Aladdin And The Wonderful Lamp,,
8,the tale of a youth who set out to learn what ...,The Tale Of A Youth Who Set Out To Learn What ...,,
9,rumpelstiltzkin,Rumpelstiltzkin,,


A fuzzy match might be able to improve things...

In [11]:
# Reused from on https://stackoverflow.com/a/56315491/454773
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    :param df_1: the left table to join
    :param df_2: the right table to join
    :param key1: key column of the left table
    :param key2: key column of the right table
    :param threshold: how close the matches should be to return a match, based on Levenshtein distance
    :param limit: the amount of matches that will get returned, these are sorted high to low
    :return: dataframe with boths keys and matches
    """
    s = df_2[key2].tolist()
    
    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))  
    df_1['matches'] = m
    
    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2
    return df_1

In [12]:
fuzzy_merge(df_blue, wp_book_df, "title", "title", 88, limit=1)[["title", "matches"]]

Unnamed: 0,title,matches
0,The Bronze Ring,The Bronze Ring
1,Prince Hyacinth And The Dear Little Princess,Prince Hyacinth and the Dear Little Princess
2,East Of The Sun And West Of The Moon,East of the Sun and West of the Moon
3,The Yellow Dwarf,The Yellow Dwarf
4,Little Red Riding Hood,Little Red Riding Hood
5,The Sleeping Beauty In The Wood,Sleeping Beauty
6,"Cinderella, Or The Little Glass Slipper",Cinderella
7,Aladdin And The Wonderful Lamp,Aladdin
8,The Tale Of A Youth Who Set Out To Learn What ...,
9,Rumpelstiltzkin,Rumpelstiltskin


In [13]:
#https://github.com/jsoma/fuzzy_pandas/

# This is probably overkill...
#%pip install fuzzy_pandas
import fuzzy_pandas as fpd

fpd.fuzzy_merge(df_blue[["title"]], wp_book_df,
            left_on='title',
            right_on='title',
            ignore_case=True,
            ignore_nonalpha=True,
            method='jaro', #bilenko, levenshtein, metaphone, jaro
            threshold=0.86, # If we move to 0.86 wee get a false positive...
            keep_left='all',
            keep_right="all"
               )


Unnamed: 0,title,title.1,path
0,The Bronze Ring,The Bronze Ring,/wiki/The_Bronze_Ring
1,Prince Hyacinth And The Dear Little Princess,Prince Hyacinth and the Dear Little Princess,/wiki/Prince_Hyacinth_and_the_Dear_Little_Prin...
2,East Of The Sun And West Of The Moon,East of the Sun and West of the Moon,/wiki/East_of_the_Sun_and_West_of_the_Moon
3,The Yellow Dwarf,The Yellow Dwarf,/wiki/The_Yellow_Dwarf
4,Little Red Riding Hood,Little Red Riding Hood,/wiki/Little_Red_Riding_Hood
5,"Cinderella, Or The Little Glass Slipper",Cinderella,/wiki/Cinderella
6,Rumpelstiltzkin,Rumpelstiltskin,/wiki/Rumpelstiltskin
7,Beauty And The Beast,Beauty and the Beast,/wiki/Beauty_and_the_Beast
8,The Master-Maid,The Master Maid,/wiki/The_Master_Maid
9,Why The Sea Is Salt,Why the Sea Is Salt,/wiki/Why_the_Sea_Is_Salt


In [14]:
fpd.fuzzy_merge(df_blue[["title"]], wp_book_df,
            left_on='title',
            right_on='title',
            ignore_case=True,
            ignore_nonalpha=True,
            method='metaphone', #levenshtein, metaphone, jaro, bilenko
            threshold=0.86,
            keep_left='all',
            keep_right="all"
               )

Unnamed: 0,title,title.1,path
0,The Bronze Ring,The Bronze Ring,/wiki/The_Bronze_Ring
1,Prince Hyacinth And The Dear Little Princess,Prince Hyacinth and the Dear Little Princess,/wiki/Prince_Hyacinth_and_the_Dear_Little_Prin...
2,East Of The Sun And West Of The Moon,East of the Sun and West of the Moon,/wiki/East_of_the_Sun_and_West_of_the_Moon
3,The Yellow Dwarf,The Yellow Dwarf,/wiki/The_Yellow_Dwarf
4,Little Red Riding Hood,Little Red Riding Hood,/wiki/Little_Red_Riding_Hood
5,Rumpelstiltzkin,Rumpelstiltskin,/wiki/Rumpelstiltskin
6,Beauty And The Beast,Beauty and the Beast,/wiki/Beauty_and_the_Beast
7,The Master-Maid,The Master Maid,/wiki/The_Master_Maid
8,Why The Sea Is Salt,Why the Sea Is Salt,/wiki/Why_the_Sea_Is_Salt
9,Felicia And The Pot Of Pinks,Felicia and the Pot of Pinks,/wiki/Felicia_and_the_Pot_of_Pinks


## Other Things to Link In

Have other people generated data sets that can be linked in?

- http://www.mythfolklore.net/andrewlang/indexbib.htm /via @OnlineCrsLady