# Jacobs' Fairy Tales

This recipe shows how to scrape Jacobs' fairy tale collections from source OCR search text documents returned from the Internet Archive.

The works include:

- [*English Fairy Tales*](https://archive.org/details/englishfairytal00jacogoog/);
- [*More English Fairy Tales*](https://archive.org/details/moreenglishfairy00jaco2/);
- [*Celtic Fairy Tales*](https://archive.org/details/celticfairytale00conggoog)
- [*More Celtic Fairy Tales*](https://archive.org/details/moreenglishfairy00jaco2/)
- [*Indian Fairy Tales*](https://archive.org/details/indianfairytales00jaco)
- [*European folk and fairy tales*](https://archive.org/details/europeanfolkfair00jaco/)

The approach explores how we can "chunk" the original text into separate stories, and suggests that a combined human + machine strategy may provide a more realistic approach than trying to create a purely automated approach.

```{warning}
For each of the works, several different scanned versions of the text may be available. A quick look at the full text document for each version will give a feel for how effective the OCR process was. Ideally, we're looking for full text that was recognised cleanly and is not full of typographical errors.
```

## Tidying the Text

Cursory inspection of the texts suggests certain common form of error, particularly in the parsing of quotation marks. The following rules provide various ways of correcting certain errors:

In [20]:
import re

tests = ["'' Text ", "*' Text", "'* Text",
         "text ''", "text *'", "text ''",
         "*Text", "text*", "text * text2",
         '\n" Text']

def quote_fix(txt):
    """Various broken quotation mark fixes."""
    # Replace any two consecutive ' and * characters
    # Note that is give a false positive for a long list of * chars
    txt_fix = re.sub("[\*']{2}",'"', txt)
    # Replace a * followed by an anlphachar assuming it to be a "
    txt_fix = re.sub("\*([a-zA-Z])", r'"\1', txt_fix)
    txt_fix = re.sub("([a-zA-Z])\*", r'\1"', txt_fix)
    txt_fix = re.sub("([a-zA-Z]) \* ([a-zA-z])", r'\1"\2', txt_fix)
    # If we get a quote at a start of a line followed by a space,
    # remove the space character
    txt_fix = re.sub('\n["\'] ','\n"', txt_fix)
    
    return txt_fix

for test in tests:
    test_fix = quote_fix(test)
    print(f"Original: --{test}-- replaced by --{test_fix}--")

Original: --'' Text -- replaced by --" Text --
Original: --*' Text-- replaced by --" Text--
Original: --'* Text-- replaced by --" Text--
Original: --text ''-- replaced by --text "--
Original: --text *'-- replaced by --text "--
Original: --text ''-- replaced by --text "--
Original: --*Text-- replaced by --"Text--
Original: --text*-- replaced by --text"--
Original: --text * text2-- replaced by --text"text2--
Original: --
" Text-- replaced by --
"Text--
