# Introduction
This is a very rough first draft at importing and cleansing the data. Solution if heavily inspired by (okay... Completely ripped off) from https://gist.github.com/mbforbes/cee3fd5bb3a797b059524fe8c8ccdc2b


## Getting the content
Start by downloading the repository of (english) books. This is done in bash. Only tested on Ubuntu, but mac should work the same

```
wget -m -H -nd "http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en"
```
Takes a few hours to run, and is stored in a folder called rawContent. 
This is then copied to another folder, and we can start to clean up the mess

First we delete some dublications of the same books:
```
ls | grep "\-8.zip" | xargs rm
ls | grep "\-0.zip" | xargs rm
```
We can then unzip the files, and remove the zip files
```
unzip "*zip"
rm *.zip
```

Next we take care of some nested foldering
```
mv */*.txt ./
```
And finally, we remove all rubbish that isn't a real book:

```
ls | grep -v "\.txt" | xargs rm -rf
```


# Start the data cleansing

## Start with all imports at one place

In [2]:
from __future__ import absolute_import
from builtins import str
import os
from six import u

from os import listdir
from os.path import isfile, join

## Next we define some constants
Much more are probably needed. only been testing on a few books at a time

In [3]:
file_path = "data/processedData"

TEXT_START_MARKERS = frozenset((u(_) for _ in (
    "*END*THE SMALL PRINT",
    "*** START OF THE PROJECT GUTENBERG",
    "*** START OF THIS PROJECT GUTENBERG",
    "This etext was prepared by",
    "E-text prepared by",
    "Produced by",
    "Distributed Proofreading Team",
    "Proofreading Team at http://www.pgdp.net",
    "http://gallica.bnf.fr)",
    "      http://archive.org/details/",
    "http://www.pgdp.net",
    "by The Internet Archive)",
    "by The Internet Archive/Canadian Libraries",
    "by The Internet Archive/American Libraries",
    "public domain material from the Internet Archive",
    "Internet Archive)",
    "Internet Archive/Canadian Libraries",
    "Internet Archive/American Libraries",
    "material from the Google Print project",
    "*END THE SMALL PRINT",
    "***START OF THE PROJECT GUTENBERG",
    "This etext was produced by",
    "*** START OF THE COPYRIGHTED",
    "The Project Gutenberg",
    "http://gutenberg.spiegel.de/ erreichbar.",
    "Project Runeberg publishes",
    "Beginning of this Project Gutenberg",
    "Project Gutenberg Online Distributed",
    "Gutenberg Online Distributed",
    "the Project Gutenberg Online Distributed",
    "Project Gutenberg TEI",
    "This eBook was prepared by",
    "http://gutenberg2000.de erreichbar.",
    "This Etext was prepared by",
    "This Project Gutenberg Etext was prepared by",
    "Gutenberg Distributed Proofreaders",
    "Project Gutenberg Distributed Proofreaders",
    "the Project Gutenberg Online Distributed Proofreading Team",
    "**The Project Gutenberg",
    "*SMALL PRINT!",
    "More information about this book is at the top of this file.",
    "tells you about restrictions in how the file may be used.",
    "l'authorization à les utilizer pour preparer ce texte.",
    "of the etext through OCR.",
    "*****These eBooks Were Prepared By Thousands of Volunteers!*****",
    "We need your donations more than ever!",
    " *** START OF THIS PROJECT GUTENBERG",
    "****     SMALL PRINT!",
    '["Small Print" V.',
    '      (http://www.ibiblio.org/gutenberg/',
    'and the Project Gutenberg Online Distributed Proofreading Team',
    'Mary Meehan, and the Project Gutenberg Online Distributed Proofreading',
    '                this Project Gutenberg edition.',
)))


TEXT_END_MARKERS = frozenset((u(_) for _ in (
    "*** END OF THE PROJECT GUTENBERG",
    "*** END OF THIS PROJECT GUTENBERG",
    "***END OF THE PROJECT GUTENBERG",
    "End of the Project Gutenberg",
    "End of The Project Gutenberg",
    "Ende dieses Project Gutenberg",
    "by Project Gutenberg",
    "End of Project Gutenberg",
    "End of this Project Gutenberg",
    "Ende dieses Projekt Gutenberg",
    "        ***END OF THE PROJECT GUTENBERG",
    "*** END OF THE COPYRIGHTED",
    "End of this is COPYRIGHTED",
    "Ende dieses Etextes ",
    "Ende dieses Project Gutenber",
    "Ende diese Project Gutenberg",
    "**This is a COPYRIGHTED Project Gutenberg Etext, Details Above**",
    "Fin de Project Gutenberg",
    "The Project Gutenberg Etext of ",
    "Ce document fut presente en lecture",
    "Ce document fut présenté en lecture",
    "More information about this book is at the top of this file.",
    "We need your donations more than ever!",
    "END OF PROJECT GUTENBERG",
    " End of the Project Gutenberg",
    " *** END OF THIS PROJECT GUTENBERG",
)))


LEGALESE_START_MARKERS = frozenset((u(_) for _ in (
    "<<THIS ELECTRONIC VERSION OF",
)))


LEGALESE_END_MARKERS = frozenset((u(_) for _ in (
    "SERVICE THAT CHARGES FOR DOWNLOAD",
)))

TITLE_MARKERS = frozenset((u(_) for _ in (
    "Title:",
)))

AUTHOR_MARKERS = frozenset((u(_) for _ in (
    "Author:",
)))
DATE_MARKERS = frozenset((u(_) for _ in (
    "Release Date:","Release Date:"
)))
LANGUAGE_MARKERS = frozenset((u(_) for _ in (
    "Language:",
)))
ENCODING_MARKERS = frozenset((u(_) for _ in (
    "Character set encoding:",
)))


## Read and cleanse the data
Much more is still needed here

In [4]:
# Initialize placeholders for results
## IF I were a clever man, this should be pandas data.frame instead
## But I am not, so this is not
file_names = []
titles = []
authors = []
dates = []
languages = []
encodings = []
contents = []

# Get all filenames
files = [f for f in listdir(file_path) if isfile(join(file_path, f))]
#print(files)

# I'm too lazy to do proper limit, but don't want to parse the entire list everytime debugging
dummyCounter = -1

# Go through each file
for file_name in files:
    dummyCounter = dummyCounter + 1
    #print(file_name)
    
    # See? I told you I was lazy. If I at least had done it the other way around, we wouldn't have to deal with this level of indentation
    if dummyCounter <3: # Normally I would like a space, but come on! It's a heart!
        # Read the file into lines
        file = open(file_path + "/" + file_name)
        file_content = file.read()

        lines = file_content.splitlines()
        sep = str(os.linesep)

        # Initialize results for single book
        out = []
        i = 0
        footer_found = False
        ignore_section = False

        title = ""
        author = ""
        date = ""
        language = ""
        encoding = ""
        
        # Reset flags for each book
        title_found = False
        author_found = False
        date_found = False
        language_found = False
        encoding_found = False

        for line in lines:
                reset = False

                print(line)
                if i <= 600:
                    # Shamelessly stolen
                    if any(line.startswith(token) for token in TEXT_START_MARKERS):
                        reset = True

                    # Extract Metadata
                    if title_found == False:
                        if any(line.startswith(token) for token in TITLE_MARKERS):
                            title_found = True
                            title = line
                    if author_found == False:
                        if any(line.startswith(token) for token in AUTHOR_MARKERS):
                            author_found = True
                            author = line
                    if date_found == False:
                        if any(line.startswith(token) for token in DATE_MARKERS):
                            date_found = True
                            date = line
                    if language_found == False:
                        if any(line.startswith(token) for token in LANGUAGE_MARKERS):
                            language_found = True
                            language = line
                    if encoding_found == False:
                        if any(line.startswith(token) for token in ENCODING_MARKERS):
                            encoding_found = True
                            encoding = line

                    # More theft from above
                    if reset:
                        out = []
                        continue
                        
                # I feel like a criminal by now. Guess what? Also stolen
                if i >= 100:
                    if any(line.startswith(token) for token in TEXT_END_MARKERS):
                        footer_found = True

                    if footer_found:
                        break

                if any(line.startswith(token) for token in LEGALESE_START_MARKERS):
                    ignore_section = True
                    continue
                elif any(line.startswith(token) for token in LEGALESE_END_MARKERS):
                    ignore_section = False
                    continue

                if not ignore_section:
                    if line != "": # Screw the blank lines
                        out.append(line.rstrip(sep))
                    i += 1

                sep.join(out)

        # Do more cleaning
        for token in TITLE_MARKERS:
            title = title.replace(token, '').lstrip().rstrip()
            titles.append(title)
        for token in AUTHOR_MARKERS:
            author = author.replace(token, '').lstrip().rstrip()
            authors.append(author)
        for token in LANGUAGE_MARKERS:
            language = language.replace(token, '').lstrip().rstrip()
            languages.append(language)
        for token in DATE_MARKERS:
            date = date.replace(token, '').lstrip().rstrip()
            dates.append(date)
        for token in ENCODING_MARKERS:
            encoding = encoding.replace(token, '').lstrip().rstrip()
            encodings.append(encoding)
        contents.append(out)
        file_names.append(file_name)

The Project Gutenberg EBook of Anne Of Avonlea, by Lucy Maud Montgomery

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Anne Of Avonlea

Author: Lucy Maud Montgomery

Release Date: March 7, 2006 [EBook #47]

Language: English

Character set encoding: ASCII

*** START OF THIS PROJECT GUTENBERG EBOOK ANNE OF AVONLEA ***




Produced by An Anonymous Volunteer and David Widger





ANNE OF AVONLEA

by Lucy Maud Montgomery



To

my former teacher
HATTIE GORDON SMITH
in grateful remembrance of her
sympathy and encouragement.


     Flowers spring to blossom where she walks
     The careful ways of duty,
     Our hard, stiff lines of life with her
     Are flowing curves of beauty.
     --WHITTIER



     I         An Irate Neighbor
     II        Selling in Haste and Repenting at Lei

"Well, never mind. This day's done and there's a new one coming
tomorrow, with no mistakes in it yet, as you used to say yourself. Just
come downstairs and have your supper. You'll see if a good cup of tea
and those plum puffs I made today won't hearten you up."

"Plum puffs won't minister to a mind diseased," said Anne
disconsolately; but Marilla thought it a good sign that she had
recovered sufficiently to adapt a quotation.

The cheerful supper table, with the twins' bright faces, and Marilla's
matchless plum puffs . . . of which Davy ate four . . .  did "hearten her
up" considerably after all. She had a good sleep that night and
awakened in the morning to find herself and the world transformed. It
had snowed softly and thickly all through the hours of darkness and the
beautiful whiteness, glittering in the frosty sunshine, looked like a
mantle of charity cast over all the mistakes and humiliations of the
past.

     "Every morn is a fresh beginning,
     Every morn is the world mad

The house was a low-eaved structure built of undressed blocks of red
Island sandstone, with a little peaked roof out of which peered two
dormer windows, with quaint wooden hoods over them, and two great
chimneys. The whole house was covered with a luxuriant growth of ivy,
finding easy foothold on the rough stonework and turned by autumn frosts
to most beautiful bronze and wine-red tints.

Before the house was an oblong garden into which the lane gate where
the girls were standing opened. The house bounded it on one side; on
the three others it was enclosed by an old stone dyke, so overgrown with
moss and grass and ferns that it looked like a high, green bank. On the
right and left the tall, dark spruces spread their palm-like branches
over it; but below it was a little meadow, green with clover aftermath,
sloping down to the blue loop of the Grafton River. No other house or
clearing was in sight . . . nothing but hills and valleys covered with
feathery young firs.

"I wonder what sort 

porridge taking effect at last. Perhaps it is. Goodness knows . . ." Paul
sighed deeply . . . "I've eaten enough to make anyone grow. I do hope,
now that I've begun, I'll keep on till I'm as tall as father. He is six
feet, you know, Miss Lavendar."

Yes, Miss Lavendar did know; the flush on her pretty cheeks deepened
a little; she took Paul's hand on one side and Anne's on the other and
walked to the house in silence.

"Is it a good day for the echoes, Miss Lavendar?" queried Paul
anxiously. The day of his first visit had been too windy for echoes and
Paul had been much disappointed.

"Yes, just the best kind of a day," answered Miss Lavendar, rousing
herself from her reverie. "But first we are all going to have something
to eat. I know you two folks didn't walk all the way back here through
those beechwoods without getting hungry, and Charlotta the Fourth and
I can eat any hour of the day . . . we have such obliging appetites. So
we'll just make a raid on the pantry. Fortunately it's 

A vagrant shepherd journeying for hire?

MESSENGER
True, but thy savior in that hour, my son.

OEDIPUS
My savior? from what harm? what ailed me then?

MESSENGER
Those ankle joints are evidence enow.

OEDIPUS
Ah, why remind me of that ancient sore?

MESSENGER
I loosed the pin that riveted thy feet.

OEDIPUS
Yes, from my cradle that dread brand I bore.

MESSENGER
Whence thou deriv'st the name that still is thine.

OEDIPUS
Who did it?  I adjure thee, tell me who
Say, was it father, mother?

MESSENGER
                              I know not.
The man from whom I had thee may know more.

OEDIPUS
What, did another find me, not thyself?

MESSENGER
Not I; another shepherd gave thee me.

OEDIPUS
Who was he?  Would'st thou know again the man?

MESSENGER
He passed indeed for one of Laius' house.

OEDIPUS
The king who ruled the country long ago?

MESSENGER
The same:  he was a herdsman of the king.

OEDIPUS
And is he living still for me to see him?

MESSENGER
His fellow-countrymen should best know 

A murdered sire--shall I be held to blame.
Come, answer me one question, if thou canst:
If one should presently attempt thy life,
Would'st thou, O man of justice, first inquire
If the assassin was perchance thy sire,
Or turn upon him?  As thou lov'st thy life,
On thy aggressor thou would'st turn, no stay
Debating, if the law would bear thee out.
Such was my case, and such the pass whereto
The gods reduced me; and methinks my sire,
Could he come back to life, would not dissent.
Yet thou, for just thou art not, but a man
Who sticks at nothing, if it serve his plea,
Reproachest me with this before these men.
It serves thy turn to laud great Theseus' name,
And Athens as a wisely governed State;
Yet in thy flatteries one thing is to seek:
If any land knows how to pay the gods
Their proper rites, 'tis Athens most of all.
This is the land whence thou wast fain to steal
Their aged suppliant and hast carried off
My daughters.  Therefore to yon goddesses,
I turn, adjure them and invoke their aid

(Ant. 2)
Hope flits about never-wearying wings;
Profit to some, to some light loves she brings,
But no man knoweth how her gifts may turn,
Till 'neath his feet the treacherous ashes burn.
Sure 'twas a sage inspired that spake this word;
          _If evil good appear_
          _To any, Fate is near_;
And brief the respite from her flaming sword.

          Hither comes in angry mood
          Haemon, latest of thy brood;
          Is it for his bride he's grieved,
          Or her marriage-bed deceived,
          Doth he make his mourn for thee,
          Maid forlorn, Antigone?
[Enter HAEMON]

CREON
Soon shall we know, better than seer can tell.
Learning may fixed decree anent thy bride,
Thou mean'st not, son, to rave against thy sire?
Know'st not whate'er we do is done in love?

HAEMON
O father, I am thine, and I will take
Thy wisdom as the helm to steer withal.
Therefore no wedlock shall by me be held
More precious than thy loving goverance.

CREON
Well spoken:  so right-minded son

Spake with naked hearts together,
Pondering much and much contriving
How the tribes of men might prosper.

Most beloved by Hiawatha
Was the gentle Chibiabos,
He the best of all musicians,
He the sweetest of all singers.
Beautiful and childlike was he,
Brave as man is, soft as woman,
Pliant as a wand of willow,
Stately as a deer with antlers.

When he sang, the village listened;
All the warriors gathered round him,
All the women came to hear him;
Now he stirred their souls to passion,
Now he melted them to pity.

From the hollow reeds he fashioned
Flutes so musical and mellow,
That the brook, the Sebowisha,
Ceased to murmur in the woodland,
That the wood-birds ceased from singing,
And the squirrel, Adjidaumo,
Ceased his chatter in the oak-tree,
And the rabbit, the Wabasso,
Sat upright to look and listen.

Yes, the brook, the Sebowisha,
Pausing, said, "O Chibiabos,
Teach my waves to flow in music,
Softly as your words in singing!"

Yes, the bluebird, the Owaissa,
Envious, said, "O Chibia

In the land of Sleep and Silence,
Still the voice of love would reach you!"

And the last of all the figures
Was a heart within a circle,
Drawn within a magic circle;
And the image had this meaning:
"Naked lies your heart before me,
To your naked heart I whisper!"

Thus it was that Hiawatha,
In his wisdom, taught the people
All the mysteries of painting,
All the art of Picture-Writing,
On the smooth bark of the birch-tree,
On the white skin of the reindeer,
On the grave-posts of the village.



            XV

    Hiawatha's Lamentation

In those days the Evil Spirits,
All the Manitos of mischief,
Fearing Hiawatha's wisdom,
And his love for Chibiabos,
Jealous of their faithful friendship,
And their noble words and actions,
Made at length a league against them,
To molest them and destroy them.

Hiawatha, wise and wary,
Often said to Chibiabos,
"O my brother! do not leave me,
Lest the Evil Spirits harm you!"
Chibiabos, young and heedless,
Laughing shook his coal-black tresses,
Answered e

    the dragon fly
Esa, shame upon you
Ewa-yea', lullaby
Gitche Gu'mee, The Big-Sea-Water,
    Lake Superior
Gitche Man'ito, the Great Spirit,
    the Master of Life
Gushkewau', the darkness
Hiawa'tha, the Prophet, the Teacher,
    son of Mudjekeewis, the West-Wind and Wenonah,
    daughter of Nokomis
Ia'goo, a great boaster and story-teller
Inin'ewug, men, or pawns in the Game of the Bowl
Ishkoodah', fire, a comet
Jee'bi, a ghost, a spirit
Joss'akeed, a prophet
Kabibonok'ka, the North-Wind
Ka'go, do not
Kahgahgee', the raven
Kaw, no
Kaween', no indeed
Kayoshk', the sea-gull
Kee'go, a fish
Keeway'din, the Northwest wind, the Home-wind
Kena'beek, a serpent
Keneu', the great war-eagle
Keno'zha, the pickerel
Ko'ko-ko'ho, the owl
Kuntasoo', the Game of Plumstones
Kwa'sind, the Strong Man
Kwo-ne'-she, or Dush-kwo-ne'-she, the dragon-fly
Mahnahbe'zee, the swan
Mahng, the loon
Mahnomo'nee, wild rice
Ma'ma, the woodpecker
Me'da, a medicine-man
Meenah'ga, the blueberry
Megissog'won, the great P

# Let's see the results so far

In [5]:
print(file_names)
print(titles)
print(authors)
print(languages)
print(dates)
print(encodings)
print(contents)

['47.txt', '31.txt', '19.txt']
['Anne Of Avonlea', 'The Oedipus Trilogy', 'The Song Of Hiawatha']
['Lucy Maud Montgomery', 'Sophocles', 'Henry W. Longfellow']
['English', 'English', 'English']
['March 7, 2006 [EBook #47]', 'March 7, 2006 [EBook #31]', 'May 27, 2007 [EBook #19]']
['ASCII', 'ASCII', 'ASCII']
