<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span><ul class="toc-item"><li><span><a href="#Getting-the-content" data-toc-modified-id="Getting-the-content-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Getting the content</a></span></li></ul></li><li><span><a href="#Start-the-data-cleansing" data-toc-modified-id="Start-the-data-cleansing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Start the data cleansing</a></span><ul class="toc-item"><li><span><a href="#Start-with-all-imports-at-one-place" data-toc-modified-id="Start-with-all-imports-at-one-place-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Start with all imports at one place</a></span></li><li><span><a href="#Next-we-define-some-constants" data-toc-modified-id="Next-we-define-some-constants-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Next we define some constants</a></span></li><li><span><a href="#Read-and-cleanse-the-data" data-toc-modified-id="Read-and-cleanse-the-data-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Read and cleanse the data</a></span></li></ul></li><li><span><a href="#Let's-see-the-results-so-far" data-toc-modified-id="Let's-see-the-results-so-far-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Let's see the results so far</a></span></li></ul></div>

# Introduction
This is a very rough first draft at importing and cleansing the data. Solution if heavily inspired by (okay... Completely ripped off) from https://gist.github.com/mbforbes/cee3fd5bb3a797b059524fe8c8ccdc2b


## Getting the content
Start by downloading the repository of (english) books. This is done in bash. Only tested on Ubuntu, but mac should work the same

```
wget -m -H -nd "http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en"
```
Takes a few hours to run, and is stored in a folder called rawContent. 
This is then copied to another folder, and we can start to clean up the mess

First we delete some dublications of the same books:
```
ls | grep "\-8.zip" | xargs rm
ls | grep "\-0.zip" | xargs rm
```
We can then unzip the files, and remove the zip files
```
unzip "*zip"
rm *.zip
```

Next we take care of some nested foldering
```
mv */*.txt ./
```
And finally, we remove all rubbish that isn't a real book:

```
ls | grep -v "\.txt" | xargs rm -rf
```


# Start the data cleansing

## Start with all imports at one place

In [6]:
from __future__ import absolute_import
from builtins import str
import os
from six import u

from os import listdir
from os.path import isfile, join

## Next we define some constants
Much more are probably needed. only been testing on a few books at a time

In [7]:
file_path = "data/processedData"

TEXT_START_MARKERS = frozenset((u(_) for _ in (
    "*END*THE SMALL PRINT",
    "*** START OF THE PROJECT GUTENBERG",
    "*** START OF THIS PROJECT GUTENBERG",
    "This etext was prepared by",
    "E-text prepared by",
    "Produced by",
    "Distributed Proofreading Team",
    "Proofreading Team at http://www.pgdp.net",
    "http://gallica.bnf.fr)",
    "      http://archive.org/details/",
    "http://www.pgdp.net",
    "by The Internet Archive)",
    "by The Internet Archive/Canadian Libraries",
    "by The Internet Archive/American Libraries",
    "public domain material from the Internet Archive",
    "Internet Archive)",
    "Internet Archive/Canadian Libraries",
    "Internet Archive/American Libraries",
    "material from the Google Print project",
    "*END THE SMALL PRINT",
    "***START OF THE PROJECT GUTENBERG",
    "This etext was produced by",
    "*** START OF THE COPYRIGHTED",
    "The Project Gutenberg",
    "http://gutenberg.spiegel.de/ erreichbar.",
    "Project Runeberg publishes",
    "Beginning of this Project Gutenberg",
    "Project Gutenberg Online Distributed",
    "Gutenberg Online Distributed",
    "the Project Gutenberg Online Distributed",
    "Project Gutenberg TEI",
    "This eBook was prepared by",
    "http://gutenberg2000.de erreichbar.",
    "This Etext was prepared by",
    "This Project Gutenberg Etext was prepared by",
    "Gutenberg Distributed Proofreaders",
    "Project Gutenberg Distributed Proofreaders",
    "the Project Gutenberg Online Distributed Proofreading Team",
    "**The Project Gutenberg",
    "*SMALL PRINT!",
    "More information about this book is at the top of this file.",
    "tells you about restrictions in how the file may be used.",
    "l'authorization à les utilizer pour preparer ce texte.",
    "of the etext through OCR.",
    "*****These eBooks Were Prepared By Thousands of Volunteers!*****",
    "We need your donations more than ever!",
    " *** START OF THIS PROJECT GUTENBERG",
    "****     SMALL PRINT!",
    '["Small Print" V.',
    '      (http://www.ibiblio.org/gutenberg/',
    'and the Project Gutenberg Online Distributed Proofreading Team',
    'Mary Meehan, and the Project Gutenberg Online Distributed Proofreading',
    '                this Project Gutenberg edition.',
)))


TEXT_END_MARKERS = frozenset((u(_) for _ in (
    "*** END OF THE PROJECT GUTENBERG",
    "*** END OF THIS PROJECT GUTENBERG",
    "***END OF THE PROJECT GUTENBERG",
    "End of the Project Gutenberg",
    "End of The Project Gutenberg",
    "Ende dieses Project Gutenberg",
    "by Project Gutenberg",
    "End of Project Gutenberg",
    "End of this Project Gutenberg",
    "Ende dieses Projekt Gutenberg",
    "        ***END OF THE PROJECT GUTENBERG",
    "*** END OF THE COPYRIGHTED",
    "End of this is COPYRIGHTED",
    "Ende dieses Etextes ",
    "Ende dieses Project Gutenber",
    "Ende diese Project Gutenberg",
    "**This is a COPYRIGHTED Project Gutenberg Etext, Details Above**",
    "Fin de Project Gutenberg",
    "The Project Gutenberg Etext of ",
    "Ce document fut presente en lecture",
    "Ce document fut présenté en lecture",
    "More information about this book is at the top of this file.",
    "We need your donations more than ever!",
    "END OF PROJECT GUTENBERG",
    " End of the Project Gutenberg",
    " *** END OF THIS PROJECT GUTENBERG",
)))


LEGALESE_START_MARKERS = frozenset((u(_) for _ in (
    "<<THIS ELECTRONIC VERSION OF",
)))


LEGALESE_END_MARKERS = frozenset((u(_) for _ in (
    "SERVICE THAT CHARGES FOR DOWNLOAD",
)))

TITLE_MARKERS = frozenset((u(_) for _ in (
    "Title:",
)))

AUTHOR_MARKERS = frozenset((u(_) for _ in (
    "Author:",
)))
DATE_MARKERS = frozenset((u(_) for _ in (
    "Release Date:","Release Date:"
)))
LANGUAGE_MARKERS = frozenset((u(_) for _ in (
    "Language:",
)))
ENCODING_MARKERS = frozenset((u(_) for _ in (
    "Character set encoding:",
)))


## Read and cleanse the data
Much more is still needed here

In [8]:
# Initialize placeholders for results
## IF I were a clever man, this should be pandas data.frame instead
## But I am not, so this is not
file_names = []
titles = []
authors = []
dates = []
languages = []
encodings = []
contents = []

# Get all filenames
files = [f for f in listdir(file_path) if isfile(join(file_path, f))]
#print(files)

# I'm too lazy to do proper limit, but don't want to parse the entire list everytime debugging
dummyCounter = -1

# Go through each file
for file_name in files:
    dummyCounter = dummyCounter + 1
    #print(file_name)
    
    # See? I told you I was lazy. If I at least had done it the other way around, we wouldn't have to deal with this level of indentation
    if dummyCounter <3: # Normally I would like a space, but come on! It's a heart!
        # Read the file into lines
        file = open(file_path + "/" + file_name)
        file_content = file.read()

        lines = file_content.splitlines()
        sep = str(os.linesep)

        # Initialize results for single book
        out = []
        i = 0
        footer_found = False
        ignore_section = False

        title = ""
        author = ""
        date = ""
        language = ""
        encoding = ""
        
        # Reset flags for each book
        title_found = False
        author_found = False
        date_found = False
        language_found = False
        encoding_found = False

        for line in lines:
                reset = False

                #print(line)
                if i <= 600:
                    # Shamelessly stolen
                    if any(line.startswith(token) for token in TEXT_START_MARKERS):
                        reset = True

                    # Extract Metadata
                    if title_found == False:
                        if any(line.startswith(token) for token in TITLE_MARKERS):
                            title_found = True
                            title = line
                    if author_found == False:
                        if any(line.startswith(token) for token in AUTHOR_MARKERS):
                            author_found = True
                            author = line
                    if date_found == False:
                        if any(line.startswith(token) for token in DATE_MARKERS):
                            date_found = True
                            date = line
                    if language_found == False:
                        if any(line.startswith(token) for token in LANGUAGE_MARKERS):
                            language_found = True
                            language = line
                    if encoding_found == False:
                        if any(line.startswith(token) for token in ENCODING_MARKERS):
                            encoding_found = True
                            encoding = line

                    # More theft from above
                    if reset:
                        out = []
                        continue
                        
                # I feel like a criminal by now. Guess what? Also stolen
                if i >= 100:
                    if any(line.startswith(token) for token in TEXT_END_MARKERS):
                        footer_found = True

                    if footer_found:
                        break

                if any(line.startswith(token) for token in LEGALESE_START_MARKERS):
                    ignore_section = True
                    continue
                elif any(line.startswith(token) for token in LEGALESE_END_MARKERS):
                    ignore_section = False
                    continue

                if not ignore_section:
                    if line != "": # Screw the blank lines
                        out.append(line.rstrip(sep))
                    i += 1

                sep.join(out)

        # Do more cleaning
        for token in TITLE_MARKERS:
            title = title.replace(token, '').lstrip().rstrip()
            titles.append(title)
        for token in AUTHOR_MARKERS:
            author = author.replace(token, '').lstrip().rstrip()
            authors.append(author)
        for token in LANGUAGE_MARKERS:
            language = language.replace(token, '').lstrip().rstrip()
            languages.append(language)
        for token in DATE_MARKERS:
            date = date.replace(token, '').lstrip().rstrip()
            dates.append(date)
        for token in ENCODING_MARKERS:
            encoding = encoding.replace(token, '').lstrip().rstrip()
            encodings.append(encoding)
        contents.append(out)
        file_names.append(file_name)

# Let's see the results so far

In [9]:
print(file_names)
print(titles)
print(authors)
print(languages)
print(dates)
print(encodings)
print(contents)

['47.txt', '31.txt', '19.txt']
['Anne Of Avonlea', 'The Oedipus Trilogy', 'The Song Of Hiawatha']
['Lucy Maud Montgomery', 'Sophocles', 'Henry W. Longfellow']
['English', 'English', 'English']
['March 7, 2006 [EBook #47]', 'March 7, 2006 [EBook #31]', 'May 27, 2007 [EBook #19]']
['ASCII', 'ASCII', 'ASCII']
