# Gutenberg Exploration

This notebook is dedicated at cleaning and exploring the Gutenber project dataset which seems to be kind of difficult to deal with.

There are a few things that I want to explore and save about each book:

Gutenberg Index and file used for the analysis
Character set encoding: UTF-8

Title
Author
Language

Total book Length (characters, words, sentences, paragraphs, lines)
Paragraph Length (characters, words, sentences)
Sentence Length (characters, words)
List every word (as per a pre-defined tokenizer) and the count 

List every character used in the book

Compute min, mean, median, max and std length for words, sentences and paragraphs (this can be done after a pre-processing)

Save all that to a zip file for each processed file with the name: "original_filename"+analysis.zip 

Save size stats somewhere else to keep track and see if I can share all of it

Graph the stats here in this notebook for the general by Language

Using SpaCy for the tokenization

In [256]:
#general imports for file manipulation
import os
import sys
import zipfile  # to read the zipped gutenberg text files
import rdflib  # to read the rdf directory of the gutenberg files
from pathlib import Path  # to deal with file paths, naming and other things in a platform independent manner
import numpy as np

In [151]:
# NLP imports
import spacy

In [152]:
# language imports
from pycountry import languages  # to deal with language naming conventions

In [153]:
# multiprocessing imports
from multiprocessing import Pool

In [154]:
# local tools imports
from utils import get_all_files_recurse, path_leaf
# gutenberg project metadata parsing
from metainfo import * # slightly modified script so it works for what I need


In [155]:
# gutenberg specific imports
import gutenberg_cleaner  #to clean headers and footers from gutenberg files, they are NOISE
# redefine the naming .. just because -> TODO take this out
gcleaner = gutenberg_cleaner.simple_cleaner

In [6]:
gcleaner?

[0;31mSignature:[0m [0mgcleaner[0m[0;34m([0m[0mbook[0m[0;34m:[0m [0mstr[0m[0;34m)[0m [0;34m->[0m [0mstr[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Just removes lines that are part of the Project Gutenberg header or footer.
Doesnt go deeply in the text to remove other things like titles or footnotes or etc.
:rtype: str
:param book: str of a gutenberg's book
:return: str of the book without the lines that are part of the Project Gutenberg header and footer.
[0;31mFile:[0m      ~/venv3/lib/python3.8/site-packages/gutenberg_cleaner-0.1.6-py3.8.egg/_cleaning_options/cleaner.py
[0;31mType:[0m      function


In [81]:
# listing of all the files, from https://www.gutenberg.org/dirs/GUTINDEX.zip
GUTINDEX = "/media/nfs/Datasets/text/Gutenberg/GUTINDEX.ALL"
GUTINDEX_ZIP = "/media/nfs/Datasets/text/Gutenberg/GUTINDEX.zip"
with open(GUTINDEX, "r") as f:
    gutindex = f.read()


In [86]:
begin_idx,end_idx = gutindex.find('<==LISTINGS==>'), gutindex.find('<==End of GUTINDEX.ALL==>')


In [89]:
gutindex_db = gutindex[begin_idx:end_idx]

In [90]:
gutindex_db[:100]

'<==LISTINGS==>\n\n**** A C Following a Project Gutenberg eBook Number Indicates Copyright ****\n\n   ***'

In [91]:
gutindex_lines = gutindex_db.split('\n\n')

In [92]:
len(gutindex_lines)

55328

Is it good enough, there are missing files?? or is just that there are many duplicate files from the zip files?
Are all the files present in this listing?

Trying to get the RDF/XML database but seems to be a bitch 

In [9]:
# where the Gutenberg project file dump is
BASE_DIR = "/media/nfs/Datasets/text/Gutenberg/aleph.gutenberg.org"


In [104]:
RDF_TAR_FILE = "/media/nfs/Datasets/text/Gutenberg/rdf-files.tar.bz2"
BASE_RDF_DIR = "/media/nfs/Datasets/text/Gutenberg/rdf_db/cache/epub"

In [26]:
# get all files
#gutfiles = get_all_files_recurse(BASE_DIR)
# gutfiles = [ f for f in gutfiles if f.endswith('.zip')]
# I already had a complete list done in a txt file
with open("/media/nfs/Datasets/text/Gutenberg/zip_list.txt", "r") as f:
    gutfiles = f.readlines()

In [27]:
len(gutfiles)

91793

In [28]:
gutfiles[0].split(" ")

['[', '45K]', '', 'aleph.gutenberg.org/0/1/1-0.zip\n']

In [29]:
gutfiles = [ i.split("  ")[1].replace('\n', '') for i in gutfiles]

In [31]:
gutfiles[1000:1010]

['aleph.gutenberg.org/1/0/6/7/10673/10673.zip',
 'aleph.gutenberg.org/1/0/6/7/10674/10674-8.zip',
 'aleph.gutenberg.org/1/0/6/7/10674/10674.zip',
 'aleph.gutenberg.org/1/0/6/7/10675/10675-8.zip',
 'aleph.gutenberg.org/1/0/6/7/10675/10675.zip',
 'aleph.gutenberg.org/1/0/6/7/10676/10676-8.zip',
 'aleph.gutenberg.org/1/0/6/7/10676/10676.zip',
 'aleph.gutenberg.org/1/0/6/7/10677/10677-8.zip',
 'aleph.gutenberg.org/1/0/6/7/10677/10677.zip',
 'aleph.gutenberg.org/1/0/6/7/10678/10678-8.zip']

In [108]:
# %%time
# rdf_files = [f for f in get_all_files_recurse(BASE_RDF_DIR) if f.endswith(".rdf")]

# no need as the later script

In [99]:
testrdf = BASE_RDF_DIR+"/1800/pg1800.rdf"


In [100]:
rdfg = rdflib.Graph()
rdfg.parse(testrdf)

<Graph identifier=Nea02597699484e969ee9844a071f953e (<class 'rdflib.graph.Graph'>)>

In [102]:
for s,p,o in rdfg:
    print(' | '.join([o,p,s]))

http://purl.org/dc/terms/IMT | http://purl.org/dc/dcam/memberOf | N2ee9868a66b14bbf8d893a733da94d88
http://www.gutenberg.org/ebooks/1800 | http://purl.org/dc/terms/isFormatOf | http://www.gutenberg.org/files/1800/1800.zip
http://www.gutenberg.org/ebooks/1800.rdf | http://purl.org/dc/terms/hasFormat | http://www.gutenberg.org/ebooks/1800
3451 | http://purl.org/dc/terms/extent | http://www.gutenberg.org/cache/epub/1800/pg1800.cover.small.jpg
Copyrighted. Read the copyright notice inside this book for details. | http://purl.org/dc/terms/rights | http://www.gutenberg.org/ebooks/1800
http://www.gutenberg.org/2009/pgterms/file | http://www.w3.org/1999/02/22-rdf-syntax-ns#type | http://www.gutenberg.org/ebooks/1800.rdf
http://www.gutenberg.org/2009/pgterms/file | http://www.w3.org/1999/02/22-rdf-syntax-ns#type | http://www.gutenberg.org/cache/epub/1800/pg1800.cover.small.jpg
214089 | http://purl.org/dc/terms/extent | http://www.gutenberg.org/ebooks/1800.html.images
application/zip | http://ww

Analysis of some things where I can find the data I need, examples follow for a few elements: 

Triplets are: Object | Predicate | Subject

    text/plain; charset=us-ascii | http://www.w3.org/1999/02/22-rdf-syntax-ns#value | N0ddd72365b03404794f17691a9d033dd
    Shakspeare, William | http://www.gutenberg.org/2009/pgterms/alias | http://www.gutenberg.org/2009/agents/65
    Tragicomedy | http://www.w3.org/1999/02/22-rdf-syntax-ns#value | N48ba9daae4da4315a71f97515de1ac3d
    en | http://www.w3.org/1999/02/22-rdf-syntax-ns#value | Ne86faea53a7a46c3aa2126391e768c5d
    Married people -- Drama | http://www.w3.org/1999/02/22-rdf-syntax-ns#value | Na02a77874275488c823e1cf9c1af1e5d
    Shakespeare, William | http://www.gutenberg.org/2009/pgterms/name | http://www.gutenberg.org/2009/agents/65
    text/plain; charset=us-ascii | http://www.w3.org/1999/02/22-rdf-syntax-ns#value | Neab2b458666341749abaad5cc1b867d7
    http://www.gutenberg.org/ebooks/1800.epub.noimages | http://purl.org/dc/terms/hasFormat | http://www.gutenberg.org/ebooks/1800
    Castaways -- Drama | http://www.w3.org/1999/02/22-rdf-syntax-ns#value | N22b4e5001dbc4efb9196fc7601eb19b4
    http://www.gutenberg.org/ebooks/1800 | http://purl.org/dc/terms/isFormatOf | http://www.gutenberg.org/ebooks/1800.rdf
    The Winter's Tale | http://purl.org/dc/terms/title | http://www.gutenberg.org/ebooks/1800

There seems to be a nice [Gist here](https://gist.github.com/andreasvc/b3b4189120d84dec8857) to deal with RDF data form Gutenberg
that I've cloned [here](https://gist.github.com/leomrocha/23ac4a4b4f4d365502c9e32bee46b797)


In [103]:
# gutenberg project metadata parsing
from metainfo import *

# slightly modified script so it works for what I need


In [105]:
%%time
gut_metadata = readmetadata(RDF_TAR_FILE)

CPU times: user 1min 5s, sys: 0 ns, total: 1min 5s
Wall time: 1min 5s


In [106]:
gut_metadata[1800]

{'id': 1800,
 'author': 'Shakespeare, William',
 'title': "The Winter's Tale",
 'downloads': 8,
 'formats': {'application/epub+zip': 'http://www.gutenberg.org/ebooks/1800.epub.images',
  'application/x-mobipocket-ebook': 'http://www.gutenberg.org/ebooks/1800.kindle.images',
  'application/rdf+xml': 'http://www.gutenberg.org/ebooks/1800.rdf',
  'text/plain': 'http://www.gutenberg.org/ebooks/1800.txt.utf-8',
  'image/jpeg': 'http://www.gutenberg.org/cache/epub/1800/pg1800.cover.small.jpg',
  'text/plain; charset=us-ascii': 'http://www.gutenberg.org/files/1800/1800.txt',
  'text/html': 'http://www.gutenberg.org/ebooks/1800.html.noimages',
  'application/zip': 'http://www.gutenberg.org/files/1800/1800.zip'},
 'type': 'Text',
 'LCC': {'PR'},
 'subjects': {'Castaways -- Drama',
  'Fathers and daughters -- Drama',
  'Married people -- Drama',
  'Sicily (Italy) -- Kings and rulers -- Drama',
  'Tragicomedy'},
 'authoryearofbirth': 1564,
 'authoryearofdeath': 1616,
 'language': ['en']}

In [107]:
gut_metadata[18092]

{'id': 18092,
 'author': 'About, Edmond',
 'title': 'Germaine',
 'downloads': 95,
 'formats': {'text/plain; charset=iso-8859-1': 'http://www.gutenberg.org/files/18092/18092-8.zip',
  'application/epub+zip': 'http://www.gutenberg.org/ebooks/18092.epub.noimages',
  'text/html': 'http://www.gutenberg.org/ebooks/18092.html.noimages',
  'image/jpeg': 'http://www.gutenberg.org/cache/epub/18092/pg18092.cover.medium.jpg',
  'application/rdf+xml': 'http://www.gutenberg.org/ebooks/18092.rdf',
  'application/x-mobipocket-ebook': 'http://www.gutenberg.org/ebooks/18092.kindle.images',
  'text/plain': 'http://www.gutenberg.org/ebooks/18092.txt.utf-8'},
 'type': 'Text',
 'LCC': {'PQ'},
 'subjects': {'French fiction -- 19th century'},
 'authoryearofbirth': 1828,
 'authoryearofdeath': 1885,
 'language': ['fr']}

In [110]:
gut_metadata[20775]

{'id': 20775,
 'author': 'Arana Xajilá, Francisco Hernández',
 'title': 'The Annals of the Cakchiquels',
 'downloads': 249,
 'formats': {'application/rdf+xml': 'http://www.gutenberg.org/ebooks/20775.rdf',
  'text/plain; charset=iso-8859-1': 'http://www.gutenberg.org/files/20775/20775-8.zip',
  'text/html; charset=iso-8859-1': 'http://www.gutenberg.org/files/20775/20775-h/20775-h.htm',
  'application/x-mobipocket-ebook': 'http://www.gutenberg.org/ebooks/20775.kindle.images',
  'image/jpeg': 'http://www.gutenberg.org/cache/epub/20775/pg20775.cover.medium.jpg',
  'application/epub+zip': 'http://www.gutenberg.org/ebooks/20775.epub.noimages',
  'text/plain; charset=utf-8': 'http://www.gutenberg.org/files/20775/20775-0.txt',
  'application/zip': 'http://www.gutenberg.org/files/20775/20775-0.zip',
  'text/plain; charset=us-ascii': 'http://www.gutenberg.org/files/20775/20775.txt',
  'application/octet-stream': 'http://www.gutenberg.org/files/20775/20775-page-images.zip'},
 'type': 'Text',
 'LC

In [109]:
gut_metadata[9000]

{'id': 9000,
 'author': None,
 'title': 'Sri Vishnu Sahasranaamam',
 'downloads': 163,
 'formats': {'text/plain': 'http://www.gutenberg.org/ebooks/9000.txt.utf-8',
  'application/epub+zip': 'http://www.gutenberg.org/ebooks/9000.epub.images',
  'application/rdf+xml': 'http://www.gutenberg.org/ebooks/9000.rdf',
  'application/x-mobipocket-ebook': 'http://www.gutenberg.org/ebooks/9000.kindle.noimages',
  'text/plain; charset=us-ascii': 'http://www.gutenberg.org/files/9000/9000.txt',
  'text/html; charset=iso-8859-1': 'http://www.gutenberg.org/files/9000/9000-h/9000-h.htm',
  'image/jpeg': 'http://www.gutenberg.org/cache/epub/9000/pg9000.cover.medium.jpg'},
 'type': 'Text',
 'LCC': {'BL'},
 'subjects': {'Vishnu (Hindu deity)'},
 'authoryearofbirth': None,
 'authoryearofdeath': None,
 'language': ['sa']}

In [111]:
gut_metadata[30774]

{'id': 30774,
 'author': 'Apostol, P. N. (Pavel Natanovich)',
 'title': 'Московия в представлении иностранцев XVI-XVII в.',
 'downloads': 124,
 'formats': {'text/plain; charset=utf-8': 'http://www.gutenberg.org/files/30774/30774-0.txt',
  'application/rdf+xml': 'http://www.gutenberg.org/ebooks/30774.rdf',
  'application/epub+zip': 'http://www.gutenberg.org/ebooks/30774.epub.images',
  'application/zip': 'http://www.gutenberg.org/files/30774/30774-h.zip',
  'image/jpeg': 'http://www.gutenberg.org/cache/epub/30774/pg30774.cover.medium.jpg',
  'application/x-mobipocket-ebook': 'http://www.gutenberg.org/ebooks/30774.kindle.noimages',
  'text/html; charset=utf-8': 'http://www.gutenberg.org/files/30774/30774-h/30774-h.htm'},
 'type': 'Text',
 'LCC': {'DK'},
 'subjects': {'Russia -- Description and travel'},
 'authoryearofbirth': 1872,
 'authoryearofdeath': 1942,
 'language': ['ru']}

In [112]:
gut_metadata[24225]

{'id': 24225,
 'author': 'Aiyuezhuren',
 'title': '戲中戲',
 'downloads': 41,
 'formats': {'application/zip': 'http://www.gutenberg.org/files/24225/24225-0.zip',
  'text/plain; charset=utf-8': 'http://www.gutenberg.org/files/24225/24225-0.txt',
  'text/html': 'http://www.gutenberg.org/ebooks/24225.html.noimages',
  'application/epub+zip': 'http://www.gutenberg.org/ebooks/24225.epub.noimages',
  'image/jpeg': 'http://www.gutenberg.org/cache/epub/24225/pg24225.cover.medium.jpg',
  'application/x-mobipocket-ebook': 'http://www.gutenberg.org/ebooks/24225.kindle.images',
  'application/rdf+xml': 'http://www.gutenberg.org/ebooks/24225.rdf'},
 'type': 'Text',
 'LCC': {'PL'},
 'subjects': set(),
 'authoryearofbirth': None,
 'authoryearofdeath': None,
 'language': ['zh']}

In [113]:
gut_metadata[1982]

{'id': 1982,
 'author': 'Akutagawa, Ryunosuke',
 'title': '羅生門',
 'downloads': 578,
 'formats': {'application/epub+zip': 'http://www.gutenberg.org/ebooks/1982.epub.noimages',
  'text/plain; charset=utf-8': 'http://www.gutenberg.org/files/1982/1982-0.txt',
  'text/html': 'http://www.gutenberg.org/ebooks/1982.html.images',
  'image/jpeg': 'http://www.gutenberg.org/cache/epub/1982/pg1982.cover.small.jpg',
  'application/rdf+xml': 'http://www.gutenberg.org/ebooks/1982.rdf',
  'application/x-mobipocket-ebook': 'http://www.gutenberg.org/ebooks/1982.kindle.images',
  'application/zip': 'http://www.gutenberg.org/files/1982/1982-0.zip'},
 'type': 'Text',
 'LCC': {'PL'},
 'subjects': {'Japan -- Social life and customs -- Fiction', 'Short stories'},
 'authoryearofbirth': 1892,
 'authoryearofdeath': 1927,
 'language': ['ja']}

In [114]:
gut_metadata[10257]

{'id': 10257,
 'author': "I. J. Hochman's Yiddisher Orchester",
 'title': 'Mazel Tov',
 'downloads': 47,
 'formats': {'audio/mpeg': 'http://www.gutenberg.org/files/10257/10257-m/10257-m-001.mp3',
  'text/plain; charset=us-ascii': 'http://www.gutenberg.org/files/10257/10257-m/10257-m-readme.txt',
  'application/rdf+xml': 'http://www.gutenberg.org/ebooks/10257.rdf'},
 'type': 'Sound',
 'LCC': {'M'},
 'subjects': {'Jews -- Music', 'Klezmer music'},
 'authoryearofbirth': None,
 'authoryearofdeath': None,
 'language': ['yi']}

In [115]:
gut_metadata[50430]

{'id': 50430,
 'author': 'Acosta, José de',
 'title': 'Historia natural y moral de las Indias (vol 2 of 2)',
 'downloads': 160,
 'formats': {'image/jpeg': 'http://www.gutenberg.org/cache/epub/50430/pg50430.cover.medium.jpg',
  'application/zip': 'http://www.gutenberg.org/files/50430/50430-0.zip',
  'application/x-mobipocket-ebook': 'http://www.gutenberg.org/ebooks/50430.kindle.noimages',
  'application/epub+zip': 'http://www.gutenberg.org/ebooks/50430.epub.images',
  'text/html; charset=iso-8859-1': 'http://www.gutenberg.org/files/50430/50430-h.zip',
  'text/plain; charset=utf-8': 'http://www.gutenberg.org/files/50430/50430-0.txt',
  'application/rdf+xml': 'http://www.gutenberg.org/ebooks/50430.rdf'},
 'type': 'Text',
 'LCC': {'E011'},
 'subjects': {'Acosta, José de, 1540-1600 -- Travel -- America',
  'America -- Description and travel',
  'America -- Early accounts to 1600',
  'Indians of Mexico -- Early works to 1800',
  'Indians of South America -- Early works to 1800',
  'Natural h

It seems that the library as is with the given path works well and extracts more information than needed. This is good for making the metadata extraction



### Pipeline description:

1. get next file
2. get file ID and base name of the file
3. unzip it in memory and get ENCODING (if available) from the file (maybe will have to reload the file ... for the moment don't care and I'll treat everything as utf-8 compatible as the encodings I found sampling some languages including tagalog, chinese, japanese seem to be utf-8 compatible)
4. Get author (name, alias, dates ... ), title and language, publishing date and editor (for the data available)
5. clean gutenberg text
6. count total length of the text
7. Separate paragraphs I'll do it by at least a `\n\n` sequence (might not work for every language .. (I do care but I cant, so somebody that knows the language should correct those) 
8. count the number of paragraphs ... ?
9. separate sentences (again, language depending, I'll do it ~by `.` characters and~ with spacy models ...
10. count the number of sentences per paragraph
11. separate words -> maybe here would need lemmatization to see some things correctly, but will have issues in non supported languages.
12. count number of ~~words~~ tokens per sentence -> tokens is already available in spacy, while words is a bit more .. difficult to define and be sure it works in a coherent way.
13. sum number of ~~words~~ tokens per paragraph
13. separate chars
14. count chars per word, sentence and paragraph
15. aggregate all words and count the number of occurrences
16. aggregate all the characters and count the number of occurrences
17. aggregate and sort results then save zip file with it
    

#### Notes:

There are many issues with the current data analysis, but for the global analysis the numbers should be OK enough.
These issues are caused by noise in the data, for example captions and references to images that are not in the current text (and can not be taken into account here)

There are many languages that are not supported by SpaCy, and so the tokenization might not be extremely good.

Different writing styles are present.

The current analysis' goal is to understand some basic statistics on text from different Languages and Authors, starting to give a more thourough insight on some variations that come into play when deciding implementation specificities for NLP pipelines. 

The contributions of the current study are:
- An Open Source library to analyze the entire Gutenberg project statistics.
- A first introspection on some 
- A (text) database output containing many of the elements needed for the current analysis, which in future works can be reused as a base for improved insights.

As an extra element the code separation is done in a way to allow for some code reusability in other text datasets.

SpaCy has different models, but there are not for every language, so for the moment I'll try a specific model for the available ones, but the general model for the ones that are not available.

For this is that I need to first install all available models from SpaCy, if a medium (md) model is available, this one is preferred instead of the small one (sm), the large ones (lg) are going to be ignored as they are too big for the current processing needs (the needs are quite basic in the current study)

In [161]:
spacy_models_2 = {
    'de': 'de_core_news_md', # German
    'el': 'el_core_news_sm', # Greek
    'en': 'en_core_web_md', # English
    'es': 'es_core_news_md', # Spanish
    'fr': 'fr_core_news_md', # French
    'it': 'it_core_news_sm', # Italian
    'lt': 'lt_core_news_sm', # Lithuanian
    'nb': 'nb_core_news_sm', # Norwegian Bokmål
    'nl': 'nl_core_news_sm', # Dutch
    'pt': 'pt_core_news_sm', # Portuguese
    'xx': 'xx_ent_wiki_sm', # Multi-Lang
    }
        
spacy_models = {
    'german': 'de_core_news_md', # German
    'greek': 'el_core_news_sm', # Greek
    'english': 'en_core_web_md', # English
    'spanish': 'es_core_news_md', #Sspanish
    'french': 'fr_core_news_md', # French
    'italian': 'it_core_news_sm', # Italian
    'lithuanian': 'lt_core_news_sm', # Lithuanian
    'norwegian': 'nb_core_news_sm', # Norwegian Bokmål
    'dutch': 'nl_core_news_sm', # Dutch
    'portuguese': 'pt_core_news_sm', # Portuguese
    'multi-lang': 'xx_ent_wiki_sm', # Multi-Lang
    }

In [168]:
# %%time
# for model in spacy_models.values():
#     # install the models:
#     cmd = "python -m spacy download {}".format(model)
#     print(cmd)
#     os.system(cmd)

In [44]:
spacy.info()

[1m

spaCy version    2.2.4                         
Location         /home/leo/venv3/lib/python3.8/site-packages/spacy
Platform         Linux-5.4.0-29-generic-x86_64-with-glibc2.29
Python version   3.8.2                         
Models                                         



{'spaCy version': '2.2.4',
 'Location': '/home/leo/venv3/lib/python3.8/site-packages/spacy',
 'Platform': 'Linux-5.4.0-29-generic-x86_64-with-glibc2.29',
 'Python version': '3.8.2',
 'Models': ''}

In [58]:
ftest = '/media/nfs/Datasets/text/Gutenberg/aleph.gutenberg.org/1/8/0/1800/1800.zip'
ftxt = path_leaf(ftest).replace(".zip", ".txt")
f = zipfile.ZipFile(ftest)


In [125]:
ntpath.split(ftest)

('/media/nfs/Datasets/text/Gutenberg/aleph.gutenberg.org/1/8/0/1800',
 '1800.zip')

In [130]:
ntpath.dirname(ftest)

'/media/nfs/Datasets/text/Gutenberg/aleph.gutenberg.org/1/8/0/1800'

In [132]:
os.path.split(ftest), os.path.split('/media/nfs/Datasets/text/Gutenberg/aleph.gutenberg.org/1/8/0/1800')

(('/media/nfs/Datasets/text/Gutenberg/aleph.gutenberg.org/1/8/0/1800',
  '1800.zip'),
 ('/media/nfs/Datasets/text/Gutenberg/aleph.gutenberg.org/1/8/0', '1800'))

In [134]:
os.path.basename(ftest)

'1800.zip'

In [138]:
pth = Path(ftest)

In [143]:
pth.parent.name

'1800'

In [144]:
def get_file_id(fname):
    """Returns the Gutenberg File ID"""
    pth = Path(fname)
    # as per file structure the filename has some variations but the parent folder is always the ID
    return pth.parent.name

In [148]:
a = []
isinstance(a,list)

True

In [149]:
# def get_meta(gut_id, rfd_meta):
#     """ Loads the most adapted spacy nlp resource from the available ones
#     :param gut_id: Numeric gutenberg id of the file
#     :param rfd_meta:  rfd metainfo database
#     :return:  dict_metainfo
#     """
#     metainfo = rfd_meta[gut_id]
    
def get_nlp_resource(metainfo):
    
    lang = 'xx'
    try:
        lng = metainfo['language']
        if isinstance(lng, list) or isinstance(lng, tuple) and len(lng > 0):
            lng = lng[0]
        elif isinstance(lng, str):
            pass  # nothing to do here, move along
        else:
            pass # FUUUUUUU something wrong, but the default value will be multilang anyways
    except:
        # just to avoid issues if there is no language tag, in that case go back to default
        pass
    # loading with shortcut ... maybe will need to use the spacy models dict that I've created earlier, we'll see
    nlp = spacy.load(lang)
    return 

In [347]:

def _get_stats(arr):
    stats = {'total_count': np.sum(arr),
             'min': np.min(arr),
             'max': np.max(arr),
             'mean': np.mean(arr),
             'median': np.median(arr),
             'std': np.std(arr)
             }
    return stats


# this function already works but there is a lot to cleanup and improve
def process_file(fname, rfd_meta):
    # Gutenberg file id
    gut_id = int(get_file_id(fname))
    # meta information extracted from the Gutenber RFD database -> warning is around 1GB the DB
    metainfo = rfd_meta[gut_id]
    # TODO extract format from metadata (if exists)
    encoding = 'utf-8'  # asume all is compatible with utf-8, for the moment haven't found one that is not
    # spacy
    # nlp = get_nlp_resource(metainfo)
    # TODO FIXME, this is an issue with current spacy install not loading correctly ??
    import en_core_web_md
    nlp = en_core_web_md.load()
    # load and clean the file, asume zip as compressing format
    pth = Path(fname)  #
    ftxt = pth.name.replace(".zip", ".txt")  # inside gutenberg zip there should be a .txt file with the same name
    f = zipfile.ZipFile(fname)
    txt = f.read(ftxt).decode(encoding)
    txt = gutenberg_cleaner.simple_cleaner(txt)
    # Start analysis
    doc = nlp(txt)  # SpaCy tokenization

    stats_data = {'total_char_count': len(txt)}

    ocnt = doc.count_by(ORTH)
    tokens = token_count = {doc.vocab.strings[k]: v for k, v in
                            reversed(sorted(ocnt.items(), key=lambda item: item[1]))}
    token_lens = np.array([len(k) for k in token_count.keys()])

    token_stats = _get_stats(token_lens)

    sen_charcount = []  # sentence length in characters
    sen_tok_count = []  # sentence length in tokens
    sen_stats = {
        'sentence_count': len(list(doc.sents))  # number of sentences in the document
    }

    for s in doc.sents:
        # clen = len(s.string)
        clen = len(s.text)
        sen_charcount.append(clen)
        tlen = len(s)
        sen_tok_count.append(tlen)

    sen_stats['char_count'] = sen_charcount
    sen_stats['token_count'] = sen_tok_count

    sen_charcount = np.array(sen_charcount)
    sen_tok_count = np.array(sen_tok_count)
    # np.min(), np.max(), np.mean(), np.median(), np.std()
    sen_stats['char_stats'] = _get_stats(sen_charcount)
    sen_stats['token_stats'] = _get_stats(sen_tok_count)

    stats_data['sentences'] = sen_stats
    stats_data['token_lengths'] = token_lens

    # slows a lot the processing, but I don't care for the moment
    # I'm doubting using this as much of it is already done in the previous part
    # also makes everything slooower ... need to do it faster so I wont count this stat (even though I do want it)
    # i might be able to do some estimation?, nevertheless
    # count number of paragraphs .. might not work on many languages
    paragraphs = [l.strip() for l in txt.split('\n\n') if len(l.strip()) > 0]

    para_char_lens = []
    para_tok_lens = []
    para_sen_lens = []
    for p in paragraphs:
        d = nlp(p)
        para_char_lens.append(len(p))
        para_tok_lens.append(len(list(d)))
        para_sen_lens.append(len(list(d.sents)))
    #
    stats_data['paragraphs'] = {
        'paragraph_count': len(paragraphs),
        'char_count': para_char_lens,
        'token_count': para_tok_lens,
        'sentence_count': para_sen_lens,
    }

    para_char_lens = np.array(para_char_lens)
    para_tok_lens = np.array(para_tok_lens)
    para_sen_lens = np.array(para_sen_lens)

    main_para_char_stats = _get_stats(para_char_lens)
    main_para_tok_stats = _get_stats(para_tok_lens)
    main_para_sen_stats = _get_stats(para_sen_lens)

    para_stats = {'char_stats': main_para_char_stats,
                  'token_stats': main_para_tok_stats,
                  'sentence_stats': main_para_sen_stats}

    stats = {'char_count': len(txt),
             'tokens': {'total_token_count': sum(token_count.values()),
                        'different_token_count': len(list(token_count.keys())),
                        'token_length_stats': token_stats
                        },
             'sentences': {'sentence_count': len(list(doc.sents)),
                           'char_stats': sen_stats['char_stats'],
                           'token_stats': sen_stats['token_stats']
                           },
             'paragraphs': para_stats,
             }
    # stats: aggregated statistics
    # stats_data: all data count
    # tokens: tokens, count of each token and token set
    return stats, stats_data, tokens


In [64]:
txt = f.read(ftxt).decode('utf-8')

In [65]:
len(txt)

172007

In [66]:
type(txt)

str

In [67]:
ctxt = gcleaner(txt)

In [71]:
len(ctxt), type(ctxt)

(155798, str)

In [76]:
END_OF_HEADER = "START OF THIS PROJECT GUTENBERG EBOOK"

In [77]:
txt.find(END_OF_HEADER)

-1

In [73]:
txt

'\r\n*******************************************************************\r\nTHIS EBOOK WAS ONE OF PROJECT GUTENBERG\'S EARLY FILES PRODUCED AT A\r\nTIME WHEN PROOFING METHODS AND TOOLS WERE NOT WELL DEVELOPED. THERE\r\nIS AN IMPROVED EDITION OF THIS TITLE WHICH MAY BE VIEWED AS EBOOK\r\n(#1539) at https://www.gutenberg.org/ebooks/1539\r\n*******************************************************************\r\n\r\n\r\nThis Etext file is presented by Project Gutenberg, in\r\ncooperation with World Library, Inc., from their Library of the\r\nFuture and Shakespeare CDROMS.  Project Gutenberg often releases\r\nEtexts that are NOT placed in the Public Domain!!\r\n\r\n*This Etext has certain copyright implications you should read!*\r\n\r\n<<THIS ELECTRONIC VERSION OF THE COMPLETE WORKS OF WILLIAM\r\nSHAKESPEARE IS COPYRIGHT 1990-1993 BY WORLD LIBRARY, INC., AND IS\r\nPROVIDED BY PROJECT GUTENBERG WITH PERMISSION.  ELECTRONIC AND\r\nMACHINE READABLE COPIES MAY BE DISTRIBUTED SO LONG AS SUCH COPI

In [72]:
ctxt

'\n\n\n\n\n1611\n\nTHE WINTER\'S TALE\n\nby William Shakespeare\n\n\n\nDramatis Personae\n\n  LEONTES, King of Sicilia\n  MAMILLIUS, his son, the young Prince of Sicilia\n  CAMILLO,    lord of Sicilia\n  ANTIGONUS,    "   "     "\n  CLEOMENES,    "   "     "\n  DION,         "   "     "\n  POLIXENES, King of Bohemia\n  FLORIZEL, his son, Prince of Bohemia\n  ARCHIDAMUS, a lord of Bohemia\n  OLD SHEPHERD, reputed father of Perdita\n  CLOWN, his son\n  AUTOLYCUS, a rogue\n  A MARINER\n  A GAOLER\n  TIME, as Chorus\n\n  HERMIONE, Queen to Leontes\n  PERDITA, daughter to Leontes and Hermione\n  PAULINA, wife to Antigonus\n  EMILIA, a lady attending on the Queen\n  MOPSA,   shepherdess\n  DORCAS,        "\n\n  Other Lords, Gentlemen, Ladies, Officers, Servants, Shepherds,\n    Shepherdesses\n\n                              SCENE:\n                       Sicilia and Bohemia\n\n\n\n\n\n\n\nACT I. SCENE I.\nSicilia. The palace of LEONTES\n\nEnter CAMILLO and ARCHIDAMUS\n\n  ARCHIDAMUS. If you 

In [116]:
paragraphs =[l.strip() for l in ctxt.split('\n\n') if len(l.strip()) > 0]

In [117]:
len(paragraphs)

126

In [121]:
paragraphs[:15]

['1611',
 "THE WINTER'S TALE",
 'by William Shakespeare',
 'Dramatis Personae',
 'LEONTES, King of Sicilia\n  MAMILLIUS, his son, the young Prince of Sicilia\n  CAMILLO,    lord of Sicilia\n  ANTIGONUS,    "   "     "\n  CLEOMENES,    "   "     "\n  DION,         "   "     "\n  POLIXENES, King of Bohemia\n  FLORIZEL, his son, Prince of Bohemia\n  ARCHIDAMUS, a lord of Bohemia\n  OLD SHEPHERD, reputed father of Perdita\n  CLOWN, his son\n  AUTOLYCUS, a rogue\n  A MARINER\n  A GAOLER\n  TIME, as Chorus',
 'HERMIONE, Queen to Leontes\n  PERDITA, daughter to Leontes and Hermione\n  PAULINA, wife to Antigonus\n  EMILIA, a lady attending on the Queen\n  MOPSA,   shepherdess\n  DORCAS,        "',
 'Other Lords, Gentlemen, Ladies, Officers, Servants, Shepherds,\n    Shepherdesses',
 'SCENE:\n                       Sicilia and Bohemia',
 'ACT I. SCENE I.\nSicilia. The palace of LEONTES',
 'Enter CAMILLO and ARCHIDAMUS',
 "ARCHIDAMUS. If you shall chance, Camillo, to visit Bohemia, on the\n    l

In [164]:
nlp = spacy.load(spacy_models_2['en'])

OSError: [E050] Can't find model 'en_core_web_md'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

In [166]:
# Well there is an issue to fix with SpaCy now
# spacy.load('en_core_web_md')
import en_core_web_md

In [167]:
nlp = en_core_web_md.load()

In [169]:
%%time
doc = nlp(ctxt)

CPU times: user 2.82 s, sys: 409 ms, total: 3.23 s
Wall time: 3.23 s


In [318]:
%%time
from collections import defaultdict, Counter
from spacy.attrs import ORTH,NORM,TAG


CPU times: user 20 µs, sys: 1 µs, total: 21 µs
Wall time: 26.2 µs


In [306]:
pos_counts = Counter()
all_tokens = []

for token in doc:
    pos_counts[token.string] += 1
    all_tokens.append(token)
#     if token.string != token.text:
#         print(token.string, token.text)

In [309]:
type(token.text),type(token.string)

(str, str)

In [300]:
t1014 = all_tokens[1014]

In [301]:
t1014.orth

15684468530116715331

In [319]:
ocnt = doc.count_by(ORTH)
ncnt = doc.count_by(NORM)
tcnt = doc.count_by(TAG)

In [322]:
ocnt

{689596287893600409: 11,
 13717132615059711303: 1,
 908432558851201422: 18,
 6398231955146299758: 2,
 15625463239345040277: 1,
 4819537671249353417: 3,
 5068865973908226183: 1,
 16764210730586636600: 107,
 15682359169525196490: 2,
 17705840240265698784: 2,
 17401526048160334743: 1,
 17226268230246754930: 1,
 4615314666607501289: 1,
 6379588897221739654: 39,
 2565451499649297794: 139,
 2593208677638477497: 2381,
 14826469074451677028: 50,
 886050111519832510: 434,
 4483006468253402507: 27,
 11295366195010100045: 727,
 2985680790829681700: 18,
 2661093235354845946: 160,
 2555790282192608669: 37,
 7425985699627899538: 636,
 5946595046901399946: 14,
 4527521648030784477: 20,
 9112760883834046208: 83,
 6718839663412986256: 8,
 6316802600213942060: 70,
 9409918951138552818: 25,
 15884554869126768810: 10,
 12127836336259238092: 5,
 3518396686757337811: 6,
 14135569176005561457: 13,
 15800633500902066432: 9,
 6198647145348743278: 2,
 17396450413703794145: 68,
 5258365634227731740: 34,
 1693991

In [326]:
token.orth

962983613142996970

In [330]:

token_count = {doc.vocab.strings[k]: v for k, v in reversed(sorted(ocnt.items(), key=lambda item: item[1]))}


In [335]:
type(list(token_count.keys())[0])

str

In [302]:
pos_counts.values()

dict_values([11, 1, 18, 2, 1, 3, 1, 94, 2, 2, 1, 1, 1, 39, 139, 1771, 23, 414, 23, 727, 17, 148, 27, 606, 11, 12, 82, 8, 6, 24, 6, 5, 6, 4, 11, 9, 2, 66, 24, 54, 9, 350, 1, 49, 1, 18, 8, 71, 71, 3, 59, 5, 10, 3, 159, 1, 43, 10, 480, 6, 32, 13, 5, 493, 12, 67, 9, 5, 6, 6, 1, 57, 24, 16, 2, 13, 2, 1, 1, 1, 1, 1, 1, 1, 610, 2409, 1, 1, 1, 21, 1, 4, 5, 5, 2, 15, 5, 19, 1275, 148, 7, 32, 1, 61, 272, 86, 3, 27, 3, 30, 48, 1, 1, 344, 5, 87, 35, 2, 17, 589, 196, 7, 24, 2, 4, 58, 10, 227, 726, 34, 131, 2, 2, 6, 27, 7, 3, 75, 117, 1, 1, 61, 2, 1, 6, 27, 62, 78, 33, 1, 181, 4, 469, 168, 1, 9, 1, 3, 23, 154, 1, 9, 25, 69, 11, 32, 249, 158, 31, 1, 109, 1, 36, 53, 18, 31, 25, 1, 1, 223, 3, 1, 1, 8, 16, 57, 58, 2, 24, 12, 2, 61, 2, 30, 6, 9, 211, 3, 2, 3, 99, 2, 1, 23, 5, 5, 1, 5, 11, 1, 15, 4, 1, 2, 41, 1, 34, 1, 1, 35, 44, 44, 3, 3, 135, 1, 46, 4, 54, 1, 1, 12, 1, 5, 1, 2, 1, 1, 31, 2, 1, 1, 1, 2, 1, 1, 3, 195, 7, 2, 1, 8, 6, 1, 1, 68, 1, 1, 1, 9, 2, 118, 185, 6, 3, 1, 4, 7, 1, 86, 2, 5, 3, 3, 8, 1

In [170]:
sentences = list(doc.sents)  

In [171]:
len(sentences)

3726

In [177]:
type(sentences[100]), len(sentences[100]), sentences[100]

(spacy.tokens.span.Span,
 15,
 Say this to him,
     He's beat from his best ward.
   )

In [312]:
s100.string, s100.text

("Say this to him,\n    He's beat from his best ward.\n  ",
 "Say this to him,\n    He's beat from his best ward.\n  ")

In [194]:
s100 = sentences[100]

In [198]:
len(s100.string)

53

In [217]:
t0,t1 = s100[:2]

In [205]:
t0.string, len(t0.string)

('Say ', 4)

In [219]:
t0 == t0, t0 == t1

(True, False)

In [220]:
ldoc = list(doc)

In [222]:
type(ldoc[0]), len(ldoc)

(spacy.tokens.token.Token, 36047)

In [189]:
vocab_nlp_list = list(nlp.vocab.strings)

In [183]:
vocab_list = list(doc.vocab.strings)

In [185]:
len(vocab_list), len(set(vocab_list)),

(1476693, 1476693)

In [191]:
set(vocab_list).difference(vocab_nlp_list), set(vocab_nlp_list).difference(vocab_list)

(set(), set())

In [188]:
vocab_list[100:1000]

['While',
 'Since',
 'Like',
 'So',
 'Than',
 'Whether',
 'Although',
 'Though',
 'Unless',
 'Once',
 'Cause',
 'Upon',
 'Till',
 'Whereas',
 'Whilst',
 'Except',
 'Despite',
 'Wether',
 'But',
 'Becuse',
 'Whie',
 'It',
 'W/Out',
 'Albeit',
 'Save',
 'Besides',
 'Becouse',
 'Coz',
 'Til',
 'Ask',
 "I'D",
 'Out',
 'Near',
 'Seince',
 'Tho',
 'Sice',
 'Will',
 'something',
 'anyone',
 'anything',
 'nothing',
 'someone',
 'everything',
 'everyone',
 'everybody',
 'nobody',
 'somebody',
 'anybody',
 'any1',
 'Something',
 'Anyone',
 'Anything',
 'Nothing',
 'Someone',
 'Everything',
 'Everyone',
 'Everybody',
 'Nobody',
 'Somebody',
 'Anybody',
 'Any1',
 '-PRON-',
 'I',
 'me',
 'you',
 'he',
 'him',
 'she',
 'her',
 'we',
 'us',
 'they',
 'them',
 'mine',
 'his',
 'hers',
 'its',
 'ours',
 'yours',
 'theirs',
 'myself',
 'yourself',
 'himself',
 'herself',
 'itself',
 'themself',
 'ourselves',
 'yourselves',
 'themselves',
 'Me',
 'You',
 'He',
 'Him',
 'She',
 'Her',
 'We',
 'Us',
 'They

In [234]:
%%time
stats =  process_file(fname=ftest, rfd_meta=gut_metadata)


CPU times: user 16.2 s, sys: 777 ms, total: 17 s
Wall time: 17 s


In [235]:
stats.keys()

dict_keys(['total_char_len', 'total_token_len', 'sentences', 'tokens', 'paragraphs'])

In [238]:
17*60000/60/60/24/817*60000/60/60/24/8


1.4756944444444444

In [240]:
%%time
stats, stats_data, tokens =  process_file(fname=ftest, rfd_meta=gut_metadata)


CPU times: user 16 s, sys: 721 ms, total: 16.7 s
Wall time: 16.7 s


In [247]:
%%time
stats, stats_data, tokens = process_file(fname=ftest, rfd_meta=gut_metadata)
# still too slow!

CPU times: user 13 s, sys: 609 ms, total: 13.6 s
Wall time: 13.6 s


In [348]:
%%time
stats, stats_data, tokens =  process_file(fname=ftest, rfd_meta=gut_metadata)

CPU times: user 15.7 s, sys: 646 ms, total: 16.3 s
Wall time: 16.3 s


In [349]:
stats

{'char_count': 155798,
 'tokens': {'total_token_count': 36047,
  'different_token_count': 4572,
  'token_length_stats': {'total_count': 29133,
   'min': 1,
   'max': 59,
   'mean': 6.372047244094488,
   'median': 6.0,
   'std': 3.3489185401923574}},
 'sentences': {'sentence_count': 3726,
  'char_stats': {'total_count': 154034,
   'min': 1,
   'max': 410,
   'mean': 41.340311325818575,
   'median': 29.0,
   'std': 39.812356725660244},
  'token_stats': {'total_count': 36047,
   'min': 1,
   'max': 88,
   'mean': 9.674449812130971,
   'median': 7.0,
   'std': 8.842179625633522}},
 'paragraphs': {'char_stats': {'total_count': 154684,
   'min': 4,
   'max': 13190,
   'mean': 1227.6507936507937,
   'median': 62.0,
   'std': 2359.187764846401},
  'token_stats': {'total_count': 35920,
   'min': 1,
   'max': 3056,
   'mean': 285.07936507936506,
   'median': 13.5,
   'std': 552.2620237781459},
  'sentence_stats': {'total_count': 3783,
   'min': 1,
   'max': 356,
   'mean': 30.023809523809526,
  

In [271]:
tc

{
 
 
 
 : 1,
 1611: 1,
 
 : 1,
 THE: 1,
 WINTER: 1,
 'S: 1,
 TALE: 1,
 
 : 1,
 by: 1,
 William: 1,
 Shakespeare: 1,
 
 
 
 : 1,
 Dramatis: 1,
 Personae: 1,
 
 
   : 1,
 LEONTES: 1,
 ,: 1,
 King: 1,
 of: 1,
 Sicilia: 1,
 
   : 1,
 MAMILLIUS: 1,
 ,: 1,
 his: 1,
 son: 1,
 ,: 1,
 the: 1,
 young: 1,
 Prince: 1,
 of: 1,
 Sicilia: 1,
 
   : 1,
 CAMILLO: 1,
 ,: 1,
    : 1,
 lord: 1,
 of: 1,
 Sicilia: 1,
 
   : 1,
 ANTIGONUS: 1,
 ,: 1,
    : 1,
 ": 1,
   : 1,
 ": 1,
     : 1,
 ": 1,
 
   : 1,
 CLEOMENES: 1,
 ,: 1,
    : 1,
 ": 1,
   : 1,
 ": 1,
     : 1,
 ": 1,
 
   : 1,
 DION: 1,
 ,: 1,
         : 1,
 ": 1,
   : 1,
 ": 1,
     : 1,
 ": 1,
 
   : 1,
 POLIXENES: 1,
 ,: 1,
 King: 1,
 of: 1,
 Bohemia: 1,
 
   : 1,
 FLORIZEL: 1,
 ,: 1,
 his: 1,
 son: 1,
 ,: 1,
 Prince: 1,
 of: 1,
 Bohemia: 1,
 
   : 1,
 ARCHIDAMUS: 1,
 ,: 1,
 a: 1,
 lord: 1,
 of: 1,
 Bohemia: 1,
 
   : 1,
 OLD: 1,
 SHEPHERD: 1,
 ,: 1,
 reputed: 1,
 father: 1,
 of: 1,
 Perdita: 1,
 
   : 1,
 CLOWN: 1,
 ,: 1,
 his: 1,
 son: 1,
 
   

In [248]:
14*60000/60/60/24/8


1.215277777777778

In [250]:
stats.keys()

dict_keys(['total_char_len', 'sentences', 'tokens', 'total_token_len'])

In [253]:
stats['total_char_count'],stats['total_token_count'], stats['sentences']['sentence_count']

(155798, 36047, 3726)

In [254]:
sentence_stats =stats['sentences']
token_stats = stats['tokens']

In [255]:
sentence_stats.keys(), token_stats.keys()

(dict_keys(['sentence_count', 'char_count', 'token_count']),
 dict_keys(['token_count', 'token_set']))