# nlptextdoc library source code

## 1. Prepare the Python environment

### 1.1 Install Anaconda and create a virtual environment

Download and install Anaconda for your platform : [Anaconda - Python 3.7](https://www.anaconda.com/distribution/#download-section).

Launch Anaconda Prompt.

> conda create --name nlptextenv

> conda activate nlptextenv

### 1.2 Install pandas with pyarrow.feather file format support

> conda install pandas

> conda install pyarrow

Make sure your version of pandas is > 0.24 and pyarrow is installed :

In [None]:
import pandas as pd
pd.show_versions()

### 1.3 Install spaCy with french language support

> conda install -c conda-forge spacy

> python -m spacy download fr

Make sure your version of spacy is > 2.1 and fr model is installed :

In [8]:
!python -m spacy info



spaCy version    2.1.3                         
Location         C:\Users\laure\Anaconda3\envs\spacy\lib\site-packages\spacy
Platform         Windows-10-10.0.18362-SP0     
Python version   3.7.3                         
Models           fr                            



Install a spaCy language detector extension :

> pip install spacy-langdetect

Check if the language detection works :

In [67]:
import spacy
from spacy_langdetect import LanguageDetector

nlp = spacy.load("fr_core_news_sm",disable=["tagger","ner"])
nlp.add_pipe(LanguageDetector(), name="language_detector", last=True)
doc = nlp("Est-ce que le détecteur fonctionne ?")
%time doc._.language["language"]

Wall time: 3.7 ms


'fr'

### 1.4 How to run a Jupyter notebook in the context of a conda environment

> conda activate nlptextenv

> conda install ipykernel

> python -m ipykernel install --user --name nlptextenv --display-name "Python (nlptextenv)"

> jupyter notebook

=> menu Kernel / Change kernel / Python (nlptextenv)

Check : locate the Jupyter config directories, kernels are configured in the 'kernels' subdirectory, in 'kernel.json' files

In [10]:
from jupyter_core.paths import jupyter_data_dir
print(jupyter_data_dir())

C:\Users\laure\AppData\Roaming\jupyter


Check : locate the python environment in use

In [66]:
import sys
print(sys.executable)

C:\Users\laure\Anaconda3\envs\spacy\python.exe


### 1.5 Define technical utility functions

In [71]:
import os

def _memory_size(obj, seen=None):
    size = sys.getsizeof(obj)
    if seen is None:
        seen = set()
    obj_id = id(obj)
    if obj_id in seen:
        return 0
    seen.add(obj_id)
    if isinstance(obj, dict):
        size += sum([_memory_size(v, seen) for v in obj.values()])
        size += sum([_memory_size(k, seen) for k in obj.keys()])
    elif hasattr(obj, '__dict__'):
        size += _memory_size(obj.__dict__, seen)
    elif hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes, bytearray)):
        size += sum([_memory_size(i, seen) for i in obj])
    return size

# OTHER OPTION specific to pandas dataframes
# https://www.dataquest.io/blog/pandas-big-data/
# df.info(memory_usage="deep")

def _file_size(filepath):
    statinfo = os.stat(filepath)
    return statinfo.st_size

def _format_size_mb(size):
    return int(size / 1024.0 / 102.4) / 10.0

## 2. Prepare the .NET environment

### 2.1 Install Visual Studio 2019 community

Download and install [Microsoft Visual Studio 2019](https://visualstudio.microsoft.com/fr/downloads/) community edition.

Install only the following workload : .NET Core multiplatform development.

### 2.2 Clone and compile nlptextdoc

Launch Visual Studio 2019.

Clone code
- Repository URL : https://github.com/laurentprudhon/nlptextdoc.git
- Choose a local directory for the solution

Double click on the solution file : nlptextdoc.sln

Select the "Release" configuration in the top toolbar.

In the Solution Explorer :
- right-click on the solution root => Generate solution
- right-click on the projet "nlptextdoc.cli" => Open directory in File Explorer

Navigate to the bin\Release\netcoreapp2.1 subdirectory :
- this directory should contain 7 .dll files, including : nlptextdoc.cli.dll
- copy the full path of this directory in the variable below

In [24]:
nlptextdocExecPath = r"C:\Users\laure\source\repos\nlptextdoctemp\nlptextdoc.cli\bin\Release\netcoreapp2.1"

Test the command line client and learn its syntax :

In [29]:
!dotnet "{nlptextdocExecPath}/nlptextdoc.cli.dll"

nlptexdoc extractor v1.0

Crawls all the Html pages of a website and converts them to .nlp.txt structured text documents.
All the extracted text documents are stored under a single directory named like the website.
The .nlp.txt file format is described here : https://www.cognitivefactory.org/nlptextdocument/

Features an advanced Html to text conversion algorithm :
- tries to recover the logical structure of the document from the Html layout
- interprets Css properties of the Html nodes to make this operation more reliable
- preserves document / section / list / table grouping and nesting information

Usage : nlptextdoc [rootUrl] [storageDirectory] [maxPagesCount=0] [minCrawlDelay=0]
 - rootUrl          : root Url of the website (or subfolder of a website) you want to crawl
 - storageDirectory : path to the disk directory where the website folder
 - maxPagesCount    : maximum number of pages extracted from the website (optional, default:100 000)
 - minCrawlDelay    : delay in milliseco

## 3. Extract nlp text documents from websites

### 3.1 Identify popular websites to build your specific language model

List the public and open websites you would like to read to build a language model.

PLEASE MAKE SURE THIS IS LEGAL in your country.

For example in Europe : https://ec.europa.eu/digital-single-market/en/modernisation-eu-copyright-rules. 

> "The mandatory exceptions that the proposed directive announced are related to: ... Text and data mining ..."

In [2]:
websites = ["http://bourse.latribune.fr/",
            "http://cercledelepargne.com/",
            "http://finance.lelynx.fr/banques/",
            "http://labourseauquotidien.fr/",
            "http://lafourmiz.fr/",
            "http://www.assurances.com/",
            "http://www.banque.org/",
            "http://www.banque-info.com/",
            "http://www.bourse.fr/",
            "http://www.boursedirect.fr/",
            "http://www.capitaine-epargne.com/",
            "http://www.cnp.fr/",
            "http://www.cofinoga.fr/",
            "http://www.comparabanques.fr/",
            "http://www.comparalivrets.fr/",
            "http://www.fbf.fr/",
            "http://www.financo.fr/",
            "http://www.generali.fr/",
            "http://www.guide-epargne.com/",
            "http://www.lemonde.fr/epargne/",
            "http://www.leparisien.fr/actus/banque",
            "http://www.lesaffaires.com/bourse",
            "http://www.lesclesdelabanque.com",
            "http://www.msn.com/fr-fr/finance",
            "http://www.retraiteepargne.fr/",
            "http://www.revue-banque.fr/",
            "http://www.strategie-bourse.com/",
            "http://www.zonebourse.com/",
            "https://acpr.banque-france.fr/",
            "https://banque.meilleurtaux.com/",
            "https://bourse.lefigaro.fr/",
            "https://compte-nickel.fr/",
            "https://eko-by-ca.fr/",
            "https://epargne.ooreka.fr/",
            "https://ffa-assurance.fr/",
            "https://fr.finance.yahoo.com/",
            "https://humanis.com/",
            "https://mabanque.bnpparibas/",
            "https://mes-placements.fr/",
            "https://n26.com/fr-fr/",
            "https://particulier.apicil.com/",
            "https://www.10meilleuresbanques.fr/",
            "https://www.abcbourse.com/",
            "https://www.afer.fr/",
            "https://www.ag2rlamondiale.fr/",
            "https://www.agpm.fr/",
            "https://www.allianz.fr/",
            "https://www.allianzbanque.fr/",
            "https://www.amaguiz.com/",
            "https://www.ameli.fr/",
            "https://www.amundi.fr/fr_part",
            "https://www.arkea.com/",
            "https://www.assurland.com/",
            "https://www.aviva.fr/",
            "https://www.axa.fr/",
            "https://www.banque.fr/",
            "https://www.banque-casino.fr/",
            "https://www.banque-edel.fr/",
            "https://www.banque-france.fr/",
            "https://www.banquepopulaire.fr/",
            "https://www.banquesenligne.org/",
            "https://www.bforbank.com/",
            "https://www.boursedeparis.fr/",
            "https://www.boursier.com/",
            "https://www.boursorama.com/",
            "https://www.boursorama-banque.com/",
            "https://www.bred.fr/",
            "https://www.ca-alsace-vosges.fr/",
            "https://www.caisse-epargne.fr/",
            "https://www.carrefour-banque.fr/",
            "https://www.cbanque.com/",
            "https://www.cetelem.fr/",
            "https://www.challenges.fr/tag_theme/banque_876/",
            "https://www.cic.fr/",
            "https://www.cofidis.fr/",
            "https://www.credit-cooperatif.coop/",
            "https://www.credit-du-nord.fr/",
            "https://www.credit-et-banque.com/",
            "https://www.creditfoncier.fr/",
            "https://www.creditmutuel.fr/",
            "https://www.culturebanque.com/",
            "https://www.diac.fr/",
            "https://www.direct-assurance.fr/",
            "https://www.economie.gouv.fr/",
            "https://www.empruntis.com/epargne/",
            "https://www.en-bourse.fr/",
            "https://www.eurofil.com/",
            "https://www.fortuneo.fr/",
            "https://www.francetransactions.com/",
            "https://www.gan.fr/",
            "https://www.groupama.fr/",
            "https://www.hellobank.fr/",
            "https://www.home.saxo/fr-fr/",
            "https://www.hsbc.fr/",
            "https://www.impots.gouv.fr/portail/",
            "https://www.ing.fr/banque-en-ligne/",
            "https://www.labanquepostale.fr/",
            "https://www.lcl.fr/",
            "https://www.lerevenu.com/",
            "https://www.lesechos.fr/finance-marches/",
            "https://www.lesfurets.com/",
            "https://www.lolivier.fr/",
            "https://www.macif.fr/assurance/particuliers",
            "https://www.mae.fr/",
            "https://www.maif.fr/",
            "https://www.matmut.fr/",
            "https://www.mma.fr/",
            "https://www.monabanq.com/fr/index.html",
            "https://www.mon-epargne.com/",
            "https://www.montepaschi-banque.fr/fr/",
            "https://www.natixis.com/",
            "https://www.oney.fr/",
            "https://www.orangebank.fr/",
            "https://www.ouest-france.fr/economie/banques-finance/",
            "https://www.palatine.fr/",
            "https://www.panorabanques.com/",
            "https://www.probtp.com/",
            "https://www.psabanque.fr/",
            "https://www.quechoisir.org/thematique-banque-credit-t111/",
            "https://www.revolut.com/fr-FR/",
            "https://www.service-public.fr/particuliers/vosdroits/N19803",
            "https://www.smc.fr/",
            "https://www.societegenerale.fr/",
            "https://www.sofinco.fr/",
            "https://www.toutsurmesfinances.com/",
            "https://www.tradingsat.com/",
            "https://www.usine-digitale.fr/banque/",
            "https://www.younited-credit.com/"]

len(websites)

128

### 3.2 Extract raw text from these websites in a local directory

Create a local directory to store the extracted nlp text documents : be careful, this directory may contain several gigabytes of data at the end of the process.

IMPORTANT : **the "magic" \\\\?\ prefix in the root path is mandatory on Windows** to enable long file names support in Python.

In [85]:
from pathlib import Path

rootdir = Path(r"\\?\C:\Users\laure\Desktop\nlptextdoc-data-201907")
rootdir.mkdir(exist_ok=True)

Start by extracting only a few documents (for example 100) from each webiste, to test if they are accessible and if everything works as expected :

In [None]:
maxPagesCount = 100

for websiteUrl in websites:
    !dotnet "{nlptextdocExecPath}/nlptextdoc.cli.dll" {websiteUrl} {str(rootdir)} {maxPagesCount}

In the local root directory, the extraction program creates one subdirectory per website.

Each website subdirectory contains :
- one log file called **httprequests.log.csv**
- subdirectories reproducing the website tree structure
- one **nlp.txt text document** for each extracted html page in this tree structure

See the following page for a **description of the nlptextdoc format** : https://github.com/laurentprudhon/nlptextdoc/blob/master/README.md

Check if all the websites were correctly extracted :

In [49]:
import pandas as pd
from urllib.parse import urlparse

def getWebsiteName(websiteurl):
    url = urlparse(websiteurl)
    websitename = url.netloc
    return websitename

def getWebsiteDir(rootdir, websitename):
    websitedir = rootdir / websitename
    return websitedir

def loadExtractionLogs(websitedir):
    return pd.read_csv(websitedir / "httprequests.log.csv",delimiter=";")

def getExtractionStats(websites):
    websiteNames = []
    requestsCount = []
    statusCounts = []    
    errorTypes = ["OK","NotFound","Redirect","NoContent","Forbidden","BadRequest","Moved"]
    for errorType in errorTypes:
        statusCounts.append([])
    for websiteurl in websites:
        website = getWebsiteName(websiteurl)
        print(f"Checking extraction logs for website {website} ...")
        websitedir = getWebsiteDir(rootdir, website)
        logsdf = loadExtractionLogs(websitedir)
        logsstatus = logsdf["Status code"].value_counts()
        websiteNames.append(website)
        requestsCount.append(len(logsdf))
        for idx,errorType in enumerate(errorTypes):
            statusCounts[idx].append(logsstatus[errorType] if errorType in logsstatus else 0)
    dictResult = {}
    dictResult["Website"] = websiteNames
    dictResult["Requests"] = requestsCount
    for idx,errorType in enumerate(errorTypes):
        dictResult[errorType] = statusCounts[idx]
    return pd.DataFrame(dictResult)    

In [None]:
extractionStats = getExtractionStats(websites)

In [None]:
extractionStats[extractionStats["Requests"] != extractionStats["OK"]]

For each website with http error codes, open the log file **httprequests.log.csv** and see if something needs to be fixed.

Use the code below to test if the errors :
- were temporary, a consequence of a the high request rate the extraction program : then relaunch the extraction of the website with a bigger minCrawlDelay
- are a real problem in the source website : just ignore them and continue

In [52]:
from urllib.request import urlopen
from urllib.error import HTTPError

def checkExtractionLogsByErrorType(logsdf):
    errorTypes = ["NotFound","Redirect","NoContent","Forbidden","BadRequest","Moved"]
    urls = []
    extractionStatus = []
    checkedStatus = []
    for errorType in errorTypes:
        urlsWithError = logsdf[logsdf["Status code"] == errorType]["Url"]
        print(f"Testing {len(urlsWithError)} URLs with error type {errorType} ...")
        for url in urlsWithError:
            urls.append(url)
            extractionStatus.append(errorType)
            try:
                resp = urlopen(url)
                checkedStatus.append(resp.getcode())
            except HTTPError as he:
                checkedStatus.append(he.code)
    checksdf = pd.DataFrame({"Urls" : urls, "ExtractionStatus" : extractionStatus, "CheckedStatus" : checkedStatus})
    return checksdf

In [53]:
websiteIndex = 9
websitename = getWebsiteName(websites[websiteIndex])
print(websitename)

websitedir = getWebsiteDir(rootdir, websitename)
logsdf = loadExtractionLogs(websitedir)
checkExtractionLogsByErrorType(logsdf)

www.boursedirect.fr
Testing 0 URLs with error type NotFound ...
Testing 1 URLs with error type Redirect ...
Testing 0 URLs with error type NoContent ...
Testing 2 URLs with error type Forbidden ...
Testing 0 URLs with error type BadRequest ...
Testing 0 URLs with error type Moved ...


Unnamed: 0,Urls,ExtractionStatus,CheckedStatus
0,http://www.boursedirect.fr/priv/logoutPriv.php,Redirect,200
1,http://www.boursedirect.fr/fr/profil,Forbidden,403
2,http://www.boursedirect.fr/fr/messagerie,Forbidden,403


When everything seems OK, relaunch the extraction code above with a much bigger maxPagesCount (for example 100 000).

### 3.3 Download publicly available french dictionaries

Create a local subdirectory to store the french dictionaries :

In [88]:
dictdir = rootdir / "_dictionaries"
dictdir.mkdir(exist_ok=True)

**Dictionary 1 : Dicollecte** - Open Source french dictionary for LibreOffice/OpenOffice

Website : https://grammalecte.net/home.php?prj=fr.

Licence : MPL : Mozilla Public License version 2.0  -  http://www.mozilla.org/MPL/2.0/.

Download the latest "Lexique" on the [Grammalecte downloads page](https://grammalecte.net/download.php?prj=fr) :
- open the zip file
- copy only the "lexique-dicollecte-fr-v*.txt file (for example : lexique-dicollecte-fr-v6.4.1.txt) in the local subdirectory

Open the file in a text editor to see its self-descriptive format and contents.

Store the file name in the variable below :

In [90]:
dicollectefile = dictdir / "lexique-dicollecte-fr-v6.4.1.txt"
dicollectefile.exists()

True

In [106]:
def buildDicollecteTags(dicollectefile):
    dictionarydf = pd.read_csv(dicollectefile, sep="\t", skiprows=15)
    dictionarydf.head()
    dictionarytags = {}
    for index, row in dictionarydf.iterrows():
        token = row["Flexion"]
        tag = _convertDicollecteTagsToUnivDepTags(row["Étiquettes"])
        if(not (token in dictionarytags)):
            dictionarytags[token] = tag
        elif(not (tag in dictionarytags[token])):
            dictionarytags[token] = dictionarytags[token] + "|" + tag
    return dictionarytags

def _convertDicollecteTagsToUnivDepTags(text):
    if(("adj" in text) or ("loc.adj" in text)):
        return "ADJ"
    elif("prep" in text):
        return "ADP"
    elif(("adv" in text) or ("loc.adv" in text)):
        return "ADV"
    elif(("v0a" in text) or ("v0e" in text) or ("ppas" in text)):
        return "AUX"
    elif("cjco" in text):
        return "CCONJ"
    elif("det" in text):
        return "DET"
    elif("interj" in text):
        return "INTJ"
    elif("nom" in text):
        return "NOUN"
    elif(("nb" in text) or ("ord" in text)):
        return "NUM"
    elif("pro" in text):
        return "PRON"
    elif(("prn" in text) or ("patr" in text) or ("npr" in text)):
        return "PROPN"
    elif("cjsub" in text):
        return "SCONJ"
    elif("symb" in text):
        return "SYM"
    elif(("v1" in text) or ("v2" in text) or ("v3" in text) or ("loc.verb" in text)):
        return "VERB"
    else:
        return text

In [None]:
dicollecteTags = buildDicollecteTags(dicollectefile)
dicollecteTags

**Dictionary 2 : UDLexicons Lefff** - Research resource from INRIA for the [Universal Dependencies](https://universaldependencies.org/) project

Citation : Benoît Sagot. A multilingual collection of CoNLL-U-compatible morphological lexicons. Eleventh International Conference on Language Resources and Evaluation (LREC 2018), May 2018, Miyazaki, Japan. hal-01798798v2

Paper : https://hal.inria.fr/hal-01798798v2/document

Download the latest "UDLexicons" on [Benoît Sagot's resources page](http://alpage.inria.fr/~sagot/) :
- open the zip file
- copy only the "UDLex_French-Lefff.conllul" in the local directory
- add a .txt extension to the file name

Open the file in a text editor to see its self-descriptive format and contents.

Store the file name in the variable below :

In [92]:
leffffile = dictdir / "UDLex_French-Lefff.conllul.txt"
leffffile.exists()

True

In [108]:
def buildLefffTags(leffffile):
    lexicondf = pd.read_csv(leffffile, sep="\t")
    lexicontags = {}
    for index, row in lexicondf.iterrows():
        token = row["!"]
        tag = row["PUNCT"]
        if(not (token in lexicontags)):
            lexicontags[token] = tag
        elif(not (tag in lexicontags[token])):
            lexicontags[token] = lexicontags[token] + "|" + tag
    return lexicontags

In [None]:
lefffTags = buildLefffTags(leffffile)
leffftags

## 4. Generate text dataset, statistics, dictionaries from websites extraction directory

### 4.1 Load the extracted text files in an efficient DataFrame for each website

The following class can be used to **parse and load the .nlp.txt text files extracted from a website in a DataFrame**.

See the following page for a **description of the nlptextdoc format** : https://github.com/laurentprudhon/nlptextdoc/blob/master/README.md

In [81]:
import numpy as np
import re

class NLPTextDocumentReader:
    """Read output files of a website extraction in pandas DataFrames.
    
    Sample usage :
    
    textreader = NLPTextDocumentReader(websitedir)
    textdf = textreader.load_dataframe()
    """    
    def __init__(self, websitedir):
        self.websitedir = websitedir
        
        self.documentCount = 0 
        self.nestingLevel = 1
        self.listType = []
        self.listCmd = []
        self.listLevel = []
        self.listText = []
                
        self.DOCUMENT_ELEMENT_LINE_MARKER = "##"
        self.DOCUMENT_ELEMENT_START = "Start"
        self.DOCUMENT_ELEMENT_END = "End"
        self.DOCUMENT_ELEMENT_ITEMS = "Items"
        self.DOCUMENT_ELEMENT_ITEMS_START = ">>"
        self.DOCUMENT_ELEMENT_ITEMS_SEPARATOR = "||"
        
        self.TEXT_DOCUMENT_PROPERTY_PREFIX = self.DOCUMENT_ELEMENT_LINE_MARKER + " NLPTextDocument "
        self.TEXT_DOCUMENT_TITLE = "Title"
        self.TEXT_DOCUMENT_URI = "Uri"
        
        self.DOCUMENT_ELEMENT_LINE_REGEX = re.compile(
            self.DOCUMENT_ELEMENT_LINE_MARKER + " "
            + "(?P<NestingLevel>[0-9]+)" + " "
            + "(?P<ElementName>[A-Za-z]+)" + " "
            + "(?P<Command>" + self.DOCUMENT_ELEMENT_START + "|" + self.DOCUMENT_ELEMENT_END + "|" + self.DOCUMENT_ELEMENT_ITEMS + ")" + " ?")
        
    def load_dataframe(self):
        textdffile = self.websitedir / "nlptextdocs.dataframe.feather"
        if(textdffile.exists()):
            return pd.read_feather(textdffile)
        else:
            for textfile in self.websitedir.glob("**/*.nlp.txt"):
                with textfile.open(mode="r", encoding="utf-8-sig") as f:   
                    self.textfile = textfile
                    self.documentCount = self.documentCount+1
                    self.onDocumentStart(str(self.documentCount))
                    self.isreadingproperties = True
                    for lineidx,line in enumerate(f):
                        line = line.strip()
                        if(not line): continue
                        self.lineidx = lineidx
                        self.readline(line)
                    self.onDocumentEnd(str(self.documentCount))
            textdf = pd.DataFrame({"DocEltType": self.listType, "DocEltCmd" : self.listCmd, "NestingLevel": self.listLevel, "Text":self.listText})
            textdf = textdf.astype({"DocEltType": "category", "DocEltCmd": "category", "NestingLevel": np.uint8},copy=False)
            self.__init__(self.websitedir)
            textdf.to_feather(textdffile)
            return textdf

    def readline(self,line):
        if (self.isreadingproperties):
            if (line.startswith(self.TEXT_DOCUMENT_PROPERTY_PREFIX)):
                self.readproperty(line[len(self.TEXT_DOCUMENT_PROPERTY_PREFIX):])
            else:
                self.isreadingproperties = False
        if (not self.isreadingproperties):
            self.readelement(line)
                
    def readproperty(self,propstr):
        firstspaceindex = propstr.find(" ");
        if (firstspaceindex > 0):
            propertyname = propstr[:firstspaceindex]            
            propertyvalue = propstr[firstspaceindex + 1:].strip()
            if(propertyname == self.TEXT_DOCUMENT_TITLE):
                self.onDocumentTitle(propertyvalue)
            elif(propertyname == self.TEXT_DOCUMENT_URI):
                self.onDocumentUri(propertyvalue)       
    
    def readelement(self,line):
        if (line.startswith(self.DOCUMENT_ELEMENT_LINE_MARKER)):
            self.readcommand(line)
        else:
            self.onTextBlock(line)
    
    def readcommand(self,line):
        match = self.DOCUMENT_ELEMENT_LINE_REGEX.match(line)
        if(match): 
            self.nestingLevel = int(match.group("NestingLevel"))
            elementName = match.group("ElementName")
            command = match.group("Command")
            if (command == self.DOCUMENT_ELEMENT_START):
                title = line[match.end():].strip()
                if (len(title) == 0): title = None
                if(elementName == "Section"):
                    self.onSectionStart(title)
                elif(elementName == "NavigationList"):
                    self.onNavigationListStart(title)
                elif(elementName == "List"):
                    self.onListStart(title)
                elif(elementName == "ListItem"):
                    self.onListItemStart()
                elif(elementName == "Table"):
                    self.onTableStart(title)
                elif(elementName == "TableHeader"):
                    self.onTableHeaderStart()           
                elif(elementName == "TableCell"):
                    self.onTableCellStart()
            elif (command == self.DOCUMENT_ELEMENT_END):
                if(elementName == "Section"):
                    self.onSectionEnd()
                elif(elementName == "NavigationList"):
                    self.onNavigationListEnd()
                elif(elementName == "List"):
                    self.onListEnd()
                elif(elementName == "ListItem"):
                    self.onListItemEnd()
                elif(elementName == "Table"):
                    self.onTableEnd()
                elif(elementName == "TableHeader"):
                    self.onTableHeaderEnd()                 
                elif(elementName == "TableCell"):
                    self.onTableCellEnd()
            elif (command == self.DOCUMENT_ELEMENT_ITEMS):
                startOfItems = line.find(self.DOCUMENT_ELEMENT_ITEMS_START)
                title = line[match.end():startOfItems].strip()
                if (len(title) == 0): title = None
                if (elementName == "NavigationList"):
                    self.onNavigationListStart(title)
                elif (elementName == "List"):
                    self.onListStart(title)             
                self.nestingLevel = self.nestingLevel+1
                items = line[startOfItems+len(self.DOCUMENT_ELEMENT_ITEMS_START):].split(self.DOCUMENT_ELEMENT_ITEMS_SEPARATOR)
                for item in items:
                    item = item.strip()
                    if (len(item) > 0):
                        self.onInlineListItem(item)
                self.nestingLevel = self.nestingLevel-1
                if (elementName == "NavigationList"):
                    self.onNavigationListEnd()
                elif (elementName == "List"):
                    self.onListEnd()
            else:
                raise Exception(f"File format error in file {self.textfile} on line {self.lineidx} : {line[:min(len(line), 50)]}");                     
        else:
            raise Exception(f"File format error in file {self.textfile} on line {self.lineidx} : {line[:min(len(line), 50)]}");
    
    def onDocumentStart(self,docId):
        self.appendrow("Document","Start",docId)
    
    def onDocumentTitle(self,title):
        self.appendrow("Document","Title",title)
            
    def onDocumentUri(self,uri):
        self.appendrow("Document","Uri",uri)
    
    def onDocumentEnd(self,docId):
        self.appendrow("Document","End",docId)
    
    def onTextBlock(self,text):
        self.appendrow("TextBlock","Text",text)
            
    def onSectionStart(self,title):
        self.appendrow("Section","Start",title)
        
    def onSectionEnd(self): 
        self.appendrow("Section","End")
        
    def onNavigationListStart(self,title):
        self.appendrow("NavigationList","Start",title)
        
    def onNavigationListEnd(self):
        self.appendrow("NavigationList","End")
        
    def onListStart(self,title):
        self.appendrow("List","Start",title)
        
    def onListEnd(self):
        self.appendrow("List","End")
        
    def onInlineListItem(self,item):
        self.appendrow("ListItem","Text",item)
            
    def onListItemStart(self):
        self.appendrow("ListItem","Start")
        
    def onListItemEnd(self):
        self.appendrow("ListItem","End")
        
    def onTableStart(self,title):
        self.appendrow("Table","Start",title)
    
    def onTableEnd(self):
        self.appendrow("Table","End")
        
    def onTableHeaderStart(self):
        self.appendrow("TableHeader","Start")
        
    def onTableHeaderEnd(self): 
        self.appendrow("TableHeader","End")
        
    def onTableCellStart(self):
        self.appendrow("TableCell","Start")
        
    def onTableCellEnd(self): 
        self.appendrow("TableCell","End")
            
    def appendrow(self,docEltType,docEltCmd,text=None):
        self.listType.append(docEltType)
        self.listCmd.append(docEltCmd)
        self.listLevel.append(self.nestingLevel)
        if(text != None):
            text = text.replace("\\n","\n")
        self.listText.append(text)

Use the function below to prepare DataFrames for all the extracted websites at once :

In [86]:
def prepareDataFramesForWebsites(rootdir, websites):
    """Loads all individual text blocks extracted from the pages of each website in a dataframe, and save them efficiently on disk.

    Parameters:
    rootdir - Path to the directory where the websites were extracted
    websites - List of strings with the websites root URLs
    """
    for websiteurl in websites:
        website = getWebsiteName(websiteurl)
        print(f"Preparing dataframe for website {website} ...")        
        websitedir = getWebsiteDir(rootdir,website)
        reader = NLPTextDocumentReader(websitedir)
        textdf = reader.load_dataframe()
        docsCount = len(textdf[(textdf["DocEltType"]=="Document") & (textdf["DocEltCmd"]=="Start")])
        logsdf = loadExtractionLogs(websitedir)
        print(f"- {len(logsdf)} website extraction logs")
        print(f"- {docsCount} documents")
        print(f"- {len(textdf)} document elements")
        print(f"- dataframe size in memory : {_format_size_mb(_memory_size(textdf))} MB")
        websitefile = websitedir / "nlptextdocs.dataframe.feather"
        print(f"- dataframe size on disk : {_format_size_mb(_file_size(websitefile))} MB")

In [None]:
prepareDataFramesForWebsites(rootdir, websites)

If you encounter a parsing error in any of the text files : just delete the corrupted file and relaunch the function above.

It will run very efficiently for all the websites already processed.

### 4.2 Filter and aggregate all interesting text blocks in a single DataFrame

While we filter and aggregate all the interesting text blocks in a single DataFrame, we also generate the following summaries of the text data for later use :

1. Information about the character set used in the extracted dataset :

In [None]:
def saveCharset(rootdir, vocabdf):
    print("Saving the character set ...")
    charset = defaultdict(lambda:0)
    for idx,row in vocabdf.iterrows():
        token = row["Words"]
        count = row["Counts"]
        for char in token:
            charcode = ord(char)
            charcounts[charcode] = charcounts[charcode] + count
    charsetdf = pd.DataFrame({"Code" : [*charset.keys()], "Count" : [*charset.values()]})
    charsetdf.sort_values("Count", ascending=False, inplace=True)
    charsetdf.reset_index(inplace=True)
    charsetdf["Char"] = charsetdf.index.map(lambda x:chr(x))
    charsetdf["isAlpha"] = charsetdf["Char"].map(lambda x:x.isalpha())
    charsetdf["isDigit"] = charsetdf["Char"].map(lambda x:x.isdigit())
    charsetdf["isSpace"] = charsetdf["Char"].map(lambda x:x.isspace())
    charsetdf["Percent"] = 100*charsetdf["Count"].cumsum()/charsetdf["Count"].sum()
    charsetfile = rootdir / "charset.dataframe.feather"
    charsetdf.to_feather(charsetfile)
    charsetdf.to_csv(rootdir / "charset.csv",sep=";")
    print(f"- {len(charset} distinct characters")
    return charsetdf

2. Information about the vocabulary (distinct words) used in the extracted dataset :