# Extract textual data from HTML

In [None]:
!pip install beautifulsoup4 # Install the library Beautiful Soup



In [None]:
from bs4 import BeautifulSoup as BS
import re, os

Creat a BeautifulSoup object, essentially a parsed document containing all the information of information passed in, which is the HTML file in our case.

In [None]:
# the file location requires modifcation depending on where you save the BOE.html file
os.chdir('/content/drive/MyDrive/CivilCode_Spanish/')
with open('BOE.html','r') as file:
    soup = BS(file,'html.parser')

After examination, we found that all textual data needed are saved in three tags: "h4", "h5", and "p".

So we will use .find_all() method of the BeautifulSoup object to filter all these tag objects into a new object.


In [None]:
command = re.compile('(h(4|5))|^p')

In [None]:
needed_tags = soup.find_all(command)

In [None]:
# A preview of filtered tags
needed_tags[35:45]

[<p class="parrafo">2. La equidad habrá de ponderarse en la  aplicación de las normas, si bien las resoluciones de los Tribunales  sólo podrán descansar de manera exclusiva en ella cuando la ley  expresamente lo permita.</p>,
 <h5 class="articulo">Artículo 4.</h5>,
 <p class="parrafo">1. Procederá la aplicación analógica de las normas  cuando éstas no contemplen un supuesto específico, pero regulen otro  semejante entre los que se aprecie identidad de razón.</p>,
 <p class="parrafo">2. Las leyes penales, las excepcionales y las de  ámbito temporal no se aplicarán a supuestos ni en momentos distintos de  los comprendidos expresamente en ellas.</p>,
 <p class="parrafo">3. Las disposiciones de este Código se aplicarán  como supletorias en las materias regidas por otras leyes.</p>,
 <h5 class="articulo">Artículo 5.</h5>,
 <p class="parrafo">1. Siempre que no se establezca otra cosa, en los  plazos señalados por días, a contar de uno determinado, quedará éste  excluido del cómputo, el cual 

In [None]:
'''
Next step is to extract only paragraph information from all these tags.

Note that we use *.get_text()* methods instead of .string, which is native and intuitive for BeutifulSoup.

The reason is that the latter does not handle well superscript tags within the text.

For example, if a tag object in BeautifulSoup is

<p class="parrafo_2">1.<sup>a</sup> Será ley personal la  determinada por la vecindad civil.</p>

Using *.string* will return
"None".

While using .get_text() you get "1.a Será ley personal la  determinada por la vecindad civil", for which we only need to format the "a" later.
'''

In [None]:
spanish_raw = open('spanish_raw.txt','w')
for n in needed_tags:
    text = n.get_text()
    spanish_raw.write(text + '\n')
spanish_raw.close()

# Clean the raw text

In [None]:
raw_text = open('spanish_raw.txt','r').read()

In [None]:
# delete extra whitespace
whitespace = re.sub(' {2,}',' ',raw_text)

In [None]:
# replace all suprimidir with derogar in order to maintain certain linguistic consistency
Derogar = re.sub('Suprimid','Derogad',whitespace)
derogar = re.sub('suprimid','derogad',Derogar)

In [None]:
# change .a to .ª
superscript = re.sub('(\d).a','\g<1>.ª',derogar)

In [None]:
with open('spanish_cleaned.txt','w') as Writer:
    Writer.write(superscript)
Writer.close()

# Structure cleaned text into separate files and folders

In [None]:
cleaned_text = open('spanish_cleaned.txt','r').read()

In [None]:
# Split all text by 'LIBRO' into a list.
# Each item of the list then is the content of a LIBRO, apart from the 'Título Preliminar'
libro = cleaned_text.split('LIBRO')

In [None]:
dict0 = libro[0].split('TÍTULO') # There are unwanted information before the needed text
titulopreliminar = 'TÍTULO' + dict0[1]

In [None]:
dict4 = libro[4].split('DISPOSICIÓN FINAL')
disposicionfinal = 'DISPOSICIÓN FINAL' + dict4[1]

In [None]:
preliminar = open('Título preliminar.txt','w')
preliminar.write(titulopreliminar)
preliminar.close()

In [None]:
final = open('Disposiciones final y adicionales.txt','w')
final.write(disposicionfinal)
final.close()

In [None]:
Rest = [libro[1],libro[2],libro[3],dict4[0]]

In [None]:
# automatically split the rest of the text into different "Libros" and "Títulos"
# and save them in according folders and files
y = 1
for libro in Rest:
    namefolder = 'Libro_' + str(y)
    y = y + 1
    os.mkdir('%s' % namefolder)
    os.chdir('%s' % namefolder)
    titulos = libro.split('TÍTULO')
    x = 1
    for titulo in titulos[1:]:
        text = 'TÍTULO' + titulo
        name = 'Título_' + str(x)
        file = open('%s.txt' % name, 'w')
        file.write(text)
        file.close()
        x = x + 1
    os.chdir('..')
print('Done')

Done
