# Step 02: Data import from the "Congreso de los Diputados" website

This process is focused on extracting the political debates that happened in the congress and tag every intervention with the name of the politician. The main objective is to create a corpora for political profiles that can be used to train an ML Transformer.

It relies on the previous step that downloaded a set of links to all the pages for each term to get the HTML pages that contain all the debates that happened in the Spanish Congress.

After this process we will have all the terms debates extracted from the web pages into the `./data/debates` folder in `csv` format, one entry for every speech. Speeches are detected when an ALL CAPS name starts appears, this is the format the congress curators use to indicate that a person (can be the name or the title) is starting to speak. The process also strips some noise like the page numbers, internal scripts or special characters not used for the text itself.

Any error to this process will be stored in the `./data/errors` page for debug purposes.

In [1]:
# installers, we use the alive_progress bar to have a nice progress view during the import
%pip install alive_progress

Note: you may need to restart the kernel to use updated packages.


In [2]:
# Common Imports
from html.parser import HTMLParser
import re
import pandas as pd


In [35]:
# functions

"""
A simple parser to extract all the text of a publication from the body.
It removes any internal script and removes special characters with the str.strip function.
It also gets rid of pagination (Página nnn)
"""
class PublicationParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        #Initializing lists
        self.lsStartTags = list()
        self.lsEndTags = list()
        self.lsStartEndTags = list()
        self.lsComments = list()
        self.lsData=list()
        # Indicates when we are inside the body tag
        self.inBody=False
        # Marker for scripts
        self.inScript=False

    #HTML Parser Methods
    def handle_starttag(self, startTag, attrs):
        self.lsStartTags.append(startTag)
        if(startTag=="body"):
            self.inBody=True
        if(startTag=="script"):
            self.inScript=True

    def handle_endtag(self, endTag):
        self.lsEndTags.append(endTag)
        if(endTag=="body"):
            self.inBody=False
        elif(endTag=="script"):
            self.inScript=False

    def handle_startendtag(self,startendTag, attrs):
       self.lsStartEndTags.append(startendTag)

    def handle_comment(self,data):
       self.lsComments.append(data)

    def handle_data(self, data):
        if(self.inBody and not self.inScript and data!=''):
            if(not (data.startswith('Página ') or data.startswith('(Página') )):
                self.lsData.append(data.strip())

           


## Main body extractor

Gets the body and finds the start of the debate by looking at the first appearance of the word PRESIDENT*, because when someone speaks its name appears in capital letters and in the case of the chamber president the words PRESIDENTE or PRESIDENTA are used. This appearance usually indicates the start of the interventions.

In [36]:
def findStart(ls:list)->int:
    index=0
    for line in ls:
        if "PRESIDENT" in line:
            return index
        index+=1
    return -1

def getPublicationText(url:str,term:str,date:str)->str:
    import urllib3
    # variables
    http = urllib3.PoolManager()
    # Get the publication
    response = http.request('GET', url)
    # Parse the publication
    parser = PublicationParser()
    parser.feed(response.data.decode('utf-8'))

    index=findStart(parser.lsData)
    text=''
    if(index>=0):
        # lsData is a list of strings, so we join them all with a space
        text=' '.join(parser.lsData[index:])        
    else:
        os.makedirs(f'data/errors/{term}', exist_ok=True)
        with open(f'data/errors/{term}/{date}.html', mode='w') as f:
            f.write(response.data.decode('utf-8'))
        print("Error: PRESIDENT* not found")
    return text

def get_speeches(text:str)->pd.DataFrame:
    """
    Extracts the speeches from the text of a publication.
    """

    # This regex finds the name of the politician that is speaking.
    # The name is usually in the form of:
    # [text]... ALL CAPS SURNAME (the title if president or candidate):
    # So we use this simple regex to find the next ALL CAPS that may be 
    # followed by a parenthesis and ends with a colon.
    regexfinder = r'(?:(?:[A-ZÀ-Ü,])(?:-|\s)?)+(?:\s*\([^\)]*\))? *:'
    # TODO: there's still some special cases like 'DEL DIPUTADO DON ALBERTO GARZÓN ESPINOSA, DEL GRUPO PARLAMENTARIO DE IU, ICV-EUiA, CHA:' that are detected as 'A, CHA'
    # Look file data/pagecache/X/20150121.txt

    indexes=[(m.start(0),m.end(0)) for m in re.finditer(regexfinder,text, re.U|re.M)]
    # print(indexes)

    sentences=pd.DataFrame(columns=['Name','Text'])
    last=len(indexes)-1
    for i in range(len(indexes)):
        name=text[indexes[i][0]:indexes[i][1]-1]   

        firstIdx=indexes[i][1]+1
        if(i<last):
            # try to find the end of the last sentence
            lastIdx=indexes[i+1][0]
            while(text[lastIdx]!='.' and text[lastIdx]!=')'):
                lastIdx=lastIdx-1
                if(lastIdx<=firstIdx):
                    # end of the last sentence not found, take it full.
                    lastIdx=indexes[i+1][0]
                    break
        else:
            lastIdx=len(text)
        # print(f"{firstIdx}-{lastIdx}")
        sentence=text[firstIdx:lastIdx].strip()
        sentences.loc[len(sentences)]=[name,sentence]
    return sentences


The `parse_speeches` function will download all the pages from a term and will parse them using the functions above to extract the
texts for the full debate. It still does not do a full preparation, but once this process has finished we have a full dataset for each term with the texts of every day that a debate occured.

In [37]:
def parse_speeches(term:pd.DataFrame)->pd.DataFrame:
    from alive_progress import alive_bar
    import os

    speeches_ds = None

    with alive_bar(len(term),title=f'importing Term {term.loc[0]["term"]}',force_tty=True) as bar:
        bar.text='open'
        for index,r in term.iterrows():
            if os.path.isfile(f'data/pagecache/{r["term"]}/{r["fecha"]}.txt'):
                bar.text=f'c-{r["term"]}-{r["fecha"]}'
                with open(f'data/pagecache/{r["term"]}/{r["fecha"]}.txt','r') as f:
                    text=f.read()
            else:
                bar.text=f'u-{r["term"]}-{r["fecha"]}'
                text=getPublicationText(r["url"],f'{r["term"]}',f'{r["fecha"]}')
                os.makedirs(f'data/pagecache/{r["term"]}',exist_ok=True)
                with open(f'data/pagecache/{r["term"]}/{r["fecha"]}.txt',mode='w') as f:
                    f.write(text)

            bar.text=f'Parse {r["term"]}-{r["fecha"]}'
            speeches=get_speeches(text)
            bar.text='Append data {r["term"]}-{r["fecha"]}'
            speeches["Date"]=r["fecha"]
            speeches["Term"]=r["term"]
            bar.text=f'Merge {r["term"]}-{r["fecha"]}'
            if speeches_ds is None:
                speeches_ds=speeches
            else:
                speeches_ds=pd.concat([speeches_ds,speeches])
            bar()
    return speeches_ds


In [38]:
import glob
import os

os.makedirs('data/debates', exist_ok=True)

for term_file in glob.glob('data/terms/*.csv'):
    base_name = os.path.basename(term_file)
    current_term=pd.read_csv(term_file)
    term_id=current_term.loc[0]['term']
    ds=parse_speeches(current_term)
    ds.to_csv(f'data/debates/speeches_term_{term_id}.csv',index=False)



importing Term X |████████████████████████████████████████| 315/315 [100%] in 2:12.1 (2.39/s)                           
importing Term XI |████████████████████████████████████████| 15/15 [100%] in 3.6s (4.20/s)                              
importing Term XII |████████████████████████████████████████| 185/185 [100%] in 1:17.1 (2.40/s)                         
importing Term XIII |████████████████████████████████████████| 15/15 [100%] in 4.8s (3.12/s)                            
importing Term XIV |████████████████████████████████████████| 221/221 [100%] in 1:40.7 (2.19/s)                         
importing Term V |████████████████████████████████████████| 197/197 [100%] in 1:21.9 (2.40/s)                           
importing Term VI |████████████████████████████████████████| 286/286 [100%] in 2:05.5 (2.28/s)                          
importing Term VII |████████████████████████████████████████| 310/310 [100%] in 2:07.4 (2.43/s)                         
importing Term VIII |███████████