# Use this notebook to runn all the notebooks as a pipeline

Building this dataset takes multiple steps:

1. First we need to gather the list of pages we can use for building it, there's a search engine that returns a `json` file with the list of pages.
2. We import all the HTML and do an initial parsing to extract the debates and name of the politician for each speech.
3. Then we build a huge dataset with all the imported data, and enhance it by extracting the polititian title if any, extract sentiment and cleanup data.
4. Once we have the full dataset, we prepare it for publishing it on Hugging Face and make it compatible with the Transformers platform.

Each of this steps corresponds to a Notebook, you can run them all with this pipeline

## Step 1: run search.

The result is a bunch of json files pointing to the real HTML files, into the ./data/terms folder. We also store partial csv and a full one.

In [7]:
%run ./01-CDLD-Get-full-term-in-office.ipynb

Getting term 5...1.2.3.4.5.6.7.8.9.10Done. 197 documents.
Getting term 6...1.2.3.4.5.6.7.8.9.10.11.12.13.14.15Done. 483 documents.
Getting term 7...1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16Done. 793 documents.
Getting term 8...1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16Done. 1108 documents.
Getting term 9...1.2.3.4.5.6.7.8.9.10.11.12.13.14.15Done. 1390 documents.
Getting term 10...1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16Done. 1705 documents.
Getting term 11...1Done. 1720 documents.
Getting term 12...1.2.3.4.5.6.7.8.9.10Done. 1905 documents.
Getting term 13...1Done. 1920 documents.
Getting term 14...1.2.3.4.5.6.7.8.9.10.11.12Done. 2150 documents.


## Step 2: Import all the debates from HTML

The result is stored as a csv file for each debate in ./data/debates. Those files will contain all the speeches with the name of the politician giving the speech.
We also store the HTML pages into ./data/pagecache to avoid downloading all the content each time.

In [8]:
%run ./02-CDLD-import.ipynb

Note: you may need to restart the kernel to use updated packages.
importing Term X |████████████████████████████████████████| 315/315 [100%] in 2:10.6 (2.41/s)                           
importing Term XI |████████████████████████████████████████| 15/15 [100%] in 3.4s (4.44/s)                              
importing Term XII |████████████████████████████████████████| 185/185 [100%] in 1:16.1 (2.43/s)                         
importing Term XIII |████████████████████████████████████████| 15/15 [100%] in 4.7s (3.18/s)                            
on 221: Error: PRESIDENT* not found                                                                                     
on 222: Error: PRESIDENT* not found                                                                                     
on 223: Error: PRESIDENT* not found                                                                                     
on 224: Error: PRESIDENT* not found                                                    

## Step 3: Prepare and clean the dataset.

Data enhancement by removing paranthesized text and extracting data from these items like politician title, related decrees or reactions to have a sentiment score.

In [11]:
%run ./03-Prepare-Data.ipynb

             TextLen       Positive       Negative       Surprise
count  357970.000000  358067.000000  358067.000000  358067.000000
mean     1468.588885       0.296344       0.125094       0.002058
std      3119.074758       1.061502       0.617428       0.052462
min         0.000000       0.000000       0.000000       0.000000
25%        76.000000       0.000000       0.000000       0.000000
50%       200.000000       0.000000       0.000000       0.000000
75%      1355.000000       0.000000       0.000000       0.000000
max    101263.000000     105.000000      46.000000       7.000000
debates written to parquet file.
Exported Zapater & Rajoy
            TextLen     Positive     Negative     Surprise
count   2123.000000  2123.000000  2123.000000  2123.000000
mean    3701.005652     2.113990     0.740462     0.026849
std     7683.510591     4.321462     1.791697     0.188588
min        2.000000     0.000000     0.000000     0.000000
25%      517.000000     0.000000     0.000000     0.0

## Step 4: NLP dataset preparation

In [12]:
%run ./04-Prepare-dataset.ipynb

Downloading and preparing dataset json/. to /home/azureuser/.cache/huggingface/datasets/json/.-ec277ea29f9ac72a/0.0.0/da492aad5680612e4028e7f6ddc04b1dfcec4b64db470ed7cc5f2bb265b9b6b5...
Dataset json downloaded and prepared to /home/azureuser/.cache/huggingface/datasets/json/.-ec277ea29f9ac72a/0.0.0/da492aad5680612e4028e7f6ddc04b1dfcec4b64db470ed7cc5f2bb265b9b6b5. Subsequent calls will reuse this data.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1821 entries, 1 to 2122
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Name              1821 non-null   object 
 1   Text              1821 non-null   object 
 2   Date              1821 non-null   object 
 3   Term              1821 non-null   object 
 4   Title             1195 non-null   object 
 5   infos             1576 non-null   object 
 6   TextLen           1821 non-null   float64
 7   Data              11 non-null     object 
 8   InterventionT

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Using custom data configuration .-7d843941c9c1b74a


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

  0%|          | 0/2 [00:00<?, ?it/s]