# Use this notebook to runn all the notebooks as a pipeline

Building this dataset takes multiple steps:

1. First we need to gather the list of pages we can use for building it, there's a search engine that returns a `json` file with the list of pages.
2. We import all the HTML and do an initial parsing to extract the debates and name of the politician for each speech.
3. Then we build a huge dataset with all the imported data, and enhance it by extracting the polititian title if any, extract sentiment and cleanup data.
4. Once we have the full dataset, we prepare it for publishing it on Hugging Face and make it compatible with the Transformers platform.

Each of this steps corresponds to a Notebook, you can run them all with this pipeline

## Step 1: run search.

The result is a bunch of json files pointing to the real HTML files, into the ./data/terms folder. We also store partial csv and a full one.

In [3]:
# %run ./01-CDLD-Get-full-term-in-office.ipynb

## Step 2: Import all the debates from HTML

The result is stored as a csv file for each debate in ./data/debates. Those files will contain all the speeches with the name of the politician giving the speech.
We also store the HTML pages into ./data/pagecache to avoid downloading all the content each time.

In [4]:
%run ./02-CDLD-import.ipynb

Note: you may need to restart the kernel to use updated packages.
importing Term X |████████████████████████████████████████| 315/315 [100%] in 2:13.2 (2.37/s)                           
importing Term XI |████████████████████████████████████████| 15/15 [100%] in 3.6s (4.16/s)                              
importing Term XII |████████████████████████████████████████| 185/185 [100%] in 1:17.7 (2.38/s)                         
importing Term XIII |████████████████████████████████████████| 15/15 [100%] in 4.8s (3.10/s)                            
importing Term XIV |████████████████████████████████████████| 221/221 [100%] in 1:42.7 (2.15/s)                         
importing Term V |████████████████████████████████████████| 197/197 [100%] in 1:24.0 (2.34/s)                           
importing Term VI |████████████████████████████████████████| 286/286 [100%] in 2:06.8 (2.26/s)                          
importing Term VII |████████████████████████████████████████| 310/310 [100%] in 2:08.1 

## Step 3: Prepare and clean the dataset.

Data enhancement by removing paranthesized text and extracting data from these items like politician title, related decrees or reactions to have a sentiment score.

In [5]:
%run ./03-Prepare-Data.ipynb

## Step 4: NLP dataset preparation

In [6]:
%run ./04-Prepare-dataset.ipynb

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1821 entries, 1 to 2122
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Name              1821 non-null   object 
 1   Text              1821 non-null   object 
 2   Date              1821 non-null   object 
 3   Term              1821 non-null   object 
 4   Title             1195 non-null   object 
 5   infos             1576 non-null   object 
 6   TextLen           1821 non-null   float64
 7   Data              11 non-null     object 
 8   InterventionType  11 non-null     object 
 9   Positive          1821 non-null   int64  
 10  Negative          1821 non-null   int64  
 11  Surprise          1821 non-null   int64  
dtypes: float64(1), int64(3), object(8)
memory usage: 184.9+ KB


Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Downloading and preparing dataset json/. to /home/azureuser/.cache/huggingface/datasets/json/.-0452fd0be2cf681c/0.0.0/da492aad5680612e4028e7f6ddc04b1dfcec4b64db470ed7cc5f2bb265b9b6b5...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset json downloaded and prepared to /home/azureuser/.cache/huggingface/datasets/json/.-0452fd0be2cf681c/0.0.0/da492aad5680612e4028e7f6ddc04b1dfcec4b64db470ed7cc5f2bb265b9b6b5. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Using custom data configuration .-0452fd0be2cf681c
