# ETL for Questions Au Gouvernement (QAG) Intervention Dataset

Welcome to this exceptional Google Colab Notebook that focuses on the extraction, transformation, and loading (ETL) of the "Questions Au Gouvernement" (QAG) dataset. QAG is a French political meeting where members of the French government answer questions from the National Assembly. This notebook is designed to preprocess and extract meaningful information from JSON files obtained from the French National Assembly's Open Data portal (https://data.assemblee-nationale.fr/), enabling you to create a rich and insightful intervention dataset for future Natural Language Processing (NLP) projects.

## Overview

The purpose of this ETL process is to convert raw QAG JSON files into a structured and accessible format, which can be used to analyze the interventions of various political figures during the QAG meetings. This includes extracting relevant information such as the speaker's name, intervention content, and the official date of the intervention.

To achieve this, the notebook will perform the following steps:

1. Download the raw QAG JSON files from the French National Assembly's Open Data portal.
2. Unzip the downloaded files to access the individual JSON files.
3. Define necessary classes and functions for the extraction and transformation of the data.
4. Parse the JSON files to extract relevant information, including the date, speaker's name, and intervention content.
5. Concatenate the extracted information from each JSON file to create a global Pandas DataFrame, which represents the complete intervention dataset.
6. Save the dataset as a CSV file for easy access and future NLP projects.

## Key Features

- Utilizes BeautifulSoup and custom classes to extract and structure data effectively.
- Provides clear examples and usage instructions for each step of the ETL process.
- Employs efficient code to process a large number of JSON files quickly.
- Ensures that the output dataset is saved in a widely compatible CSV format for seamless integration into NLP projects.

By utilizing this exceptional ETL notebook, you will be able to unlock valuable insights from the QAG meetings and explore the intricate dynamics of French political discourse. With a comprehensive intervention dataset at your fingertips, you will be well-equipped to tackle exciting NLP projects that delve into the world of political communication and beyond.


# Download .zip containing all QAG .json files from source

Source: https://data.assemblee-nationale.fr/archives-xve/questions-au-gouvernement  

In [28]:
!wget https://data.assemblee-nationale.fr/static/openData/repository/15/questions/questions_gouvernement/Questions_gouvernement_XV.json.zip

--2023-03-21 14:05:58--  https://data.assemblee-nationale.fr/static/openData/repository/15/questions/questions_gouvernement/Questions_gouvernement_XV.json.zip
Resolving data.assemblee-nationale.fr (data.assemblee-nationale.fr)... 46.105.202.26
Connecting to data.assemblee-nationale.fr (data.assemblee-nationale.fr)|46.105.202.26|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14635993 (14M) [application/zip]
Saving to: ‘Questions_gouvernement_XV.json.zip.1’


2023-03-21 14:06:26 (519 KB/s) - ‘Questions_gouvernement_XV.json.zip.1’ saved [14635993/14635993]



## unzip raw QAG .json files

In [29]:
!unzip /content/Questions_gouvernement_XV.json.zip

Archive:  /content/Questions_gouvernement_XV.json.zip
replace json/QANR5L15QG3917.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

# Define classes and functions here - Extract Interventions

In [30]:
import json
import re
from bs4 import BeautifulSoup
from IPython.display import HTML

class Intervention:
    def __init__(self, speaker_name, sentence_list):
        self.speaker_name = speaker_name
        self.sentence_list = sentence_list

class QAG:
    def __init__(self, data=None):
        if data:
            self.question = data["question"]
            self.interventions = self._parse_interventions()
        else:
            self.question = {}
            self.interventions = []

    @classmethod
    def from_json_file(cls, filepath):
        with open(filepath, "r", encoding="utf-8") as file:
            data = json.load(file)
        return cls(data)

    def to_json_file(self, filepath):
        with open(filepath, "w", encoding="utf-8") as file:
            json.dump({"question": self.question}, file, ensure_ascii=False, indent=2)

    def get_texte(self):
        return self.question["textesReponse"]["texteReponse"]["texte"]

    def get_dateJO(self):
        return self.question["textesReponse"]["texteReponse"]["infoJO"]["dateJO"]

    def get_qag_number(self):
        return int(self.question['identifiant']['numero'])

    def get_legislature_number(self):
        return int(self.question['identifiant']['legislature'])

    def _parse_interventions(self):
        html_string = self.get_texte()
        soup = BeautifulSoup(html_string, 'html.parser')
        return extract_interventions(soup)

def extract_interventions(soup):
    interventions = []
    for tag in soup.find_all('strong'):
        speaker_name = tag.getText()
        sentences = ""
        while (tag.next_sibling and (tag.next_sibling.name != "strong")):
          if tag.next_sibling.name != "i":
            sentences += re.sub(r'\([^)]*\)', '', tag.next_sibling.getText())
          tag = tag.next_sibling
        interventions.append(Intervention(speaker_name, sentences))
    return interventions

# Example usage:

# Load QAG object from JSON file
qag = QAG.from_json_file("/content/json/QANR5L15QG1.json")


# Extract 'dateJO' attribute
print("\n\n\n")
print('-------------------')
dateJO = qag.get_dateJO()
print("Date JO:", dateJO)
print('-------------------')


# Extract 'texte' attribute
texte = qag.get_texte()
display(HTML(texte))





-------------------
Date JO: 06/07/2017
-------------------


In [31]:
import pandas as pd

def get_interventions_df(qag):
  data = []

  for i, intervention in enumerate(qag.interventions):
      data.append({
          'legislature_number':qag.get_legislature_number(),
          'official_date':qag.get_dateJO(),
          'qag_number':qag.get_qag_number(),
          'intervention_number': i,
          'speaker_name': intervention.speaker_name,
          'intervention_sentences': intervention.sentence_list
      })

  df = pd.DataFrame(data)
  return df

# example usage
get_interventions_df(qag)

Unnamed: 0,legislature_number,official_date,qag_number,intervention_number,speaker_name,intervention_sentences
0,15,06/07/2017,1,0,M. le président.,"La parole est à M. Damien Abad, pour le groupe..."
1,15,06/07/2017,1,1,M. Damien Abad.,"Monsieur le président, avant de poser ma quest..."
2,15,06/07/2017,1,2,M. Claude Goasguen.,Très bien !
3,15,06/07/2017,1,3,M. Damien Abad.,Or en augmentant la CSG dès 2018 et en repouss...
4,15,06/07/2017,1,4,M. Pierre Cordier.,Bravo !
5,15,06/07/2017,1,5,M. Damien Abad.,"Je ne voudrais pas que vous soyez, dès aujourd..."
6,15,06/07/2017,1,6,M. Claude Goasguen.,Très bien !
7,15,06/07/2017,1,7,M. le président.,La parole est à M. le Premier ministre.
8,15,06/07/2017,1,8,"M. Edouard Philippe,","Monsieur le président, mesdames et messieurs l..."
9,15,06/07/2017,1,9,M. Sébastien Jumel.,"Oui, nous !"


# Extract `speaker_name` and `intervention_sentences` from HTML

Also create a large `global_df` that is the concatenation of all single QAG.json files.  

In [32]:
import glob
import json
from tqdm.auto import tqdm

# Set the path to the directory containing the JSON files
path = '/content/json/'
path_out = '/content/'

# Use glob to find all JSON files in the directory
json_files = glob.glob(path + '*.json')


cpt = 0 # used to initialize global_df
for json_path in tqdm(json_files):
  try:
    qag = QAG.from_json_file(json_path)
    if cpt == 0:
      global_df = get_interventions_df(qag) #init
    else:
      global_df = pd.concat([global_df, get_interventions_df(qag)]) # df+=df
  except Exception as e:
    print(f"Error processing file {json_path}: {e}")
  cpt = cpt + 1

# save dataset
global_df.to_csv(path_out + 'complete_qag_15.csv')

  0%|          | 0/4851 [00:00<?, ?it/s]

Error processing file /content/json/QANR5L15QG3307.json: 'NoneType' object is not subscriptable


In [33]:
# load dataset
df = pd.read_csv(path_out + 'complete_qag_15.csv', index_col=0)
df

Unnamed: 0,legislature_number,official_date,qag_number,intervention_number,speaker_name,intervention_sentences
0,15,11/04/2018,806,0,M. le président.,"La parole est à Mme Sophie Auconie, pour le gr..."
1,15,11/04/2018,806,1,Mme Sophie Auconie.,Madame la ministre des solidarités et de la sa...
2,15,11/04/2018,806,2,Mme Danièle Obono.,Très bien !
3,15,11/04/2018,806,3,Mme Sophie Auconie.,Il est temps de nous préoccuper du mal-être de...
4,15,11/04/2018,806,4,M. Éric Coquerel et Mme Laurence Dumont .,Très bien !
...,...,...,...,...,...,...
6,15,21/06/2018,1026,6,M. Raphaël Schellenberger.,Ce n'est pas agir : c'est subir !
7,15,21/06/2018,1026,7,M. Fabien Di Filippo.,C'est l'immigration subie !
8,15,21/06/2018,1026,8,Mme Élise Fajgeles.,La France peut également s'honorer de développ...
9,15,21/06/2018,1026,9,M. le président.,La parole est à Mme la ministre auprès du mini...




# ⚠ NOTE: LOW CHECKS

We did not keep the additional descriptions (clapping, shouting, titles like prime minister, etc...).  
There is also a loose amount of verification, use the text at your own risk.  
Below is the proof of a well-filled dataset.

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 57680 entries, 0 to 10
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   legislature_number      57680 non-null  int64 
 1   official_date           57680 non-null  object
 2   qag_number              57680 non-null  int64 
 3   intervention_number     57680 non-null  int64 
 4   speaker_name            57680 non-null  object
 5   intervention_sentences  57680 non-null  object
dtypes: int64(3), object(3)
memory usage: 3.1+ MB
