Skip to content

Scripts for processing the pdfs of Parlament de Catalunya

Notifications You must be signed in to change notification settings

gullabi/parlament-scrape

Repository files navigation

parlament-scrape

Scripts to scrape and post-process the pdfs of Parlament de Catalunya

Overview

This repository consists a set of tools to process the plenary sessions of the Parlament de Catalunya.

The scripts are capable of

  • Facilitating the metadata retrieval from the website of Parlament de Catalunya
  • Conversion of pdfs of the parliamentary sessions into xml format and structure them into dictionaries
  • Matching the session metadata from the audiovisuals with the structured data which come from the pdf text

in order to output a set of json files with each having the matched information of session, speaker, text and media url.

Use

The detailed documentation for using the scripts not yet prepared


These scripts were developed in order to create the ParlamentParla corpus which was possible thanks to the support of Culture Department of the Catalan autonomous government.

About

Scripts for processing the pdfs of Parlament de Catalunya

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages