*Peter has some nice papers with previous research
-
https://opendata.swiss/de/dataset/abstimmungsarchiv-des-kantons-zurich
-
Kantonal level CSV contains URLs to machine-readable pdf voting information
-
Gemeinde level CSV contains per-Gemeinde historical voting records
-
CSVs are joined by unique vote ID (STAT_VORLAGE_ID)
-
PDF are converted to TXT via pdftotext and can be joined to CSV files by field ABSTIMMUNGSTAG
(mostly python 2.7 or bash)
-
get_pdfs.py scrapes the URLs from the Kantonal CSV file and saves them locally. (Actually we got the PDFs from the organizers on a usb stick, because the scraper was getting IP blocked.) Note that the files Bundesamt.pdf are not URL linked in the CSV files.
-
convert_pdf_to_txt.sh loops over the PDFs and converts them to TXT with pdftotext.
-
read_txt.py reads the individual TXT files, cleanups up the text a bit, and writes a CSV file with some keys for joining later: full_text.csv (zipped).
-
vote_mapping.py (experimental) reads the combined text from full_text.csv, and also the metadta from the Kantonal CSV file. It attemps to split the TXT file into multiple elements, one for each ballot measure, using some file-specific some keywords. The code then maps based on the rank of this split array. Output file is full_text_mapped.csv.
-
sentiment.py reads full_text_mapped.csv and calculates the polarity (-1,1), the subjectivity (0,1) with textblob_de and the readability. Output file is full_text_mapped_sentiment.csv, and the three scores are added as the last 3 columns.