GitHub - philshem/zuerich_speaks: TWIST2018 Project

Text mining 100+ years of Kanton Zürich's referenda and initiatives

*Peter has some nice papers with previous research

https://opendata.swiss/de/dataset/abstimmungsarchiv-des-kantons-zurich
Kantonal level CSV contains URLs to machine-readable pdf voting information
Gemeinde level CSV contains per-Gemeinde historical voting records
CSVs are joined by unique vote ID (STAT_VORLAGE_ID)
PDF are converted to TXT via pdftotext and can be joined to CSV files by field ABSTIMMUNGSTAG

(mostly python 2.7 or bash)

get_pdfs.py scrapes the URLs from the Kantonal CSV file and saves them locally. (Actually we got the PDFs from the organizers on a usb stick, because the scraper was getting IP blocked.) Note that the files Bundesamt.pdf are not URL linked in the CSV files.
convert_pdf_to_txt.sh loops over the PDFs and converts them to TXT with pdftotext.
read_txt.py reads the individual TXT files, cleanups up the text a bit, and writes a CSV file with some keys for joining later: full_text.csv (zipped).
vote_mapping.py (experimental) reads the combined text from full_text.csv, and also the metadta from the Kantonal CSV file. It attemps to split the TXT file into multiple elements, one for each ballot measure, using some file-specific some keywords. The code then maps based on the rank of this split array. Output file is full_text_mapped.csv.
sentiment.py reads full_text_mapped.csv and calculates the polarity (-1,1), the subjectivity (0,1) with textblob_de and the readability. Output file is full_text_mapped_sentiment.csv, and the three scores are added as the last 3 columns.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
txt		txt
LICENSE		LICENSE
README.md		README.md
convert_pdf_to_txt.sh		convert_pdf_to_txt.sh
full_text.csv.zip		full_text.csv.zip
full_text_mapped.csv		full_text_mapped.csv
full_text_mapped.csv.zip		full_text_mapped.csv.zip
full_text_mapped_sentiment.csv		full_text_mapped_sentiment.csv
get_pdfs.py		get_pdfs.py
read_txt.py		read_txt.py
sentiment.csv		sentiment.csv
sentiment.py		sentiment.py
text_complexity.md		text_complexity.md
vote_mapping.py		vote_mapping.py