# Analýza hlasování jednotlivých stran

Data jsou stažena ze stránek PS a vztahují se k volebnímu období 2017-2021

https://www.psp.cz/sqw/hlasovani.sqw?o=8

<hr/>

### Importy knihoven

In [1]:
import os

import pandas as pd

## Konfigurace

Cesty v počítači:

In [2]:
PAGES_DIR_PATH = './pages/'
DATA_DIR_PATH = './csv_data/'

In [3]:
from utils import mkdir_safe

mkdir_safe(PAGES_DIR_PATH)
mkdir_safe(DATA_DIR_PATH)

Ostatní:

In [4]:
# update (already downloaded) HMTL pages?
REDOWNLOAD_HTML = False
REDOWNLOAD_CSV = False

# sleep time between 2 requests
DELAY = 0.1
# data base url definition
URL_PREFIX = 'https://www.psp.cz/sqw/hlasy.sqw?g='
URL_SUFIX = '&l=cz'

# voting IDs
ID_FIRST = 67018
ID_LAST = 77296

DELIM = ';'

In [5]:
NUM_PAGES = ID_LAST - ID_FIRST + 1

## Pomocné funkce

Práce se soubory:

## Stažení HTML stránek 

Stažení všech stránek obsahujících výsledky hlasování v uplynulém volebním období do lokálního souboru

Pomocné funkce:

In [6]:
def generate_urls():
    """
    Generates all URLs with parlament voting results
    """
    for page_id in range(ID_FIRST, ID_LAST + 1):
        yield f'{URL_PREFIX}{page_id}{URL_SUFIX}'

Stažení:

In [7]:
from download import HtmlPagesDownloader

hpd = HtmlPagesDownloader(generate_urls, PAGES_DIR_PATH, redownload=REDOWNLOAD_HTML, verbose=True, n_pages=NUM_PAGES)
hpd.download_all()

[PROGRESS] 1000/10279 ~ 9.73 % complete. ID: 999
[PROGRESS] 2000/10279 ~ 19.46 % complete. ID: 1999
[PROGRESS] 3000/10279 ~ 29.19 % complete. ID: 2999
[PROGRESS] 4000/10279 ~ 38.91 % complete. ID: 3999
[PROGRESS] 5000/10279 ~ 48.64 % complete. ID: 4999
[PROGRESS] 6000/10279 ~ 58.37 % complete. ID: 5999
[PROGRESS] 7000/10279 ~ 68.1 % complete. ID: 6999
[PROGRESS] 8000/10279 ~ 77.83 % complete. ID: 7999
[PROGRESS] 9000/10279 ~ 87.56 % complete. ID: 8999
[PROGRESS] 10000/10279 ~ 97.29 % complete. ID: 9999
---------------------------------
Download complete.
	pages processed: 10279
	pages failed: 	 0 --> []

Total size of the directory at "./pages/": 367.97 MB


## Scrapping hlasovacích dat

Mining dat z jednotlivých HTML stránek

In [8]:
from scraping import PartiesDataScrapper, PoliticiansDataScrapper

Definice sloupců:

In [9]:
COLUMN_NAMES_PARTIES = ['Club', 'Total', 'Yes', 'No', 'Not-logged-in', 'Excused', 'Refrained']
COLUMN_NAMES_INDIVIDUALS = ['TODO']

Pomocné funkce:

In [10]:
def generate_html_files():
    for i in range(NUM_PAGES):
        yield f'{PAGES_DIR_PATH}{i}.html'

### Data o jednotlivých stranách

In [12]:
parties_scrapper = PartiesDataScrapper(generate_html_files, 
                                       COLUMN_NAMES_PARTIES, 
                                       download=True,
                                       download_dir_path=DATA_DIR_PATH + 'parties/',
                                       verbose=True,
                                       n_files=NUM_PAGES)
for _ in parties_scrapper.generate_all():
    pass

[PROGRESS] 1000/10279 ~ 9.73 % complete. ID: 999
[PROGRESS] 2000/10279 ~ 19.46 % complete. ID: 1999
[PROGRESS] 3000/10279 ~ 29.19 % complete. ID: 2999
[PROGRESS] 4000/10279 ~ 38.91 % complete. ID: 3999
[PROGRESS] 5000/10279 ~ 48.64 % complete. ID: 4999
[PROGRESS] 6000/10279 ~ 58.37 % complete. ID: 5999
[PROGRESS] 7000/10279 ~ 68.1 % complete. ID: 6999
[PROGRESS] 8000/10279 ~ 77.83 % complete. ID: 7999
[PROGRESS] 9000/10279 ~ 87.56 % complete. ID: 8999
[PROGRESS] 10000/10279 ~ 97.29 % complete. ID: 9999
---------------------------------
Download complete.
	pages processed: 10265
	pages failed: 	 14 --> [494, 1529, 9169, 9181, 9182, 9183, 9184, 9185, 9186, 9187, 9188, 9189, 9190, 10050]

Total size of the directory at "./csv_data/parties/": 3.03 MB


### Data o jednotlivých poslancích

# TODO: 

Note: already started in the playground section at the bottom

# Playground

Zkouška scrapování dat o jednotlivých poslancích:

In [None]:
import codecs
from bs4 import BeautifulSoup

page_path = f'{PAGES_DIR_PATH}69.html'  # page with ID: 67087
content = codecs.open(page_path, 'r')
soup = BeautifulSoup(content.read(), 'html.parser')

In [None]:
parlamentarians = []

for x in soup.select('li'):
    children = list(x.children)
    # only keep elements with exactly 3 tags inside the <li> tag
    if len(x) != 3:
        continue
    # last inner tag is parlamentarian's name
    name = children[2].string.replace('\xa0', ' ')
    parlamentarians.append(name)

In [None]:
parlamentarians