The notebook has been used to scrap the votation data from the BFS website as describe in the readme file.

In [None]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from tqdm import tqdm_notebook

In [None]:
def find_idx_for_key(body, key):
    for idx, item in enumerate(body): 
        if item[0] == key: 
            return idx

# Votes

We saved the body of a post request made from URL (see below). This post request is particular as it has all the values selected in the first selector, **only one** in the second and all the values in the third. We are forced to make multiple post requests as otherwise we exceed the limit for the resulting table size.

Below we open the save post request and process it to have a valid body to use later with _requests_.

In [None]:
#data = body of post request with all values of first and third picker + 1st value of 2nd picker (use postman)
with(open('post_request_votes.txt', "rb")) as f:
    body = [tuple(x.decode().strip().split(':')) for x in f.readlines()]

Here we find the key id of the only variable we need to change in the post request between two post request. It corresponds to the value selected in the second selector.

In [None]:
#find index of the variable in the body
key = 'ctl00$ContentPlaceHolderMain$VariableSelector1$VariableSelector1$VariableSelectorValueSelectRepeater$ctl02$VariableValueSelect$VariableValueSelect$ValuesListBox'
idx_key = find_idx_for_key(body, key)

In order to change the value of the variable isolated just above, we need to know all the valid options it can take.

In [None]:
#find all options for second picker
URL = "https://www.pxweb.bfs.admin.ch/Selection.aspx?px_language=fr&px_db=px-x-1703030000_101&px_tableid=px-x-1703030000_101/px-x-1703030000_101.px&px_type=PX"
soup = BeautifulSoup(requests.get(URL).text, 'html.parser')
options = [o.attrs['value'] for o in soup.select('#ctl00_ContentPlaceHolderMain_VariableSelector1_VariableSelector1_VariableSelectorValueSelectRepeater_ctl02_VariableValueSelect_VariableValueSelect_ValuesListBox option')]
len(options)

Three variables in the body of the post request need to be updated (only once). These variables correspond to session values. However, as we cold-stored our post request the session has expired. We can find the three variables in the html resulting from the GET.

In [None]:
# GET, params saving
r = requests.get(URL).text
soup = BeautifulSoup(r, 'html.parser')

for elem in ['__EVENTVALIDATION', '__VIEWSTATE', '__VIEWSTATEGENERATOR']:
    body[find_idx_for_key(body, elem)] = (elem, soup.find(id=elem).attrs['value'])

Now we can simply iterates on the values of the second selector and perform the post request with the updated variable in the body at each iteration. The interesting part for us in the POST request is the table which is contain in a HTML table. Thus, we can easily use *pd.read_table* to extract it. All the resulting dataframes are concatenated together before the set the index as the _commune_ and the _votation_.

In [None]:
#Note URL_post != URL
URL_post = 'https://www.pxweb.bfs.admin.ch/Selection.aspx?px_language=fr&px_db=px-x-1703030000_101&px_tableid=px-x-1703030000_101%2fpx-x-1703030000_101.px&px_type=PX'

df = pd.DataFrame()

for o in tqdm_notebook(options):
    
    #Change the value of the variable in the body of the post request
    body[idx_key] = (key, o)
    
    #Make the post request
    r = requests.post(URL_post, data=body)
    
    #Build the temporary dataframe corresponding to one votation
    df_tmp = pd.read_html(r.text, thousands=' ', decimal=',', na_values=['...'])[0]
    df_tmp.columns = df_tmp.columns.droplevel()
    df_tmp.columns = ['Commune', 'Votation'] + df_tmp.columns[1:-1].tolist()
    
    #Concat with the temporary dataframe with the dataframe containing all the votations
    df = pd.concat([df, df_tmp])
    
df = df.sort_index()
df = df.set_index(['Commune', 'Votation'])

df

In [None]:
df.to_pickle("data/votations.pkl")