# Scraping of http://www.parlament.ch

First, we need to scrap some information from the website http://parlament.ch. In this notebook, we will scrap different information. These information will be stored in the folder *data*. If you just cloned the repo and you need some data, please run this python notebook to scrap all the data. 

For the scraping, we are using the library `requests`. The metadata of the website are provided and working with XOData. So, we get the urls using XOData, then we get the XML using `requests` and we transform the XML into JSON using the library `xmltodict`.

URL of the metadata: https://ws.parlament.ch/odata.svc/$metadata

In [2]:
# Import some useful libraries
%matplotlib inline
import pandas as pd
import urllib
import xml.etree.ElementTree as ET
from scraper import *
import numpy as np
%load_ext autoreload
%autoreload 2

# display all pandas columns
pd.set_option('display.max_columns', 100)

# Examples of how to use Scraper class

## Scrap


Tables: Party, Person, Council

In [5]:
scrap = Scraper()
df_party = scrap.get('Party') # get the table and write it into csv file
df_person = scrap.get('Person')
df_member_council = scrap.get('MemberCouncil')


GET: https://ws.parlament.ch/odata.svc/Party?$filter=Language%20eq%20'FR'
[OK] table Party correctly scraped, df.shape =  79 as expected
GET: https://ws.parlament.ch/odata.svc/Person?$top=1000&$filter=Language%20eq%20'FR'&$skip=0
GET: https://ws.parlament.ch/odata.svc/Person?$top=1000&$filter=Language%20eq%20'FR'&$skip=1000
GET: https://ws.parlament.ch/odata.svc/Person?$top=1000&$filter=Language%20eq%20'FR'&$skip=2000
GET: https://ws.parlament.ch/odata.svc/Person?$top=1000&$filter=Language%20eq%20'FR'&$skip=3000
GET: https://ws.parlament.ch/odata.svc/Person?$top=1000&$filter=Language%20eq%20'FR'&$skip=4000
[OK] table Person correctly scraped, df.shape =  3525 as expected
GET: https://ws.parlament.ch/odata.svc/MemberCouncil?$top=1000&$filter=Language%20eq%20'FR'&$skip=0
GET: https://ws.parlament.ch/odata.svc/MemberCouncil?$top=1000&$filter=Language%20eq%20'FR'&$skip=1000
GET: https://ws.parlament.ch/odata.svc/MemberCouncil?$top=1000&$filter=Language%20eq%20'FR'&$skip=2000
GET: https://w

In [6]:
df_party[df_party.index == 0]

Unnamed: 0,EndDate,ID,Language,Modified,PartyAbbreviation,PartyName,PartyNumber,StartDate
0,,12,FR,2010-12-26T13:05:26.43,PSS,Parti socialiste suisse,12,1888-01-01T00:00:00


In [7]:
df_party.shape

(79, 8)

Count how many occurencies exist in a table
(Will be used after to control that we got 

In [8]:
n_parties = scrap.count('Party')
n_parties

79

In [9]:
# try to scrap transcript... will be hard -_-
# stop after 24 iteration, 25'000 transcripts requested, or env. 50 mb (I don't know wich one is the limit)
scrap = Scraper(time_out=300)
df_transcript = scrap.get('Transcript')

GET: https://ws.parlament.ch/odata.svc/Transcript?$top=1000&$filter=Language%20eq%20'FR'&$skip=0
GET: https://ws.parlament.ch/odata.svc/Transcript?$top=1000&$filter=Language%20eq%20'FR'&$skip=1000
GET: https://ws.parlament.ch/odata.svc/Transcript?$top=1000&$filter=Language%20eq%20'FR'&$skip=2000
GET: https://ws.parlament.ch/odata.svc/Transcript?$top=1000&$filter=Language%20eq%20'FR'&$skip=3000
GET: https://ws.parlament.ch/odata.svc/Transcript?$top=1000&$filter=Language%20eq%20'FR'&$skip=4000
GET: https://ws.parlament.ch/odata.svc/Transcript?$top=1000&$filter=Language%20eq%20'FR'&$skip=5000
GET: https://ws.parlament.ch/odata.svc/Transcript?$top=1000&$filter=Language%20eq%20'FR'&$skip=6000
GET: https://ws.parlament.ch/odata.svc/Transcript?$top=1000&$filter=Language%20eq%20'FR'&$skip=7000
GET: https://ws.parlament.ch/odata.svc/Transcript?$top=1000&$filter=Language%20eq%20'FR'&$skip=8000
GET: https://ws.parlament.ch/odata.svc/Transcript?$top=1000&$filter=Language%20eq%20'FR'&$skip=9000
GET

ParseError: mismatched tag: line 1, column 177
b'<html><head><title>Request Rejected</title></head><body>The requested URL was rejected. Please consult with your administrator.<br><br>Your support ID is: 10862505001035633232</body></html>'
================================================================================================================================================================================^ (<string>)

In [None]:
df_transcript = scrap.count('Transcript')
df_transcript

## Try to merge two tables

In [None]:
df_person.head()

In [None]:
df_member_council.head()