# Scraping of http://www.parlament.ch

First, we need to scrap some information from the website http://parlament.ch. In this notebook, we will scrap different information. These information will be stored in the folder *data*. If you just cloned the repo and you need some data, please run this python notebook to scrap all the data. 

For the scraping, we are using the library `requests`. The metadata of the website are provided and working with XOData. So, we get the urls using XOData, then we get the XML using `requests` and we transform the XML into JSON using the library `xmltodict`.

URL of the metadata: https://ws.parlament.ch/odata.svc/$metadata

In [78]:
# Import some useful libraries
%matplotlib inline
import pandas as pd
import urllib
import xml.etree.ElementTree as ET
import extract
import numpy as np
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Scrap the Parties

First, we want to scrap all the parties. We will save them into a JSON file.

In [79]:
df_parties = extract.parties()

https://ws.parlament.ch/odata.svc/Party?$filter=Language%20eq%20'FR'


In [80]:
df_parties.head()

Unnamed: 0,EndDate,ID,Modified,PartyAbbreviation,PartyName,PartyNumber,StartDate
0,,12,2010-12-26T13:05:26.43,PSS,Parti socialiste suisse,12,1888-01-01T00:00:00
1,,13,2010-12-26T13:05:26.43,UDC,Union Démocratique du Centre,13,1848-01-01T00:00:00
2,,14,2010-12-26T13:05:26.43,PDC,Parti démocrate-chrétien suisse,14,1848-01-01T00:00:00
3,,15,2010-12-26T13:05:26.43,PLR,PLR.Les Libéraux-Radicaux,15,1848-01-01T00:00:00
4,,16,2010-12-26T13:05:26.43,PLD,Parti libéral démocrate,16,1848-01-01T00:00:00


## Scrap all the members (Past and Present) of councils

We know that their is 4211 persons registered in database, but API only allows 1000 load at a time.
Let's scrap!

In [81]:
# Url to get all Council members
url_council = "https://ws.parlament.ch/odata.svc/MemberCouncil?$filter=Language%20eq%20'FR'"

In [82]:
person = extract.persons()
person.shape

https://ws.parlament.ch/odata.svc/Person?&$filter=Language%20eq%20'FR'%20and%20ID%20ge%200%20and%20ID%20lt%201000
https://ws.parlament.ch/odata.svc/Person?&$filter=Language%20eq%20'FR'%20and%20ID%20ge%201000%20and%20ID%20lt%202000
https://ws.parlament.ch/odata.svc/Person?&$filter=Language%20eq%20'FR'%20and%20ID%20ge%202000%20and%20ID%20lt%203000
https://ws.parlament.ch/odata.svc/Person?&$filter=Language%20eq%20'FR'%20and%20ID%20ge%203000%20and%20ID%20lt%204000
https://ws.parlament.ch/odata.svc/Person?&$filter=Language%20eq%20'FR'%20and%20ID%20ge%204000%20and%20ID%20lt%205000
https://ws.parlament.ch/odata.svc/Person?&$filter=Language%20eq%20'FR'%20and%20ID%20ge%205000%20and%20ID%20lt%206000


(3525, 21)

In [83]:
url = "https://ws.parlament.ch/odata.svc/Person?$top=1000&$filter=Language%20eq%20%27FR%27%20and%20ID%20gt%2010000"
df = extractor.parsing_datas(url)

AttributeError: module 'extractor' has no attribute 'parsing_datas'

In [65]:
df.shape == (0,0)

True

In [58]:
df_persons = persons()

https://ws.parlament.ch/odata.svc/Person?$top=10000&$filter=Language%20eq%20'FR'%20and%20ID%20ge%201%20and%20ID%20lt%201001
https://ws.parlament.ch/odata.svc/Person?$top=10000&$filter=Language%20eq%20'FR'%20and%20ID%20ge%201001%20and%20ID%20lt%202001
https://ws.parlament.ch/odata.svc/Person?$top=10000&$filter=Language%20eq%20'FR'%20and%20ID%20ge%202001%20and%20ID%20lt%203001
https://ws.parlament.ch/odata.svc/Person?$top=10000&$filter=Language%20eq%20'FR'%20and%20ID%20ge%203001%20and%20ID%20lt%204001
https://ws.parlament.ch/odata.svc/Person?$top=10000&$filter=Language%20eq%20'FR'%20and%20ID%20ge%204001%20and%20ID%20lt%205001


In [59]:
df_persons.shape

(3525, 21)

In [53]:
df_council.shape

(729, 21)

In [52]:
pd.concat([df_council, df_council2]).shape

(1546, 21)

In [29]:
df_council = extractor.parsing_datas(url_council)

https://ws.parlament.ch/odata.svc/MemberCouncil?$filter=Language%20eq%20'FR'


In [30]:
df_council.shape

(1000, 43)

In [136]:
df_council.head()

Unnamed: 0,Active,AdditionalActivity,AdditionalMandate,BirthPlace_Canton,BirthPlace_City,Canton,CantonAbbreviation,CantonName,Citizenship,Council,...,ParlGroupAbbreviation,ParlGroupFunction,ParlGroupFunctionText,ParlGroupName,ParlGroupNumber,Party,PartyAbbreviation,PartyName,PersonIdCode,PersonNumber
0,False,Réducteur TSV de 1971 à 1983,"Prés. Féd. Romande des Sociolistes ch., Prés. ...",,Pompaples,22,VD,Vaud,"Sullens (VD),Lutry (VD)",1,...,,,,,,12.0,PSS,Parti socialiste suisse,2200,1
1,False,,,,,1,ZH,Zurich,"Kreuzlingen (TG),Fällanden (ZH)",1,...,,,,,,15.0,PLR,PLR.Les Libéraux-Radicaux,2002,2
2,False,Zentralpräsident Schweiz. Skliverband 1985 bis...,,,Hasle,3,LU,Lucerne,Hasle (LU),1,...,,,,,,15.0,PLR,PLR.Les Libéraux-Radicaux,2004,6
3,False,,,,,2,BE,Berne,Tavannes (BE),1,...,,,,,,,,,2005,7
4,False,,,,,2,BE,Berne,"Siselen (BE),Richterswil (ZH)",1,...,,,,,,,,,2008,8


In [137]:
# Drop the column Language
df_council = df_council.drop('Language', axis=1)

In [138]:
df_council.any()

Active                    True
AdditionalActivity        True
AdditionalMandate         True
BirthPlace_Canton        False
BirthPlace_City           True
Canton                    True
CantonAbbreviation        True
CantonName                True
Citizenship               True
Council                   True
CouncilAbbreviation       True
CouncilName               True
DateElection              True
DateJoining               True
DateLeaving               True
DateOath                  True
DateOfBirth               True
DateOfDeath               True
DateResignation           True
FirstName                 True
GenderAsString            True
ID                        True
IdPredecessor             True
LastName                  True
Mandates                  True
MaritalStatus             True
MaritalStatusText         True
MilitaryRank              True
MilitaryRankText          True
Modified                  True
NumberOfChildren          True
OfficialName              True
ParlGrou

In [139]:
# Drop the column BirthPlace_Canton
df_council = df_council.drop('BirthPlace_Canton', axis=1)

In [148]:
# Drop the column BirthPlace_City
df_council = df_council.drop('BirthPlace_City', axis=1)

In [149]:
index_council = df_council.columns
index_council

Index(['Active', 'AdditionalActivity', 'AdditionalMandate', 'Canton',
       'CantonAbbreviation', 'CantonName', 'Citizenship', 'Council',
       'CouncilAbbreviation', 'CouncilName', 'DateElection', 'DateJoining',
       'DateLeaving', 'DateOath', 'DateOfBirth', 'DateOfDeath',
       'DateResignation', 'FirstName', 'GenderAsString', 'ID', 'IdPredecessor',
       'LastName', 'Mandates', 'MaritalStatus', 'MaritalStatusText',
       'MilitaryRank', 'MilitaryRankText', 'Modified', 'NumberOfChildren',
       'OfficialName', 'ParlGroupAbbreviation', 'ParlGroupFunction',
       'ParlGroupFunctionText', 'ParlGroupName', 'ParlGroupNumber', 'Party',
       'PartyAbbreviation', 'PartyName', 'PersonIdCode', 'PersonNumber'],
      dtype='object')

## Scrap all the persons

In [201]:
# Url to get all Council members
url_person = "https://ws.parlament.ch/odata.svc/Person?$filter=Language%20eq%20'FR'"

In [218]:
df_person = parsing_datas(url_person)

https://ws.parlament.ch/odata.svc/Person?$filter=Language%20eq%20'FR'


In [219]:
df_person.shape

(1000, 21)

In [220]:
df_person['TitleText'].head()

0                 None
1                 None
2    dipl. Bauing. HTL
3                 None
4                 None
Name: TitleText, dtype: object

In [221]:
# Drop the column Language
df_person = df_person.drop('Language', axis=1)

In [222]:
df_person.any()

DateOfBirth           True
DateOfDeath           True
FirstName             True
GenderAsString        True
ID                    True
LastName              True
MaritalStatus         True
MaritalStatusText     True
MilitaryRank          True
MilitaryRankText      True
Modified              True
NativeLanguage        True
NumberOfChildren      True
OfficialName          True
PersonIdCode          True
PersonNumber          True
PlaceOfBirthCanton    True
PlaceOfBirthCity      True
Title                 True
TitleText             True
dtype: bool

In [223]:
index_person = df_person.columns
index_person

Index(['DateOfBirth', 'DateOfDeath', 'FirstName', 'GenderAsString', 'ID',
       'LastName', 'MaritalStatus', 'MaritalStatusText', 'MilitaryRank',
       'MilitaryRankText', 'Modified', 'NativeLanguage', 'NumberOfChildren',
       'OfficialName', 'PersonIdCode', 'PersonNumber', 'PlaceOfBirthCanton',
       'PlaceOfBirthCity', 'Title', 'TitleText'],
      dtype='object')

In [224]:
index_to_remove = list(set(index_person) - (set(index_person) - set(index_council)))
index_to_remove

['OfficialName',
 'DateOfDeath',
 'NumberOfChildren',
 'PersonIdCode',
 'Modified',
 'MaritalStatus',
 'ID',
 'DateOfBirth',
 'MilitaryRankText',
 'FirstName',
 'PersonNumber',
 'MaritalStatusText',
 'LastName',
 'GenderAsString',
 'MilitaryRank']

In [225]:
for i in set(index_to_remove):
    print(i, i in set(index_council))

OfficialName True
DateOfDeath True
NumberOfChildren True
PersonIdCode True
Modified True
MaritalStatus True
DateOfBirth True
MilitaryRankText True
FirstName True
PersonNumber True
MaritalStatusText True
MilitaryRank True
LastName True
GenderAsString True
ID True


In [237]:
df_person_sparse = df_person
for idx in index_to_remove:
    if ('ID' not in idx) and ('LastName' not in idx):
        df_person_sparse = df_person_sparse.drop(idx, axis=1)

OfficialName
DateOfDeath
NumberOfChildren
PersonIdCode
Modified
MaritalStatus
DateOfBirth
MilitaryRankText
FirstName
PersonNumber
MaritalStatusText
GenderAsString
MilitaryRank


In [238]:
df_council.shape

(1000, 40)

In [239]:
df_person_sparse.shape

(1000, 7)

In [240]:
frames = [df_council, df_person_sparse]

In [241]:
df_council_complete = pd.concat(frames, axis=1)

In [242]:
df_council_complete

Unnamed: 0,Active,AdditionalActivity,AdditionalMandate,Canton,CantonAbbreviation,CantonName,Citizenship,Council,CouncilAbbreviation,CouncilName,...,PartyName,PersonIdCode,PersonNumber,ID,LastName,NativeLanguage,PlaceOfBirthCanton,PlaceOfBirthCity,Title,TitleText
0,false,Réducteur TSV de 1971 à 1983,"Prés. Féd. Romande des Sociolistes ch., Prés. ...",22,VD,Vaud,"Sullens (VD),Lutry (VD)",1,CN,Conseil national,...,Parti socialiste suisse,2200,1,1,Aguet,F,Vaud,Pompaples,,
1,false,,,1,ZH,Zurich,"Kreuzlingen (TG),Fällanden (ZH)",1,CN,Conseil national,...,PLR.Les Libéraux-Radicaux,2002,2,2,Allenspach,D,,,,
2,false,Zentralpräsident Schweiz. Skliverband 1985 bis...,,3,LU,Lucerne,Hasle (LU),1,CN,Conseil national,...,PLR.Les Libéraux-Radicaux,2004,6,6,Aregger,D,Lucerne,Hasle,9,dipl. Bauing. HTL
3,false,,,2,BE,Berne,Tavannes (BE),1,CN,Conseil national,...,,2005,7,7,Aubry,F,,,,
4,false,,,2,BE,Berne,"Siselen (BE),Richterswil (ZH)",1,CN,Conseil national,...,,2008,8,8,Bär,D,,,,
5,false,Stiftungsrat Greina-Stiftung,Präsident der Grünen Partei der Schweiz,2,BE,Berne,Wileroltigen (BE),1,CN,Conseil national,...,,2268,9,9,Baumann,D,Berne,Suberg,10,dipl. Ing. Agr. ETH
6,false,,,1,ZH,Zurich,"Winterthur (ZH),Balterswil (TG)",1,CN,Conseil national,...,Parti démocrate-chrétien suisse,2269,10,10,Baumberger,D,Zurich,Winterthur,6,Dr. iur.
7,false,,EKF,2,BE,Berne,"Bonau (TG),Berne (BE),Zurich (ZH)",1,CN,Conseil national,...,,2011,11,11,Bäumlin,D,Berne,Berne,12,lic. phil. I
8,false,,,2,BE,Berne,"Thal (SG),Bienne (BE)",2,CE,Conseil des Etats,...,,2335,12,12,Beerli,D,Berne,Bienne,115,lic. iur.
9,false,,,24,NE,Neuchâtel,Rochefort (NE),2,CE,Conseil des Etats,...,PLR.Les Libéraux-Radicaux,2202,13,13,Béguin,F,Neuchâtel,La Chaux-de-Fonds,3,lic. en droit


In [243]:
df_council_complete['ID']

Unnamed: 0,ID,ID.1
0,1,1
1,2,2
2,6,6
3,7,7
4,8,8
5,9,9
6,10,10
7,11,11
8,12,12
9,13,13


In [244]:
df_council_complete['LastName']

Unnamed: 0,LastName,LastName.1
0,Aguet,Aguet
1,Allenspach,Allenspach
2,Aregger,Aregger
3,Aubry,Aubry
4,Bär,Bär
5,Baumann,Baumann
6,Baumberger,Baumberger
7,Bäumlin,Bäumlin
8,Beerli,Beerli
9,Béguin,Béguin
