# Scraping of http://www.parlament.ch

First, we need to scrap some information from the website http://parlament.ch. In this notebook, we will scrap different information. These information will be stored in the folder *data*. If you just cloned the repo and you need some data, please run this python notebook to scrap all the data. 

For the scraping, we are using the library `requests`. The metadata of the website are provided and working with XOData. So, we get the urls using XOData, then we get the XML using `requests` and we transform the XML into JSON using the library `xmltodict`.

In [1]:
# Import some useful libraries
import requests
from bs4 import BeautifulSoup as BSoup
import pandas as pd
import xmltodict
import json

## Scrap the Parties

First, we want to scrap all the parties. We will save them into a JSON file.

In [2]:
# Url to get all the party
url_party = "https://ws.parlament.ch/odata.svc/Party?$top=10000&$filter=Language eq 'FR'&$select=PartyAbbreviation,PartyName,ID"
# Use requests to get the XML
xml_party = requests.get(url_party)

In [3]:
# Transform the XML to JSON
json_party = xmltodict.parse(xml_party.text)

Now, we need to process this JSON in order to get a pretty JSON.
We will take the following columns:
- ID
- PartyName
- PartyAbbreviation

In [4]:
json_party_clean = {'ID':[], 'PartyName':[], 'PartyAbbreviation':[]}

entries = json_party['feed']['entry']

# Print an entry to see where we need to take the party
print(json.dumps(json_party["feed"]["entry"][0], indent=2))

for i in range(len(entries)):
    # Get the content of the i-th entry
    properties = entries[i]['content']['m:properties']
    # Get the ID
    json_party_clean['ID'].append(properties['d:ID']['#text'])
    # Get the PartyName
    json_party_clean['PartyName'].append(properties['d:PartyName'])
    # Get the PartyAbbreviation
    json_party_clean['PartyAbbreviation'].append(properties['d:PartyAbbreviation'])

{
  "id": "https://ws.parlament.ch/OData.svc/Party(ID=12,Language='FR')",
  "category": {
    "@term": "itsystems.Pd.DataServices.DataModel.Party",
    "@scheme": "http://schemas.microsoft.com/ado/2007/08/dataservices/scheme"
  },
  "link": {
    "@rel": "edit",
    "@title": "Party",
    "@href": "Party(ID=12,Language='FR')"
  },
  "title": null,
  "updated": "2016-11-06T18:44:42Z",
  "author": {
    "name": null
  },
  "content": {
    "@type": "application/xml",
    "m:properties": {
      "d:ID": {
        "@m:type": "Edm.Int32",
        "#text": "12"
      },
      "d:PartyName": "Parti socialiste suisse",
      "d:PartyAbbreviation": "PSS"
    }
  }
}


In [5]:
# Save the JSON to data
with open('data/party.json', 'w') as ff:
    ff.write(json.dumps(json_party_clean, indent=2))

## Scrap the people

We scrap all the people that appear in the website and we will save them into a JSON.

In [6]:
# Url to get all party's members
url_members = "https://ws.parlament.ch/odata.svc/MemberParty?$filter=Language eq 'FR'"
# Use requests to get the XML
xml_members = requests.get(url_members)
# Transform the XML to JSON
json_members = xmltodict.parse(xml_members.text)

Now, we need to process this JSON in order to get a pretty JSON.
We will take the following columns:
- ID
- PersonIdCode (To get more info on this person)
- PartyID (i.e. PartyNumber)
- PartyAbbreviation
- FirstName
- LastName
- Gender (i.e. GenderAsString)
- PartyFunction

In [7]:
json_members_clean = {'ID':[], 'PersonIdCode': [], 
                      'PartyID':[], 'PartyAbbreviation':[],
                      'FirstName':[], 'LastName':[],
                      'Gender': [], 'PartyFunction': []}

entries = json_members['feed']['entry']

# Print an entry to see where we need to take the party
print(json.dumps(json_members["feed"]["entry"][430]['content']['m:properties'], indent=2))

for i in range(len(entries)):
    # Get the content of the i-th entry
    properties = entries[i]['content']['m:properties']
    # Get the ID
    json_members_clean['ID'].append(properties['d:ID']['#text'])
    # Get the PersonIdCode
    try:
        json_members_clean['PersonIdCode'].append(properties['d:PersonIdCode']['#text'])
    except:
        json_members_clean['PersonIdCode'].append(None)
    # Get the PartyID
    json_members_clean['PartyID'].append(properties['d:PartyNumber']['#text'])
    # Get the PartyAbbreviation
    json_members_clean['PartyAbbreviation'].append(properties['d:PartyAbbreviation'])
    # Get the FirstName
    json_members_clean['FirstName'].append(properties['d:FirstName'])
    # Get the LastName
    json_members_clean['LastName'].append(properties['d:LastName'])
    # Get the Gender
    json_members_clean['Gender'].append(properties['d:GenderAsString'])
    # Get the PartyFunction
    json_members_clean['PartyFunction'].append(properties['d:PartyFunction'])

{
  "d:ID": {
    "@m:type": "Edm.Int32",
    "#text": "1172"
  },
  "d:Language": "FR",
  "d:PartyNumber": {
    "@m:type": "Edm.Int32",
    "#text": "14"
  },
  "d:PartyName": "Parti d\u00e9mocrate-chr\u00e9tien suisse",
  "d:PersonNumber": {
    "@m:type": "Edm.Int32",
    "#text": "1172"
  },
  "d:PersonIdCode": {
    "@m:type": "Edm.Int32",
    "@m:null": "true"
  },
  "d:FirstName": "Max",
  "d:LastName": "Aebischer",
  "d:GenderAsString": "m",
  "d:PartyFunction": "Mitglied",
  "d:Modified": {
    "@m:type": "Edm.DateTime",
    "#text": "2013-04-29T15:49:41.223"
  },
  "d:PartyAbbreviation": "PDC"
}


In [8]:
# Save the JSON to data
with open('data/party_members.json', 'w') as ff:
    ff.write(json.dumps(json_members_clean, indent=2))

## Have fun with the Party members

We just want to check that we can access some information using a pandas DataFrame.

In [9]:
df_members = pd.read_json(json.dumps(json_members_clean, indent=2))
df_members.head()

Unnamed: 0,FirstName,Gender,ID,LastName,PartyAbbreviation,PartyFunction,PartyID,PersonIdCode
0,Pierre,m,1,Aguet,PSS,Mitglied,12,2200.0
1,Heinz,m,2,Allenspach,PLR,Mitglied,15,2002.0
2,Manfred,m,6,Aregger,PLR,Mitglied,15,2004.0
3,Peter,m,10,Baumberger,PDC,Mitglied,14,2269.0
4,Thierry,m,13,Béguin,PLR,Mitglied,15,2202.0


In [10]:
# Check if the conseiller fédéral are in here.
df_members[df_members['LastName'] == 'Schneider-Ammann']

Unnamed: 0,FirstName,Gender,ID,LastName,PartyAbbreviation,PartyFunction,PartyID,PersonIdCode
243,Johann N.,m,508,Schneider-Ammann,PLR,Mitglied,15,2530.0


## Get all the persons

If we try to get the persons, it will give us more information about the party members. We will have more persons, but we can then extract different information

In [11]:
# Url to get all party's members
url_persons = "https://ws.parlament.ch/odata.svc/Person?$filter=Language eq 'FR'"
# Use requests to get the XML
xml_persons = requests.get(url_persons)
# Transform the XML to JSON
json_persons = xmltodict.parse(xml_persons.text)

Now, we need to process this JSON in order to get a pretty JSON.
We will take the following columns:
- ID
- PersonIdCode (To get more info on this person)
- FirstName
- LastName
- Gender (i.e. GenderAsString)
- Profession (i.e. Title)
- ProfessionText (i.e. TitleText)
- DateOfBirth_at (The "\_at" at the end is for pandas to parse the date)
- DateOfDeath_at (The "\_at" at the end is for pandas to parse the date)
- MaritalStatus
- MaritalStatusText
- PlaceOfBirthCity
- PlaceOfBirthCanton
- MilitaryRank
- MilitaryRankText
- NativeLanguage
- NumberOfChildren

In [12]:
json_persons_clean = {'ID':[], 'PersonIdCode': [], 
                      'FirstName':[], 'LastName':[],
                      'Gender': [], 'Profession': [],
                      'ProfessionText': [], 'DateOfBirth_at': [],
                      'DateOfDeath_at': [], 'MaritalStatus': [],
                      'MaritalStatusText': [], 'PlaceOfBirthCity': [],
                      'PlaceOfBirthCanton': [], 'MilitaryRank': [],
                      'MilitaryRankText': [], 'NativeLanguage': [],
                      'NumberOfChildren': []}

entries = json_persons['feed']['entry']

# Print an entry to see where we need to take the party
print(json.dumps(json_persons["feed"]["entry"][905]['content']['m:properties'], indent=2))

for i in range(len(entries)):
    # Get the content of the i-th entry
    properties = entries[i]['content']['m:properties']
    # Get the ID
    json_persons_clean['ID'].append(properties['d:ID']['#text'])
    # Get the PersonIdCode
    try:
        json_persons_clean['PersonIdCode'].append(properties['d:PersonIdCode']['#text'])
    except:
        json_persons_clean['PersonIdCode'].append(None)
    # Get the FirstName
    json_persons_clean['FirstName'].append(properties['d:FirstName'])
    # Get the LastName
    json_persons_clean['LastName'].append(properties['d:LastName'])    
    # Get the Gender
    json_persons_clean['Gender'].append(properties['d:GenderAsString'])
    try:
        # Get the Profession
        json_persons_clean['Profession'].append(properties['d:Title']['#text']) 
        # Get the ProfessionText
        json_persons_clean['ProfessionText'].append(properties['d:TitleText']) 
    except:
        json_persons_clean['Profession'].append(None)
        json_persons_clean['ProfessionText'].append(None)
    try:
        # Get the DateOfBirth
        json_persons_clean['DateOfBirth_at'].append(properties['d:DateOfBirth']['#text'])
    except:
        json_persons_clean['DateOfBirth_at'].append(None)
    try:
        # Get the DateOfDeath
        json_persons_clean['DateOfDeath_at'].append(properties['d:DateOfDeath']['#text'])
    except:
        json_persons_clean['DateOfDeath_at'].append(None)   
    try:
        # Get the MaritalStatus
        json_persons_clean['MaritalStatus'].append(properties['d:MaritalStatus']['#text']) 
        # Get the MaritalStatusText
        json_persons_clean['MaritalStatusText'].append(properties['d:MaritalStatusText']) 
    except:
        json_persons_clean['MaritalStatus'].append(None)
        json_persons_clean['MaritalStatusText'].append(None)
    try:
        # Check if the PlaceOfBirthCanton is null
        test = properties['d:PlaceOfBirthCanton']['@m:null']
        # This mean that it's null
        json_persons_clean['PlaceOfBirthCanton'].append(None)
        json_persons_clean['PlaceOfBirthCity'].append(None)
    except:
        # If we can't get the null, then it's not null. =)
        json_persons_clean['PlaceOfBirthCanton'].append(properties['d:PlaceOfBirthCanton'])
        json_persons_clean['PlaceOfBirthCity'].append(properties['d:PlaceOfBirthCity'])
    try:
        # Get the MilitaryRank
        json_persons_clean['MilitaryRank'].append(properties['d:MilitaryRank']['#text']) 
        # Get the MilitaryRankText
        json_persons_clean['MilitaryRankText'].append(properties['d:MilitaryRankText']) 
    except:
        json_persons_clean['MilitaryRank'].append(None)
        json_persons_clean['MilitaryRankText'].append(None)    
    try:
        # Check if the NativeLanguage is null
        test = properties['d:NativeLanguage']['@m:null']
        # It's the case, so we put NaN
        json_persons_clean['NativeLanguage'].append(None)
    except:
        # Get the Native Language
        json_persons_clean['NativeLanguage'].append(properties['d:NativeLanguage'])
    try:
        # Check if the NumberOfChilder is null
        test = properties['d:NumberOfChildren']['@m:null']
        # It's the case, so we put NaN
        json_persons_clean['NumberOfChildren'].append(None)
    except:
        # Get the Native Language
        json_persons_clean['NumberOfChildren'].append(properties['d:NumberOfChildren']['#text'])        

{
  "d:ID": {
    "@m:type": "Edm.Int32",
    "#text": "1359"
  },
  "d:Language": "FR",
  "d:PersonNumber": {
    "@m:type": "Edm.Int32",
    "#text": "1359"
  },
  "d:PersonIdCode": {
    "@m:type": "Edm.Int32",
    "@m:null": "true"
  },
  "d:Title": {
    "@m:type": "Edm.Int32",
    "@m:null": "true"
  },
  "d:TitleText": {
    "@m:null": "true"
  },
  "d:LastName": "Ackermann",
  "d:GenderAsString": "m",
  "d:DateOfBirth": {
    "@m:type": "Edm.DateTime",
    "#text": "1907-04-02T00:00:00"
  },
  "d:DateOfDeath": {
    "@m:type": "Edm.DateTime",
    "#text": "1997-12-12T00:00:00"
  },
  "d:MaritalStatus": {
    "@m:type": "Edm.Int32",
    "@m:null": "true"
  },
  "d:MaritalStatusText": {
    "@m:null": "true"
  },
  "d:PlaceOfBirthCity": {
    "@m:null": "true"
  },
  "d:PlaceOfBirthCanton": {
    "@m:null": "true"
  },
  "d:Modified": {
    "@m:type": "Edm.DateTime",
    "#text": "2015-05-17T21:18:19.387"
  },
  "d:FirstName": "Alfred",
  "d:OfficialName": "Ackermann Alfred",
  "

In [13]:
# Save the JSON to data
with open('data/persons.json', 'w') as ff:
    ff.write(json.dumps(json_persons_clean, indent=2))

## Have fun with the Persons

We just want to check that we can access some information using a pandas DataFrame.

In [14]:
df_persons = pd.read_json(json.dumps(json_persons_clean, indent=2), convert_dates=True)
df_persons.head()

Unnamed: 0,DateOfBirth_at,DateOfDeath_at,FirstName,Gender,ID,LastName,MaritalStatus,MaritalStatusText,MilitaryRank,MilitaryRankText,NativeLanguage,NumberOfChildren,PersonIdCode,PlaceOfBirthCanton,PlaceOfBirthCity,Profession,ProfessionText
0,1938-03-02,NaT,Pierre,m,1,Aguet,2.0,marié(e),5.0,Fourrier,F,,2200.0,Vaud,Pompaples,,
1,1928-02-22,NaT,Heinz,m,2,Allenspach,,,,,D,,2002.0,,,,
2,1931-01-27,NaT,Manfred,m,6,Aregger,2.0,marié(e),7.0,Adjudant sous-officier,D,5.0,2004.0,Lucerne,Hasle,9.0,dipl. Bauing. HTL
3,1928-03-04,NaT,Geneviève,f,7,Aubry,,,,,F,,2005.0,,,,
4,1947-12-01,NaT,Rosmarie,f,8,Bär,,,,,D,,2008.0,,,,


In [15]:
# Print some tests to make sure we did everything correctly
print(':' in df_persons)
print('@' in df_persons)
print('#' in df_persons)

False
False
False


In [16]:
df_persons[df_persons['LastName'] == 'Schneider-Ammann'] 

Unnamed: 0,DateOfBirth_at,DateOfDeath_at,FirstName,Gender,ID,LastName,MaritalStatus,MaritalStatusText,MilitaryRank,MilitaryRankText,NativeLanguage,NumberOfChildren,PersonIdCode,PlaceOfBirthCanton,PlaceOfBirthCity,Profession,ProfessionText
444,1952-02-18,NaT,Johann N.,m,508,Schneider-Ammann,2.0,marié(e),14.0,Colonel,D,2.0,2530.0,Berne,Sumiswald,148.0,dipl. El. Ing. ETH / MBA INSEAD
