# CSS: Politicians on Wikipedia 

## Which German politicians are captured on Wikipedia? Does search interest predict existence on Wikipedia?

1. Create list of all German politicians between XX and XY
2. Analyse search volume of politicians 
	- 2.1 Plot distribution of number of countries from which search volume happens for male and female politicians
	- 2.2 Plot number of month during which search volume is above threshold for for male and female politicians
3. Binary logistic regression
	- 3.1 Outcome variable: article exists on Wikipedia
	- 3.2 IV: search volume -> number of countries and month
	- 3.3 Control: experience (e.g. how often was a politician already part of parliament)

### Answers

#### Subtask 1
1. We create a list of all Members of The Bundestag since 2005 until now (2018)
    - Bundestag 2005-2009 (Data: https://www.abgeordnetenwatch.de/api/parliament/bundestag%202005-2009/deputies.xml)
    - Bundestag 2009-2013 (Data: https://www.abgeordnetenwatch.de/api/parliament/bundestag%202009-2013/deputies.xml)
    - Bundestag 2013-2017 (Data: https://www.abgeordnetenwatch.de/api/parliament/bundestag%202013-2017/deputies.xml)
    - Bundestag 2017-     (Data: https://www.abgeordnetenwatch.de/api/parliament/bundestag/deputies.xml)

A copy of the original data will be kept.



The Result can be seen in data/memberList.json

### Here is a sketch of the data flow
```
https://www.abgeordnetenwatch.de/api > parliament
    https://www.abgeordnetenwatch.de/api/parliament/{parliament}/deputies.xml > firstName, lastName    
        https://de.wikipedia.org/w/api.php?action=query&list=search&srsearch={firstName}%20{lastName}&format=xml 
            > pageId, title, url
```

In [17]:
#create data directory where all data used will be downloaded to
import os
dataDirectory="data/"
if not os.path.exists(dataDirectory):
    os.makedirs(dataDirectory)

In [36]:
#Solution for Task 1: Data Aquisition
#get and data from abgeordnetenwatch.de/api to data
import requests
import urllib
import os
import json

#helper function to convert url to local paths
def toLocalPath(url):
    return os.path.join(dataDirectory, url.replace("/","_"))

parliamentsUrl="https://www.abgeordnetenwatch.de/api/parliaments.json"
parliamentsLocal=toLocalPath(parliamentsUrl)
#make a local copy of the parliaments list
urllib.request.urlretrieve(parliamentsUrl, parliamentsLocal)
print("Saved parliaments: "+ parliamentsLocal)  

with open(parliamentsLocal) as parliamentsJsonFile:
    
    #load the local copy of the parliaments list file
    parliaments = json.load(parliamentsJsonFile)

    #iterate parliaments
    for parliament in parliaments["parliaments"]:
        
        #restrict to Bundestag, maybe add more alter
        if ("Bundestag" in parliament["name"]):
            
            #get the file pointing to a specific parliament
            parliamentMembersUrl=parliamentName=parliament["datasets"]["deputies"]["by-name"]
            parliamentMembersLocal=toLocalPath(parliamentMembersUrl)
            #make a local copy of a specific parliament
            urllib.request.urlretrieve(parliamentMembersUrl, parliamentMembersLocal)
            print("Saved parliament: " + parliament["name"] + " to "+ parliamentMembersLocal)            
            

Saved parliaments: data/https:__www.abgeordnetenwatch.de_api_parliaments.json
Saved parliament: Bundestag todata/https:__www.abgeordnetenwatch.de_api_parliament_bundestag_deputies.json
Saved parliament: Bundestag 2005-2009 todata/https:__www.abgeordnetenwatch.de_api_parliament_bundestag%202005-2009_deputies.json
Saved parliament: Bundestag 2009-2013 todata/https:__www.abgeordnetenwatch.de_api_parliament_bundestag%202009-2013_deputies.json
Saved parliament: Bundestag 2013-2017 todata/https:__www.abgeordnetenwatch.de_api_parliament_bundestag%202013-2017_deputies.json


In [55]:
#Solution for Subtask 1: Purging Data
import requests
import urllib
import os
import json

#the result of this subtask, a list of parliament members
memberList = {}

with open(parliamentsLocal) as parliamentsJsonFile:
    #load the local copy of the parliaments list file
    parliaments = json.load(parliamentsJsonFile)

    #iterate parliaments
    for parliament in parliaments["parliaments"]:
        
        #restrict to Bundestag, maybe add more alter
        if ("Bundestag" in parliament["name"]):
            
            #get the file pointing to a specific parliament
            parliamentMembersUrl=parliamentName=parliament["datasets"]["deputies"]["by-name"]
            parliamentMembersLocal=toLocalPath(parliamentMembersUrl)
            
            with open(parliamentMembersLocal) as parliamentMembersJsonFile:
                #load local copy of the parliament file
                parliamentMembers = json.load(parliamentMembersJsonFile)
                for parliamentMember in parliamentMembers["profiles"]:
                    #read desired values
                    #we use uuid as id
                    uuid=parliamentMember["meta"]["uuid"]
                    firstName=parliamentMember["personal"]["first_name"]
                    lastName=parliamentMember["personal"]["last_name"]
                    gender=parliamentMember["personal"]["gender"]
                    
                    #test if we already have a member with that uuid in our member list
                    if(uuid in memberList):
                        #update existing member entry and update the "numberOfParliaments"
                        memberList[uuid]["numberOfParliaments"]=memberList[uuid]["numberOfParliaments"]+1
                    else:
                        #create new member entry
                        memberList[uuid] = {
                            "firstName" :firstName,
                            "firstName" :firstName,
                            "lastName" :lastName,
                            "gender" :gender,
                            "numberOfParliaments":1
                        }

#Save membrs list
meberListJsonPath=os.path.join(dataDirectory,"memberList.json")
with open(meberListJsonPath, 'w') as meberListJsonFile:
    json.dump(memberList, meberListJsonFile, sort_keys=True, indent=4)

In [None]:
#Function to test if wiki page exists

def wikiPageExists(fristName, lastName, language)
 #build the query string for the wiki api
    payload = {"action":"query",
               "list": "search",
               "srsearch":"{firstName} {lastName}".format(firstName=firstName, lastName=lastName),
               "format":"json"}
    encodedPayload = urlencode(payload)

    #build wikipedia api url
    ##example document: https://de.wikipedia.org/w/api.php?action=query&list=search&srsearch=Angela+Merkel&format=json
    wikiUrl="https://{language}.wikipedia.org/w/api.php?{encodedPayload}".format(language=language, encodedPayload=encodedPayload)

    result={}

    #we use try to avoid errors if no such page exists
    try:
        wikiSearch = json.load(urllib.request.urlopen(wikiUrl))

        #read the attributes and from wikipedia and put in result of the function
        result["wikiTitle"] = wikiSearch["query"]["search"][0]["title"]
        result["wikipageId"] = wikiSearch["query"]["search"][0]["pageid"]
        result["wikipageSnippet"] = wikiSearch["query"]["search"][0]["snippet"]
        #page exists
        result["exists"]=True
        
        #debug
        #print("Name: {firstName} {lastName}, language: {language} pageid: {pageId}, title: {pageTitle} url: https://{language}.wikipedia.org/wiki/{pageTitleQuoted}"
        #      .format(firstName=firstName, lastName=lastName, language=language, pageId=pageId, pageTitle=pageTitle, pageTitleQuoted=urllib.parse.quote(pageTitle)))

        #TODO check if its the right person, maybe use some categories, or search the snippet for information

    except IndexError:
        #page does not exists
        result["exists"]=False
        #print("Name: {firstName} {lastName}, language: {language} does not exists!"
        #      .format(firstName=firstName, lastName=lastName, language=language ))
        
    return result


In [None]:
#Solution Subtask 2.1
#workflow:
#for each member in members
#  get search volume
#  for each language in languages
#    test if wiki page exists

import urllib

#list of languages we want to test, more languages mean more time
languages = ["de", "fr", "nl", "pl"]


with open(meberListJsonPath) as meberListJsonFile:
    #load the local copy of the parliaments list file
    members = json.load(meberListJsonFile)
    
    for memberUuid in members:
    
        #TODO
        #do google trends here, get the list of countries,
        #above the threshold and than for each of this countires test if the wiki page exists
    
        for language in languages:

            #get firstName and lastName to search for
            firstName = members[memberUuid]["firstName"]
            lastName = members[memberUuid]["lastName"]

            
            
            #for each language we add the search volume from gogole trends and if an wikipedia page exists
            members[memberUuid][language]={}

            #TODO get search volume with suggested library
            members[memberUuid][language]["searchVolume"]=1000

            #get information from wikipedia
            wikiResult=wikiPageExists(firstName, lastName, language)
            
            members[memberUuid][language]["pageExists"]=wikiResult["exists"]
            members[memberUuid][language]["pageTitle"]=wikiResult["wikiTitle"]
            
         
            
#Save updated members list, keep old
memberListUpdatedJsonPath=os.path.join(dataDirectory,"memberListWikiSearch.json")
with open(memberListUpdatedJsonPath, 'w') as meberListUpdatedJsonFile:
    json.dump(members,meberListUpdatedJsonFile, sort_keys=True, indent=4)
    
print("Saved updated members list")

In [25]:
%matplotlib inline

import pytrends
from pytrends.request import TrendReq

pytrends = TrendReq(hl='de-DE', tz=360)
kw_list = ["Angela Merkel"]
#cat=396 is the category for politics on google trends, sse here: https://github.com/pat310/google-trends-api/wiki/Google-Trends-Categories
pytrends.build_payload(kw_list, cat=396, timeframe='today 5-y', geo='', gprop='')

pytrends.interest_by_region(resolution='COUNTRY')

#is there i way to get the geoName is country code (e.g. DE, FR)?
#that would be way easier to handle


Unnamed: 0_level_0,Angela Merkel
geoName,Unnamed: 1_level_1
Afghanistan,0
Albania,50
Algeria,10
American Samoa,0
Andorra,0
Angola,0
Anguilla,0
Antarctica,0
Antigua & Barbuda,0
Argentina,8
