# Data Retrieval 02

### Retrieving a lot of data

Now that we have already retrieved and processed data it is time to switch to a larger amount of data.
Since the data is restricted to 4 years we want to use all of it.

---


In [4]:
import requests
import pandas as pd


In [5]:
def getDictOfTextContents(textContents):
    subBody = []
    for tc in textContents:
        body = tc['textBody']
        for j in range(len(body)):
            textPart = body[j]
            subBody.append({textPart['type']:
                            ' '.join([s['text'] for s in textPart['sentences']])})
    return subBody


The only legislature to ever use the API was the forth cabinet of chancellor Merkel between 2017-10-24 and 2021-09-07.<br />
The upper limit for the page retrieved is 1000.<br />
The scroll API developers do not recommend the API to be used for deep pagination (https://www.elastic.co/guide/en/elasticsearch/reference/current/scroll-api.html).<br />
Each year wraps less than 10000 speeches.<br />
-> Data is retrieved per year and appended to result.


In [6]:
BASE_REQUEST = 'https://de.openparliament.tv/api/v1/search/media/?electoralPeriod=019'
AND = '&'
DATE_RANGES = [('dateFrom=2017-01-01&dateTo=2017-12-31'),
               ('dateFrom=2018-01-01&dateTo=2018-12-31'),
               ('dateFrom=2019-01-01&dateTo=2019-12-31'),
               ('dateFrom=2020-01-01&dateTo=2020-12-31'),
               ('dateFrom=2021-01-01&dateTo=2021-12-31')]
PAGES_PER_YEAR = 999


initPending = True
for dateRange in DATE_RANGES:
    print('retrieving speeces for date range "' + dateRange + '"')
    for page in range(1, PAGES_PER_YEAR+1):
        if (initPending):
            apiResponse = requests.get(
                BASE_REQUEST + AND + dateRange + AND + 'page=' + str(page)).json()
            result = apiResponse['data']
            initPending = False
        else:
            apiResponse = requests.get(
                BASE_REQUEST + AND + dateRange + AND + 'page=' + str(page)).json()
            if 'data' in apiResponse:
                result.extend(apiResponse['data'])
            elif apiResponse['meta']['results']['count'] == 0:
                break


data = pd.json_normalize(result)
# data to store
# - relationships.people.data.attributes.label                           (main-speaker)
# - relationships.people.data.attributes.party.label                     (main-speaker party)
# - attributes.TextContent.DICT(textBody.type, textBody.sentences.sentences.text) (speech and comments)
data


retrieving speeces for date range "dateFrom=2017-01-01&dateTo=2017-12-31"
retrieving speeces for date range "dateFrom=2018-01-01&dateTo=2018-12-31"
retrieving speeces for date range "dateFrom=2019-01-01&dateTo=2019-12-31"
retrieving speeces for date range "dateFrom=2020-01-01&dateTo=2020-12-31"
retrieving speeces for date range "dateFrom=2021-01-01&dateTo=2021-12-31"


Unnamed: 0,type,id,_score,_highlight,_finds,attributes.originID,attributes.originMediaID,attributes.creator,attributes.license,attributes.parliament,...,relationships.agendaItem.data.links.self,relationships.documents.data,relationships.documents.links.self,relationships.organisations.data,relationships.organisations.links.self,relationships.terms.data,relationships.terms.links.self,relationships.people.data,relationships.people.links.self,relationships.annotations.links.self
0,media,DE-0190001002,0,,,,7164718,Deutscher Bundestag,"<a href=""https://www.bundestag.de/nutzungsbedi...",DE,...,https://de.openparliament.tv/api/v1/agendaItem...,[],https://de.openparliament.tv/api/v1/search/ann...,[],https://de.openparliament.tv/api/v1/search/ann...,[],https://de.openparliament.tv/api/v1/search/ann...,"[{'type': 'person', 'id': 'Q70407', 'attribute...",https://de.openparliament.tv/api/v1/search/ann...,https://de.openparliament.tv/api/v1/search/ann...
1,media,DE-0190001007,0,,,,7164730,Deutscher Bundestag,"<a href=""https://www.bundestag.de/nutzungsbedi...",DE,...,https://de.openparliament.tv/api/v1/agendaItem...,"[{'type': 'document', 'id': '4022', 'attribute...",https://de.openparliament.tv/api/v1/search/ann...,"[{'type': 'organisation', 'id': 'Q1826856', 'a...",https://de.openparliament.tv/api/v1/search/ann...,[],https://de.openparliament.tv/api/v1/search/ann...,"[{'type': 'person', 'id': 'Q1599545', 'attribu...",https://de.openparliament.tv/api/v1/search/ann...,https://de.openparliament.tv/api/v1/search/ann...
2,media,DE-0190001008,0,,,,7164731,Deutscher Bundestag,"<a href=""https://www.bundestag.de/nutzungsbedi...",DE,...,https://de.openparliament.tv/api/v1/agendaItem...,"[{'type': 'document', 'id': '4022', 'attribute...",https://de.openparliament.tv/api/v1/search/ann...,"[{'type': 'organisation', 'id': 'Q1023134', 'a...",https://de.openparliament.tv/api/v1/search/ann...,[],https://de.openparliament.tv/api/v1/search/ann...,"[{'type': 'person', 'id': 'Q1770968', 'attribu...",https://de.openparliament.tv/api/v1/search/ann...,https://de.openparliament.tv/api/v1/search/ann...
3,media,DE-0190001010,0,,,,7164735,Deutscher Bundestag,"<a href=""https://www.bundestag.de/nutzungsbedi...",DE,...,https://de.openparliament.tv/api/v1/agendaItem...,"[{'type': 'document', 'id': '4022', 'attribute...",https://de.openparliament.tv/api/v1/search/ann...,"[{'type': 'organisation', 'id': 'Q1007353', 'a...",https://de.openparliament.tv/api/v1/search/ann...,[],https://de.openparliament.tv/api/v1/search/ann...,"[{'type': 'person', 'id': 'Q920726', 'attribut...",https://de.openparliament.tv/api/v1/search/ann...,https://de.openparliament.tv/api/v1/search/ann...
4,media,DE-0190001011,0,,,,7164736,Deutscher Bundestag,"<a href=""https://www.bundestag.de/nutzungsbedi...",DE,...,https://de.openparliament.tv/api/v1/agendaItem...,"[{'type': 'document', 'id': '4022', 'attribute...",https://de.openparliament.tv/api/v1/search/ann...,[],https://de.openparliament.tv/api/v1/search/ann...,[],https://de.openparliament.tv/api/v1/search/ann...,"[{'type': 'person', 'id': 'Q70407', 'attribute...",https://de.openparliament.tv/api/v1/search/ann...,https://de.openparliament.tv/api/v1/search/ann...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25604,media,DE-0190239047,0,,,,7532003,Deutscher Bundestag,"<a href=""https://www.bundestag.de/nutzungsbedi...",DE,...,https://de.openparliament.tv/api/v1/agendaItem...,"[{'type': 'document', 'id': '3553', 'attribute...",https://de.openparliament.tv/api/v1/search/ann...,"[{'type': 'organisation', 'id': 'Q1023134', 'a...",https://de.openparliament.tv/api/v1/search/ann...,[],https://de.openparliament.tv/api/v1/search/ann...,"[{'type': 'person', 'id': 'Q98768', 'attribute...",https://de.openparliament.tv/api/v1/search/ann...,https://de.openparliament.tv/api/v1/search/ann...
25605,media,DE-0190239049,0,,,,7532005,Deutscher Bundestag,"<a href=""https://www.bundestag.de/nutzungsbedi...",DE,...,https://de.openparliament.tv/api/v1/agendaItem...,"[{'type': 'document', 'id': '3553', 'attribute...",https://de.openparliament.tv/api/v1/search/ann...,"[{'type': 'organisation', 'id': 'Q2207512', 'a...",https://de.openparliament.tv/api/v1/search/ann...,[],https://de.openparliament.tv/api/v1/search/ann...,"[{'type': 'person', 'id': 'Q1736645', 'attribu...",https://de.openparliament.tv/api/v1/search/ann...,https://de.openparliament.tv/api/v1/search/ann...
25606,media,DE-0190239050,0,,,,7532006,Deutscher Bundestag,"<a href=""https://www.bundestag.de/nutzungsbedi...",DE,...,https://de.openparliament.tv/api/v1/agendaItem...,"[{'type': 'document', 'id': '3553', 'attribute...",https://de.openparliament.tv/api/v1/search/ann...,"[{'type': 'organisation', 'id': 'Q1826856', 'a...",https://de.openparliament.tv/api/v1/search/ann...,[],https://de.openparliament.tv/api/v1/search/ann...,"[{'type': 'person', 'id': 'Q65114', 'attribute...",https://de.openparliament.tv/api/v1/search/ann...,https://de.openparliament.tv/api/v1/search/ann...
25607,media,DE-0190239052,0,,,,7532009,Deutscher Bundestag,"<a href=""https://www.bundestag.de/nutzungsbedi...",DE,...,https://de.openparliament.tv/api/v1/agendaItem...,"[{'type': 'document', 'id': '3553', 'attribute...",https://de.openparliament.tv/api/v1/search/ann...,"[{'type': 'organisation', 'id': 'Q1387991', 'a...",https://de.openparliament.tv/api/v1/search/ann...,[],https://de.openparliament.tv/api/v1/search/ann...,"[{'type': 'person', 'id': 'Q1429938', 'attribu...",https://de.openparliament.tv/api/v1/search/ann...,https://de.openparliament.tv/api/v1/search/ann...


In [7]:
dataExtracted = data[['attributes.textContents', 'relationships.people.data']]

dataExtracted['sentences'] = dataExtracted.apply(
    lambda dataRow: getDictOfTextContents(dataRow['attributes.textContents']), axis=1)
dataExtracted = dataExtracted.drop(columns=['attributes.textContents'])

for index, dataRow in dataExtracted.iterrows():
    if len(dataRow['relationships.people.data']) > 0:
        dataExtracted.at[index,
                         'main-speaker'] = dataRow['relationships.people.data'][0]['attributes']['label']
        dataExtracted.at[index, 'main-speaker-party'] = dataRow['relationships.people.data'][0]['attributes']['party']['label']
    else:
        dataExtracted.at[index, 'main-speaker'] = None
        dataExtracted.at[index, 'main-speaker-party'] = None


dataExtracted = dataExtracted.drop(columns=['relationships.people.data'])

dataExtracted


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataExtracted['sentences'] = dataExtracted.apply(


Unnamed: 0,sentences,main-speaker,main-speaker-party
0,"[{'speech': 'Guten Morgen, liebe Kolleginnen u...",Hermann Otto Solms,Freie Demokratische Partei
1,[{'speech': 'Herr Präsident! Liebe Kolleginnen...,Jan Korte,Die Linke
2,[{'speech': 'Herr Präsident! Sehr verehrte Gäs...,Michael Grosse-Brömer,Christlich Demokratische Union Deutschlands
3,[{'speech': 'Sehr geehrter Herr Präsident! Mei...,Britta Haßelmann,Bündnis 90/Die Grünen
4,"[{'speech': 'Wir kommen nun zur Abstimmung.'},...",Hermann Otto Solms,Freie Demokratische Partei
...,...,...,...
25604,[{'speech': 'Frau Präsidentin! Liebe Kolleginn...,Andreas Jung,Christlich Demokratische Union Deutschlands
25605,[{'speech': 'Frau Präsidentin! Meine sehr vere...,Katja Mast,Sozialdemokratische Partei Deutschlands
25606,[{'speech': 'Vielen Dank. – Frau Präsidentin! ...,Gesine Lötzsch,Die Linke
25607,[{'speech': 'Vielen Dank. – Frau Präsidentin! ...,Florian Toncar,Freie Demokratische Partei


In [8]:
# Store data for use in other notebooks.
%store dataExtracted


Stored 'dataExtracted' (DataFrame)
