# Step 1 - Gathering Raw Session Data into Local Data Lake
Retrieving raw JSON data for Gartner IT Symposium Xpo 2020 Session Catalog into local JSON Files to surmount unavailability, latency and format limitations. 

The session catalog for Gartner IT Symposium Xpo 2020 is available through a public web site, at: https://www.gartner.com/en/conferences/emea/symposium-spain/agenda. These websites use a common backend API to search for and retrieve details of conference sessions. The session catalog data is available from this REST API whose root endpoint is https://events.rainfocus.com/api/search.

A typical call to this API uses Headers ('rfapiprofileid': 'jbODoWCsunb5Gzfge0yi8wG4qkkl73PZ', 'rfauthtoken': '566e710bcbb4423485e98e2059409ce3', 'rfwidgetid': 'bGU8m10LBxA0zjUyqD64KMDCSKlr4NVb') and Form values to retrieve specific session information. 

A bare bone API call in Python looks like this:
```
import requests

url = "https://events.rainfocus.com/api/search"

payload={'search': '',
'type': 'session',
'size': '500'}
files=[
]
headers = {
  'rfapiprofileid': 'jbODoWCsunb5Gzfge0yi8wG4qkkl73PZ',
  'rfauthtoken': '566e710bcbb4423485e98e2059409ce3',
  'rfwidgetid': 'bGU8m10LBxA0zjUyqD64KMDCSKlr4NVb'
}

response = requests.request("POST", url, headers=headers, data=payload, files=files)

print(response.text)

```
This request returns all sessions at the event.

Execute the next cell to see this code in action and check the response from the API:

In [1]:
import requests

url = "https://events.rainfocus.com/api/search"

payload={'search': '',
'type': 'session',
'size': '500'}
files=[
]
headers = {
  'rfapiprofileid': 'jbODoWCsunb5Gzfge0yi8wG4qkkl73PZ',
  'rfauthtoken': '566e710bcbb4423485e98e2059409ce3',
  'rfwidgetid': 'bGU8m10LBxA0zjUyqD64KMDCSKlr4NVb'
}

response = requests.request("POST", url, headers=headers, data=payload, files=files)
#print first 4000 characters of response
print(response.text[:4000])

{"responseCode":"0","responseMessage":"Success","sections":true,"totalSearchItems":160,"sectionList":[{"sectionId":"20201109t10","sectionTitle":"10:00 a.m. Monday, Nov 09","total":1,"numItems":1,"from":0,"size":50,"items":[{"sessionID":"1597610066073001MIKP","externalID":"2237883","code":"K1","abbreviation":"K1","title":"Gartner Opening Keynote: Seize the Moment to Compose a Resilient Future","abstract":"Leaders worldwide are guiding their organizations into an uncertain future. But can we do more than just survive change? An adaptive, resilient business — the composable business — promises bold outcomes. CIOs throughout the private and public sectors are combining what their organizations do well with new ideas about technology, learning and ecosystems. Gartner’s 2020 Symposium Keynote covers the strategies that, even in turmoil, create greater advantage.","type":"Keynote","status":"Accepted","length":45.0,"modified":"2020-11-09T10:49:09Z","published":1.0,"hasWebinarProfile":true,"web

Our challenge is to retrieve the complete data on all sessions for the event. With the code in the previous cell as starting point - we can create a Python program that pulls session data from the (semi-)public API and stores it in a local file in our *data lake* (folder /datalake). 

The next cell contains the first part of this program. Note that all imports and variables defined in this cell have - when executed - a lasting effect throughout the Jupyter Notebook session. 

In [18]:
import requests # module for making HTTP requests, for example invoking REST APIs
import pandas as pd # module for working with Panel Data; A panel is a 3D container of data. The term Panel data is derived from econometrics and is partially responsible for the name pandas
import json # module for working with JSON data 
import math 
url = "https://events.rainfocus.com/api/search" # rooturl for session catalog
dataLake = "datalake/" # file system directory used for storing the gathered data

#headers required in each API call (these identify the Gartner conference)
headers = {
  'rfapiprofileid': 'jbODoWCsunb5Gzfge0yi8wG4qkkl73PZ',
  'rfauthtoken': '566e710bcbb4423485e98e2059409ce3',
  'rfwidgetid': 'bGU8m10LBxA0zjUyqD64KMDCSKlr4NVb'
}


payload={'search': '',
'type': 'session',
'size': '500'}
files=[
]


Conference session data should be collected for all session types. The list of session types is defined in the next cell in a Map - associating the session type abbreviation to the identifier recognized by the API.

This list was compiled by searching one by one for every available session type in the Web UI and inspecting the HTTP request that was sent from the Web UI to the backend API.

Two functions are defined to make API calls for a conference and specific session type. 

Function `loadSessionDataForConference` coordinates the API calling and converting the relevant section from the raw JSON response into a Pandas DataFrame object. This function leverages `loadSessionData` to make the actual HTTP requests to the Session Catalog API.

In [49]:
def loadSessionData():
    # make the HTTP request
    payload={'search': '',
    'type': 'session',
    'size': '500'}
    files=[

    ]
    response = requests.request("POST", url, headers=headers, data=payload, files=files)
    return response.text

def loadSessionDataForConference():
    tempDict = json.loads(loadSessionData())
    # the sessions are clustered by sessionList; each session list represents a slot (date & time)
    # for our data wrangling and analysis purposes we prefer the data to be inclustered
    sessions = []
    for sl in tempDict ["sectionList"]:
        items = sl["items"]
        for item in items:
            item["date"] = sl["sectionInfo"]["date"]
            item["time"] = sl["sectionInfo"]["time"]
        sessions = sessions + items
    df = pd.DataFrame(sessions)
    return df;

Let's retrieve all session data. The result is written to a JSON data file in our data lake (i.e. folder `./datalake`). 

In [50]:
#get session details for all sessions
ss = loadSessionDataForConference()
# write details to a JSON file called gartner-it-symposium-xpo-2020.json in the datalake (folder ./datalake)
ss.to_json("{0}gartner-it-symposium-xpo-2020.json".format(dataLake), force_ascii=False)
# show the first five entries in the data frame:
ss.head(5)


length of response.text 1662174


Unnamed: 0,sessionID,externalID,code,abbreviation,title,abstract,type,status,length,modified,...,times,attributevalues,featured_value,useWaitingList,es_metadata_id,highlight,videos,date,time,sponsors
0,1597610066073001MIKP,2237883,K1,K1,Gartner Opening Keynote: Seize the Moment to C...,Leaders worldwide are guiding their organizati...,Keynote,Accepted,45.0,2020-11-09T10:49:09Z,...,"[{'sessionTimeID': '1597610066073002MG6k', 'ex...","[{'value': 'All Sessions', 'attributevalue_id'...",1.0,1,1597610066073001MIKP,{},[],2020-11-09,10:00,
1,1599064516896001RRlL,2254587,11E,11E,Composable Businesses Need Antifragile Strategies,"In order to win in volatile times, composabili...",Track Sessions,Accepted,30.0,2020-11-09T16:12:38Z,...,"[{'sessionTimeID': '1599064516896002RuZ1', 'ex...","[{'value': 'All Sessions', 'attributevalue_id'...",1.0,1,1599064516896001RRlL,{},[],2020-11-09,11:00,
2,1599061799403001Zhvo,2247633,11a,11a,Postpandemic Planning of IT Strategy,As CIOs come out of the immediate response pha...,Track Sessions,Accepted,30.0,2020-11-09T16:08:52Z,...,"[{'sessionTimeID': '1599061799403002ZebZ', 'ex...","[{'value': 'All Sessions', 'attributevalue_id'...",1.0,1,1599061799403001Zhvo,{},[],2020-11-09,11:00,
3,1597342478776001J94Z,2238176,11C,11C,Ten Rules for Rapid IT Spend Reduction,Difficult times call for difficult actions. In...,Track Sessions,Accepted,30.0,2020-11-09T16:10:51Z,...,"[{'sessionTimeID': '1597342478776002JnhU', 'ex...","[{'value': 'All Sessions', 'attributevalue_id'...",1.0,1,1597342478776001J94Z,{},[],2020-11-09,11:00,
4,1599062702602001pCKN,2248715,11G,11G,The Cloud Computing Scenario: The Future Is Di...,"Distributed cloud brings together edge, hybrid...",Track Sessions,Accepted,30.0,2020-11-09T16:13:35Z,...,"[{'sessionTimeID': '1599062702602002pTY2', 'ex...","[{'value': 'All Sessions', 'attributevalue_id'...",1.0,1,1599062702602001pCKN,{},[],2020-11-09,11:00,


In [27]:
import pathlib

# define the path
dataLakeDirectory = pathlib.Path(dataLake)

for currentFile in dataLakeDirectory.iterdir():  
    print(currentFile)

datalake/oow2018-sessions_codeone_BUS.json
datalake/oow2018-sessions_codeone_BOF.json
datalake/oow2018-sessions_oow_PRM.json
datalake/oow2018-sessions_codeone_KEY.json
datalake/oow2018-sessions_oow_MTE.json
datalake/oow2018-sessions_oow_PKN.json
datalake/oow2018-sessions_oow_TRN.json
datalake/oow2018-sessions_codeone_THT.json
datalake/oow2018-sessions_codeone_IGN.json
datalake/oow2018-sessions_codeone_TLD.json
datalake/oow2018-sessions_codeone_PRM.json
datalake/oow2018-sessions_codeone_DEV.json
datalake/oow2018-sessions_codeone_TUT.json
datalake/oow2018-sessions_codeone_GEN.json
datalake/oow2018-sessions_oow_BUS.json
datalake/oow2018-sessions_oow_GEN.json
datalake/oow2018-sessions_oow_KEY.json
datalake/oow2018-sessions_codeone_HOL.json
datalake/oow2018-sessions_oow_DEV.json
datalake/oow2018-sessions_codeone_TIP.json
datalake/gartner-it-symposium-xpo-2020.json
datalake/oow2018-sessions_codeone_SIG.json
datalake/oow2018-sessions_oow_ESS.json
datalake/oow2018-sessions_oow_TLD.json
datalak

As a final check, let's load the contents from a randomly selected file into a Pandas Data Frame. If that is successful, we consider the data gathering stage complete. Time for some `data wrangling` in <a href="./2-OOW2018 Session Catalog - Wrangling I - one single, refined, reshaped file.ipynb" target="_new">Step 2 - Data Wrangling</a>.

In [28]:
#as a test, try to load data from the generated file 
sessionPandas = pd.read_json("{0}gartner-it-symposium-xpo-2020.json".format(dataLake))
sessionPandas.head(5)

Unnamed: 0,sessionID,externalID,code,abbreviation,title,abstract,type,status,length,modified,...,type_displayorder,type_displayorder_string,participants,times,attributevalues,featured_value,useWaitingList,es_metadata_id,highlight,videos
0,1597610066073001MIKP,2237883,K1,K1,Gartner Opening Keynote: Seize the Moment to C...,Leaders worldwide are guiding their organizati...,Keynote,Accepted,45,2020-11-09 10:49:09+00:00,...,9999,9999,[{'speakerId': '15132790924060010zbe_159475592...,"[{'sessionTimeID': '1597610066073002MG6k', 'ex...","[{'value': 'All Sessions', 'attributevalue_id'...",1,1,1597610066073001MIKP,{},[]


# Resources
The public website for the session catalog for Gartner IT Symposium Xpo 2020 https://www.gartner.com/en/conferences/emea/symposium-spain/agenda. 

The root endpoint for the REST API for session catalog data: https://events.rainfocus.com/api/search.

Dataquest Data Science Blog — Jupyter Notebook for Beginners: A Tutorial https://www.dataquest.io/blog/jupyter-notebook-tutorial/

The Pandas DataFrame – loading, editing, and viewing data in Python - by Shane Lynn https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-data-in-python/