# Step 1 - Gathering Raw Session Data into Local Data Lake
Retrieving raw JSON data for Oracle OpenWorld & CodeOne Session Catalog into local JSON Files to surmount unavailability, latency and format limitations. 

The session catalog for Oracle OpenWorld 2018 is available through a public web site, at: https://events.rainfocus.com/widget/oracle/oow18/catalogoow18? . The session catalog for the co-located CodeOne conference is published at https://events.rainfocus.com/widget/oracle/oow18/catalogcodeone18?. These websites use a common backend API to search for and retrieve details of conference sessions. The session catalog data is available from this REST API whose root endpoint is https://events.rainfocus.com/api/search.

A typical call to this API uses Headers (rfwidgetid: 'KKA8rC3VuZo5clh8gX5Aq07XFonUTLyU',rfapiprofileid: 'uGiII5rYGOjoHXOZx0ch4r7f1KzFC0zd') and Form values to retrieve specific session information. 

A bare bone API call in Python looks like this:
```
import requests

url = "https://events.rainfocus.com/api/search"

querystring = {"search.sessiontype":"1522435540042001BxTD"}

payload = "------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"size\"\r\n\r\n50\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"type\"\r\n\r\nsession\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"from\"\r\n\r\n30\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW--"
headers = {
    'content-type': "multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW",
    'rfapiprofileid': "uGiII5rYGOjoHXOZx0ch4r7f1KzFC0zd",
    'rfwidgetid': "KKA8rC3VuZo5clh8gX5Aq07XFonUTLyU",
    'cache-control': "no-cache",
    }

response = requests.request("POST", url, data=payload, headers=headers, params=querystring)

print(response.text)
```
This request returns all sessions at CodeOne of type Developer Session.

Execute the next cell to see this code in action and check the response from the API:

In [4]:
import requests

url = "https://events.rainfocus.com/api/search"

querystring = {"search.sessiontype":"1522435540042001BxTD"}

payload = "------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"size\"\r\n\r\n50\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"type\"\r\n\r\nsession\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"from\"\r\n\r\n30\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW--"
headers = {
    'content-type': "multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW",
    'rfapiprofileid': "uGiII5rYGOjoHXOZx0ch4r7f1KzFC0zd",
    'rfwidgetid': "KKA8rC3VuZo5clh8gX5Aq07XFonUTLyU",
    'cache-control': "no-cache",
    'Postman-Token': "dadd9f76-6a7f-41ed-8f31-04359976c622"
    }

response = requests.request("POST", url, data=payload, headers=headers, params=querystring)

print(response.text)



Our challenge is to retrieve the complete data on all sessions for both conferences - CodeOne and Oracle OpenWorld. With the code in the previous cell as starting point - we can create a Python program that pulls session data from the (semi-)public API and stores it in a local file in our *data lake* (folder /data). Note that we have to make an API call for each session type and for both conferences in order to gather all data.

The next cell contains the first part of this program. Note that all imports and variables defined in this cell have - when executed - a lasting effect throughout the Jupyter Notebook session. In other words: variable `OOW_rfapiprofileid` is defined in the next cell and will be available to all subsequent cells in the notebook.

In [5]:
import requests # module for making HTTP requests, for example invoking REST APIs
import pandas as pd # module for working with Panel Data; A panel is a 3D container of data. The term Panel data is derived from econometrics and is partially responsible for the name pandas
import json # module for working with JSON data 
import math 
url = "https://events.rainfocus.com/api/search" # rooturl for session catalog
dataLake = "data/" # file system directory used for storing the gathered data

#headers required in each API call
querystring = {"search.sessiontype":"<session type specific code>"} 
payload = "------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"size\"\r\n\r\n50\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"type\"\r\n\r\nsession\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW\r\nContent-Disposition: form-data; name=\"from\"\r\n\r\n{0}\r\n------WebKitFormBoundary7MA4YWxkTrZu0gW--"

headers = {
    'content-type': "multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW",
    'rfapiprofileid': "<conference specific profile id>",
    'rfwidgetid': "<conference specific widgetid>",
    'cache-control': "no-cache"
}

# OOW
OOW_rfapiprofileid ="K9HkkU5es180AVTifYUgYembIKJ15CMM"
OOW_rfwidgetid= "VEsNDADSTFH5azU4dH1QslO3lhpQTy4U"

# CodeOne
CodeOne_rfapiprofileid= "uGiII5rYGOjoHXOZx0ch4r7f1KzFC0zd"
CodeOne_rfwidgetid =  "KKA8rC3VuZo5clh8gX5Aq07XFonUTLyU"

Conference session data should be collected for all session types. The list of session types is defined in the next cell in a Map - associating the session type abbreviation to the identifier recognized by the API.

This list was compiled by searching one by one for every available session type in the Web UI and inspecting the HTTP request that was sent from the Web UI to the backend API.

In [6]:
sessionTypes =   {
      'BOF': '1518466139979001dQkv'
    , 'BQS': 'bqs'
    , 'BUS': '1519240082595001EpMm'
    , 'CAS': 'casestudy'
    , 'DEV': '1522435540042001BxTD'
    , 'ESS': 'ess'
    , 'FLP': 'flp'
    , 'GEN': 'general'
    , 'HOL': 'hol'
    , 'HOM': 'hom'
    , 'IGN': 'ignite'
    , 'KEY': 'option_1508950285425'
    , 'MTE':'1523906206279002QAu9'
    , 'PKN': '1527614217434001RBfj'
    , 'PRO': '1518464344082003KVWZ'
    , 'PRM': '1518464344082002KM3k'
    , 'TRN': '1518464344082001KHky'
    , 'SIG': 'sig'
    , 'THT': 'ts'
    , 'TLD': '1537894888625001RriS'
    , 'TIP': '1517517756579001F3CR'
    , 'TUT': 'tutorial'
    , 'TRN': '1518464344082001KHky'
}
                
print("Mapping of Session Type to header-code", sessionTypes)



Mapping of Session Type to header-code {'BOF': '1518466139979001dQkv', 'BQS': 'bqs', 'BUS': '1519240082595001EpMm', 'CAS': 'casestudy', 'DEV': '1522435540042001BxTD', 'ESS': 'ess', 'FLP': 'flp', 'GEN': 'general', 'HOL': 'hol', 'HOM': 'hom', 'IGN': 'ignite', 'KEY': 'option_1508950285425', 'MTE': '1523906206279002QAu9', 'PKN': '1527614217434001RBfj', 'PRO': '1518464344082003KVWZ', 'PRM': '1518464344082002KM3k', 'TRN': '1518464344082001KHky', 'SIG': 'sig', 'THT': 'ts', 'TLD': '1537894888625001RriS', 'TIP': '1517517756579001F3CR', 'TUT': 'tutorial'}


Two functions are defined to make API calls for a conference and specific session type. The session details are retrieved in batched of 50; for any session type for which there are more than 50 sessions of a certain type, multiple calls have to be made.

Function `loadSessionDataForSessionType` coordinates the API calling - handling the batches for example and converting the relevant section from the raw JSON response into a Pandas DataFrame object. This function leverages `loadSessionDataForSessionTypeStartingAt` to make the actual HTTP requests to the Session Catalog API - for session details for a specific conference and a session type from a certain offset.

In [22]:
def loadSessionDataForSessionTypeStartingAt(conference, sessionType, startingAt):
    querystring = {"search.sessiontype":sessionType} 
    if conference == "codeone":
       headers['rfapiprofileid'] = CodeOne_rfapiprofileid
       headers['rfwidgetid'] = CodeOne_rfwidgetid
    else :
       headers['rfapiprofileid'] = OOW_rfapiprofileid
       headers['rfwidgetid'] = OOW_rfwidgetid
    # make the HTTP request
    response = requests.request("POST", url, data=payload.format(startingAt), headers=headers, params=querystring)
    return response.text

def loadSessionDataForSessionType(conference, sessionType):
    tempDict = json.loads(loadSessionDataForSessionTypeStartingAt(conference,  sessionType,0))
    sl= tempDict ["sectionList"][0]
    sessions = sl["items"]
    total = sl.get("total",-1) # indicates the total number of sessions of type sessionType
    received = sl['numItems']
    startingAt = sl['from']
    print(" ** Received for session type ",sessionType," a set of ", received , "items, out of a total of ", total);
    if total > received:
        # figure out how many calls are required to get all data for the sessions of this type
        requestTotal = math.ceil(total/50)
        print ("number of additional requests = ",requestTotal-1)
        for k in range(1,requestTotal):
            # for subsequent request, the response does not contain the nested sectionList element
            tempDict = json.loads(loadSessionDataForSessionTypeStartingAt( key, sessionType, k*50))
            items = tempDict['items']

            received = tempDict['numItems']
            startingAt = tempDict['from']
            print(" ** Received from " , startingAt," set of ", received , "items, out of a total of ", total);
            #combine sessions with new set of items
            sessions = sessions + items
    #the JSON object is converted to a Dict; this Dict contains scalar values such as strings and more complex values such as Arrarys and nested Dicts)
    ss = pd.DataFrame(sessions)
    return ss;

Let's make a single call for one session type at CodeOne, to get a feel for what's happening. The result is written to a JSON data file in our data lake (i.e. folder `./data`). 

In [23]:
#get session details for session type BOF (Birds of a Feather) at the CodeOne conference
key = 'BOF'
ss = loadSessionDataForSessionType('codeone',sessionTypes[key])
# write details to a JSON file called oow2018-sessions_codeone_BOF.json in the datalake (folder ./data)
ss.to_json("{0}oow2018-sessions_codeone_{1}.json".format(dataLake, key), force_ascii=False)
# show the first five entries in the data frame:
ss.head(5)

 ** Received from  0  set of  43 items, out of a total of  43


Unnamed: 0,abbreviation,abstract,allowDoubleBooking,attributevalues,code,codeParts,code_id,es_metadata_id,event,eventCode,...,type_displayorder,type_displayorder_string,useDoubleBooking,useWaitingList,videos,viewAccess,viewAccessPublic,viewFileAccess,waitlistAccess,waitlistLimit
0,BOF6086,Data is the essential fuel to every analytic p...,0,"[{'value': 'Beginner', 'attributevalue_id': '1...",BOF6086,"{'alpha0': 'BOF', 'numeric1': '6086'}",bof6086,1526592270835001iepf,Oracle OpenWorld,oow18,...,9999.0,9999,True,0,[],"[1533140082245001une2, 1533141196635001EZdJ, 1...",True,[],"[1533141196635001EZdJ, 1533143193464001uQsQ]",0
1,BOF4977,Microservices are independent—sure. Complex tr...,0,"[{'value': 'Intermediate', 'attributevalue_id'...",BOF4977,"{'alpha0': 'BOF', 'numeric1': '4977'}",bof4977,1525702590325001nlzi,Oracle OpenWorld,oow18,...,9999.0,9999,True,0,[],"[1533140082245001une2, 1533141196635001EZdJ, 1...",True,[],"[1533141196635001EZdJ, 1533143193464001uQsQ]",0
2,BOF5402,This session takes the audience through the ri...,0,"[{'value': 'Beginner', 'attributevalue_id': '1...",BOF5402,"{'alpha0': 'BOF', 'numeric1': '5402'}",bof5402,15259703423020012GvZ,Oracle Code One,oow18,...,9999.0,9999,True,0,[],"[1533140082245001une2, 1533141196635001EZdJ, 1...",True,[],"[1533141196635001EZdJ, 1533143193464001uQsQ]",0
3,BOF4909,Leveraging the developer session “Evolutionary...,0,"[{'value': 'Advanced', 'attributevalue_id': '1...",BOF4909,"{'alpha0': 'BOF', 'numeric1': '4909'}",bof4909,1525400274646001Otyb,Oracle Code One,oow18,...,9999.0,9999,True,0,[],"[1533140082245001une2, 1533141196635001EZdJ, 1...",True,[],"[1533141196635001EZdJ, 1533143193464001uQsQ]",0
4,BOF5039,"<i>Cloud-native</i>. It’s a great term, one th...",0,"[{'value': 'Intermediate', 'attributevalue_id'...",BOF5039,"{'alpha0': 'BOF', 'numeric1': '5039'}",bof5039,15257821699770013WRE,Oracle Code One,oow18,...,9999.0,9999,True,0,,"[1533140082245001une2, 1533141196635001EZdJ, 1...",True,[],"[1533141196635001EZdJ, 1533143193464001uQsQ]",0


You can now check in folder `./data` for a file called oow2018-sessions_codeone_BOF.json that was created just now and contains in JSON format the fairly raw data on Bird of a Feathers sessions.

## Gather all raw session data into the Data Lake

The next cell will gather all session details for all types of sessions at both conferences (Oracle OpenWorld and CodeOne). This involves a substantial number of HTTP calls to the public API and a sizable data volume. 

*Note: Running the next cell can take several minutes.* 

In [24]:
#For both conferences, loop over all session types and invoke function loadSessionDataForSessionType
print("Loading CodeOne details")
for key, value in sessionTypes.items():
    ss = loadSessionDataForSessionType('codeone', value)
    ss.to_json("{0}oow2018-sessions_codeone_{1}.json".format(dataLake, key), force_ascii=False)

print("Loading OOW details")
for key, value in sessionTypes.items():
    ss = loadSessionDataForSessionType('oow', value)
    ss.to_json("{0}oow2018-sessions_oow_{1}.json".format(dataLake, key), force_ascii=False)
    


Loading CodeOne details
 ** Received from  0  set of  43 items, out of a total of  43
 ** Received from  0  set of  0 items, out of a total of  0
 ** Received from  0  set of  7 items, out of a total of  7
 ** Received from  0  set of  7 items, out of a total of  7
 ** Received from  0  set of  50 items, out of a total of  398
number of additional requests =  7
 ** Received from  50  set of  0 items, out of a total of  398
 ** Received from  100  set of  0 items, out of a total of  398
 ** Received from  150  set of  0 items, out of a total of  398
 ** Received from  200  set of  0 items, out of a total of  398
 ** Received from  250  set of  0 items, out of a total of  398
 ** Received from  300  set of  0 items, out of a total of  398
 ** Received from  350  set of  0 items, out of a total of  398
 ** Received from  0  set of  0 items, out of a total of  0
 ** Received from  0  set of  0 items, out of a total of  0
 ** Received from  0  set of  1 items, out of a total of  1
 ** Recei

The Data Lake should now contain 44 JSON files - for 22 session types at each of the two conferences. The next cell checks the file system and lists all files in the *Data Lake*.

In [26]:
import pathlib

# define the path
dataLakeDirectory = pathlib.Path(dataLake)

for currentFile in dataLakeDirectory.iterdir():  
    print(currentFile)

data/oow2018-sessions_codeone_PKN.json
data/oow2018-sessions_oow_ESS.json
data/oow2018-sessions_codeone_PRO.json
data/oow2018-sessions_oow_TIP.json
data/oow2018-sessions_oow_THT.json
data/oow2018-sessions_codeone_DEV.json
data/oow2018-sessions_codeone_BOF.json
data/oow2018-sessions_codeone_BUS.json
data/oow2018-sessions_codeone_MTE.json
data/oow2018-sessions_oow_PKN.json
data/oow2018-sessions_codeone_PRM.json
data/oow2018-sessions_codeone_TLD.json
data/oow2018-sessions_oow_PRM.json
data/oow2018-sessions_codeone_HOL.json
data/oow2018-sessions_codeone_KEY.json
data/oow2018-sessions_oow_TRN.json
data/oow2018-sessions_oow_HOM.json
data/oow2018-sessions_codeone_BQS.json
data/oow2018-sessions_oow_TLD.json
data/oow2018-sessions_codeone_TRN.json
data/oow2018-sessions_oow_KEY.json
data/oow2018-sessions_codeone_ESS.json
data/oow2018-sessions_oow_CAS.json
data/oow2018-sessions_oow_FLP.json
data/oow2018-sessions_oow_BQS.json
data/oow2018-sessions_codeone_TIP.json
data/oow2018-sessions_codeone_CAS.

As a final check, let's load the contents from a randomly selected file into a Pandas Data Frame. If that is successful, we consider the data gathering stage complete. Time for some `data wrangling` in <a href="./2-OOW2018 Session Catalog - Wrangling I - one single, refined, reshaped file.ipynb" target="_new">Step 2 - Data Wrangling</a>.

In [29]:
#as a test, try to load data from one of the generated files 
conference = 'oow' # could also be codeone
sessionType = 'HOL' # could also be one of 21 other values such as TUT, DEV, GEN, BOF,...
sessionPandas = pd.read_json("{0}oow2018-sessions_{1}_{2}.json".format(dataLake, conference, sessionType))
sessionPandas.head(5)

Unnamed: 0,abbreviation,abstract,allowDoubleBooking,attributevalues,code,codeParts,code_id,es_metadata_id,event,eventCode,...,type_displayorder,type_displayorder_string,useDoubleBooking,useWaitingList,videos,viewAccess,viewAccessPublic,viewFileAccess,waitlistAccess,waitlistLimit
0,HOL6306,"Enhance the security of your Oracle HCM, ERP, ...",0,"[{'value': 'Beginner', 'attributevalue_id': '1...",HOL6306,"{'alpha0': 'HOL', 'numeric1': '6306'}",hol6306,15311715163230013Hbl,Oracle OpenWorld,oow18,...,9999,9999,True,0,,"[1533140082245001une2, 1533141196635001EZdJ, 1...",True,[],"[1533141196635001EZdJ, 1533143193464001uQsQ]",0
1,HOL6286,Oracle Data Integration Platform Cloud allows ...,0,"[{'value': 'Beginner', 'attributevalue_id': '1...",HOL6286,"{'alpha0': 'HOL', 'numeric1': '6286'}",hol6286,1530813370901001gB2W,Oracle OpenWorld,oow18,...,9999,9999,True,0,,"[1533140082245001une2, 1533141196635001EZdJ, 1...",True,[],"[1533141196635001EZdJ, 1533143193464001uQsQ]",0
10,HOL6325,In this session learn to develop and deploy a ...,0,"[{'value': 'Beginner', 'attributevalue_id': '1...",HOL6325,"{'alpha0': 'HOL', 'numeric1': '6325'}",hol6325,15313473829130018nP4,Oracle OpenWorld,oow18,...,9999,9999,True,0,[],"[1533140082245001une2, 1533141196635001EZdJ, 1...",True,[],"[1533141196635001EZdJ, 1533143193464001uQsQ]",0
11,HOL6328,In this session learn about the latest feature...,0,"[{'value': 'Beginner', 'attributevalue_id': '1...",HOL6328,"{'alpha0': 'HOL', 'numeric1': '6328'}",hol6328,1531350426847001WHKn,Oracle OpenWorld,oow18,...,9999,9999,True,0,[],"[1533140082245001une2, 1533141196635001EZdJ, 1...",True,[],"[1533141196635001EZdJ, 1533143193464001uQsQ]",0
12,HOL6331,In this hands-on lab learn the basics of Oracl...,0,"[{'value': 'Beginner', 'attributevalue_id': '1...",HOL6331,"{'alpha0': 'HOL', 'numeric1': '6331'}",hol6331,1531436407190001W0ep,Oracle OpenWorld,oow18,...,9999,9999,True,0,,"[1533140082245001une2, 1533141196635001EZdJ, 1...",True,[],"[1533141196635001EZdJ, 1533143193464001uQsQ]",0


# Resources
The public website for the session catalog for Oracle OpenWorld 2018: https://events.rainfocus.com/widget/oracle/oow18/catalogoow18? . 

The session catalog for the co-located CodeOne conference is published at https://events.rainfocus.com/widget/oracle/oow18/catalogcodeone18?

The root endpoint for the REST API for session catalog data: https://events.rainfocus.com/api/search.

Dataquest Data Science Blog — Jupyter Notebook for Beginners: A Tutorial https://www.dataquest.io/blog/jupyter-notebook-tutorial/

The Pandas DataFrame – loading, editing, and viewing data in Python - by Shane Lynn https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-data-in-python/