<a href="https://colab.research.google.com/github/predicthq/phq-data-science-docs/blob/master/academic-events/part_1_data_engineering.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### Academic Events Data Science Guides

# Part 1: Data Engineering

PredictHQ Academic Events data is event data related to Colleges and Universities.

This *How to Series* allows you to quickly extract the data (Part 1), explore the data (Part 2) and experiment with different aggregations (Part 3).

The Academic Events Category Documentation provides more information about the category https://docs.predicthq.com/categoryinfo/attended-events/#academic-events.

<b>This How to Guide, Part 1, is how to extract data from PredictHQ's Academic Events and covers:</b>

- [Setup](#setup)
- [Access Token](#access_token)
- [Support Functions](#support_functions)
- [SDK Parameters](#sdk_parameters)
- [SDK Call](#sdk_call)
- [Output Dataframe](#output_dataframe)
- [Appendix - Finding Place ID](#appendix)

<a id='setup'></a>
## Setup

If using Google Colab uncomment the following code block.

In [1]:
# %%capture
# !git clone https://github.com/predicthq/phq-data-science-docs.git
# %cd phq-data-science-docs/academic-events
# !pip install predicthq>=1.6.3 timezonefinder

If running locally, configure the required dependencies in your Python environment by using the [requirements.txt](https://github.com/predicthq/phq-data-science-docs/blob/master/academic-events/requirements.txt) file which is shared alongside the notebook.

These requirements can be installed by runing the command `pip install -r requirements.txt`

In [5]:
import pandas as pd
from datetime import datetime
from datetime import timedelta
from timezonefinder import TimezoneFinder
import pytz

from predicthq import Client
import requests

<a id='access_token'></a>
## Access Token

To query the API, you will need an access token. 

The following link will guide you through creating an account and access token. 

 - https://docs.predicthq.com/guides/quickstart/

In [32]:
# Replace Access Token with own access token.
ACCESS_TOKEN = '<REPLACE WITH YOUR ACCESS TOKEN>'
phq = Client(access_token=ACCESS_TOKEN)

<a id='support_functions'></a>
## Support Functions

We recommend creating four additional features from the raw data:

#### 1) Sub-category (sub_category)
        Each event is assigned into one of 5 sub categories that apply to academic events. 
        
        -holiday
        -academic-session
        -exam
        -graduation
        -social
        
Unlike PredictHQ event categories of school holidays or public holidays, academic holiday events have an associated attendance. This represents the full time undergraduate population that will be on holiday. The Academic Events holiday is associated with a more specific location. Its location represents the campus that students are on holiday from. Holidays in the academic events category are therefore likely to represent decremental demand in these locations.

#### 2) Session Type (session_type)
    This summarises whether the event is attended physically, virtually or both.
    The three options for this field are:
    
        -in-person session
        -online session 
        -hybrid session
        
The attendance numbers for these sessions are already adjusted to take account of hybrid or online sessions. The attendance figures still represent the physical attendance figures.  
        
#### 3) Estimated (estimated)
Events are added when the academic calendar is released. For recurrent events, these can be estimated in advance of the official calendar release. Estimated events have an 'estimate' label applied. Estimated events can also apply to historic events where the event was added to our system but official historic calendars are not available. This field mainly applies to academic sessions and holidays.

#### 4) On Campus (on_campus)
        Not all events occur on campus.
        Calculation logic.
            - All holidays are classed as off campus
            
As online and hybrid session attendance figures have already been adjusted to only include the number of students physically attending, these events are assigned as on campus.

In [17]:
def extract_matching_label(event_labels, labels_to_match):
    ''' For each event labels need to be
    extracted. These labels are extracted from the labels.
    As the order of the labels varies this look up is
    required to compare to the frozenset of options.
    '''
    for label in labels_to_match:
        if label in event_labels:
            return label
    return None


SUB_CATEGORY = frozenset([
                          'academic-session',
                          'exam',
                          'graduation',
                          'holiday',
                          'social',
                          ])

SESSION_TYPE = frozenset([
                          'online-session',
                          'hybrid-session',
                         ])

ESTIMATED = frozenset([
                      'estimated',
                      ])


def extract_entity_name(row):
    '''The entity represents the venue of the event
    The name of the entity is required'''
    if len(row['entities']) > 0:
        return row['entities'][0]['name']
    else:
        return None

<a id='sdk_parameters'></a>
## SDK Parameters

We will create a dictionary of notable parameters and walk through each of the settings to use in the SDK call. A full list of available parameters and details of the API be found in our API documentation https://docs.predicthq.com/resources/events/#search-events

In [18]:
 parameters_dict = dict()

#### Location

There are two options available to specify a location of interest:

 - place_id

 - radius @ latitude and longitude


`location__place_id='place_id'` or `within='radiusmi@lat,long'`

If you do not know the place_id for the location of interest you can apply a search using the API call in the [Appendix](#appendix).

For the SDK call, you can specify your own location of interest. Here are four default locations to query as an example:

 - Austin, Texas: 4671654 or (30.2785, -97.7395)
 - Los Angeles, California: 5368361 or (34.0778, -118.3602)
 - Chicago, Illinois: 4887398 or (41.8048, -87.5871)
 - Tallahassee, Florida: 4174715 or (30.4420, -84.2845)


In [19]:
parameters_dict.update(within='100mi@34.0778,-118.3602')

# m - Meters.
# km - Kilometers.
# ft - Feet.
# mi - Miles.

#### Time Limits 

Define the period of time for which events will be returned. Either start or active can be used. The start will search based on events that start within the time period given. The active will return all events that are active within the time period, even if these events started before the start of the time period.
 
You could also use either of these parameters depending on your time period of interest:

```gte - Greater than or equal.``` <br>
```gt - Greater than.```<br>
```lte - Less than or equal.```<br>
```lt - Less than.```<br>


```start__tz``` or ```active__tz``` allows you to set the timezone to align with the location of interest. If no timezone is provided, UTC is used as default. This can lead to missing events at the edge of your time period, where they may not fall within the date range based on UTC, but fall within the dates based on the local timezone.

```parameters_dict.update(start__tz='America/Chicago')``` 

Sources to aid in finding the timezone (<a href="https://en.wikipedia.org/wiki/List_of_tz_database_time_zones">tz database</a>).

In [20]:
# Set your chosen start and end date.
START_DATE = '2019-01-01'
END_DATE = '2021-12-14'
parameters_dict.update(active__gte=START_DATE, active__lte=END_DATE) 
# parameters_dict.update(start__gte=START_DATE, start__lte=END_DATE)  # Alternative use of start

In [21]:
# timezonefinder will help to easily find a timezone from lat long.
timezone = TimezoneFinder().timezone_at(lat=34.0778, lng=-118.3602)
print(timezone)

America/Los_Angeles


In [22]:
parameters_dict.update(active__tz='America/Los_Angeles')

#### Category

These notebooks only relate to the 'academic' category. For other categories please see the relevant documentation. 

In [23]:
parameters_dict.update(category=['academic']) 

#### Limits  `limit=500`

 - When pulling historical data for a large time period many results are returned. To speed up the execution set ```limit``` to the highest available setting (500). By doing this each call to the API returns 500 results and this will generally speed up the time to retrieve large datasets.

In [24]:
parameters_dict.update(limit=500) 

In [25]:
# For example:
parameters_dict

{'within': '100mi@34.0778,-118.3602',
 'active__gte': '2019-01-01',
 'active__lte': '2021-12-14',
 'active__tz': 'America/Los_Angeles',
 'category': ['academic'],
 'limit': 500}

<a id='sdk_call'></a>
## SDK Call

Loop through the call to the API for each location of interest.

The data for each location is saved to csv as an example output. This can be adjusted to work with your own data pipeline.

In [30]:
# To run for your own location of interest.
# Either replace list of place ids.
# Or replace list of lat and long.
LIST_OF_PLACEID = [4671654 , 5368361, 4887398, 4174715]
LIST_OF_LAT_LONG = [['30.2785', '-97.7395'],
                    ['34.0778', '-118.3602'],
                    ['41.8048', '-87.5871'],
                    ['30.4420', '-84.2845']]
TIMEZONES = [ 'America/Chicago',
             'America/Los_Angeles',
            'America/Chicago',
            'America/New_York'
            ]

START_DATE = '2019-01-01'
END_DATE = '2021-01-01'
# unit can be changed (currently set to miles)
RADIUS = 10

In the following example: Uncomment or comment appropriately if using place_ids.

In [31]:
# Define API parameters.
parameters_dict = dict()
parameters_dict.update(active__gte=START_DATE, active__lte=END_DATE) 
parameters_dict.update(category=['academic']) 
parameters_dict.update(limit=500) 

# Loop through each location of interest.
# Example code is provide to either loop through LIST_OF_PLACEID or LIST_OF_LAT_LONG.
#for timezone, place_id in zip(TIMEZONES, LIST_OF_PLACEID):  # uncomment/comment as required. 
for timezone, lat_long in zip(TIMEZONES, LIST_OF_LAT_LONG): # uncomment/comment as required. 
    
    #parameters_dict.update(place__scope=place_id)  # uncomment/comment as required.  
    parameters_dict.update(within='{}mi@{},{}'.format(RADIUS,
                                                      lat_long[0],
                                                      lat_long[1]))  # uncomment/comment as required. 
   
    # If time zones are unknown comment out this line and revert to UTC.
    parameters_dict.update(active__tz=timezone)
    
    search_results = phq.events.search(**parameters_dict).iter_all()

    search_results = [result.to_dict() for result in search_results]

    df = pd.DataFrame(search_results)

    df['entity_name'] = df.apply(extract_entity_name, axis=1)

    df[['longitude', 'latitude']] = pd.DataFrame(df.location.tolist())

    # Create a list of unique entities.
    df_entities = df.drop_duplicates('entity_name')
    df_entities = df_entities[['entity_name',
                               'latitude',
                               'longitude']]


    df['sub_category'] = df.labels.apply(extract_matching_label,
                                         args=(SUB_CATEGORY, ))
    df['session_type'] = df.labels.apply(extract_matching_label,
                                         args=(SESSION_TYPE, ))
    df['estimated'] = df.labels.apply(extract_matching_label,
                                      args=(ESTIMATED, ))

    # fill non-specified session_type with in-person
    df['session_type'].fillna('in-person', inplace=True)
    # fill non-specified estimated with scheduled
    df['estimated'].fillna('not_estimated', inplace=True)
    
    # Where events are missing attendance fill with 0. 
    # PredictHQ aims to have attendance for all events. 
    # This assumption can be changed depending on your use case. (mean by subcategory or location)
    df['phq_attendance'].fillna(0, inplace=True)

    # If holiday then off campus
    df.loc[df['sub_category'] == 'holiday', 'on_campus'] = False
    df['on_campus'].fillna(True, inplace=True)

    # Naming functionality
    if 'within' in parameters_dict:
        file_name = ('radius_{}_{}_{}_{}_{}'
                    .format(RADIUS,
                            lat_long[0],
                            lat_long[1],
                            START_DATE,
                            END_DATE)
                     )      
    else:
        file_name = 'place_ids_{}_{}_{}'.format(place_id,
                                                START_DATE,
                                                END_DATE)

    df.to_csv('data/{}.csv'.format(file_name),
              index=False)
    

The returned data is at the event level. In Part 2, of this *How to Series* we will explore this data to understand the key trends. In Part 3, we'll prepare features to be used in a forecasting model.

<a id='output_dataframe'></a>
## Output Dataframe


In [25]:
df.head(2)

Unnamed: 0,id,title,description,start,end,timezone,duration,category,labels,country,...,updated,deleted_reason,duplicate_of_id,entity_name,longitude,latitude,sub_category,session_type,estimated,on_campus
0,2LGRhd4ZTAHFtdTysF,Summer Session,,2021-05-10 04:00:00+00:00,2021-07-31 03:59:59+00:00,America/New_York,7084799,academic,"[academic, academic-session, hybrid-session]",US,...,2021-01-26 04:26:47+00:00,,,Florida Agricultural and Mechanical University,-84.285131,30.426857,academic-session,hybrid-session,not_estimated,True
1,ZGxD9mFQPQ39afjCxJ,Summer Session,,2021-05-10 04:00:00+00:00,2021-07-31 03:59:59+00:00,America/New_York,7084799,academic,"[academic, academic-session, hybrid-session]",US,...,2021-02-22 03:38:58+00:00,,,Florida State University,-84.298489,30.441878,academic-session,hybrid-session,not_estimated,True


<a id='appendix'></a>
## Appendix: Finding ```place_id``` 

Here is a guide on how to link between locations and ```place_id```. Here the ```location``` could be a city, a state, a country or a continent. 

 - Query ```place_id``` based on ```location```
 - Query ```place_hierarchies``` based on ```latitude, longitude```
 - Query ```location``` based on ```place_id```

The full list of parameters you could use in your query is documents at our [Places API page] (https://docs.predicthq.com/resources/places/).<br>PredictHQ uses the geonames places convention https://www.geonames.org/ 

#### 1) Query ```place_id``` based on ```location```

By using PredictHQ Places API, you can find the ```place_id``` for a specific ```location```. By calling the API and setting ```q``` to ```location```, the API will return the most relevant ```place_id```. Taking the top ```place_id``` will provide the most relevant ```place_id``` the ```location``` is in.

In [16]:
# Example locations.
locations = ["Los Angeles", "California", "United States", "North America"]

place_id_lookup = pd.DataFrame()

for location in locations:
    response = requests.get(
        url="https://api.predicthq.com/v1/places/",
        headers={
            "Authorization": "Bearer {}".format(ACCESS_TOKEN),
            "Accept": "application/json",
        },
        params={"q": location},
    )

    data = response.json()
    df = pd.json_normalize(data["results"])
    place_id_lookup = place_id_lookup.append(df.iloc[0], ignore_index=True)

In [17]:
place_id_lookup[["id", "name", "type"]]

Unnamed: 0,id,name,type
0,5368361,Los Angeles,locality
1,5332921,California,region
2,6252001,United States,country
3,6255149,North America,continent


#### 2) Query ```place_hierarchies``` based on ```latitude, longitude```

By using PredictHQ Places Hierarchies API, you can find the  ```place_hierarchies``` for a specific ```latitude, longitude```. By calling the API and setting ```location.origin``` to ```latitude, longitude```, the API will return the most relevant ```place_hierarchies```.

In [18]:
# Example locations.
latitude_longitudes = [[34.07, -118.25]]

place_hierarchies_lookup = pd.DataFrame()

for latitude_longitude in latitude_longitudes:
    latitude, longitude = latitude_longitude
    response = requests.get(
        url="https://api.predicthq.com/v1/places/hierarchies",
        headers={
            "Authorization": "Bearer {}".format(ACCESS_TOKEN),
            "Accept": "application/json",
        },
        params={"location.origin": f"{latitude},{longitude}"},
    )

    data = response.json()
    df = pd.DataFrame(data)
    df["latitude"] = latitude
    df["longitude"] = longitude
    place_hierarchies_lookup = place_hierarchies_lookup.append(df, ignore_index=True)

In [19]:
place_hierarchies_lookup

Unnamed: 0,place_hierarchies,latitude,longitude
0,"[6295630, 6255149, 6252001, 5332921, 5368381, ...",34.07,-118.25
1,"[6295630, 6255149, 6252001, 5332921, 5368381, ...",34.07,-118.25


For each ```latitude, longitude```, the response might include more than one hierarchy. The reason for this is to match the closest place's hierarchy but we also include the closest major city's hierarchy within a radius of 50km. This only applies if the level is below region and, if it exists, the major city's hierarchy will always be the second row of the DataFrame.

#### 3) Query ```location``` based on ```place_id```

By using PredictHQ Places API, you can find the ```location``` for a specific ```place_id```. By calling the API and setting ```id``` to ```place_id```, the API will return the most relevant ```location```. Taking the top ```location``` will provide the most relevant ```location``` the ```place_id``` is in.

In [20]:
# Example locations.
place_ids = ["6295630", "6255148", "2510769", "2513413"]

location_lookup = pd.DataFrame()

for place_id in place_ids:
    response = requests.get(
        url="https://api.predicthq.com/v1/places/",
        headers={
            "Authorization": "Bearer {}".format(ACCESS_TOKEN),
            "Accept": "application/json",
        },
        # The id could be a comma-separated list of place_ids. In this example, the
        # events are queried based on each place_id.
        params={"id": place_id},
    )

    data = response.json()
    df = pd.json_normalize(data["results"])
    location_lookup = location_lookup.append(df.iloc[0], ignore_index=True)

In [21]:
location_lookup[["id", "name", "type"]]

Unnamed: 0,id,name,type
0,6295630,Earth,planet
1,6255148,Europe,continent
2,2510769,Spain,country
3,2513413,Murcia,region
