# Live TV Events Data Science Guides - Part 1: Data Extraction

PredictHQ’s Live TV Events data includes viewership prediction for the seven top US leagues: NFL, NBA, NHL, MLB, D1 NCAA Basketball, D1 NCAA American Football, and MLS. Our TV viewership data is designed for data scientists to improve forecasting at the county and store level. This How to Series allows you to quickly extract the data (Part 1), explore the data (Part 2) and experiment with different aggregations (Part 3).  

<b>A How To Guide to extracting data from PredictHQ's Live TV Events.</b>

- [Setup](#setup)
- [Access Token](#access_token)
- [Support Function](#support_functions) 
- [SDK Parameters](#sdk_parameters)
- [SDK Call](#sdk_call)
- [Output Dataframe](#output_dataframe)
- [Appendix - Finding County place_id](#appendix)

<a id='setup'></a>
## Setup

If using Google Colab uncomment the following code block.

In [2]:
# %%capture
# !git clone https://github.com/predicthq/phq-data-science-docs.git
# %cd phq-data-science-docs/live-tv-events
# !pip install predicthq

If running locally, set up a Python environment using ```requirements.txt``` shared alongside the notebook to install the required dependancies.

In [1]:
import pandas as pd
from datetime import datetime
from datetime import timedelta
import numpy as np
import pytz

from predicthq import Client
import requests

<a id='access_token'></a>
## Access Token

To query the API, you will need an access token. If you have previously used the PredictHQ API to search and use events, you may still need to create a new access token to query broadcasts.

The following link will guide you through creating an account and access token. 

 - https://docs.predicthq.com/guides/quickstart/

In [24]:
# Replace Access Token with own access token.
ACCESS_TOKEN = 'REPLACE_WITH_ACCESS_TOKEN'
phq = Client(access_token=ACCESS_TOKEN)

Live TV Events data is available through the Broadcasts API.

#### Events Coverage
Broadcasts API returns each live sports broadcast for the following seven sports leagues in the US: 

- NFL
- NBA
- NHL
- MLB
- D1 NCAA Basketball
- D1 NCAA American Football
- MLS

(Only live games are included. There are no replays.)

#### Spacial Granularity

Data is available for the United States at a granularity of county level. 

#### Features

Each broadcast is provided with predicted viewership at the US county level. Additional data is available about the event, such as physical location and duration.

#### Date Availablility

January 1, 2018 to 2 weeks into the future.

<a id='support_functions'></a>
## Support Functions

Each broadcast relates to a physical sports event from the PredictHQ events knowledge graph. Additional data about the actual event is also returned. For example: the league and sport of the broadcast are included within the labels field. The following functions make it easier to extract the sport and league for each broadcast.

In [12]:
def _extract_matching_label(event_labels, labels_to_match):
    '''
    For each broadcast the league and sport type need to be
    extracted. These labels are extracted from the labels.
    As the order of the labels varies this look up is
    required to compare to the frozenset of options.
    '''
    for label in labels_to_match:
        if label in event_labels:
            return label
    return None


SPORTS = frozenset([
        'american-football',
        'baseball',
        'basketball',
        'ice-hockey',
        'soccer',
    ])
LEAGUES = frozenset([
        'mlb',
        'mls',
        'nba',
        'ncaa',
        'nfl',
        'nhl',
    ])

def convert_timezone(row):
    '''Convert event predicted end time to 
    broadcast location timezone from the event timezone.
    '''
    event_end_naive = row['dates_event']['predicted_end_local']
    event_timezone = pytz.timezone(row['dates_event']['timezone'])

    event_end_localtime= event_timezone.localize(event_end_naive, is_dst=None)
    event_end_utc = event_end_localtime.astimezone(pytz.utc)

    broadcast_timezone = row['dates_broadcast']['timezone']
    broadcast_end_localtime = event_end_utc.astimezone(pytz.timezone(broadcast_timezone))
    row['predicted_end_time_broadcast_local'] = broadcast_end_localtime.replace(tzinfo=None)

    return row

<a id='sdk_parameters'></a>
## SDK parameters

We will create a dictionary of the key parameters and walk through each of the settings to use in the SDK call.

In [13]:
 parameters_dict = dict()

#### Viewership Limits ```phq_viewership__gte=100```
  -  We recommend filtering for broadcasts with a viewership greater than or equal to 100. This removes the smallest, noisiest broadcast predictions. This will remove a  number of broadcasts. This is customisable to your use case.   

In [14]:
parameters_dict.update(phq_viewership__gte=100) 

#### Time Limits ```start={'gte': '2019-01-01', 'lte':'2021-01-15'}```
 - To define the period of time for which broadcasts will be returned set the greater than or equal `gte` and less than or equal `lte` parameters for start. This will select all broadcasts that start within this period.
 
 
Bear in mind that you could use either of these parameters depending on your time period of interest:

```gte - Greater than or equal.``` <br>
```gt - Greater than.```<br>
```lte - Less than or equal.```<br>
```lt - Less than.```<br>


In [15]:
# Set your chosen start and end date.
START_DATE = '2019-01-01'
END_DATE = '2021-02-14'
parameters_dict.update(start={'gte': START_DATE, 'lte':END_DATE}) 

#### Limits  ```limit=500```

 - When pulling historical data for a large time period many results are returned. To speed up the execution set ```limit``` to the highest available setting (500). By doing this each call to the API returns 500 results and this will speed up processing large datasets.

In [16]:
parameters_dict.update(limit=500) 

#### Location Limits ```location__place_id=4888671```

 - To define which counties to select use the `location__place_id` and the place_id of the county. The place_id of the county is the geonames id of the county. In the [Appendix](#appendix) is a guide as to how to find which county to use dependant  on the locations of your business. 
 
For the SDK call, you can specify your own counties of interest. However here are four default counties to query as an example:

 - 'Clark County, Nevada': 5501879

 - 'Los Angeles County, California': 5368381

 - 'Cook County, Chicago, Illinois': 4888671

 - 'Harris County, Houston, Texas': 4696376
 
 
The place_id will be set within the SDK call as a loop through the counties of interest.

In [17]:
# To run for your own counties of interest - replace these ids.
LIST_OF_COUNTIES = [5501879] # 5368381, 4888671, 4696376]

In [18]:
# Note the place_id is set within the loop below.
parameters_dict

{'phq_viewership__gte': 100,
 'start': {'gte': '2019-01-01', 'lte': '2021-02-14'},
 'limit': 500}

<a id='sdk_call'></a>
## SDK Call

Loop through the call to the broadcasts API for each county of interest.

Not all broadcasts will be returned for each county. For example if a county has low broadcast coverage (<45% of the county population have access to the broadcast) the broadcast will be removed. Other reasons a broadcast may not appear could be if the phq_viewership setting excludes any broadcasts with low numbers. Certain sports events in certain counties are forecast to have low viewership. 

The data for each county is saved to csv as an example output. This can be adjusted to work with your own data pipeline.


In [19]:
# Loop through each county
for place_id in LIST_OF_COUNTIES:
    
    parameters_dict.update(location__place_id=place_id) 
    
    search_results = phq.broadcasts.search(parameters_dict).iter_all()

    search_results = [result.to_dict() for result in search_results]

    df = pd.DataFrame(search_results)

    # Extract out additional information
    # 'event' stores the additional data about the physical event
    df = df.merge(df['event'].apply(pd.Series),
                  left_index=True,
                  right_index=True,
                  suffixes=('_broadcast', '_event'))

    # Extract sport and league from the labels in the nested event data.
    df['sport'] = df.labels.apply(_extract_matching_label, args=(SPORTS,))
    df['league'] = df.labels.apply(_extract_matching_label, args=(LEAGUES,))

    df['local_start_date'] = (df.dates_broadcast
                                .apply(
                                        lambda start_dt:
                                        (start_dt['start_local']).date()
                                       )
                              )

    df['county_place_id'] = (df.location_broadcast
                               .apply(
                                       lambda location:
                                       location['places'][0]['place_id']
                                     )
                             )

    df['local_start_datetime'] = (df.dates_broadcast
                                    .apply(
                                            lambda start_dt:
                                            (start_dt['start_local'])
                                          )
                                  )

    # Check for any events without a predicted end time.
    # All broadcasts are expected to have a predicted end time
    broadcast_id_no_endtime = [row['broadcast_id'] for _, row in df.iterrows() \
                               if not row.get('dates_event', {}).get('predicted_end_local')]
    # Remove any broadcasts without a predicted end time.
    df = df[~df['broadcast_id'].isin(broadcast_id_no_endtime)]

    # Convert the predicted end time of the event to broadcast timezone.
    df = df.apply(convert_timezone, axis=1)

    df['sport_league'] = df['sport'] + '_' + df['league']
    # Calculate the duration of the broadcast. 
    df['duration'] = df['predicted_end_time_broadcast_local'] - df['local_start_datetime']
    df['duration_hours'] = df['duration'].dt.seconds/(60*60)
    df['total_viewing'] = df['duration_hours'] * df['phq_viewership']
    
    # Save dataframe to csv
    df.to_csv('data/tv_events_data/{}_county_raw.csv'.format(place_id),
              index=False)

The returned data is at the broadcast level. Each broadcast for the selected county in the selected county is returned that met the parameters of the SDK call. In Part 2 of this How to Series we will explore this data to understand the key trends. In Part 3 we'll prepare features to be used in a forecasting model.

<a id='output_dataframe'></a>
## Output Dataframe

It is important to understand the output data. 

There is one key aspect to be familiar with. This is which data fields relate to the broadcast and which fields relate to the physical sports event that the broadcast is showing. The data that was extracted out of the ```event``` are all related to the actual physical event.  

For absolute clarity in the returned dataframe, the following columns relate to the broadcast:

- broadcast_id
- updated
- dates_broadcast
- location_broadcast
- phq_viewership
- record_status
- broadcast_status
- local_start_date
- local_start_datetime
- county_place_id
- predicted_end_time_broadcast_local
- total_viewing


And the following columns relate the the actual physical event (Note: many of these are relevent additional data about the broadcast):

- event
- event_id
- title 
- category 
- labels
- dates_event
- location_event
- entities 
- phq_attendance
- phq_rank
- local_rank
- aviation_rank
- sport
- league
- duration
- duration_hours



In [20]:
df.head(2)

Unnamed: 0,broadcast_id,updated,dates_broadcast,location_broadcast,phq_viewership,record_status,broadcast_status,event,event_id,title,...,sport,league,local_start_date,county_place_id,local_start_datetime,predicted_end_time_broadcast_local,sport_league,duration,duration_hours,total_viewing
0,8Cy3b48jHZAPdxbDRbkaQk,2020-12-03 10:08:04+00:00,"{'start': 2019-01-01 00:00:00+00:00, 'start_lo...","{'geopoint': {'lat': 36.2152, 'lon': -115.0135...",68084,active,scheduled,"{'event_id': 'S8F7FMsKiiU4q8UF67', 'title': 'N...",S8F7FMsKiiU4q8UF67,Northwestern Wildcats vs Utah Utes,...,american-football,ncaa,2018-12-31,5501879,2018-12-31 16:00:00,2018-12-31 19:20:00,american-football_ncaa,0 days 03:20:00,3.333333,226946.666667
1,93LTY9p25MRrj39sGYxepc,2020-12-05 05:41:16+00:00,"{'start': 2019-01-01 00:00:00+00:00, 'start_lo...","{'geopoint': {'lat': 36.2152, 'lon': -115.0135...",18805,active,scheduled,"{'event_id': 'usqZVdBrXwBLVQfsRG', 'title': 'B...",usqZVdBrXwBLVQfsRG,Boston Celtics vs San Antonio Spurs,...,basketball,nba,2018-12-31,5501879,2018-12-31 16:00:00,2018-12-31 18:20:00,basketball_nba,0 days 02:20:00,2.333333,43878.333333


<a id='appendix'></a>
## Appendix: Finding County ```place_id``` 

Here is a guide on how to link store locations to the county ```place_id``` dependant on the geodata you have available for your locations. 

 - Location Longitude and Latitude
 - Location FIPS code
 
PredictHQ uses the geonames places convention https://www.geonames.org/ 

#### 1) Location Longitude and Latitude

By using PredictHQ Places API you can find the county for a specific latitude and longitude. By calling the API against the longitude and latitude, and setting ```type``` to ```county```the API will return the most relevent counties. Taking the top county will provide the county the location is in.

In [21]:
# Two example locations.
locations = [[40.66677, -73.88236], [33.95345, -118.3392]]

location_county_lookup = pd.DataFrame()

for location in locations:
    response = requests.get(
        url="https://api.predicthq.com/v1/places/",
        headers={
          "Authorization": "Bearer {}".format(ACCESS_TOKEN),
          "Accept": "application/json"
        },
        params={
            "location": "@{},{}".format(location[0], location[1]),
            "type": "county"
        }
    )

    data = response.json()
    df = pd.json_normalize(data['results'])
    location_county_lookup = location_county_lookup.append(df.iloc[0],
                                                           ignore_index=True)

In [22]:
location_county_lookup

Unnamed: 0,country,country_alpha2,country_alpha3,county,id,location,name,region,type
0,United States,US,USA,Queens County,5133268,"[-73.83875, 40.65749]",Queens County,New York,county
1,United States,US,USA,Los Angeles County,5368381,"[-118.26102, 34.19801]",Los Angeles County,California,county


#### 2) Location FIPS Code

In [23]:
# We provide a lookup between FIPS code and place_id. (geoname_id = place_id)
mapping = pd.read_csv('data/geo_data/geoname_to_fips_mapping.csv')
mapping.head()

Unnamed: 0,geoname_id,county_name,county_fips
0,4047434,Russell County,1113
1,4048080,Long County,13183
2,4048522,Boone County,21015
3,4048572,Rowan County,21205
4,4049189,Bibb County,1007
