# Data acquisition

The following notebook is used to acquire the data from the API and creates a Pandas DataFrame for each of the datasets. The datasets are then saved to disk for further processing.

## Get raw data from API

This will get the data from the API and store the dump in the `ift6758/data/storage/dump` directory.


In [None]:
from ift6758.data import fetch_all_seasons_games_data

# This process takes a few minutes / hours
fetch_all_seasons_games_data()

This also stores every single API response in the `ift6758/data/storage/cache` directory.
Once the raw data are stored in the `ift6758/data/storage/dump` directory, you can clear the cache.
Run the following cell to clear the data.

In [None]:
from ift6758.data import clear_cache

clear_cache()


## Load raw data

Now all the season data are stored in the `ift6758/data/storage/dump` directory, we can load them into objects.

In [None]:
from ift6758.data import load_raw_games_data

# You can pass a season number (first year) as argument to load only one season
data = load_raw_games_data() 
print(len(data))

## Load flattened data

Extract features from raw data set and convert in records


In [1]:
from ift6758.data import load_events_records

# You can pass a season number (first year) as argument to load only one season
data = load_events_records()
print(data[0])

Found 57734 events
{'game_id': 2020020001, 'season': 20202021, 'game_type': 2, 'game_date': '2021-01-13', 'venue': 'Wells Fargo Center', 'venue_location': 'Philadelphia', 'away_team_id': 5, 'away_team_abbrev': 'PIT', 'away_team_name': 'Penguins', 'home_team_id': 4, 'home_team_abbrev': 'PHI', 'home_team_name': 'Flyers', 'event_id': 53, 'event_idx': 2, 'period_number': 1, 'period_type': 'REG', 'time_in_period': '00:16', 'time_remaining': '19:44', 'situation_code': '1551', 'type_code': 506, 'type_desc_key': 'shot-on-goal', 'sort_order': 10, 'x_coord': -74, 'y_coord': 29, 'zone_code': 'O', 'shot_type': 'wrist', 'description': 'Tristan Jarry stops a shot from Travis Konecny', 'event_owner_team_id': 4, 'goalie_in_net_id': 8477465, 'goalie_in_net_name': 'Tristan Jarry', 'goalie_in_net_team_id': 5, 'goalie_in_net_position_code': 'G', 'shooting_player_id': 8478439, 'shooting_player_name': 'Travis Konecny', 'shooting_player_team_id': 4, 'shooting_player_position_code': 'C'}



## Load dataframe

Extract features from raw data set and convert in Panda's Dataframe

In [1]:
from ift6758.data import load_events_dataframe

# You can pass a season number (first year) as argument to load only one season
load_events_dataframe(2020)

Found 57734 events


Unnamed: 0,game_id,season,game_type,game_date,venue,venue_location,away_team_id,away_team_abbrev,away_team_name,home_team_id,...,scoring_player_team_id,scoring_player_position_code,assist1_player_id,assist1_player_name,assist1_player_team_id,assist1_player_position_code,assist2_player_id,assist2_player_name,assist2_player_team_id,assist2_player_position_code
0,2020020001,20202021,2,2021-01-13,Wells Fargo Center,Philadelphia,5,PIT,Penguins,4,...,,,,,,,,,,
1,2020020001,20202021,2,2021-01-13,Wells Fargo Center,Philadelphia,5,PIT,Penguins,4,...,,,,,,,,,,
2,2020020001,20202021,2,2021-01-13,Wells Fargo Center,Philadelphia,5,PIT,Penguins,4,...,,,,,,,,,,
3,2020020001,20202021,2,2021-01-13,Wells Fargo Center,Philadelphia,5,PIT,Penguins,4,...,,,,,,,,,,
4,2020020001,20202021,2,2021-01-13,Wells Fargo Center,Philadelphia,5,PIT,Penguins,4,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57729,2020030415,20202021,3,2021-07-07,Amalie Arena,Tampa,8,MTL,Canadiens,14,...,,,,,,,,,,
57730,2020030415,20202021,3,2021-07-07,Amalie Arena,Tampa,8,MTL,Canadiens,14,...,,,,,,,,,,
57731,2020030415,20202021,3,2021-07-07,Amalie Arena,Tampa,8,MTL,Canadiens,14,...,,,,,,,,,,
57732,2020030415,20202021,3,2021-07-07,Amalie Arena,Tampa,8,MTL,Canadiens,14,...,,,,,,,,,,
