# Data acquisition

The following notebook is used to acquire the data from the API and creates a Pandas DataFrame for each of the datasets. The datasets are then saved to disk for further processing.

## Get raw data from API

This will get the data from the API and store the dump in the `ift6758/data/storage/dump` directory.


In [None]:
from ift6758.data import fetch_all_seasons_games_data

# This process takes a few minutes / hours
fetch_all_seasons_games_data()

This also stores every single API response in the `ift6758/data/storage/cache` directory.
Once the raw data are stored in the `ift6758/data/storage/dump` directory, you can clear the cache.
Run the following cell to clear the data.

In [None]:
from ift6758.data import clear_cache

clear_cache()


## Load raw data

Now all the season data are stored in the `ift6758/data/storage/dump` directory, we can load them into objects.

In [None]:
from ift6758.data import load_raw_games_data

# You can pass a season number (first year) as argument to load only one season
data = load_raw_games_data() 
print(len(data))

## Load flattened data

Extract features from raw data set and convert in records


In [None]:
from ift6758.data import load_events_records

# You can pass a season number (first year) as argument to load only one season
data = load_events_records()
print(data[0])


## Load dataframe

Extract features from raw data set and convert in Panda's Dataframe

In [None]:
from ift6758.data import load_events_dataframe

# You can pass a season number (first year) as argument to load only one season
df = load_events_dataframe()

In [6]:
print("goal_distance is null:", len(df.loc[df['goal_distance'].isnull()]))
print("goal_distance is 0:", len(df.loc[df['goal_distance'] == 0]))
print("wrong goal side offensive:", len(df.loc[df['zone_code'] == "O"].loc[df['goal_x_coord'] * df['x_coord'] < 0]))
print("wrong goal side defense:", len(df.loc[df['zone_code'] == "D"].loc[df['goal_x_coord'] * df['x_coord'] > 0]))
print("is empty net:", len(df.loc[df['is_empty_net'] == 1]))

goal_distance is null: 0
goal_distance is 0: 23
wrong goal side offensive: 164
wrong goal side defense: 456
is empty net: 17361


In [1]:
from ift6758.data import load_train_test_dataframes

train, test = load_train_test_dataframes()
test.head()

Found 80399 events
Found 87137 events
Found 85939 events
Found 73867 events
Found 57734 events


Unnamed: 0,game_id,season,game_type,game_date,venue,venue_location,away_team_id,away_team_abbrev,away_team_name,home_team_id,...,scoring_player_team_id,scoring_player_position_code,assist1_player_id,assist1_player_name,assist1_player_team_id,assist1_player_position_code,assist2_player_id,assist2_player_name,assist2_player_team_id,assist2_player_position_code
0,2020020001,20202021,2,2021-01-13,Wells Fargo Center,Philadelphia,5,PIT,Penguins,4,...,,,,,,,,,,
1,2020020001,20202021,2,2021-01-13,Wells Fargo Center,Philadelphia,5,PIT,Penguins,4,...,,,,,,,,,,
2,2020020001,20202021,2,2021-01-13,Wells Fargo Center,Philadelphia,5,PIT,Penguins,4,...,,,,,,,,,,
3,2020020001,20202021,2,2021-01-13,Wells Fargo Center,Philadelphia,5,PIT,Penguins,4,...,,,,,,,,,,
4,2020020001,20202021,2,2021-01-13,Wells Fargo Center,Philadelphia,5,PIT,Penguins,4,...,,,,,,,,,,
