# Data Collection

Notebook objective: retrieve and store teams and games data for all leagues for all available seasons to preserve, just in case.

The intent is to only run this notebook once and then very infrequently because it completely rebuilds the tables unless otherwise specified.

## Set-Up

In [1]:
# libraries and dependencies
from etl.creds import api_key, postgresql_pw
from etl.helper_functions import apiSports
from pprint import pprint
from sqlalchemy import create_engine, text
import pandas as pd

The apiSports class has 2 methods.
- `request(payload, endpoint)` makes a call using the API, credentials, and version specified in the class object.
    - `payload` is a dictionary of parameters to pass in the API call.
    - `endpoint` is a string that specifies the API endpoint.
- `clean(endpoint, df_current)` transforms the dataframe created by the `request` method.
    - `endpoint` is a string that specifies the API endpoint. The method behaves differently for the "teams" versus "games" endpoints.
    - `df_current` is the existing dataframe to transform.

In [2]:
# creates a member of apiSports class and checks API key status
mlb = apiSports(api_key, 'baseball', 'v1')
response = mlb.request({}, 'status')
pprint(response)

{'errors': [],
 'get': 'status',
 'paging': {'current': 1, 'total': 1},
 'parameters': [],
 'response': {'account': {'email': 'hunterhollismobile@gmail.com',
                          'firstname': 'Hunter',
                          'lastname': 'Hollis'},
              'requests': {'current': 0, 'limit_day': 7500},
              'subscription': {'active': True,
                               'end': '2025-04-03T15:34:47+00:00',
                               'plan': 'Pro'}},
 'results': 0}


In [3]:
# checks API status for league object
def api_status(league_name, league_object):
    status = league_object.request({}, 'status')['response']
    print(f"{league_name.upper()}:")
    pprint(status['requests'])
    pprint(status['subscription'])
    print(f"{'-'*70}")

In [4]:
# creates a member of apiSports class and checks API key status
mlb = apiSports(api_key, 'baseball', 'v1')
api_status('mlb', mlb)

# create apiSports object for MLS
mls = apiSports(api_key, 'football', 'v3')
api_status('mls', mls)

# create apiSports object for NBA
nba = apiSports(api_key, 'basketball', 'v1')
api_status('nba', nba)

# create apiSports object for NFL
nfl = apiSports(api_key, 'american-football')
api_status('nfl', nfl)

# create apiSports object for NHL
nhl = apiSports(api_key, 'hockey')
api_status('nhl', nhl)

MLB:
{'current': 0, 'limit_day': 7500}
{'active': True, 'end': '2025-04-03T15:34:47+00:00', 'plan': 'Pro'}
----------------------------------------------------------------------
MLS:
{'current': 0, 'limit_day': 7500}
{'active': True, 'end': '2025-04-03T15:34:47+00:00', 'plan': 'Pro'}
----------------------------------------------------------------------
NBA:
{'current': 0, 'limit_day': 7500}
{'active': True, 'end': '2025-04-03T15:34:47+00:00', 'plan': 'Pro'}
----------------------------------------------------------------------
NFL:
{'current': 0, 'limit_day': 7500}
{'active': True, 'end': '2025-04-03T15:34:47+00:00', 'plan': 'Pro'}
----------------------------------------------------------------------
NHL:
{'current': 0, 'limit_day': 7500}
{'active': True, 'end': '2025-04-03T15:34:47+00:00', 'plan': 'Pro'}
----------------------------------------------------------------------


In [5]:
# create postgresql connection to export DataFrames to database tables
engine = create_engine(f'postgresql+psycopg2://postgres:{postgresql_pw}@localhost:5432/api_sports')

## Update Histories

In [6]:
def history(league_object, league_name, league_id, season_year, target):
    """
    Function to update past seasons of team or games data
    Parameters:
        1) league_object: variable for member of apiSports class | [mlb, mls, nba, nfl, nhl]
        2) league_name: string | ['mlb', 'mls', 'nba', 'nfl', 'nhl']
        3) league_id: string | ['1', '253', '12', '1', '57']
        4) season_year: integer | [2021, 2022, 2023, 2024, (2025)]
        5) target: string | ['teams', 'games']
    """

    # gets season year in correct string format
    if league_name == 'nba':
        season = f"{season_year}-{season_year+1}"
    else:
        season = f"{season_year}"

    # I'm glad we threw their tea in the harbor
    if (league_name == 'mls') & (target == 'games'):
        endpoint = 'fixtures'
    else:
        endpoint = target

    print(f"Retrieving {target} data for {season} {league_name.upper()} season.")

    # API call
    payload = {'league': league_id, 'season': season}
    response = league_object.request(payload, endpoint)

    # create response df
    df = pd.DataFrame(response['response'])
    print(f"Returned {len(df)} {target}.")

    if (league_name == 'mls') & (target == 'teams'):
        # dataframe cleaning
        teams_df_final, venues_df_final = league_object.clean('teams', df)

        # export teams_df and venues_df to history tables in api-sports database
        teams_df_final.to_sql('mls_teams_history', con=engine, if_exists='append', index=False)
        venues_df_final.to_sql('mls_venues_history', con=engine, if_exists='append', index=False)
        print(f"Cleaned and uploaded {len(teams_df_final)} teams to mls_teams_history,  {len(venues_df_final)} venues to mls_venues_history.\n")
    else: 
        # dataframe cleaning
        df_final = league_object.clean(target, df)

        # export final df to table in api-sports database
        df_final.to_sql(f'{league_name}_{target}_history', con=engine, if_exists='append', index=False)
        print(f"Cleaned and uploaded {len(df_final)} {target} to {league_name}_{target}_history.\n")

In [7]:
def drop_table(table_name, engine=engine):
    """
    Function to drop existing tables in database
    Takes 2 parameters:
        1) engine: object | variable name of sqlalchemy engine created previously
        2) table_name: string | name of table to drop
    """
    with engine.connect() as conn:
        conn.execute(text(f'DROP TABLE IF EXISTS {table_name};'))
        conn.commit()

In [8]:
leagues = {'mlb': {'object': mlb,
                   'id': '1',
                   'tables': {'teams': 'mlb_teams_history', 'games': 'mlb_games_history'},
                   'update': False,
                   'rebuild_tables': False},
            'mls': {'object': mls,
                   'id': '253',
                   'tables': {'teams': 'mls_teams_history, mls_venues_history', 'games': 'mls_games_history'},
                   'update': False,
                   'rebuild_tables': False},
            'nba': {'object': nba,
                   'id': '12',
                   'tables': {'teams': 'nba_teams_history', 'games': 'nba_games_history'},
                   'update': False,
                   'rebuild_tables': False},
            'nfl': {'object': nfl,
                   'id': '1',
                   'tables': {'teams': 'nfl_teams_history', 'games': 'nfl_games_history'},
                   'update': False,
                   'rebuild_tables': False},
            'nhl': {'object': nhl,
                   'id': '57',
                   'tables': {'teams': 'nhl_teams_history', 'games': 'nhl_games_history'},
                   'update': False,
                   'rebuild_tables': False}}

def refresh_flags(update_all, rebuild_all, leagues_dict=leagues):
    """
    Function to set all leagues in dictionary to refresh and/or rebuild tables
    Takes 3 parameters:
        1) update_all: boolean | whether to update all leagues
        2) rebuild_all: boolean | whether to rebuild all league tables
        3) leagues_dict: dictionary | previously defined info about each league
    """
    if update_all:
        user_confirm_update = input("Please confirm you would like to update all leagues' histories by typing 'CONFIRM': ").upper()
        if user_confirm_update == 'CONFIRM':
            for name, info in leagues_dict.items():
                info['update']=True
            print('When `update_history()` function is run, all leagues will be updated.')
    if rebuild_all:
        user_confirm_rebuild = input("Please confirm you would like to DELETE ALL CURRENT TABLES with leagues' histories by typing 'CONFIRM': ").upper()
        if user_confirm_rebuild == 'CONFIRM':
            for name, info in leagues_dict.items():
                info['rebuild_tables']=True
            print('When `update_history()` function is run, all existing tables with league history will be dropped and rebuilt. Only do this if starting from scratch.')
    if (not update_all) & (not rebuild_all):
        print("Leagues dictionary will not be changed further.")
    print(f"{'-'*70}")
    return leagues_dict

def update_history(start_season, end_season, endpoint, leagues=leagues):
    """
    Function to update teams tables for all leagues and seasons
    Existing table is dropped so that multiple seasons can append data to table
    Takes 3 parameters:
        1) start_season: integer
        2) end_season: integer
        3) endpoint: string | ['teams', 'games']
        4) leagues: dictionary | previously defined info about each league
    """
    for name, info in leagues.items():
        if info['update']:
            if info['rebuild_tables']:
                print(f"Dropping table(s) {info['tables'][endpoint]}...")
                drop_table(info['tables'][endpoint])
            for season in range (start_season, end_season+1):
                history(info['object'], name, info['id'], season, endpoint)
        else:
            print(f"{name.upper()} is not set to update. Revise `leagues` dictionary if needed.")
        print(f"{'-'*70}")

In [9]:
# set arguments to True for new build or full re-build
leagues = refresh_flags(update_all=True, rebuild_all=True)
leagues

When `update_history()` function is run, all leagues will be updated.
When `update_history()` function is run, all existing tables with league history will be dropped and rebuilt. Only do this if starting from scratch.
----------------------------------------------------------------------


{'mlb': {'object': <etl.helper_functions.apiSports at 0x1b84814c1d0>,
  'id': '1',
  'tables': {'teams': 'mlb_teams_history', 'games': 'mlb_games_history'},
  'update': True,
  'rebuild_tables': True},
 'mls': {'object': <etl.helper_functions.apiSports at 0x1b846922f00>,
  'id': '253',
  'tables': {'teams': 'mls_teams_history, mls_venues_history',
   'games': 'mls_games_history'},
  'update': True,
  'rebuild_tables': True},
 'nba': {'object': <etl.helper_functions.apiSports at 0x1b847f6c560>,
  'id': '12',
  'tables': {'teams': 'nba_teams_history', 'games': 'nba_games_history'},
  'update': True,
  'rebuild_tables': True},
 'nfl': {'object': <etl.helper_functions.apiSports at 0x1b8481cf8c0>,
  'id': '1',
  'tables': {'teams': 'nfl_teams_history', 'games': 'nfl_games_history'},
  'update': True,
  'rebuild_tables': True},
 'nhl': {'object': <etl.helper_functions.apiSports at 0x1b8466095b0>,
  'id': '57',
  'tables': {'teams': 'nhl_teams_history', 'games': 'nhl_games_history'},
  'updat

In [10]:
# update teams histories
update_history(2021, 2024, 'teams')

Dropping table(s) mlb_teams_history...
Retrieving teams data for 2021 MLB season.
Returned 32 teams.
Cleaned and uploaded 32 teams to mlb_teams_history.

Retrieving teams data for 2022 MLB season.
Returned 33 teams.
Cleaned and uploaded 33 teams to mlb_teams_history.

Retrieving teams data for 2023 MLB season.
Returned 34 teams.
Cleaned and uploaded 34 teams to mlb_teams_history.

Retrieving teams data for 2024 MLB season.
Returned 32 teams.
Cleaned and uploaded 32 teams to mlb_teams_history.

----------------------------------------------------------------------
Dropping table(s) mls_teams_history, mls_venues_history...
Retrieving teams data for 2021 MLS season.
Returned 27 teams.
Cleaned and uploaded 27 teams to mls_teams_history,  27 venues to mls_venues_history.

Retrieving teams data for 2022 MLS season.
Returned 28 teams.
Cleaned and uploaded 28 teams to mls_teams_history,  28 venues to mls_venues_history.

Retrieving teams data for 2023 MLS season.
Returned 29 teams.
Cleaned and

In [11]:
# update games histories
update_history(2021, 2024, 'games')

Dropping table(s) mlb_games_history...
Retrieving games data for 2021 MLB season.
Returned 2883 games.
Cleaned and uploaded 2883 games to mlb_games_history.

Retrieving games data for 2022 MLB season.
Returned 2963 games.
Cleaned and uploaded 2963 games to mlb_games_history.

Retrieving games data for 2023 MLB season.
Returned 2940 games.
Cleaned and uploaded 2940 games to mlb_games_history.

Retrieving games data for 2024 MLB season.
Returned 2946 games.
Cleaned and uploaded 2946 games to mlb_games_history.

----------------------------------------------------------------------
Dropping table(s) mls_games_history...
Retrieving games data for 2021 MLS season.
Returned 472 games.
Cleaned and uploaded 472 games to mls_games_history.

Retrieving games data for 2022 MLS season.
Returned 489 games.
Cleaned and uploaded 489 games to mls_games_history.

Retrieving games data for 2023 MLS season.
Returned 526 games.
Cleaned and uploaded 526 games to mls_games_history.

Retrieving games data fo