# Introduction

We used an NBA API to collect the data for our project. This first notebook is for the data collection purpose.
**Note 1**: To run the code below, the library "nba_api" must be installed.
**Note 2**: Because the API started having issues providing complete datasets, we used Google Colab to run the code below (though we were still unable to get player game data for all three seasons).

In [0]:
!pip install nba_api

Collecting nba_api
[?25l  Downloading https://files.pythonhosted.org/packages/98/bc/f701f6f7c79354419107e2534d22ed27bac7de9a6c2d85ab73f87bc140a9/nba_api-1.1.5-py3-none-any.whl (215kB)
[K     |█▌                              | 10kB 17.1MB/s eta 0:00:01[K     |███                             | 20kB 1.8MB/s eta 0:00:01[K     |████▌                           | 30kB 2.6MB/s eta 0:00:01[K     |██████                          | 40kB 1.7MB/s eta 0:00:01[K     |███████▋                        | 51kB 2.1MB/s eta 0:00:01[K     |█████████                       | 61kB 2.5MB/s eta 0:00:01[K     |██████████▋                     | 71kB 2.9MB/s eta 0:00:01[K     |████████████▏                   | 81kB 3.3MB/s eta 0:00:01[K     |█████████████▋                  | 92kB 3.7MB/s eta 0:00:01[K     |███████████████▏                | 102kB 2.8MB/s eta 0:00:01[K     |████████████████▊               | 112kB 2.8MB/s eta 0:00:01[K     |██████████████████▏             | 122kB 2.8MB/s eta 0:

In [0]:
pip install --upgrade nba_api

Requirement already up-to-date: nba_api in /usr/local/lib/python3.6/dist-packages (1.1.5)


# Data source 1: Game data by team for three seasons
18-19, 17-18, 16-17

In [0]:
from pandas import DataFrame, Series
import pandas as pd
from nba_api.stats.static import teams

In [0]:
## Get list of NBA teams and their corresponding IDs
nba_teams = teams.get_teams()
print(nba_teams[0])

teams_id_list = []
teams_name_list = []
for i in range(len(nba_teams)):
    teams_id_list.append(nba_teams[i]['id'])
    teams_name_list.append(nba_teams[i]['abbreviation'])    
nba_teams_dict = dict(zip(teams_id_list,teams_name_list))

{'id': 1610612737, 'full_name': 'Atlanta Hawks', 'abbreviation': 'ATL', 'nickname': 'Hawks', 'city': 'Atlanta', 'state': 'Atlanta', 'year_founded': 1949}


## Season 18-19

In [0]:
from nba_api.stats.endpoints import leaguegamefinder
game = leaguegamefinder.LeagueGameFinder(team_id_nullable = 1610612737,
                                         season_nullable = '2018-19',
                                         season_type_nullable = 'Regular Season')

In [0]:
## Get team game log for 18-19
from nba_api.stats.endpoints import leaguegamefinder
from time import sleep
TeamGamesLogsDF18 = DataFrame()
for key in nba_teams_dict.keys():
    prelim_games = leaguegamefinder.LeagueGameFinder(team_id_nullable = key,
                                         season_nullable = '2018-19',
                                         season_type_nullable = 'Regular Season')
    team_games = prelim_games.get_data_frames()[0]
    TeamGamesLogsDF18 = pd.concat([TeamGamesLogsDF18, team_games])
    sleep(2)

In [0]:
## Ensure the games are in regular season
TeamGamesLogsDF18['GAME_DATE'] = pd.to_datetime(TeamGamesLogsDF18['GAME_DATE'])
start_date = '2018-10-16'
mask = (TeamGamesLogsDF18['GAME_DATE'] >= start_date)
TeamGamesLogsDF18 = TeamGamesLogsDF18[mask]

In [0]:
## Add a column to indicate if the team is Home or Away for the main team in the row
import re
def home_or_away(matchup):
    ha = re.match(r"[A-z]+ (vs.|@)", matchup)
    if ha.groups(1)[0] == "@":
        return "AWAY"
    else:
        return "HOME"
## Home_or_away("MIN @ HOU")

TeamGamesLogsDF18["HOME_AWAY"] = TeamGamesLogsDF18["MATCHUP"].map(home_or_away)

## Change Home to 1 and Away to 0
TeamGamesLogsDF18['HOME_AWAY'].replace("HOME",1, inplace = True)
TeamGamesLogsDF18['HOME_AWAY'].replace("AWAY",0, inplace = True)

In [0]:
## Save the DF to a csv 
TeamGamesLogsDF18.to_csv(r'TeamGamesLogsDF18.csv', index = False, header = True)

## Season 17-18

In [0]:
## Get team game log for 17-18
TeamGamesLogsDF17 = DataFrame()
for key in nba_teams_dict.keys():
    prelim_games = leaguegamefinder.LeagueGameFinder(team_id_nullable = key,
                                         season_nullable = '2017-18',
                                         season_type_nullable = 'Regular Season')
    team_games = prelim_games.get_data_frames()[0]
    TeamGamesLogsDF17 = pd.concat([TeamGamesLogsDF17, team_games])
    

## Ensure the games are in regular season
TeamGamesLogsDF17['GAME_DATE'] = pd.to_datetime(TeamGamesLogsDF17['GAME_DATE'])
start_date = '2017-10-17'
mask = (TeamGamesLogsDF17['GAME_DATE'] >= start_date)
TeamGamesLogsDF17 = TeamGamesLogsDF17[mask]

## Add Home_Away column
TeamGamesLogsDF17["HOME_AWAY"] = TeamGamesLogsDF17["MATCHUP"].map(home_or_away)
## Change Home to 1 and Away to 0
TeamGamesLogsDF17['HOME_AWAY'].replace("HOME",1, inplace = True)
TeamGamesLogsDF17['HOME_AWAY'].replace("AWAY",0, inplace = True)

## Save the DF to a csv 
TeamGamesLogsDF17.to_csv(r'TeamGamesLogsDF17.csv', index = False, header = True)

## Season 16-17

In [0]:
## Get team game log for 16-17
TeamGamesLogsDF16 = DataFrame()
for key in nba_teams_dict.keys():
    prelim_games = leaguegamefinder.LeagueGameFinder(team_id_nullable = key,
                                         season_nullable = '2016-17',
                                         season_type_nullable = 'Regular Season')
    team_games = prelim_games.get_data_frames()[0]
    TeamGamesLogsDF16 = pd.concat([TeamGamesLogsDF16, team_games])
    

## Ensure the games are in regular season
TeamGamesLogsDF16['GAME_DATE'] = pd.to_datetime(TeamGamesLogsDF16['GAME_DATE'])
start_date = '2016-10-27'
mask = (TeamGamesLogsDF16['GAME_DATE'] >= start_date)
TeamGamesLogsDF16 = TeamGamesLogsDF16[mask]

## Add Home_Away column
TeamGamesLogsDF16["HOME_AWAY"] = TeamGamesLogsDF16["MATCHUP"].map(home_or_away)
## Change Home to 1 and Away to 0
TeamGamesLogsDF16['HOME_AWAY'].replace("HOME",1, inplace = True)
TeamGamesLogsDF16['HOME_AWAY'].replace("AWAY",0, inplace = True)

## Save the DF to a csv 
TeamGamesLogsDF16.to_csv(r'TeamGamesLogsDF16.csv', index = False, header = True)

# Data source 2: Game data by player for three seasons
18-19, 17-18, 16-17

As noted in the introduction, we attempted to retreive the player game log data for seasons 16-17 and 17-18 also. However, we ran into an error with the API where it was not able to retreive information for a complete list of active NBA players. The API only recognized about 35% of the original list of players. Therefore, we decided to only use data from season 18-19 as originally planned. Nonetheless, we provided code for all three of the seasons below.

## Season 18-19

In [0]:
import os, sys, json
from pandas import DataFrame, Series
import pandas as pd
import time

headers = {
    'Host': 'stats.nba.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0',
    'Accept': 'application/json, text/plain, */*',
    'Accept-Language': 'en-US,en;q=0.5',
    'Referer': 'https://stats.nba.com/',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive',
}


# Get list of current player IDs
from nba_api.stats.endpoints import commonallplayers
players_json_raw = commonallplayers.CommonAllPlayers(is_only_current_season=1, league_id="00", season="2018-19", headers = headers).get_json()
players_json = json.loads(players_json_raw)
players = pd.DataFrame(columns=players_json["resultSets"][0]["headers"], data=players_json["resultSets"][0]["rowSet"])
active_player_ids = players["PERSON_ID"].values

In [0]:
len(players)

175

In [0]:
from nba_api.stats.endpoints import playergamelog

## Test one player
pgl = playergamelog.PlayerGameLog(player_id = 2544,
                                  season = "2018-19",
                                  season_type_all_star = "Regular Season").get_json()

In [0]:
## Obtain json file by player by game
#os.mkdir("player-json-files-18")

for player in active_player_ids:
    # get_json on each player's data
    json_file = playergamelog.PlayerGameLog(player_id = player,
                                              season = "2018-19",
                                              season_type_all_star = "Regular Season").get_json()
    filepath = os.path.join('player-json-files-18', str(player)+'.json')

    # write json file into local folder
    with open(filepath, 'w') as JsonOut:
        JsonOut.write(json_file)
    time.sleep(10) # to avoid getting timed out

In [0]:
## read json files into a dataframe: json.load
PlayerGameLog18 = DataFrame()

for filename in os.listdir('player-json-files-18'):
    if filename.endswith(".json"): # only include the json files 
        with open('player-json-files-18\\' + filename) as player_file:
            dict_player = json.load(player_file) # read in as a dictionary
            player_df = DataFrame.from_records(dict_player['resultSets'][0]['rowSet'],
                                                columns = dict_player['resultSets'][0]['headers'])
        PlayerGameLog18 = pd.concat([PlayerGameLog18, player_df])      
        continue
    else:
        continue

In [0]:
## Save player df to csv
PlayerGameLog18.to_csv(r'PlayerGameLog18.csv', index = False, header = True)

## Season 17-18

In [0]:
# 17-18
# Get list of player IDs
from nba_api.stats.endpoints import commonallplayers
players_json_raw = commonallplayers.CommonAllPlayers(is_only_current_season=1, league_id="00", season="2017-18").get_json()
players_json = json.loads(players_json_raw)
players = pd.DataFrame(columns=players_json["resultSets"][0]["headers"], data=players_json["resultSets"][0]["rowSet"])
active_player_ids = players["PERSON_ID"].values

## Obtain json file by player by game
os.mkdir("player-json-files-17")

for player in active_player_ids:
    # get_json on each player's data
    json_file = playergamelog.PlayerGameLog(player_id = player,
                                              season = "2017-18",
                                              season_type_all_star = "Regular Season").get_json()
    # write json file into local folder
    with open('player-json-files-17\\' + str(player) + '.json', 'w') as JsonOut:
        JsonOut.write(json_file)
    time.sleep(10) # to avoid getting timed out
    
## read json files into a dataframe: json.load
PlayerGameLog17 = DataFrame()

for filename in os.listdir('player-json-files-17'):
    if filename.endswith(".json"): # only include the json files 
        with open('player-json-files\\' + filename) as player_file:
            dict_player = json.load(player_file) # read in as a dictionary
            player_df = DataFrame.from_records(dict_player['resultSets'][0]['rowSet'],
                                                columns = dict_player['resultSets'][0]['headers'])
        PlayerGameLog17 = pd.concat([PlayerGameLog17, player_df])      
        continue
    else:
        continue
        
## Save player df to csv
PlayerGameLog17.to_csv(r'PlayerGameLog17.csv', index = False, header = True)

## Season 16-17

In [0]:
# Get list of player IDs
from nba_api.stats.endpoints import commonallplayers
players_json_raw = commonallplayers.CommonAllPlayers(is_only_current_season=1, league_id="00", season="2016-17").get_json()
players_json = json.loads(players_json_raw)
players = pd.DataFrame(columns=players_json["resultSets"][0]["headers"], data=players_json["resultSets"][0]["rowSet"])
active_player_ids = players["PERSON_ID"].values

## Obtain json file by player by game
os.mkdir("player-json-files-16")

for player in active_player_ids:
    # get_json on each player's data
    json_file = playergamelog.PlayerGameLog(player_id = player,
                                              season = "2016-17",
                                              season_type_all_star = "Regular Season").get_json()
    # write json file into local folder
    with open('player-json-files-18\\' + str(player) + '.json', 'w') as JsonOut:
        JsonOut.write(json_file)
    time.sleep(10) # to avoid getting timed out
    
## read json files into a dataframe: json.load
PlayerGameLog16 = DataFrame()

for filename in os.listdir('player-json-files'):
    if filename.endswith(".json"): # only include the json files 
        with open('player-json-files\\' + filename) as player_file:
            dict_player = json.load(player_file) # read in as a dictionary
            player_df = DataFrame.from_records(dict_player['resultSets'][0]['rowSet'],
                                                columns = dict_player['resultSets'][0]['headers'])
        PlayerGameLog16 = pd.concat([PlayerGameLog16, player_df])      
        continue
    else:
        continue
        
## Save player df to csv
PlayerGameLog16.to_csv(r'PlayerGameLog16.csv', index = False, header = True)

# Data Source 3: Basic Player Info
This source is for simulation only; therefore it was only collected for the 18-19 season.

In [0]:
## Get list of current player IDs
import json, re, time
import pandas as pd
from nba_api.stats.endpoints import commonallplayers
players_json_raw = commonallplayers.CommonAllPlayers(is_only_current_season=1, league_id="00", season="2018-19").get_json()
players_json = json.loads(players_json_raw)
players = pd.DataFrame(columns=players_json["resultSets"][0]["headers"], data=players_json["resultSets"][0]["rowSet"])
active_player_ids = players["PERSON_ID"].values

In [0]:
## Get player json files
from nba_api.stats.endpoints import commonplayerinfo
os.mkdir("info-json-files")
for player in active_player_ids:
  # get_json on each player's data
  json_file = commonplayerinfo.CommonPlayerInfo(player_id = player).get_json()
  # write json file into local folder
  file_path = os.path.join("info-json-files", str(player) + '.json')
  with open(file_path, 'w') as JsonOut:
     JsonOut.write(json_file)
  time.sleep(10) 

In [0]:
## Create a dataframe from the json files
for playerID in active_player_ids[1:]:
  file_path = os.path.join("info-json-files", str(playerID) + '.json')
  with open(file_path) as player_file:
    player_json = json.load(player_file)
    playerDF = pd.DataFrame(columns=player_json["resultSets"][0]["headers"], data=player_json["resultSets"][0]["rowSet"])
    infoDF = pd.concat([infoDF, playerDF])
infoDF.set_index("PERSON_ID", inplace=True)
infoDF.to_csv("Player_Info.csv")