# Datasets
A brainstorm of what datasets would be useful

## By player per game
A dictionary of player's and their stats in all their games

## By team per game
A dictionary of all teams and overall stats for each game they've played

## Ratings
A dictionary of all players and their current ratings. Not on a game by game basis, but an aggregate.
Sources could include:
* NBA official stats
* 2K
* Other free stats sources

## Data structure
It might be useful to have custom data structures to access all this data. For example, I could have a specific data structure for box score, player stats, team stats.

# Using tutorial
[Here](http://practicallypredictable.com/2017/12/21/web-scraping-nba-team-matchups-box-scores/) is a nice tutorial on how to scrape box score data.
A nice notebook viewer version [here](https://nbviewer.jupyter.org/github/practicallypredictable/posts/blob/master/basketball/nba/notebooks/scrape-stats_nba-team_matchups.ipynb)

In [109]:
from itertools import chain
from pathlib import Path
from time import sleep
import datetime as dt
import requests
from tqdm import tqdm
tqdm.monitor_interval = 0
import numpy as np
import pandas as pd
from collections import namedtuple
pd.options.display.max_rows = 999
pd.options.display.max_columns = 999

### User agent
Found out what my user-agent is [here](https://www.whoishostingthis.com/tools/user-agent/)

In [2]:
USER_AGENT = ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' +
              'AppleWebKit/537.36 (KHTML, like Gecko) ' +
              'Chrome/70.0.3538.77 Safari/537.36')

REQUEST_HEADERS = {
    'user-agent': USER_AGENT,
}

### Getting gameid from game by game page

Make a function to return all GameIDs in a specified time period.

In [None]:
NBA_ID = '00'

NBA_SEASON_TYPES = {
    'regular': 'Regular Season',
    'playoffs': 'Playoffs',
    'preseason': 'Pre Season',
}

season_type = NBA_SEASON_TYPES['regular']


In [121]:
def parse_game_ids(date):
    '''
    Given a date, find all the GameIDs for that day. 
    Takes in a datetime.date input and returns a named tuple consisting of GameID, Matchup, and date
    '''
    NBA_URL = 'https://stats.nba.com/stats/teamgamelogs'

    date_from = date
    date_to = date_from

    date_from_string = date_from.strftime('%m/%d/%Y')
    date_to_string = date_from_string

    if date_from.month >= 10:
        season_start_year = date_from.year
    else:
        season_start_year = date_from.year - 1

    season = str(season_start_year) + '-' + str(season_start_year + 1)[-2:]

    nba_params = {
        'LeagueID': NBA_ID,
        'Season': season,
        'SeasonType': season_type,
        'DateFrom': date_from_string, 
        'DateTo': date_to_string,
        'Outcome': 'W' # To prevent doubling up and recording both the winner and loser of one game.
    }

    r = requests.get(NBA_URL, params=nba_params, headers=REQUEST_HEADERS, allow_redirects=False, timeout=15)
    assert r.status_code == 200
    
    json_dict = r.json() # Turns the json text into a python dict
    games_dict = json_dict['resultSets'][0] # Only has one element
    headers = games_dict['headers']
    
    GAME_ID_COL = headers.index('GAME_ID')
    DATE_COL = headers.index('GAME_DATE')
    MATCHUP_COL = headers.index('MATCHUP')
    games_list = games_dict['rowSet'] # List of list
    
    Game_tuple = namedtuple('Game_tuple', ['GameID', 'Date', 'Matchup'])
    
    output_games_list = [None] * len(games_list)
    i = 0
    
    for game in games_list:
        game_tuple = Game_tuple(GameID=game[GAME_ID_COL], 
                               Date=dt.datetime.strptime(game[DATE_COL][:10], '%Y-%m-%d').date(),
                               Matchup =game[MATCHUP_COL])

        
        output_games_list[i] = game_tuple
        i += 1
    
    return output_games_list

In [123]:
parse_game_ids(dt.date(2018,11,7))

[Game_tuple(GameID='0021800160', Date=datetime.date(2018, 11, 7), Matchup='UTA vs. DAL'),
 Game_tuple(GameID='0021800161', Date=datetime.date(2018, 11, 7), Matchup='TOR @ SAC'),
 Game_tuple(GameID='0021800159', Date=datetime.date(2018, 11, 7), Matchup='NOP vs. CHI'),
 Game_tuple(GameID='0021800153', Date=datetime.date(2018, 11, 7), Matchup='OKC @ CLE'),
 Game_tuple(GameID='0021800154', Date=datetime.date(2018, 11, 7), Matchup='DET @ ORL'),
 Game_tuple(GameID='0021800156', Date=datetime.date(2018, 11, 7), Matchup='MIA vs. SAS'),
 Game_tuple(GameID='0021800157', Date=datetime.date(2018, 11, 7), Matchup='PHI @ IND'),
 Game_tuple(GameID='0021800155', Date=datetime.date(2018, 11, 7), Matchup='NYK @ ATL'),
 Game_tuple(GameID='0021800162', Date=datetime.date(2018, 11, 7), Matchup='LAL vs. MIN'),
 Game_tuple(GameID='0021800158', Date=datetime.date(2018, 11, 7), Matchup='MEM vs. DEN')]

### Parsing the box score
Now that we have a way to get the game ids, we have to parse the box score

The XHR with individual player scores is boxscoretraditionalv2

example of url is 

https://stats.nba.com/stats/boxscoretraditionalv2?EndPeriod=10&EndRange=28800&GameID=0021800162&RangeType=0&Season=2018-19&SeasonType=Regular+Season&StartPeriod=1&StartRange=0

Not sure what a bunch of these parameters are, going to compare them to others. This is an example from a different game, also final boxscore:

https://stats.nba.com/stats/boxscoretraditionalv2?EndPeriod=10&EndRange=28800&GameID=0021800152&RangeType=0&Season=2018-19&SeasonType=Regular+Season&StartPeriod=1&StartRange=0

It has the exact same parameters except for the GameID, I suspect that the other parameters change when the game is not final. At time of writing, there are no ongoing games so I shall try again tomorrow.

When a game is in play, there is no boxscoretraditionalv2 being called. Replacing the above exmple urls with a live game's GameID shows null for everyone's stats. Ways to handle this:
* Find the real time boxscores and update as we go
* Prevent any update for non finished games
    * Check NULL for every boxscore value
    * Check something else to see if a game is live. This is preferable.
    
    

In [93]:
def parse_boxscore(GameID, season='2018-19', season_type=NBA_SEASON_TYPES['regular']):
    ''' 
    Takes in the GameID in string format and returns a dataframe of the boxscore, broken down by player.
    If game is not yet finished, will return a boxscore with NULL values
    >>> df_boxscore = parse_boxscore('0021800151')
    '''
    if not isinstance(GameID, str):
        raise TypeError('GameID must be string')
    
    URL_GAME_BOXSCORE = 'https://stats.nba.com/stats/boxscoretraditionalv2'
    boxscore_params = {
        'GameID': GameID,
        'Season': season,
        'SeasonType': season_type,
        'EndPeriod': '10',
        'EndRange': '28800',
        'RangeType': '0',
        'StartPeriod': '1',
        'StartRange': '0'
    }

    r_boxscore = requests.get(URL_GAME_BOXSCORE, params=boxscore_params, 
                              headers=REQUEST_HEADERS, allow_redirects=False, timeout=15)

    assert r_boxscore.status_code == 200

    json_boxscore = r_boxscore.json()

    player_stats = json_boxscore['resultSets'][0] # 0 is player stats, 1 is team stats, 2 is starter bench stats
    boxscore_headers = player_stats['headers'] 
    boxscore_stats = player_stats['rowSet']

    df_boxscore = pd.DataFrame(columns=boxscore_headers, data=boxscore_stats)
    return df_boxscore

In [96]:
parse_boxscore('0021800151')

Unnamed: 0,GAME_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_CITY,PLAYER_ID,PLAYER_NAME,START_POSITION,COMMENT,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TO,PF,PTS,PLUS_MINUS
0,21800151,1610612751,BKN,Brooklyn,203925,Joe Harris,F,,28:19,5.0,11.0,0.455,1.0,6.0,0.167,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,2.0,11.0,23.0
1,21800151,1610612751,BKN,Brooklyn,201162,Jared Dudley,F,,24:29,2.0,7.0,0.286,1.0,5.0,0.2,0.0,0.0,0.0,0.0,2.0,2.0,1.0,0.0,0.0,2.0,4.0,5.0,5.0
2,21800151,1610612751,BKN,Brooklyn,1628386,Jarrett Allen,C,,24:20,4.0,7.0,0.571,0.0,1.0,0.0,2.0,2.0,1.0,2.0,7.0,9.0,5.0,1.0,1.0,1.0,2.0,10.0,13.0
3,21800151,1610612751,BKN,Brooklyn,1627747,Caris LeVert,G,,32:12,10.0,16.0,0.625,3.0,6.0,0.5,3.0,4.0,0.75,1.0,4.0,5.0,1.0,1.0,1.0,3.0,2.0,26.0,9.0
4,21800151,1610612751,BKN,Brooklyn,1626156,D'Angelo Russell,G,,26:55,6.0,15.0,0.4,1.0,6.0,0.167,2.0,2.0,1.0,1.0,5.0,6.0,3.0,2.0,0.0,0.0,3.0,15.0,7.0
5,21800151,1610612751,BKN,Brooklyn,203459,Allen Crabbe,,,22:25,3.0,12.0,0.25,1.0,5.0,0.2,0.0,0.0,0.0,3.0,4.0,7.0,2.0,0.0,1.0,1.0,2.0,7.0,12.0
6,21800151,1610612751,BKN,Brooklyn,202334,Ed Davis,,,23:40,4.0,7.0,0.571,0.0,0.0,0.0,1.0,5.0,0.2,4.0,8.0,12.0,0.0,1.0,0.0,0.0,3.0,9.0,9.0
7,21800151,1610612751,BKN,Brooklyn,1626178,Rondae Hollis-Jefferson,,,12:33,1.0,6.0,0.167,0.0,1.0,0.0,3.0,4.0,0.75,0.0,2.0,2.0,1.0,1.0,0.0,2.0,0.0,5.0,-1.0
8,21800151,1610612751,BKN,Brooklyn,203915,Spencer Dinwiddie,,,29:56,5.0,7.0,0.714,2.0,4.0,0.5,0.0,0.0,0.0,0.0,5.0,5.0,3.0,2.0,1.0,0.0,4.0,12.0,20.0
9,21800151,1610612751,BKN,Brooklyn,203894,Shabazz Napier,,,15:12,1.0,4.0,0.25,0.0,2.0,0.0,2.0,2.0,1.0,2.0,2.0,4.0,5.0,1.0,0.0,1.0,0.0,4.0,13.0
