# NBA Game Predictions - Final Report
## Kevin Yang, Eric Lee, Derek Young


### Introduction
The NBA (National Basketball Association) is an American based men's basketball league that is considered to be the best basketball league in the world. As of the 2016-2017 season, there are 30 teams divided into two conferences (and further divided into six divisions).

Our main objective is to build prediction models for NBA team performance.

On a broad level, we have decided to focus our project on predicting the outcome of a given nba game. On a high level, our approach will involve determining the most important features in determining game outcomes, and then train a supervised machine learning model on these features over many previous games. 

### Report Contents
Below outlines the pipeline to predict NBA games.

1. [Data Collection and Scraping](#Data Collection and Scraping)
2. [Exploratory Data Analysis]()
3. [Feature Selection]()
4. [Cutoff and "N"-previous Game Selection]()
5. [Naive Approach]()
6. [Linear Regression Model](#Linear Regression Model)
7. [Logistic Regression Model]()
8. [Support Vector Machines]()
9. [Concluding Thoughts]()

While most of our project is written in Python, a certain subset was done in R, specifically modeling linear and logistic regression. Throughout the report, we will highlight snippets of code that were substantial to our results. All code can be found ____.

<a id='Data Collection and Scraping'></a>
### Data Collection

We found that the best place to get our data was [stats.nba.com](http://stats.nba.com). First, we surveyed the webpage, finding the location and organization of data. Using Google Chrome Developer Tools, we were able to find the link structures and JSON responses containing useful data for our models; these methods will be made more clear in the code below.

Once we found a scalable approach to get all team game data for any NBA season, we decided that it would be best if, upon scraping and collecting the appropriate data, that we store our data in a local `sqlite3` database for futher use. We found this to be a consistent and reliable approach to our data analysis, as our data does not run the risk of changing unexpectedly, and we do not have to worry about overloading any servers with too many requests. We write the functions `get_league_gamelogs`, `generate_year_list`, and `load_all_gamelogs` to collect and then store our data.


In [None]:
import requests
import sqlite3
import pandas as pd

def get_league_gamelogs(season):
    """ Given a season (string, ex: 2016-17), returns a (header, log_list) where the header represents a 
        key describing the format of the logs in log_list
    Input:
        season (str): season string, ex: '2016-17'
    Output:
        (list, dict)
    """
    league_log_url = ("http://stats.nba.com/stats/leaguegamelog?Counter=1000&DateFrom=&DateTo=&" + 
                  "Direction=DESC&LeagueID=00&PlayerOrTeam=T&Season=" + str(season) + 
                  "&SeasonType=Regular+Season&Sorter=PTS")
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0'}

    # Request URL and parse JSON
    response = requests.get(league_log_url, headers = headers)
    response.raise_for_status() # Raise exception if invalid response
    response_json = response.json()
    log_list = response_json['resultSets'][0]['rowSet']
    header = response_json['resultSets'][0]['headers']
    
    return (header, log_list)

def generate_year_list(start, yrs):
    """ Generate a year list to pass into load_all_gamelogs
    Input:
        start (int): The first year we are interested in loading
        yrs (int): How many years since start that we are including
    Output:
        (List): List of years
    """
    year_list = []
    curr_yr = start
    for i in xrange(yrs):
        nextyr = curr_yr + 1 
        year_list.append(str(curr_yr)+"-"+str(nextyr)[2:])
        curr_yr = nextyr
    return year_list
    
def load_all_gamelogs(conn, start, yrs):
    """ Load nba gamelog data for the past yrs years as a games tables into an sqlite database given in conn
    Input:
        conn (sqlite3.Connection): Connection object corresponding to the database; used to perform SQL commands.
        yrs (int): Number of years to include in table
    Output:
        None
    """
    
    cursor = conn.cursor()
    
    year_list = generate_year_list(start,yrs)
    
    # Clear any existing league_log table
    cursor.execute('drop table if exists league_log')
    
    # Create league_log table
    cursor.execute("""
    CREATE TABLE IF NOT EXISTS league_log (
    season_id TEXT, team_id INTEGER, team_abbreviation TEXT, team_name TEXT, game_id INTEGER,
    game_date INTEGER, matchup INTEGER, wl STRING, min INTEGER, fgm INTEGER, fga INTEGER,
    fg_pct REAL, fg3m INTEGER, fg3a INTEGER, fg3_pct REAL, ftm INTEGER, fta INTEGER, ft_pct REAL,
    oreb INTEGER, dreb INTEGER, reb INTEGER, ast INTEGER, stl INTEGER, blk INTEGER, tov INTEGER,
    pf INTEGER, pts INTEGER, plus_minus INTEGER
    )""")
    
    for year in year_list:
        (header, log_list) = get_league_gamelogs(year)
        
        question_marks = "(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? ,?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)"
        query_string = "INSERT INTO league_log VALUES " + question_marks
        for log in log_list:
            cursor.execute(query_string,
                          (log[0],log[1],log[2],log[3],log[4],log[5],log[6],log[7],
                          log[8], log[9], log[10], log[11], log[12], log[13], log[14],
                          log[15],log[16], log[17], log[18], log[19], log[20], log[21],
                          log[22], log[23], log[24], log[25], log[26], log[27]))
            
    conn.commit()

In [None]:
# Connect to database file
conn = sqlite3.connect(r"db/league.db")
conn.text_factory = str

# Define relevant years
start_year = 1946
length = 2017 - start_year

load_all_gamelogs(conn, start_year, length)

# Load newly stored data into pandas DataFrame league_df
league_df = pd.read_sql_query('SELECT * FROM league_log', conn)
league_df.head()

In [None]:
def validate_data(df):
    """ Given a DataFrame of league game logs, performs a few sanity checks based upon the number of games 
        in a season.
    Input:
        df (pd.DataFrame): league game logs
    Output:
        str
    """
    year_list = df['season_id'].unique().tolist()
    
    for year in year_list:
        df_temp = df[df['season_id'] == year]
        
        if year == '22011':
            # Lockout year
            assert(df_temp.shape[0] == 1980)
        elif year == '22016':
            # Current ongoing year
            continue
        elif year >= '22004':
            # Normal 30 team, 82 game schedule
            assert(df_temp.shape[0] == 2460) 
    return "Passed!"

print validate_data(league_df)

We have now successfully collected and stored league log data for years ranging from `1946 - 2016`. Further, we have passed a few sanity checks to help ensure the correctness of our data. Notably, the head of league_df contains some NaN values; this is because there is a lack of data availability on certain older years on `stats.nba.com`. We print out the following to show more recent data from the 2005-06 season.

In [None]:
(league_df[league_df['season_id'] == "22005"]).head()

### Feature Creation
Before running any exploratory data analysis, we need to preprocess our dataframe to include a few new columns. This is because each row in our current dataframe only contains data on postgame stats (i.e. points scored, win/loss), and does not contain any data that can be used as predictors before the game started. Based upon intuition and inspiration from other sources (mainly Amorim Torres [1]) we wanted to add the following features/columns:

1. is_home (indicator displaying 1 if the given team is playing at home, 0 otherwise)
2. opp_team_id (opposing team identifier)
3. wl_binary (indicator displaying 1 if the given team won, 0 otherwise)
4. home_win_pct (win percentage of the home team prior to the current game)
5. away_win_pct (win percentage of the away team prior to the current game)
6. home_avg_pt_diff (average point differential of the home team prior to the current game)
7. away_avg_pt_diff (average point differential of the away team prior to the current game)
8. home_win_pct_N (win percentage of the home team in the last N games prior to the current game)
9. away_win_pct_N (win percentage of the away team in the last N games prior to the current game)
10. home_win_pct_as_home (win percentage of the home team as home prior to the current game)
11. away_win_pct_as_away (win percentage of the away team as away prior to the current game)
12. home_back_to_back (indicator displaying 1 if the home team just played the day prior, 0 otherwise)
13. away_back_to_back (indicator displaying 1 if the away team just played the day prior, 0 otherwise)
14. home_game_count (number of games the home team as played prior to the current game)
15. away_game_count (number of games the away team as played prior to the current game)

We note that since the dynamics of NBA teams change greatly from season to season, all these features will
be season specific. For example, home_win_pct only considers the win percentage of the team in the games
within the current season. To create these features, we note that some of the features are not time-sensitive
(i.e. 1-3), while the other features 4-15 need to be ordered by time. Because of this, we will divide the
preprocessing into two functions, `add_constant_features` and `add_variable_features`, which add the 
constant features and the time-sensitive features respectively.

In [None]:
import numpy as np

def add_constant_features(league_df):
    """ Given a dataframe league_df, returns new league_df by converting the 'game_date' column to datetime if 
    necessary, and adds is_home indicator, opp_team_id, and wl_binary indicator
    Input:
        league_df (pandas.DataFrame): dataframe containing league logs
    Output:
        pd.DataFrame
    """
    # Convert to datetime, note that this is in place
    league_df['game_date'] = league_df['game_date'].apply(pd.to_datetime)
    
    # Add new columns
    is_home = np.zeros(len(league_df), dtype=np.int64)
    opp_team_id = np.zeros(len(league_df), dtype=np.int64)
    wl_binary = np.zeros(len(league_df), dtype = np.int64)
    
    league_df = league_df.assign(is_home = is_home)
    league_df = league_df.assign(opp_team_id = opp_team_id)
    league_df = league_df.assign(wl_binary = wl_binary)
    
    
    # Add home indicator variable
    for (index, row) in league_df.iterrows():
        matchup = row['matchup']
        if "@" in matchup:
            league_df.set_value(index, "is_home", 0)
        else:
            league_df.set_value(index, "is_home", 1)
            
    # Add opposing team ID
    for (index,row) in league_df.iterrows():
        game_id = row['game_id']
        team_id = row['team_id']
        
        # Find other game with the same game ID
        df_game = league_df[league_df['game_id'] == game_id]
        assert(len(df_game) == 2)
        found_opp = False
        for (inner_index,inner_row) in df_game.iterrows():
            curr_team_id = inner_row['team_id']
            if curr_team_id == team_id:
                continue
            else:
                # Found opposing team, update opposing team ID
                league_df.set_value(index, 'opp_team_id', curr_team_id)
                found_opp = True
        assert(found_opp)
        
    # Add binary representation of wins and losses
    for (index, row) in league_df.iterrows():
        wl = row['wl']
        if wl == 'W':
            league_df.set_value(index, 'wl_binary', 1)
        else:
            league_df.set_value(index, 'wl_binary', 0)
    
    return league_df

def add_variable_features(league_df, N=8):
    """ Given a dataframe league_df, returns a new df containing extra columns for each new feature.
        If new_years = true, adds features that only work for data frames containing data >= 2005
    Input:
        league_df (pandas.DataFrame): dataframe containing league logs
        N (int): lookback parameter
    Output:
        pd.DataFrame
    """    
    lookback = N
    new_df = league_df.sort_values('game_date')
    
    # Add new columns
    home_win_pct = np.zeros(len(new_df))
    away_win_pct = np.zeros(len(new_df))
    home_avg_pt_diff = np.zeros(len(new_df))
    away_avg_pt_diff = np.zeros(len(new_df))
    home_win_pct_N = np.zeros(len(new_df))
    away_win_pct_N = np.zeros(len(new_df))
    away_win_pct_as_away = np.zeros(len(new_df))
    home_win_pct_as_home = np.zeros(len(new_df))
    home_back_to_back = np.zeros(len(new_df))
    away_back_to_back = np.zeros(len(new_df))
    home_game_count = np.zeros(len(new_df))
    away_game_count = np.zeros(len(new_df))
    home_mileage = np.zeros(len(new_df))
    away_mileage = np.zeros(len(new_df))
    
    new_df = new_df.assign(home_win_pct = home_win_pct)
    new_df = new_df.assign(away_win_pct = away_win_pct)
    new_df = new_df.assign(home_avg_pt_diff = home_avg_pt_diff)
    new_df = new_df.assign(away_avg_pt_diff = away_avg_pt_diff)
    new_df = new_df.assign(home_win_pct_N = home_win_pct_N)
    new_df = new_df.assign(away_win_pct_N = away_win_pct_N)
    new_df = new_df.assign(away_win_pct_as_away = away_win_pct_as_away)
    new_df = new_df.assign(home_win_pct_as_home = home_win_pct_as_home)
    new_df = new_df.assign(home_back_to_back = home_back_to_back)
    new_df = new_df.assign(away_back_to_back = away_back_to_back)
    new_df = new_df.assign(home_game_count = home_game_count)
    new_df = new_df.assign(away_game_count = away_game_count)
    
    # add features
    grouped = new_df.groupby(['season_id'])
    groupList = [grouped.get_group(x) for x in grouped.groups]
    
    for season_df in groupList:
        # Initialize dictionary containing wins and losses for each team
        win_dict = dict()
        lose_dict = dict()
        running_dict = dict()
        
        # Stores list of game dates for each team
        running_date_dict = dict()
        
        # Total plus minus so far
        plus_minus_dict = dict()
        
        # Stores home and away game counts and w/l counts
        wins_as_home = dict()
        wins_as_away = dict()
        games_as_home = dict()
        games_as_away = dict()
        
        running_locations = dict()
        
        for team in season_df['team_id'].unique():
            win_dict[team] = 0
            lose_dict[team] = 0
            running_dict[team] = []
            plus_minus_dict[team] = 0
            running_date_dict[team] = []
            
            # Track wins at home, at away, and total games at home, at away
            wins_as_home[team] = 0
            wins_as_away[team] = 0
            games_as_home[team] = 0
            games_as_away[team] = 0
        
        # Sort season by day
        season_df = season_df.sort_values('game_date')
        
        seen_games = set()
        
        for (index, row) in season_df.iterrows():
            is_home = row['is_home']
            team_id = row['team_id']
            opp_team_id = row['opp_team_id']
            wl = row['wl']
            game_id = row['game_id']
            curr_team_plus_minus = row['plus_minus']
            opp_team_plus_minus = -curr_team_plus_minus
            game_date = row['game_date']
            
            season_id = row['season_id']
            
            if is_home == 1:
                home_team_id = team_id
                away_team_id = opp_team_id
            else:
                home_team_id = opp_team_id
                away_team_id = team_id
                
            # Update home_win_pct, away_win_pct
            home_win_pct = 0
            away_win_pct = 0

            if win_dict[home_team_id] + lose_dict[home_team_id] > 0:
                home_win_pct = (win_dict[home_team_id])/float(win_dict[home_team_id] + lose_dict[home_team_id])
            if win_dict[away_team_id] + lose_dict[away_team_id] > 0:
                away_win_pct = (win_dict[away_team_id])/float(win_dict[away_team_id] + lose_dict[away_team_id])
                
            new_df.set_value(index, 'home_win_pct', home_win_pct)
            new_df.set_value(index, 'away_win_pct', away_win_pct)
            
            # Update home_win_pct_N, away_win_pct_N
            home_win_pct_N = 0
            away_win_pct_N = 0
            
            home_games_count = len(running_dict[home_team_id])
            away_games_count = len(running_dict[away_team_id])
            
            new_df.set_value(index, 'home_game_count', home_games_count)
            new_df.set_value(index, 'away_game_count', away_games_count)
            
            if home_games_count > 0:
                if home_games_count > lookback:
                    lookback_games = running_dict[home_team_id][home_games_count - lookback:]
                else:
                    lookback_games = running_dict[home_team_id]
                home_win_pct_N = sum(lookback_games)/float(len(lookback_games))
                
            if away_games_count > 0:
                if away_games_count > lookback:
                    lookback_games = running_dict[away_team_id][away_games_count - lookback:]
                else:
                    lookback_games = running_dict[away_team_id]
                away_win_pct_N = sum(lookback_games)/float(len(lookback_games))
                
            new_df.set_value(index, 'home_win_pct_N', home_win_pct_N)
            new_df.set_value(index, 'away_win_pct_N', away_win_pct_N)
            
            # Update home_avg_pt_diff, away_avg_pt_diff
            home_avg_pt_diff = 0
            away_avg_pt_diff = 0
            
            if home_games_count > 0:
                running_pt_diff = plus_minus_dict[home_team_id]
                home_avg_pt_diff = running_pt_diff/float(home_games_count)
            if away_games_count > 0:
                running_pt_diff = plus_minus_dict[away_team_id]
                away_avg_pt_diff = running_pt_diff/float(away_games_count)
                
            new_df.set_value(index, 'home_avg_pt_diff', home_avg_pt_diff)
            new_df.set_value(index, 'away_avg_pt_diff', away_avg_pt_diff)
            
            # Update back-to-back indicators   
            home_back_to_back = 0
            away_back_to_back = 0
            
            if home_games_count > 0:
                most_recent_date = running_date_dict[home_team_id][home_games_count - 1]
                
                if game_date.toordinal() - most_recent_date.toordinal() == 1:
                    # Back-to-back
                    home_back_to_back = 1
                
            if away_games_count > 0:
                most_recent_date = running_date_dict[away_team_id][away_games_count - 1]
                
                if game_date.toordinal() - most_recent_date.toordinal() == 1:
                    # Back-to-back
                    away_back_to_back = 1
                    
            new_df.set_value(index, 'home_back_to_back', home_back_to_back)
            new_df.set_value(index, 'away_back_to_back', away_back_to_back)
            
            # Update home_win_pct_as_home, away_win_pct_as_away
            home_win_pct_as_home = 0
            away_win_pct_as_away = 0
            
            home_games_as_home = games_as_home[home_team_id]
            away_games_as_away = games_as_away[away_team_id]
            
            if (home_games_as_home > 0):
                home_win_pct_as_home = (wins_as_home[home_team_id])/float(home_games_as_home)
            if (away_games_as_away > 0):
                away_win_pct_as_away = (wins_as_away[away_team_id])/float(away_games_as_away)
                
            new_df.set_value(index, 'home_win_pct_as_home', home_win_pct_as_home)
            new_df.set_value(index, 'away_win_pct_as_away', away_win_pct_as_away) 
                
            # Update running stats
            if (wl == 'W'):
                if game_id in seen_games:
                    win_dict[team_id] += 1
                    lose_dict[opp_team_id] += 1
                    running_dict[team_id].append(1)
                    running_dict[opp_team_id].append(0)
                    
                    # Update home team and away team w/l
                    if is_home == 1:
                        wins_as_home[team_id] += 1
                        games_as_home[team_id] += 1
                        games_as_away[opp_team_id] += 1
                    else:
                        wins_as_away[team_id] += 1
                        games_as_away[team_id] += 1
                        games_as_home[opp_team_id] += 1
                    
            else:
                if game_id in seen_games:
                    win_dict[opp_team_id] += 1
                    lose_dict[team_id] += 1
                    running_dict[opp_team_id].append(1)
                    running_dict[team_id].append(0)
                    
                    # update home team and away team w/l
                    if is_home == 1:
                        wins_as_away[opp_team_id] += 1
                        games_as_away[opp_team_id] += 1
                        games_as_home[team_id] += 1
                        
                    else:
                        wins_as_home[opp_team_id] += 1
                        games_as_home[opp_team_id] += 1
                        games_as_away[team_id] += 1
            if game_id in seen_games:
                plus_minus_dict[team_id] += curr_team_plus_minus
                plus_minus_dict[opp_team_id] += opp_team_plus_minus
                running_date_dict[team_id].append(game_date)
                running_date_dict[opp_team_id].append(game_date)
                
#                 if (new_years):
#                     running_locations[home_team_id].append(curr_location)
#                     running_locations[away_team_id].append(curr_location)
                
            seen_games.add(game_id)
    return new_df

In [None]:
all_games = add_variable_features(add_constant_features(league_df))
home_games = all_games[all_games['is_home'] == 1]
all_games.head()

From the output above, we can see some of the new columns we added. However, we see that most of the values are 0, which is expected since the head of the dataframe contains data from the beginning of a season. To display some of our results, we will show the dataframe tail.

In [None]:
all_games.tail()

### Lookback Parameter (N) Tuning
A crucial parameter in our model is our lookback parameter, which is used in the calcation of the `home_win_pct_N` and `away_win_pct_N` features. The idea behind these features is that team performance may be streaky, either hot or cold. Further, team dynamics change throughout the season, i.e. through injuries and potentially other roster changes. Thus, it is important to not only consider team performance throughout the entire season, but also the recent team performance.

To determine which `N` value to use, we will run some basic tests. Naturally, we might ask: if we predicted game results solely on recent game performance, how well would we do for different `N`? By comparing these results, we can
isolate the lookback variable and do simple comparisons between different parameter values. We note that this approach was inspired from work from Amorim Torres [1]. 

In [None]:
# Use svg backend for better quality
import matplotlib
matplotlib.use("svg")
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)

def add_lookback_features(league_df, params):
    """ Given a dataframe league_df and a list of natural numbers params, returns a new df containing 
        extra columns for each lookback parameter in params.
    Input:
        league_df (pandas.DataFrame): dataframe containing league logs
        params (list): list containing positive integers
    Output:
        pd.DataFrame
    """ 
    
    # Ensure correct params formatting
    for i in params:
        assert (type(i) == int and i > 0)
        
    new_df = league_df.sort_values('game_date')
    
    # Initialize new columns
    for i in params:
        lookback = i
        
        home_col_name = 'home_win_pct_' + str(i)
        away_col_name = 'away_win_pct_' + str(i)
        
        new_df.loc[:,home_col_name] = np.zeros(len(new_df))
        new_df.loc[:,away_col_name] = np.zeros(len(new_df))
        
        grouped = new_df.groupby(['season_id'])
        groupList = [grouped.get_group(x) for x in grouped.groups]

        for season_df in groupList:
            # Initialize dictionary containing wins and losses for each team
            running_dict = dict()

            for team in season_df['team_id'].unique():
                running_dict[team] = []

            # Sort season by day
            season_df = season_df.sort_values('game_date')

            seen_games = set()
            
            for (index, row) in season_df.iterrows():
                team_id = row['team_id']
                opp_team_id = row['opp_team_id']
                wl = row['wl']
                game_id = row['game_id']
                is_home = row['is_home']
                
                if is_home == 1:
                    home_team_id = team_id
                    away_team_id = opp_team_id
                else:
                    home_team_id = opp_team_id
                    away_team_id = team_id
                
                # Update home_win_pct_N, away_win_pct_N
                home_win_pct_N = 0
                away_win_pct_N = 0
                
                home_games_count = len(running_dict[home_team_id])
                away_games_count = len(running_dict[away_team_id])

                if home_games_count > 0:
                    if home_games_count > lookback:
                        lookback_games = running_dict[home_team_id][home_games_count - lookback:]
                    else:
                        lookback_games = running_dict[home_team_id]
                    home_win_pct_N = sum(lookback_games)/float(len(lookback_games))

                if away_games_count > 0:
                    if away_games_count > lookback:
                        lookback_games = running_dict[away_team_id][away_games_count - lookback:]
                    else:
                        lookback_games = running_dict[away_team_id]
                    away_win_pct_N = sum(lookback_games)/float(len(lookback_games))

                new_df.set_value(index, home_col_name, home_win_pct_N)
                new_df.set_value(index, away_col_name, away_win_pct_N)
                
                # Update running stats
                if (wl == 'W'):
                    if game_id in seen_games:
                        running_dict[team_id].append(1)
                        running_dict[opp_team_id].append(0)

                else:
                    if game_id in seen_games:
                        running_dict[opp_team_id].append(1)
                        running_dict[team_id].append(0)
                        
                seen_games.add(game_id)
    return new_df

def graph_lookback(league_df, params):
    """ Given a dataframe league_df, a season_id, and a list of natural numbers params, graphs the effectiveness
        of using each of the params as a naive classifier
    Input:
        league_df (pandas.DataFrame): dataframe containing league logs, including the lookback columns from params
        season_id (str): season identifier
        params (list): sorted list containing positive integers
    Output:
        None
    """ 
    grouped = league_df.groupby(['season_id'])
    groupList = [grouped.get_group(x) for x in grouped.groups]
    
    fig = plt.figure()
    ax = fig.add_subplot(111)
    
    for season_df in groupList:
        # Initialize dictionary counting number of correct classifications for each param i
        correct_dict = dict()

        for i in params:
            correct_dict[i] = 0

        # Sort season by day
        season_df = season_df.sort_values('game_date')

        #seen_games = set()
        
        season_id = ""

        for (index, row) in season_df.iterrows():
            team_id = row['team_id']
            opp_team_id = row['opp_team_id']
            wl = row['wl_binary']
            game_id = row['game_id']
            is_home = row['is_home']
            season_id = row['season_id']
            
            
            for i in params:
                home_col_name = 'home_win_pct_' + str(i)
                away_col_name = 'away_win_pct_' + str(i)
                
                home_pct = row[home_col_name]
                away_pct = row[away_col_name]
                
                if (is_home == 1 and wl == 1 and home_pct >= away_pct):
                    correct_dict[i] += 1
                elif (is_home == 1 and wl == 0 and away_pct > home_pct):
                    correct_dict[i] += 1
                elif (is_home == 0 and wl == 1 and away_pct > home_pct):
                    correct_dict[i] += 1
                elif (is_home == 0 and wl == 0 and home_pct >= away_pct):
                    correct_dict[i] += 1
                    
        pct_list = []
        for i in params:
            pct_list.append(correct_dict[i]/float(len(season_df)))
            
        ax.plot(params, pct_list, label = season_id[1:])
        
    ax.set_xlabel("Lookback Parameter (N)", fontsize=15)
    ax.set_ylabel("Classification Accuracy", fontsize=18)
    
    fig.suptitle('Classification Accuracy Using Different Lookback Parameters', fontsize = 18)
    
    ax.legend(loc="upper left", bbox_to_anchor=(1,1))
        
    plt.plot()
    

In [None]:
all_games_lookback = add_lookback_features(all_games, [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
all_games_lookback.tail()

In [None]:
temp_df = all_games_lookback[all_games_lookback["season_id"] >= "22004"]
graph_lookback(temp_df, [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])

### Travel Distance (2005+)
One potential useful feature is the distance traveled by a team to get to the current location. This is related to fatigue, as teams will likely be tired from the flights, and the time zone changes can be distorting. We wish to add features `home_mileage` and `away_mileage` to indicate the number of miles the team traveled to get to the current game.

To do this, we will use the library geopy to get coordinates and to calculate distances (distance will be calculated using the vincenty formula). We will need to manually create dictionaries mapping teams to locations. Because these need to be manually created, we limit the year scope to include on 2005+. We do not think this limitation will be significant since we know that recent years are the most important.

In [None]:
from geopy.geocoders import Nominatim
from geopy.distance import vincenty

# Only consider years after 2005
temp_df = all_games[all_games['season_id'] >= '22005']
team_names = temp_df['team_name'].unique().tolist()
#print len(team_names), team_names

team_id_list = temp_df['team_id'].unique().tolist()

# Dictionary mapping of team_id to team_name
id_team = dict()
for team_id in team_id_list:
    temp_df_1 = temp_df[temp_df['team_id'] == team_id]
    id_team[team_id] = temp_df_1['team_name'].unique().tolist()

# Dictionary mapping of team_name to seasons
team_seasons = dict()
for team_name in team_names:
    temp_df_1 = temp_df[temp_df['team_name'] == team_name]
    team_seasons[team_name] = temp_df_1['season_id'].unique().tolist()

# Dictionary mapping team name to location
name_location = {'Dallas Mavericks': 'Dallas, Texas',
                'New Orleans/Oklahoma City Hornets': 'Oklahoma City, Oklahoma',
                'Milwaukee Bucks': 'Milwaukee, Wisconsin',
                'San Antonio Spurs': 'San Antonio, Texas',
                'Philadelphia 76ers': 'Philadelphia, Pennsylvania',
                'Phoenix Suns': 'Phoenix, Arizona',
                'Denver Nuggets': 'Denver, Colorado',
                'Sacramento Kings': 'Sacramento, California',
                'Atlanta Hawks': 'Atlanta, Georgia',
                'Miami Heat' : 'Miami, Florida', 
                'Toronto Raptors': 'Toronto, Ontario', 
                'New Jersey Nets': 'Newark, New Jersey', 
                'Houston Rockets': 'Houston, Texas', 
                'Boston Celtics': 'Boston, Massachusetts', 
                'Golden State Warriors' : 'Oakland, California', 
                'Utah Jazz': 'Salt Lake City, Utah', 
                'Seattle SuperSonics' : 'Seattle, Washington', 
                'Portland Trail Blazers': 'Portland, Oregon', 
                'Indiana Pacers' : 'Indianapolis, Indiana', 
                'Minnesota Timberwolves' : 'Minneapolis, Minnesota', 
                'Washington Wizards': 'Washington D.C., Virginia',
                'Chicago Bulls' : 'Chicago, Illinois', 
                'New York Knicks': 'New York City, New York', 
                'Cleveland Cavaliers' : 'Cleveland, Ohio', 
                'Charlotte Bobcats' : 'Charlotte, North Carolina',
                'Detroit Pistons' : 'Detroit, Michigan', 
                'Memphis Grizzlies' : 'Memphis, Tennessee', 
                'Orlando Magic' : 'Orlando, Florida', 
                'Los Angeles Clippers' : 'Los Angeles, California', 
                'Los Angeles Lakers' : 'Los Angeles, California', 
                'New Orleans Hornets' : 'New Orleans, Louisiana', 
                'Oklahoma City Thunder': 'Oklahoma City, Oklahoma', 
                'Brooklyn Nets': 'Brooklyn, New York', 
                'New Orleans Pelicans' : 'New Orleans, Louisiana', 
                'Charlotte Hornets' : 'Charlotte, North Carolina', 
                'LA Clippers' : 'Los Angeles, California'}

# Dictionary mapping location to (latitude, longitude)
location_latlong = dict()
geolocator = Nominatim()
for key,value in name_location.iteritems():
    geo_loc = geolocator.geocode(value)
    (lat1, long1) = (geo_loc.latitude, geo_loc.longitude)
    location_latlong[value] = (lat1,long1)
    
def get_team_location(team_id, season_id):
    """ Given a team id and the season_id, returns a string containing the location of the team stadium
    Input: 
        team_id (int): team id
        season_id (str): season id 
    Output:
        (str) 
    """
    names = id_team[team_id]
    
    team_name = None
    
    for name in names:
        seasons = team_seasons[name]
        #print seasons, name
        if season_id in seasons:
            team_name = name
            
    if team_name == None:
        print (team_id, season_id, names)
        assert(False)
    
    return name_location[team_name]

def add_mileage(league_df):
    geolocator = Nominatim()
    seen_distances = dict()
    
    new_df = league_df.sort_values('game_date')
    home_mileage = np.zeros(len(new_df))
    away_mileage = np.zeros(len(new_df))
    
    new_df = new_df.assign(home_mileage = home_mileage)
    new_df = new_df.assign(away_mileage = away_mileage)
    

    home_mileage = np.zeros(len(new_df))
    away_mileage = np.zeros(len(new_df))
    
    # Ddd features
    grouped = new_df.groupby(['season_id'])
    groupList = [grouped.get_group(x) for x in grouped.groups]
    
    for season_df in groupList:
        running_locations = dict()
        
        for team in season_df['team_id'].unique():
            # track list of locations played at
            running_locations[team] = []
        
        # sort season by day
        season_df = season_df.sort_values('game_date')
        
        seen_games = set()
        
        for (index, row) in season_df.iterrows():
            is_home = row['is_home']
            team_id = row['team_id']
            opp_team_id = row['opp_team_id']
            game_id = row['game_id']
            season_id = row['season_id']
            
            if is_home == 1:
                home_team_id = team_id
                away_team_id = opp_team_id
            else:
                home_team_id = opp_team_id
                away_team_id = team_id
                
            
            home_games_count = row['home_game_count']
            away_games_count = row['away_game_count']
            
            home_mileage = 0
            away_mileage = 0
            
            curr_location = get_team_location(home_team_id, season_id)

            (curr_lat, curr_long) = location_latlong[curr_location]

            if home_games_count > 0:
                location_list = running_locations[home_team_id]
                assert(len(location_list) == home_games_count)
                last_location = location_list[len(location_list) - 1]
                (last_lat, last_long) = location_latlong[last_location]

                str1 = curr_location + last_location
                str2 = last_location + curr_location

                if str1 in seen_distances:
                    home_mileage = seen_distances[str1]
                elif str2 in seen_distances:
                    home_mileage = seen_distances[str2]
                else:
                    home_mileage = vincenty((last_lat,last_long), (curr_lat,curr_long)).miles
                    seen_distances[str1] = home_mileage

            if away_games_count > 0:
                location_list = running_locations[away_team_id]
                assert(len(location_list) == away_games_count)
                last_location = location_list[len(location_list) - 1]

                (last_lat, last_long) = location_latlong[last_location]

                str1 = curr_location + last_location
                str2 = last_location + curr_location

                if str1 in seen_distances:
                    away_mileage = seen_distances[str1]
                elif str2 in seen_distances:
                    away_mileage = seen_distances[str2]
                else:
                    away_mileage = vincenty((last_lat,last_long), (curr_lat,curr_long)).miles
                    seen_distances[str1] = away_mileage

            new_df.set_value(index, 'home_mileage', home_mileage)
            new_df.set_value(index, 'away_mileage', away_mileage)
            
            if game_id in seen_games:
                running_locations[home_team_id].append(curr_location)
                running_locations[away_team_id].append(curr_location)
                
            seen_games.add(game_id)
            
    return new_df

In [None]:
mileage_df = add_mileage(all_games[all_games['season_id']>='22005'])
mileage_df.tail()

### Sources
1. `http://homepages.cae.wisc.edu/~ece539/fall13/project/AmorimTorres_rpt.pdf`

### Linear Regression Model


In [None]:
$$$$