# COGS 108 - Final Project 

# Overview

*Fill in your overview here*

# Names

- Jared Vitug
- Miguel Morales
- Phat Ly
- Kevin Mach

# Group Members IDs

- A92083122
- A########
- A########
- A12647584

# Research Question

Can we predict a current winning nba bracket from regular season matchup statistics collected over the last 10 years?
In a best-of-seven NBA playoff series, two teams will matchup against each other consecutively at least four times. This is unlike the regular season schedule in any way, where teams in the same division play at most four times but over the span of an 82 game season. However, we believe that from regular season statistics gathered from the eight current playoff matchups, over the span of the last ten years, we can predict an accurate playoff bracket for the current season’s playoffs. We will look at data such as each team’s winning percentage, number of assists, total offensive rebounds, number of roster changes, etc. to create a model that will tell us which team will win in a playoff series. We will take that model and use it to pick the winners of our bracket.


## Background and Prior Work

With NBA playoffs in full effect, we knew we wanted to research something relative to basketball that was also fun.  We all know sports are near impossible to predict and basketball is no exception - or is it? We began to wonder what if, given a playoff bracket, we could predict each round of the playoffs and thereby figure out who would win the championship.

Each of us watches basketball and knows a few plays can define a game.  Injuries happen, human error occurs, and upsets take place.  However, in the world of sports betting, none of this matters - only numbers matter.  That’s why we decided to find the best combination of regular season matchup stats (between two teams pitted against each other in the playoffs) that can successfully predict a series outcome.

When it comes to the individual statistics, we know that the team which wins the turnover battle usually comes out on top, as does the team with more rebounds and shooting percentage.  However, with all the different play-styles in today’s NBA (iso-ball, backcourt dominance, frontcourt dominance) we want to find out if there are other team stats that can help us predict winners, such as three pointers taken, team fouls, timeouts used, etc.


https://towardsdatascience.com/predicting-nba-winning-percentage-in-upcoming-season-using-linear-regression-f8687d9c0418
https://github.com/COGS108/FinalProjects-Wi18/blob/master/001-FinalProject.ipynb
The above link is to a similar project where someone took 40 years of basketball data from basketball-reference.com and trained the data to predict NBA winning percentage.  The article says that “average age of the players, margin of victory, number of points scored, number of returning players, and number of blocks” were some of the statistics included, which gives us some idea as to which statistics to pull.   The study’s analysis discusses how wrong the model was in over-predicting winning seasons and underpredicting losing seasons with the reason being injuries, trades, and retirements - all things which are hard to predict and difficult to assess from a mathematical perspective.

# Hypothesis


We hypothesize that it is possible to predict the winner of a 7-game playoff series between two NBA teams by analyzing their regular season matchup statistics.
By narrowing our data to historical team matchups, we can examine specific factors that may be pertinent in explaining why one team may have the competitive edge. There are many accessible statistics for every NBA game, and this data can be utilized to determine if one franchise may have a playstyle better suited for a particular opponent. Features such as the record of the matchups (giving more weight to recent wins), number of possessions, offensive rebounds, total assists, turnovers, etc. are all telling of how advantageous a team is over the other. Although teams change over time due to player, coach, and management contracts, we will attempt to take those conditions into account by feature engineering, as winning teams tend to have less structure change. Additionally, we can favor more recent data as it will be more telling of the current matchup. Therefore, with a catalog of in depth matchup statistics, we hypothesize that we can create a strong dataset for our model to predict a correct playoff bracket.


# Dataset(s)

Dataset Name: NBA Advanced Stat
Dataset Source : https://stats.nba.com/
Dataset Link  : https://stats.nba.com/teams/traditional/?sort=W_PCT&dir=-1&Season=2018-19&SeasonType=Regular%20Season  	
Number of datasets planned : 10 total, 1 for each regular season starting from August 2008 -  April 2019
 
Dataset Explanation: This website contains datasets about the NBA teams’ statistics from year 1999 to present date. Those datasets are organized as in form of a table, where the rows represent the teams and the columns are observations such as win-loss record, shooting percentages, rebounds ..etc.  Since we are planning on using the statistic of the past 10 regular season matchups, this website is useful in that it contains the information from the previous years and we will have to select the options provided in the website (in the provided link) to get the optimal dataset for our project. We will try to pull the regular season information for each year (from 2008 to present date), then drop off teams that are not participating in the current year playoff. We might also need to differentiate each season statistics, perhaps giving more weight to recent year stats than older years.  


# Setup

In [39]:
# Imports 
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
import bs4
from bs4 import BeautifulSoup
from sklearn.cross_validation import train_test_split
from sklearn import linear_model
from sklearn.linear_model import Lasso
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import accuracy_score, recall_score, f1_score

In [2]:
#example URL
# https://www.basketball-reference.com/teams/MIL/2019/gamelog/?fbclid=IwAR3BFW5ivLDuQE5NRVkPbHnEIlwe-CCCsoeo8RxOxmcPHssP0_mzfJgsVr8

year_array = ["2010", "2011", "2012", "2013", "2014", "2015", 
              "2016", "2017", "2018", "2019"]
team_array = ["GSW", "HOU", "POR", "DEN", "LAC", "UTA", "OKC", 
             "SAS", "MIL", "BOS", "PHI", "TOR", "DET", "IND", "ORL", "BRK"]

def create_url(year_array, team_array):
    #variables for appending the URLs
    begLink = "https://www.basketball-reference.com/teams/"
    endLink = "/gamelog/?fbclid=IwAR3BFW5ivLDuQE5NRVkPbHnEIlwe-CCCsoeo8RxOxmcPHssP0_mzfJgsVr8"
    #array to store the url's
    url_list = []
    #loop through team_array 
    for i in team_array:
        #loop through each year in 
        for j in year_array:
            team_name = i
            # if the team is Brooklyn and the year is 2010-2012, team name
            # is New Jersey Nets
            if(i == "BRK" and (j == "2010" or j == "2011" or j == "2012")):
                team_name = "NJN"    
            temp_url = begLink + team_name + "/" + j + endLink
            #print(temp_url)
            url_list = np.append(url_list, temp_url)
    #print(url_list)
    return url_list

            
url_list = create_url(year_array, team_array)


In [3]:
url_list

array([ 'https://www.basketball-reference.com/teams/GSW/2010/gamelog/?fbclid=IwAR3BFW5ivLDuQE5NRVkPbHnEIlwe-CCCsoeo8RxOxmcPHssP0_mzfJgsVr8',
       'https://www.basketball-reference.com/teams/GSW/2011/gamelog/?fbclid=IwAR3BFW5ivLDuQE5NRVkPbHnEIlwe-CCCsoeo8RxOxmcPHssP0_mzfJgsVr8',
       'https://www.basketball-reference.com/teams/GSW/2012/gamelog/?fbclid=IwAR3BFW5ivLDuQE5NRVkPbHnEIlwe-CCCsoeo8RxOxmcPHssP0_mzfJgsVr8',
       'https://www.basketball-reference.com/teams/GSW/2013/gamelog/?fbclid=IwAR3BFW5ivLDuQE5NRVkPbHnEIlwe-CCCsoeo8RxOxmcPHssP0_mzfJgsVr8',
       'https://www.basketball-reference.com/teams/GSW/2014/gamelog/?fbclid=IwAR3BFW5ivLDuQE5NRVkPbHnEIlwe-CCCsoeo8RxOxmcPHssP0_mzfJgsVr8',
       'https://www.basketball-reference.com/teams/GSW/2015/gamelog/?fbclid=IwAR3BFW5ivLDuQE5NRVkPbHnEIlwe-CCCsoeo8RxOxmcPHssP0_mzfJgsVr8',
       'https://www.basketball-reference.com/teams/GSW/2016/gamelog/?fbclid=IwAR3BFW5ivLDuQE5NRVkPbHnEIlwe-CCCsoeo8RxOxmcPHssP0_mzfJgsVr8',
       'https://www

In [4]:
#this code creates a list of dataframes corresponding to the url's in url_list

#list to store individual dataframes before combining
df_list = []
#i = 0
for url in url_list:
    #print(i)
    req = requests.get(url)
    soup = BeautifulSoup(req.content, 'html.parser') #get contents of webpage
    nbatables = soup.findAll("table", 'row_summable sortable stats_table') #get tables
    tbl1 = nbatables[0]
    new_tbl1 = pd.DataFrame(columns=range(0,40), index = range(0,91))
    
    #get the column names for our first table
    ind=0
    cols_list = []
    for header in tbl1.find_all('tr'): #specify HTML tags
        header_name = header.find_all('th') #tag containing column names
        for head in header_name:
            cols_list.append(head.get_text()) #get the text from between the tags
    
    #fill in contents for each table
    row_marker = -1
    for row in tbl1.find_all('tr'):
        column_marker = 0
        columns = row.find_all('td') # different tag than above for table contents
        for column in columns:
            new_tbl1.iat[row_marker,column_marker] = column.get_text()
            column_marker += 1
        row_marker += 1
    df_list.append(new_tbl1)
    #i = i + 1

In [5]:
def create_team_df(team, df_list):
    #team_df = df_list[0]
    if(team == "GSW"):
        i = 0
        team_df = df_list[i];
        while i < 9:
            team_df = team_df.append(df_list[i+1])
            i = i + 1
    if(team == "HOU"):
        i = 10
        team_df = df_list[i];
        while i < 19:
            team_df = team_df.append(df_list[i+1])
            i = i + 1
    if(team == "POR"):
        i = 20
        team_df = df_list[i];
        while i < 29:
            team_df = team_df.append(df_list[i+1])
            i = i + 1
    if(team == "DEN"):
        i = 30
        team_df = df_list[i];
        while i < 39:
            team_df = team_df.append(df_list[i+1])
            i = i + 1
    if(team == "LAC"):
        i = 40
        team_df = df_list[i];
        while i < 49:
            team_df = team_df.append(df_list[i+1])
            i = i + 1
    if(team == "UTA"):
        i = 50
        team_df = df_list[i];
        while i < 59:
            team_df = team_df.append(df_list[i+1])
            i = i + 1
    if(team == "OKC"):
        i = 60
        team_df = df_list[i];
        while i < 69:
            team_df = team_df.append(df_list[i+1])
            i = i + 1
    if(team == "SAS"):
        i = 70
        team_df = df_list[i];
        while i < 79:
            team_df = team_df.append(df_list[i+1])
            i = i + 1
    if(team == "MIL"):
        i = 80
        team_df = df_list[i];
        while i < 89:
            team_df = team_df.append(df_list[i+1])
            i = i + 1
    if(team == "BOS"):
        i = 90
        team_df = df_list[i];
        while i < 99:
            team_df = team_df.append(df_list[i+1])
            i = i + 1
    if(team == "PHI"):
        i = 100
        team_df = df_list[i];
        while i < 109:
            team_df = team_df.append(df_list[i+1])
            i = i + 1
    if(team == "TOR"):
        i = 110
        team_df = df_list[i];
        while i < 119:
            team_df = team_df.append(df_list[i+1])
            i = i + 1
    if(team == "DET"):
        i = 120
        team_df = df_list[i];
        while i < 129:
            team_df = team_df.append(df_list[i+1])
            i = i + 1
    if(team == "IND"):
        i = 130
        team_df = df_list[i];
        while i < 139:
            team_df = team_df.append(df_list[i+1])
            i = i + 1
    if(team == "ORL"):
        i = 140
        team_df = df_list[i];
        while i < 149:
            team_df = team_df.append(df_list[i+1])
            i = i + 1
    if(team == "BRK"):
        i = 150
        team_df = df_list[i];
        while i < 159:
            team_df = team_df.append(df_list[i+1])
            i = i + 1
    #NOTE: renamed opponent points to OppP
    team_df.columns = [
 'G',
 'Date',
 '\xa0',
 'Opp',
 'W/L',
 'Tm',
 'OppP',
 'FG',
 'FGA',
 'FG%',
 '3P',
 '3PA',
 '3P%',
 'FT',
 'FTA',
 'FT%',
 'ORB',
 'TRB',
 'AST',
 'STL',
 'BLK',
 'TOV',
 'PF',
 '\xa0',
 'oppFG',
 'oppFGA',
 'oppFG%',
 'opp3P',
 'opp3PA',
 'opp3P%',
 'oppFT',
 'oppFTA',
 'oppFT%',
 'oppORB',
 'oppTRB',
 'oppAST',
 'oppSTL',
 'oppBLK',
 'oppTOV',
 'oppPF']
    return team_df

In [6]:
GSW_df = create_team_df("GSW", df_list)
HOU_df = create_team_df("HOU", df_list)
POR_df = create_team_df("POR", df_list)
DEN_df = create_team_df("DEN", df_list)
LAC_df = create_team_df("LAC", df_list)
UTA_df = create_team_df("UTA", df_list)
OKC_df = create_team_df("OKC", df_list)
SAS_df = create_team_df("SAS", df_list)
MIL_df = create_team_df("MIL", df_list)
BOS_df = create_team_df("BOS", df_list)
PHI_df = create_team_df("PHI", df_list)
TOR_df = create_team_df("TOR", df_list)
DET_df = create_team_df("DET", df_list)
IND_df = create_team_df("IND", df_list)
ORL_df = create_team_df("ORL", df_list)
BRK_df = create_team_df("BRK", df_list)

In [7]:
#Create a team dictionary to hold all of the dataframes. Key = team name, Value = team dataframe
teams_dict = {
    "GSW" : GSW_df,
    "HOU" : HOU_df,
    "POR" : POR_df,
    "DEN" : DEN_df,
    "LAC" : LAC_df,
    "UTA" : UTA_df,
    "OKC" : OKC_df,
    "SAS" : SAS_df,
    "MIL" : MIL_df,
    "BOS" : BOS_df,
    "PHI" : PHI_df,
    "TOR" : TOR_df,
    "DET" : DET_df,
    "IND" : IND_df,
    "ORL" : ORL_df,
    "BRK" : BRK_df
 } 

In [8]:
# Add the team as a column for each dataframe
for team, team_df in teams_dict.items():
    name = []
    # Creating a list filled with team's acronym
    for x in range(len(team_df.index)):
        name.append(team)
    team_df['Team'] = name

# Data Cleaning

Our end goal for the data cleaning would be that the resulting dataframes would have no nan values. In addition, some columns do not have any data that we think is important, such as the game #... etc, and we plan on removing those. Lastly, we want to transform some columns for easier reading and computing. In particular, the data set contains all of the game data of a particular team from year 2009 to 2019. However, a regular season will overlapped between two years, thus we want to change the date so that the year will be when the regular season ends. For the W/L record, the current data is displaying as a string "W" and "L" for win and loss. To make it easier for us to compute like the win-rate percent, we want the data to be in term of 1 and 0. The tools we will be using would be pandas library functions like dropna() as well as some of our own functions to transform the data.

In [9]:
#This array contains the month when the season begins
firstHalf   = ['10','11','12']

int_type = ['Tm', 'OppP', 'FG', 'FGA', '3P', '3PA', 'FT', 'FTA', 'ORB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 
'oppFG', 'oppFGA', 'opp3P', 'opp3PA', 'oppFT', 'oppFTA', 'oppORB', 'oppTRB', 'oppAST', 'oppSTL', 'oppBLK', 'oppTOV','oppPF' ]

float_type = ['FG%', '3P%','FT%', 'oppFG%', 'opp3P%', 'oppFT%']



In [10]:
#This function allows us to filter out the date and month of the game. It is also easier to tell which year the data belongs to 
def standardize_year(input_year):
    year  = str(input_year)[:4]
    month = str(input_year)[5:7]
    if month in firstHalf:
        return int(year) + 1
    else:
        return int(year)

#This function allows us to transform the W/L to computable data. W => 1, L => 0    
def standardize_score(input_score):
    if input_score == "W":
        return int(1)
    else:
        return int(0)

In [11]:
#Cleaning the data
for w in team_array:
    #Getting dataframe of the current team
    currentTeam_df = teams_dict[w]      
    #Drop rows with nan values
    currentTeam_df = currentTeam_df.dropna() 
    #Drop columns that have unneccessary data 
    currentTeam_df = currentTeam_df.drop(labels = ['\xa0', 'G'], axis=1)            
    #Drop rows with teams that are not in the current 2019 playoff
    currentTeam_df = currentTeam_df[currentTeam_df['Opp'].isin(team_array)]
    #Filter and transform the "Date" column
    currentTeam_df['Date'] = currentTeam_df['Date'].apply(standardize_year)
    #Transform the "W/L" column 
    currentTeam_df['W/L']  = currentTeam_df['W/L'].apply(standardize_score)
    for label in int_type:
        currentTeam_df[label] = currentTeam_df[label].astype('int')
    for other_label in float_type:
        currentTeam_df[other_label] = currentTeam_df[other_label].astype('float')
    #Apply the modification
    teams_dict[w]  = currentTeam_df

In [12]:
teams_dict['GSW'].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 403 entries, 1 to 86
Data columns (total 38 columns):
Date      403 non-null int64
Opp       403 non-null object
W/L       403 non-null int64
Tm        403 non-null int32
OppP      403 non-null int32
FG        403 non-null int32
FGA       403 non-null int32
FG%       403 non-null float64
3P        403 non-null int32
3PA       403 non-null int32
3P%       403 non-null float64
FT        403 non-null int32
FTA       403 non-null int32
FT%       403 non-null float64
ORB       403 non-null int32
TRB       403 non-null int32
AST       403 non-null int32
STL       403 non-null int32
BLK       403 non-null int32
TOV       403 non-null int32
PF        403 non-null int32
oppFG     403 non-null int32
oppFGA    403 non-null int32
oppFG%    403 non-null float64
opp3P     403 non-null int32
opp3PA    403 non-null int32
opp3P%    403 non-null float64
oppFT     403 non-null int32
oppFTA    403 non-null int32
oppFT%    403 non-null float64
oppORB    403

# Data Analysis & Results

First we will compute the average game statistics for every team during the 2019 season against their respective opponent in the playoffs, as this will be used for the input for our model i.e. our "testing data". We will then concatenate these statistics to the original dataframe and begin feature engineering. Many of these features will consist of binary variables such as higherFG%, higherTOV, higher3P%, higherFT%, higherFGA, higherFTA, higher3PA, higherORB, higherTRB, higherBLK, higherSTL, higherAST, and higherPF. The label we wish to predict is whether or not they win the matchup.     

In [13]:
# Concatenating all the teams dataframes into one dataframe
df_all = pd.DataFrame()
for team_df in teams_dict.values():
    df_all = pd.concat([df_all, team_df], axis=0)

In [14]:
df_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6467 entries, 1 to 85
Data columns (total 38 columns):
Date      6467 non-null int64
Opp       6467 non-null object
W/L       6467 non-null int64
Tm        6467 non-null int32
OppP      6467 non-null int32
FG        6467 non-null int32
FGA       6467 non-null int32
FG%       6467 non-null float64
3P        6467 non-null int32
3PA       6467 non-null int32
3P%       6467 non-null float64
FT        6467 non-null int32
FTA       6467 non-null int32
FT%       6467 non-null float64
ORB       6467 non-null int32
TRB       6467 non-null int32
AST       6467 non-null int32
STL       6467 non-null int32
BLK       6467 non-null int32
TOV       6467 non-null int32
PF        6467 non-null int32
oppFG     6467 non-null int32
oppFGA    6467 non-null int32
oppFG%    6467 non-null float64
opp3P     6467 non-null int32
opp3PA    6467 non-null int32
opp3P%    6467 non-null float64
oppFT     6467 non-null int32
oppFTA    6467 non-null int32
oppFT%    6467

In [15]:
# This function will calculate average matchup statistics between two teams in the 2019 season and append it to the overall dataframe
def average_stats_2019(team_name, df_team, df_total, opp):
    df_2019 = df_team[(df_team['Date'] == 2019) & (df_team['Opp'] == opp)]
    avgs = df_2019.mean(axis=0)
    avgs_df = avgs.to_frame().transpose()
    # Removing the 'W/L' column as that is what we will need to predict
    avgs_df.drop(labels = 'W/L', axis = 1, inplace = True)
    # Adding back the team and opp columns
    avgs_df['Team'] = team_name
    avgs_df['Opp'] = opp
    return pd.concat([df_total, avgs_df], axis = 0)

In [17]:
# Creating the average statistics between playoff matchups
for team, df_team in teams_dict.items():
    if (team == "GSW"):
        opponent = "LAC"
    elif (team == "HOU"):
        opponent = "UTA"
    elif (team == "PHI"):
        opponent = "BRK"
    elif (team == "UTA"):
        opponent = "HOU"
    elif (team == "MIL"):
        opponent = "DET"
    elif (team == "TOR"):
        opponent = "ORL"
    elif (team == "DET"):
        opponent = "MIL"
    elif (team == "BOS"):
        opponent = "IND"
    elif (team == "POR"):
        opponent = "OKC"
    elif (team == "ORL"):
        opponent = "TOR"
    elif (team == "LAC"):
        opponent = "GSW"
    elif (team == "IND"):
        opponent = "BOS"
    elif (team == "OKC"):
        opponent = "POR"
    elif (team == "BRK"):
        opponent = "PHI"
    elif (team == "DEN"):
        opponent = "SAS"
    elif (team == "SAS"):
        opponent = "DEN"
    else:
        continue        
    print(team + ' vs ' + opponent)
    new_df = average_stats_2019(team, df_team, df_all, opponent)
    df_all = new_df

PHI vs BRK
UTA vs HOU
MIL vs DET
GSW vs LAC
TOR vs ORL
DET vs MIL
BOS vs IND
POR vs OKC
ORL vs TOR
LAC vs GSW
IND vs BOS
OKC vs POR
HOU vs UTA
BRK vs PHI
DEN vs SAS
SAS vs DEN


In [18]:
# Checking out the data, delete later
df_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6483 entries, 1 to 0
Data columns (total 38 columns):
3P        6483 non-null float64
3P%       6483 non-null float64
3PA       6483 non-null float64
AST       6483 non-null float64
BLK       6483 non-null float64
Date      6483 non-null float64
FG        6483 non-null float64
FG%       6483 non-null float64
FGA       6483 non-null float64
FT        6483 non-null float64
FT%       6483 non-null float64
FTA       6483 non-null float64
ORB       6483 non-null float64
Opp       6483 non-null object
OppP      6483 non-null float64
PF        6483 non-null float64
STL       6483 non-null float64
TOV       6483 non-null float64
TRB       6483 non-null float64
Team      6483 non-null object
Tm        6483 non-null float64
W/L       6467 non-null float64
opp3P     6483 non-null float64
opp3P%    6483 non-null float64
opp3PA    6483 non-null float64
oppAST    6483 non-null float64
oppBLK    6483 non-null float64
oppFG     6483 non-null float64
op

In [19]:
# Resetting index
df_all = df_all.reset_index()
df_all.drop(labels = 'index', axis = 1, inplace = True)

In [20]:
df_all.columns

Index([u'3P', u'3P%', u'3PA', u'AST', u'BLK', u'Date', u'FG', u'FG%', u'FGA',
       u'FT', u'FT%', u'FTA', u'ORB', u'Opp', u'OppP', u'PF', u'STL', u'TOV',
       u'TRB', u'Team', u'Tm', u'W/L', u'opp3P', u'opp3P%', u'opp3PA',
       u'oppAST', u'oppBLK', u'oppFG', u'oppFG%', u'oppFGA', u'oppFT',
       u'oppFT%', u'oppFTA', u'oppORB', u'oppPF', u'oppSTL', u'oppTOV',
       u'oppTRB'],
      dtype='object')

### Feature Engineering

In [24]:
# higherFG%, higherTOV, higher3P%, higherFT%, higherFGA, higherFTA, higher3PA, higherORB, higherTRB, higherBLK, higherSTL 
# higherAST, higherPF
train_data = pd.DataFrame()
train_data['W'] = df_all['W/L']
train_data['Team'] = df_all['Team']
train_data['Opp'] = df_all['Opp']
train_data['higherFG%'] = (df_all['FG%'] - df_all['oppFG%']).apply(lambda x: 1 if x > 0 else 0)
train_data['higherTOV'] = (df_all['TOV'] - df_all['oppTOV']).apply(lambda x: 0 if x < 0 else 1)
train_data['higher3P%'] = (df_all['3P%'] - df_all['opp3P%']).apply(lambda x: 1 if x > 0 else 0)
train_data['higherFT%'] = (df_all['FT%'] - df_all['oppFT%']).apply(lambda x: 1 if x > 0 else 0)
train_data['higherFGA'] = (df_all['FGA'] - df_all['oppFGA']).apply(lambda x: 1 if x > 0 else 0)
train_data['higherFTA'] = (df_all['FTA'] - df_all['oppFTA']).apply(lambda x: 1 if x > 0 else 0)
train_data['higher3PA'] = (df_all['3PA'] - df_all['opp3PA']).apply(lambda x: 1 if x > 0 else 0)
train_data['higherORB'] = (df_all['ORB'] - df_all['oppORB']).apply(lambda x: 1 if x > 0 else 0)
train_data['higherTRB'] = (df_all['TRB'] - df_all['oppTRB']).apply(lambda x: 1 if x > 0 else 0)
train_data['higherBLK'] = (df_all['BLK'] - df_all['oppBLK']).apply(lambda x: 1 if x > 0 else 0)
train_data['higherSTL'] = (df_all['STL'] - df_all['oppSTL']).apply(lambda x: 1 if x > 0 else 0)
train_data['higherAST'] = (df_all['AST'] - df_all['oppAST']).apply(lambda x: 1 if x > 0 else 0)
train_data['higherPF'] = (df_all['PF'] - df_all['oppPF']).apply(lambda x: 0 if x > 0 else 1)

In [25]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6483 entries, 0 to 6482
Data columns (total 16 columns):
W            6467 non-null float64
Team         6483 non-null object
Opp          6483 non-null object
higherFG%    6483 non-null int64
higherTOV    6483 non-null int64
higher3P%    6483 non-null int64
higherFT%    6483 non-null int64
higherFGA    6483 non-null int64
higherFTA    6483 non-null int64
higher3PA    6483 non-null int64
higherORB    6483 non-null int64
higherTRB    6483 non-null int64
higherBLK    6483 non-null int64
higherSTL    6483 non-null int64
higherAST    6483 non-null int64
higherPF     6483 non-null int64
dtypes: float64(1), int64(13), object(2)
memory usage: 810.4+ KB


In [26]:
# Splitting the data into the training set and testing set
training_data = train_data.iloc[0:6467, :]
testing_data = train_data.iloc[6467: , :]

In [35]:
# Splitting training data into x and y
x = training_data.drop(labels = ['W', 'Team', 'Opp'], axis = 1)
y = training_data['W']

In [36]:
# Train and test split
X_train, X_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=0)
print X_train.shape, X_test.shape, y_train.shape, y_test.shape

(5173, 13) (1294, 13) (5173L,) (1294L,)


In [37]:
# Regression Technique
lr = LogisticRegression()
logmodel = lr.fit(X_train, y_train)

In [38]:
# Prediction
prediction = logmodel.predict(X_test)

In [40]:
# Checking accuracy of our model
accuracy_score(y_test, prediction)

0.86321483771251928

### Playoff Predictions

In [42]:
x1 = testing_data.drop(labels = ['W', 'Team', 'Opp'], axis = 1)
round1_predictions = logmodel.predict(x1)

In [45]:
# Formating into a dataframe
round1_results = pd.DataFrame()
round1_results['Team'] = testing_data['Team']
round1_results['Opp'] = testing_data['Opp']
round1_results['W'] = round1_predictions.astype(int)

round1_results

Unnamed: 0,Team,Opp,W
6467,PHI,BRK,1
6468,UTA,HOU,1
6469,MIL,DET,1
6470,GSW,LAC,1
6471,TOR,ORL,0
6472,DET,MIL,0
6473,BOS,IND,1
6474,POR,OKC,1
6475,ORL,TOR,1
6476,LAC,GSW,0


In [None]:
round1_winners = ['PHI', 'UTA', 'MIL', 'GSW', 'BOS', 'POR', 'ORL', 'DEN']

# Ethics & Privacy

*Fill in your ethics & privacy discussion here*

# Conclusion & Discussion

*Fill in your discussion information here*