In [1]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import warnings
warnings.filterwarnings('ignore')

# Mutiple Regression and Categorical Variables

## 1 NBA2K20 Data

#### NBA 2K20 is a basketball simulation video game published by Visual Concepts and published by 2K Sports, based on the National Basketball Association (NBA). (from wikipedia)

#### (1) Load the data "nba2k20-full.csv", how many observations you find? and make some sanity checks. missing value in team means that the player is currently a free agent and missing value in college means that the player did not attend college before.

nba2k20 dataset contains following variables:

full_name: players' full name

rating: the overall ability of a player in the game

jersey: the number on players' jersey

position: players' position on the court

b_day: players' birthday

height: players' height

weight: players' weight

salary: players' salary

country: players' nationality

draft_year: the year which players attend NBA draft

draft_around: the round that players be chosen

draft_peak: the number of pick that players be chosen

college: players' college

In [2]:
nba = pd.read_csv("nba2k20-full.csv",sep=",")
print(nba.shape)
print(nba.isnull().sum())
nba.head(5)

(429, 14)
full_name       0
rating          0
jersey          0
team           23
position        0
b_day           0
height          0
weight          0
salary          0
country         0
draft_year      0
draft_round     0
draft_peak      0
college        66
dtype: int64


Unnamed: 0,full_name,rating,jersey,team,position,b_day,height,weight,salary,country,draft_year,draft_round,draft_peak,college
0,LeBron James,97,#23,Los Angeles Lakers,F,12/30/84,6-9 / 2.06,250 lbs. / 113.4 kg.,$37436858,USA,2003,1,1,
1,Kawhi Leonard,97,#2,Los Angeles Clippers,F,06/29/91,6-7 / 2.01,225 lbs. / 102.1 kg.,$32742000,USA,2011,1,15,San Diego State
2,Giannis Antetokounmpo,96,#34,Milwaukee Bucks,F-G,12/06/94,6-11 / 2.11,242 lbs. / 109.8 kg.,$25842697,Greece,2013,1,15,
3,Kevin Durant,96,#7,Brooklyn Nets,F,09/29/88,6-10 / 2.08,230 lbs. / 104.3 kg.,$37199000,USA,2007,1,2,Texas
4,James Harden,96,#13,Houston Rockets,G,08/26/89,6-5 / 1.96,220 lbs. / 99.8 kg.,$38199000,USA,2009,1,3,Arizona State


## 2 Mutiple Regression

First, let's find out the relationship between players' salaries and their abilities(rating) in the game and their draft pick.

(1) First, you need to convert salary column from string to int and make salary into million dollars(divide by 1_000_000).

In [3]:
for i in range(len(nba["salary"])):
    nba["salary"].iloc[i] = nba["salary"].iloc[i][1:]

nba["salary"] = nba["salary"].astype(int)
nba["salary"] = nba["salary"]/1_000_000

(2) Some players are not come to NBA league by draft, so in this case, we want to drop undrated player.

Hint(check values in draft_round)

In [4]:
nba.draft_round.value_counts()

1            257
2            105
Undrafted     67
Name: draft_round, dtype: int64

In [5]:
nba_draft = nba[nba["draft_round"] != "Undrafted"] 

(3) now we want to convert the draft_peak from string to integer

In [6]:
nba_draft["draft_peak"] = nba_draft["draft_peak"].astype(int)

(4) run the linear regression in the form 

   salary = β0 + β1 * rating + β2 * draft_peak + ϵ
          
   and show the regression output.
   
   What is β0, β1 and β2 ? Interpret what they means? Are they statistically significant?

In [7]:
m = smf.ols("salary ~ rating + draft_peak", 
            data = nba_draft).fit()

b0 = -89.6891
b1 = 1.2845
b2 = 0.0052

m.summary()

# b0 is -89.69, b1 is 1.28 and b2 is 0.0052. b0 means that if a player's rating and draft peak is 0, and then his salary is -89.69 
# million dollars, b1 means that for 1 increase in a player's rating, his salary will increase by 1.28 million dollars. b2 means that
# for 1 increase in a player's draft pick, his salary will increase by 0.0052 million dollars. b0 and b1 are statistically significant
# as their P values are pretty small. b2 is not statistically significant as t value is pretty samll and P is very large, which means 
# it is very likely that players' draft pick has no relationship with their salary.

0,1,2,3
Dep. Variable:,salary,R-squared:,0.592
Model:,OLS,Adj. R-squared:,0.59
Method:,Least Squares,F-statistic:,260.4
Date:,"Fri, 13 Nov 2020",Prob (F-statistic):,1.32e-70
Time:,17:49:02,Log-Likelihood:,-1171.2
No. Observations:,362,AIC:,2348.0
Df Residuals:,359,BIC:,2360.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-89.6891,5.095,-17.603,0.000,-99.709,-79.669
rating,1.2845,0.063,20.482,0.000,1.161,1.408
draft_peak,0.0052,0.024,0.219,0.827,-0.042,0.052

0,1,2,3
Omnibus:,15.279,Durbin-Watson:,1.901
Prob(Omnibus):,0.0,Jarque-Bera (JB):,27.532
Skew:,0.24,Prob(JB):,1.05e-06
Kurtosis:,4.263,Cond. No.,1260.0


## 3 Categorical Variables

Beside Money, now we want to see if there is a relationship between players' postitions and their height.

(1) First, let's get players' height(centimeters) by grabing last four indexes of values in column height and convert those values to numbers.

In [8]:
for i in range(len(nba["height"])):
    nba["height"].iloc[i] = nba["height"].iloc[i][-4:]

nba["height"] = nba["height"].astype(float)
nba["height"] = nba["height"]*100

(2) Second, to make it simpler, we just need players' main position. Thus, if a player's position is "F-C", we take F as their positions

In [9]:
for i in range(len(nba["position"])):
    nba["position"].iloc[i] = nba["position"].iloc[i][0]

(3) Now you get the data and let's find relationship between heights and positions. G means guard, F means forward, and C means center.

Use the formula: height = β0 + β1 * ForwardYes + β2 * GuardYes + ϵ

What is your β0, β1, and β2? how do you intepret those values? Are they statistically significant? What do they tell you in reality?

In [10]:
m = smf.ols("height ~ position", data=nba).fit()

b0 = 210.74
b1 = -7.15
b2 = -18.39

m.summary()

# b0 is 210.74, and it means that players whose position is center have a height of 210.74cm
# b1 is -7.15, and it means that players whose position is foward have height less than center players by 7.15cm
# b2 is -18.39, and it means that players whose position is guard have height less than center players by 18.39cm
# All these values are statistically significant as P values are pretty small.
# It tells me that usually Center has higher height than forward and guard.

0,1,2,3
Dep. Variable:,height,R-squared:,0.655
Model:,OLS,Adj. R-squared:,0.653
Method:,Least Squares,F-statistic:,404.5
Date:,"Fri, 13 Nov 2020",Prob (F-statistic):,3.4300000000000002e-99
Time:,17:49:02,Log-Likelihood:,-1290.4
No. Observations:,429,AIC:,2587.0
Df Residuals:,426,BIC:,2599.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,210.7407,0.669,314.996,0.000,209.426,212.056
position[T.F],-7.1525,0.760,-9.417,0.000,-8.645,-5.660
position[T.G],-18.3897,0.759,-24.227,0.000,-19.882,-16.898

0,1,2,3
Omnibus:,3.852,Durbin-Watson:,1.861
Prob(Omnibus):,0.146,Jarque-Bera (JB):,3.653
Skew:,-0.182,Prob(JB):,0.161
Kurtosis:,3.269,Cond. No.,5.96


(4) What is your R^2, and what does it means?

In [11]:
# R^2 is 0.655, which means 65.5% of variantions in players' height can be explained by positions.