## Problem 2
In this problem you will play with the idea of compound models. I have created a data set
(entirely fake!) of 2012 salaries in the NBA, of 10,000 basketball players that were in high-school
in 2011: nba cc fake data.csv. Note that the vast majority of the salaries are equal to 0 because
the vast majority of these high-school players did not make it to the NBA and hence their NBA
salary equals zero.
There are three features you will use: height (in inches), average points scored during the last
year in high school competition, and a scoring from 1-10 of the competitiveness of the league these
players played in, with 10 being the most competitive.
The goal is to build a model to predict the NBA salary of a high school baller.


In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

In [49]:
# Only 5.88% of the players make it to the NBA.
nba_data = pd.read_csv("nba_cc_fake_data.csv", index_col=0)
nba_data.head()

Unnamed: 0,Comp,Height,Points,Salary
0,9.0,76.0,27.0,0.0
1,7.0,78.0,39.0,0.0
2,9.0,76.0,39.0,0.0
3,9.0,74.0,39.0,0.0
4,9.0,74.0,26.0,0.0


In [9]:
len(nba_data), len(nba_data[nba_data['Salary']>0])

(10000, 587)

In [12]:
nba_data[nba_data['Salary']>0]['Salary'].min(), nba_data[nba_data['Salary']>0]['Salary'].max()

(176939.0, 1212013.0)

### 1\. Explain why linear regression is not appropriate, given the nature of the data.
From the result shown above, we can find that there are 10000 samples in total, but only 587 of them has a value rather than 0 in the predicted column (Salary). Also, the nature of salary data should be a range of different numbers, there will be a great gap between the start (0) and start of non-zero (176939). Therefore, the biased data doesn't fit well in Linear Regression.

### 2\. Try least squares regression, anyway. How well do you do?

In [62]:
# Split the data into training set and testing set
X = nba_data.values[:, :3]
Y = nba_data.values[:, 3]
X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size=0.7, random_state=0)

In [63]:
# # Test for linear regression - ordinary least square method
LinearReg = LinearRegression()
LinearReg.fit(X_train, y_train)
print("R2 socre for Ordinary Least Squares is %.2f" % r2_score(y_test, LinearReg.predict(X_test)))

R2 socre for Ordinary Least Squares is 0.18


As we can see from the $R^2$ score above, the performance of Least Squares Regression is almost the same as randomly guess which is bad. 

### 3\. You will next build a composite model.
You will first predict the probability that a player
actually makes it to the NBA at all, and then you will build a model to predict the salary of
a player, conditioned on the fact of making it to the NBA.<br>
– Build a model that predicts the probability of making it to the NBA.<br>
– Do a train-test split of 8000/2000 points, train your best model on the training set, and
compute the AUC on the test set.<br>
– Now, build a model to predict the salary. Note that you may wish to consider a nonlinear transformation of your data. What is your R2
score on the test set?

### Step 1: Build a model that predicts the probability of making it to the NBA.

In [64]:
# First use the Logistic Regression to predict probability that a player actually makes it to the NBA.
# Before fitting into the model, transfer the predicted column into 0 and 1.
Y_pro = np.copy(Y)
Y_pro[np.where(Y>0)] = 1

In [66]:
# Default value of the hyperparamenter stratify is None, which means data is split in a stratified fashion.
X_train_pro, X_test_pro, y_train_pro, y_test_pro = train_test_split(X, Y_pro, train_size=0.8, random_state=0)

In [67]:
def aucModel(clf, model_name, X_train=X_train_pro, y_train=y_train_pro, X_test=X_test_pro, y_test=y_test_pro):
    clf.fit(X_train, y_train)
    print("AUC score of %s"%model_name + " on the training set is %.4f" % roc_auc_score(y_train, clf.predict_proba(X_train)[:, 1])) 
    print("AUC score of %s"%model_name + " on the testing set is %.4f" % roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1]))

### Step 2: Do a train-test split of 8000/2000 points, train your best model on the training set, and compute the AUC on the test set.

In [72]:
# First, let's use a Logistic Regression model to test the performance.
pro_LR = LogisticRegression()
aucModel(pro_LR, "Logistic Regression")



AUC score of Logistic Regression on the training set is 0.9005
AUC score of Logistic Regression on the testing set is 0.9133


From the AUC score of Logistic Regression printed above, the performance of this model is significant.

In [198]:
# We try Decision Tree here.
from sklearn.tree import DecisionTreeClassifier
DecTree = DecisionTreeClassifier(max_depth=6)
aucModel(DecTree, "Decision Tree")

AUC score of Decision Tree on the training set is 0.9494
AUC score of Decision Tree on the testing set is 0.9249


In [209]:
# This time, we try gradient boosting of trees.
from sklearn.ensemble import GradientBoostingClassifier
GradBoost = GradientBoostingClassifier(n_estimators=30, max_depth=4)
aucModel(GradBoost, "Gradient Boosting Classifier")

AUC score of Gradient Boosting Classifier on the training set is 0.9501
AUC score of Gradient Boosting Classifier on the testing set is 0.9383


In [204]:
# What about Random Forest which reduces the variance?
from sklearn.ensemble import RandomForestClassifier
RandForest = RandomForestClassifier(n_estimators=8, max_depth=5)
aucModel(RandForest, "Random Forest")

AUC score of Random Forest on the training set is 0.9390
AUC score of Random Forest on the testing set is 0.9380


From the experiment results above, we find that among these models, Gradient Boosting Classifier and Random Forest perform best. Here, we choose Gradient Boosting Classifier as our best model.

### Step 3: Now, build a model to predict the salary. Note that you may wish to consider a nonlinear transformation of your data. What is your R2 score on the test set?

In [84]:
# Now, build a model to predict the salary.
# As the range of salary is very wide, first filter the data by salary > 0.
X_salary = np.copy(X[np.where(Y>0)])
Y_salary = np.copy(Y[np.where(Y>0)])
X_train_sal, X_test_sal, y_train_sal, y_test_sal = train_test_split(X_salary, Y_salary, train_size=0.8, random_state=0)

First, get some insights from origin data.

In [85]:
nba_data[nba_data['Salary']>0].corr()

Unnamed: 0,Comp,Height,Points,Salary
Comp,1.0,-0.060623,-0.25452,0.664805
Height,-0.060623,1.0,-0.210574,0.259522
Points,-0.25452,-0.210574,1.0,-0.057897
Salary,0.664805,0.259522,-0.057897,1.0


From the correlation matrix, we can find that there is no co-linearty in features.

In [86]:
nba_data[nba_data['Salary']>0][['Comp', 'Height', 'Points']].corrwith(np.log(nba_data[nba_data['Salary']>0]['Salary']))

Comp      0.729033
Height    0.246162
Points   -0.065791
dtype: float64

In [99]:
def r2LR(clf, data_name, X_train, X_test, y_train, y_test):
    clf.fit(X_train, y_train)
    print("R2 score for %s"%data_name + " is:")
    print("Training score: %.4f" % r2_score(y_train, clf.predict(X_train)))
    print("Testing score:  %.4f" % r2_score(y_test, clf.predict(X_test)))

In [100]:
preLR = LinearRegression()
r2LR(preLR, "filtered data", X_train_sal, X_test_sal, y_train_sal, y_test_sal)

R2 score for filtered data is:
Training score: 0.5601
Testing score:  0.5932


In [91]:
# First, try standardizing data
from sklearn.preprocessing import StandardScaler
x_sc = StandardScaler()
x_sc.fit(X_train_sal)
X_train_sal_std = x_sc.transform(X_train_sal)
X_test_sal_std = x_sc.transform(X_test_sal)

In [103]:
# First, try standardizing data
from sklearn.preprocessing import StandardScaler
x_sc = StandardScaler()
x_sc.fit(X_train_sal)
X_train_sal_std = x_sc.transform(X_train_sal)
X_test_sal_std = x_sc.transform(X_test_sal)

y_sc = StandardScaler()
y_sc.fit(y_train_sal.reshape(-1, 1))
y_train_sal_std = y_sc.transform(y_train_sal.reshape(-1, 1))
y_test_sal_std = y_sc.transform(y_test_sal.reshape(-1, 1)).reshape(1, -1)[0]

preLR_std = LinearRegression()
r2LR(preLR_std, "standardized data", X_train_sal_std, X_test_sal_std, y_train_sal_std, y_test_sal_std)

R2 score for standardized data is:
Training score: 0.5601
Testing score:  0.5932


In [109]:
# It seems that standardization does not help improve the performance of predictions.
# As the salary is in a relative high range, try log.
y_train_log = np.log(y_train_sal)
y_test_log = np.log(y_test_sal)
preLR_log = LinearRegression()
r2LR(preLR_log, "logged salary", X_train_sal, X_test_sal, y_train_log, y_test_log)

R2 score for logged salary is:
Training score: 0.6441
Testing score:  0.6871


In [108]:
from sklearn.tree import DecisionTreeRegressor
preDTR = DecisionTreeRegressor(max_depth=3)
r2LR(preDTR, "Decision Tree Regressor (max depth = 3) with logged salary", X_train_sal, X_test_sal, y_train_log, y_test_log)
preDTR = DecisionTreeRegressor(max_depth=4)
r2LR(preDTR, "Decision Tree Regressor (max depth = 4) with logged salary", X_train_sal, X_test_sal, y_train_log, y_test_log)

R2 score for Decision Tree Regressor (max depth = 3) with logged salary is:
Training score: 0.6009
Testing score:  0.6125
R2 score for Decision Tree Regressor (max depth = 4) with logged salary is:
Training score: 0.6516
Testing score:  0.6007


In [176]:
# This time, we try gradient boosting of trees.
# This is the best result of manually tuning.
from sklearn.ensemble import GradientBoostingRegressor
preGradBoost = GradientBoostingRegressor(n_estimators=30, max_depth=3)
r2LR(preGradBoost, "Gradient Boosting Regressor (max depth = 3)", X_train_sal, X_test_sal, y_train_log, y_test_log)

R2 score for Gradient Boosting Regressor (max depth = 3) is:
Training score: 0.6883
Testing score:  0.6514


In [174]:
# What about Random Forest which reduces the variance?
from sklearn.ensemble import RandomForestRegressor
preRandForest = RandomForestRegressor(n_estimators=8, max_depth=5)
r2LR(preRandForest, "Random Forest Regressor (max depth = 5)", X_train_sal, X_test_sal, y_train_log, y_test_log)

R2 score for Random Forest Regressor (max depth = 5) is:
Training score: 0.7039
Testing score:  0.6843


The best model in predicting salary is the Random Forest Regressor. <br>
In summary, we choose the best model to predict the probability of a player that makes it to the NBA and another regression model to predict the salary if the player is making it the NBA.

### 4\.  Test for given sample
Compute the expected NBA salary of a high school basketball player who is 6’ 6” tall, is
averaging 46 points per game, and is playing in the second most competitive league (comp =
9), according to your model.

In [225]:
def predPlayer(player, prob_clf, pred_clf, threshold = 0.5):
    # Predict the probability of this player that makes it to the NBA
    prob = prob_clf.predict_proba(player)[0][1]
    print(("The data for the player is:\nCompetitive Leagure: {} \nHeight: {}\nPoints: {}").format(player[0][0], player[0][1], player[0][2]))
    print("The probability of this player making it to the NBA is %f"%prob)
    if (prob > threshold):
        pred_sal = pred_clf.predict(player)[0]**10
        print("The predicted salary for this player is %.2f"%pred_sal)
    else:
        print("As the player does not prefer to making it to the NBA, the salary is 0.")

In [214]:
player = np.array([[9, 78, 46]])

In [226]:
predPlayer(player, GradBoost, preRandForest)

The data for the player is:
Competitive Leagure: 9 
Height: 78
Points: 46
The probability of this player making it to the NBA is 0.068136
As the player does not prefer to making it to the NBA, the salary is 0.
