### <center>**Assignment**</center>

In this assignment, we would like to predict the success of shots made by basketball players in the NBA.

![](https://nthu-datalab.github.io/ml/labs/05_Regularization/fig-nba.png)

Please download the dataset first. You might need to try with various models to achieve better performance.

### **Requirements**

- Submit to eeclass with your code file `Lab05_{student_id}.ipynb` (e.g. `Lab05_109069999.ipynb`) and prediction file `Lab05_{student_id}_y_pred.csv`. The notebook should contain the following parts:

    1. Use all features to train [any linear model in scikit-learn](https://scikit-learn.org/stable/modules/linear_model.html#linear-models) and try different hyperparameters (ex. different degree, complexity). Show their performances.
    2. Select 1 setting (model + hyperparameters) and plot the **error curve** to show that the setting you selected **isn't over-fit**.
    3. Use any method to choose the **best 3 features** that can best aid the model's prediction. Explain **how you find it**.
        - Note: the combination of features doesn't count as 1 feature, e.g. $x_{1}$, $x_{2}$, and $x_1^2+x_2$ count as only two features.
    4. Train the model selected in 2. with the only 3 features selected in 3., and present the training error.
    5. Export the predictions of the model trained in 4. for ``X_test`` (follow the format of ``y_train.csv``).
    6. **A brief report** of what you have done in this assignment.

- Prediction performance will have minimal impact on the assignment grade, so there's no need to be overly concerned about it.
- Deadline: **2023-10-19 (Thur) 23:59**.

### **Hints**

1. You can **preprocess the data** to help your training.
2. Since you don't have y_test this time, you may need to **split a validation set** for checking your performance.
3. It is possible to use a regression model as a classifier, for example [RidgeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html).



### **References**

1. [Stanford CS229 Machine Learning](http://cs229.stanford.edu/proj2017/final-reports/5132133.pdf)
2. [NBA shot logs](https://www.kaggle.com/dansbecker/nba-shot-logs)


In [85]:
import pandas as pd
import numpy as np
from pylab import *
from sklearn.model_selection import train_test_split
from sklearn.model_selection import learning_curve
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import accuracy_score

In [86]:
# download the dataset
import urllib.request

# urllib.request.urlretrieve("https://nthu-datalab.github.io/ml/labs/05_Regularization/data/X_train.csv", "./data/X_train.csv")
# urllib.request.urlretrieve("https://nthu-datalab.github.io/ml/labs/05_Regularization/data/y_train.csv", "./data/y_train.csv")
# urllib.request.urlretrieve("https://nthu-datalab.github.io/ml/labs/05_Regularization/data/X_test.csv", "./data/X_test.csv")

X = pd.read_csv('./data/X_train.csv')
y = pd.read_csv('./data/y_train.csv')
X_test = pd.read_csv('./data/X_test.csv')

In [87]:
print(X.shape)
print(X.columns)
print(y.columns)

(85751, 8)
Index(['PERIOD', 'GAME_CLOCK', 'SHOT_CLOCK', 'DRIBBLES', 'TOUCH_TIME',
       'SHOT_DIST', 'PTS_TYPE', 'CLOSE_DEF_DIST'],
      dtype='object')
Index(['FGM'], dtype='object')


In [88]:
X.head()

Unnamed: 0,PERIOD,GAME_CLOCK,SHOT_CLOCK,DRIBBLES,TOUCH_TIME,SHOT_DIST,PTS_TYPE,CLOSE_DEF_DIST
0,1,358,2.4,0,3.2,20.6,2,4.5
1,1,585,8.3,0,1.2,3.0,2,0.5
2,1,540,19.9,0,0.6,3.5,2,3.2
3,1,392,9.0,0,0.9,21.1,2,4.9
4,3,401,22.7,0,0.7,4.1,2,2.9


In [89]:
print("Label with:", np.unique(y.values))

Label with: [0 1]


In [90]:
# Split and Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=0)

print("Train data:", X_train.shape)
print("Valid data:", X_valid.shape)
print("Test data:", X_test.shape)

Train data: (54880, 8)
Valid data: (13720, 8)
Test data: (17151, 8)


In [91]:
# Logistic regression
penalty = [None, 'l2']

for p in penalty:
    logistic = LogisticRegression(penalty=p)

    logistic.fit(X_train, y_train.values.ravel())

    y_train_pred = logistic.predict(X_train)
    y_valid_pred = logistic.predict(X_valid)
    y_test_pred = logistic.predict(X_test)

    print("Penalty:", p)
    print('ACC train: %.4f, valid: %.4f, test: %.4f \n' % (
                    accuracy_score(y_train, y_train_pred),
                    accuracy_score(y_valid, y_valid_pred),
                    accuracy_score(y_test, y_test_pred)))

Penalty: None
ACC train: 0.6094, valid: 0.6034, test: 0.6091 

Penalty: l2
ACC train: 0.6094, valid: 0.6034, test: 0.6089 



In [92]:
# Ridge Classifier
alpha = [0.01, 0.1, 1, 10, 100]

for a in alpha:
    ridge = RidgeClassifier(alpha=a)

    ridge.fit(X_train, y_train.values.ravel())

    y_train_pred = ridge.predict(X_train)
    y_valid_pred = ridge.predict(X_valid)
    y_test_pred = ridge.predict(X_test)

    print("Alpha:", a)
    print('ACC train: %.4f, valid: %.4f, test: %.4f \n' % (
                    accuracy_score(y_train, y_train_pred),
                    accuracy_score(y_valid, y_valid_pred),
                    accuracy_score(y_test, y_test_pred)))

Alpha: 0.01
ACC train: 0.6091, valid: 0.6031, test: 0.6090 

Alpha: 0.1
ACC train: 0.6091, valid: 0.6031, test: 0.6090 

Alpha: 1
ACC train: 0.6091, valid: 0.6031, test: 0.6090 

Alpha: 10
ACC train: 0.6091, valid: 0.6030, test: 0.6091 

Alpha: 100
ACC train: 0.6091, valid: 0.6029, test: 0.6092 

