# Threshold Adjustment

👇 Load the player `player_performances.csv` dataset to see what you will be working with.

In [2]:
import pandas as pd

data = pd.read_csv('data/player_performances.csv')

data.head()

Unnamed: 0,games played,minutes played,points per game,field goals made,field goal attempts,field goal percent,3 point made,3 point attempt,3 point %,free throw made,free throw attempts,free throw %,offensive rebounds,defensive rebounds,rebounds,assists,steals,blocks,turnovers,target_5y
0,36,27.4,7.4,2.6,7.6,34.7,0.5,2.1,25.0,1.6,2.3,69.9,0.7,3.4,4.1,1.9,0.4,0.4,1.3,0
1,35,26.9,7.2,2.0,6.7,29.6,0.7,2.8,23.5,2.6,3.4,76.5,0.5,2.0,2.4,3.7,1.1,0.5,1.6,0
2,74,15.3,5.2,2.0,4.7,42.2,0.4,1.7,24.4,0.9,1.3,67.0,0.5,1.7,2.2,1.0,0.5,0.3,1.0,0
3,58,11.6,5.7,2.3,5.5,42.6,0.1,0.5,22.6,0.9,1.3,68.9,1.0,0.9,1.9,0.8,0.6,0.1,1.0,1
4,48,11.5,4.5,1.6,3.0,52.4,0.0,0.1,0.0,1.3,1.9,67.4,1.0,1.5,2.5,0.3,0.3,0.4,0.8,1


ℹ️ Each observation represents a player and each column a characteristic of performance. The target `target_5y` defines whether the player has had a professional career of less than 5 years [0] or 5 years or more [1].

# Preprocessing

👇 To avoid spending too much time on the preprocessing, Robust Scale the entire feature set. This practice is not optimal, but can be used for preliminary preprocessing and/or to get models up and running quickly.

Save the scaled feature set as `X_scaled`.

In [3]:
# YOUR CODE HERE
from sklearn.preprocessing import RobustScaler

In [8]:
X = data.drop(columns=["target_5y"])
y = data["target_5y"]

scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

df = pd.concat([X_scaled, data["target_5y"]], axis = 1) 
df

Unnamed: 0,games played,minutes played,points per game,field goals made,field goal attempts,field goal percent,3 point made,3 point attempt,3 point %,free throw made,free throw attempts,free throw %,offensive rebounds,defensive rebounds,rebounds,assists,steals,blocks,turnovers,target_5y
0,-0.900000,0.933884,0.352941,0.25,0.666667,-1.206557,1.00,1.500000,0.083077,0.585366,0.571429,-0.109375,-0.1,1.0625,0.659794,0.571429,-0.2,0.50,0.375,0
1,-0.933333,0.892562,0.313725,-0.05,0.452381,-1.875410,1.50,2.083333,0.036923,1.560976,1.357143,0.406250,-0.3,0.1875,-0.041237,1.857143,1.2,0.75,0.750,0
2,0.366667,-0.066116,-0.078431,-0.05,-0.023810,-0.222951,0.75,1.166667,0.064615,-0.097561,-0.142857,-0.335937,-0.3,0.0000,-0.123711,-0.071429,0.0,0.25,0.000,0
3,-0.166667,-0.371901,0.019608,0.10,0.166667,-0.170492,0.00,0.166667,0.009231,-0.097561,-0.142857,-0.187500,0.2,-0.5000,-0.247423,-0.214286,0.2,-0.25,0.000,1
4,-0.500000,-0.380165,-0.215686,-0.25,-0.428571,1.114754,-0.25,-0.166667,-0.686154,0.292683,0.285714,-0.304687,0.2,-0.1250,0.000000,-0.571429,-0.4,0.50,-0.250,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1323,0.566667,-0.024793,-0.254902,-0.25,-0.285714,-0.078689,-0.25,-0.083333,-0.246154,0.195122,0.000000,0.617188,-0.4,-0.5625,-0.536082,1.000000,0.2,0.00,-0.250,0
1324,0.166667,-0.289256,-0.333333,-0.30,-0.166667,-1.062295,0.00,0.333333,-0.172308,-0.195122,-0.357143,0.632813,-0.4,-0.3750,-0.412371,0.857143,0.6,-0.50,0.375,1
1325,-0.666667,-0.330579,-0.039216,0.05,-0.214286,1.455738,-0.25,-0.250000,-0.686154,0.000000,0.071429,-0.546875,0.7,0.3750,0.536082,-0.571429,-0.4,0.50,-0.125,0
1326,-0.366667,-0.338843,-0.215686,-0.20,-0.238095,0.000000,-0.25,-0.083333,-0.378462,0.195122,0.214286,-0.687500,-0.6,-0.8125,-0.742268,0.785714,-0.2,-0.25,-0.250,1


### ☑️ Check your code

In [9]:
from nbresult import ChallengeResult

result = ChallengeResult('scaled_features',
                         scaled_features = X_scaled
)

result.write()
print(result.check())


platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/Laetitia/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/Laetitia/code/juliensoudet/05-ML/03-Performance-metrics/data-threshold-adjustments/tests
plugins: asyncio-0.19.0, typeguard-2.13.3, anyio-3.6.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_scaled_features.py::TestScaled_features::test_scaled_features [32mPASSED[0m[32m [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/scaled_features.pickle

[32mgit[39m commit -m [33m'Completed scaled_features step'[39m

[32mgit[39m push origin master



In [10]:
! git add tests/scaled_features.pickle

! git commit -m 'Completed scaled_features step'

! git push origin master


[master 6ca8799] Completed scaled_features step
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 tests/scaled_features.pickle
Enumerating objects: 17, done.
Counting objects: 100% (17/17), done.
Delta compression using up to 4 threads
Compressing objects: 100% (15/15), done.
Writing objects: 100% (17/17), 45.08 KiB | 3.00 MiB/s, done.
Total 17 (delta 2), reused 0 (delta 0), pack-reused 0 (from 0)
remote: Resolving deltas: 100% (2/2), done.[K
To github.com:juliensoudet/data-threshold-adjustments.git
 * [new branch]      master -> master


# Base modeling

🎯 The task is to detect players who will last 5 years minimum as professionals, with a 90% guarantee.

👇 Is a default Logistic Regression model going to satisfy the coach's requirements? Use cross-validation and save the score that supports your answer under variable name `base_score`.

In [11]:
# YOUR CODE HERE
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

In [13]:


# Create a Logistic Regression model
model = LogisticRegression()

# Perform cross-validation
cv_scores = cross_val_score(model, X_scaled, y, cv=5, scoring='accuracy')

# Calculate the mean score
base_score = cv_scores.mean()
base_score


0.7010554688608314

### ☑️ Check your code

In [14]:
from nbresult import ChallengeResult

result = ChallengeResult('base_precision',
                         score = base_score
)

result.write()
print(result.check())


platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/Laetitia/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/Laetitia/code/juliensoudet/05-ML/03-Performance-metrics/data-threshold-adjustments/tests
plugins: asyncio-0.19.0, typeguard-2.13.3, anyio-3.6.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_base_precision.py::TestBase_precision::test_precision_score [32mPASSED[0m[32m  [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/base_precision.pickle

[32mgit[39m commit -m [33m'Completed base_precision step'[39m

[32mgit[39m push origin master



In [15]:
! git add tests/base_precision.pickle

! git commit -m 'Completed base_precision step'

! git push origin master

[master 6e61e14] Completed base_precision step
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 tests/base_precision.pickle
Enumerating objects: 6, done.
Counting objects: 100% (6/6), done.
Delta compression using up to 4 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 525 bytes | 525.00 KiB/s, done.
Total 4 (delta 2), reused 0 (delta 0), pack-reused 0 (from 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To github.com:juliensoudet/data-threshold-adjustments.git
   6ca8799..6e61e14  master -> master


# Threshold adjustment

👇 Find the decision threshold that guarantees a 90% precision for a player to last 5 years or more as a professional. Save the threshold under variable name `new_threshold`.

<details>
<summary>💡 Hint</summary>

- Make cross validated probability predictions with [`cross_val_predict`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html)
    
- Plug the probabilities into [`precision_recall_curve`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html) to generate precision scores at different thresholds

- Find out which threshold guarantees a precision of 0.9
      
</details>



In [18]:
# YOUR CODE HERE
from sklearn.metrics import precision_recall_curve
from sklearn.model_selection import cross_val_predict

In [19]:
# Create a Logistic Regression model
model = LogisticRegression()

# Get cross-validated probabilities
y_probs = cross_val_predict(model, X_scaled, y, cv=5, method='predict_proba')

# Compute precision-recall pairs for different thresholds
precision, recall, thresholds = precision_recall_curve(y, y_probs[:, 1])

# Find the threshold for 90% precision
for thr, prec in zip(thresholds, precision):
    if prec >= 0.9:
        new_threshold = thr
        break

print("New Threshold for 90% Precision:", new_threshold)


New Threshold for 90% Precision: 0.8666405182816544


### ☑️ Check your code

In [20]:
from nbresult import ChallengeResult

result = ChallengeResult('decision_threshold',
                         threshold = new_threshold
)

result.write()
print(result.check())


platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/Laetitia/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/Laetitia/code/juliensoudet/05-ML/03-Performance-metrics/data-threshold-adjustments/tests
plugins: asyncio-0.19.0, typeguard-2.13.3, anyio-3.6.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_decision_threshold.py::TestDecision_threshold::test_new_threshold [32mPASSED[0m[32m [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/decision_threshold.pickle

[32mgit[39m commit -m [33m'Completed decision_threshold step'[39m

[32mgit[39m push origin master



In [21]:
! git add tests/decision_threshold.pickle

! git commit -m 'Completed decision_threshold step'

! git push origin master

[master 90f26b7] Completed decision_threshold step
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 tests/decision_threshold.pickle
Enumerating objects: 6, done.
Counting objects: 100% (6/6), done.
Delta compression using up to 4 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 531 bytes | 531.00 KiB/s, done.
Total 4 (delta 2), reused 0 (delta 0), pack-reused 0 (from 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To github.com:juliensoudet/data-threshold-adjustments.git
   6e61e14..90f26b7  master -> master


# Using the new threshold

🎯 The coach has spotted a potentially interesting player, but wants your 90% guarantee that he would last 5 years minimum as a pro. Download the player's data [here](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_New_player.csv).

In [23]:
new_player = pd.read_csv("data/ML_New_player.csv")

new_player

Unnamed: 0,games played,minutes played,points per game,field goals made,field goal attempts,field goal percent,3 point made,3 point attempt,3 point %,free throw made,free throw attempts,free throw %,offensive rebounds,defensive rebounds,rebounds,assists,steals,blocks,turnovers
0,80,31.4,14.3,5.9,11.1,52.5,0.0,0.1,11.1,2.6,3.9,65.4,3.0,5.0,8.0,2.4,1.1,0.8,2.2


❓ Would you risk recommending the player to the coach? Save your answer as string under variable name `recommendation` as "recommend" or "not recommend".

In [28]:
# YOUR CODE HERE
scaler = RobustScaler()
X_scaled = scaler.fit_transform(combined_data)

# Get the number of rows in the original X_scaled data
num_original_rows = X_scaled.shape[0] - 1  # Subtract 1 for the new player's row

# Get the new player's scaled data
new_player_scaled = X_scaled[num_original_rows:]

# Load the target variable
data = {
    "target_5y": [0, 0, 0, 1, 1]
}
y = pd.DataFrame(data)["target_5y"]

# Create a Logistic Regression model
model = LogisticRegression()

# Fit the model on the original data
model.fit(X_scaled[:num_original_rows], y)

# Predict the new player's outcome with the adjusted threshold
threshold = 0.6093347475121835  # Adjusted threshold for 90% precision
prediction = (model.predict_proba(new_player_scaled)[:, 1] >= threshold).astype(int)

print("Prediction for the new player (1 means will last 5 years or more, 0 means won't):", prediction[0])


ValueError: Found input variables with inconsistent numbers of samples: [1328, 5]

### ☑️ Check your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('recommendation',
                         recommendation = recommendation
)

result.write()
print(result.check())

# 🏁