# **NBA Career Prediction**
**Predict 5-Year Career Longevity for NBA Rookies**

## WEEK 1 - Data Preparation

It will start by read data that used in the project, then check dependent varaible distribution

**The steps are:**
1. read data
2. explore dataset 
3. Prepare data

## 1. Read data

**[1.1]** Import all modules needed

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from joblib import dump
from sklearn.model_selection import train_test_split

**[1.2]** Download train.csv and test.csv into data/raw folder then read data in notebook

In [2]:
# Set index_col = 0 to get rid of ID column
data = pd.read_csv("../data/raw/train.csv", index_col=0)
test = pd.read_csv("../data/raw/test.csv", index_col=0)

## 2. Explore Dataset

**[2.1]** Display first 5 rows of train and test dataframe

In [3]:
data.head()

Unnamed: 0_level_0,GP,MIN,PTS,FGM,FGA,FG%,3P Made,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TOV,TARGET_5Yrs
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
10556,80,24.3,7.8,3.0,6.4,45.7,0.1,0.3,22.6,2.0,2.9,72.1,2.2,2.0,3.8,3.2,1.1,0.2,1.6,1
5342,75,21.8,10.5,4.2,7.9,55.1,-0.3,-1.0,34.9,2.4,3.6,67.8,3.6,3.7,6.6,0.7,0.5,0.6,1.4,1
5716,85,19.1,4.5,1.9,4.5,42.8,0.4,1.2,34.3,0.4,0.6,75.7,0.6,1.8,2.4,0.8,0.4,0.2,0.6,1
13790,63,19.1,8.2,3.5,6.7,52.5,0.3,0.8,23.7,0.9,1.5,66.9,0.8,2.0,3.0,1.8,0.4,0.1,1.9,1
5470,63,17.8,3.7,1.7,3.4,50.8,0.5,1.4,13.7,0.2,0.5,54.0,2.4,2.7,4.9,0.4,0.4,0.6,0.7,1


**Note:** There are negative values in columns 3P Made, 3PA, but these values should not be negative.

In [4]:
test.head()

Unnamed: 0_level_0,GP,MIN,PTS,FGM,FGA,FG%,3P Made,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TOV
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1,56,9.1,4.0,1.6,3.7,43.7,0.1,0.3,7.3,0.7,1.2,63.4,1.2,0.8,1.7,0.4,0.2,0.3,0.8
8194,43,19.3,10.1,3.7,8.1,46.0,0.6,1.7,35.1,1.8,2.5,75.3,0.5,0.9,1.5,3.5,0.6,-0.0,1.8
3,82,33.9,11.3,4.9,10.6,45.6,0.5,1.9,44.8,1.8,2.7,71.2,1.3,3.3,4.5,2.5,1.3,0.3,2.0
8196,86,44.7,18.8,6.8,15.9,42.9,0.5,1.8,13.5,4.5,6.3,70.9,1.5,3.2,5.0,4.1,0.9,0.1,3.6
8197,58,12.3,4.7,1.6,4.0,40.0,0.5,1.7,38.7,1.1,1.3,76.9,0.2,0.6,0.9,1.5,0.5,-0.4,0.9


**Note:** There are negative values in column BLK but these values should not be negative.

**[2.2]** Find all columns containing negative values in data and test

In [5]:
(data < 0).any()

GP              True
MIN            False
PTS            False
FGM            False
FGA            False
FG%            False
3P Made         True
3PA             True
3P%             True
FTM            False
FTA            False
FT%             True
OREB           False
DREB           False
REB            False
AST            False
STL            False
BLK             True
TOV            False
TARGET_5Yrs    False
dtype: bool

In [6]:
(test < 0).any()

GP         False
MIN        False
PTS        False
FGM        False
FGA        False
FG%        False
3P Made     True
3PA         True
3P%         True
FTM        False
FTA        False
FT%        False
OREB       False
DREB       False
REB        False
AST        False
STL        False
BLK         True
TOV        False
dtype: bool

**Note:** Columns containing Negative values in both data and test: 3P Made, 3PA, 3P%, BLK

Columns containing negative values in data only: GP, FT%

**[2.3]** Further investigate negative values

In [7]:
data[data["GP"] < 0].index

Int64Index([1756, 7478], dtype='int64', name='Id')

In [8]:
data[data["FT%"] < 0].index

Int64Index([11294], dtype='int64', name='Id')

In [9]:
name_list = ["3P Made", "3PA", "3P%", "BLK"]
for name in name_list:
    print(f"No. of Negative values in column '{name}' in (data, test): ({len(data[data[name] < 0].index)}, {len(test[test[name] < 0].index)})")

No. of Negative values in column '3P Made' in (data, test): (1629, 775)
No. of Negative values in column '3PA' in (data, test): (1658, 773)
No. of Negative values in column '3P%' in (data, test): (878, 435)
No. of Negative values in column 'BLK' in (data, test): (1048, 456)


## Data cleaning to be made:
    Drop columns: 3P Made, 3PA, 3P%, BLK - too many negative values in both train and test csv. 
        Reason: non of these values should be negative, considered as invalid variables.
    Drop rows: 1756, 7478, 11294
        Reason: only 3 negative values in GP and FT% found in the train set and 0 in test set, considered as input mistakes.   

**[2.4]** Remove rows and columns

In [10]:
data_row_cleaned = data.drop(data[data["FT%"] < 0].index)
data_row_cleaned = data_row_cleaned.drop(data_row_cleaned[data_row_cleaned["GP"] < 0].index)

In [11]:
data_column_cleaned = data_row_cleaned.drop(['3P Made', '3PA', '3P%', 'BLK'], axis = 1)
test_column_cleaned = test.drop(['3P Made', '3PA', '3P%', 'BLK'], axis = 1)

**[2.5]** Display dimensions of train and test dataframe

In [12]:
data_column_cleaned.shape

(7997, 16)

In [13]:
test_column_cleaned.shape

(3799, 15)

**[2.6]** Display summary of train and test dataframe

In [14]:
data_column_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7997 entries, 10556 to 2900
Data columns (total 16 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   GP           7997 non-null   int64  
 1   MIN          7997 non-null   float64
 2   PTS          7997 non-null   float64
 3   FGM          7997 non-null   float64
 4   FGA          7997 non-null   float64
 5   FG%          7997 non-null   float64
 6   FTM          7997 non-null   float64
 7   FTA          7997 non-null   float64
 8   FT%          7997 non-null   float64
 9   OREB         7997 non-null   float64
 10  DREB         7997 non-null   float64
 11  REB          7997 non-null   float64
 12  AST          7997 non-null   float64
 13  STL          7997 non-null   float64
 14  TOV          7997 non-null   float64
 15  TARGET_5Yrs  7997 non-null   int64  
dtypes: float64(14), int64(2)
memory usage: 1.0 MB


In [15]:
test_column_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3799 entries, 1 to 8183
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   GP      3799 non-null   int64  
 1   MIN     3799 non-null   float64
 2   PTS     3799 non-null   float64
 3   FGM     3799 non-null   float64
 4   FGA     3799 non-null   float64
 5   FG%     3799 non-null   float64
 6   FTM     3799 non-null   float64
 7   FTA     3799 non-null   float64
 8   FT%     3799 non-null   float64
 9   OREB    3799 non-null   float64
 10  DREB    3799 non-null   float64
 11  REB     3799 non-null   float64
 12  AST     3799 non-null   float64
 13  STL     3799 non-null   float64
 14  TOV     3799 non-null   float64
dtypes: float64(14), int64(1)
memory usage: 474.9 KB


**[2.7]** Display descriptive data of train and test dataframe

In [16]:
data_column_cleaned.describe()

Unnamed: 0,GP,MIN,PTS,FGM,FGA,FG%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,TOV,TARGET_5Yrs
count,7997.0,7997.0,7997.0,7997.0,7997.0,7997.0,7997.0,7997.0,7997.0,7997.0,7997.0,7997.0,7997.0,7997.0,7997.0,7997.0
mean,62.796299,18.581456,7.269138,2.807815,6.232875,44.610329,1.392935,1.948293,71.377529,1.077954,2.168951,3.245892,1.625084,0.648868,1.258022,0.833688
std,17.087419,8.933509,4.318241,1.693213,3.584202,6.154805,0.926075,1.252302,10.38897,0.785676,1.392218,2.085153,1.355918,0.407592,0.723257,0.372384
min,1.0,2.9,0.8,0.3,0.8,21.3,0.0,0.0,13.5,0.0,0.2,0.3,0.0,0.0,0.1,0.0
25%,51.0,12.0,4.1,1.6,3.6,40.4,0.7,1.0,65.0,0.5,1.1,1.7,0.7,0.3,0.7,1.0
50%,63.0,16.8,6.3,2.4,5.4,44.4,1.2,1.7,71.4,0.9,1.9,2.8,1.3,0.6,1.1,1.0
75%,74.0,23.5,9.5,3.7,8.1,48.7,1.9,2.6,77.5,1.5,2.9,4.3,2.2,0.9,1.6,1.0
max,123.0,73.8,34.2,13.1,28.9,67.2,8.1,11.1,168.9,5.5,11.0,15.9,12.8,3.6,5.3,1.0


In [17]:
test_column_cleaned.describe()

Unnamed: 0,GP,MIN,PTS,FGM,FGA,FG%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,TOV
count,3799.0,3799.0,3799.0,3799.0,3799.0,3799.0,3799.0,3799.0,3799.0,3799.0,3799.0,3799.0,3799.0,3799.0,3799.0
mean,62.853909,18.650224,7.328034,2.835404,6.30258,44.599079,1.399842,1.953567,71.612924,1.096025,2.179495,3.275783,1.636483,0.653593,1.25791
std,17.15174,8.727259,4.294724,1.688427,3.579221,6.040168,0.92614,1.250376,10.457336,0.785678,1.371935,2.070646,1.335496,0.410573,0.712449
min,6.0,3.7,0.7,0.3,0.8,25.1,0.0,0.0,23.7,0.0,0.2,0.3,0.0,0.0,0.1
25%,51.0,12.2,4.2,1.6,3.7,40.5,0.7,1.0,65.0,0.5,1.2,1.8,0.6,0.4,0.7
50%,63.0,17.0,6.4,2.5,5.5,44.6,1.2,1.7,71.5,0.9,1.9,2.8,1.3,0.6,1.1
75%,74.0,23.3,9.4,3.7,8.1,48.5,1.9,2.6,78.0,1.5,2.9,4.3,2.3,0.9,1.6
max,126.0,68.0,33.0,13.4,26.2,74.6,7.8,9.8,127.1,6.9,12.0,18.5,9.0,2.7,5.2


**[2.8]** Check distribution of target variable

In [18]:
data_column_cleaned['TARGET_5Yrs'].value_counts()/len(data)

1    0.833375
0    0.166250
Name: TARGET_5Yrs, dtype: float64

Imbalanced data found, should consider SMOTE, Oversampling or other methods

## 3. Prepare data

**[3.1]** Create a copy of train and save it into a variable called train_cleaned

In [19]:
data_cleaned = data_column_cleaned.copy()

**[3.2]** Remove leading and trailing space from the column names

In [20]:
data_cleaned.columns = data_cleaned.columns.str.strip()

**[3.3]** Extract the column `TARGET_5Yrs` and save it into variable called target

In [21]:
target = data_cleaned.pop('TARGET_5Yrs')

**[3.4]** Split the data (80%) randomly with random_state=8 into 2 different sets: training (80%) and validation (20%)

In [22]:
X_train, X_val, y_train, y_val = train_test_split(data_cleaned, target, test_size=0.2, random_state=0, stratify=target)

**[3.5]** Instantiate the StandardScaler

In [23]:
scaler = StandardScaler()

**[3.6]** Fit and apply the scaling on data and test

In [24]:
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
test = scaler.transform(test_column_cleaned)

**[3.6]** Save the scaler into the folder models and call the file scaler.joblib

In [25]:
dump(scaler, '../models/scaler.joblib')

['../models/scaler.joblib']

**[3.7]** Save the different sets in the folder `data/processed`

In [26]:
np.save('../data/processed/data_cleaned', data_cleaned)
np.save('../data/processed/target', target)

np.save('../data/processed/X_train', X_train)
np.save('../data/processed/y_train', y_train)

np.save('../data/processed/X_val', X_val)
np.save('../data/processed/y_val', y_val)

np.save('../data/processed/test', test)