# Group Name: The Big One

## Group Members: Nicholas Parker, Matthew King, Sean Sturtevant

### Dataset: Predict NHL Player Salaries

1) Ask
----

Can we accurately predict the salary of an NHL player, for the 2016-2017 season?

2) Acquire
----

Link to data and data dictionary can be found [here](https://www.kaggle.com/camnugent/predict-nhl-player-salaries#train.csv).

The dataset contains 874 records of NHL players and 151 features.

3) Process
----

In [2]:
reset -fs

In [3]:
import pandas as pd
import numpy as np

##### Direct Feature Engineering

In [4]:
hockey_train = pd.read_csv('./data/clean/train.csv', encoding = "ISO-8859-1")

hockey_test = pd.read_csv('./data/clean/test.csv', encoding = "ISO-8859-1")

hockey_test_y = pd.read_csv('./data/clean/test_salaries.csv', encoding = "ISO-8859-1")

In [5]:
def combine_train_and_test(train_df, test_df, test_response):
    """
    Combine the train and test datasets that were previous split at the source of the data.
    """
    test_df = pd.concat([test_df, test_response], axis = 1)
    return pd.concat([train_df, test_df],ignore_index = True, sort = False)

hockey = combine_train_and_test(hockey_train, hockey_test, hockey_test_y)

In [6]:
def nationality_group(df, nationalityCol):
    """
    Reduces the number of values in the Nationality column through binning.
    """
    # A function to feature engineering the 'Nationality column'
    # Changes it from 16 unique values to 5 to prevent overfitting
    scandanavianNations = ['SWE','NOR','FIN']
    otherNations = ['CHE','CZE','FRA','DEU','SVK','AUT','DNK','LVA','HRV','GBR','SVN']
    df.loc[(df[nationalityCol].isin(scandanavianNations)), nationalityCol] = 'Scandanavian'
    df.loc[(df[nationalityCol].isin(otherNations)), nationalityCol] = 'Other'
    return df

hockey = nationality_group(hockey, 'Nat')

In [7]:
# Code used to group and remove provinces and states that are only seen a few times
# Useful to prevent overfitting to values only observed a few times

extreneousStates = ['AK', 'AL', 'AZ', 'CO', 'CT', 'FL', 'IN', 'ME', 'MO', 'NC'
                    , 'ND', 'NE', 'NH', 'NJ', 'NL', 'NS', 'OH', 'OK', 'PA'
                    , 'PE', 'RI', 'SC', 'TX', 'UT', 'WA']

hockey.loc[(hockey['Pr/St'].isin(extreneousStates)),'Pr/St'] = 'Other'

# Removing the time variable 'Born' by making a variable 'Age'
hockey['Age'] = 117 - pd.to_numeric(hockey['Born'].str[0:2])
hockey.loc[(hockey['Age'] > 99), 'Age'] = hockey['Age'] - 100

In [8]:
# Adding isNa Cols
# These columns are useful to account for missing data
def addIsNACol(df, col_name):
    """
    Add columns that indicate whether a record had missing data for specified features.
    """
    na_col_name = col_name + '_is_na'
    df[na_col_name] = 0
    df.loc[(df[col_name].isna()), na_col_name] = 1
    return df

hockey = addIsNACol(hockey, 'DftYr')
hockey = addIsNACol(hockey, 'iCF')

##### Save Processed Data to be used by Model Pipeline
Further work with the columns will be done in the pipeline by excluding variable and imputation

In [9]:
# hockey.to_csv('./data/processed/hockey.csv', index= False)

4) Exploratory Data Analysis
----

5) Models
----

In [10]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn import linear_model
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn import compose
from sklearn.experimental import enable_iterative_imputer
from sklearn import impute
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

In [11]:
def add_commas(number):
    """
    Adds commas to values greater that 1,000 for evaluation metrics.
    """
    return ("{:,}".format(number))

In [12]:
def mape_metric(y_test, y_pred):
    """
    Calculates the mean absolute percentage error.
    """
    y_test, y_pred = np.array(y_test), np.array(y_pred)
    n = len(y_test)
    running_sum = 0
    for i in range(n):
        running_sum += abs((y_test[i] - y_pred[i])/y_test[i])
    return running_sum/n

In [13]:
# Create train, test, and validation sets of the data.
y = hockey['Salary']
X = hockey.drop('Salary', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y) # Split to obtain the test set
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train) # Split to obtain the validation set

### Baseline Model

In [14]:
# Create separate lists of numeric and categorical features to be passed through the pipeline.
numeric_baseline = [feature for feature in X_train.columns if np.issubdtype(X_train[feature], np.number)]
categorical_baseline = [feature for feature in X_train.columns if feature not in numeric_baseline]

In [15]:
def make_pipeline_ridge(regressor=None):
    """
    Creates pipeline to perform transformations on numeric and categorical features 
    and pass into a Ridge Regression Model.
    """
    numeric_features = numeric_baseline
    numeric_transformer = Pipeline(steps=[
        ('imputer', impute.SimpleImputer(strategy='median')),
        ('scaler', preprocessing.StandardScaler())])

    categorical_features = categorical_baseline
    categorical_transformer = Pipeline(steps=[
        ('imputer', impute.SimpleImputer(strategy='constant', fill_value='unknown')),
        ('onehot', preprocessing.OneHotEncoder(handle_unknown='ignore'))])

    preprocessor = compose.ColumnTransformer(transformers=[
        ('numerical', numeric_transformer, numeric_features),
        ('categorical', categorical_transformer, categorical_features)])

    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('regressor', regressor)])
    
    return pipeline

regressor = linear_model.Ridge(alpha=100, tol=0.001)
pipeline = make_pipeline_ridge(regressor)
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('numerical',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=nan,
                                                                                 strategy='median',
                                                       

##### Baseline Evaluation

In [16]:
median_absolute_error_scorer = make_scorer(metrics.median_absolute_error)
cross_val_score(pipeline, 
                X_train, 
                y_train, 
                scoring=median_absolute_error_scorer,
                cv=10)

array([602094.73735785, 542646.39431729, 708999.37496388, 805774.7388271 ,
       713018.45145678, 793303.18126844, 693651.42613369, 616909.51806132,
       606597.7016723 , 560432.85050831])

Median Absolute Error

In [17]:
y_pred = pipeline.predict(X_train)
medae_value_train = add_commas(round(metrics.median_absolute_error(y_train, y_pred), 2))
print(f"${medae_value_train} medae on train dataset")

y_pred = pipeline.predict(X_validation)
medae_value_validation = add_commas(round(metrics.median_absolute_error(y_validation, y_pred), 2))
print(f"${medae_value_validation} medae on validation dataset")

$543,672.63 medae on train dataset
$672,986.27 medae on validation dataset


Root Mean Squared Error

In [18]:
y_pred = pipeline.predict(X_train)
rmse_value_train = add_commas(round(np.sqrt(metrics.mean_squared_error(y_train, y_pred)), 2))
print(f"{rmse_value_train} rmse on train dataset")

y_pred = pipeline.predict(X_validation)
rmse_value_validation = add_commas(round(np.sqrt(metrics.mean_squared_error(y_validation, y_pred)), 2))
print(f"{rmse_value_validation} rmse on validation dataset")

1,179,861.82 rmse on train dataset
1,366,249.95 rmse on validation dataset


Mean Absolute Percentage Error

In [19]:
y_pred = pipeline.predict(X_train)
mape_value_train = round(mape_metric(y_train, y_pred)*100, 2)
print(f"{mape_value_train}% mape on train dataset")

y_pred = pipeline.predict(X_validation)
mape_value_validation = round(mape_metric(y_validation, y_pred)*100, 2)
print(f"{mape_value_validation}% mape on validation dataset")

58.27% mape on train dataset
64.85% mape on validation dataset


### Model Selection Process

### knn Model

In [20]:
nonpredictive_features = ['ENG', 'Wide', 'Over', 'PSG', 'PSA', 'S.Dflct', 'G.Bkhd', 'Post', 'G.Dflct', 'CBar ', 'G.Slap', 'G.Snap', 'G.Wrst', 'G.Wrap', 'G.Tip', 'S.Bkhd', 'Min', 'S.Slap', 'Misc', 'Noise', 'DAP', 'Grit', 'PS', 'DPS', 'OPS', 'DSA', 'DSF', 'Game', 'Match', 'S.Snap', 'Maj', '1G', 'NPD', 'iPenDf', 'iPenD', 'iPenT', 'S.Wrst', 'S.Wrap', 'S.Tip', 'GWG', 'FOL.Down', 'OTG', 'PIM', 'iSF.1', 'iCF.1', 'Diff', 'Pct%', 'FOL.Close', 'TOI/GP.1', 'TOI/GP', 'TOI', 'Shifts', 'E+/-', 'sDist', '+/-', 'PTS', 'A2', 'A1', 'A', 'G', 'GP', 'Wt', 'Ht', 'iSF.2', 'Age', 'iFOW', 'iBLK', 'iFOL', 'dzFOL', 'nzFOW', 'nzFOL', 'ozFOW', 'ozFOL', 'dzFOW', 'FOL.Up', 'FOW.Up', 'iTKA', 'iGVA', 'iMiss', 'FOW.Down', 'iHF', 'FOW.Close', 'FO%', 'Position', 'Team', 'PEND', 'TOIX', 'GA', 'xGA', 'iSCF', 'iPEND', 'sDist.1', 'iHF.1', 'PDO', 'Hand', 'SCA', 'iTKA.1', 'SA', 'IPP%', 'ixG', 'FA', 'Pace', 'iGVA.1', 'SV%', 'RBF', 'PENT', 'F/60', 'GVA', 'TKA', 'FOW', 'Diff/60']

# For KNN, we only want to include the numeric variables.
# At the moment we do not want to consider some of the categorical 
# variables since KNN will not handle these naturally.
numeric_knn = [feature for feature in X_train.columns if np.issubdtype(X_train[feature], np.number) 
                      and feature not in nonpredictive_features]

In order to be able to fit the knn model, we will need to make sure that our pipeline is suited to handle the data for KNN.

In [21]:
def make_pipeline_knn(knn=None):
    """
    Creates pipeline that performs separate transformations on the categorical and numerical features
    for a KNN algorithm.
    """
    
    # All predictive numeric features
    numeric_features = numeric_knn
    numeric_transformer = Pipeline(steps=[
        # Impute any missing values with the median.
        ('imputer', impute.SimpleImputer(strategy='median')),
        # Make sure that all values are Normalized for KNN. 
        # Since we are going to look at points that are similar in
        # terms of euclidean space, we need all features on the same
        # scale.
        ('normalizer', preprocessing.Normalizer())])

    preprocessor = compose.ColumnTransformer(transformers=[
        ('numerical', numeric_transformer, numeric_features)])

    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('knn', knn)])
    
    return pipeline

knn = KNeighborsRegressor()
pipeline_knn = make_pipeline_knn(knn)

In [22]:
def make_random_cv_knn():
    """
    Define hyperparameter search space for KNN algorithm
    Instantiate RandomizedSearchCV with the pipeline.
    """
    
    algo = ['ball_tree', 'kd_tree', 'auto']
    weights = ['distance', 'uniform']
    neighbors = [8, 9, 10, 15]
    hyperparameters = dict(knn__algorithm=algo,
                          knn__n_neighbors=neighbors,
                          knn__weights=weights)
    
    reg_random_cv = RandomizedSearchCV(pipeline_knn, 
                                       hyperparameters, 
                                       cv=5, 
                                       n_iter=15, 
                                       verbose=1,
                                       random_state=42)
    
    return reg_random_cv

model_knn = make_random_cv_knn()
model_knn.fit(X_train, y_train)

Fitting 5 folds for each of 15 candidates, totalling 75 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:    0.8s finished


RandomizedSearchCV(cv=5, error_score='raise-deprecating',
                   estimator=Pipeline(memory=None,
                                      steps=[('preprocessor',
                                              ColumnTransformer(n_jobs=None,
                                                                remainder='drop',
                                                                sparse_threshold=0.3,
                                                                transformer_weights=None,
                                                                transformers=[('numerical',
                                                                               Pipeline(memory=None,
                                                                                        steps=[('imputer',
                                                                                                SimpleImputer(add_indicator=False,
                                                               

#### Median Absolute Error

In [23]:

median_absolute_error_scorer = make_scorer(metrics.median_absolute_error)
cross_val_score(model_knn.best_estimator_, 
                X_train, 
                y_train, 
                scoring=median_absolute_error_scorer,
                cv=10)

array([632000. , 492812.5, 912500. , 537812.5, 649062.5, 915312.5,
       505312.5, 593750. , 640625. , 478750. ])

In [24]:
knn_y_pred = model_knn.best_estimator_.predict(X_train)
medae_value_train = add_commas(round(metrics.median_absolute_error(y_train, knn_y_pred), 2))
print(f"${medae_value_train} medae on train dataset")

knn_y_pred = model_knn.best_estimator_.predict(X_validation)
medae_value_validation = add_commas(round(metrics.median_absolute_error(y_validation, knn_y_pred), 2))
print(f"${medae_value_validation:} medae on validation dataset")

$525,000.0 medae on train dataset
$551,562.5 medae on validation dataset


#### Root Mean Squared Error

In [25]:
knn_y_pred = model_knn.best_estimator_.predict(X_train)
rmse_value_train = add_commas(round(np.sqrt(metrics.mean_squared_error(y_train, knn_y_pred)), 2))
print(f"{rmse_value_train} rmse on train dataset")

knn_y_pred = model_knn.best_estimator_.predict(X_validation)
rmse_value_validation = add_commas(round(np.sqrt(metrics.mean_squared_error(y_validation, knn_y_pred)), 2))
print(f"{rmse_value_validation} rmse on validation dataset")

1,471,890.52 rmse on train dataset
1,644,663.08 rmse on validation dataset


#### Mean Absolute Percentage Error

In [26]:
knn_y_pred = model_knn.best_estimator_.predict(X_train)
mape_value_train = round(mape_metric(y_train, knn_y_pred)*100, 2)
print(f"{mape_value_train}% mape on train dataset")

knn_y_pred = model_knn.best_estimator_.predict(X_validation)
mape_value_validation = round(mape_metric(y_validation, knn_y_pred)*100, 2)
print(f"{mape_value_validation}% mape on validation dataset")

60.24% mape on train dataset
60.73% mape on validation dataset


##### Random Forest Model

In [27]:
# These are features which should have no affect on Salary, like First Name
illogical_features = ['Born', 'City', 'Last Name', 'First Name', 'Cntry']
# These are features which are repeated later on with updated stats
redundant_features = ['TOI/GP', 'iCF', 'iSF', 'iSF.1', 'sDist', 'iHF', 'iGVA', 'iTKA', 'iBLK', 'iFOW', 'iFOL']
# These are features found to be non-predictive using Parrt's rfpimp package

drop_list = illogical_features + redundant_features

numeric_rf = [feature for feature in X_train.columns if np.issubdtype(X_train[feature], np.number) 
                      and feature not in drop_list]
categorical_rf = [feature for feature in X_train.columns if feature not in numeric_rf
                       and feature not in drop_list]

In [28]:
def make_pipeline_rf(regressor=None):
    """
    Creates pipeline that performs separate transformations on the categorical and numerical features.
    """
    
    numeric_features = numeric_rf
    numeric_transformer = Pipeline(steps=[
        ('imputer', impute.SimpleImputer(strategy='median'))])

    categorical_features = categorical_rf
    categorical_transformer = Pipeline(steps=[
        ('imputer', impute.SimpleImputer(strategy='constant', fill_value='unknown')),
        ('onehot', preprocessing.OneHotEncoder(handle_unknown='ignore'))])

    preprocessor = compose.ColumnTransformer(transformers=[
        ('numerical', numeric_transformer, numeric_features),
        ('categorical', categorical_transformer, categorical_features)])

    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('regressor', regressor)])
    
    return pipeline

regressor_rf = RandomForestRegressor()
pipeline_rf = make_pipeline_rf(regressor_rf)

In [29]:
def make_random_cv():
    """
    Define hyperparameter search space
    Instantiate RandomizedSearchCV with the pipeline.
    """
    
    bootstrap = ['True', 'False']
    oob_score = ['True', 'False']
    max_features = ['auto', 'sqrt']
    min_samples_leaf = [2, 3, 5, 6 , 7, 8, 9, 10]
    n_estimators = [100, 150, 200]
    hyperparameters = dict(regressor__min_samples_leaf=min_samples_leaf,
                          regressor__bootstrap=bootstrap,
                          regressor__max_features=max_features,
                          regressor__n_estimators=n_estimators,
                          regressor__oob_score=oob_score)
    reg_random_cv = RandomizedSearchCV(pipeline_rf, 
                                       hyperparameters, 
                                       cv=5, 
                                       n_iter=15, 
                                       verbose=1,
                                       random_state=42)
    
    return reg_random_cv

model_rf = make_random_cv()
model_rf.fit(X_train, y_train)

Fitting 5 folds for each of 15 candidates, totalling 75 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:  1.2min finished


RandomizedSearchCV(cv=5, error_score='raise-deprecating',
                   estimator=Pipeline(memory=None,
                                      steps=[('preprocessor',
                                              ColumnTransformer(n_jobs=None,
                                                                remainder='drop',
                                                                sparse_threshold=0.3,
                                                                transformer_weights=None,
                                                                transformers=[('numerical',
                                                                               Pipeline(memory=None,
                                                                                        steps=[('imputer',
                                                                                                SimpleImputer(add_indicator=False,
                                                               

Cross Validation Scores for Best Estimator from Randomized Search

In [30]:
median_absolute_error_scorer = make_scorer(metrics.median_absolute_error)
cross_val_score(model_rf.best_estimator_, 
                X_train, 
                y_train, 
                scoring=median_absolute_error_scorer,
                cv=10)

array([400490.33588865, 354668.12169312, 709603.74218374, 421945.45214045,
       802128.06998557, 648167.45983646, 592347.8013468 , 526480.46296296,
       596238.6002886 , 321501.96705147])

Median Absolute Error

In [31]:
y_pred = model_rf.predict(X_train)
medae_value_train = add_commas(round(metrics.median_absolute_error(y_train, y_pred), 2))
print(f"${medae_value_train} medae on train dataset")

y_pred = model_rf.predict(X_validation)
medae_value_validation = add_commas(round(metrics.median_absolute_error(y_validation, y_pred), 2))
print(f"${medae_value_validation} medae on validation dataset")

$215,435.65 medae on train dataset
$476,427.61 medae on validation dataset


Root Mean Squared Error

In [32]:
y_pred = model_rf.predict(X_train)
rmse_value_train = add_commas(round(np.sqrt(metrics.mean_squared_error(y_train, y_pred)), 2))
print(f"{rmse_value_train} rmse on train dataset")

y_pred = model_rf.predict(X_validation)
rmse_value_validation = add_commas(round(np.sqrt(metrics.mean_squared_error(y_validation, y_pred)), 2))
print(f"{rmse_value_validation} rmse on validation dataset")

666,153.4 rmse on train dataset
1,322,807.98 rmse on validation dataset


Mean Absolute Percentage Error

In [33]:
y_pred = model_rf.predict(X_train)
mape_value_train = round(mape_metric(y_train, y_pred)*100, 2)
print(f"{mape_value_train}% mape on train dataset")

y_pred = model_rf.predict(X_validation)
mape_value_validation = round(mape_metric(y_validation, y_pred)*100, 2)
print(f"{mape_value_validation}% mape on validation dataset")

24.88% mape on train dataset
49.57% mape on validation dataset


### Final Model Choice

We chose to continue with the Random Forest Model based on the evaluation metrics on the validation sets that we performed.

Feature Importance Analysis

In [34]:
from rfpimp import *

# Using the rfpimp package by ParrT, we look at our best model and estimate feature importance
I = importances(model_rf.best_estimator_, X_train, y_train)
I.reset_index(inplace = True)

# These features deemed 'not important are dropped'
# It doesn't affect the training score much, but a simpler model generalizes better
print(len(list(I.loc[(I.Importance <= 0)]['Feature'])), 'Features')
new_drop_list = list(I.loc[(I.Importance <= 0)]['Feature'])

19 Features


In [35]:
# Retrain the model on the new drop list
numeric_rf = [feature for feature in X_train.columns if np.issubdtype(X_train[feature], np.number) 
                      and feature not in new_drop_list]
categorical_rf = [feature for feature in X_train.columns if feature not in numeric_rf
                       and feature not in new_drop_list]

regressor_rf = RandomForestRegressor()
pipeline_rf = make_pipeline_rf(regressor_rf)
model_rf = make_random_cv()
model_rf.fit(X_train, y_train)

Fitting 5 folds for each of 15 candidates, totalling 75 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:  1.2min finished


RandomizedSearchCV(cv=5, error_score='raise-deprecating',
                   estimator=Pipeline(memory=None,
                                      steps=[('preprocessor',
                                              ColumnTransformer(n_jobs=None,
                                                                remainder='drop',
                                                                sparse_threshold=0.3,
                                                                transformer_weights=None,
                                                                transformers=[('numerical',
                                                                               Pipeline(memory=None,
                                                                                        steps=[('imputer',
                                                                                                SimpleImputer(add_indicator=False,
                                                               

6) Deliver
----

### Evaluation Metrics

##### Median Absolute Error Test Set

In [36]:
y_pred = model_rf.predict(X_test)
medae_value_test = add_commas(round(metrics.median_absolute_error(y_test, y_pred), 2))
print(f"${medae_value_test:} medae on test dataset")

$519,973.01 medae on test dataset


##### Root Mean Squared Error Test Set

In [37]:
y_pred = model_rf.predict(X_test)
rmse_value_test = add_commas(round(np.sqrt(metrics.mean_squared_error(y_test, y_pred)), 2))
print(f"{rmse_value_test} rmse on test dataset")

1,495,995.7 rmse on test dataset


##### Mean Absolute Percentage Error Test Set

In [38]:
y_pred = model_rf.predict(X_test)
mape_value_test = round(mape_metric(y_test, y_pred)*100, 2)
print(f"{mape_value_test}% mape on test dataset")

56.23% mape on test dataset


### Summary and Takeaways

<span style='font-size:18px'> 
To predict NHL player salaries, we fitted three regressor models, a Lasso Regression model, a KNN model, and a Random Forest model. After running our models against each other, we looked at three different evaluation metrics for each. Each evaluation metric told us something uniquely useful about our models. Median Absolute Error (MedAE) is the most useful for external communication of our model. To be able to say that we were off by \\$100,000 is something that we could tell a hockey General Manager to help them understand what is going on. Median Absolute Percent Error (MAPE) is used for internal understaning of our model. It is a metric to understand error like MedAE, but it is easier for us as data scientists to say our model is off by 53\% rather than off by \\$557,132. Finally, our 'North Star' metric that we use to pick a model Root Mean Square Error (RMSE). RMSE minimizes the loss, and unlike MAPE, RMSE is symmetric. MAPE sufferes from asymmetry, which means that predictions which undershoot the true value are not valued the same as predictions which overshoot the true value by the same amount.</span>

<span style='font-size:18px'> 
Our single best model was our Random Forest model which had the lowest RMSE of 1.5 million. This establishes that we chose this model, to interpret it, we turn to the MAPE of 57\% and \\$648,000 on the test set. We can take this model to a NHL General Manager and give them a way to predict salary in the ever-changing salary cap environment of the NHL.</span>
