# Training a Machine Learning Model
This notebook contains the code for the seventh part of this data science project - model training. Section headings have been included for convenience and the full writeup is available [on my website](https://www.pineconedata.com/2024-09-13-basketball-train-ols/).

## Project Overview
This is part of a series that walks through the entire process of a data science project - from initial steps like data acquisition, preprocessing, and cleaning to more advanced steps like feature engineering, creating visualizations, and machine learning. The dataset used in this project contains individual basketball player statistics (such as total points scored and blocks made) for the 2023-2024 NCAA women’s basketball season.

### Articles in this Series   
1. [Acquiring and Combining the Datasets](https://www.pineconedata.com/2024-04-11-basketball-data-acquisition/)
2. [Cleaning and Preprocessing the Data](https://www.pineconedata.com/2024-05-02-basketball-data-cleaning-preprocessing/)
3. [Engineering New Features](https://www.pineconedata.com/2024-05-30-basketball-feature_engineering/)
4. [Exploratory Data Analysis](https://www.pineconedata.com/2024-06-28-basketball-data-exploration/)
5. [Visualizations, Charts, and Graphs](https://www.pineconedata.com/2024-07-29-basketball-visualizations/)
6. [Selecting a Machine Learning Model](https://www.pineconedata.com/2024-08-12-basketball-select-ml-ols/)
7. [Training the Machine Learning Model](https://www.pineconedata.com/2024-09-13-basketball-train-ols/) (This Notebook)
8. [Evaluating the Machine Learning Model](https://www.pineconedata.com/2024-11-27-basketball-evaluate-ols-model/2024-11-27-basketball-evaluate-ols-model/)


# Getting Started
Full requirements and environment setup information is detailed in the [first article of this series](https://www.pineconedata.com/2024-04-11-basketball-data-acquisition/).

## Import Packages

In [1]:
import joblib
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

## Import Data

In [2]:
from pathlib import Path


data_folder = Path.cwd().parent / 'data'
model_folder = Path.cwd().parent / 'models'

In [3]:
player_data = pd.read_excel(data_folder / 'player_data_engineered.xlsx')
player_data.head()

Unnamed: 0,PLAYER_NAME,Team,Class,Height,Position,PLAYER_ID,TEAM_NAME,GAMES,MINUTES_PLAYED,FIELD_GOALS_MADE,...,Conference,MINUTES_PER_GAME,FOULS_PER_GAME,POINTS_PER_GAME,ASSISTS_PER_GAME,STEALS_PER_GAME,BLOCKS_PER_GAME,REBOUNDS_PER_GAME,ASSIST_TO_TURNOVER,FANTASY_POINTS
0,Kiara Jackson,UNLV (Mountain West),Junior,67,Guard,ncaaw.p.67149,UNLV,29,895,128,...,Mountain West,30.862069,1.62069,11.137931,4.655172,1.068966,0.172414,4.448276,3.214286,710.3
1,Raven Johnson,South Carolina (SEC),Sophomore,68,Guard,ncaaw.p.67515,South Carolina,30,823,98,...,SEC,27.433333,1.133333,8.1,4.933333,2.0,0.166667,5.366667,2.792453,735.2
2,Gina Marxen,Montana (Big Sky),Senior,68,Guard,ncaaw.p.57909,Montana,29,778,88,...,Big Sky,26.827586,0.896552,10.241379,3.827586,0.551724,0.068966,2.068966,2.921053,533.5
3,McKenna Hofschild,Colorado St. (Mountain West),Senior,62,Guard,ncaaw.p.60402,Colorado St.,29,1046,231,...,Mountain West,36.068966,1.172414,22.551724,7.275862,1.241379,0.137931,3.965517,2.971831,1117.5
4,Kaylah Ivey,Boston College (ACC),Junior,68,Guard,ncaaw.p.64531,Boston Coll.,33,995,47,...,ACC,30.151515,1.454545,4.333333,5.636364,1.090909,0.030303,1.727273,2.90625,500.4


# Basics of Machine Learning

# Model Training

## Define the Variables

In [4]:
target = 'FANTASY_POINTS'
features = ['Height', 'MINUTES_PLAYED', 'FIELD_GOALS_MADE', 'THREE_POINTS_MADE',
            'TWO_POINTS_MADE', 'FREE_THROWS_MADE', 'TOTAL_REBOUNDS', 'ASSISTS',
            'TURNOVERS', 'STEALS', 'BLOCKS', 'FOULS', 'POINTS']

In [5]:
X = player_data[features]
y = player_data[target]

In [6]:
X

Unnamed: 0,Height,MINUTES_PLAYED,FIELD_GOALS_MADE,THREE_POINTS_MADE,TWO_POINTS_MADE,FREE_THROWS_MADE,TOTAL_REBOUNDS,ASSISTS,TURNOVERS,STEALS,BLOCKS,FOULS,POINTS
0,67,895,128,28,100,39,129,135,42,31,5,47,323
1,68,823,98,20,78,27,161,148,53,60,5,34,243
2,68,778,88,58,30,63,60,111,38,16,2,26,297
3,62,1046,231,55,176,137,115,211,71,36,4,34,654
4,68,995,47,32,15,17,57,186,64,36,1,48,143
...,...,...,...,...,...,...,...,...,...,...,...,...,...
890,66,742,92,53,39,45,113,73,66,45,2,54,282
891,73,815,108,58,50,26,140,34,46,19,19,51,300
892,71,774,102,56,46,67,176,29,48,29,3,68,327
893,71,848,127,54,73,76,123,71,90,35,9,94,384


In [7]:
y

0       710.3
1       735.2
2       533.5
3      1117.5
4       500.4
        ...  
890     555.1
891     549.0
892     597.7
893     636.1
894     597.9
Name: FANTASY_POINTS, Length: 895, dtype: float64

## Create Training and Testing Splits

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [9]:
X_train.head(5)

Unnamed: 0,Height,MINUTES_PLAYED,FIELD_GOALS_MADE,THREE_POINTS_MADE,TWO_POINTS_MADE,FREE_THROWS_MADE,TOTAL_REBOUNDS,ASSISTS,TURNOVERS,STEALS,BLOCKS,FOULS,POINTS
470,75,737,214,4,210,103,221,31,70,11,16,63,535
593,72,942,160,1,159,104,201,51,68,98,11,104,425
218,69,1109,142,38,104,124,146,111,92,42,6,55,446
150,66,599,90,26,64,95,127,92,60,49,4,73,301
417,73,782,95,0,95,49,147,56,66,9,31,89,239


### Reproducibility

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train.head(5)

Unnamed: 0,Height,MINUTES_PLAYED,FIELD_GOALS_MADE,THREE_POINTS_MADE,TWO_POINTS_MADE,FREE_THROWS_MADE,TOTAL_REBOUNDS,ASSISTS,TURNOVERS,STEALS,BLOCKS,FOULS,POINTS
751,67,821,70,28,42,43,98,75,75,65,10,72,211
473,73,826,219,10,209,91,261,60,69,36,12,79,539
710,71,772,67,20,47,26,209,43,50,23,12,52,180
154,71,978,145,10,135,68,141,170,113,46,3,75,368
647,71,923,157,66,91,40,138,55,65,29,18,59,420


In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=314)
X_train.head(5)

Unnamed: 0,Height,MINUTES_PLAYED,FIELD_GOALS_MADE,THREE_POINTS_MADE,TWO_POINTS_MADE,FREE_THROWS_MADE,TOTAL_REBOUNDS,ASSISTS,TURNOVERS,STEALS,BLOCKS,FOULS,POINTS
385,72,1047,135,40,95,64,166,46,58,53,31,53,374
488,68,903,204,60,144,149,226,79,62,76,3,98,617
763,68,854,132,28,104,66,163,61,84,65,3,87,358
688,73,576,136,8,128,71,204,38,53,11,10,63,351
523,68,658,172,4,168,51,187,29,25,29,5,72,399


### Dataset Proportions

In [12]:
print(f'Test data split proportion: {len(X_test) / len(X)}')
print(f'Train data split proportion: {len(X_train) / len(X)}')

Test data split proportion: 0.25027932960893856
Train data split proportion: 0.7497206703910615


In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)
print(f'Test data split proportion: {len(X_test) / len(X)}')
print(f'Train data split proportion: {len(X_train) / len(X)}')

Test data split proportion: 0.2
Train data split proportion: 0.8


In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=314)
print(f'Test data split proportion: {len(X_test) / len(X)}')
print(f'Train data split proportion: {len(X_train) / len(X)}')

Test data split proportion: 0.25027932960893856
Train data split proportion: 0.7497206703910615


### A Note on Dataframes versus numPy Arrays

#### Using pandas DataFrames or Series

In [15]:
print(f'dtype of X: {type(X)}')
print(f'dtype of y: {type(y)}')

dtype of X: <class 'pandas.core.frame.DataFrame'>
dtype of y: <class 'pandas.core.series.Series'>


#### Using numPy Arrays

In [16]:
print(f'dtype of X: {type(X.to_numpy())}')
print(f'dtype of y: {type(y.to_numpy())}')

dtype of X: <class 'numpy.ndarray'>
dtype of y: <class 'numpy.ndarray'>


## Train the Model

In [17]:
linear_reg_model = LinearRegression()

In [18]:
linear_reg_model.fit(X_train, y_train)

## Print the Model Equation

In [19]:
linear_reg_model.coef_

array([-3.35206199e-15, -2.77555756e-17,  1.66666667e+00,  1.33333333e+00,
        3.33333333e-01,  1.00000000e+00,  1.20000000e+00,  1.50000000e+00,
       -1.00000000e+00,  2.00000000e+00,  2.00000000e+00, -1.28022593e-15,
        1.31960415e-14])

In [20]:
linear_reg_model.intercept_

2.2737367544323206e-13

In [21]:
linear_reg_model.feature_names_in_

array(['Height', 'MINUTES_PLAYED', 'FIELD_GOALS_MADE',
       'THREE_POINTS_MADE', 'TWO_POINTS_MADE', 'FREE_THROWS_MADE',
       'TOTAL_REBOUNDS', 'ASSISTS', 'TURNOVERS', 'STEALS', 'BLOCKS',
       'FOULS', 'POINTS'], dtype=object)

In [22]:
coef_series = pd.Series(data=linear_reg_model.coef_, index=linear_reg_model.feature_names_in_)
coef_series

Height              -3.352062e-15
MINUTES_PLAYED      -2.775558e-17
FIELD_GOALS_MADE     1.666667e+00
THREE_POINTS_MADE    1.333333e+00
TWO_POINTS_MADE      3.333333e-01
FREE_THROWS_MADE     1.000000e+00
TOTAL_REBOUNDS       1.200000e+00
ASSISTS              1.500000e+00
TURNOVERS           -1.000000e+00
STEALS               2.000000e+00
BLOCKS               2.000000e+00
FOULS               -1.280226e-15
POINTS               1.319604e-14
dtype: float64

In [23]:
coef_string = "\n + ".join(f"{coef}*{feat}" for feat, coef in coef_series.items())
print(coef_string)

-3.3520619913961595e-15*Height
 + -2.7755575615628914e-17*MINUTES_PLAYED
 + 1.6666666666666448*FIELD_GOALS_MADE
 + 1.3333333333333168*THREE_POINTS_MADE
 + 0.3333333333333284*TWO_POINTS_MADE
 + 0.9999999999999863*FREE_THROWS_MADE
 + 1.2000000000000013*TOTAL_REBOUNDS
 + 1.500000000000001*ASSISTS
 + -0.9999999999999996*TURNOVERS
 + 1.9999999999999991*STEALS
 + 2.000000000000002*BLOCKS
 + -1.2802259252708836e-15*FOULS
 + 1.3196041481755572e-14*POINTS


In [24]:
print(f'{target} = {coef_string} + {linear_reg_model.intercept_} + error')

FANTASY_POINTS = -3.3520619913961595e-15*Height
 + -2.7755575615628914e-17*MINUTES_PLAYED
 + 1.6666666666666448*FIELD_GOALS_MADE
 + 1.3333333333333168*THREE_POINTS_MADE
 + 0.3333333333333284*TWO_POINTS_MADE
 + 0.9999999999999863*FREE_THROWS_MADE
 + 1.2000000000000013*TOTAL_REBOUNDS
 + 1.500000000000001*ASSISTS
 + -0.9999999999999996*TURNOVERS
 + 1.9999999999999991*STEALS
 + 2.000000000000002*BLOCKS
 + -1.2802259252708836e-15*FOULS
 + 1.3196041481755572e-14*POINTS + 2.2737367544323206e-13 + error


### Analyze the Model Equation

In [25]:
coef_series_simple = coef_series[abs(coef_series) > 0.0001]
coef_string_simple = "\n\t\t + ".join(f"{coef:.4f} * {feat}" for feat, coef in coef_series_simple.items())
print(f'{target} = {coef_string_simple} + {linear_reg_model.intercept_} + error')

FANTASY_POINTS = 1.6667 * FIELD_GOALS_MADE
		 + 1.3333 * THREE_POINTS_MADE
		 + 0.3333 * TWO_POINTS_MADE
		 + 1.0000 * FREE_THROWS_MADE
		 + 1.2000 * TOTAL_REBOUNDS
		 + 1.5000 * ASSISTS
		 + -1.0000 * TURNOVERS
		 + 2.0000 * STEALS
		 + 2.0000 * BLOCKS + 2.2737367544323206e-13 + error


In [26]:
check = player_data['FIELD_GOALS_MADE'] == player_data['TWO_POINTS_MADE'] + player_data['THREE_POINTS_MADE']
print(f'True count: {check.sum()} rows')
print(f'False count: {(~check).sum()} rows')

True count: 895 rows
False count: 0 rows


## Alternate Training

In [27]:
X_alt = player_data[features].drop(columns=['FIELD_GOALS_MADE', 'TWO_POINTS_MADE', 'POINTS'])

In [28]:
X_train_alt, X_test_alt, y_train_alt, y_test_alt = train_test_split(X_alt, y, random_state=314)

In [29]:
ols_alt = LinearRegression()
ols_alt.fit(X_train_alt, y_train_alt)

In [30]:
coef_series_alt = pd.Series(data=ols_alt.coef_, index=ols_alt.feature_names_in_)
coef_series_alt = coef_series_alt[abs(coef_series_alt) > 0.0001]
coef_string_alt = "\n\t\t + ".join(f"{coef:.4f} * {feat}" for feat, coef in coef_series_alt.items())
print(f'{target} = {coef_string_alt} + {ols_alt.intercept_} + error')

FANTASY_POINTS = 2.4532 * Height
		 + 0.1039 * MINUTES_PLAYED
		 + 2.2037 * THREE_POINTS_MADE
		 + 2.3917 * FREE_THROWS_MADE
		 + 1.5219 * TOTAL_REBOUNDS
		 + 1.3231 * ASSISTS
		 + -0.5706 * TURNOVERS
		 + 2.2393 * STEALS
		 + 2.4818 * BLOCKS
		 + -0.2612 * FOULS + -203.27425560271263 + error


# Export Data & Models
If you're going to use a new Jupyter notebook / Python script for the next part of this series, then it's a good idea to export the testing dataset. 

In [31]:
X_train.to_csv(data_folder / 'X_train_full.csv', index=False)
X_train_alt.to_csv(data_folder / 'X_train_few.csv', index=False)
X_test.to_csv(data_folder / 'X_test_full.csv', index=False)
X_test_alt.to_csv(data_folder / 'X_test_few.csv', index=False)
y_test.to_csv(data_folder / 'y_actual.csv', index=False)

It's not strictly necessary to export small, simple models like these, but it's often helpful for checkpointing and collaboration. There are multiple ways to export machine learning models detailed in [scikit-learn's model persistence](https://scikit-learn.org/stable/model_persistence.html) page, including the popular [pickle](https://docs.python.org/3/library/pickle.html#module-pickle) library, but for today we'll use [joblib](https://joblib.readthedocs.io/en/latest/index.html#module-joblib). 

In [32]:
joblib.dump(linear_reg_model, model_folder / 'model_full.sav')
joblib.dump(ols_alt, model_folder / 'model_few.sav')

['/home/scoops/Code/ncaa-basketball-stats/models/model_few.sav']