This notebook serves to determine how to convert betting lines to estimated win probabilities by looking at historical data. The data is stored in a CSV file within the directory and is from https://www.teamrankings.com/nfl/odds-history/results/? using all games since the 2003 season to the date of writing (11/13/24).

# Setup

Imports

In [85]:
import pandas as pd
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
from random import sample

Retrieving the data from the CSV file.

In [45]:
historic_odds = pd.read_csv("historic_odds.csv")
historic_odds

Unnamed: 0,Closing Spread,Game Count,Win Record,Win %,Average MOV,Cover Record,Cover %
0,27.0,1.0,0-1-0,0.0%,-16.0,1-0-0,100.0%
1,24.0,1.0,0-1-0,0.0%,-3.0,1-0-0,100.0%
2,22.0,1.0,0-1-0,0.0%,-21.0,1-0-0,100.0%
3,21.5,1.0,0-1-0,0.0%,-25.0,0-1-0,0.0%
4,20.5,3.0,0-3-0,0.0%,-11.0,3-0-0,100.0%
...,...,...,...,...,...,...,...
87,-21.5,1.0,1-0-0,100.0%,25.0,1-0-0,100.0%
88,-22.0,1.0,1-0-0,100.0%,21.0,0-1-0,0.0%
89,-24.0,1.0,1-0-0,100.0%,3.0,0-1-0,0.0%
90,-27.0,1.0,1-0-0,100.0%,16.0,0-1-0,0.0%


Restricting to only those with closing spread <=0 (otherwise double couting), then converting to list of tuples of ints whose first value is the closing spread and whose second value is either 1 (the favorite won) or 0 (the favorite lost).

In [61]:
historic_odds["Closing Spread"] = [float(spread) for spread in historic_odds["Closing Spread"].tolist()]
historic_odds = historic_odds[historic_odds["Closing Spread"] <= 0]
outcomes = []
for row in historic_odds.iterrows():
    row_values = row[1]
    win_record = row_values["Win Record"].split("-")
    closing_spread = row_values["Closing Spread"]
    wins = int(win_record[0])
    losses = int(win_record[1])
    outcomes += [(closing_spread, 1)] * wins
    outcomes += [(closing_spread, 0)] * losses

print("Test...")
print(sample(outcomes, 10))

Test...
[(-7.0, 1), (-6.5, 1), (-14.0, 1), (-7.0, 0), (-1.5, 0), (-6.5, 0), (-3.0, 0), (-4.0, 1), (-6.0, 1), (-6.5, 1)]


Splitting outcomes into train (90%) and test (10%)

In [71]:
x = [[outcome[0]] for outcome in outcomes]
y = [outcome[1] for outcome in outcomes]
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.9, random_state=1234)

# Linear regression

Fitting a linear regression model.

In [90]:
linear_regression_model = LinearRegression().fit(x_train, y_train)

Predicting on test data, and converting predictions to be within 0-1.

In [91]:
predictions = linear_regression_model.predict(x_test)
for prediction_index in range(len(predictions)):
    prediction = predictions[prediction_index]
    if prediction > 1:
        predictions[prediction_index] = 1
    elif prediction < 0:
        predictions[prediction_index] = 0

Assessing against test data using RMSE.

In [92]:
RMSE = np.sqrt(mean_squared_error(y_test, predictions))
RMSE

0.4629383239953458

# Logistic Regression

Fitting a logistic regression model.

In [89]:
logistic_regression_model = LogisticRegression().fit(x_train, y_train)

Predicting on test data.

In [98]:
probability_predictions = [prediction[1] for prediction in logistic_regression_model.predict_proba(x_test)]

Assessing against test data using RMSE.

In [100]:
RMSE = np.sqrt(mean_squared_error(y_test, probability_predictions))
RMSE

0.4630317925996652

# Conclusion

There is virtually no difference between the RMSE in the linear or logistic regression models. We will use the linear model and adjust values above 1 or below 1 for simplicity's sake. Model intercept and coefficient:

In [103]:
print("Intercept:", linear_regression_model.intercept_)
print("Coefficient:", linear_regression_model.coef_[0])

Intercept: 0.5040042425856379
Coefficient: -0.030485526633855984
