# Using Statistics to Predict Standings

Option 3:

The goal of this project is to use statistics collected by both the Rocky Mountain Athletic Conference (RMAC) and NCAA division II softball websites to determine a team's possible outcome for the next season.

This notebook will use the 2025 statistics collected by the RMAC and the NCAA to generate standings for the RMAC , which contains teams like MSU Denver, Regis University, Colorado Mesa University, and Colorado School of Mines.

The model that will be utilized is linear regression, as this type of model does well in predicting win and loss numbers given features.


# Load the Dataset

To begin, we will bring in the datasets for both the RMAC and the NCAA. The RMAC for softball consists of 12 teams, while the NCAA for this year consists of 268 teams. 

In [26]:
import pandas as pd

RMAC = pd.read_csv('https://raw.githubusercontent.com/prietoc2022/RMACPredictionModel/refs/heads/main/rmac_2025_dataset.csv')
NCAA = pd.read_csv('https://raw.githubusercontent.com/prietoc2022/RMACPredictionModel/refs/heads/main/2025NCAA_Combined_Stats.csv')

print("RMAC Dataset: ")
print(RMAC.head())
print(RMAC.describe())

print("NCAA Dataset: ")
print(NCAA.head())
print(NCAA.describe())


RMAC Dataset: 
                 Team  RMAC_Wins  RMAC_Losses  RMAC_Pct  Overall_Wins  \
0  Colorado Christian         36            8     0.818            44   
1       Colorado Mesa         35            9     0.795            40   
2          MSU Denver         34            9     0.791            41   
3                UCCS         26           17     0.605            35   
4          CSU Pueblo         26           18     0.591            27   

   Overall_Losses  Overall_Pct  
0              16        0.733  
1              17        0.702  
2              17        0.707  
3              24        0.593  
4              26        0.509  
       RMAC_Wins  RMAC_Losses   RMAC_Pct  Overall_Wins  Overall_Losses  \
count  12.000000    12.000000  12.000000     12.000000       12.000000   
mean   21.833333    21.833333   0.500167     25.750000       28.916667   
std    10.743567    10.777361   0.245975     12.439855       10.352499   
min     3.000000     8.000000   0.068000      5.0000

As you can see, there is a discrepancy in the data for the team names, the dataset for the RMAC has different names for the teams in comparision to the NCAA dataset, as well as containing all teams in Division 2, not ones just in the RMAC. To fix this, we normalize the teams across both datasets so it merges properly. The RMAC is a small conference, so this will be simple. The dataset in the NCAA is also missing fielding percentage (FP), so we will calculate that and add it as a column in the NCAA dataset.


In [27]:
# Remove parentheses and extra characters from the columns in the NCAA Dataset
NCAA['Team'] = NCAA['Team'].str.replace(r"\([^)]*\)", "", regex=True).str.strip()

# Normalize the NCAA Teams to have the same name from the RMAC Dataset
NCAA['Team'] = NCAA['Team'].replace({
    'UC-Colo. Springs': 'UCCS',
    'CSU Pueblo': 'CSU Pueblo',
    'MSU Denver': 'MSU Denver',
    'Colorado Mesa': 'Colorado Mesa',
    'Colo. Sch. of Mines': 'Colorado School of Mines',
    'Regis': 'Regis',
    'N.M. Highlands': 'New Mexico Highlands',
    'Black Hills St.': 'Black Hills State',
    'Adams St.': 'Adams State',
    'Colo. Christian': 'Colorado Christian',
    'Chadron St.': 'Chadron State',
    'Fort Lewis': 'Fort Lewis'
})

# Calculate fielding percentage  = Put Outs + Assists / Put Outs + Assists + Errors
NCAA['FP'] = (NCAA['PO'] + NCAA['A']) / (NCAA['PO'] + NCAA['A'] + NCAA['E'])

# Merge NCAA Statistics on RMAC Standings
mergedData = pd.merge(NCAA, RMAC, on='Team', how='right')
mergedData.groupby('RMAC_Wins').RMAC_Wins.count()


print(mergedData)

    Rank                      Team   W   L  T    PCT   G    AB    H     BA  \
0     19        Colorado Christian  44  16  0  0.733  63  1612  533  0.331   
1     28             Colorado Mesa  40  17  0  0.702  54  1509  488  0.323   
2     26                MSU Denver  41  17  0  0.707  56  1504  489  0.325   
3     88                      UCCS  33  24  0  0.579  59  1529  458  0.300   
4    127                CSU Pueblo  27  26  0  0.509  42  1101  318  0.289   
5    129         Black Hills State  27  27  0  0.500  55  1428  412  0.289   
6    187  Colorado School of Mines  22  32  0  0.407  31   797  215  0.270   
7    185                     Regis  21  30  0  0.412  55  1443  390  0.270   
8    170             Chadron State  22  28  0  0.440  54  1406  387  0.275   
9    248                Fort Lewis  10  38  0  0.208  43  1021  250  0.245   
10   230      New Mexico Highlands  15  39  0  0.278  47  1238  317  0.256   
11   266               Adams State   5  51  0  0.089  42   961  

Now that the datasets are imported and cleaned up, time to add the features that the model needs to predict standings. For this model, the features being used will be Wins, Losses, W/L percentage, team batting average (BA), team fielding percentage (FP), and team Earned Run Average (ERA).

In [28]:
# Choose target value
y  = mergedData['Rank']

# Features being used to calculate ranking
features = ['RMAC_Wins', 'RMAC_Losses','RMAC_Pct','BA','FP','ERA']
x = mergedData[features]

Now that we have our features and data refined to how we like, it is time to construct our model! Let's make a set of training and test data to be used for the model.

In [29]:
from sklearn.model_selection import train_test_split  # To split the data
from sklearn.linear_model import LinearRegression  # The star of the show!

X_train, X_test, y_train, y_test = train_test_split(x, y,test_size = 0.2, random_state = 1)
model = LinearRegression()
model.fit(X_train, y_train)

# Predict Rank of test teams based on the training data
y_pred = model.predict(X_test)

print(f"Predicted Rank: {y_pred}")
print(f"Actual Rank: {y_test.values}")



Predicted Rank: [ 64.09088465 111.7323314  129.39171769]
Actual Rank: [ 26  88 127]


What is neat with the output of this cell is that you can tell above from the output which team was selected and what their estimated rank is within the NCAA dataset. Let's calculate the accuracy of the model just created.

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Calculate metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
r2 = r2_score(y_test, y_pred)

# Print results
print(f"MAE: {mae}")
print(f"MSE: {mse}")
print(f"RMSE: {rmse}")
print(f"R2: {r2}")

As you can see, the R squared value is not quite what we are looking for in terms of accuracy in predicting RMAC team national rankings in the NCAA. Let's remove the RMAC_Wins and RMAC_Losses features from above to see if it gives us a more accurate model prediction. The reason we would remove these features is because the RMAC_Pct feature calculates a team's win percentage already, so these features could be redundant in calculations.

In [31]:
# Edit the features from above
features = ['RMAC_Pct','BA','FP','ERA']
x = mergedData[features]

In [None]:
# train model
X_train, X_test, y_train, y_test = train_test_split(x, y,test_size = 0.2, random_state = 1)
model = LinearRegression()
model.fit(X_train, y_train)

# Predict Rank of test teams based on the training data
y_pred = model.predict(X_test)


mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
r2 = r2_score(y_test, y_pred)


print(f"Predicted Rank: {y_pred}")
print(f"Actual Rank: {y_test.values}")

# Fixed prediction
print(f"MAE: {mae}")
print(f"MSE: {mse}")
print(f"RMSE: {rmse}")
print(f"R2: {r2}")


Now that we have a more accurate model, the project will consist of having extra years of datasets, including statistics from 2023 and 2024 as well to calculate a model for the 2026 season this Spring. Thanks for viewing!