# NCAA Basketball Rankings
This is the Jupyter notebook for the data preprocessing and two ranking approaches. The data came from this site: https://masseyratings.com/scores.php?s=cb2025&sub=ncaa-d1&all=1&sch=1

## Data Preprocessing
First, we load the necessary libraries and load and clean the data

In [5]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from numpy.linalg import eig, matrix_power

In [3]:
df1 = pd.read_csv("NCAA_2025.csv")
df1 = df1[['Date', 'Winner', "WinnerScore", 'Loser', 'LoserScore']]
df1 = df1.iloc[:6192] # Only keep regular season games
n = len(df1)

In [4]:
# Clean up data, get rid of @ symbols
arr, arr2 = [0 for i in range(n)], [0 for i in range(n)]
for i in range(n):
  arr[i] = df1["Winner"].iloc[i].replace("@", "")
  arr2[i] = df1["Loser"].iloc[i].replace("@", "")
# df1["Winner"].replace("@", "")

df1["Winner"] = arr
df1["Loser"] = arr2
df1.head()

Unnamed: 0,Date,Winner,WinnerScore,Loser,LoserScore
0,10/29/2024,S Illinois,106.0,North Park,71.0
1,11/4/2024,Siena,72.0,Brown,71.0
2,11/4/2024,Weber St,118.0,Northwest Indian,35.0
3,11/4/2024,Charlotte,88.0,Presbyterian,79.0
4,11/4/2024,Longwood,79.0,Randolph Col,68.0


In [None]:
teams = df1["Winner"]
teams = list(set(teams))  # only keep teams that won a game that involved a D1 team.
teams.sort()
m = len(teams)

## Massey's Method
In this next section, we'll use the Massey method as depicted in [this post](https://yetanothermathblog.com/2016/12/03/sports-ranking-methods-1/).


In [None]:
M = np.zeros((n, m))
b = np.zeros((n))
for i in range(n):
  game = df1.iloc[i]
  winner, loser = game["Winner"], game["Loser"]
  if winner in teams and loser in teams:
    wInd, lInd = teams.index(game["Winner"]), teams.index(game["Loser"])
    M[i][wInd] += 1
    M[i][lInd] -= 1
    b[i] += game["WinnerScore"] - game["LoserScore"]

In [None]:
reg = LinearRegression()
reg.fit(M, b)

In [None]:
rs = reg.coef_
expRs = [[rs[i],i] for i in range(m)]
expRs.sort()
expRs[:10]

[[np.float64(-25.009489638249406), 159],
 [np.float64(-17.648249953580986), 328],
 [np.float64(-15.112105019182376), 12],
 [np.float64(-13.156633062994679), 63],
 [np.float64(-13.108140644136693), 51],
 [np.float64(-12.894923839662882), 4],
 [np.float64(-12.106358631633185), 177],
 [np.float64(-11.728519229194907), 204],
 [np.float64(-11.728114416435835), 158],
 [np.float64(-11.636745767371472), 252]]

In [None]:
pd.DataFrame({"Rating" : [expRs[-i-1][0] for i in range(30)], "Team" : [teams[expRs[-i-1][1]] for i in range(30)]}, index = range(1,31)).round(2)

Unnamed: 0,Rating,Team
1,21.19,Duke
2,19.1,Auburn
3,19.0,Houston
4,18.38,Florida
5,17.8,Gonzaga
6,17.57,Texas Tech
7,17.25,Arizona
8,17.07,Maryland
9,16.78,Iowa St
10,16.63,Alabama


In [None]:
pd.Series([teams[expRs[-i-1][1]] for i in range(31,61)], index = range(31,61))

Unnamed: 0,0
31,Louisville
32,Mississippi St
33,Cincinnati
34,Xavier
35,Mississippi
36,Penn St
37,Northwestern
38,Clemson
39,Villanova
40,St Mary's CA


In [None]:
# This cell and next one were helpful for finding specific teams
[teams[expRs[-i-1][1]] for i in range(m)].index("Memphis")

85

In [None]:
[a for a in teams if a[0] == "C"]

[]

## PageRank
Now let's do [PageRank](https://yetanothermathblog.com/2017/01/26/sports-ranking-methods-3/), the non-baby version

In [None]:
mat = np.zeros((m, m))
game_cnts = np.zeros((m, m))
for i in range(len(df1)):
  game = df1.iloc[i]
  winner, loser = game["Winner"], game["Loser"]
  if winner in teams and loser in teams:
    wInd, lInd = teams.index(game["Winner"]), teams.index(game["Loser"])
    game_cnts[wInd][lInd] += 1
    game_cnts[lInd][wInd] += 1
    mat[wInd][lInd] += game["WinnerScore"]
    mat[lInd][wInd] += game["LoserScore"]

In [None]:
mat2 = np.zeros((m, m))
for i in range(m):
  for j in range(m):
    mat2[i][j] = max(mat[i][j]-mat[j][i], 0)

In [None]:
for i in range(m):
  if sum(mat2[i]):
    mat2[i] /= sum(mat2[i] )
  else:
    print("team", i, "didn't win", teams[i])

team 159 didn't win MS Valley St


In [None]:
J = np.ones((m,m))/m
mat3 = (mat2 + J)/2

In [None]:
evs = np.dot(np.ones(m)/m, matrix_power(mat2, 100))
exp_vs = [[evs[i],i] for i in range(m)]
exp_vs.sort()
exp_vs[:10]
pgRank = [teams[a[1]] for a in exp_vs if int(sum(game_cnts[a[1]])) > 3]
pd.Series([pgRank[i] for i in range(30)], index = range(1,31))

Unnamed: 0,0
1,Drake
2,Florida
3,St John's
4,Houston
5,Auburn
6,Tennessee
7,Alabama
8,Duke
9,Oklahoma
10,St Mary's CA


In [None]:
pd.Series([pgRank[i] for i in range(30,60)], index = range(31,61))

Unnamed: 0,0
31,Penn St
32,Vanderbilt
33,Maryland
34,UCF
35,Utah
36,Iowa St
37,Illinois
38,UCLA
39,Utah St
40,Liberty


In [None]:
pgRank.index("Colorado St")

56

In [None]:
[a for a in teams if a[0] == "C"]

['C Michigan',
 'CS Bakersfield',
 'CS Fullerton',
 'CS Northridge',
 'CS Sacramento',
 'Cal Baptist',
 'Cal Poly',
 'California',
 'Campbell',
 'Canisius',
 'Cent Arkansas',
 'Central Conn',
 'Charleston So',
 'Charlotte',
 'Chattanooga',
 'Chicago St',
 'Cincinnati',
 'Citadel',
 'Clemson',
 'Cleveland St',
 'Coastal Car',
 'Col Charleston',
 'Colgate',
 'Colorado',
 'Colorado St',
 'Columbia',
 'Connecticut',
 'Coppin St',
 'Cornell',
 'Creighton']

In [None]:
# take a peek at the values within the eigenvector. The top 7 teams were net positive against each of their opponents
pd.DataFrame({"Team" : [pgRank[i] for i in range(30)], "Points" : exp_vs[:30]}, index = range(1,31))

Unnamed: 0,Team,Points
1,Drake,"[0.0, 28]"
2,Florida,"[0.0, 74]"
3,St John's,"[0.0, 91]"
4,Houston,"[0.0, 177]"
5,Auburn,"[0.0, 188]"
6,Tennessee,"[0.0, 252]"
7,Alabama,"[0.0, 367]"
8,Duke,"[3.9785107837423325e-09, 285]"
9,Oklahoma,"[8.229809868671208e-09, 115]"
10,St Mary's CA,"[1.7906001816134335e-08, 16]"
