# EDA for March Learning Mania

### Goal: 
We want to create a model which can provide predictions in the format of YEAR_TEAM1ID_TEAM2ID_WINCHANCE

where the winchance is a value from 0->1 representing the probability that the team with the higher team ID will win. 

"For example, "2025_1101_1102" indicates a hypothetical matchup between team 1101 and 1102 in the year 2025. You must predict the probability that the team with the lower TeamId beats the team with the higher TeamId"

Output should look like this:
```
ID,Pred
2025_1101_1102,0.5
2025_1101_1103,0.5
2025_1101_1104,0.5
...
```

### Data files:
They give you a suffocating amount of data, including: 
- Basic Team & Season Data
- Detailed Game Statistics
- Geographical Data
- Ranking & Rating Systems
- Tournament Bracket Structure
- Supplementary Data
- Data for Submissions
- Miscellaneous

GPT summarized them in the blob below. The key point is that you should probably be able to build a basic model with the basic team and season data. 

Based on historical matches between teams, how will they perform in the future?

The first steps will to get the data into basic form

#### **Basic Team & Season Data**
- **MTeams.csv / WTeams.csv** – Lists all NCAA teams with unique Team IDs and historical Division-I participation details.
- **MSeasons.csv / WSeasons.csv** – Details past seasons, including season start dates and region assignments.
- **MNCAATourneySeeds.csv / WNCAATourneySeeds.csv** – Provides historical NCAA tournament seedings for teams since 1985 (men) and 1998 (women).
- **MRegularSeasonCompactResults.csv / WRegularSeasonCompactResults.csv** – Contains game results (winner, loser, score, location) for regular season games since 1985 (men) and 1998 (women).
- **MNCAATourneyCompactResults.csv / WNCAATourneyCompactResults.csv** – Similar to above but specific to NCAA tournament games.

#### **Detailed Game Statistics**
- **MRegularSeasonDetailedResults.csv / WRegularSeasonDetailedResults.csv** – Contains extended game stats like field goals, rebounds, assists, etc., since 2003 (men) and 2010 (women).
- **MNCAATourneyDetailedResults.csv / WNCAATourneyDetailedResults.csv** – Similar to above but specific to NCAA tournament games.

#### **Geographical Data**
- **Cities.csv** – Lists cities where games were played, including city IDs, names, and state abbreviations.
- **MGameCities.csv / WGameCities.csv** – Maps each game to its city, starting from 2010.

#### **Ranking & Rating Systems**
- **MMasseyOrdinals.csv** – Weekly rankings of men's teams across various rating systems since 2003.

#### **Tournament Bracket Structure**
- **MNCAATourneySlots.csv / WNCAATourneySlots.csv** – Defines how teams progress through the tournament based on seed matchups.
- **MNCAATourneySeedRoundSlots.csv** – Maps tournament seeds to their expected bracket slots and game rounds (men only).

#### **Supplementary Data**
- **MTeamCoaches.csv** – Lists head coaches for teams per season, including mid-season changes.
- **Conferences.csv** – Contains NCAA conference names and abbreviations.
- **MTeamConferences.csv / WTeamConferences.csv** – Tracks which teams belonged to which conferences each season.
- **MConferenceTourneyGames.csv / WConferenceTourneyGames.csv** – Identifies games from conference tournaments before the NCAA tournament.
- **MSecondaryTourneyTeams.csv / WSecondaryTourneyTeams.csv** – Lists teams that participated in secondary postseason tournaments (e.g., NIT).
- **MSecondaryTourneyCompactResults.csv / WSecondaryTourneyCompactResults.csv** – Contains results for games in secondary tournaments.

#### **Data for Submissions**
- **SampleSubmissionStage1.csv / SampleSubmissionStage2.csv** – Example submission files showing expected format.
- **SeedBenchmarkStage1.csv** – Baseline model predictions based on seed matchups.

#### **Miscellaneous**
- **MTeamSpellings.csv / WTeamSpellings.csv** – Helps map external team name variations to standardized Team IDs.


## Organize data

In [2]:
import pandas as pd
import yaml

with open('config.yaml', 'r') as file:
    config_file = yaml.safe_load(file)
data_dir = config_file.get("data_dir")

def peek(file_name):
    df = pd.read_csv(f"{data_dir}/{file_name}.csv").head()
    return(df)

In [3]:
# Straight forward, mostly a lookup table. 
peek("/Kaggle/MTeams")

Unnamed: 0,TeamID,TeamName,FirstD1Season,LastD1Season
0,1101,Abilene Chr,2014,2025
1,1102,Air Force,1985,2025
2,1103,Akron,1985,2025
3,1104,Alabama,1985,2025
4,1105,Alabama A&M,2000,2025


In [4]:
# A bit confusing, will take a bit of work to get the seeds lined up, but maybed they've already done it for us
peek("/Kaggle/MNCAATourneySeeds")

Unnamed: 0,Season,Seed,TeamID
0,1985,W01,1207
1,1985,W02,1210
2,1985,W03,1228
3,1985,W04,1260
4,1985,W05,1374


In [5]:
## So this has a lot of potential. 
# You could get a better sense of a probability of a win based on team-by-team point differentials. 
# The model should also see how "good" the team is that season, based on this differential.
peek("/Kaggle/MRegularSeasonCompactResults")

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0


In [6]:
# Exact same as above, but for the tournament.
# Still not sure how to use the daynumber to understand which game it is in the tournament.
# I'd like to include some "heating up" factor, becuase it seems teams go on runs, but maybe that's just a function of media
# aspirational if anything
peek("/Kaggle/MNCAATourneyCompactResults")

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
0,1985,136,1116,63,1234,54,N,0
1,1985,136,1120,59,1345,58,N,0
2,1985,136,1207,68,1250,43,N,0
3,1985,136,1229,58,1425,55,N,0
4,1985,136,1242,49,1325,38,N,0


In [7]:
# More data than I know what to do with...
peek("/Kaggle/MRegularSeasonDetailedResults")

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,WFGM,WFGA,...,LFGA3,LFTM,LFTA,LOR,LDR,LAst,LTO,LStl,LBlk,LPF
0,2003,10,1104,68,1328,62,N,0,27,58,...,10,16,22,10,22,8,18,9,2,20
1,2003,10,1272,70,1393,63,N,0,26,62,...,24,9,20,20,25,7,12,8,6,16
2,2003,11,1266,73,1437,61,N,0,24,58,...,26,14,23,31,22,9,12,2,5,23
3,2003,11,1296,56,1457,50,N,0,18,38,...,22,8,15,17,20,9,19,4,3,23
4,2003,11,1400,77,1208,71,N,0,30,61,...,16,17,27,21,15,12,10,7,1,14


### Actual EDA
On the topic of actual EDA and looking at the shape of the data, this person has already done a lot of it:
[BBall EDA Tutorial](https://www.kaggle.com/code/clehmann10/bball-eda-tutorial)

## Model Development!

I liked this approach to splitting up the submission to get the exact teams you'll need: [notebook](https://www.kaggle.com/code/paultimothymooney/simple-starter-notebook-for-march-mania-2025)

However, the model is very basic (just a function of difference_in_seed*.03+.5 to affect probability.), this [notebook is a bit better and shows building a random forest model](https://www.kaggle.com/code/jocelyndumlao/march-ml-mania-2025-brier-score-prediction)

In [8]:
import numpy as np

# Load the seeds, make them into one dataframe, and fill in the missing values with 0.05 (not sure why not 0)
# w_seed = pd.read_csv(f'{data_dir}/Kaggle/WNCAATourneySeeds.csv')
# m_seed = pd.read_csv(f'{data_dir}/Kaggle/MNCAATourneySeeds.csv')

# Import the submission file
submission_df = pd.read_csv(f'{data_dir}/Kaggle/SampleSubmissionStage2.csv')
submission_df.head()

Unnamed: 0,ID,Pred
0,2025_1101_1102,0.5
1,2025_1101_1103,0.5
2,2025_1101_1104,0.5
3,2025_1101_1105,0.5
4,2025_1101_1106,0.5


In [9]:
def extract_game_info(id_str):
    # Extract year and team_ids
    parts = id_str.split('_')
    year = int(parts[0])
    teamID1 = int(parts[1])
    teamID2 = int(parts[2])
    return year, teamID1, teamID2

def extract_seed_value(seed_str):
    # Extract seed value
    try:
        return int(seed_str[1:])
    # Set seed to 16 for unselected teams and errors
    except ValueError:
        return 16

In [10]:
# Reformat the data
submission_df[['Season', 'TeamID1', 'TeamID2']] = submission_df['ID'].apply(extract_game_info).tolist()

In [11]:
# Now we have the data in a more usable format, we can merge it with elo to get the expected outcome
submission_df

Unnamed: 0,ID,Pred,Season,TeamID1,TeamID2
0,2025_1101_1102,0.5,2025,1101,1102
1,2025_1101_1103,0.5,2025,1101,1103
2,2025_1101_1104,0.5,2025,1101,1104
3,2025_1101_1105,0.5,2025,1101,1105
4,2025_1101_1106,0.5,2025,1101,1106
...,...,...,...,...,...
131402,2025_3477_3479,0.5,2025,3477,3479
131403,2025_3477_3480,0.5,2025,3477,3480
131404,2025_3478_3479,0.5,2025,3478,3479
131405,2025_3478_3480,0.5,2025,3478,3480


## Get Nate Data
men: https://www.natesilver.net/p/sbcb-college-basketball-ratings-men

women: https://www.natesilver.net/p/sbcb-college-basketball-ratings-women

In [216]:
nate_men = pd.read_csv(f'{data_dir}/Nate/Mens.csv', index_col=0)
nate_women = pd.read_csv(f'{data_dir}/Nate/Womens.csv', index_col=0)
nate_men.head()

Unnamed: 0,Team,Conf.,Current Elo,Last,Season Min.,Season Max.,Home Court*
1,Duke,ACC,2089.303,+14 🟢@@13.77758789,1842.133,2089.303,83.04591
2,Houston,Big 12,2074.552,+7 🟢@@6.890441895,1863.668,2074.552,84.47469
3,Auburn,SEC,2049.707,-18 🟠@@-17.93383789,1863.583,2105.443,67.584
4,Florida,SEC,2040.319,+7 🟢@@6.597717285,1799.165,2040.319,62.82391
5,Alabama,SEC,2031.764,+18 🟢@@17.93389893,1903.559,2086.004,100.502


I'll need to match the nate table to the kaggle names, and I'll use fuzzywuzzy for that
Right now he only has mens teams, so for womens I'll need to derive ELO myself using a simple historic win-loss calculator I think

In [217]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In [220]:
# Example of how this will work:
MTeamSpellings = pd.read_csv(f"{data_dir}/Kaggle/MTeamSpellings.csv")
Mchoices = list(MTeamSpellings["TeamNameSpelling"])
process.extractOne(nate_men['Team'][1], Mchoices)

('duke', 100)

In [None]:
# Make that into a function
def name_match(nate_df, names_df):
    ''' input should be nate's df and the path to ncaa name choices'''
    choices = list(names_df["TeamNameSpelling"])
    nate_df['kaggle_name'] = nate_df['Team'].apply(lambda x: process.extractOne(x, choices)[0])
    nate_df['fuzz_score'] = nate_df['Team'].apply(lambda x: process.extractOne(x, choices)[1])
    return nate_df

Match for men:
(These usually take about a minute to run)

In [None]:
MTeamSpellings = pd.read_csv(f"{data_dir}/Kaggle/MTeamSpellings.csv")
nate_men = name_match(nate_men, MTeamSpellings)
# check for erroneous matches by going worst to best
display(nate_men[['kaggle_name','Team','fuzz_score']].sort_values(by='fuzz_score', ascending=True))
# In my case, the only errors were that Miami University (OH), which got matches with 'uni'
# and saint francis in the nate elo is for PA, but it matches to ny
nate_men.loc[nate_men['kaggle_name'] == 'uni', 'kaggle_name'] = 'miami (oh)'
nate_men.loc[nate_men['kaggle_name'] == 'saint francis (ny)', 'kaggle_name'] = 'saint francis (pa)'

Unnamed: 0,kaggle_name,Team,fuzz_score
300,texas rio grande valley,UT Rio Grande Valley,88
211,umass,UMass (Amherst),90
167,miami (oh),Miami University (OH),90
215,queens (nc),Queens,90
364,mississippi valley state,Mississippi Valley St.,93
...,...,...,...
118,middle tennessee,Middle Tennessee,100
117,tcu,TCU,100
116,uab,UAB,100
114,furman,Furman,100


Match for women:

There are two teams fewer in Nate's ELO report for women

In [None]:
WTeamSpellings = pd.read_csv(f"{data_dir}/Kaggle/WTeamSpellings.csv")
nate_women = name_match(nate_women, WTeamSpellings)
# check for erroneous matches by going worst to best
display(nate_women[['kaggle_name','Team','fuzz_score']].sort_values(by='fuzz_score', ascending=True))
# In my case, the only error was Miami University (OH), which got matches with 'uni' again, same as men
nate_women.loc[nate_women['kaggle_name'] == 'uni', 'kaggle_name'] = 'miami (oh)'

Unnamed: 0,kaggle_name,Team,fuzz_score
267,texas rio grande valley,UT Rio Grande Valley,88
154,uni,Miami University (OH),90
153,umass,UMass (Amherst),90
317,queens (nc),Queens,90
334,southeast missouri state,Southeast Missouri St.,93
...,...,...,...
117,bowling green,Bowling Green,100
116,byu,BYU,100
115,rhode island,Rhode Island,100
135,georgetown,Georgetown,100


In [281]:
# Merge the two to get the team_id
M_merged = pd.merge(nate_men, MTeamSpellings, left_on='kaggle_name', right_on='TeamNameSpelling', how='left')
M_merged['gender'] = 'M'
W_merged = pd.merge(nate_women, WTeamSpellings, left_on='kaggle_name', right_on='TeamNameSpelling', how='left')
W_merged['gender'] = 'W'
# Merge the two dataframes to get all team_ids
All_merged = pd.concat([M_merged, W_merged], ignore_index=True)
All_merged.head()

Unnamed: 0,Team,Conf.,Current Elo,Last,Season Min.,Season Max.,Home Court*,kaggle_name,fuzz_score,TeamNameSpelling,TeamID,gender
0,Duke,ACC,2089.303,+14 🟢@@13.77758789,1842.133,2089.303,83.04591,duke,100,duke,1181,M
1,Houston,Big 12,2074.552,+7 🟢@@6.890441895,1863.668,2074.552,84.47469,houston,100,houston,1222,M
2,Auburn,SEC,2049.707,-18 🟠@@-17.93383789,1863.583,2105.443,67.584,auburn,100,auburn,1120,M
3,Florida,SEC,2040.319,+7 🟢@@6.597717285,1799.165,2040.319,62.82391,florida,100,florida,1196,M
4,Alabama,SEC,2031.764,+18 🟢@@17.93389893,1903.559,2086.004,100.502,alabama,100,alabama,1104,M


In [280]:
nate_men.query('Team == "Saint Francis"')

Unnamed: 0,Team,Conf.,Current Elo,Last,Season Min.,Season Max.,Home Court*,kaggle_name,fuzz_score
297,Saint Francis,NSEC,1295.311,+22 🟢@@21.66772461,1155.816,1295.311,44.21799,saint francis (pa),95


In [278]:
MTeamSpellings.query('TeamNameSpelling == "saint francis (pa)"')

Unnamed: 0,TeamNameSpelling,TeamID
790,saint francis (pa),1384


## sample submission
Now let's do a simple submission based on just nates elo scores

#### We need to join Current Elo with the submissions table, and calculate the expected outcome

In [292]:
import warnings

warnings.filterwarnings('ignore')
# Create a dictionary for quick lookup of ELO ratings by TeamID
elo_dict = All_merged.set_index('TeamID')['Current Elo'].to_dict()

# Map the ELO ratings to the TeamID1 column in the submission_df
submission_df['TeamID1_Elo'] = submission_df['TeamID1'].map(elo_dict)
submission_df['TeamID2_Elo'] = submission_df['TeamID2'].map(elo_dict)

# Fill missing values with 9999 - these would be teams that aren't in the nate database of mismatches in names
submission_df['TeamID1_Elo'].fillna(9999, inplace=True)
submission_df['TeamID2_Elo'].fillna(9999, inplace=True)

# Check the result, this should be 0
assert len(submission_df.query('TeamID1_Elo == 9999 or TeamID2_Elo == 9999')) == 0, "There are teams with missing ELO ratings"

In [293]:
# This should be replaced with goto conversion thing
def calc_elo_win(A, B):
    awin = 1 / (1 + 10**( (B - A) / 400))
    return(awin)
submission_df['Team1_win_prob'] = submission_df.apply(lambda x: calc_elo_win(x['TeamID1_Elo'], x['TeamID2_Elo']), axis=1)

In [307]:
finalsub = submission_df[['ID', 'Team1_win_prob']]
readable_finalsub = submission_df[['ID','TeamID1', 'TeamID2', 'TeamID1_Elo', 'TeamID2_Elo', 'Team1_win_prob']]
readable_finalsub['TeamName1'] = readable_finalsub['TeamID1'].map(All_merged.set_index('TeamID')['TeamNameSpelling'].to_dict())
readable_finalsub['TeamName2'] = readable_finalsub['TeamID2'].map(All_merged.set_index('TeamID')['TeamNameSpelling'].to_dict())
readable_finalsub.drop(['TeamID1', 'TeamID2'], axis=1, inplace=True)
readable_finalsub[['ID', 'TeamName1', 'TeamName2', 'TeamID1_Elo', 'TeamID2_Elo', 'Team1_win_prob']].head()

Unnamed: 0,ID,TeamName1,TeamName2,TeamID1_Elo,TeamID2_Elo,Team1_win_prob
0,2025_1101_1102,abilene christian,air force,1482.911,1267.39,0.775675
1,2025_1101_1103,abilene christian,akron,1482.911,1635.409,0.293624
2,2025_1101_1104,abilene christian,alabama,1482.911,2031.764,0.04072
3,2025_1101_1105,abilene christian,alabama a&m,1482.911,1109.851,0.895435
4,2025_1101_1106,abilene christian,alabama st,1482.911,1321.198,0.717257


Double check that all the teams are there

In [24]:
import duckdb as db

In [None]:
mensids = db.sql('FROM "./SourceData/Kaggle/MTeams.csv" WHERE LastD1Season = 2025').to_df()
womensids = db.sql('FROM "./SourceData/Kaggle/WTeams.csv" ').to_df()

In [70]:
unique_team_ids = pd.unique(submission_df[['TeamID1', 'TeamID2']].values.ravel('K'))
# Combine mensids and womensids into a single array
all_known_ids = np.concatenate((mensids['TeamID'].values, womensids['TeamID'].values))

# Find the unique team IDs that are not in all_known_ids
unknown_ids = np.setdiff1d(all_known_ids, unique_team_ids)
# How many of the womens teams id are in the final submission?
print("missing womens teams = ", len(np.intersect1d(womensids['TeamID'].values, unique_team_ids))-len(womensids['TeamID'].values))
print ("missing mens teams = ", len(np.intersect1d(mensids['TeamID'].values, unique_team_ids))-len(mensids['TeamID'].values))

missing womens teams =  -16
missing mens teams =  0


There are 16 teams missing in the final submission, all womens teams:

In [None]:
db.sql("SELECT * FROM womensids WHERE TeamID IN (FROM unknown_ids)")

┌────────┬────────────────┐
│ TeamID │    TeamName    │
│ int64  │    varchar     │
├────────┼────────────────┤
│   3109 │ Alliant Intl   │
│   3118 │ Armstrong St   │
│   3121 │ Augusta        │
│   3128 │ Birmingham So  │
│   3134 │ Brooklyn       │
│   3147 │ Centenary      │
│   3215 │ Hardin-Simmons │
│   3216 │ Hartford       │
│   3289 │ Morris Brown   │
│   3302 │ NE Illinois    │
│   3327 │ Okla City      │
│   3366 │ Savannah St    │
│   3383 │ St Francis NY  │
│   3432 │ Utica          │
│   3445 │ W Salem St     │
│   3446 │ W Texas A&M    │
├────────┴────────────────┤
│ 16 rows       2 columns │
└─────────────────────────┘

## ELO women
I'll need elo scores for women, which I'll do now following the basic guides of what Silver did for NBA: https://fivethirtyeight.com/features/how-we-calculate-nba-elo-ratings/

In [None]:
# Load the WRegularSeasonCompactResults.csv file
womens_results = pd.read_csv(f'{data_dir}/Kaggle/WRegularSeasonCompactResults.csv')

Unnamed: 0,TeamID,Elo
0,3101,1444.999172
1,3102,1442.118994
2,3103,1369.688215
3,3104,1951.949269
4,3105,1329.605468
...,...,...
373,3476,1207.995909
374,3477,1207.217733
375,3478,1288.593764
376,3479,1316.821760


In [None]:
def update_elo(winner_elo, loser_elo, k=30):
    expected_win = 1 / (1 + 10**((loser_elo - winner_elo) / 400))
    new_winner_elo = winner_elo + k * (1 - expected_win)
    new_loser_elo = loser_elo - k * (1 - expected_win)
    return new_winner_elo, new_loser_elo

seasons_array = sorted(womens_results['Season'].unique())
initial_elo = 1500
elo_ratings = {team_id: initial_elo for team_id in womensids['TeamID'].unique()}

for i in seasons_array:
    womens_results_season = womens_results[womens_results['Season'] == i]
    # print(i)
    for index, row in womens_results_season.iterrows():
        winner = row['WTeamID']
        loser = row['LTeamID']
        if row['WLoc'] == 'H':
            winner_elo = elo_ratings[winner] + 100
        elif row['WLoc'] == 'A':
            loser_elo = elo_ratings[loser] + 100
            
        winner_elo = elo_ratings[winner]
        loser_elo = elo_ratings[loser]
        new_winner_elo, new_loser_elo = update_elo(winner_elo, loser_elo)
        elo_ratings[winner] = new_winner_elo
        elo_ratings[loser] = new_loser_elo
        # mean_elo = np.mean(list(elo_ratings.values())) # This will always be 1300 because it's 0 sum
    elo_ratings = {team_id: 0.75 * elo + 0.25 * initial_elo for team_id, elo in elo_ratings.items()}
df = pd.DataFrame(list(elo_ratings.items()), columns=['TeamID', 'Elo'])


1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025


In [206]:
df.join(womensids.set_index('TeamID'), on='TeamID').sort_values(by='Elo', ascending=False)

Unnamed: 0,TeamID,Elo,TeamName
274,3376,1864.633298,South Carolina
315,3417,1822.938338,UCLA
298,3400,1821.401839,Texas
61,3163,1817.199989,Connecticut
323,3425,1799.645810,USC
...,...,...,...
65,3167,1260.018314,CS Bakersfield
51,3152,1250.008398,Chicago St
73,3175,1237.199027,Delaware St
319,3421,1222.529781,UNC Asheville


In [207]:
print("Womens elo mean: " , df['Elo'].mean())
print("Nates elo mean: " ,nate['Current Elo'].mean())
# close enough

Womens elo mean:  1500.0
Nates elo mean:  1499.899613186813


In [208]:
womens_results_season

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
136612,2025,120,3129,72,3429,59,H,0
136613,2025,120,3161,90,3363,70,A,0
136614,2025,120,3178,73,3454,62,H,0
136615,2025,120,3194,69,3272,62,H,0
136616,2025,120,3204,70,3379,59,N,0
...,...,...,...,...,...,...,...,...
131683,2025,0,3449,95,3370,53,H,0
131684,2025,0,3450,83,3186,82,H,1
131685,2025,0,3463,64,3284,61,H,0
131686,2025,0,3466,66,3404,63,A,0


Calculate home field values
Nate in his 2015 blog uses a value of 100 for the NBA, but the mean value for the NCAA mens is ~50. I may try and calculate it by adjusting each time a team wins or loses.

In [214]:
nate['Home Court*'].mean()

np.float64(53.979152260989004)

In [310]:
nate_women['Current Elo'].mean()

np.float64(1500.2197309392266)

In [311]:
df['Elo'].mean()

np.float64(1500.0)