Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Baseball Reference Pitcher WAR #9

Closed
jfreynolds opened this issue Dec 6, 2017 · 11 comments
Closed

Baseball Reference Pitcher WAR #9

jfreynolds opened this issue Dec 6, 2017 · 11 comments

Comments

@jfreynolds
Copy link
Contributor

Are there any plans to add WAR or any stats from the Player Value tables on Baseball Reference to pitching_stats_bref(season)?

I was looking to find the largest difference between bWAR and fWAR for pitchers, but I am unable to without a WAR column in the dataframe that returns from pitching_stats_bref(season). Were there issues in obtaining that data or just never implemented?

@jldbc
Copy link
Owner

jldbc commented Dec 7, 2017

Good question. I didn't realize that wasn't included there. It's also missing in batting_stats_bref(season). The tables these scrape from don't include bWAR, but I'm open to moving these functions over to a better table.

They're currently using Baseball Reference's Daily Gamelog Finder, where batting_stats_bref(season) supplies a season-length date range to batting_stats_range(start_dt,end_dt). Know of a better table that includes WAR + all the other standard stats for each player?

@trojanguard25
Copy link

I'm not aware of any way to query bWAR over a date range. Baseball-reference hosts files that include player bWAR (among other stats that go into their WAR calculations) for batters and pitchers. Every player has an entry broken up by year-team-stint.

http://www.baseball-reference.com/data/war_daily_bat.txt
http://www.baseball-reference.com/data/war_daily_pitch.txt

These files are updated daily during the season, as well as during the offseason whenever they make stat adjustments.

I don't think there are analogous files for traditional counting stats, so you will probably need separate interfaces.

@jfreynolds
Copy link
Contributor Author

My initial thought was get a player's baseball reference ID and just scrape the table from their actual player page. You could sum up something like WAR for a given range, but that feels like more of a band-aid fix. Wouldn't work for aggregating other values over a range.

On top of that, it would be pretty slow all in all.

@jldbc
Copy link
Owner

jldbc commented Dec 8, 2017

I can't see fetching WAR one player at a time scaling well beyond a small number of players.

The data @trojanguard25 mentioned look promising. If there's no single source with WAR and traditional stats side by side, a separate scrape for pulling this data might be the best route forward. From there a user can join the tables together on player id if necessary.

Thoughts/objections?

@jfreynolds jfreynolds reopened this Dec 8, 2017
@jfreynolds
Copy link
Contributor Author

The daily batting/pitching files seem to be the best option available.

Should all the data form those files be provided in a table to a user by default? Seems like there is a lot in there that isn't regularly sought after. Maybe by default they are provided more common statistics (WAR, salary, WAA, ERA+, etc.) from that file and if a boolean is specified to be true, provide all the data available?

@jldbc
Copy link
Owner

jldbc commented Dec 9, 2017

Most of these could be left out by default since the main point of this is to get WAR. Returning all 49 columns by default might be overkill.

Bare minimum would be WAR, its essential components (WAA and WAR_rep for batters, WAA, WAR_rep, and WAA_adj for pitchers) , and everything needed to identify the player and connect the with another table. I think this would mean WAR, WAR_off, WAR_def, WAR_rep, WAA, mlb_ID, player_ID, team_ID, year_ID, stint_ID for both, plus WAA_adj for pitching unless I'm missing anything.

On top of these it might get a bit arbitrary to decide what to leave in by default. Is anything else important to keep in or should the rest be optional with something along the lines of a boolean return_all parameter? Maybe G for both and GS for pitchers since these are common things people might filter on?

@jfreynolds
Copy link
Contributor Author

Definitely agree that WAR values should be the default. The only other things that jump out to me that could be frequently requested is ERA+, salary, or even BIP. Other than that, nothing really strikes me.

So, it seems like the best idea would be provide WAR and its components by default, maybe allow specify an argument for some more commonly used columns within such as ERA+, salary, RA, xRA, RAA, BIP, etc. and finally a return_all parameter like you said to return all of the rows.

I just think occasionally people may want a select few values outside of WAR and forcing them read all of the columns seems like unnecessary overhead. Should we just keep it simple though? WAR and its components or if some boolean argument is true, then return all columns?

@jldbc
Copy link
Owner

jldbc commented Dec 10, 2017

Yeah we can keep some of the more commonly used ones in. For non-WAR, non-identification columns of interest I'm seeing:

Batting: salary, G, PA, runs_above_avg, runs_above_avg_off, runs_above_avg_def
Pitching: G, GS, RA, xRA, BIP, BIP_perc, salary, ERA_plus

Which all in all would have these as the defaults:

Batting: ['name_common', 'mlb_ID', 'player_ID', 'year_ID', 'team_ID', 'stint_ID', 'lg_ID', 'pitcher', 'G', 'PA', 'salary', 'runs_above_avg', 'runs_above_avg_off', 'runs_above_avg_def', 'WAR_rep', 'WAA', 'WAR']
Pitching: ['name_common' ,'mlb_ID', 'player_ID', 'year_ID', 'team_ID', 'stint_ID', 'lg_ID', 'G', 'GS', 'RA', 'xRA', 'BIP', 'BIP_perc', 'salary', 'ERA_plus', 'WAR_rep', 'WAA', 'WAA_adj', 'WAR']

With everything else being retrievable with a return_all type of parameter. Anything important I missed? This leaves ~ 20 columns each which seems reasonable.

The function itself would basically be the top response to this Stack Overflow post with the above column filtering.

@trojanguard25
Copy link

I think there should be some default 'groupby' that is done to combine player rows for the same year. I committed a potential option in my fork: https://github.com/trojanguard25/pybaseball/commits/cache
This function returns all the columns for a single season. By default, it groups the rows so each player has a single entry for the year submitted. I also added an option to split each player by team. I think those are the two most common use-cases. This does cause a problem since some of the columns (like ERA+) cannot be summed or averaged; rather, they need to be weighted by playing time. Not exactly sure the best way to handle that correctly.

@jldbc
Copy link
Owner

jldbc commented Dec 16, 2017

Let's leave the groupby in the hands of the user for now since doing it for them without using proper weights might cause people to unknowingly use bad data (i.e. using a summed/averaged ERA+ without realizing it's not weighted).

I pushed the version I've been using to a new branch in 7b10b82. I'll merge later today if there aren't any objections.

It's probably worth opening a new issue for working on properly-weighted aggregations since it definitely would be useful to have.

@jldbc
Copy link
Owner

jldbc commented Dec 17, 2017

Merged branch bwar to master. Commit 7b10b82 adds a bwar_bat() and bwar_pitch() function, each with the optional argument return_all to retrieve all fields.

@jldbc jldbc closed this as completed Dec 17, 2017
schorrm pushed a commit that referenced this issue Sep 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants