-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Baseball Reference Pitcher WAR #9
Comments
Good question. I didn't realize that wasn't included there. It's also missing in They're currently using Baseball Reference's Daily Gamelog Finder, where |
I'm not aware of any way to query bWAR over a date range. Baseball-reference hosts files that include player bWAR (among other stats that go into their WAR calculations) for batters and pitchers. Every player has an entry broken up by year-team-stint. http://www.baseball-reference.com/data/war_daily_bat.txt These files are updated daily during the season, as well as during the offseason whenever they make stat adjustments. I don't think there are analogous files for traditional counting stats, so you will probably need separate interfaces. |
My initial thought was get a player's baseball reference ID and just scrape the table from their actual player page. You could sum up something like WAR for a given range, but that feels like more of a band-aid fix. Wouldn't work for aggregating other values over a range. On top of that, it would be pretty slow all in all. |
I can't see fetching WAR one player at a time scaling well beyond a small number of players. The data @trojanguard25 mentioned look promising. If there's no single source with WAR and traditional stats side by side, a separate scrape for pulling this data might be the best route forward. From there a user can join the tables together on player id if necessary. Thoughts/objections? |
The daily batting/pitching files seem to be the best option available. Should all the data form those files be provided in a table to a user by default? Seems like there is a lot in there that isn't regularly sought after. Maybe by default they are provided more common statistics (WAR, salary, WAA, ERA+, etc.) from that file and if a boolean is specified to be true, provide all the data available? |
Most of these could be left out by default since the main point of this is to get WAR. Returning all 49 columns by default might be overkill. Bare minimum would be WAR, its essential components (WAA and WAR_rep for batters, WAA, WAR_rep, and WAA_adj for pitchers) , and everything needed to identify the player and connect the with another table. I think this would mean WAR, WAR_off, WAR_def, WAR_rep, WAA, mlb_ID, player_ID, team_ID, year_ID, stint_ID for both, plus WAA_adj for pitching unless I'm missing anything. On top of these it might get a bit arbitrary to decide what to leave in by default. Is anything else important to keep in or should the rest be optional with something along the lines of a boolean return_all parameter? Maybe G for both and GS for pitchers since these are common things people might filter on? |
Definitely agree that WAR values should be the default. The only other things that jump out to me that could be frequently requested is ERA+, salary, or even BIP. Other than that, nothing really strikes me. So, it seems like the best idea would be provide WAR and its components by default, maybe allow specify an argument for some more commonly used columns within such as ERA+, salary, RA, xRA, RAA, BIP, etc. and finally a return_all parameter like you said to return all of the rows. I just think occasionally people may want a select few values outside of WAR and forcing them read all of the columns seems like unnecessary overhead. Should we just keep it simple though? WAR and its components or if some boolean argument is true, then return all columns? |
Yeah we can keep some of the more commonly used ones in. For non-WAR, non-identification columns of interest I'm seeing: Batting: salary, G, PA, runs_above_avg, runs_above_avg_off, runs_above_avg_def Which all in all would have these as the defaults: Batting: ['name_common', 'mlb_ID', 'player_ID', 'year_ID', 'team_ID', 'stint_ID', 'lg_ID', 'pitcher', 'G', 'PA', 'salary', 'runs_above_avg', 'runs_above_avg_off', 'runs_above_avg_def', 'WAR_rep', 'WAA', 'WAR'] With everything else being retrievable with a return_all type of parameter. Anything important I missed? This leaves ~ 20 columns each which seems reasonable. The function itself would basically be the top response to this Stack Overflow post with the above column filtering. |
I think there should be some default 'groupby' that is done to combine player rows for the same year. I committed a potential option in my fork: https://github.com/trojanguard25/pybaseball/commits/cache |
Let's leave the groupby in the hands of the user for now since doing it for them without using proper weights might cause people to unknowingly use bad data (i.e. using a summed/averaged ERA+ without realizing it's not weighted). I pushed the version I've been using to a new branch in 7b10b82. I'll merge later today if there aren't any objections. It's probably worth opening a new issue for working on properly-weighted aggregations since it definitely would be useful to have. |
Merged branch bwar to master. Commit 7b10b82 adds a |
Are there any plans to add WAR or any stats from the Player Value tables on Baseball Reference to
pitching_stats_bref(season)
?I was looking to find the largest difference between bWAR and fWAR for pitchers, but I am unable to without a WAR column in the dataframe that returns from
pitching_stats_bref(season)
. Were there issues in obtaining that data or just never implemented?The text was updated successfully, but these errors were encountered: