Baseball Reference Pitcher WAR #9

jfreynolds · 2017-12-06T23:24:35Z

Are there any plans to add WAR or any stats from the Player Value tables on Baseball Reference to pitching_stats_bref(season)?

I was looking to find the largest difference between bWAR and fWAR for pitchers, but I am unable to without a WAR column in the dataframe that returns from pitching_stats_bref(season). Were there issues in obtaining that data or just never implemented?

The text was updated successfully, but these errors were encountered:

jldbc · 2017-12-07T05:00:48Z

Good question. I didn't realize that wasn't included there. It's also missing in batting_stats_bref(season). The tables these scrape from don't include bWAR, but I'm open to moving these functions over to a better table.

They're currently using Baseball Reference's Daily Gamelog Finder, where batting_stats_bref(season) supplies a season-length date range to batting_stats_range(start_dt,end_dt). Know of a better table that includes WAR + all the other standard stats for each player?

trojanguard25 · 2017-12-07T13:27:19Z

I'm not aware of any way to query bWAR over a date range. Baseball-reference hosts files that include player bWAR (among other stats that go into their WAR calculations) for batters and pitchers. Every player has an entry broken up by year-team-stint.

http://www.baseball-reference.com/data/war_daily_bat.txt
http://www.baseball-reference.com/data/war_daily_pitch.txt

These files are updated daily during the season, as well as during the offseason whenever they make stat adjustments.

I don't think there are analogous files for traditional counting stats, so you will probably need separate interfaces.

jfreynolds · 2017-12-07T15:50:10Z

My initial thought was get a player's baseball reference ID and just scrape the table from their actual player page. You could sum up something like WAR for a given range, but that feels like more of a band-aid fix. Wouldn't work for aggregating other values over a range.

On top of that, it would be pretty slow all in all.

jldbc · 2017-12-08T06:17:05Z

I can't see fetching WAR one player at a time scaling well beyond a small number of players.

The data @trojanguard25 mentioned look promising. If there's no single source with WAR and traditional stats side by side, a separate scrape for pulling this data might be the best route forward. From there a user can join the tables together on player id if necessary.

Thoughts/objections?

jfreynolds · 2017-12-08T18:42:38Z

The daily batting/pitching files seem to be the best option available.

Should all the data form those files be provided in a table to a user by default? Seems like there is a lot in there that isn't regularly sought after. Maybe by default they are provided more common statistics (WAR, salary, WAA, ERA+, etc.) from that file and if a boolean is specified to be true, provide all the data available?

jldbc · 2017-12-09T21:56:31Z

Most of these could be left out by default since the main point of this is to get WAR. Returning all 49 columns by default might be overkill.

Bare minimum would be WAR, its essential components (WAA and WAR_rep for batters, WAA, WAR_rep, and WAA_adj for pitchers) , and everything needed to identify the player and connect the with another table. I think this would mean WAR, WAR_off, WAR_def, WAR_rep, WAA, mlb_ID, player_ID, team_ID, year_ID, stint_ID for both, plus WAA_adj for pitching unless I'm missing anything.

On top of these it might get a bit arbitrary to decide what to leave in by default. Is anything else important to keep in or should the rest be optional with something along the lines of a boolean return_all parameter? Maybe G for both and GS for pitchers since these are common things people might filter on?

jfreynolds · 2017-12-09T22:32:01Z

Definitely agree that WAR values should be the default. The only other things that jump out to me that could be frequently requested is ERA+, salary, or even BIP. Other than that, nothing really strikes me.

So, it seems like the best idea would be provide WAR and its components by default, maybe allow specify an argument for some more commonly used columns within such as ERA+, salary, RA, xRA, RAA, BIP, etc. and finally a return_all parameter like you said to return all of the rows.

I just think occasionally people may want a select few values outside of WAR and forcing them read all of the columns seems like unnecessary overhead. Should we just keep it simple though? WAR and its components or if some boolean argument is true, then return all columns?

jldbc · 2017-12-10T01:26:34Z

Yeah we can keep some of the more commonly used ones in. For non-WAR, non-identification columns of interest I'm seeing:

Batting: salary, G, PA, runs_above_avg, runs_above_avg_off, runs_above_avg_def
Pitching: G, GS, RA, xRA, BIP, BIP_perc, salary, ERA_plus

Which all in all would have these as the defaults:

Batting: ['name_common', 'mlb_ID', 'player_ID', 'year_ID', 'team_ID', 'stint_ID', 'lg_ID', 'pitcher', 'G', 'PA', 'salary', 'runs_above_avg', 'runs_above_avg_off', 'runs_above_avg_def', 'WAR_rep', 'WAA', 'WAR']
Pitching: ['name_common' ,'mlb_ID', 'player_ID', 'year_ID', 'team_ID', 'stint_ID', 'lg_ID', 'G', 'GS', 'RA', 'xRA', 'BIP', 'BIP_perc', 'salary', 'ERA_plus', 'WAR_rep', 'WAA', 'WAA_adj', 'WAR']

With everything else being retrievable with a return_all type of parameter. Anything important I missed? This leaves ~ 20 columns each which seems reasonable.

The function itself would basically be the top response to this Stack Overflow post with the above column filtering.

trojanguard25 · 2017-12-12T03:09:04Z

I think there should be some default 'groupby' that is done to combine player rows for the same year. I committed a potential option in my fork: https://github.com/trojanguard25/pybaseball/commits/cache
This function returns all the columns for a single season. By default, it groups the rows so each player has a single entry for the year submitted. I also added an option to split each player by team. I think those are the two most common use-cases. This does cause a problem since some of the columns (like ERA+) cannot be summed or averaged; rather, they need to be weighted by playing time. Not exactly sure the best way to handle that correctly.

jldbc · 2017-12-16T22:55:22Z

Let's leave the groupby in the hands of the user for now since doing it for them without using proper weights might cause people to unknowingly use bad data (i.e. using a summed/averaged ERA+ without realizing it's not weighted).

I pushed the version I've been using to a new branch in 7b10b82. I'll merge later today if there aren't any objections.

It's probably worth opening a new issue for working on properly-weighted aggregations since it definitely would be useful to have.

jldbc · 2017-12-17T21:17:58Z

Merged branch bwar to master. Commit 7b10b82 adds a bwar_bat() and bwar_pitch() function, each with the optional argument return_all to retrieve all fields.

Merge master down

jfreynolds closed this as completed Dec 8, 2017

jfreynolds reopened this Dec 8, 2017

jldbc closed this as completed Dec 17, 2017

schorrm pushed a commit that referenced this issue Sep 11, 2020

Merge pull request #9 from jldbc/master

33ec36a

Merge master down

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Baseball Reference Pitcher WAR #9

Baseball Reference Pitcher WAR #9

jfreynolds commented Dec 6, 2017

jldbc commented Dec 7, 2017

trojanguard25 commented Dec 7, 2017

jfreynolds commented Dec 7, 2017

jldbc commented Dec 8, 2017

jfreynolds commented Dec 8, 2017

jldbc commented Dec 9, 2017

jfreynolds commented Dec 9, 2017

jldbc commented Dec 10, 2017

trojanguard25 commented Dec 12, 2017

jldbc commented Dec 16, 2017

jldbc commented Dec 17, 2017

Baseball Reference Pitcher WAR #9

Baseball Reference Pitcher WAR #9

Comments

jfreynolds commented Dec 6, 2017

jldbc commented Dec 7, 2017

trojanguard25 commented Dec 7, 2017

jfreynolds commented Dec 7, 2017

jldbc commented Dec 8, 2017

jfreynolds commented Dec 8, 2017

jldbc commented Dec 9, 2017

jfreynolds commented Dec 9, 2017

jldbc commented Dec 10, 2017

trojanguard25 commented Dec 12, 2017

jldbc commented Dec 16, 2017

jldbc commented Dec 17, 2017