Statcast pitcher spin rate fix #64

tpoatsy3 · 2019-08-30T18:47:52Z

This merge adds the statcast_pitcher_spin function to the library. The new function piggy-backs on the existing statcast_pitcher function, but adds calculations of the pitch's movement as a result of its spin and two angles measuring the spin's axis of rotation. The physics and formulae that I used are based on the work of Prof. Alan Nathan of the University of Illinois.

I used the test driven development method, so I also included my test scripts and data. This will validate the calculations and could provide a home for scripts that test other parts of the library.

Lastly, I included documentation for the function that mirrors the rest of the package's docs. The examples have a bias towards pitchers from the Chicago Cubs, but everything else should be in order.

tpoatsy3 · 2019-09-09T18:19:28Z

@jldbc I wanted to follow up on this to see if there are any changes that you think should be made before it's merged into pybaseball

tpoatsy3 · 2020-01-09T16:01:54Z

@jldbc Just circling back on this. This pull should close issue #58. Let me know if you think other changes should be made or if you have any concerns with the code that are preventing making the pull.

schorrm · 2020-05-08T09:37:11Z

@tpoatsy3 This is really good -- can you please make this PR to my fork? We're trying to get this going again

schorrm · 2020-08-28T11:07:20Z

Okay, I'm trying to merge the PR, but I can't get the unit testing to work here.
Also, I don't love the setup where it goes through a module indirection

schorrm · 2020-08-30T19:32:35Z

@tpoatsy3 -- can you please drop the gitignore from this, merge up to date?
and can you take a look at the init thing? Can we get this a different way? Because it feels like the module for unit testing creates a really unpleasant usage method on the thingy.

schorrm · 2020-08-31T07:26:48Z

Also, in terms of structure, I think this might be better as a boolean flag in the function calls for statcast. And possibly even on by default.

schorrm · 2020-09-13T10:40:15Z

@tpoatsy3, you still there?

schorrm · 2020-10-15T02:24:26Z

@bdilday can you take a look at this? @TheCleric started working on fixing the unit tests and there are discrepancies here apparently

TheCleric · 2020-10-15T02:49:20Z

@bdilday I checked my fixes on a branch on my copy of the repo:

https://github.com/TheCleric/pybaseball/tree/statcast-fix

The main issue is the integration test. It seems like the numbers in the example CSV and the ones being generated are a few hundredths off. Not sure if it's an algorithm issue, or that the CSV just needs to be updated.

bdilday · 2020-10-16T14:23:29Z

I did some digging on this. I was able to find the spreadsheet that the algorithm is based on here, http://baseball.physics.illinois.edu/trackman/MovementSpinEfficiencyTemplate.xlsx

I can say that the algorithm implemented in this PR is definitely consistent with Alan Nathan's excel-based computations. The example data for this test is actually from the spreadsheet, so we've effectively got a test for consistency, not only for the final result but for each individual component.

On the other hand if I take the the data from the live_Darvish_July data included in this PR, and put it into the spreadsheet, the answer is identical to @tpoatsy3 code and is not identical to the values in the csv file itself. i.e., the csv file appears to be wrong (if we assume Alan Nathan's calculations to be correct).

it's hard to know why these aren;t consistent - presumably this test passed at some point? so that would suggest the data being returned by mlbam has changed. we also don't know where the Darvish data came from (@tpoatsy3 ?) or if we should expect it to match Alan Nathan's spreadsheet.

overall, we don't need to block on the test of the Darvish data passing. It may be worth doing the pull from mlbam (df = spin.statcast_pitcher_spin(start_dt='2019-07-01', end_dt='2019-07-31', player_id=506433) , computing the spin stuff, saving that file, and testing against that going forward. This would allow for a regression test based on changing mlbam data.

TheCleric · 2020-10-16T14:28:24Z

Thanks @bdilday ! I'll update the CSV file then with the current data and we'll go from there.

pybaseball/statcast_pitcher_spin.py

bdilday · 2020-10-16T14:33:54Z

pybaseball/statcast_pitcher_spin.py

+import pandas as pd
+import numpy as np
+
+K = .005153 #Environmental Constant


would it be useful to compute this instead of hard-coding it? i.e. assume 70 F sea-level as a default but let a user specify something different in the call to statcast_pitcher_spin?

As well, wouldn't this be park dependent? I.e., I'm assuming Tropicana Field (sea-level) vs Coors Field (a mile up) would possibly be different.

It's great that you brought this up because I was hoping on replacing K with a calculation later. The reason I didnt in the original PR was to manage scope and keep the test data could be consistent.

bdilday · 2020-10-16T14:35:10Z

pybaseball/statcast_pitcher_spin.py

+import numpy as np
+
+K = .005153 #Environmental Constant
+SIG_DIG = 4


is rounding really necessary? if so maybe the result should be rounded only in the last step?

I had also thought that rounding at each step was superfluous, but I compared my testing data to the excel sheet that I was doing the calculations for and there were marginal errors that compounded as I went further in the calculations and amounted to non-insignificant differences in the output.

If we can verify the accuracy of the functions I built, then we can absolutely remove rounding at every step

I would make a suggestion that we leave out the rounding at each step perhaps for now and then ask Alan Nathan nicely on Twitter if he can check if the math is accurate?

I think that;s unnecessary. I confirmed that this code agrees with Alan Nathan's spreadsheet, not just the end result but each individual cell.

Bumping this again -- @bdilday -- should we round intermediate steps or only at the end in the final formatted df?

I don't see any reason we need to round.

OK yeah then nerf the rounding

bdilday · 2020-10-16T14:36:30Z

pybaseball/statcast_pitcher_spin.py

+	return df
+
+def find_release_time(df):
+	df['tR'] = time_duration(df['yR'], df['vy0'], df['ay'], 50, False).round(SIG_DIG)


should 50 be variable at top of the file?

I'm not actually sure what the variable is unfortunately. I think it could be the distance to the plate at the point of data capture. On line 52, there is also an unexplained 17/12. I'm not sure why 17 inches is there, but it is for some reason...

Both of these numbers are actually explained in the README tab of Prof. Nathan's excel workbook. Y Velocity is captured at Y=50 and the final data points are taken at y=17/12. These numbers will be defined at the top of the file as such.

bdilday · 2020-10-16T14:37:04Z

pybaseball/statcast_pitcher_spin.py

+	return df
+
+def find_spin_factor(df):
+	df['S'] = (0.4*df['Cl']/(1-2.32*df['Cl'])).round(SIG_DIG)


a doc string with a reference to where this formula came from would be helpful

yeah, that's a good call, I'll make that change too

TheCleric · 2020-10-16T15:20:48Z

@schorrm I've updated my branch with @bdilday's suggestions and all tests pass now. Since the original submitted seems to have disappeared, we can work off that branch if you'd like.

bdilday · 2020-10-16T16:11:35Z

I don't see this feature as being urgent, maybe we could send him on email to let me know we want to merge this and give him the opportunity to finish this off? It's available on linked in profile, https://github.com/tpoatsy3

tpoatsy3 · 2020-10-21T14:48:10Z

I didn't see all of this activity on this thread. I'll review your comments this week and make some changes. Thanks for the interest in adding this.

schorrm · 2020-10-21T20:58:54Z

Also @tpoatsy3 -- take a look at @TheCleric's work on the testing for this -- it may be of use

schorrm · 2020-10-27T12:24:39Z

Hey, @tpoatsy3 -- any progress?

schorrm · 2020-10-31T21:20:10Z

@tpoatsy3 -- anything new doing on this?

schorrm · 2020-11-19T14:12:00Z

Anyone? Anyone? Bueller?

tpoatsy3 · 2021-01-21T00:25:57Z

@schorrm I'm making these edits now. I'll get something pushed later and hopefully we can resolve this.

tpoatsy3 · 2021-01-21T02:24:40Z

Just as a heads up, Prof Nathan changed the function for S. The latest push includes that change.

schorrm · 2021-01-21T08:15:58Z

@tpoatsy3 -- can you please drop the .gitignore from your PR to resolve that conflict?

schorrm · 2021-01-21T08:22:22Z

pybaseball/statcast_pitcher_spin.py

+    df['vxbar'] = df['vxbar'].round(4)
+    df['vybar'] = df['vybar'].round(4)
+    df['vzbar'] = df['vzbar'].round(4)


I'm not sure that we want to be rounding, but if we are, this should be SIG_DIG

schorrm · 2021-01-21T08:23:58Z

pybaseball/statcast_pitcher_spin.py

+
+
+def find_average_drag(df):
+    df['adrag'] = (-(df['ax']*df['vxbar'] + df['ay']*df['vybar'] + (df['az'] + 32.174)*df['vzbar'])/ df['vbar']).round(SIG_DIG)


should be GRAVITATIONAL_ACCELERATION, no?

schorrm · 2021-01-21T08:25:02Z

pybaseball/statcast_pitcher_spin.py

+
+
+def find_spin_factor(df):
+    df['S'] = (0.166*np.log(0.336/(0.336-df['Cl']))).round(SIG_DIG)


Can we comment / explain this line here?

schorrm · 2021-01-21T08:25:34Z

pybaseball/statcast_pitcher_spin.py

+def special_round(series, digit):
+    series = series * 10**digit
+    series = np.where(series >= 0, series + .5, series - .5)
+    series = series.astype('int64').astype('float64')
+    series = series / 10**digit
+    return series


Can this get a comment / explanation?

* Additional Statcast Batter Leaderboards * add pitch arsenal leaderboard * docs for new statcast batter functions * Make exitvelo test a little more resilient (hopefully for the last time) - w/ Adam Weeden

tpoatsy3 · 2021-02-19T02:35:37Z

I've tried regenerating the files with the new formulas, but once we take out the aggressive rounding, I cant get the first test to pass (test_individual_calculations). Specifically, it fails on find_lift_coefficient. I think it's being caused by the difference in how python and excel are treating the K value (which is a constant for air density). I was looking into this and apparently MS Excel has its own proprietary decimal rounding which departs from IEEE standards. Since all of the previous calculations work before, it must be how Excel is handling the decimal number compared to python.

Just to be clear, before I refreshed the data, I changed the Cl calculation in the excel data file from being updatable to being a static 0.005383' in the excel doc, which matches the K global variable that I have in the file. I let the excel doc reload and then saved that output as a csv. The numbers largely match and when I use the check_less_precise=3attribute withpandas._testing.assert_frame_equal` function, I'm getting a difference of 6.47%. With check_less_precise=2, it's 0.04%.

Is there another way to test this data that's not circular without using excel? I'm open to ideas.

bdilday · 2021-02-19T12:24:55Z

@tpoatsy3 could you push your changes? I think it'd be easier to try to untangle this if we could see the most up-to-date code. Also, can you clarify which excel file you're using, i.e. what's the link on Alan Nathan's website?

pybaseball/statcast_pitcher_spin.py

bdilday · 2021-02-19T16:46:52Z

I believe I've addressed the issues here, with this PR tpoatsy3#2

hardcodes K = 0.005383 in the excel file
formats the excel file to avoid rounding in display
exports a query for Yu Darvish data, using the current statcast values, which have diverged from the test file included in the original PR

Solves a license issue, removes a dependency

* Additional Statcast Batter Leaderboards * add pitch arsenal leaderboard * docs for new statcast batter functions * Make exitvelo test a little more resilient (hopefully for the last time) - w/ Adam Weeden

tpoatsy3 · 2021-02-20T19:42:42Z

@bdilday Thank you so much for the help. I'm not sure how you fixed the excel rounding issue, but it's passing that unittest now.

@schorrm I re-wrote the testing file to use pytest and the utility functions in tests/pybaseball/conftest.py. All of the tests from tests/pybaseball/test_statcast_pitching_spin.py pass.

schorrm · 2021-02-20T20:49:00Z

Will be going through as soon as the rest of the CI passes... can you pull back up to date?

schorrm · 2021-02-21T08:41:13Z

@TheCleric ?

I've added a testing folder. The statcast_pitching additions are going to be added with the TDD method, so this is a necessary step. This could hold other testing scripts for the rest of the package's methods. Add test data Included are test data files. The test_data models what might get scraped from the web. The target data contains fields that I'm aiming to calculate. Add statcast_pitcher_spin method with testing statcast_pitcher_spin extends statcast_pitcher by adding spin data back into the file, replacing the deprecated spin columns. The math and physics behind the calculations were modeled off of Professor Alan Nathan's work at the University of Illinois. I have no impression of what information was in the original spin columns before they were deprecated, but they now have the magnitude of movement cause by spin in the X and Z directions ('Mx' and 'Mz') as well as the axis of rotation ('phi'). statcast_pitcher_spin was developed with the test driven development method, so a testing folder containing the testing file and data was added as well. Documentation is next on the to-dos for this method. Add documentation for statcast_pitcher_spin Include 'theta' calculation in results Writing documentation led me to consider that users might want to see the 'theta' calculation. It requires some extra steps along the way (and their respective tests). This commit includes the updated method, associated tests, and refreshed test files. Add changes requested in comments Remove gitignore file Replace magic numbers with variables Add comments to some functions Remove superfluous rounding replace fuzzywuzzy with difflib (jldbc#180) Solves a license issue, removes a dependency Statcast batter leaderboards (jldbc#179) * Additional Statcast Batter Leaderboards * add pitch arsenal leaderboard * docs for new statcast batter functions * Make exitvelo test a little more resilient (hopefully for the last time) - w/ Adam Weeden updates data files based on excel w/o rounding removes rounding from calcs updates tests to use pandas updates tests to use full df updates darvish test data Merge fixed testing with working branch Add text attribute to monkeypatch DummyResponse Fix testing to use pytest properly with fixtures Add testing folder I've added a testing folder. The statcast_pitching additions are going to be added with the TDD method, so this is a necessary step. This could hold other testing scripts for the rest of the package's methods. Add test data Included are test data files. The test_data models what might get scraped from the web. The target data contains fields that I'm aiming to calculate. Add statcast_pitcher_spin method with testing statcast_pitcher_spin extends statcast_pitcher by adding spin data back into the file, replacing the deprecated spin columns. The math and physics behind the calculations were modeled off of Professor Alan Nathan's work at the University of Illinois. I have no impression of what information was in the original spin columns before they were deprecated, but they now have the magnitude of movement cause by spin in the X and Z directions ('Mx' and 'Mz') as well as the axis of rotation ('phi'). statcast_pitcher_spin was developed with the test driven development method, so a testing folder containing the testing file and data was added as well. Documentation is next on the to-dos for this method. Add documentation for statcast_pitcher_spin Include 'theta' calculation in results Writing documentation led me to consider that users might want to see the 'theta' calculation. It requires some extra steps along the way (and their respective tests). This commit includes the updated method, associated tests, and refreshed test files. Add comments to some functions Fix merge conflict errors

…into statcast-fix

schorrm · 2021-02-21T21:47:43Z

Upon discussing with @bdilday, I think we'll merge and fix the unrelated FG bugs later.

tpoatsy3 · 2021-02-22T01:05:07Z

Ok, sounds good. I'll do a final commit that's rebased on the current master commit (1b8bb70)

tpoatsy3 · 2021-02-22T02:15:49Z

I'm all set, feel free to merge whenever

bdilday reviewed Oct 16, 2020

View reviewed changes

pybaseball/statcast_pitcher_spin.py Outdated Show resolved Hide resolved

bdilday reviewed Oct 16, 2020

View reviewed changes

schorrm reviewed Jan 21, 2021

View reviewed changes

Statcast batter leaderboards (jldbc#179)

1b8bb70

* Additional Statcast Batter Leaderboards * add pitch arsenal leaderboard * docs for new statcast batter functions * Make exitvelo test a little more resilient (hopefully for the last time) - w/ Adam Weeden

bdilday reviewed Feb 19, 2021

View reviewed changes

pybaseball/statcast_pitcher_spin.py Outdated Show resolved Hide resolved

bdilday mentioned this pull request Feb 19, 2021

Statcast fix update data tpoatsy3/pybaseball#2

Merged

TheCleric and others added 12 commits February 19, 2021 12:28

Fix statcast exitvelo tests, and use our own qualifier (jldbc#182)

888b0a4

replace fuzzywuzzy with difflib (jldbc#180)

5610654

Solves a license issue, removes a dependency

Statcast batter leaderboards (jldbc#179)

61e8eb7

* Additional Statcast Batter Leaderboards * add pitch arsenal leaderboard * docs for new statcast batter functions * Make exitvelo test a little more resilient (hopefully for the last time) - w/ Adam Weeden

updates data files based on excel w/o rounding

1e627c5

removes rounding from calcs

d43fc27

updates tests to use pandas

1dc3d15

updates tests to use full df

a1acfc6

updates darvish test data

c7e4834

Merge branch 'bdilday-statcast-fix-update-data' into statcast-fix

59778ca

Merge fixed testing with working branch

328a6c4

Add text attribute to monkeypatch DummyResponse

b9e7e67

Fix testing to use pytest properly with fixtures

74d0c88

tpoatsy3 added 3 commits February 21, 2021 09:42

remove old test data

422d75c

Merge branch 'statcast-fix' of https://github.com/tpoatsy3/pybaseball …

c6ad563

…into statcast-fix

Add back mistakenly removed file

7f6b22e

schorrm merged commit b755cbb into jldbc:master Feb 22, 2021

schorrm mentioned this pull request Mar 9, 2021

Question - Statcast pitcher spin rate #58

Closed



		def find_average_drag(df):
		df['adrag'] = (-(df['ax']df['vxbar'] + df['ay']df['vybar'] + (df['az'] + 32.174)*df['vzbar'])/ df['vbar']).round(SIG_DIG)



		def find_spin_factor(df):
		df['S'] = (0.166*np.log(0.336/(0.336-df['Cl']))).round(SIG_DIG)

Statcast pitcher spin rate fix #64

Statcast pitcher spin rate fix #64

Conversation

tpoatsy3 commented Aug 30, 2019

tpoatsy3 commented Sep 9, 2019

tpoatsy3 commented Jan 9, 2020

schorrm commented May 8, 2020

schorrm commented Aug 28, 2020

schorrm commented Aug 30, 2020 • edited

schorrm commented Aug 31, 2020

schorrm commented Sep 13, 2020

schorrm commented Oct 15, 2020

TheCleric commented Oct 15, 2020

bdilday commented Oct 16, 2020

TheCleric commented Oct 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schorrm Jan 21, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheCleric commented Oct 16, 2020

bdilday commented Oct 16, 2020

tpoatsy3 commented Oct 21, 2020

schorrm commented Oct 21, 2020

schorrm commented Oct 27, 2020

schorrm commented Oct 31, 2020

schorrm commented Nov 19, 2020

tpoatsy3 commented Jan 21, 2021

tpoatsy3 commented Jan 21, 2021

schorrm commented Jan 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tpoatsy3 commented Feb 19, 2021

bdilday commented Feb 19, 2021 • edited

bdilday commented Feb 19, 2021

tpoatsy3 commented Feb 20, 2021

schorrm commented Feb 20, 2021

schorrm commented Feb 21, 2021

schorrm commented Feb 21, 2021

tpoatsy3 commented Feb 22, 2021

tpoatsy3 commented Feb 22, 2021

schorrm commented Aug 30, 2020 •

edited

schorrm Jan 21, 2021 •

edited

bdilday commented Feb 19, 2021 •

edited