Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statcast pitcher spin rate fix #64

Merged
merged 20 commits into from Feb 22, 2021
Merged

Statcast pitcher spin rate fix #64

merged 20 commits into from Feb 22, 2021

Conversation

tpoatsy3
Copy link
Contributor

This merge adds the statcast_pitcher_spin function to the library. The new function piggy-backs on the existing statcast_pitcher function, but adds calculations of the pitch's movement as a result of its spin and two angles measuring the spin's axis of rotation. The physics and formulae that I used are based on the work of Prof. Alan Nathan of the University of Illinois.

I used the test driven development method, so I also included my test scripts and data. This will validate the calculations and could provide a home for scripts that test other parts of the library.

Lastly, I included documentation for the function that mirrors the rest of the package's docs. The examples have a bias towards pitchers from the Chicago Cubs, but everything else should be in order.

@tpoatsy3
Copy link
Contributor Author

tpoatsy3 commented Sep 9, 2019

@jldbc I wanted to follow up on this to see if there are any changes that you think should be made before it's merged into pybaseball

@tpoatsy3
Copy link
Contributor Author

tpoatsy3 commented Jan 9, 2020

@jldbc Just circling back on this. This pull should close issue #58. Let me know if you think other changes should be made or if you have any concerns with the code that are preventing making the pull.

@schorrm
Copy link
Collaborator

schorrm commented May 8, 2020

@tpoatsy3 This is really good -- can you please make this PR to my fork? We're trying to get this going again

@schorrm
Copy link
Collaborator

schorrm commented Aug 28, 2020

Okay, I'm trying to merge the PR, but I can't get the unit testing to work here.
Also, I don't love the setup where it goes through a module indirection

@schorrm
Copy link
Collaborator

schorrm commented Aug 30, 2020

@tpoatsy3 -- can you please drop the gitignore from this, merge up to date?
and can you take a look at the init thing? Can we get this a different way? Because it feels like the module for unit testing creates a really unpleasant usage method on the thingy.

@schorrm
Copy link
Collaborator

schorrm commented Aug 31, 2020

Also, in terms of structure, I think this might be better as a boolean flag in the function calls for statcast. And possibly even on by default.

@schorrm
Copy link
Collaborator

schorrm commented Sep 13, 2020

@tpoatsy3, you still there?

@schorrm
Copy link
Collaborator

schorrm commented Oct 15, 2020

@bdilday can you take a look at this? @TheCleric started working on fixing the unit tests and there are discrepancies here apparently

@TheCleric
Copy link
Contributor

@bdilday I checked my fixes on a branch on my copy of the repo:

https://github.com/TheCleric/pybaseball/tree/statcast-fix

The main issue is the integration test. It seems like the numbers in the example CSV and the ones being generated are a few hundredths off. Not sure if it's an algorithm issue, or that the CSV just needs to be updated.

@bdilday
Copy link
Contributor

bdilday commented Oct 16, 2020

I did some digging on this. I was able to find the spreadsheet that the algorithm is based on here, http://baseball.physics.illinois.edu/trackman/MovementSpinEfficiencyTemplate.xlsx

I can say that the algorithm implemented in this PR is definitely consistent with Alan Nathan's excel-based computations. The example data for this test is actually from the spreadsheet, so we've effectively got a test for consistency, not only for the final result but for each individual component.

On the other hand if I take the the data from the live_Darvish_July data included in this PR, and put it into the spreadsheet, the answer is identical to @tpoatsy3 code and is not identical to the values in the csv file itself. i.e., the csv file appears to be wrong (if we assume Alan Nathan's calculations to be correct).

it's hard to know why these aren;t consistent - presumably this test passed at some point? so that would suggest the data being returned by mlbam has changed. we also don't know where the Darvish data came from (@tpoatsy3 ?) or if we should expect it to match Alan Nathan's spreadsheet.

overall, we don't need to block on the test of the Darvish data passing. It may be worth doing the pull from mlbam (df = spin.statcast_pitcher_spin(start_dt='2019-07-01', end_dt='2019-07-31', player_id=506433) , computing the spin stuff, saving that file, and testing against that going forward. This would allow for a regression test based on changing mlbam data.

@TheCleric
Copy link
Contributor

Thanks @bdilday ! I'll update the CSV file then with the current data and we'll go from there.

import pandas as pd
import numpy as np

K = .005153 #Environmental Constant
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be useful to compute this instead of hard-coding it? i.e. assume 70 F sea-level as a default but let a user specify something different in the call to statcast_pitcher_spin?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As well, wouldn't this be park dependent? I.e., I'm assuming Tropicana Field (sea-level) vs Coors Field (a mile up) would possibly be different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great that you brought this up because I was hoping on replacing K with a calculation later. The reason I didnt in the original PR was to manage scope and keep the test data could be consistent.

import numpy as np

K = .005153 #Environmental Constant
SIG_DIG = 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is rounding really necessary? if so maybe the result should be rounded only in the last step?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had also thought that rounding at each step was superfluous, but I compared my testing data to the excel sheet that I was doing the calculations for and there were marginal errors that compounded as I went further in the calculations and amounted to non-insignificant differences in the output.

If we can verify the accuracy of the functions I built, then we can absolutely remove rounding at every step

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make a suggestion that we leave out the rounding at each step perhaps for now and then ask Alan Nathan nicely on Twitter if he can check if the math is accurate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that;s unnecessary. I confirmed that this code agrees with Alan Nathan's spreadsheet, not just the end result but each individual cell.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumping this again -- @bdilday -- should we round intermediate steps or only at the end in the final formatted df?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any reason we need to round.

Copy link
Collaborator

@schorrm schorrm Jan 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK yeah then nerf the rounding

return df

def find_release_time(df):
df['tR'] = time_duration(df['yR'], df['vy0'], df['ay'], 50, False).round(SIG_DIG)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should 50 be variable at top of the file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not actually sure what the variable is unfortunately. I think it could be the distance to the plate at the point of data capture. On line 52, there is also an unexplained 17/12. I'm not sure why 17 inches is there, but it is for some reason...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these numbers are actually explained in the README tab of Prof. Nathan's excel workbook. Y Velocity is captured at Y=50 and the final data points are taken at y=17/12. These numbers will be defined at the top of the file as such.

return df

def find_spin_factor(df):
df['S'] = (0.4*df['Cl']/(1-2.32*df['Cl'])).round(SIG_DIG)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a doc string with a reference to where this formula came from would be helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's a good call, I'll make that change too

@TheCleric
Copy link
Contributor

@schorrm I've updated my branch with @bdilday's suggestions and all tests pass now. Since the original submitted seems to have disappeared, we can work off that branch if you'd like.

@bdilday
Copy link
Contributor

bdilday commented Oct 16, 2020

I don't see this feature as being urgent, maybe we could send him on email to let me know we want to merge this and give him the opportunity to finish this off? It's available on linked in profile, https://github.com/tpoatsy3

@tpoatsy3
Copy link
Contributor Author

I didn't see all of this activity on this thread. I'll review your comments this week and make some changes. Thanks for the interest in adding this.

@schorrm
Copy link
Collaborator

schorrm commented Oct 21, 2020

Also @tpoatsy3 -- take a look at @TheCleric's work on the testing for this -- it may be of use

@schorrm
Copy link
Collaborator

schorrm commented Oct 27, 2020

Hey, @tpoatsy3 -- any progress?

@schorrm
Copy link
Collaborator

schorrm commented Oct 31, 2020

@tpoatsy3 -- anything new doing on this?

@schorrm
Copy link
Collaborator

schorrm commented Nov 19, 2020

Anyone? Anyone? Bueller?

@tpoatsy3
Copy link
Contributor Author

@schorrm I'm making these edits now. I'll get something pushed later and hopefully we can resolve this.

@tpoatsy3
Copy link
Contributor Author

Just as a heads up, Prof Nathan changed the function for S. The latest push includes that change.

@schorrm
Copy link
Collaborator

schorrm commented Jan 21, 2021

@tpoatsy3 -- can you please drop the .gitignore from your PR to resolve that conflict?

Comment on lines 100 to 102
df['vxbar'] = df['vxbar'].round(4)
df['vybar'] = df['vybar'].round(4)
df['vzbar'] = df['vzbar'].round(4)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that we want to be rounding, but if we are, this should be SIG_DIG



def find_average_drag(df):
df['adrag'] = (-(df['ax']*df['vxbar'] + df['ay']*df['vybar'] + (df['az'] + 32.174)*df['vzbar'])/ df['vbar']).round(SIG_DIG)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be GRAVITATIONAL_ACCELERATION, no?



def find_spin_factor(df):
df['S'] = (0.166*np.log(0.336/(0.336-df['Cl']))).round(SIG_DIG)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we comment / explain this line here?

Comment on lines 175 to 196
def special_round(series, digit):
series = series * 10**digit
series = np.where(series >= 0, series + .5, series - .5)
series = series.astype('int64').astype('float64')
series = series / 10**digit
return series
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this get a comment / explanation?

* Additional Statcast Batter Leaderboards

* add pitch arsenal leaderboard

* docs for new statcast batter functions

* Make exitvelo test a little more resilient (hopefully for the last time) - w/ Adam Weeden
@tpoatsy3
Copy link
Contributor Author

I've tried regenerating the files with the new formulas, but once we take out the aggressive rounding, I cant get the first test to pass (test_individual_calculations). Specifically, it fails on find_lift_coefficient. I think it's being caused by the difference in how python and excel are treating the K value (which is a constant for air density). I was looking into this and apparently MS Excel has its own proprietary decimal rounding which departs from IEEE standards. Since all of the previous calculations work before, it must be how Excel is handling the decimal number compared to python.

Just to be clear, before I refreshed the data, I changed the Cl calculation in the excel data file from being updatable to being a static 0.005383' in the excel doc, which matches the K global variable that I have in the file. I let the excel doc reload and then saved that output as a csv. The numbers largely match and when I use the check_less_precise=3attribute withpandas._testing.assert_frame_equal` function, I'm getting a difference of 6.47%. With check_less_precise=2, it's 0.04%.

Is there another way to test this data that's not circular without using excel? I'm open to ideas.

@bdilday
Copy link
Contributor

bdilday commented Feb 19, 2021

@tpoatsy3 could you push your changes? I think it'd be easier to try to untangle this if we could see the most up-to-date code. Also, can you clarify which excel file you're using, i.e. what's the link on Alan Nathan's website?

@bdilday
Copy link
Contributor

bdilday commented Feb 19, 2021

I believe I've addressed the issues here, with this PR tpoatsy3#2

  • hardcodes K = 0.005383 in the excel file

  • formats the excel file to avoid rounding in display

  • exports a query for Yu Darvish data, using the current statcast values, which have diverged from the test file included in the original PR

@tpoatsy3
Copy link
Contributor Author

@bdilday Thank you so much for the help. I'm not sure how you fixed the excel rounding issue, but it's passing that unittest now.

@schorrm I re-wrote the testing file to use pytest and the utility functions in tests/pybaseball/conftest.py. All of the tests from tests/pybaseball/test_statcast_pitching_spin.py pass.

@schorrm
Copy link
Collaborator

schorrm commented Feb 20, 2021

Will be going through as soon as the rest of the CI passes... can you pull back up to date?

@schorrm
Copy link
Collaborator

schorrm commented Feb 21, 2021

@TheCleric ?

I've added a testing folder. The statcast_pitching additions are
going to be added with the TDD method, so this is a necessary
step. This could hold other testing scripts for the rest of the
package's methods.

Add test data

Included are test data files. The test_data models what might get
scraped from the web. The target data contains fields that I'm
aiming to calculate.

Add statcast_pitcher_spin method with testing

statcast_pitcher_spin extends statcast_pitcher by adding spin data
back into the file, replacing the deprecated spin columns. The
math and physics behind the calculations were modeled off of
Professor Alan Nathan's work at the University of Illinois. I have
no impression of what information was in the original spin columns
before they were deprecated, but they now have the magnitude of
movement cause by spin in the X and Z directions ('Mx' and 'Mz')
as well as the axis of rotation ('phi').

statcast_pitcher_spin was developed with the test driven development
method, so a testing folder containing the testing file and data
was added as well.

Documentation is next on the to-dos for this method.

Add documentation for statcast_pitcher_spin

Include 'theta' calculation in results

Writing documentation led me to consider that users might want to
see the 'theta' calculation. It requires some extra steps along
the way (and their respective tests). This commit includes the
updated method, associated tests, and refreshed test files.

Add changes requested in comments

Remove gitignore file

Replace magic numbers with variables

Add comments to some functions

Remove superfluous rounding

replace fuzzywuzzy with difflib (jldbc#180)

Solves a license issue, removes a dependency

Statcast batter leaderboards (jldbc#179)

* Additional Statcast Batter Leaderboards

* add pitch arsenal leaderboard

* docs for new statcast batter functions

* Make exitvelo test a little more resilient (hopefully for the last time) - w/ Adam Weeden

updates data files based on excel w/o rounding

removes rounding from calcs

updates tests to use pandas

updates tests to use full df

updates darvish test data

Merge fixed testing with working branch

Add text attribute to monkeypatch DummyResponse

Fix testing to use pytest properly with fixtures

Add testing folder

I've added a testing folder. The statcast_pitching additions are
going to be added with the TDD method, so this is a necessary
step. This could hold other testing scripts for the rest of the
package's methods.

Add test data

Included are test data files. The test_data models what might get
scraped from the web. The target data contains fields that I'm
aiming to calculate.

Add statcast_pitcher_spin method with testing

statcast_pitcher_spin extends statcast_pitcher by adding spin data
back into the file, replacing the deprecated spin columns. The
math and physics behind the calculations were modeled off of
Professor Alan Nathan's work at the University of Illinois. I have
no impression of what information was in the original spin columns
before they were deprecated, but they now have the magnitude of
movement cause by spin in the X and Z directions ('Mx' and 'Mz')
as well as the axis of rotation ('phi').

statcast_pitcher_spin was developed with the test driven development
method, so a testing folder containing the testing file and data
was added as well.

Documentation is next on the to-dos for this method.

Add documentation for statcast_pitcher_spin

Include 'theta' calculation in results

Writing documentation led me to consider that users might want to
see the 'theta' calculation. It requires some extra steps along
the way (and their respective tests). This commit includes the
updated method, associated tests, and refreshed test files.

Add comments to some functions

Fix merge conflict errors
@schorrm
Copy link
Collaborator

schorrm commented Feb 21, 2021

Upon discussing with @bdilday, I think we'll merge and fix the unrelated FG bugs later.

@tpoatsy3
Copy link
Contributor Author

Ok, sounds good. I'll do a final commit that's rebased on the current master commit (1b8bb70)

@tpoatsy3
Copy link
Contributor Author

I'm all set, feel free to merge whenever

@schorrm schorrm merged commit b755cbb into jldbc:master Feb 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants