New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Statcast pitcher spin rate fix #64
Conversation
@jldbc I wanted to follow up on this to see if there are any changes that you think should be made before it's merged into pybaseball |
@tpoatsy3 This is really good -- can you please make this PR to my fork? We're trying to get this going again |
Okay, I'm trying to merge the PR, but I can't get the unit testing to work here. |
@tpoatsy3 -- can you please drop the gitignore from this, merge up to date? |
Also, in terms of structure, I think this might be better as a boolean flag in the function calls for statcast. And possibly even on by default. |
@tpoatsy3, you still there? |
@bdilday can you take a look at this? @TheCleric started working on fixing the unit tests and there are discrepancies here apparently |
@bdilday I checked my fixes on a branch on my copy of the repo: https://github.com/TheCleric/pybaseball/tree/statcast-fix The main issue is the integration test. It seems like the numbers in the example CSV and the ones being generated are a few hundredths off. Not sure if it's an algorithm issue, or that the CSV just needs to be updated. |
I did some digging on this. I was able to find the spreadsheet that the algorithm is based on here, http://baseball.physics.illinois.edu/trackman/MovementSpinEfficiencyTemplate.xlsx I can say that the algorithm implemented in this PR is definitely consistent with Alan Nathan's excel-based computations. The example data for this test is actually from the spreadsheet, so we've effectively got a test for consistency, not only for the final result but for each individual component. On the other hand if I take the the data from the it's hard to know why these aren;t consistent - presumably this test passed at some point? so that would suggest the data being returned by mlbam has changed. we also don't know where the Darvish data came from (@tpoatsy3 ?) or if we should expect it to match Alan Nathan's spreadsheet. overall, we don't need to block on the test of the Darvish data passing. It may be worth doing the pull from mlbam ( |
Thanks @bdilday ! I'll update the CSV file then with the current data and we'll go from there. |
pybaseball/statcast_pitcher_spin.py
Outdated
import pandas as pd | ||
import numpy as np | ||
|
||
K = .005153 #Environmental Constant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be useful to compute this instead of hard-coding it? i.e. assume 70 F sea-level as a default but let a user specify something different in the call to statcast_pitcher_spin
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As well, wouldn't this be park dependent? I.e., I'm assuming Tropicana Field (sea-level) vs Coors Field (a mile up) would possibly be different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great that you brought this up because I was hoping on replacing K with a calculation later. The reason I didnt in the original PR was to manage scope and keep the test data could be consistent.
pybaseball/statcast_pitcher_spin.py
Outdated
import numpy as np | ||
|
||
K = .005153 #Environmental Constant | ||
SIG_DIG = 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is rounding really necessary? if so maybe the result should be rounded only in the last step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had also thought that rounding at each step was superfluous, but I compared my testing data to the excel sheet that I was doing the calculations for and there were marginal errors that compounded as I went further in the calculations and amounted to non-insignificant differences in the output.
If we can verify the accuracy of the functions I built, then we can absolutely remove rounding at every step
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make a suggestion that we leave out the rounding at each step perhaps for now and then ask Alan Nathan nicely on Twitter if he can check if the math is accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that;s unnecessary. I confirmed that this code agrees with Alan Nathan's spreadsheet, not just the end result but each individual cell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bumping this again -- @bdilday -- should we round intermediate steps or only at the end in the final formatted df?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any reason we need to round.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK yeah then nerf the rounding
pybaseball/statcast_pitcher_spin.py
Outdated
return df | ||
|
||
def find_release_time(df): | ||
df['tR'] = time_duration(df['yR'], df['vy0'], df['ay'], 50, False).round(SIG_DIG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should 50 be variable at top of the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not actually sure what the variable is unfortunately. I think it could be the distance to the plate at the point of data capture. On line 52, there is also an unexplained 17/12. I'm not sure why 17 inches is there, but it is for some reason...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both of these numbers are actually explained in the README tab of Prof. Nathan's excel workbook. Y Velocity is captured at Y=50 and the final data points are taken at y=17/12. These numbers will be defined at the top of the file as such.
pybaseball/statcast_pitcher_spin.py
Outdated
return df | ||
|
||
def find_spin_factor(df): | ||
df['S'] = (0.4*df['Cl']/(1-2.32*df['Cl'])).round(SIG_DIG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a doc string with a reference to where this formula came from would be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that's a good call, I'll make that change too
I don't see this feature as being urgent, maybe we could send him on email to let me know we want to merge this and give him the opportunity to finish this off? It's available on linked in profile, https://github.com/tpoatsy3 |
I didn't see all of this activity on this thread. I'll review your comments this week and make some changes. Thanks for the interest in adding this. |
Also @tpoatsy3 -- take a look at @TheCleric's work on the testing for this -- it may be of use |
Hey, @tpoatsy3 -- any progress? |
@tpoatsy3 -- anything new doing on this? |
Anyone? Anyone? Bueller? |
@schorrm I'm making these edits now. I'll get something pushed later and hopefully we can resolve this. |
Just as a heads up, Prof Nathan changed the function for |
@tpoatsy3 -- can you please drop the .gitignore from your PR to resolve that conflict? |
pybaseball/statcast_pitcher_spin.py
Outdated
df['vxbar'] = df['vxbar'].round(4) | ||
df['vybar'] = df['vybar'].round(4) | ||
df['vzbar'] = df['vzbar'].round(4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that we want to be rounding, but if we are, this should be SIG_DIG
pybaseball/statcast_pitcher_spin.py
Outdated
|
||
|
||
def find_average_drag(df): | ||
df['adrag'] = (-(df['ax']*df['vxbar'] + df['ay']*df['vybar'] + (df['az'] + 32.174)*df['vzbar'])/ df['vbar']).round(SIG_DIG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be GRAVITATIONAL_ACCELERATION
, no?
pybaseball/statcast_pitcher_spin.py
Outdated
|
||
|
||
def find_spin_factor(df): | ||
df['S'] = (0.166*np.log(0.336/(0.336-df['Cl']))).round(SIG_DIG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we comment / explain this line here?
pybaseball/statcast_pitcher_spin.py
Outdated
def special_round(series, digit): | ||
series = series * 10**digit | ||
series = np.where(series >= 0, series + .5, series - .5) | ||
series = series.astype('int64').astype('float64') | ||
series = series / 10**digit | ||
return series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this get a comment / explanation?
* Additional Statcast Batter Leaderboards * add pitch arsenal leaderboard * docs for new statcast batter functions * Make exitvelo test a little more resilient (hopefully for the last time) - w/ Adam Weeden
I've tried regenerating the files with the new formulas, but once we take out the aggressive rounding, I cant get the first test to pass ( Just to be clear, before I refreshed the data, I changed the Is there another way to test this data that's not circular without using excel? I'm open to ideas. |
@tpoatsy3 could you push your changes? I think it'd be easier to try to untangle this if we could see the most up-to-date code. Also, can you clarify which excel file you're using, i.e. what's the link on Alan Nathan's website? |
I believe I've addressed the issues here, with this PR tpoatsy3#2
|
Solves a license issue, removes a dependency
* Additional Statcast Batter Leaderboards * add pitch arsenal leaderboard * docs for new statcast batter functions * Make exitvelo test a little more resilient (hopefully for the last time) - w/ Adam Weeden
@bdilday Thank you so much for the help. I'm not sure how you fixed the excel rounding issue, but it's passing that unittest now. @schorrm I re-wrote the testing file to use pytest and the utility functions in |
Will be going through as soon as the rest of the CI passes... can you pull back up to date? |
I've added a testing folder. The statcast_pitching additions are going to be added with the TDD method, so this is a necessary step. This could hold other testing scripts for the rest of the package's methods. Add test data Included are test data files. The test_data models what might get scraped from the web. The target data contains fields that I'm aiming to calculate. Add statcast_pitcher_spin method with testing statcast_pitcher_spin extends statcast_pitcher by adding spin data back into the file, replacing the deprecated spin columns. The math and physics behind the calculations were modeled off of Professor Alan Nathan's work at the University of Illinois. I have no impression of what information was in the original spin columns before they were deprecated, but they now have the magnitude of movement cause by spin in the X and Z directions ('Mx' and 'Mz') as well as the axis of rotation ('phi'). statcast_pitcher_spin was developed with the test driven development method, so a testing folder containing the testing file and data was added as well. Documentation is next on the to-dos for this method. Add documentation for statcast_pitcher_spin Include 'theta' calculation in results Writing documentation led me to consider that users might want to see the 'theta' calculation. It requires some extra steps along the way (and their respective tests). This commit includes the updated method, associated tests, and refreshed test files. Add changes requested in comments Remove gitignore file Replace magic numbers with variables Add comments to some functions Remove superfluous rounding replace fuzzywuzzy with difflib (jldbc#180) Solves a license issue, removes a dependency Statcast batter leaderboards (jldbc#179) * Additional Statcast Batter Leaderboards * add pitch arsenal leaderboard * docs for new statcast batter functions * Make exitvelo test a little more resilient (hopefully for the last time) - w/ Adam Weeden updates data files based on excel w/o rounding removes rounding from calcs updates tests to use pandas updates tests to use full df updates darvish test data Merge fixed testing with working branch Add text attribute to monkeypatch DummyResponse Fix testing to use pytest properly with fixtures Add testing folder I've added a testing folder. The statcast_pitching additions are going to be added with the TDD method, so this is a necessary step. This could hold other testing scripts for the rest of the package's methods. Add test data Included are test data files. The test_data models what might get scraped from the web. The target data contains fields that I'm aiming to calculate. Add statcast_pitcher_spin method with testing statcast_pitcher_spin extends statcast_pitcher by adding spin data back into the file, replacing the deprecated spin columns. The math and physics behind the calculations were modeled off of Professor Alan Nathan's work at the University of Illinois. I have no impression of what information was in the original spin columns before they were deprecated, but they now have the magnitude of movement cause by spin in the X and Z directions ('Mx' and 'Mz') as well as the axis of rotation ('phi'). statcast_pitcher_spin was developed with the test driven development method, so a testing folder containing the testing file and data was added as well. Documentation is next on the to-dos for this method. Add documentation for statcast_pitcher_spin Include 'theta' calculation in results Writing documentation led me to consider that users might want to see the 'theta' calculation. It requires some extra steps along the way (and their respective tests). This commit includes the updated method, associated tests, and refreshed test files. Add comments to some functions Fix merge conflict errors
…into statcast-fix
Upon discussing with @bdilday, I think we'll merge and fix the unrelated FG bugs later. |
Ok, sounds good. I'll do a final commit that's rebased on the current master commit (1b8bb70) |
I'm all set, feel free to merge whenever |
This merge adds the
statcast_pitcher_spin
function to the library. The new function piggy-backs on the existingstatcast_pitcher
function, but adds calculations of the pitch's movement as a result of its spin and two angles measuring the spin's axis of rotation. The physics and formulae that I used are based on the work of Prof. Alan Nathan of the University of Illinois.I used the test driven development method, so I also included my test scripts and data. This will validate the calculations and could provide a home for scripts that test other parts of the library.
Lastly, I included documentation for the function that mirrors the rest of the package's docs. The examples have a bias towards pitchers from the Chicago Cubs, but everything else should be in order.