Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added thread_requests parameter to import_pbp_data and import_weekly … #46

Merged
merged 7 commits into from
Sep 2, 2023

Conversation

bendominguez0111
Copy link
Contributor

…data functions.

Added optional parameter to import_pbp_data and import_weekly_data to use threading to speed up requests for play by play data and weekly data. I also tested async and multiprocessing as well but threading posted the best results. Depending on connection, it sped up the speed at which PBP data from 1999 to 2022 was loaded by 25-50%.

Also added associated tests. I added a init.py file to the tests folder, because it was the only way I could run pytest (although I may have done something wrong there).

Did not open issue for this beforehand but spoke to @cooperdff on Twitter about the idea and thought it was a good idea.

Test results below:

image

Copy link
Collaborator

@alecglen alecglen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, @bendominguez0111! Just a couple small fixes needed, plus a question on the dependency addition.

nfl_data_py/__init__.py Outdated Show resolved Hide resolved
nfl_data_py/__init__.py Outdated Show resolved Hide resolved
nfl_data_py/__init__.py Show resolved Hide resolved
nfl_data_py/tests/nfl_test.py Show resolved Hide resolved
nfl_data_py/tests/nfl_test.py Show resolved Hide resolved
@bendominguez0111
Copy link
Contributor Author

Cool! I will make some of these changes. I actually thought up another optimization for this but never got to it (removes the need to sort the CSVs at the end, which takes a considerable amount of time if youre pulling a lot of data) I'll add that in as well along with the changes requested

… not to have to sort dfs when using threaded requests
@bendominguez0111
Copy link
Contributor Author

Made those changes. Didn't move the if thread_requests block all the way down to the # load data comment cause it didnt exactly work, but it doesnt bypass caching now w/ this new code. Also set engine = auto instead of explicitly setting pyarrow

@bendominguez0111
Copy link
Contributor Author

Oh, and added some additional logic compared to last time to avoid sorting years at the end. Since the HTTP requests can resolve out of order, originally there would need to be a sort at the end which took up some time. Now just creating a fixed sized list and then inserting responses into it as their threads resolve

Copy link
Collaborator

@alecglen alecglen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea on avoiding the extra sort operation, Ben! All seems to work as intended, just had a couple more quick suggestions you can commit if you like. We'll get this merged soon.

setup.py Outdated Show resolved Hide resolved
nfl_data_py/__init__.py Outdated Show resolved Hide resolved
nfl_data_py/__init__.py Outdated Show resolved Hide resolved
Ben Dominguez and others added 3 commits August 20, 2023 11:53
Co-authored-by: Alec Ostrander <alec.ostrander@gmail.com>
Co-authored-by: Alec Ostrander <alec.ostrander@gmail.com>
Co-authored-by: Alec Ostrander <alec.ostrander@gmail.com>
@bendominguez0111
Copy link
Contributor Author

Committed those suggestions. Good call, those lines were a bit cluttered

@cooperdff cooperdff merged commit 9fc431a into nflverse:main Sep 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants