-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Optimize dataframe conversion in JPMaQSDownload()
#1442
Conversation
Magnus167
commented
Feb 4, 2024
•
edited
edited
- Reduces API_DELAY_PARAM to 0.2 (200ms) from 0.3
- Removes checks for duplicates. as DQ specifies that only unique date-value pairs will make it into a timeseries. Also removes the test for this
- Uses a new method for converting JSONs to DFs
- Adds progress notifications during dataframe conversion
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #1442 +/- ##
===========================================
+ Coverage 82.05% 82.09% +0.04%
===========================================
Files 59 59
Lines 5906 5909 +3
===========================================
+ Hits 4846 4851 +5
+ Misses 1060 1058 -2
|
macrosynergy/download/jpmaqs.py
Outdated
final_df: pd.DataFrame = pd.concat(dfs, ignore_index=True) | ||
dups_df: pd.DataFrame = final_df.groupby( | ||
["real_date", "cid", "xcat", "metric"] | ||
)["obs"].count() | ||
if sum(dups_df > 1) > 0: | ||
dups_df = pd.DataFrame(dups_df[dups_df > 1].index.tolist()) | ||
err_str: str = "Duplicate data found for the following expressions:\n" | ||
for i in dups_df.groupby([1, 2, 3]).groups: | ||
dts_series: pd.Series = dups_df.iloc[ | ||
dups_df.groupby([1, 2, 3]).groups[i] | ||
][0] | ||
dts: List[str] = dts_series.tolist() | ||
max_date: str = pd.to_datetime(max(dts)).strftime("%Y-%m-%d") | ||
min_date: str = pd.to_datetime(min(dts)).strftime("%Y-%m-%d") | ||
expression: str = self.construct_expressions( | ||
cids=[i[0]], xcats=[i[1]], metrics=[i[2]] | ||
)[0] | ||
err_str += ( | ||
f"Expression: {expression}, Dates: {min_date} to {max_date}\n" | ||
final_df: pd.DataFrame = functools.reduce( | ||
lambda left, right: pd.merge( | ||
left, | ||
right, | ||
on=["real_date", "cid", "xcat"], | ||
), | ||
list( | ||
map( | ||
lambda metricx: pd.concat(dfs_dict[metricx], ignore_index=True), | ||
dfs_dict, | ||
) | ||
), | ||
) | ||
|
||
raise InvalidDataframeError(err_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we drop this check @emhbrine , @sandresen1 ?