Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Optimize dataframe conversion in JPMaQSDownload() #1442

Merged
merged 17 commits into from
Feb 5, 2024

Conversation

Magnus167
Copy link
Member

@Magnus167 Magnus167 commented Feb 4, 2024

  • Reduces API_DELAY_PARAM to 0.2 (200ms) from 0.3
  • Removes checks for duplicates. as DQ specifies that only unique date-value pairs will make it into a timeseries. Also removes the test for this
  • Uses a new method for converting JSONs to DFs
  • Adds progress notifications during dataframe conversion

@Magnus167 Magnus167 requested a review from a team as a code owner February 4, 2024 18:35
Copy link

codecov bot commented Feb 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (e31a926) 82.05% compared to head (9e3e957) 82.09%.
Report is 14 commits behind head on develop.

❗ Current head 9e3e957 differs from pull request most recent head 0995f1e. Consider uploading reports for the commit 0995f1e to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1442      +/-   ##
===========================================
+ Coverage    82.05%   82.09%   +0.04%     
===========================================
  Files           59       59              
  Lines         5906     5909       +3     
===========================================
+ Hits          4846     4851       +5     
+ Misses        1060     1058       -2     
Files Coverage Δ
macrosynergy/download/jpmaqs.py 84.98% <100.00%> (+0.51%) ⬆️

... and 2 files with indirect coverage changes

Impacted file tree graph

Comment on lines 382 to 403
final_df: pd.DataFrame = pd.concat(dfs, ignore_index=True)
dups_df: pd.DataFrame = final_df.groupby(
["real_date", "cid", "xcat", "metric"]
)["obs"].count()
if sum(dups_df > 1) > 0:
dups_df = pd.DataFrame(dups_df[dups_df > 1].index.tolist())
err_str: str = "Duplicate data found for the following expressions:\n"
for i in dups_df.groupby([1, 2, 3]).groups:
dts_series: pd.Series = dups_df.iloc[
dups_df.groupby([1, 2, 3]).groups[i]
][0]
dts: List[str] = dts_series.tolist()
max_date: str = pd.to_datetime(max(dts)).strftime("%Y-%m-%d")
min_date: str = pd.to_datetime(min(dts)).strftime("%Y-%m-%d")
expression: str = self.construct_expressions(
cids=[i[0]], xcats=[i[1]], metrics=[i[2]]
)[0]
err_str += (
f"Expression: {expression}, Dates: {min_date} to {max_date}\n"
final_df: pd.DataFrame = functools.reduce(
lambda left, right: pd.merge(
left,
right,
on=["real_date", "cid", "xcat"],
),
list(
map(
lambda metricx: pd.concat(dfs_dict[metricx], ignore_index=True),
dfs_dict,
)
),
)

raise InvalidDataframeError(err_str)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we drop this check @emhbrine , @sandresen1 ?

@Magnus167 Magnus167 merged commit df73806 into develop Feb 5, 2024
5 checks passed
@Magnus167 Magnus167 deleted the feature/optimize_dq_df_conversion branch February 5, 2024 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants