Feature: Optimize dataframe conversion in `JPMaQSDownload()` #1442

Magnus167 · 2024-02-04T18:35:27Z

Reduces API_DELAY_PARAM to 0.2 (200ms) from 0.3
Removes checks for duplicates. as DQ specifies that only unique date-value pairs will make it into a timeseries. Also removes the test for this
Uses a new method for converting JSONs to DFs
Adds progress notifications during dataframe conversion

codecov · 2024-02-04T18:44:56Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (e31a926) 82.05% compared to head (9e3e957) 82.09%.
Report is 14 commits behind head on develop.

❗ Current head 9e3e957 differs from pull request most recent head 0995f1e. Consider uploading reports for the commit 0995f1e to get more accurate results

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1442      +/-   ##
===========================================
+ Coverage    82.05%   82.09%   +0.04%     
===========================================
  Files           59       59              
  Lines         5906     5909       +3     
===========================================
+ Hits          4846     4851       +5     
+ Misses        1060     1058       -2

Files	Coverage Δ
macrosynergy/download/jpmaqs.py	`84.98% <100.00%> (+0.51%)`	⬆️

... and 2 files with indirect coverage changes

Magnus167 · 2024-02-04T23:43:26Z

macrosynergy/download/jpmaqs.py

-        final_df: pd.DataFrame = pd.concat(dfs, ignore_index=True)
-        dups_df: pd.DataFrame = final_df.groupby(
-            ["real_date", "cid", "xcat", "metric"]
-        )["obs"].count()
-        if sum(dups_df > 1) > 0:
-            dups_df = pd.DataFrame(dups_df[dups_df > 1].index.tolist())
-            err_str: str = "Duplicate data found for the following expressions:\n"
-            for i in dups_df.groupby([1, 2, 3]).groups:
-                dts_series: pd.Series = dups_df.iloc[
-                    dups_df.groupby([1, 2, 3]).groups[i]
-                ][0]
-                dts: List[str] = dts_series.tolist()
-                max_date: str = pd.to_datetime(max(dts)).strftime("%Y-%m-%d")
-                min_date: str = pd.to_datetime(min(dts)).strftime("%Y-%m-%d")
-                expression: str = self.construct_expressions(
-                    cids=[i[0]], xcats=[i[1]], metrics=[i[2]]
-                )[0]
-                err_str += (
-                    f"Expression: {expression}, Dates: {min_date} to {max_date}\n"
+        final_df: pd.DataFrame = functools.reduce(
+            lambda left, right: pd.merge(
+                left,
+                right,
+                on=["real_date", "cid", "xcat"],
+            ),
+            list(
+                map(
+                    lambda metricx: pd.concat(dfs_dict[metricx], ignore_index=True),
+                    dfs_dict,
                )
+            ),
+        )

-            raise InvalidDataframeError(err_str)


Should we drop this check @emhbrine , @sandresen1 ?

Magnus167 added 6 commits February 4, 2024 18:04

applied formatting

c3f4c22

setting used variables to None

de3cdf1

modified code for checking duplicates

9bc0270

bugfix: removed reference to dict_list

e7163b3

applied formatting

e68b3da

removed exception behaviour from __exit__

afc9c61

Magnus167 requested a review from a team as a code owner February 4, 2024 18:35

comment duplicate data check in JPMaQSDownload class

becc249

Magnus167 added 4 commits February 4, 2024 23:15

method for df conversion using pd.merge

e7647e1

Add functools import

9720016

Refactor code to sort metrics in the order of self.valid_metrics

80e3507

disabled test for InvalidDataFrame error with duplicate entries

9e3e957

Magnus167 commented Feb 4, 2024

View reviewed changes

Magnus167 added 5 commits February 5, 2024 09:49

reducing API delay param to 0.2

16d4006

Add duplicate data check back

59700e7

Fix duplicate data check in JPMaQSDownload

f224806

Add progress bar to data processing in JPMaQSDownload class

6dfb378

commented duplicates check in JPMaQSDownload.time_series_to_df()

2b4aa27

sandresen1 approved these changes Feb 5, 2024

View reviewed changes

Merge branch 'develop' into feature/optimize_dq_df_conversion

0995f1e

Magnus167 merged commit df73806 into develop Feb 5, 2024
5 checks passed

Magnus167 deleted the feature/optimize_dq_df_conversion branch February 5, 2024 14:56

Magnus167 mentioned this pull request Feb 9, 2024

Optimize or remove duplicate check in time_series_to_df #1118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Optimize dataframe conversion in `JPMaQSDownload()` #1442

Feature: Optimize dataframe conversion in `JPMaQSDownload()` #1442

Magnus167 commented Feb 4, 2024 •

edited

codecov bot commented Feb 4, 2024 •

edited

Magnus167 Feb 4, 2024

Feature: Optimize dataframe conversion in JPMaQSDownload() #1442

Feature: Optimize dataframe conversion in JPMaQSDownload() #1442

Conversation

Magnus167 commented Feb 4, 2024 • edited

codecov bot commented Feb 4, 2024 • edited

Codecov Report

Magnus167 Feb 4, 2024

Choose a reason for hiding this comment

Feature: Optimize dataframe conversion in `JPMaQSDownload()` #1442

Feature: Optimize dataframe conversion in `JPMaQSDownload()` #1442

Magnus167 commented Feb 4, 2024 •

edited

codecov bot commented Feb 4, 2024 •

edited