Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added in a retrieval function for the tables containing financial data. #17

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

dxlnr
Copy link

@dxlnr dxlnr commented Nov 18, 2023

Added a retrieve_html_tables function in extract_items to retrieve all the tables containing financial data as pandas dataframes. This can be used to store them in a csv or similar file format.

@eloukas
Copy link
Collaborator

eloukas commented Nov 19, 2023

Hi @dxlnr, thanks for the pull request! 👋
Indeed, this is something that people are interested in. Before merging it, we need to see how well it works.

Could you provide the output result for some (e.g., 4 or 5) reports of different companies' reports?
(I am asking this because tables structure might change between different years and different companies).

For example, you could copy/paste (or screenshot) the relevant table in the report, and then the code's output.
For reproduction purposes, it would be good to also say which report is that, so I can reproduce it, before merging it into the codebase.

Once again, many thanks for the contribution 🙌

@dxlnr
Copy link
Author

dxlnr commented Nov 19, 2023

Hi! I added two more commits, with minor fixes and a testing pipeline for retrieving the tables. Please find the results in the tests/fixtures/EXTRACTED_TABLES.zip. I used the AMD filings of 2017 and 2022 and kept the table extraction separate from the raw text/json extraction. I decided to create one csv file for all table from a single filing. Maybe there is a better solution for collecting these tables. (Problem is that they have different column numbers throughout).

@dxlnr
Copy link
Author

dxlnr commented Nov 19, 2023

Screenshot from 2023-11-19 18-08-50
Only issue that remains is that the table rows with blank space like the one in the image (Diluted & Basic). Gets extracted with
,,Basic,952,952,835,835
,,Diluted,"1,039",952,835,835
Which is somewhat wrong.

You would want it to be
Basic,952,,952,835,,835
Diluted,"1,039",,952,835,,835

@dxlnr
Copy link
Author

dxlnr commented Nov 19, 2023

Other than that, I created an additional function extract_tables in extract_items.py to make it testable. This function used parts of the extract_items function which might be fused in the future to avoid duplicated code.

@dxlnr
Copy link
Author

dxlnr commented Nov 28, 2023

@eloukas Can I help along with any issues to get it merged eventually?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants