feat: Add ExcelDataLoader for Data Formulator#158
Closed
rafaelascanio wants to merge 1 commit into
Closed
Conversation
This commit introduces a new data loader for Microsoft's Data Formulator, enabling you to load data directly from Excel files (.xlsx).
Key changes include:
- Created `ExcelDataLoader` class in `py-src/data_formulator/data_loader/excel_data_loader.py`, inheriting from `ExternalDataLoader`.
- Implemented methods in `ExcelDataLoader`:
- `list_params()`: Defines `file_path` as a required parameter.
- `__init__()`: Initializes the loader with the Excel file path and DuckDB connection, using pandas and openpyxl to read the file.
- `list_tables()`: Lists all sheets in the Excel file as available tables, providing sheet names, column details, and sample data.
- `ingest_data()`: Loads data from a specified Excel sheet into a DuckDB table, using the inherited `ingest_df_to_duckdb` method. Supports custom table naming via `name_as`.
- `view_query_sample()`: Returns a JSON sample of the first few rows of a specified sheet.
- `ingest_data_from_query()`: Raises `NotImplementedError` as direct querying is not applicable to Excel files in this context.
- Registered `ExcelDataLoader` in `py-src/data_formulator/data_loader/__init__.py` to make it available to the application.
- Added `openpyxl` to `requirements.txt` as a necessary dependency for pandas to handle `.xlsx` files (`pandas` was already listed).
- Created comprehensive unit tests in `py-src/data_formulator/tests/test_excel_data_loader.py` covering various functionalities of `ExcelDataLoader`, including initialization, listing tables, data ingestion, sample viewing, and error handling scenarios. A temporary Excel file with multiple sheets is generated during test setup for thorough testing.
This new loader enhances Data Formulator's capability to work with diverse data sources, allowing you to easily integrate your existing Excel-based datasets for visualization and analysis.
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit introduces a new data loader for Microsoft's Data Formulator, enabling you to load data directly from Excel files (.xlsx).
Key changes include:
ExcelDataLoaderclass inpy-src/data_formulator/data_loader/excel_data_loader.py, inheriting fromExternalDataLoader.ExcelDataLoader:list_params(): Definesfile_pathas a required parameter.__init__(): Initializes the loader with the Excel file path and DuckDB connection, using pandas and openpyxl to read the file.list_tables(): Lists all sheets in the Excel file as available tables, providing sheet names, column details, and sample data.ingest_data(): Loads data from a specified Excel sheet into a DuckDB table, using the inheritedingest_df_to_duckdbmethod. Supports custom table naming vianame_as.view_query_sample(): Returns a JSON sample of the first few rows of a specified sheet.ingest_data_from_query(): RaisesNotImplementedErroras direct querying is not applicable to Excel files in this context.ExcelDataLoaderinpy-src/data_formulator/data_loader/__init__.pyto make it available to the application.openpyxltorequirements.txtas a necessary dependency for pandas to handle.xlsxfiles (pandaswas already listed).py-src/data_formulator/tests/test_excel_data_loader.pycovering various functionalities ofExcelDataLoader, including initialization, listing tables, data ingestion, sample viewing, and error handling scenarios. A temporary Excel file with multiple sheets is generated during test setup for thorough testing.This new loader enhances Data Formulator's capability to work with diverse data sources, allowing you to easily integrate your existing Excel-based datasets for visualization and analysis.