-
Notifications
You must be signed in to change notification settings - Fork 74
Google Drive handler integration #577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
WalkthroughThis update integrates Google Drive support into the application. A new global configuration variable controls threading when service account credentials are used. The MIME types mechanism is enhanced by initializing custom MIME mappings, and a new Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant DownloadStageHandler
participant GoogleDriveHandler
participant "Google Drive API"
Client->>DownloadStageHandler: Request file download
DownloadStageHandler->>GoogleDriveHandler: Invoke handle_file(path)
GoogleDriveHandler->>GoogleDriveHandler: Extract file ID & determine MIME type
alt Google Workspace File
GoogleDriveHandler->>Google Drive API: Export file (e.g., to PDF)
else Regular File
GoogleDriveHandler->>Google Drive API: Download file content
end
Google Drive API-->>GoogleDriveHandler: Return file data
GoogleDriveHandler->>DownloadStageHandler: Pass file data
DownloadStageHandler->>Client: Deliver file
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
ricecooker/config.py (1)
223-224: Consider documenting the threading limitation more explicitly.While the comment explains that single-threading is necessary to avoid random errors with service accounts, it would be helpful to provide more specific information about the nature of these errors or link to relevant documentation.
if GOOGLE_SERVICE_ACCOUNT_CREDENTIALS_PATH: - TASK_THREADS = 1 # If using service account, only one thread is allowed - random errors happen otherwise. + TASK_THREADS = 1 # Google Drive API with service accounts can experience rate limiting and concurrent request failures when using multiple threadstests/pipeline/test_transfer.py (1)
130-132: Use a context manager for temporary file writing.Replace manual file handling with a context manager to handle cleanup automatically and adhere to best practices:
- temp_file = tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) - json.dump(mock_credentials, temp_file) - temp_file.close() + with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as temp_file: + json.dump(mock_credentials, temp_file)🧰 Tools
🪛 Ruff (0.8.2)
130-130: Use a context manager for opening files
(SIM115)
ricecooker/utils/pipeline/transfer.py (2)
270-272: Clarify exception chaining for better context.When re-raising an exception inside an
exceptblock, addfrom None(or usefrom err) to preserve clarity about the original error context:-except ImportError: - raise RuntimeError("Google Drive downloads require google-auth and google-api-python-client libraries") +except ImportError as err: + raise RuntimeError("Google Drive downloads require google-auth and google-api-python-client libraries") from err🧰 Tools
🪛 Ruff (0.8.2)
270-272: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
319-321: Improve error re-raising style.For consistency and to distinguish this from exceptions in the
exceptclause, include exception chaining:-except ImportError: - raise RuntimeError("Google Drive downloads require google-api-python-client library") +except ImportError as err: + raise RuntimeError("Google Drive downloads require google-api-python-client library") from err🧰 Tools
🪛 Ruff (0.8.2)
319-321: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
ricecooker/config.py(1 hunks)ricecooker/utils/pipeline/__init__.py(2 hunks)ricecooker/utils/pipeline/mime.types(1 hunks)ricecooker/utils/pipeline/transfer.py(3 hunks)setup.py(1 hunks)tests/cassettes/test_gdrive_doc.yaml(1 hunks)tests/cassettes/test_gdrive_slideshow.yaml(1 hunks)tests/cassettes/test_gdrive_vtt.yaml(1 hunks)tests/pipeline/test_transfer.py(2 hunks)tests/vcr_config.py(1 hunks)tox.ini(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/pipeline/test_transfer.py (1)
ricecooker/utils/pipeline/transfer.py (3)
get_filename_from_content_disposition_header(89-100)GoogleDriveHandler(241-358)execute(398-404)
🪛 Gitleaks (8.21.2)
tests/pipeline/test_transfer.py
72-99: Identified a Private Key, which may compromise cryptographic security and sensitive data encryption.
(private-key)
🪛 Ruff (0.8.2)
tests/pipeline/test_transfer.py
130-130: Use a context manager for opening files
(SIM115)
ricecooker/utils/pipeline/transfer.py
270-272: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
319-321: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
🔇 Additional comments (20)
tests/vcr_config.py (1)
10-10: Good addition for security in test recordings.Adding
filter_headers=["authorization"]ensures that sensitive authorization headers aren't accidentally recorded in test cassettes. This is especially important when working with external APIs like Google Drive that use authorization tokens.setup.py (1)
63-63: Well-structured dependency management for Google Drive integration.The added optional dependencies for Google Drive are properly defined as extras, making them only installed when explicitly needed. The selected packages are the official Google client libraries required for Google Drive API integration.
tox.ini (1)
10-10: Correctly updated test environment to include Google Drive dependencies.Adding the
google_driveextra to the test environment ensures that the required dependencies are available during testing, which is essential for the new integration tests to pass.ricecooker/config.py (1)
219-221: Well-implemented configuration for Google Service Account credentials.The implementation correctly retrieves the credentials path from environment variables, following the established pattern in the codebase.
ricecooker/utils/pipeline/__init__.py (3)
1-2: Added necessary imports for MIME type handling.These imports are required to configure the custom MIME types handling that addresses Windows filetype registry issues.
14-16: Good addition of detailed comments explaining the workaround.The comment clearly explains why this modification is necessary - to prevent issues with Windows filetype registry affecting the
mimetypes.guess_typefunction. The link to the reference is a nice touch for anyone needing more information.
17-17: Effectively configures custom MIME types file.This line initializes the
mimetypesmodule with a custom MIME types file located in the same directory as the current file. Usingos.path.abspath()andos.path.join()ensures the path is correctly formed regardless of the current working directory.ricecooker/utils/pipeline/mime.types (6)
1-23: Comprehensive header with clear explanation of file purpose.The header provides excellent documentation about the file's purpose, format, and usage. It clearly explains how programs will use this file to map file extensions to their respective MIME types.
26-459: Well-structured application MIME types section.This section comprehensively covers application MIME types with their associated file extensions. The organization is clear and follows standard MIME types definition format.
462-560: Complete audio MIME types definitions.The audio MIME types section is well-organized and includes all common audio formats that the application might encounter.
562-617: Chemical MIME types for specialized content.Including chemical MIME types shows thoroughness and ensures the application can handle specialized scientific file formats.
619-671: Comprehensive image MIME types.This section covers all standard image formats, ensuring proper handling of various image files in the application.
673-679: Complete coverage of remaining MIME types.The remaining sections (inode, message, model, multipart, text, video, and others) ensure comprehensive coverage of all potential file types the application might handle.
Also applies to: 680-716, 717-789, 790-827
tests/cassettes/test_gdrive_doc.yaml (3)
1-59: Well-structured OAuth authentication test case.This section properly captures the OAuth authentication flow with Google's API, including the necessary headers and expected response. The use of binary encoding for the response body is appropriate for handling potentially complex JSON responses.
60-111: Correct implementation of file metadata retrieval test.This test case properly requests file metadata with appropriate fields (name, mimeType) and captures the expected response. The headers accurately reflect Google Drive API requirements.
112-523: Comprehensive PDF export test from Google Drive.This section thoroughly tests the file download functionality, including appropriate headers for range requests. The binary response contains a valid PDF document, which is essential for testing the Google Docs export functionality.
tests/cassettes/test_gdrive_vtt.yaml (3)
1-59: Well-structured OAuth authentication for VTT file test.Similar to the previous test file, this section properly captures the authentication flow required for accessing Google Drive files.
60-111: Proper file metadata retrieval for VTT file.The test correctly requests and captures the file metadata response for a VTT file.
112-214: Effective VTT file content testing.This section properly tests downloading a WebVTT subtitle file from Google Drive. The response contains valid WebVTT format with time codes and subtitle text. The response code 206 (Partial Content) is appropriate for range requests.
tests/cassettes/test_gdrive_slideshow.yaml (1)
1-603: Consider sanitizing or rotating OAuth token values.This cassette file includes raw token data in the request body (line 3) and response credentials (lines 17-33). If these credentials are valid, storing them could pose a security risk. For best practices, ensure that the tokens are revoked or that any functional tokens are masked when committing code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 14 out of 15 changed files in this pull request and generated 1 comment.
Files not reviewed (1)
- tox.ini: Language not supported
| fileId=file_id, mimeType=export_mime_type | ||
| ) | ||
| # Update extension based on export format | ||
| ext = mimetypes.guess_extension(export_mime_type) or "" |
Copilot
AI
Apr 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If mimetypes.guess_extension fails to determine a file extension, the handler defaults to an empty extension, which might lead to unsupported file types. Consider adding an explicit fallback or mapping for unrecognized export MIME types.
| ext = mimetypes.guess_extension(export_mime_type) or "" | |
| # Fallback mapping for unrecognized MIME types | |
| fallback_mime_map = { | |
| "application/vnd.openxmlformats-officedocument.wordprocessingml.document": ".docx", | |
| "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": ".xlsx", | |
| "application/vnd.openxmlformats-officedocument.presentationml.presentation": ".pptx", | |
| "application/pdf": ".pdf", | |
| "text/plain": ".txt", | |
| } | |
| ext = ( | |
| mimetypes.guess_extension(export_mime_type) | |
| or fallback_mime_map.get(export_mime_type) | |
| or ".bin" | |
| ) |
|
After conversation with @marcellamaki, I think this is mostly good to go as is, but requires a few tweaks to make error messages more explicit and better document the required installation for the extras require google_drive dependencies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (1)
ricecooker/utils/pipeline/transfer.py (1)
346-354: Fallback extension mapping prevents empty filenames
mimetypes.guess_extension()returnsNonefor several Google export MIME types on some platforms, which yields files with no extension (ext = ""). A simple fallback table (see earlier suggestion) avoids this.
🧹 Nitpick comments (2)
ricecooker/utils/pipeline/transfer.py (1)
295-305: Broaden URL patterns to cover additional Drive share links
_get_file_idmisses common variants such asuc?id=anddownload?id=links, leading to false negatives.- FILE_ID_PATTERNS = [ + FILE_ID_PATTERNS = [ r"drive\.google\.com/file/d/([^/]+)", # /file/d/{fileid}/view r"drive\.google\.com/open\?id=([^/]+)", # /open?id={fileid} r"docs\.google\.com/\w+/d/([^/]+)", # docs/sheets/etc + r"drive\.google\.com/uc\?id=([^&]+)", # /uc?id={fileid}&export=download + r"drive\.google\.com/[^/]+/download\?id=([^&]+)", # /xxx/download?id={fileid} ]This small extension greatly improves handler robustness.
tests/pipeline/test_transfer.py (1)
130-133: Prefer a context-managedNamedTemporaryFileUsing
NamedTemporaryFilewithout awithblock relies on a manualclose()and risks leaking file descriptors if an exception is raised between the two calls.- temp_file = tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) - json.dump(mock_credentials, temp_file) - temp_file.close() + with tempfile.NamedTemporaryFile( + mode="w", suffix=".json", delete=False + ) as temp_file: + json.dump(mock_credentials, temp_file)The yielded path (
temp_file.name) is still available after the block becausedelete=False.🧰 Tools
🪛 Ruff (0.8.2)
130-130: Use a context manager for opening files
(SIM115)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
ricecooker/config.py(1 hunks)ricecooker/utils/pipeline/__init__.py(2 hunks)ricecooker/utils/pipeline/mime.types(1 hunks)ricecooker/utils/pipeline/transfer.py(3 hunks)setup.py(1 hunks)tests/cassettes/test_gdrive_doc.yaml(1 hunks)tests/cassettes/test_gdrive_slideshow.yaml(1 hunks)tests/cassettes/test_gdrive_vtt.yaml(1 hunks)tests/pipeline/test_transfer.py(2 hunks)tests/vcr_config.py(1 hunks)tox.ini(1 hunks)
✅ Files skipped from review due to trivial changes (6)
- tests/vcr_config.py
- tox.ini
- tests/cassettes/test_gdrive_vtt.yaml
- tests/cassettes/test_gdrive_doc.yaml
- ricecooker/utils/pipeline/mime.types
- tests/cassettes/test_gdrive_slideshow.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
- setup.py
- ricecooker/utils/pipeline/init.py
- ricecooker/config.py
🧰 Additional context used
🧠 Learnings (1)
tests/pipeline/test_transfer.py (1)
Learnt from: rtibbles
PR: learningequality/ricecooker#577
File: tests/pipeline/test_transfer.py:72-100
Timestamp: 2025-04-14T16:03:33.574Z
Learning: In testing code, mock credentials like private keys that are clearly labeled as mocks (e.g., "MOCK_PRIVATE_KEY") and only used in test fixtures are acceptable and should not be flagged as security risks. These mock credentials are intentionally used when real credentials aren't available for testing purposes.
🪛 Ruff (0.8.2)
tests/pipeline/test_transfer.py
130-130: Use a context manager for opening files
(SIM115)
ricecooker/utils/pipeline/transfer.py
273-276: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
323-326: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: Python unit tests (ubuntu-latest, 3.9)
- GitHub Check: Python unit tests (windows-latest, 3.12)
- GitHub Check: Python unit tests (windows-latest, 3.11)
- GitHub Check: Python unit tests (windows-latest, 3.10)
- GitHub Check: Python unit tests (windows-latest, 3.9)
| from googleapiclient.http import MediaIoBaseDownload | ||
| except ImportError: | ||
| raise RuntimeError( | ||
| "Google Drive downloads require google-api-python-client library\n" | ||
| + instructions | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Repeat the exception-chaining fix here as well
Replicate the from err change in this second ImportError handler to keep stack-traces intact.
🧰 Tools
🪛 Ruff (0.8.2)
323-326: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
| def drive_service(self): | ||
| try: | ||
| from google.oauth2.service_account import Credentials | ||
| from googleapiclient.discovery import build | ||
| except ImportError: | ||
| raise RuntimeError( | ||
| "Google Drive downloads require google-auth and google-api-python-client libraries\n" | ||
| + instructions | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use exception chaining to preserve ImportError context
Catching an ImportError and re-raising as a generic RuntimeError hides the root cause, which makes troubleshooting much harder.
Chain the original exception with from err so the traceback still contains the missing-package details.
- except ImportError:
- raise RuntimeError(
+ except ImportError as err:
+ raise RuntimeError(
"Google Drive downloads require google-auth and google-api-python-client libraries\n"
- + instructions
- )
+ + instructions
+ ) from errApply the same pattern consistently throughout the file (see lines 321-326).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def drive_service(self): | |
| try: | |
| from google.oauth2.service_account import Credentials | |
| from googleapiclient.discovery import build | |
| except ImportError: | |
| raise RuntimeError( | |
| "Google Drive downloads require google-auth and google-api-python-client libraries\n" | |
| + instructions | |
| ) | |
| def drive_service(self): | |
| try: | |
| from google.oauth2.service_account import Credentials | |
| from googleapiclient.discovery import build | |
| except ImportError as err: | |
| raise RuntimeError( | |
| "Google Drive downloads require google-auth and google-api-python-client libraries\n" | |
| + instructions | |
| ) from err |
🧰 Tools
🪛 Ruff (0.8.2)
273-276: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
| _, ext = os.path.splitext(file["name"]) | ||
| if not ext and mime_type: | ||
| ext = mimetypes.guess_extension(mime_type) or "" | ||
|
|
||
| with self.write_file(ext.lstrip(".")) as fh: | ||
| downloader = MediaIoBaseDownload(fh, request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Guarantee a non-empty file extension
ext may still be an empty string after both filename + mimetype checks, producing nameless files on disk. Add a final default such as .bin.
- if not ext and mime_type:
- ext = mimetypes.guess_extension(mime_type) or ""
+ if not ext and mime_type:
+ ext = mimetypes.guess_extension(mime_type) or ""
+ if not ext:
+ ext = ".bin"Combine with the earlier fallback table for export types.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _, ext = os.path.splitext(file["name"]) | |
| if not ext and mime_type: | |
| ext = mimetypes.guess_extension(mime_type) or "" | |
| with self.write_file(ext.lstrip(".")) as fh: | |
| downloader = MediaIoBaseDownload(fh, request) | |
| _, ext = os.path.splitext(file["name"]) | |
| if not ext and mime_type: | |
| ext = mimetypes.guess_extension(mime_type) or "" | |
| if not ext: | |
| ext = ".bin" | |
| with self.write_file(ext.lstrip(".")) as fh: | |
| downloader = MediaIoBaseDownload(fh, request) |
marcellamaki
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the updates made, I think this is good to go
Summary
References
This PR is a prerequisite for #558
Reviewer guidance
Look at the tests, this should give good confidence that the Google Drive handler is working as intended.
Summary by CodeRabbit