Skip to content

Add mongo_sync script for syncing MongoDB data to JSON files#150

Open
SimoneBendazzoli93 wants to merge 11 commits into
masterfrom
mongo-sync-to-git
Open

Add mongo_sync script for syncing MongoDB data to JSON files#150
SimoneBendazzoli93 wants to merge 11 commits into
masterfrom
mongo-sync-to-git

Conversation

@SimoneBendazzoli93
Copy link
Copy Markdown
Collaborator

@SimoneBendazzoli93 SimoneBendazzoli93 commented May 5, 2026

This new script connects to a MongoDB database, retrieves project and user data, filters the information, and writes it to JSON files organized by project namespace. It includes error handling for date formatting and ensures the project folder is created if it doesn't exist.

Summary by CodeRabbit

  • New Features
    • Added automated MongoDB data synchronization that exports project information and associated user lists to JSON files with normalized date formatting and namespace-based filtering to organize project exports.

Review Change Stack

Copilot AI review requested due to automatic review settings May 5, 2026 11:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new mongo_sync utility script under the dashboard to export MongoDB maia_projects data to per-namespace JSON files, enriching each project with a derived users email list and normalizing the date field.

Changes:

  • Connect to MongoDB via env-provided credentials and fetch maia_projects / maia_users.
  • Build a filtered project payload (selected fields + derived users) and write Projects/<namespace>.json.
  • Add basic date normalization for multiple MongoDB date representations.

Overall Verdict

REQUEST CHANGES

Critical Issues

  • The user-to-project membership check is incorrect (substring match vs exact membership in a comma-separated namespace list), which can produce wrong users lists in the generated JSON.

Optional Notes (non-blocking)

  • Some key-mapping branches (group_id/username) are currently unreachable due to the metadata gate and should be removed or reworked.
  • The script performs DB access and file writes at import time; guarding with a main()/if __name__ == "__main__": would prevent accidental side effects.
  • Remove or rewrite the inline “git clone with credentials in URL” comment to avoid promoting a secret-leaking workflow.

Comment thread dashboard/mongo_sync.py Outdated
Comment thread dashboard/mongo_sync.py Outdated
Comment thread dashboard/mongo_sync.py Outdated
Comment thread dashboard/mongo_sync.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment thread dashboard/mongo_sync.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

Comment thread dashboard/mongo_sync.py Outdated
Comment thread dashboard/mongo_sync.py Outdated
Comment thread dashboard/mongo_sync.py Outdated
Comment thread dashboard/mongo_sync.py Outdated
SimoneBendazzoli93 and others added 7 commits May 17, 2026 10:16
This new script connects to a MongoDB database, retrieves project and user data, filters the information, and writes it to JSON files organized by project namespace. It includes error handling for date formatting and ensures the project folder is created if it doesn't exist.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Simone Bendazzoli <simonebendazzoli93@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Simone Bendazzoli <simonebendazzoli93@gmail.com>
This update organizes the mongo_sync script by wrapping the main functionality in a `main()` function. It maintains the existing logic for connecting to the MongoDB database, retrieving project and user data, filtering the information, and writing it to JSON files. The project folder creation and date formatting error handling remain intact.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Simone Bendazzoli <simonebendazzoli93@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Simone Bendazzoli <simonebendazzoli93@gmail.com>
- Introduced a new script, mongo_sync.py, that connects to a MongoDB database and retrieves project and user data.
- The script filters and formats project information, including user emails associated with each project, and saves the data as JSON files in a designated "Projects" directory.
- Implemented error handling for date formatting and ensured safe filename generation to prevent path traversal vulnerabilities.
Comment thread dashboard/core/mongo_sync.py Fixed
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Signed-off-by: Simone Bendazzoli <simonebendazzoli93@gmail.com>
@SimoneBendazzoli93 SimoneBendazzoli93 self-assigned this May 17, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

📝 Walkthrough

Walkthrough

This PR adds a MongoDB sync script that reads user and project documents from MongoDB, filters projects by matching user namespaces, normalizes metadata fields including dates, and exports each filtered project to a JSON file with a path-safe filename derived from the project namespace.

Changes

MongoDB Project Sync

Layer / File(s) Summary
MongoDB connection, filtering, and JSON export
dashboard/core/mongo_sync.py
Script reads MongoDB credentials from environment, loads all users and projects into memory, filters projects by matching each project's namespace against user-derived namespace lists, normalizes the date field with multiple fallback paths for different shapes, generates path-safe filenames from sanitized namespaces, and writes one JSON file per project to Projects/.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A rabbit hops through MongoDB's store,
Filters projects and users by the score,
Dates dance through fallback paths with care,
JSON files bloom in Projects' lair—
Namespace matching, safe and sound! 🌱

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a mongo_sync script that syncs MongoDB data to JSON files, which directly matches the file additions and functionality summarized in the raw_summary.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • ✅ Generated successfully - (🔄 Check to regenerate)
  • Commit on current branch
🧪 Generate unit tests (beta)

✅ Unit Test PR creation complete.

  • Create PR with unit tests
  • Commit unit tests in branch mongo-sync-to-git

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@dashboard/core/mongo_sync.py`:
- Around line 59-67: The loop that populates filtered_project["users"] currently
appends raw user.get("email") values (in the users -> filtered_project
construction) which exports PII; change it to export a stable non-PII identifier
instead (e.g., user.get("id") or a deterministic hash of the email using a
project/system salt) so no raw emails are written to disk; update the same
pattern found around the other occurrence noted (lines ~103-104) to use the
ID/hash, and ensure any downstream consumers that expect emails read the new
field name/format (adjust keys or document the change) in the code paths
referencing filtered_project and project.
- Line 63: The code directly indexes project["namespace"] which can raise
KeyError for missing/invalid project docs; change the membership check in the
sync logic to use project.get("namespace") (or assign ns =
project.get("namespace")) and skip the record if ns is None or not a str, i.e.
only proceed to check membership against user_namespaces when a valid namespace
exists to avoid aborting the whole sync.
- Around line 71-77: The code assumes v["$date"] exists and is a string before
parsing in the block around filtered_project[k], which can raise KeyError or
TypeError; change it to safely fetch and normalize the $date value (e.g. use
v.get("$date") or check "$date" in v), ensure non-string values are converted to
str() before calling replace("+00:00")/fromisoformat, and keep the
datetime.fromisoformat(...) call inside the try/except so missing or malformed
values fall back to assigning the original (or stringified) date to
filtered_project[k]; update the logic around variables v, k, filtered_project
and datetime.fromisoformat accordingly.
- Around line 98-104: The current sanitization of raw_namespace into
safe_namespace can cause collisions (e.g., "team/a" -> "team_a") and overwrite
existing exports; update the write logic around safe_namespace (used in
project_folder.joinpath(safe_namespace + ".json") and
json.dump(filtered_project,...)) to detect collisions and produce a unique,
deterministic filename instead of silently overwriting: after computing
safe_namespace, check whether a file with that name already exists and if it was
produced from a different raw_namespace (track a mapping or compare existing
metadata), and if a collision is detected append a short deterministic suffix
(e.g., a hex hash of raw_namespace like first 8 chars) to the filename or raise
a ValueError; ensure the chosen approach uses raw_namespace for uniqueness and
still prevents path traversal via the existing Path(safe_namespace).name check
before opening the file.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: b0ec6139-1168-4315-9029-4f9fde529d77

📥 Commits

Reviewing files that changed from the base of the PR and between 7757d3d and e76abbb.

📒 Files selected for processing (1)
  • dashboard/core/mongo_sync.py

Comment on lines +59 to +67
filtered_project = {"users": []}
for user in users:
user_namespace_value = user.get("namespace") or ""
user_namespaces = [namespace.strip() for namespace in user_namespace_value.split(",") if namespace.strip()]
if project["namespace"] in user_namespaces:
user_email = user.get("email")
if user_email:
filtered_project["users"].append(user_email)
for k, v in project.items():
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Reassess exporting raw user emails to disk (PII retention risk).

The script writes email addresses into project JSON files under Projects/. If these files are committed/synced, this creates a privacy/compliance exposure. Prefer stable user IDs or hashed emails unless explicit policy/legal basis requires raw emails.

Also applies to: 103-104

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dashboard/core/mongo_sync.py` around lines 59 - 67, The loop that populates
filtered_project["users"] currently appends raw user.get("email") values (in the
users -> filtered_project construction) which exports PII; change it to export a
stable non-PII identifier instead (e.g., user.get("id") or a deterministic hash
of the email using a project/system salt) so no raw emails are written to disk;
update the same pattern found around the other occurrence noted (lines ~103-104)
to use the ID/hash, and ensure any downstream consumers that expect emails read
the new field name/format (adjust keys or document the change) in the code paths
referencing filtered_project and project.

for user in users:
user_namespace_value = user.get("namespace") or ""
user_namespaces = [namespace.strip() for namespace in user_namespace_value.split(",") if namespace.strip()]
if project["namespace"] in user_namespaces:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard missing project namespace before membership checks.

project["namespace"] can raise KeyError and abort the whole sync when a project document is missing/invalid. Use .get() and skip invalid records safely.

Proposed fix
-            if project["namespace"] in user_namespaces:
+            project_namespace = project.get("namespace")
+            if not project_namespace:
+                continue
+            if project_namespace in user_namespaces:
                 user_email = user.get("email")
                 if user_email:
                     filtered_project["users"].append(user_email)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if project["namespace"] in user_namespaces:
project_namespace = project.get("namespace")
if not project_namespace:
continue
if project_namespace in user_namespaces:
user_email = user.get("email")
if user_email:
filtered_project["users"].append(user_email)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dashboard/core/mongo_sync.py` at line 63, The code directly indexes
project["namespace"] which can raise KeyError for missing/invalid project docs;
change the membership check in the sync logic to use project.get("namespace")
(or assign ns = project.get("namespace")) and skip the record if ns is None or
not a str, i.e. only proceed to check membership against user_namespaces when a
valid namespace exists to avoid aborting the whole sync.

Comment on lines +71 to +77
if isinstance(v, dict):
date_str = v["$date"]
try:
date_obj = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
filtered_project[k] = date_obj.strftime("%Y-%m-%d")
except Exception:
filtered_project[k] = date_str
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle $date shape safely before parsing.

date_str = v["$date"] is outside the try, so missing $date crashes the run. Also, non-string $date values should be normalized safely.

Proposed fix
                 if k == "date":
                     if isinstance(v, dict):
-                        date_str = v["$date"]
+                        date_raw = v.get("$date")
                         try:
-                            date_obj = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
-                            filtered_project[k] = date_obj.strftime("%Y-%m-%d")
+                            if isinstance(date_raw, str):
+                                date_obj = datetime.fromisoformat(date_raw.replace("Z", "+00:00"))
+                                filtered_project[k] = date_obj.strftime("%Y-%m-%d")
+                            elif isinstance(date_raw, datetime):
+                                filtered_project[k] = date_raw.strftime("%Y-%m-%d")
+                            else:
+                                filtered_project[k] = str(date_raw) if date_raw is not None else ""
                         except Exception:
-                            filtered_project[k] = date_str
+                            filtered_project[k] = str(date_raw) if date_raw is not None else ""
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dashboard/core/mongo_sync.py` around lines 71 - 77, The code assumes
v["$date"] exists and is a string before parsing in the block around
filtered_project[k], which can raise KeyError or TypeError; change it to safely
fetch and normalize the $date value (e.g. use v.get("$date") or check "$date" in
v), ensure non-string values are converted to str() before calling
replace("+00:00")/fromisoformat, and keep the datetime.fromisoformat(...) call
inside the try/except so missing or malformed values fall back to assigning the
original (or stringified) date to filtered_project[k]; update the logic around
variables v, k, filtered_project and datetime.fromisoformat accordingly.

Comment on lines +98 to +104
safe_namespace = re.sub(r"[^a-z0-9-]", "_", raw_namespace.lower())
# Additionally, enforce that no path traversal can happen
if Path(safe_namespace).name != safe_namespace or not safe_namespace:
raise ValueError(f"Unsafe or empty namespace for filename: {raw_namespace}")

with open(project_folder.joinpath(safe_namespace + ".json"), "w") as f:
json.dump(filtered_project, f, indent=4)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Prevent sanitized filename collisions from overwriting project exports.

Different namespaces can collapse to the same safe_namespace (e.g., team/a and team_a) and silently overwrite JSON files.

Proposed fix
+        existing_name = filtered_project.get("namespace", "")
+        file_name = f"{safe_namespace}.json"
+        output_path = project_folder / file_name
+        if output_path.exists():
+            with open(output_path, "r") as rf:
+                existing = json.load(rf)
+            if existing.get("namespace") != existing_name:
+                raise ValueError(
+                    f"Namespace collision after sanitization: '{existing_name}' -> '{file_name}'"
+                )
-
-        with open(project_folder.joinpath(safe_namespace + ".json"), "w") as f:
+        with open(output_path, "w") as f:
             json.dump(filtered_project, f, indent=4)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
safe_namespace = re.sub(r"[^a-z0-9-]", "_", raw_namespace.lower())
# Additionally, enforce that no path traversal can happen
if Path(safe_namespace).name != safe_namespace or not safe_namespace:
raise ValueError(f"Unsafe or empty namespace for filename: {raw_namespace}")
with open(project_folder.joinpath(safe_namespace + ".json"), "w") as f:
json.dump(filtered_project, f, indent=4)
safe_namespace = re.sub(r"[^a-z0-9-]", "_", raw_namespace.lower())
# Additionally, enforce that no path traversal can happen
if Path(safe_namespace).name != safe_namespace or not safe_namespace:
raise ValueError(f"Unsafe or empty namespace for filename: {raw_namespace}")
existing_name = filtered_project.get("namespace", "")
file_name = f"{safe_namespace}.json"
output_path = project_folder / file_name
if output_path.exists():
with open(output_path, "r") as rf:
existing = json.load(rf)
if existing.get("namespace") != existing_name:
raise ValueError(
f"Namespace collision after sanitization: '{existing_name}' -> '{file_name}'"
)
with open(output_path, "w") as f:
json.dump(filtered_project, f, indent=4)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dashboard/core/mongo_sync.py` around lines 98 - 104, The current sanitization
of raw_namespace into safe_namespace can cause collisions (e.g., "team/a" ->
"team_a") and overwrite existing exports; update the write logic around
safe_namespace (used in project_folder.joinpath(safe_namespace + ".json") and
json.dump(filtered_project,...)) to detect collisions and produce a unique,
deterministic filename instead of silently overwriting: after computing
safe_namespace, check whether a file with that name already exists and if it was
produced from a different raw_namespace (track a mapping or compare existing
metadata), and if a collision is detected append a short deterministic suffix
(e.g., a hex hash of raw_namespace like first 8 chars) to the filename or raise
a ValueError; ensure the chosen approach uses raw_namespace for uniqueness and
still prevents path traversal via the existing Path(safe_namespace).name check
before opening the file.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

Note

Docstrings generation - SUCCESS
Generated docstrings for this pull request at #154

coderabbitai Bot added a commit that referenced this pull request May 17, 2026
Docstrings generation was requested by @SimoneBendazzoli93.

* #150 (comment)

The following files were modified:

* `dashboard/core/mongo_sync.py`
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

Note

Unit test generation is a public access feature. Expect some limitations and changes as we gather feedback and continue to improve it.


Generating unit tests... This may take up to 20 minutes.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

✅ Created PR with unit tests: #155

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants