### Why Version Control Notebooks (with caveats)?

**Pros:**

1.  **Reproducibility (Code + Output):** The primary benefit is having a historical record of not just the code, but also the order of execution and the results (plots, tables, etc.). This can be invaluable for understanding past experiments or presentations.
2.  **Easier Sharing:** Colleagues can pull the notebook and see the exact state you left it in, including intermediate results.
3.  **Historical Record:** It tracks the evolution of your analysis and experimentation.

**Cons (and why caveats are needed):**

1.  **Terrible Diffing:** Jupyter notebooks are JSON files. Git diffs on JSON are notoriously unreadable, making it very hard to see what code changes were made, especially if outputs are large.
2.  **Merge Conflicts:** Resolving merge conflicts in JSON can be a nightmare, often leading to manual clean-up or corrupted notebooks.
3.  **Large File Sizes:** Storing outputs (especially images or large data frames) directly in the notebook can bloat the repository size.
4.  **Sensitive Data Exposure:** If you accidentally print sensitive information (like API keys or customer data) to the output, it gets stored in the notebook's JSON and committed to Git.

### Best Practices for a Small Data Science Team

To mitigate the cons while retaining the pros, adopt a "hybrid" approach:

1.  **Modularize Your Code (Most Important\!):**

      * **Extract Core Logic:** Any reusable code (like your `data_ingestion.py`, `data_processing.py`, `model_training.py`, `evaluation.py` etc.) should live in separate `.py` scripts. These are standard Python files that Git can diff and merge efficiently.
      * **Notebooks as Orchestrators/Explorers/Reports:** Use notebooks for:
          * Initial data exploration (EDA).
          * Prototyping new features or model ideas.
          * Running experiments by calling functions from your `.py` scripts.
          * Visualizing results and telling the story of your analysis (like a living report).
          * The actual "main" script that runs your pipeline should likely be a `.py` file, not a notebook.

2.  **Clear Notebook Policy for Commits:**

      * **Clear Outputs Before Committing:** This is the *most crucial* step. Use `Cell -> All Output -> Clear` in Jupyter before saving and committing. This significantly reduces file size, makes diffs cleaner, and prevents accidental sensitive data leaks.
      * **Automate Clearing:**
          * **`nbstripout`:** This is a fantastic tool that automatically strips outputs from notebooks before they are committed. Install it (`pip install nbstripout`) and then enable it as a Git filter:
            ```bash
            nbstripout --install
            # Or for a single repo:
            # cd your_repo_root
            # nbstripout --install --force
            ```
            This means you can save your notebook with outputs for your own use, and Git will automatically strip them when you `git add` and `git commit`. Your collaborators will pull clean notebooks.
          * **Git Hooks:** You can set up pre-commit hooks to run `nbstripout` or other cleaning scripts.

3.  **Leverage Experiment Tracking (like MLflow):**

      * You're already using MLflow\! This is perfect. Don't rely on notebook outputs for tracking metrics, parameters, and models. Log everything important to MLflow. This separates the "results" from the "code" and makes your notebooks lighter.

4.  **Adopt a Naming Convention:**

      * Prefix notebooks based on their purpose (e.g., `01_EDA_initial_data.ipynb`, `02_Feature_Engineering_V1.ipynb`, `03_Model_Experiment_XGBoost.ipynb`).
      * Consider adding author initials or dates if multiple people are working on similar topics simultaneously, but try to avoid parallel work on the *exact same* notebook.

5.  **Small Team Advantages:**

      * **Easier Communication:** With a small team, it's easier to establish and enforce these conventions early.
      * **Less Overhead:** Tools like `nbstripout` are simple to set up and maintain for a small group.
      * **Agility:** A clean codebase with modularized scripts and focused notebooks makes iteration faster.

### In Summary:

For your current situation, after successfully ingesting data:

  * **Save your notebook.**
  * **Clear all outputs** before committing.
  * **Commit the cleaned notebook** to your Git repo.
  * **Most importantly, consider how the data ingestion logic could be refactored into your `data_ingestion.py` file** if it's more than just a one-off execution. Your notebook would then just call the function from that script.

By following these practices, your small data science team can effectively collaborate, maintain a clean and reproducible codebase, and avoid the common pitfalls of version controlling Jupyter notebooks.

In [None]:
import sys
import os
import pandas as pd # Import pandas, as the downloaded data is a DataFrame

# Assumes you are in the root of the financial-mlops-pytorch repo
sys.path.append(os.path.abspath('financial-mlops-pytorch/src'))

# Import the module AFTER sys.path is updated and kernel is restarted (if needed)
import data_ingestion
print("Successfully imported data_ingestion module.")

# --- Next Step: Download some data ---
ticker = "AAPL" # Example: Apple Inc.
start_date = "2020-01-01"
end_date = "2023-12-31"

# Define the output directory within your shared storage
output_base_dir = "/mnt/shared-data"
output_dir_for_ticker = os.path.join(output_base_dir, ticker.lower())

# Create the directory if it doesn't exist
os.makedirs(output_dir_for_ticker, exist_ok=True)
print(f"Ensured output directory exists: {output_dir_for_ticker}")

# Determine the *expected* file path where the data will be saved by the download function
# (The download function saves it, but returns the DataFrame)
expected_file_name = f"{ticker.upper()}_stock_data.csv"
expected_file_path = os.path.join(output_dir_for_ticker, expected_file_name)


# Call the function to download data.
# IMPORTANT: data_ingestion.download_stock_data returns the DataFrame directly,
# and it also handles saving the data to the specified output_dir.
print(f"Downloading data for {ticker} from {start_date} to {end_date}...")
df_downloaded = data_ingestion.download_stock_data(ticker, start_date, end_date)

print(f"Data downloaded and available in Python DataFrame 'df_downloaded'.")
print(f"Data should also be saved to: {expected_file_path}") # This is the file that was saved to disk.


# --- Inspect the downloaded DataFrame ---
print("\nFirst 5 rows of the downloaded data (from DataFrame in memory):")
print(df_downloaded.head())
print(f"\nShape of the data: {df_downloaded.shape}")

# --- Optional: Verify the file exists on disk ---
# You can check if the file was indeed saved as expected
if os.path.exists(expected_file_path):
    print(f"\nConfirmed: Data file '{expected_file_name}' exists on disk at {output_dir_for_ticker}/")
    # If you wanted to load it *again* from disk (e.g., in a new session), you'd use load_data like this:
    # df_loaded_from_disk = data_ingestion.load_data(expected_file_path)
    # print("\nFirst 5 rows of data loaded from disk (for verification):")
    # print(df_loaded_from_disk.head())
else:
    print(f"\nWarning: Data file '{expected_file_name}' was NOT found on disk at {output_dir_for_ticker}/. Check download_stock_data implementation.")