# Version Control with Git

In this notebook, we'll set up version control for our LLM finance project. Version control is essential for tracking changes, collaborating, and maintaining a professional workflow.

## Learning Objectives
- Understand the importance of version control in ML projects
- Set up a Git repository for the course project
- Create a proper `.gitignore` file for Python/ML projects
- Learn best practices for committing code
- Understand how to handle sensitive information (API keys)

## 1. Why Version Control for LLM Projects?

### Key Benefits:
- **Track experiments**: Record different model configurations and results
- **Collaboration**: Work with teammates on the same codebase
- **Rollback capability**: Revert to working versions when experiments fail
- **Documentation**: Commit messages serve as project history
- **Reproducibility**: Ensure others can reproduce your results

### What to Track:
- ✅ Source code and notebooks
- ✅ Configuration files
- ✅ Documentation
- ✅ Requirements/dependencies
- ❌ Large datasets (use Git LFS or external storage)
- ❌ API keys and secrets
- ❌ Model checkpoints (use model registries)

## 2. Setting Up Your Git Repository

Let's check if Git is installed and set up the repository:

In [None]:
import subprocess
import os
from pathlib import Path

def run_command(command, description=""):
    """
    Run a shell command and return the result
    """
    try:
        result = subprocess.run(
            command, 
            shell=True, 
            capture_output=True, 
            text=True, 
            check=True
        )
        if description:
            print(f"✅ {description}")
        if result.stdout.strip():
            print(f"   Output: {result.stdout.strip()}")
        return True, result.stdout.strip()
    except subprocess.CalledProcessError as e:
        print(f"❌ {description} failed")
        print(f"   Error: {e.stderr.strip() if e.stderr else str(e)}")
        return False, None
    except Exception as e:
        print(f"❌ {description} failed: {str(e)}")
        return False, None

# Check if Git is installed
print("🔍 Checking Git installation...")
success, version = run_command("git --version", "Git version check")

if not success:
    print("\n📥 Git is not installed. Please install Git:")
    print("   - Windows: Download from https://git-scm.com/download/win")
    print("   - macOS: brew install git")
    print("   - Linux: sudo apt-get install git")
else:
    print(f"\n🎉 Git is installed: {version}")

## 3. Git Configuration

Let's check and configure Git settings:

In [None]:
# Check current Git configuration
print("🔧 Current Git configuration:")
success, name = run_command("git config --global user.name")
success, email = run_command("git config --global user.email")

if not name or not email:
    print("\n⚠️ Git user information not configured.")
    print("\n📝 To configure Git, run these commands in your terminal:")
    print('   git config --global user.name "Your Name"')
    print('   git config --global user.email "your.email@example.com"')
else:
    print(f"\n✅ Git configured for:")
    print(f"   Name: {name}")
    print(f"   Email: {email}")

# Check if we're in a Git repository
print("\n🏠 Repository status:")
is_repo, _ = run_command("git rev-parse --is-inside-work-tree", "Checking if in Git repository")

if not is_repo:
    print("\n📁 Current directory is not a Git repository.")
    print("   We'll create one for this project.")

## 4. Initialize Git Repository

Let's initialize a Git repository for our project:

In [None]:
# Create project directory structure
project_name = "llm-finance-course"
current_dir = Path.cwd()

print(f"📁 Current working directory: {current_dir}")
print(f"\n🏗️ Project structure we'll create:")
print(f"""
{project_name}/
├── .gitignore          # Files to ignore in version control
├── .env.example        # Template for environment variables
├── README.md           # Project documentation
├── requirements.txt    # Python dependencies
├── notebooks/          # Jupyter notebooks
│   ├── day1/
│   ├── day2/
│   └── ...
├── src/               # Source code modules
├── data/              # Data files (local only)
├── models/            # Saved models (local only)
└── outputs/           # Generated outputs
""")

# Note: We won't actually create the directory structure here
# since we're already in the course repository
print("\n💡 For this course, you would typically run:")
print(f"   mkdir {project_name}")
print(f"   cd {project_name}")
print("   git init")
print("\nBut since we're in the course environment, we'll work with the current setup.")

## 5. Creating a Comprehensive .gitignore File

The `.gitignore` file tells Git which files to never track. This is crucial for LLM projects:

In [None]:
# Create a comprehensive .gitignore for LLM/ML projects
gitignore_content = '''# LLM Finance Course - .gitignore

# =============================================================================
# SENSITIVE INFORMATION - NEVER COMMIT THESE!
# =============================================================================
.env
*.key
api_keys.txt
secrets.json
config/secrets/

# =============================================================================
# PYTHON
# =============================================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg

# Virtual environments
venv/
env/
ENV/
.venv/
.env/
conda-env/

# =============================================================================
# JUPYTER NOTEBOOKS
# =============================================================================
# Jupyter Notebook checkpoints
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# =============================================================================
# MACHINE LEARNING & DATA
# =============================================================================
# Large datasets
data/raw/
data/processed/
*.csv
*.parquet
*.h5
*.hdf5

# Model files and checkpoints
models/
checkpoints/
*.pkl
*.joblib
*.h5
*.ckpt
*.pth
*.pt
*.bin
*.safetensors

# Weights & Biases
wandb/

# MLflow
mlruns/
mlartifacts/

# HuggingFace cache
.cache/
transformers_cache/

# =============================================================================
# OUTPUTS AND LOGS
# =============================================================================
# Logs
*.log
logs/

# Generated outputs
outputs/
results/
plots/
figures/

# =============================================================================
# SYSTEM FILES
# =============================================================================
# macOS
.DS_Store
.AppleDouble
.LSOverride

# Windows
Thumbs.db
ehthumbs.db
Desktop.ini

# Linux
*~

# =============================================================================
# IDE AND EDITORS
# =============================================================================
# VS Code
.vscode/
*.code-workspace

# PyCharm
.idea/

# Sublime Text
*.sublime-project
*.sublime-workspace

# =============================================================================
# TEMPORARY FILES
# =============================================================================
*.tmp
*.temp
*.swp
*.swo
*~

# =============================================================================
# PROJECT SPECIFIC
# =============================================================================
# Add any project-specific files to ignore here
experimental/
scratch/
temp_notebooks/
'''

# Write .gitignore file
with open('.gitignore', 'w') as f:
    f.write(gitignore_content)

print("📄 Created comprehensive .gitignore file")
print("\n🔍 Key sections in .gitignore:")
print("   • Sensitive information (.env, API keys)")
print("   • Python artifacts (__pycache__, *.pyc)")
print("   • Jupyter checkpoints")
print("   • Large data files and models")
print("   • Generated outputs and logs")
print("   • System and IDE files")

print("\n⚠️ Important: The .env file (containing API keys) is ignored by Git!")

## 6. Creating Environment Variable Templates

We'll create a template for environment variables that can be safely committed:

In [None]:
# Create .env.example file - this can be safely committed to Git
env_example_content = '''# Environment Variables Template
# Copy this file to .env and fill in your actual API keys
# NEVER commit the .env file to version control!

# OpenAI API
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_ORGANIZATION=your_openai_org_id_here  # Optional

# DeepSeek API
DEEPSEEK_API_KEY=your_deepseek_api_key_here
DEEPSEEK_BASE_URL=https://api.deepseek.com/v1

# Anthropic Claude API
ANTHROPIC_API_KEY=your_anthropic_api_key_here

# HuggingFace
HUGGINGFACE_API_KEY=your_huggingface_token_here

# Cohere API
COHERE_API_KEY=your_cohere_api_key_here

# Model Configuration
DEFAULT_MODEL=gpt-3.5-turbo
MAX_TOKENS=1000
TEMPERATURE=0.7

# Data Paths
DATA_DIR=./data
MODELS_DIR=./models
OUTPUTS_DIR=./outputs

# Logging
LOG_LEVEL=INFO
LOG_FILE=./logs/app.log
'''

# Write .env.example file
with open('.env.example', 'w') as f:
    f.write(env_example_content)

print("📄 Created .env.example file")
print("\n📝 To set up your environment variables:")
print("   1. Copy .env.example to .env")
print("   2. Edit .env with your actual API keys")
print("   3. Never commit .env to version control")

print("\n🔒 Security reminder:")
print("   • .env.example is safe to commit (no real keys)")
print("   • .env contains real keys and is in .gitignore")
print("   • Always double-check before committing sensitive files")

## 7. Git Best Practices for LLM Projects

Let's create a README file with best practices:

In [None]:
# Create a project README with Git best practices
readme_content = '''# LLM Finance Course Project

This repository contains the practical exercises and projects for the LLM in Finance course.

## 🚀 Quick Start

### 1. Clone and Setup
```bash
git clone <repository-url>
cd llm-finance-course

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\\Scripts\\activate

# Install dependencies
pip install -r requirements.txt
```

### 2. Environment Variables
```bash
# Copy the template
cp .env.example .env

# Edit .env with your API keys
# NEVER commit .env to version control!
```

### 3. API Keys Required
- **OpenAI**: Get your key from https://platform.openai.com/api-keys
- **DeepSeek**: Get your key from https://platform.deepseek.com/
- **HuggingFace**: Get your token from https://huggingface.co/settings/tokens

## 📁 Project Structure

```
llm-finance-course/
├── .env.example        # Environment variables template
├── .gitignore         # Git ignore rules
├── requirements.txt   # Python dependencies
├── README.md         # This file
├── notebooks/        # Jupyter notebooks
│   ├── day1/
│   ├── day2/
│   └── ...
├── src/             # Source code modules
├── data/            # Data files (gitignored)
├── models/          # Model files (gitignored)
└── outputs/         # Generated outputs (gitignored)
```

## 🔄 Git Workflow

### Daily Workflow
```bash
# Start working
git pull origin main

# Make changes, then stage and commit
git add .
git commit -m "Add: Implement sentiment analysis for financial news"

# Push changes
git push origin main
```

### Commit Message Convention
```
Type: Brief description

Types:
- Add: New feature or file
- Fix: Bug fix
- Update: Modify existing feature
- Remove: Delete feature or file
- Docs: Documentation changes
- Style: Code formatting
- Refactor: Code restructuring
- Test: Add or modify tests

Examples:
- "Add: OpenAI API integration for text generation"
- "Fix: Handle API rate limiting in chat function"
- "Update: Improve tokenization preprocessing"
```

## 🔐 Security Guidelines

### ✅ Do:
- Use `.env` files for API keys
- Add `.env` to `.gitignore`
- Use environment variable templates (`.env.example`)
- Review commits before pushing
- Use specific `.gitignore` rules

### ❌ Don\'t:
- Hardcode API keys in source code
- Commit `.env` files
- Push large datasets or models
- Commit personal configuration files
- Share API keys in chat/email

## 🤝 Collaboration

### Working with Others
```bash
# Before starting work
git pull origin main

# Create feature branch for major changes
git checkout -b feature/new-analysis

# Work on feature, then merge back
git checkout main
git merge feature/new-analysis
git push origin main
```

## 📊 Managing Large Files

For large datasets or models, consider:
- **Git LFS**: For large files that need versioning
- **External storage**: AWS S3, Google Drive, etc.
- **Data versioning tools**: DVC, MLflow
- **Model registries**: HuggingFace Hub, MLflow Model Registry

## 🆘 Common Issues

### "Git won\'t let me commit"
```bash
# Check what\'s happening
git status

# Stage your changes
git add .

# Then commit
git commit -m "Your message"
```

### "Accidentally committed API keys"
```bash
# If not pushed yet
git reset --soft HEAD~1
git reset HEAD .env
echo ".env" >> .gitignore
git add .gitignore
git commit -m "Add .env to gitignore"

# If already pushed - contact instructor!
```

## 📚 Resources

- [Git Handbook](https://guides.github.com/introduction/git-handbook/)
- [Conventional Commits](https://www.conventionalcommits.org/)
- [Git LFS Documentation](https://git-lfs.github.io/)
- [GitHub Security Best Practices](https://docs.github.com/en/code-security)
'''

# Write README file
with open('README_git_guide.md', 'w') as f:
    f.write(readme_content)

print("📄 Created comprehensive Git guide (README_git_guide.md)")
print("\n📋 Summary of files created:")
print("   • .gitignore - Files to ignore in version control")
print("   • .env.example - Template for environment variables")
print("   • README_git_guide.md - Git best practices guide")

print("\n🎯 Next steps:")
print("   1. Copy .env.example to .env")
print("   2. Add your API keys to .env")
print("   3. Test Git workflow with initial commit")
print("   4. Start working with LLM APIs")

## 8. Testing Git Workflow

Let's test the Git workflow with our new files:

In [None]:
# Test Git workflow
print("🧪 Testing Git workflow...\n")

# Check Git status
print("1️⃣ Checking Git status:")
success, output = run_command("git status --porcelain", "Git status check")

if success and output:
    print(f"   Files to be added: {len(output.split())} files")
    
    # Show what would be staged
    print("\n2️⃣ Files that would be staged:")
    files = ['.gitignore', '.env.example', 'README_git_guide.md', 'requirements.txt']
    for file in files:
        if os.path.exists(file):
            print(f"   ✅ {file}")
        else:
            print(f"   ❌ {file} (not found)")
    
    print("\n3️⃣ Git commands you would run:")
    print("   git add .gitignore .env.example README_git_guide.md requirements.txt")
    print('   git commit -m "Add: Initial project setup with Git configuration"')
    print("   git push origin main")
    
else:
    print("   No changes to stage (this is normal in the course environment)")

print("\n📝 Git workflow summary:")
print("   • Always check status before committing: git status")
print("   • Stage specific files: git add <filename>")
print("   • Write descriptive commit messages")
print("   • Push regularly to backup your work")

print("\n⚠️ Remember:")
print("   • .env file is ignored (contains real API keys)")
print("   • .env.example is tracked (safe template)")
print("   • Large files (models, data) are ignored")
print("   • Always review what you're committing")

## 9. Summary and Best Practices

### ✅ What We've Accomplished:

1. **Git Setup**: Verified Git installation and configuration
2. **Repository Structure**: Defined a clean project structure
3. **Security Configuration**: 
   - Created comprehensive `.gitignore`
   - Set up environment variable template
   - Ensured API keys are never committed
4. **Documentation**: Created Git best practices guide
5. **Workflow**: Established a clear Git workflow

### 🔐 Security Checklist:

- ✅ `.env` file is in `.gitignore`
- ✅ API keys use environment variables
- ✅ Template file (`.env.example`) for setup
- ✅ Large files (models, data) are ignored
- ✅ Sensitive directories are ignored

### 📝 Daily Git Workflow:

```bash
# Morning routine
git pull origin main          # Get latest changes

# Work on your code...

# End of day routine
git status                    # See what changed
git add .                     # Stage changes
git commit -m "Add: Description"  # Commit with message
git push origin main          # Push to remote
```

### 🚀 Next Steps:

1. **Set up environment variables** (`.env` file)
2. **Test API connections** (OpenAI, DeepSeek)
3. **Explore HuggingFace models** locally
4. **Practice Git workflow** with real commits

### 🆘 If Something Goes Wrong:

- **Committed secrets accidentally**: Reset the commit, add to `.gitignore`, recommit
- **Can't push**: Check if you need to pull first (`git pull`)
- **Merge conflicts**: Ask for help - these need careful handling
- **Lost work**: Check `git reflog` for recent commits

**Remember**: Git is your safety net - commit often, push regularly!