# 02 - Data Extraction: Fetching and Decoding Transactions

**Goal**: Learn how to connect to Ethereum nodes, fetch transactions, and decode them using our custom decoders.

In this notebook, you'll learn:
- How to connect to Ethereum RPC endpoints (Infura, Alchemy, or local nodes)
- How to fetch transaction data programmatically
- How to use protocol-specific decoders (ETH, ERC-20, Uniswap)
- How to handle rate limits, retries, and errors gracefully
- How to save decoded data for training

**Prerequisites**: Completed `01-data-exploration.ipynb`, basic understanding of Ethereum transactions

## Setup

In [None]:
%load_ext autoreload
%autoreload 2

import json
import os
import sys
import time
from pathlib import Path
from typing import Any

import pandas as pd
from web3 import Web3

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

# Import project modules
from eth_finetuning.extraction.core.utils import Web3ConnectionManager, setup_logging
from eth_finetuning.extraction.core.fetcher import (
    load_transaction_hashes,
    fetch_transaction_data,
    fetch_transactions_batch,
)
from eth_finetuning.extraction.decoders.eth import decode_eth_transfer
from eth_finetuning.extraction.decoders.erc20 import decode_erc20_transfer
from eth_finetuning.extraction.decoders.uniswap import decode_uniswap_transaction

# Setup logging
setup_logging()

print("✓ Imports successful")
print(f"✓ Project root: {project_root}")

## Connecting to Ethereum RPC

### What is an RPC Endpoint?

An **RPC (Remote Procedure Call) endpoint** is a URL that lets you interact with an Ethereum node. Think of it as an API for the blockchain.

### RPC Provider Options:

1. **Infura** (https://infura.io)
   - Free tier: 100,000 requests/day
   - URL format: `https://mainnet.infura.io/v3/YOUR_API_KEY`

2. **Alchemy** (https://alchemy.com)
   - Free tier: 300M compute units/month
   - URL format: `https://eth-mainnet.g.alchemy.com/v2/YOUR_API_KEY`

3. **Local Node** (Geth, Erigon)
   - Requires syncing the blockchain (~1TB storage)
   - URL: `http://localhost:8545`

### Setting Up Your RPC URL

**For this notebook**, you'll need to set an environment variable or use our sample data.

In [None]:
# Option 1: Set your RPC URL here (recommended for testing)
# Get a free API key from https://infura.io or https://alchemy.com
RPC_URL = os.environ.get('ETH_RPC_URL', None)

# Option 2: Use sample data (no RPC connection needed)
USE_SAMPLE_DATA = RPC_URL is None

if USE_SAMPLE_DATA:
    print("⚠️  No RPC_URL configured - using sample data from fixtures")
    print("   To connect to live network, set ETH_RPC_URL environment variable")
    print("   Example: export ETH_RPC_URL='https://mainnet.infura.io/v3/YOUR_KEY'")
else:
    print(f"✓ RPC URL configured: {RPC_URL[:30]}...")
    print("✓ Ready to fetch live data from Ethereum")

## Loading Sample Transaction Hashes

First, let's load some transaction hashes to fetch. These can come from:
- Text files (one hash per line)
- Block explorers (Etherscan)
- On-chain analysis tools
- Protocol-specific queries

In [None]:
if USE_SAMPLE_DATA:
    # Load sample transactions directly
    fixtures_path = project_root / "tests" / "fixtures" / "sample_transactions.json"
    with open(fixtures_path, 'r') as f:
        sample_transactions = json.load(f)
    
    tx_hashes = list(sample_transactions.keys())
    print(f"✓ Loaded {len(tx_hashes)} sample transaction hashes from fixtures")
else:
    # Load transaction hashes from file
    hashes_file = project_root / "tests" / "fixtures" / "sample_tx_hashes.txt"
    
    if hashes_file.exists():
        tx_hashes = load_transaction_hashes(hashes_file)
        print(f"✓ Loaded {len(tx_hashes)} transaction hashes from {hashes_file.name}")
    else:
        print("⚠️  No sample hashes file found")
        # Example hashes (you can add your own)
        tx_hashes = [
            "0x1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef",
        ]
        print(f"Using example hashes: {len(tx_hashes)} hash(es)")

# Display loaded hashes
print("\nTransaction hashes to process:")
for i, tx_hash in enumerate(tx_hashes[:5], 1):
    print(f"  {i}. {tx_hash}")
if len(tx_hashes) > 5:
    print(f"  ... and {len(tx_hashes) - 5} more")

## Fetching Transaction Data

### How Transaction Fetching Works

To fully decode a transaction, we need to fetch:
1. **Transaction data** (`eth_getTransactionByHash`)
   - Basic info: from, to, value, input, gas
2. **Transaction receipt** (`eth_getTransactionReceipt`)
   - Execution result: status, logs, gas used

### Rate Limiting & Retries

Our `Web3ConnectionManager` handles:
- **Exponential backoff**: Automatically retries failed requests
- **Rate limiting**: Respects provider limits
- **Error handling**: Gracefully handles network issues

In [None]:
if USE_SAMPLE_DATA:
    # Use pre-loaded sample data
    fetched_transactions = sample_transactions
    print(f"✓ Using {len(fetched_transactions)} sample transactions")
else:
    # Connect to Ethereum RPC and fetch transactions
    print("Connecting to Ethereum RPC...")
    
    # Initialize connection manager with retry logic
    manager = Web3ConnectionManager(
        rpc_url=RPC_URL,
        max_retries=3,
        retry_delay=1.0,
    )
    
    # Check connection
    is_connected = manager.w3.is_connected()
    print(f"Connection status: {'✓ Connected' if is_connected else '✗ Failed'}")
    
    if not is_connected:
        print("\n⚠️  Unable to connect to RPC endpoint")
        print("Please check your RPC_URL and internet connection")
    else:
        # Get chain info
        chain_id = manager.w3.eth.chain_id
        block_number = manager.w3.eth.block_number
        print(f"✓ Chain ID: {chain_id}")
        print(f"✓ Latest block: {block_number:,}")
        
        # Fetch transactions
        print(f"\nFetching {len(tx_hashes)} transaction(s)...")
        fetched_transactions = {}
        
        for i, tx_hash in enumerate(tx_hashes, 1):
            print(f"  [{i}/{len(tx_hashes)}] Fetching {tx_hash[:10]}...")
            
            tx_data = fetch_transaction_data(tx_hash, manager)
            
            if tx_data:
                fetched_transactions[tx_hash] = tx_data
                print(f"    ✓ Success (block {tx_data['block_number']})")
            else:
                print(f"    ✗ Failed to fetch")
            
            # Rate limiting: small delay between requests
            if i < len(tx_hashes):
                time.sleep(0.5)
        
        print(f"\n✓ Fetched {len(fetched_transactions)}/{len(tx_hashes)} transactions")

## Inspecting Fetched Data

Let's examine what data we've fetched.

In [None]:
# Pick first transaction for detailed inspection
if fetched_transactions:
    first_hash = list(fetched_transactions.keys())[0]
    first_tx = fetched_transactions[first_hash]
    
    print("=" * 80)
    print("FETCHED TRANSACTION DATA")
    print("=" * 80)
    print(f"\nTransaction Hash: {first_hash}")
    print(f"\nBasic Info:")
    print(f"  Block Number:     {first_tx.get('block_number', 'N/A')}")
    print(f"  From:             {first_tx.get('from', 'N/A')}")
    print(f"  To:               {first_tx.get('to', 'N/A')}")
    print(f"  Value (Wei):      {first_tx.get('value', 0)}")
    print(f"  Value (ETH):      {Web3.from_wei(first_tx.get('value', 0), 'ether')}")
    print(f"\nGas Details:")
    print(f"  Gas Limit:        {first_tx.get('gas', 'N/A')}")
    print(f"  Gas Used:         {first_tx.get('gas_used', 'N/A')}")
    print(f"  Gas Price:        {first_tx.get('gas_price', 'N/A')} Wei")
    print(f"\nExecution:")
    print(f"  Status:           {'✓ Success' if first_tx.get('status') == 1 else '✗ Failed'}")
    print(f"  Events Emitted:   {len(first_tx.get('logs', []))} log(s)")
    print(f"\nInput Data:")
    input_data = first_tx.get('input', '0x')
    print(f"  Length:           {len(input_data)} bytes")
    print(f"  Preview:          {input_data[:66]}...")
    
    if len(input_data) > 10:
        print(f"  Method ID:        {input_data[:10]}")
else:
    print("⚠️  No transactions available")

## Decoding Transactions

Now let's use our protocol-specific decoders to extract structured information.

### Available Decoders:

1. **ETH Transfer Decoder** - Simple value transfers
2. **ERC-20 Decoder** - Token transfers and approvals
3. **Uniswap Decoder** - V2 and V3 swaps

### Decoding Process:

Each decoder:
1. Checks if the transaction matches its pattern
2. Extracts relevant data (amounts, addresses, tokens)
3. Returns structured JSON with decoded information
4. Returns `None` if transaction doesn't match the pattern

### Decoder 1: ETH Transfers

In [None]:
print("DECODING ETH TRANSFERS")
print("=" * 80)

eth_transfers = []

for tx_hash, tx_data in fetched_transactions.items():
    # Try to decode as ETH transfer
    decoded = decode_eth_transfer(tx_data)
    
    if decoded:
        eth_transfers.append(decoded)
        print(f"\n✓ Decoded ETH transfer: {tx_hash[:10]}...")
        print(f"  Action:    {decoded['action']}")
        print(f"  Protocol:  {decoded['protocol']}")
        print(f"  From:      {decoded['from'][:10]}...")
        print(f"  To:        {decoded['to'][:10]}...")
        print(f"  Amount:    {decoded['amount_eth']} ETH")
        print(f"  Status:    {decoded['status']}")

print(f"\n\nFound {len(eth_transfers)} ETH transfer(s)")

### Decoder 2: ERC-20 Token Transfers

In [None]:
print("DECODING ERC-20 TRANSFERS")
print("=" * 80)

erc20_transfers = []

# For ERC-20 decoding, we may need RPC connection to fetch token metadata
# If we have a connection manager, use it; otherwise, work with available data
rpc_manager = None if USE_SAMPLE_DATA else manager

for tx_hash, tx_data in fetched_transactions.items():
    # Try to decode as ERC-20 transfer
    decoded = decode_erc20_transfer(tx_data, rpc_manager)
    
    if decoded:
        erc20_transfers.append(decoded)
        print(f"\n✓ Decoded ERC-20 transfer: {tx_hash[:10]}...")
        print(f"  Action:       {decoded['action']}")
        print(f"  Protocol:     {decoded['protocol']}")
        print(f"  Token:        {decoded['token_address'][:10]}...")
        print(f"  Token Symbol: {decoded.get('token_symbol', 'Unknown')}")
        print(f"  From:         {decoded['from'][:10]}...")
        print(f"  To:           {decoded['to'][:10]}...")
        print(f"  Amount:       {decoded['amount']}")
        print(f"  Decimals:     {decoded.get('decimals', 'Unknown')}")

print(f"\n\nFound {len(erc20_transfers)} ERC-20 transfer(s)")

### Decoder 3: Uniswap Swaps

In [None]:
print("DECODING UNISWAP SWAPS")
print("=" * 80)

uniswap_swaps = []

for tx_hash, tx_data in fetched_transactions.items():
    # Try to decode as Uniswap swap
    decoded = decode_uniswap_transaction(tx_data)
    
    if decoded:
        uniswap_swaps.append(decoded)
        print(f"\n✓ Decoded Uniswap swap: {tx_hash[:10]}...")
        print(f"  Action:       {decoded['action']}")
        print(f"  Protocol:     {decoded['protocol']}")
        print(f"  Pool:         {decoded['pool_address'][:10]}...")
        print(f"  Token In:     {decoded.get('token_in', 'N/A')[:10] if decoded.get('token_in') else 'N/A'}...")
        print(f"  Token Out:    {decoded.get('token_out', 'N/A')[:10] if decoded.get('token_out') else 'N/A'}...")
        print(f"  Amount In:    {decoded.get('amount_in', 'N/A')}")
        print(f"  Amount Out:   {decoded.get('amount_out', 'N/A')}")

print(f"\n\nFound {len(uniswap_swaps)} Uniswap swap(s)")

## Decoding Summary

Let's create a summary of all decoded transactions.

In [None]:
# Combine all decoded transactions
all_decoded = eth_transfers + erc20_transfers + uniswap_swaps

print("DECODING SUMMARY")
print("=" * 80)
print(f"\nTotal Transactions Fetched: {len(fetched_transactions)}")
print(f"Successfully Decoded:        {len(all_decoded)}")
print(f"\nBreakdown by Protocol:")
print(f"  ETH Transfers:  {len(eth_transfers)}")
print(f"  ERC-20 Tokens:  {len(erc20_transfers)}")
print(f"  Uniswap Swaps:  {len(uniswap_swaps)}")

if len(all_decoded) < len(fetched_transactions):
    undecoded = len(fetched_transactions) - len(all_decoded)
    print(f"\n⚠️  {undecoded} transaction(s) could not be decoded")
    print("   (May be other protocols or complex interactions)")

## Converting to DataFrame

Let's organize decoded data into a pandas DataFrame for easier analysis.

In [None]:
if all_decoded:
    # Create DataFrame from decoded transactions
    df_decoded = pd.DataFrame(all_decoded)
    
    print("DECODED TRANSACTIONS DATAFRAME")
    print("=" * 80)
    print(f"\nShape: {df_decoded.shape[0]} rows × {df_decoded.shape[1]} columns")
    print(f"\nColumns: {', '.join(df_decoded.columns)}")
    print(f"\nFirst few rows:\n")
    display(df_decoded.head())
    
    # Show protocol distribution
    print("\nProtocol Distribution:")
    print(df_decoded['protocol'].value_counts())
else:
    print("⚠️  No decoded transactions to display")

## Saving Decoded Data

Let's save the decoded transactions for use in the next notebook (dataset preparation).

In [None]:
# Create output directory if it doesn't exist
output_dir = project_root / "data" / "processed"
output_dir.mkdir(parents=True, exist_ok=True)

# Save as JSON
json_output = output_dir / "decoded_transactions.json"
with open(json_output, 'w') as f:
    json.dump(all_decoded, f, indent=2)

print(f"✓ Saved decoded transactions to {json_output}")
print(f"  Total transactions: {len(all_decoded)}")

# Also save as CSV for easy inspection
if all_decoded:
    csv_output = output_dir / "decoded_transactions.csv"
    df_decoded.to_csv(csv_output, index=False)
    print(f"✓ Saved CSV version to {csv_output}")

## Error Handling Examples

Let's demonstrate how our decoders handle edge cases and errors gracefully.

In [None]:
print("ERROR HANDLING EXAMPLES")
print("=" * 80)

# Test Case 1: Empty transaction data
print("\nTest 1: Empty transaction data")
try:
    result = decode_eth_transfer({})
    print(f"  Result: {result}")
    print("  ✓ Handled gracefully (returned None)")
except Exception as e:
    print(f"  ✗ Error: {e}")

# Test Case 2: Missing required fields
print("\nTest 2: Missing 'from' field")
try:
    incomplete_tx = {'to': '0x742d35Cc6634C0532925a3b844Bc9e7595f0', 'value': 1000000}
    result = decode_eth_transfer(incomplete_tx)
    print(f"  Result: {result}")
    print("  ✓ Handled gracefully (returned None)")
except Exception as e:
    print(f"  ✗ Error: {e}")

# Test Case 3: Failed transaction (status = 0)
print("\nTest 3: Failed transaction")
try:
    failed_tx = {
        'from': '0x742d35Cc6634C0532925a3b844Bc9e7595f0',
        'to': '0x1f9840a85d5aF5bf1D1762F925BDADdC4201F984',
        'value': 1000000,
        'status': 0,  # Failed!
    }
    result = decode_eth_transfer(failed_tx)
    if result:
        print(f"  Status in result: {result.get('status')}")
        print("  ✓ Decoder includes failure status")
    else:
        print("  Result: None")
except Exception as e:
    print(f"  ✗ Error: {e}")

print("\n✓ All error handling tests completed")

## Key Takeaways

✓ **RPC Connection**: Use connection managers with retry logic for reliability

✓ **Transaction Fetching**: Need both transaction and receipt for complete data

✓ **Protocol Decoders**: Each protocol has specific patterns and data structures

✓ **Error Handling**: Gracefully handle missing data, failed transactions, and edge cases

✓ **Data Storage**: Save decoded data in structured formats (JSON, CSV) for reuse

## Troubleshooting Tips

**Issue**: `Connection refused` or timeout errors
- **Solution**: Check RPC URL, verify API key, ensure internet connection

**Issue**: Rate limit errors (HTTP 429)
- **Solution**: Increase delay between requests, upgrade RPC provider tier, or use local node

**Issue**: Transaction not found
- **Solution**: Verify transaction hash, check if it's from the correct network (mainnet vs testnet)

**Issue**: Decoder returns `None` for valid transaction
- **Solution**: Transaction might be from a different protocol - check logs and input data manually

**Issue**: Token symbol shows as 'Unknown'
- **Solution**: Token contract may not implement optional `symbol()` function, or RPC call failed

## Next Steps

In the next notebook (**03-dataset-preparation.ipynb**), we'll learn how to:
- Extract intents from decoded transactions
- Format data for instruction-tuning (Alpaca format)
- Split into training/validation/test sets
- Validate dataset quality

---

**Ready to continue?** → `notebooks/03-dataset-preparation.ipynb`