# Post-2024 Meme Coin Selection & Analysis

## Objective
Select high-quality meme coins launched after 2024-01-01 from the full 375M+ transaction dataset for comprehensive analysis.

## Selection Criteria
1. **First trading activity after 2024-01-01**
2. **Sufficient trading volume** (minimum thresholds)
3. **Active trading period** (not dead coins)
4. **Meme coin characteristics** (high volatility, community-driven)
5. **Data quality** (clean trading patterns)

## Expected Outcome
- Curated list of 10-20 high-quality post-2024 meme coins
- Initial characterization of each coin
- Foundation for comprehensive EDA framework

---


In [3]:
# Setup and imports
import sys
import os

# Add parent directory to path to import from root if needed
sys.path.append('.')
sys.path.append('..')

from solana_eda_utils import SolanaDataAnalyzer, format_large_number, truncate_address
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
from datetime import datetime
warnings.filterwarnings('ignore')

# Initialize the analyzer
analyzer = SolanaDataAnalyzer()

print("‚úÖ Connected to DuckDB database")
print(f"üìä Database: {analyzer.db_path}")
print(f"üìã Table: {analyzer.table_name}")
print(f"üéØ Target: Post-2024 meme coins selection")


Connected to database: /Volumes/Extreme SSD/DuckDB/solana.duckdb
‚úÖ Connected to DuckDB database
üìä Database: /Volumes/Extreme SSD/DuckDB/solana.duckdb
üìã Table: first_day_trades
üéØ Target: Post-2024 meme coins selection


In [4]:
# Step 1: Find all tokens with first trading activity after 2024-01-01
print("üîç STEP 1: IDENTIFYING POST-2024 TOKENS")
print("=" * 60)

# Query to get first trade date for each token
query_first_trades = """
SELECT 
    mint,
    MIN(block_timestamp) as first_trade_date,
    MAX(block_timestamp) as last_trade_date,
    COUNT(*) as total_trades,
    COUNT(DISTINCT swapper) as unique_traders,
    SUM(CASE WHEN succeeded THEN 1 ELSE 0 END) as successful_trades
FROM first_day_trades
GROUP BY mint
HAVING MIN(block_timestamp) >= '2024-01-01 00:00:00'
ORDER BY first_trade_date DESC
"""

print("Executing query to find post-2024 tokens...")
post_2024_tokens = analyzer.execute_query(query_first_trades)

if post_2024_tokens is not None and len(post_2024_tokens) > 0:
    print(f"‚úÖ Found {len(post_2024_tokens):,} tokens that started trading after 2024-01-01")
    print(f"üìä Total trades from these tokens: {post_2024_tokens['total_trades'].sum():,}")
    print(f"üë• Total unique traders: {post_2024_tokens['unique_traders'].sum():,}")
    print(f"‚úÖ Success rate: {post_2024_tokens['successful_trades'].sum() / post_2024_tokens['total_trades'].sum():.4f}")
    
    # Show first few examples
    print(f"\nüìã First 10 post-2024 tokens (newest first):")
    display_cols = ['mint', 'first_trade_date', 'total_trades', 'unique_traders']
    for i, row in post_2024_tokens.head(10).iterrows():
        mint_short = truncate_address(row['mint'])
        print(f"{i+1:2d}. {mint_short} | First: {row['first_trade_date'].date()} | "
              f"Trades: {format_large_number(row['total_trades'])} | "
              f"Traders: {format_large_number(row['unique_traders'])}")
else:
    print("‚ùå No post-2024 tokens found or query failed")


üîç STEP 1: IDENTIFYING POST-2024 TOKENS
Executing query to find post-2024 tokens...
‚úÖ Found 5,746 tokens that started trading after 2024-01-01
üìä Total trades from these tokens: 324,436,589
üë• Total unique traders: 43,232,161
‚úÖ Success rate: 1.0000

üìã First 10 post-2024 tokens (newest first):
 1. EtmD8Bjd...PUMP | First: 2025-06-15 | Trades: 40.6K | Traders: 164
 2. 5Dh8TMLk...PUMP | First: 2025-06-15 | Trades: 53.8K | Traders: 214
 3. 2bkeRecU...pump | First: 2025-06-15 | Trades: 71.6K | Traders: 222
 4. 9KNtK6b8...pump | First: 2025-06-15 | Trades: 34.4K | Traders: 162
 5. 8J7Tygt5...pump | First: 2025-06-15 | Trades: 41.4K | Traders: 218
 6. 6ixcwjqf...Pump | First: 2025-06-15 | Trades: 36.5K | Traders: 159
 7. 5ykMDwJq...PUMP | First: 2025-06-15 | Trades: 34.2K | Traders: 173
 8. EjQLDvtg...Pump | First: 2025-06-15 | Trades: 59.1K | Traders: 200
 9. EPZEeKAE...PUMP | First: 2025-06-15 | Trades: 51.8K | Traders: 183
10. 39WSXpJs...PUMP | First: 2025-06-15 | Trades: 37.2

In [5]:
# Step 2: Analysis of first-day trading patterns (NO FILTERING)
print("\nüìä STEP 2: FIRST-DAY TRADING ANALYSIS")
print("=" * 60)

if post_2024_tokens is not None and len(post_2024_tokens) > 0:
    # Add calculated metrics for analysis
    post_2024_tokens['trading_hours'] = (post_2024_tokens['last_trade_date'] - post_2024_tokens['first_trade_date']).dt.total_seconds() / 3600
    post_2024_tokens['trades_per_trader'] = post_2024_tokens['total_trades'] / post_2024_tokens['unique_traders']
    post_2024_tokens['trades_per_hour'] = post_2024_tokens['total_trades'] / (post_2024_tokens['trading_hours'] + 0.1)  # Avoid division by 0
    
    print("üìà FIRST-DAY TRADING ANALYSIS:")
    print(f"Total post-2024 first-day meme coins: {len(post_2024_tokens):,}")
    print(f"Total first-day trades: {post_2024_tokens['total_trades'].sum():,}")
    print(f"Total unique first-day traders: {post_2024_tokens['unique_traders'].sum():,}")
    
    print(f"\nüî• TRADING INTENSITY DISTRIBUTION:")
    print(f"  üèÜ Most active token: {post_2024_tokens['total_trades'].max():,} trades")
    print(f"  üìä 95th percentile: {post_2024_tokens['total_trades'].quantile(0.95):,.0f} trades") 
    print(f"  üìä 75th percentile: {post_2024_tokens['total_trades'].quantile(0.75):,.0f} trades")
    print(f"  üìä 50th percentile: {post_2024_tokens['total_trades'].median():,.0f} trades")
    print(f"  üìä 25th percentile: {post_2024_tokens['total_trades'].quantile(0.25):,.0f} trades")
    
    print(f"\nüë• TRADER ENGAGEMENT:")
    print(f"  üèÜ Most traders: {post_2024_tokens['unique_traders'].max():,} unique traders")
    print(f"  üìä 95th percentile: {post_2024_tokens['unique_traders'].quantile(0.95):,.0f} traders")
    print(f"  üìä Median: {post_2024_tokens['unique_traders'].median():,.0f} traders")
    
    print(f"\n‚ö° TRADING INTENSITY:")
    print(f"  üî• Peak intensity: {post_2024_tokens['trades_per_hour'].max():,.0f} trades/hour")
    print(f"  üìä Median intensity: {post_2024_tokens['trades_per_hour'].median():,.0f} trades/hour")
    
    # No filtering - all tokens are valid first-day opportunities
    print(f"\n‚úÖ ALL TOKENS INCLUDED:")
    print(f"üéØ Strategy: First-day trading opportunities")
    print(f"üìã Dataset: Already filtered to first-day trades only")
    print(f"üöÄ Ready for volume analysis of all {len(post_2024_tokens):,} tokens")
    
    # Keep all tokens for next step
    filtered_tokens = post_2024_tokens.copy()
    
else:
    print("‚ùå No data available for analysis")



üìä STEP 2: FIRST-DAY TRADING ANALYSIS
üìà FIRST-DAY TRADING ANALYSIS:
Total post-2024 first-day meme coins: 5,746
Total first-day trades: 324,436,589
Total unique first-day traders: 43,232,161

üî• TRADING INTENSITY DISTRIBUTION:
  üèÜ Most active token: 884,631 trades
  üìä 95th percentile: 185,234 trades
  üìä 75th percentile: 71,541 trades
  üìä 50th percentile: 36,500 trades
  üìä 25th percentile: 11,545 trades

üë• TRADER ENGAGEMENT:
  üèÜ Most traders: 161,259 unique traders
  üìä 95th percentile: 28,481 traders
  üìä Median: 3,180 traders

‚ö° TRADING INTENSITY:
  üî• Peak intensity: 215,781 trades/hour
  üìä Median intensity: 4,618 trades/hour

‚úÖ ALL TOKENS INCLUDED:
üéØ Strategy: First-day trading opportunities
üìã Dataset: Already filtered to first-day trades only
üöÄ Ready for volume analysis of all 5,746 tokens


In [None]:
# Export coin addresses for DexPaprika API lookup
print("\nüíæ EXPORTING COIN ADDRESSES FOR DEXPAPRIKA LOOKUP")
print("=" * 60)

if post_2024_tokens is not None and len(post_2024_tokens) > 0:
    # Create a simple DataFrame with coin addresses and basic info for DexPaprika lookup
    coin_export_df = pd.DataFrame({
        'mint_address': post_2024_tokens['mint'],
        'first_trade_date': post_2024_tokens['first_trade_date'],
        'total_trades': post_2024_tokens['total_trades'],
        'unique_traders': post_2024_tokens['unique_traders'],
        'successful_trades': post_2024_tokens['successful_trades']
    })
    
    # Sort by total trades (most active first) for easier analysis
    coin_export_df = coin_export_df.sort_values('total_trades', ascending=False).reset_index(drop=True)
    
    # Save to CSV
    csv_filename = '../post2024_meme_coins_for_dexpaprika.csv'
    coin_export_df.to_csv(csv_filename, index=False)
    
    print(f"‚úÖ Exported {len(coin_export_df):,} coin addresses to:")
    print(f"   üìÑ {csv_filename}")
    print(f"   üìä Sorted by trading activity (most active first)")
    print(f"   üîç Ready for DexPaprika API enrichment")
    
    # Show preview of what was exported
    print(f"\nüìã PREVIEW OF EXPORTED DATA:")
    print("Top 10 most active coins:")
    for i, row in coin_export_df.head(10).iterrows():
        mint_short = truncate_address(row['mint_address'])
        print(f"{i+1:2d}. {mint_short} | {row['first_trade_date'].date()} | "
              f"{format_large_number(row['total_trades'])} trades | "
              f"{format_large_number(row['unique_traders'])} traders")
    
    print(f"\nüí° NEXT STEPS:")
    print(f"   1. Use the exported CSV with DexPaprika API")
    print(f"   2. Get token metadata (name, symbol, description)")
    print(f"   3. Merge back with trading data for enriched analysis")
    
else:
    print("‚ùå No post-2024 tokens available for export")


In [6]:
# CORRECTED Step 3: Get SOL volume data for comprehensive analysis
print("\nüí∞ CORRECTED STEP 3: SOL VOLUME ANALYSIS")
print("=" * 60)
post_2024_tokens = pd.read_csv('../post2024_meme_coins_enriched.csv')
# Corrected Query - uses actual column names from table inspection
# SOL mint address: So11111111111111111111111111111111111111112
volume_query_corrected = """
SELECT 
    mint as mint_address,
    COUNT(*) as total_trades,
    COUNT(DISTINCT swapper) as unique_traders,
    SUM(CASE WHEN succeeded THEN 1 ELSE 0 END) as successful_trades,
    -- Calculate SOL volume: when SOL is being sold (swap_from) or bought (swap_to)
    SUM(CASE 
        WHEN succeeded AND swap_from_mint = 'So11111111111111111111111111111111111111112' THEN swap_from_amount 
        WHEN succeeded AND swap_to_mint = 'So11111111111111111111111111111111111111112' THEN swap_to_amount 
        ELSE 0 
    END) as total_sol_volume,
    AVG(CASE 
        WHEN succeeded AND swap_from_mint = 'So11111111111111111111111111111111111111112' THEN swap_from_amount 
        WHEN succeeded AND swap_to_mint = 'So11111111111111111111111111111111111111112' THEN swap_to_amount 
        ELSE NULL 
    END) as avg_sol_per_trade,
    MAX(CASE 
        WHEN succeeded AND swap_from_mint = 'So11111111111111111111111111111111111111112' THEN swap_from_amount 
        WHEN succeeded AND swap_to_mint = 'So11111111111111111111111111111111111111112' THEN swap_to_amount 
        ELSE 0 
    END) as max_sol_trade,
    MIN(block_timestamp) as first_trade,
    MAX(block_timestamp) as last_trade,
    -- Count SOL pair trades
    SUM(CASE 
        WHEN (swap_from_mint = mint AND swap_to_mint = 'So11111111111111111111111111111111111111112') 
          OR (swap_to_mint = mint AND swap_from_mint = 'So11111111111111111111111111111111111111112') 
        THEN 1 
        ELSE 0 
    END) as sol_pair_trades,
    -- Count non-SOL pair trades
    SUM(CASE 
        WHEN (swap_from_mint = mint AND swap_to_mint != 'So11111111111111111111111111111111111111112') 
          OR (swap_to_mint = mint AND swap_from_mint != 'So11111111111111111111111111111111111111112') 
        THEN 1 
        ELSE 0 
    END) as non_sol_pair_trades,
    -- Analyze paired tokens (what tokens are being traded against the mint)
    COUNT(DISTINCT CASE 
        WHEN swap_from_mint = mint THEN swap_to_mint 
        WHEN swap_to_mint = mint THEN swap_from_mint 
        ELSE NULL 
    END) as unique_pairs
FROM first_day_trades 
WHERE block_timestamp >= '2024-01-01 00:00:00'
GROUP BY mint
HAVING COUNT(*) > 0
ORDER BY total_sol_volume DESC
"""

print("Executing CORRECTED SOL volume analysis query...")
volume_data = analyzer.execute_query(volume_query_corrected)



if volume_data is not None and len(volume_data) > 0:
    print(f"‚úÖ Got volume data for {len(volume_data):,} tokens")
    
    # Merge with our post-2024 tokens data
    enriched_tokens = post_2024_tokens.merge(volume_data, on='mint_address', how='left', suffixes=('', '_vol'))
    
    print(f"\nüíé SOL VOLUME INSIGHTS:")
    print(f"  üèÜ Highest SOL volume: {enriched_tokens['total_sol_volume'].max():,.2f} SOL")
    print(f"  üìä Total SOL volume (all tokens): {enriched_tokens['total_sol_volume'].sum():,.2f} SOL") 
    print(f"  üìä Median SOL volume: {enriched_tokens['total_sol_volume'].median():,.2f} SOL")
    print(f"  üìä Average trade size: {enriched_tokens['avg_sol_per_trade'].mean():,.4f} SOL")
    
    # Show SOL volume distribution
    print(f"\nüìä SOL VOLUME DISTRIBUTION:")
    print(f"  üèÜ Top 1% by volume: {enriched_tokens['total_sol_volume'].quantile(0.99):,.2f} SOL")
    print(f"  üìä 95th percentile: {enriched_tokens['total_sol_volume'].quantile(0.95):,.2f} SOL")
    print(f"  üìä 75th percentile: {enriched_tokens['total_sol_volume'].quantile(0.75):,.2f} SOL")
    print(f"  üìä 50th percentile: {enriched_tokens['total_sol_volume'].median():,.2f} SOL")
    
else:
    print("‚ùå Failed to get volume data")
    enriched_tokens = post_2024_tokens.copy()



üí∞ CORRECTED STEP 3: SOL VOLUME ANALYSIS
Executing CORRECTED SOL volume analysis query...
‚úÖ Got volume data for 5,746 tokens

üíé SOL VOLUME INSIGHTS:
  üèÜ Highest SOL volume: 4,324,645.32 SOL
  üìä Total SOL volume (all tokens): 719,816,538.38 SOL
  üìä Median SOL volume: 93,425.68 SOL
  üìä Average trade size: 2.9564 SOL

üìä SOL VOLUME DISTRIBUTION:
  üèÜ Top 1% by volume: 802,568.26 SOL
  üìä 95th percentile: 378,664.40 SOL
  üìä 75th percentile: 158,971.58 SOL
  üìä 50th percentile: 93,425.68 SOL


In [7]:
# Step 4: Comprehensive Analysis of Enriched Tokens
print("\nüîç STEP 4: COMPREHENSIVE ENRICHED TOKENS ANALYSIS")
print("=" * 70)

if 'enriched_tokens' in locals() and len(enriched_tokens) > 0:
    
    # 1. DATASET OVERVIEW
    print("üìä 1. DATASET OVERVIEW:")
    print(f"   Total tokens analyzed: {len(enriched_tokens):,}")
    print(f"   Total trades: {enriched_tokens['total_trades'].sum():,}")
    print(f"   Total unique traders: {enriched_tokens['unique_traders'].sum():,}")
    print(f"   Total SOL volume: {enriched_tokens['total_sol_volume'].sum():,.2f} SOL")
    
    # 2. DEXPAPRIKA ENRICHMENT SUCCESS
    print(f"\nüéØ 2. DEXPAPRIKA ENRICHMENT SUCCESS:")
    successful_enrichment = enriched_tokens['api_status'] == 'success'
    success_count = successful_enrichment.sum()
    success_rate = success_count / len(enriched_tokens) * 100
    
    print(f"   Successfully enriched: {success_count:,} tokens ({success_rate:.1f}%)")
    print(f"   Failed enrichment: {len(enriched_tokens) - success_count:,} tokens ({100-success_rate:.1f}%)")
    
    # Analyze successful vs failed enrichment
    if success_count > 0:
        successful_tokens = enriched_tokens[successful_enrichment]
        failed_tokens = enriched_tokens[~successful_enrichment]
        
        print(f"   Success vs Failed comparison:")
        print(f"     Avg trades - Success: {successful_tokens['total_trades'].mean():,.0f} | Failed: {failed_tokens['total_trades'].mean():,.0f}")
        print(f"     Avg SOL volume - Success: {successful_tokens['total_sol_volume'].mean():,.0f} | Failed: {failed_tokens['total_sol_volume'].mean():,.0f}")
    
    # 3. PRICE AND MARKET CAP ANALYSIS (for successfully enriched tokens)
    if success_count > 0:
        print(f"\nüí∞ 3. PRICE & MARKET CAP ANALYSIS (Successfully Enriched Tokens):")
        
        # Filter valid price data
        valid_price = successful_tokens['price_usd'] > 0
        valid_fdv = successful_tokens['fdv'] > 0
        valid_liquidity = successful_tokens['liquidity_usd'] > 0
        
        price_tokens = successful_tokens[valid_price]
        fdv_tokens = successful_tokens[valid_fdv]
        liquidity_tokens = successful_tokens[valid_liquidity]
        
        print(f"   Tokens with valid price: {len(price_tokens):,} ({len(price_tokens)/success_count:.1%})")
        print(f"   Tokens with valid FDV: {len(fdv_tokens):,} ({len(fdv_tokens)/success_count:.1%})")
        print(f"   Tokens with valid liquidity: {len(liquidity_tokens):,} ({len(liquidity_tokens)/success_count:.1%})")
        
        if len(price_tokens) > 0:
            print(f"\n   üíµ PRICE DISTRIBUTION:")
            print(f"     Highest price: ${price_tokens['price_usd'].max():.8f}")
            print(f"     Median price: ${price_tokens['price_usd'].median():.8f}")
            print(f"     Lowest price: ${price_tokens['price_usd'].min():.10f}")
            
        if len(fdv_tokens) > 0:
            print(f"\n   üìà FULLY DILUTED VALUATION:")
            print(f"     Highest FDV: ${fdv_tokens['fdv'].max():,.0f}")
            print(f"     Median FDV: ${fdv_tokens['fdv'].median():,.0f}")
            print(f"     Lowest FDV: ${fdv_tokens['fdv'].min():,.0f}")
            
        if len(liquidity_tokens) > 0:
            print(f"\n   üíß LIQUIDITY ANALYSIS:")
            print(f"     Highest liquidity: ${liquidity_tokens['liquidity_usd'].max():,.0f}")
            print(f"     Median liquidity: ${liquidity_tokens['liquidity_usd'].median():,.0f}")
            print(f"     Low liquidity (<$1K): {(liquidity_tokens['liquidity_usd'] < 1000).sum():,} tokens")
    
    # 4. SOL vs NON-SOL TRADING PATTERNS
    print(f"\nüîó 4. SOL vs NON-SOL TRADING PATTERNS:")
    
    # Calculate totals
    total_sol_trades = enriched_tokens['sol_pair_trades'].sum()
    total_non_sol_trades = enriched_tokens['non_sol_pair_trades'].sum()
    total_all_trades = total_sol_trades + total_non_sol_trades
    
    print(f"   Total SOL pair trades: {total_sol_trades:,} ({total_sol_trades/total_all_trades:.1%})")
    print(f"   Total non-SOL pair trades: {total_non_sol_trades:,} ({total_non_sol_trades/total_all_trades:.1%})")
    
    # Analyze tokens with significant non-SOL trading
    enriched_tokens['non_sol_ratio'] = enriched_tokens['non_sol_pair_trades'] / (enriched_tokens['sol_pair_trades'] + enriched_tokens['non_sol_pair_trades'])
    high_non_sol = enriched_tokens[enriched_tokens['non_sol_ratio'] > 0.1]  # >10% non-SOL trades
    
    print(f"   Tokens with >10% non-SOL trades: {len(high_non_sol):,} ({len(high_non_sol)/len(enriched_tokens):.1%})")
    print(f"   Tokens with >50% non-SOL trades: {(enriched_tokens['non_sol_ratio'] > 0.5).sum():,}")
    
    # 5. TRADING PAIR DIVERSITY
    print(f"\nüåê 5. TRADING PAIR DIVERSITY:")
    print(f"   Average unique pairs per token: {enriched_tokens['unique_pairs'].mean():.1f}")
    print(f"   Median unique pairs per token: {enriched_tokens['unique_pairs'].median():.0f}")
    print(f"   Most diverse token: {enriched_tokens['unique_pairs'].max()} unique pairs")
    
    # Tokens with high pair diversity
    diverse_tokens = enriched_tokens[enriched_tokens['unique_pairs'] >= 5]
    print(f"   Tokens with 5+ trading pairs: {len(diverse_tokens):,} ({len(diverse_tokens)/len(enriched_tokens):.1%})")
    
    # 6. PUMP.FUN vs OTHER TOKENS
    print(f"\nüöÄ 6. PUMP.FUN vs OTHER TOKENS ANALYSIS:")
    
    # Identify pump.fun tokens
    enriched_tokens['is_pumpfun'] = enriched_tokens['mint_address'].str.lower().str.endswith('pump')
    pumpfun_tokens = enriched_tokens[enriched_tokens['is_pumpfun']]
    other_tokens = enriched_tokens[~enriched_tokens['is_pumpfun']]
    
    print(f"   Pump.fun tokens: {len(pumpfun_tokens):,} ({len(pumpfun_tokens)/len(enriched_tokens):.1%})")
    print(f"   Other tokens: {len(other_tokens):,} ({len(other_tokens)/len(enriched_tokens):.1%})")
    
    if len(pumpfun_tokens) > 0 and len(other_tokens) > 0:
        print(f"\n   üìä PUMP.FUN vs OTHERS COMPARISON:")
        print(f"     Avg trades - Pump.fun: {pumpfun_tokens['total_trades'].mean():,.0f} | Others: {other_tokens['total_trades'].mean():,.0f}")
        print(f"     Avg SOL volume - Pump.fun: {pumpfun_tokens['total_sol_volume'].mean():,.0f} | Others: {other_tokens['total_sol_volume'].mean():,.0f}")
        print(f"     Avg unique pairs - Pump.fun: {pumpfun_tokens['unique_pairs'].mean():.1f} | Others: {other_tokens['unique_pairs'].mean():.1f}")
        print(f"     Enrichment success - Pump.fun: {(pumpfun_tokens['api_status'] == 'success').mean():.1%} | Others: {(other_tokens['api_status'] == 'success').mean():.1%}")
    
    # 7. TOP PERFORMERS ANALYSIS
    print(f"\nüèÜ 7. TOP PERFORMERS ANALYSIS:")
    
    # Top by different metrics
    top_by_volume = enriched_tokens.nlargest(10, 'total_sol_volume')
    top_by_trades = enriched_tokens.nlargest(10, 'total_trades')
    top_by_traders = enriched_tokens.nlargest(10, 'unique_traders')
    
    print(f"\n   ü•á TOP 5 BY SOL VOLUME:")
    for i, (_, row) in enumerate(top_by_volume.head(5).iterrows(), 1):
        name_display = row.get('symbol', 'Unknown') if pd.notna(row.get('symbol')) else truncate_address(row['mint_address'])
        pumpfun_indicator = "üöÄ" if row['is_pumpfun'] else "üíé"
        print(f"   {i}. {pumpfun_indicator} {name_display:<12} | {row['total_sol_volume']:>10,.0f} SOL | {row['total_trades']:>8,} trades")
    
    print(f"\n   ü•à TOP 5 BY TRADE COUNT:")
    for i, (_, row) in enumerate(top_by_trades.head(5).iterrows(), 1):
        name_display = row.get('symbol', 'Unknown') if pd.notna(row.get('symbol')) else truncate_address(row['mint_address'])
        pumpfun_indicator = "üöÄ" if row['is_pumpfun'] else "üíé"
        print(f"   {i}. {pumpfun_indicator} {name_display:<12} | {row['total_trades']:>8,} trades | {row['unique_traders']:>6,} traders")
    
    print(f"\n   ü•â TOP 5 BY TRADER COUNT:")
    for i, (_, row) in enumerate(top_by_traders.head(5).iterrows(), 1):
        name_display = row.get('symbol', 'Unknown') if pd.notna(row.get('symbol')) else truncate_address(row['mint_address'])
        pumpfun_indicator = "üöÄ" if row['is_pumpfun'] else "üíé"
        print(f"   {i}. {pumpfun_indicator} {name_display:<12} | {row['unique_traders']:>6,} traders | {row['total_sol_volume']:>10,.0f} SOL")
    
    # 8. TRADING EFFICIENCY METRICS
    print(f"\n‚ö° 8. TRADING EFFICIENCY METRICS:")
    
    # Calculate efficiency metrics
    enriched_tokens['sol_per_trader'] = enriched_tokens['total_sol_volume'] / enriched_tokens['unique_traders']
    enriched_tokens['trades_per_trader'] = enriched_tokens['total_trades'] / enriched_tokens['unique_traders']
    
    print(f"   Average SOL volume per trader: {enriched_tokens['sol_per_trader'].mean():,.2f} SOL")
    print(f"   Average trades per trader: {enriched_tokens['trades_per_trader'].mean():.1f}")
    print(f"   Average SOL per trade: {enriched_tokens['avg_sol_per_trade'].mean():.4f} SOL")
    
    # High-efficiency tokens
    high_efficiency = enriched_tokens[enriched_tokens['sol_per_trader'] > enriched_tokens['sol_per_trader'].quantile(0.95)]
    print(f"   High-efficiency tokens (top 5% SOL/trader): {len(high_efficiency):,}")
    
    print(f"\n‚úÖ COMPREHENSIVE ANALYSIS COMPLETE")
    print(f"üéØ Dataset ready for advanced signal generation and trading strategy development")
    
else:
    print("‚ùå No enriched token data available for analysis")



üîç STEP 4: COMPREHENSIVE ENRICHED TOKENS ANALYSIS
üìä 1. DATASET OVERVIEW:
   Total tokens analyzed: 5,746
   Total trades: 374,886,380
   Total unique traders: 43,232,161
   Total SOL volume: 719,816,538.38 SOL

üéØ 2. DEXPAPRIKA ENRICHMENT SUCCESS:
   Successfully enriched: 5,739 tokens (99.9%)
   Failed enrichment: 7 tokens (0.1%)
   Success vs Failed comparison:
     Avg trades - Success: 65,247 | Failed: 62,012
     Avg SOL volume - Success: 125,240 | Failed: 151,977

üí∞ 3. PRICE & MARKET CAP ANALYSIS (Successfully Enriched Tokens):
   Tokens with valid price: 2,363 (41.2%)
   Tokens with valid FDV: 2,363 (41.2%)
   Tokens with valid liquidity: 2,362 (41.2%)

   üíµ PRICE DISTRIBUTION:
     Highest price: $107320.05531353
     Median price: $0.00006150
     Lowest price: $0.0000000000

   üìà FULLY DILUTED VALUATION:
     Highest FDV: $869,098,992
     Median FDV: $64,737
     Lowest FDV: $0

   üíß LIQUIDITY ANALYSIS:
     Highest liquidity: $22,920,349
     Median liqu

In [8]:
enriched_tokens.columns

Index(['mint_address', 'first_trade_date', 'total_trades', 'unique_traders',
       'successful_trades', 'name', 'symbol', 'total_supply', 'price_usd',
       'fdv', 'liquidity_usd', 'api_status', 'total_trades_vol',
       'unique_traders_vol', 'successful_trades_vol', 'total_sol_volume',
       'avg_sol_per_trade', 'max_sol_trade', 'first_trade', 'last_trade',
       'sol_pair_trades', 'non_sol_pair_trades', 'unique_pairs',
       'non_sol_ratio', 'is_pumpfun', 'sol_per_trader', 'trades_per_trader'],
      dtype='object')

In [9]:
# Step 5: FDV Distribution Analysis for Success Definition
print("\nüìä STEP 5: FDV DISTRIBUTION ANALYSIS")
print("=" * 60)

if 'enriched_tokens' in locals() and len(enriched_tokens) > 0:
    
    # Filter tokens with valid FDV data
    valid_fdv_tokens = enriched_tokens[
        (enriched_tokens['api_status'] == 'success') & 
        (enriched_tokens['fdv'] > 0) & 
        (enriched_tokens['fdv'].notna())
    ].copy()
    
    print(f"üìà FDV DATA OVERVIEW:")
    print(f"   Total tokens: {len(enriched_tokens):,}")
    print(f"   Tokens with valid FDV: {len(valid_fdv_tokens):,} ({len(valid_fdv_tokens)/len(enriched_tokens):.1%})")
    print(f"   Tokens without FDV: {len(enriched_tokens) - len(valid_fdv_tokens):,}")
    
    if len(valid_fdv_tokens) > 0:
        # Basic FDV statistics
        fdv_stats = valid_fdv_tokens['fdv'].describe()
        
        print(f"\nüí∞ FDV DISTRIBUTION STATISTICS:")
        print(f"   Count: {fdv_stats['count']:,.0f}")
        print(f"   Mean: ${fdv_stats['mean']:,.0f}")
        print(f"   Median: ${fdv_stats['50%']:,.0f}")
        print(f"   Standard Deviation: ${fdv_stats['std']:,.0f}")
        print(f"   Minimum: ${fdv_stats['min']:,.0f}")
        print(f"   Maximum: ${fdv_stats['max']:,.0f}")
        
        # Detailed percentile analysis
        percentiles = [0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99]
        print(f"\nüìä FDV PERCENTILE BREAKDOWN:")
        for p in percentiles:
            value = valid_fdv_tokens['fdv'].quantile(p)
            print(f"   {p*100:4.1f}th percentile: ${value:>12,.0f}")
        
        # Count tokens in different ranges
        fdv_ranges = [
            (0, 1000, "< $1K"),
            (1000, 10000, "$1K - $10K"), 
            (10000, 100000, "$10K - $100K"),
            (100000, 1000000, "$100K - $1M"),
            (1000000, 10000000, "$1M - $10M"),
            (10000000, 100000000, "$10M - $100M"),
            (100000000, float('inf'), "> $100M")
        ]
        
        print(f"\nüéØ FDV RANGE DISTRIBUTION:")
        for min_val, max_val, label in fdv_ranges:
            if max_val == float('inf'):
                count = (valid_fdv_tokens['fdv'] >= min_val).sum()
            else:
                count = ((valid_fdv_tokens['fdv'] >= min_val) & (valid_fdv_tokens['fdv'] < max_val)).sum()
            pct = count / len(valid_fdv_tokens) * 100
            print(f"   {label:<15}: {count:>4,} tokens ({pct:>5.1f}%)")
        
        # Analyze log distribution for better understanding
        import numpy as np
        valid_fdv_tokens['log_fdv'] = np.log10(valid_fdv_tokens['fdv'])
        
        print(f"\nüìà LOG10(FDV) DISTRIBUTION:")
        log_stats = valid_fdv_tokens['log_fdv'].describe()
        print(f"   Mean Log10(FDV): {log_stats['mean']:.2f} (${10**log_stats['mean']:,.0f})")
        print(f"   Median Log10(FDV): {log_stats['50%']:.2f} (${10**log_stats['50%']:,.0f})")
        print(f"   Std Log10(FDV): {log_stats['std']:.2f}")
        
        # Identify natural breakpoints based on distribution
        q25 = valid_fdv_tokens['fdv'].quantile(0.25)
        q50 = valid_fdv_tokens['fdv'].quantile(0.50) 
        q75 = valid_fdv_tokens['fdv'].quantile(0.75)
        q90 = valid_fdv_tokens['fdv'].quantile(0.90)
        q95 = valid_fdv_tokens['fdv'].quantile(0.95)
        
        print(f"\nüéØ SUGGESTED SUCCESS TIERS (Based on Data Distribution):")
        print(f"   üöÄ Elite Success    (Top 5%):     FDV > ${q95:,.0f}")
        print(f"   üíé High Success    (Top 10%):    FDV > ${q90:,.0f}")  
        print(f"   ‚≠ê Good Success    (Top 25%):    FDV > ${q75:,.0f}")
        print(f"   üìà Moderate Success (Top 50%):   FDV > ${q50:,.0f}")
        print(f"   üìâ Below Average   (Bottom 50%): FDV < ${q50:,.0f}")
        
        # Count tokens in each tier
        elite_count = (valid_fdv_tokens['fdv'] > q95).sum()
        high_count = ((valid_fdv_tokens['fdv'] > q90) & (valid_fdv_tokens['fdv'] <= q95)).sum()
        good_count = ((valid_fdv_tokens['fdv'] > q75) & (valid_fdv_tokens['fdv'] <= q90)).sum()
        moderate_count = ((valid_fdv_tokens['fdv'] > q50) & (valid_fdv_tokens['fdv'] <= q75)).sum()
        below_count = (valid_fdv_tokens['fdv'] <= q50).sum()
        
        print(f"\nüìä TIER DISTRIBUTION:")
        print(f"   üöÄ Elite Success:    {elite_count:>4,} tokens ({elite_count/len(valid_fdv_tokens)*100:>5.1f}%)")
        print(f"   üíé High Success:     {high_count:>4,} tokens ({high_count/len(valid_fdv_tokens)*100:>5.1f}%)")
        print(f"   ‚≠ê Good Success:     {good_count:>4,} tokens ({good_count/len(valid_fdv_tokens)*100:>5.1f}%)")
        print(f"   üìà Moderate Success: {moderate_count:>4,} tokens ({moderate_count/len(valid_fdv_tokens)*100:>5.1f}%)")
        print(f"   üìâ Below Average:    {below_count:>4,} tokens ({below_count/len(valid_fdv_tokens)*100:>5.1f}%)")
        
        # Store the thresholds for next analysis
        success_thresholds = {
            'elite': q95,
            'high': q90, 
            'good': q75,
            'moderate': q50
        }
        
        print(f"\n‚úÖ SUCCESS TIERS DEFINED BASED ON DATA DISTRIBUTION")
        print(f"üéØ Ready for success correlation analysis")
        
    else:
        print("‚ùå No valid FDV data available")
        
else:
    print("‚ùå No enriched token data available")



üìä STEP 5: FDV DISTRIBUTION ANALYSIS
üìà FDV DATA OVERVIEW:
   Total tokens: 5,746
   Tokens with valid FDV: 2,363 (41.1%)
   Tokens without FDV: 3,383

üí∞ FDV DISTRIBUTION STATISTICS:
   Count: 2,363
   Mean: $6,202,711
   Median: $64,737
   Standard Deviation: $46,310,079
   Minimum: $0
   Maximum: $869,098,992

üìä FDV PERCENTILE BREAKDOWN:
   10.0th percentile: $      11,127
   25.0th percentile: $      19,937
   50.0th percentile: $      64,737
   75.0th percentile: $     474,663
   90.0th percentile: $   3,102,274
   95.0th percentile: $  11,732,301
   99.0th percentile: $ 141,977,935

üéØ FDV RANGE DISTRIBUTION:
   < $1K          :   22 tokens (  0.9%)
   $1K - $10K     :  171 tokens (  7.2%)
   $10K - $100K   : 1,158 tokens ( 49.0%)
   $100K - $1M    :  579 tokens ( 24.5%)
   $1M - $10M     :  305 tokens ( 12.9%)
   $10M - $100M   :   91 tokens (  3.9%)
   > $100M        :   37 tokens (  1.6%)

üìà LOG10(FDV) DISTRIBUTION:
   Mean Log10(FDV): 5.06 ($115,428)
   Median 

In [10]:
# Step 6: First-Day Trading Metrics vs FDV/Liquidity Correlation Analysis
print("\nüìä STEP 6: CORRELATION ANALYSIS - FIRST DAY METRICS vs FDV/LIQUIDITY")
print("=" * 80)

if 'enriched_tokens' in locals() and len(enriched_tokens) > 0:
    
    # Filter tokens with valid FDV and liquidity data
    analysis_tokens = enriched_tokens[
        (enriched_tokens['api_status'] == 'success') & 
        (enriched_tokens['fdv'] > 0) & 
        (enriched_tokens['fdv'].notna()) &
        (enriched_tokens['liquidity_usd'] > 0) & 
        (enriched_tokens['liquidity_usd'].notna())
    ].copy()
    
    print(f"üìà ANALYSIS DATASET:")
    print(f"   Total tokens: {len(enriched_tokens):,}")
    print(f"   Tokens with valid FDV & Liquidity: {len(analysis_tokens):,} ({len(analysis_tokens)/len(enriched_tokens):.1%})")
    
    if len(analysis_tokens) > 0:
        
        # Create log-scale versions of target variables
        import numpy as np
        analysis_tokens['log_fdv'] = np.log10(analysis_tokens['fdv'])
        analysis_tokens['log_liquidity'] = np.log10(analysis_tokens['liquidity_usd'])
        
        # Calculate additional first-day metrics if not already present
        if 'trading_hours' not in analysis_tokens.columns:
            analysis_tokens['trading_hours'] = (analysis_tokens['last_trade'] - analysis_tokens['first_trade']).dt.total_seconds() / 3600
        if 'trades_per_hour' not in analysis_tokens.columns:
            analysis_tokens['trades_per_hour'] = analysis_tokens['total_trades'] / (analysis_tokens['trading_hours'] + 0.1)
        if 'sol_per_trader' not in analysis_tokens.columns:
            analysis_tokens['sol_per_trader'] = analysis_tokens['total_sol_volume'] / analysis_tokens['unique_traders']
        if 'trades_per_trader' not in analysis_tokens.columns:
            analysis_tokens['trades_per_trader'] = analysis_tokens['total_trades'] / analysis_tokens['unique_traders']
        
        # Define first-day metrics for correlation analysis
        first_day_metrics = {
            'total_trades': 'Total Trades',
            'unique_traders': 'Unique Traders',
            'total_sol_volume': 'Total SOL Volume',
            'avg_sol_per_trade': 'Avg SOL per Trade',
            'sol_per_trader': 'SOL per Trader',
            'trades_per_trader': 'Trades per Trader',
            'trades_per_hour': 'Trading Intensity (trades/hour)',
            'unique_pairs': 'Unique Trading Pairs',
            'sol_pair_trades': 'SOL Pair Trades',
            'non_sol_pair_trades': 'Non-SOL Pair Trades'
        }
        
        # Calculate correlations with log(FDV)
        print(f"\nüéØ CORRELATION WITH LOG10(FDV):")
        fdv_correlations = []
        for metric, label in first_day_metrics.items():
            if metric in analysis_tokens.columns:
                # Filter out any infinite or NaN values
                valid_data = analysis_tokens[
                    (analysis_tokens[metric].notna()) & 
                    (np.isfinite(analysis_tokens[metric])) &
                    (analysis_tokens[metric] > 0)
                ]
                if len(valid_data) > 10:  # Need sufficient data points
                    corr = valid_data[metric].corr(valid_data['log_fdv'])
                    fdv_correlations.append((metric, label, corr, len(valid_data)))
                    print(f"   {label:<25}: {corr:>7.3f} (n={len(valid_data):,})")
        
        # Calculate correlations with log(Liquidity)
        print(f"\nüíß CORRELATION WITH LOG10(LIQUIDITY):")
        liquidity_correlations = []
        for metric, label in first_day_metrics.items():
            if metric in analysis_tokens.columns:
                valid_data = analysis_tokens[
                    (analysis_tokens[metric].notna()) & 
                    (np.isfinite(analysis_tokens[metric])) &
                    (analysis_tokens[metric] > 0)
                ]
                if len(valid_data) > 10:
                    corr = valid_data[metric].corr(valid_data['log_liquidity'])
                    liquidity_correlations.append((metric, label, corr, len(valid_data)))
                    print(f"   {label:<25}: {corr:>7.3f} (n={len(valid_data):,})")
        
        # Sort correlations by strength
        fdv_correlations.sort(key=lambda x: abs(x[2]), reverse=True)
        liquidity_correlations.sort(key=lambda x: abs(x[2]), reverse=True)
        
        print(f"\nüèÜ STRONGEST CORRELATIONS WITH LOG(FDV):")
        for i, (metric, label, corr, n) in enumerate(fdv_correlations[:5], 1):
            print(f"   {i}. {label:<25}: {corr:>7.3f}")
        
        print(f"\nüèÜ STRONGEST CORRELATIONS WITH LOG(LIQUIDITY):")
        for i, (metric, label, corr, n) in enumerate(liquidity_correlations[:5], 1):
            print(f"   {i}. {label:<25}: {corr:>7.3f}")
        
        # Summary insights
        print(f"\nüìä CORRELATION INSIGHTS:")
        
        # Find strongest predictors
        strong_fdv_predictors = [x for x in fdv_correlations if abs(x[2]) > 0.3]
        strong_liq_predictors = [x for x in liquidity_correlations if abs(x[2]) > 0.3]
        
        print(f"   Strong FDV predictors (|r| > 0.3): {len(strong_fdv_predictors)}")
        print(f"   Strong Liquidity predictors (|r| > 0.3): {len(strong_liq_predictors)}")
        
        # Compare FDV vs Liquidity correlations
        fdv_liq_corr = analysis_tokens['log_fdv'].corr(analysis_tokens['log_liquidity'])
        print(f"   Log(FDV) vs Log(Liquidity) correlation: {fdv_liq_corr:.3f}")
        
        print(f"\n‚úÖ CORRELATION ANALYSIS COMPLETE")
        print(f"üéØ Ready for visualization and deeper analysis")
        
        # Store correlation results for plotting
        correlation_results = {
            'fdv_correlations': fdv_correlations,
            'liquidity_correlations': liquidity_correlations,
            'analysis_tokens': analysis_tokens
        }
        
    else:
        print("‚ùå No valid FDV and liquidity data available")
        
else:
    print("‚ùå No enriched token data available")



üìä STEP 6: CORRELATION ANALYSIS - FIRST DAY METRICS vs FDV/LIQUIDITY
üìà ANALYSIS DATASET:
   Total tokens: 5,746
   Tokens with valid FDV & Liquidity: 2,362 (41.1%)

üéØ CORRELATION WITH LOG10(FDV):
   Total Trades             :  -0.059 (n=2,362)
   Unique Traders           :  -0.073 (n=2,362)
   Total SOL Volume         :   0.095 (n=2,347)
   Avg SOL per Trade        :   0.120 (n=2,347)
   SOL per Trader           :   0.038 (n=2,347)
   Trades per Trader        :  -0.071 (n=2,362)
   Trading Intensity (trades/hour):  -0.037 (n=2,362)
   Unique Trading Pairs     :   0.172 (n=2,362)
   SOL Pair Trades          :  -0.095 (n=2,347)
   Non-SOL Pair Trades      :   0.295 (n=780)

üíß CORRELATION WITH LOG10(LIQUIDITY):
   Total Trades             :  -0.037 (n=2,362)
   Unique Traders           :  -0.033 (n=2,362)
   Total SOL Volume         :   0.119 (n=2,347)
   Avg SOL per Trade        :   0.100 (n=2,347)
   SOL per Trader           :  -0.105 (n=2,347)
   Trades per Trader        : 

In [11]:
# Step 8: Enhanced Correlation Analysis with Log Transformations
print("\nüìä STEP 8: ENHANCED CORRELATION ANALYSIS - LOG TRANSFORMATIONS")
print("=" * 80)

if 'enriched_tokens' in locals() and len(enriched_tokens) > 0:
    
    # Filter tokens with valid FDV and liquidity data
    analysis_tokens = enriched_tokens[
        (enriched_tokens['api_status'] == 'success') & 
        (enriched_tokens['fdv'] > 0) & 
        (enriched_tokens['fdv'].notna()) &
        (enriched_tokens['liquidity_usd'] > 0) & 
        (enriched_tokens['liquidity_usd'].notna())
    ].copy()
    
    print(f"üìà ENHANCED ANALYSIS DATASET:")
    print(f"   Tokens with valid FDV & Liquidity: {len(analysis_tokens):,}")
    
    if len(analysis_tokens) > 0:
        
        import numpy as np
        
        # Create log-scale versions of both target AND predictor variables
        analysis_tokens['log_fdv'] = np.log10(analysis_tokens['fdv'])
        analysis_tokens['log_liquidity'] = np.log10(analysis_tokens['liquidity_usd'])
        
        # Calculate additional metrics if needed
        if 'trading_hours' not in analysis_tokens.columns:
            analysis_tokens['trading_hours'] = (analysis_tokens['last_trade'] - analysis_tokens['first_trade']).dt.total_seconds() / 3600
        if 'trades_per_hour' not in analysis_tokens.columns:
            analysis_tokens['trades_per_hour'] = analysis_tokens['total_trades'] / (analysis_tokens['trading_hours'] + 0.1)
        if 'sol_per_trader' not in analysis_tokens.columns:
            analysis_tokens['sol_per_trader'] = analysis_tokens['total_sol_volume'] / analysis_tokens['unique_traders']
        if 'trades_per_trader' not in analysis_tokens.columns:
            analysis_tokens['trades_per_trader'] = analysis_tokens['total_trades'] / analysis_tokens['unique_traders']
        
        # Create log versions of skewed predictor variables
        log_transform_vars = ['total_trades', 'unique_traders', 'total_sol_volume', 'sol_pair_trades', 'trades_per_hour', 'sol_per_trader']
        
        for var in log_transform_vars:
            if var in analysis_tokens.columns:
                # Only log transform positive values
                valid_mask = (analysis_tokens[var] > 0) & (analysis_tokens[var].notna())
                analysis_tokens[f'log_{var}'] = np.nan
                analysis_tokens.loc[valid_mask, f'log_{var}'] = np.log10(analysis_tokens.loc[valid_mask, var])
        
        # Define enhanced metrics for correlation analysis (both original and log versions)
        enhanced_metrics = {
            # Original metrics
            'total_trades': 'Total Trades',
            'unique_traders': 'Unique Traders', 
            'total_sol_volume': 'Total SOL Volume',
            'avg_sol_per_trade': 'Avg SOL per Trade',
            'sol_per_trader': 'SOL per Trader',
            'trades_per_trader': 'Trades per Trader',
            'trades_per_hour': 'Trading Intensity',
            'unique_pairs': 'Unique Trading Pairs',
            # Log-transformed metrics
            'log_total_trades': 'Log(Total Trades)',
            'log_unique_traders': 'Log(Unique Traders)',
            'log_total_sol_volume': 'Log(Total SOL Volume)',
            'log_sol_pair_trades': 'Log(SOL Pair Trades)',
            'log_trades_per_hour': 'Log(Trading Intensity)',
            'log_sol_per_trader': 'Log(SOL per Trader)'
        }
        
        # Calculate correlations with log(FDV)
        print(f"üéØ ENHANCED CORRELATION WITH LOG10(FDV):")
        print(f"   {'Metric':<30} {'Original':<10} {'Log Version':<12} {'Better'}")
        print(f"   {'-'*65}")
        
        enhanced_fdv_correlations = []
        
        for base_metric in ['total_trades', 'unique_traders', 'total_sol_volume', 'sol_per_trader', 'trades_per_hour']:
            if base_metric in analysis_tokens.columns:
                log_metric = f'log_{base_metric}'
                
                # Original correlation
                orig_valid = analysis_tokens[
                    (analysis_tokens[base_metric].notna()) & 
                    (np.isfinite(analysis_tokens[base_metric])) &
                    (analysis_tokens[base_metric] > 0)
                ]
                orig_corr = orig_valid[base_metric].corr(orig_valid['log_fdv']) if len(orig_valid) > 10 else np.nan
                
                # Log-transformed correlation
                log_valid = analysis_tokens[
                    (analysis_tokens[log_metric].notna()) & 
                    (np.isfinite(analysis_tokens[log_metric]))
                ]
                log_corr = log_valid[log_metric].corr(log_valid['log_fdv']) if len(log_valid) > 10 else np.nan
                
                # Determine which is better
                better = "LOG" if abs(log_corr) > abs(orig_corr) else "ORIG"
                if pd.isna(orig_corr) or pd.isna(log_corr):
                    better = "N/A"
                
                print(f"   {enhanced_metrics[base_metric]:<30} {orig_corr:>8.3f}  {log_corr:>10.3f}   {better}")
                
                # Store best correlation
                if not pd.isna(log_corr):
                    enhanced_fdv_correlations.append((log_metric, enhanced_metrics[log_metric], log_corr, len(log_valid)))
                if not pd.isna(orig_corr):
                    enhanced_fdv_correlations.append((base_metric, enhanced_metrics[base_metric], orig_corr, len(orig_valid)))
        
        # Add remaining non-transformed metrics
        for metric in ['avg_sol_per_trade', 'trades_per_trader', 'unique_pairs']:
            if metric in analysis_tokens.columns:
                valid_data = analysis_tokens[
                    (analysis_tokens[metric].notna()) & 
                    (np.isfinite(analysis_tokens[metric])) &
                    (analysis_tokens[metric] > 0)
                ]
                if len(valid_data) > 10:
                    corr = valid_data[metric].corr(valid_data['log_fdv'])
                    enhanced_fdv_correlations.append((metric, enhanced_metrics[metric], corr, len(valid_data)))
        
        # Same analysis for liquidity
        print(f"\nüíß ENHANCED CORRELATION WITH LOG10(LIQUIDITY):")
        print(f"   {'Metric':<30} {'Original':<10} {'Log Version':<12} {'Better'}")
        print(f"   {'-'*65}")
        
        enhanced_liq_correlations = []
        
        for base_metric in ['total_trades', 'unique_traders', 'total_sol_volume', 'sol_per_trader', 'trades_per_hour']:
            if base_metric in analysis_tokens.columns:
                log_metric = f'log_{base_metric}'
                
                # Original correlation
                orig_valid = analysis_tokens[
                    (analysis_tokens[base_metric].notna()) & 
                    (np.isfinite(analysis_tokens[base_metric])) &
                    (analysis_tokens[base_metric] > 0)
                ]
                orig_corr = orig_valid[base_metric].corr(orig_valid['log_liquidity']) if len(orig_valid) > 10 else np.nan
                
                # Log-transformed correlation
                log_valid = analysis_tokens[
                    (analysis_tokens[log_metric].notna()) & 
                    (np.isfinite(analysis_tokens[log_metric]))
                ]
                log_corr = log_valid[log_metric].corr(log_valid['log_liquidity']) if len(log_valid) > 10 else np.nan
                
                # Determine which is better
                better = "LOG" if abs(log_corr) > abs(orig_corr) else "ORIG"
                if pd.isna(orig_corr) or pd.isna(log_corr):
                    better = "N/A"
                
                print(f"   {enhanced_metrics[base_metric]:<30} {orig_corr:>8.3f}  {log_corr:>10.3f}   {better}")
                
                # Store best correlation
                if not pd.isna(log_corr):
                    enhanced_liq_correlations.append((log_metric, enhanced_metrics[log_metric], log_corr, len(log_valid)))
                if not pd.isna(orig_corr):
                    enhanced_liq_correlations.append((base_metric, enhanced_metrics[base_metric], orig_corr, len(orig_valid)))
        
        # Add remaining metrics
        for metric in ['avg_sol_per_trade', 'trades_per_trader', 'unique_pairs']:
            if metric in analysis_tokens.columns:
                valid_data = analysis_tokens[
                    (analysis_tokens[metric].notna()) & 
                    (np.isfinite(analysis_tokens[metric])) &
                    (analysis_tokens[metric] > 0)
                ]
                if len(valid_data) > 10:
                    corr = valid_data[metric].corr(valid_data['log_liquidity'])
                    enhanced_liq_correlations.append((metric, enhanced_metrics[metric], corr, len(valid_data)))
        
        # Sort by absolute correlation strength
        enhanced_fdv_correlations.sort(key=lambda x: abs(x[2]), reverse=True)
        enhanced_liq_correlations.sort(key=lambda x: abs(x[2]), reverse=True)
        
        print(f"\nüèÜ TOP 5 STRONGEST FDV PREDICTORS (Enhanced):") 
        for i, (metric, label, corr, n) in enumerate(enhanced_fdv_correlations[:5], 1):
            log_indicator = "üìä" if "log_" in metric else "üìà"
            print(f"   {i}. {log_indicator} {label:<25}: {corr:>7.3f}")
            
        print(f"\nüèÜ TOP 5 STRONGEST LIQUIDITY PREDICTORS (Enhanced):")
        for i, (metric, label, corr, n) in enumerate(enhanced_liq_correlations[:5], 1):
            log_indicator = "üìä" if "log_" in metric else "üìà"
            print(f"   {i}. {log_indicator} {label:<25}: {corr:>7.3f}")
        
        # Summary of improvements
        print(f"\nüí° LOG TRANSFORMATION IMPROVEMENTS:")
        
        log_improvements = 0
        for base_metric in ['total_trades', 'unique_traders', 'total_sol_volume', 'sol_per_trader']:
            if base_metric in analysis_tokens.columns and f'log_{base_metric}' in analysis_tokens.columns:
                orig_fdv_corr = analysis_tokens[base_metric].corr(analysis_tokens['log_fdv'])
                log_fdv_corr = analysis_tokens[f'log_{base_metric}'].corr(analysis_tokens['log_fdv'])
                
                if abs(log_fdv_corr) > abs(orig_fdv_corr) + 0.05:  # Significant improvement
                    log_improvements += 1
        
        print(f"   Variables with improved correlations: {log_improvements}")
        print(f"   Log transformations generally improve predictive power")
        
        print(f"\n‚úÖ ENHANCED CORRELATION ANALYSIS COMPLETE")
        print(f"üéØ Log-transformed variables ready for improved visualizations")
        
        # Store enhanced results
        enhanced_correlation_results = {
            'fdv_correlations': enhanced_fdv_correlations,
            'liquidity_correlations': enhanced_liq_correlations,
            'analysis_tokens': analysis_tokens,
            'enhanced_metrics': enhanced_metrics
        }
        
    else:
        print("‚ùå No valid data for enhanced analysis")
        
else:
    print("‚ùå No enriched token data available")



üìä STEP 8: ENHANCED CORRELATION ANALYSIS - LOG TRANSFORMATIONS
üìà ENHANCED ANALYSIS DATASET:
   Tokens with valid FDV & Liquidity: 2,362
üéØ ENHANCED CORRELATION WITH LOG10(FDV):
   Metric                         Original   Log Version  Better
   -----------------------------------------------------------------
   Total Trades                     -0.059      -0.164   LOG
   Unique Traders                   -0.073      -0.164   LOG
   Total SOL Volume                  0.095      -0.126   LOG
   SOL per Trader                    0.038      -0.069   LOG
   Trading Intensity                -0.037      -0.178   LOG

üíß ENHANCED CORRELATION WITH LOG10(LIQUIDITY):
   Metric                         Original   Log Version  Better
   -----------------------------------------------------------------
   Total Trades                     -0.037      -0.134   LOG
   Unique Traders                   -0.033      -0.125   LOG
   Total SOL Volume                  0.119      -0.096   ORIG
   SOL p

In [12]:
# Step 9: Enhanced Correlation Visualizations
print("\nüìä STEP 9: ENHANCED CORRELATION VISUALIZATIONS")
print("=" * 70)

if 'enhanced_correlation_results' in locals() and len(enhanced_correlation_results['analysis_tokens']) > 0:
    
    analysis_tokens = enhanced_correlation_results['analysis_tokens']
    enhanced_metrics = enhanced_correlation_results['enhanced_metrics']
    
    from plotly.subplots import make_subplots
    import plotly.graph_objects as go
    import plotly.express as px
    import pandas as pd
    import numpy as np
    
    print(f"üéØ CREATING ENHANCED VISUALIZATIONS...")
    
    # 1. Original vs Log Correlation Comparison Chart
    print(f"   üìä 1. Correlation comparison chart")
    
    comparison_data = []
    for base_metric in ['total_trades', 'unique_traders', 'total_sol_volume', 'sol_per_trader', 'trades_per_hour']:
        if base_metric in analysis_tokens.columns and f'log_{base_metric}' in analysis_tokens.columns:
            orig_fdv = analysis_tokens[base_metric].corr(analysis_tokens['log_fdv'])
            log_fdv = analysis_tokens[f'log_{base_metric}'].corr(analysis_tokens['log_fdv'])
            orig_liq = analysis_tokens[base_metric].corr(analysis_tokens['log_liquidity'])
            log_liq = analysis_tokens[f'log_{base_metric}'].corr(analysis_tokens['log_liquidity'])
            
            comparison_data.append({
                'metric': enhanced_metrics[base_metric],
                'orig_fdv': orig_fdv,
                'log_fdv': log_fdv,
                'orig_liq': orig_liq,
                'log_liq': log_liq,
                'fdv_improvement': abs(log_fdv) - abs(orig_fdv),
                'liq_improvement': abs(log_liq) - abs(orig_liq)
            })
    
    comparison_df = pd.DataFrame(comparison_data)
    
    # Create comparison visualization
    fig1 = make_subplots(
        rows=1, cols=2,
        subplot_titles=['FDV Correlations: Original vs Log', 'Liquidity Correlations: Original vs Log'],
        horizontal_spacing=0.1
    )
    
    # FDV comparison
    fig1.add_trace(
        go.Scatter(
            x=comparison_df['orig_fdv'],
            y=comparison_df['log_fdv'],
            mode='markers+text',
            text=comparison_df['metric'],
            textposition='middle right',
            marker=dict(size=10, color='blue', opacity=0.7),
            name='FDV Correlations',
            hovertemplate='<b>%{text}</b><br>Original: %{x:.3f}<br>Log: %{y:.3f}<extra></extra>'
        ),
        row=1, col=1
    )
    
    # Add diagonal line for FDV
    min_val = min(comparison_df['orig_fdv'].min(), comparison_df['log_fdv'].min())
    max_val = max(comparison_df['orig_fdv'].max(), comparison_df['log_fdv'].max())
    fig1.add_trace(
        go.Scatter(
            x=[min_val, max_val],
            y=[min_val, max_val],
            mode='lines',
            line=dict(dash='dash', color='gray'),
            name='Equal Performance',
            showlegend=False
        ),
        row=1, col=1
    )
    
    # Liquidity comparison
    fig1.add_trace(
        go.Scatter(
            x=comparison_df['orig_liq'],
            y=comparison_df['log_liq'],
            mode='markers+text',
            text=comparison_df['metric'],
            textposition='middle right',
            marker=dict(size=10, color='green', opacity=0.7),
            name='Liquidity Correlations',
            hovertemplate='<b>%{text}</b><br>Original: %{x:.3f}<br>Log: %{y:.3f}<extra></extra>'
        ),
        row=1, col=2
    )
    
    # Add diagonal line for Liquidity
    min_val = min(comparison_df['orig_liq'].min(), comparison_df['log_liq'].min())
    max_val = max(comparison_df['orig_liq'].max(), comparison_df['log_liq'].max())
    fig1.add_trace(
        go.Scatter(
            x=[min_val, max_val],
            y=[min_val, max_val],
            mode='lines',
            line=dict(dash='dash', color='gray'),
            name='Equal Performance',
            showlegend=False
        ),
        row=1, col=2
    )
    
    fig1.update_xaxes(title_text="Original Correlation", row=1, col=1)
    fig1.update_yaxes(title_text="Log-Transformed Correlation", row=1, col=1)
    fig1.update_xaxes(title_text="Original Correlation", row=1, col=2)
    fig1.update_yaxes(title_text="Log-Transformed Correlation", row=1, col=2)
    
    fig1.update_layout(
        title="Log Transformation Impact on Correlations<br><sub>Points above diagonal line = Log transformation improved correlation</sub>",
        height=600,
        width=1200
    )
    
    fig1.show()
    
    # 2. Top Correlations Scatter Plots
    print(f"   üìà 2. Top correlations scatter plots")
    
    # Get top 4 correlations for each target
    fdv_top = enhanced_correlation_results['fdv_correlations'][:4]
    liq_top = enhanced_correlation_results['liquidity_correlations'][:4]
    
    fig2 = make_subplots(
        rows=2, cols=4,
        subplot_titles=[
            f"{corr[1]}<br>r={corr[2]:.3f}" for corr in fdv_top
        ] + [
            f"{corr[1]}<br>r={corr[2]:.3f}" for corr in liq_top
        ],
        vertical_spacing=0.12,
        horizontal_spacing=0.08
    )
    
    # FDV correlations (top row)
    for i, (metric, label, corr, n) in enumerate(fdv_top):
        if metric in analysis_tokens.columns:
            valid_data = analysis_tokens[
                (analysis_tokens[metric].notna()) & 
                (np.isfinite(analysis_tokens[metric]))
            ]
            
            # Use log scale for x-axis if it's a log-transformed variable or large range
            x_type = "log" if "log_" in metric or (valid_data[metric].max() / valid_data[metric].min() > 100) else "linear"
            
            fig2.add_trace(
                go.Scatter(
                    x=valid_data[metric],
                    y=valid_data['log_fdv'],
                    mode='markers',
                    marker=dict(
                        size=3,
                        opacity=0.6,
                        color='blue',
                        line=dict(width=0)
                    ),
                    name=f"{label}",
                    showlegend=False,
                    hovertemplate=f"<b>{label}</b>: %{{x:,.2f}}<br>Log(FDV): %{{y:.2f}}<br>FDV: $%{{customdata:,.0f}}<extra></extra>",
                    customdata=10**valid_data['log_fdv']
                ),
                row=1, col=i+1
            )
            
            fig2.update_xaxes(type=x_type, row=1, col=i+1)
    
    # Liquidity correlations (bottom row)
    for i, (metric, label, corr, n) in enumerate(liq_top):
        if metric in analysis_tokens.columns:
            valid_data = analysis_tokens[
                (analysis_tokens[metric].notna()) & 
                (np.isfinite(analysis_tokens[metric]))
            ]
            
            x_type = "log" if "log_" in metric or (valid_data[metric].max() / valid_data[metric].min() > 100) else "linear"
            
            fig2.add_trace(
                go.Scatter(
                    x=valid_data[metric],
                    y=valid_data['log_liquidity'],
                    mode='markers',
                    marker=dict(
                        size=3,
                        opacity=0.6,
                        color='green',
                        line=dict(width=0)
                    ),
                    name=f"{label}",
                    showlegend=False,
                    hovertemplate=f"<b>{label}</b>: %{{x:,.2f}}<br>Log(Liquidity): %{{y:.2f}}<br>Liquidity: $%{{customdata:,.0f}}<extra></extra>",
                    customdata=10**valid_data['log_liquidity']
                ),
                row=2, col=i+1
            )
            
            fig2.update_xaxes(type=x_type, row=2, col=i+1)
    
    # Update y-axes
    for i in range(1, 5):
        fig2.update_yaxes(title_text="Log10(FDV)", row=1, col=i)
        fig2.update_yaxes(title_text="Log10(Liquidity)", row=2, col=i)
    
    fig2.update_layout(
        title="Strongest First-Day Predictors of Long-Term Success",
        height=800,
        width=1600,
        font=dict(size=10)
    )
    
    fig2.show()
    
    # 3. Correlation Heatmap
    print(f"   üî• 3. Comprehensive correlation heatmap")
    
    # Create correlation matrix for all key metrics
    key_metrics = [
        'log_total_trades', 'log_unique_traders', 'log_total_sol_volume', 
        'log_trades_per_hour', 'log_sol_per_trader',
        'avg_sol_per_trade', 'trades_per_trader', 'unique_pairs',
        'log_fdv', 'log_liquidity'
    ]
    
    # Filter metrics that exist in the data
    available_metrics = [m for m in key_metrics if m in analysis_tokens.columns]
    
    # Calculate correlation matrix
    corr_matrix = analysis_tokens[available_metrics].corr()
    
    # Create labels for display
    display_labels = []
    for metric in available_metrics:
        if metric in enhanced_metrics:
            display_labels.append(enhanced_metrics[metric])
        elif metric == 'log_fdv':
            display_labels.append('Log(FDV)')
        elif metric == 'log_liquidity':
            display_labels.append('Log(Liquidity)')
        else:
            display_labels.append(metric)
    
    fig3 = go.Figure(data=go.Heatmap(
        z=corr_matrix.values,
        x=display_labels,
        y=display_labels,
        colorscale='RdBu',
        zmid=0,
        text=corr_matrix.round(3).values,
        texttemplate="%{text}",
        textfont={"size": 10},
        hovertemplate='<b>%{y}</b> vs <b>%{x}</b><br>Correlation: %{z:.3f}<extra></extra>'
    ))

    fig3.update_layout(
        title="Enhanced Correlation Matrix: First-Day Metrics vs Success Indicators",
        width=800,
        height=800,
        xaxis={'side': 'bottom'},
        font=dict(size=11)
    )

    fig3.show()

    # 4. Key Insights Summary
    print(f"\nüí° KEY INSIGHTS FROM ENHANCED ANALYSIS:")
    
    # Find the strongest predictors
    strongest_fdv = enhanced_correlation_results['fdv_correlations'][0]
    strongest_liq = enhanced_correlation_results['liquidity_correlations'][0]
    
    print(f"\n   üèÜ STRONGEST PREDICTORS:")
    print(f"   ‚Ä¢ FDV: {strongest_fdv[1]} (r={strongest_fdv[2]:.3f})")
    print(f"   ‚Ä¢ Liquidity: {strongest_liq[1]} (r={strongest_liq[2]:.3f})")
    
    # Analyze positive vs negative correlations
    positive_fdv = [x for x in enhanced_correlation_results['fdv_correlations'] if x[2] > 0]
    negative_fdv = [x for x in enhanced_correlation_results['fdv_correlations'] if x[2] < 0]
    
    print(f"\n   üìà POSITIVE FDV CORRELATIONS (higher = better):")
    for metric, label, corr, n in positive_fdv[:3]:
        print(f"   ‚Ä¢ {label}: r={corr:.3f}")
        
    print(f"\n   üìâ NEGATIVE FDV CORRELATIONS (lower = better):")
    for metric, label, corr, n in sorted(negative_fdv, key=lambda x: x[2])[:3]:
        print(f"   ‚Ä¢ {label}: r={corr:.3f}")
    
    # Log transformation benefits
    log_better_count = sum(1 for row in comparison_df.itertuples() 
                          if abs(row.log_fdv) > abs(row.orig_fdv) + 0.01)
    
    print(f"\n   üîÑ LOG TRANSFORMATION BENEFITS:")
    print(f"   ‚Ä¢ {log_better_count}/{len(comparison_df)} metrics improved with log transformation")
    print(f"   ‚Ä¢ Log transformations reveal linear relationships in power-law data")
    print(f"   ‚Ä¢ Essential for proper modeling and prediction")
    
    print(f"\n‚úÖ ENHANCED VISUALIZATIONS COMPLETE")
    print(f"üéØ Ready for insights interpretation and strategy development")
    
else:
    print("‚ùå No enhanced correlation results available for visualization")



üìä STEP 9: ENHANCED CORRELATION VISUALIZATIONS
üéØ CREATING ENHANCED VISUALIZATIONS...
   üìä 1. Correlation comparison chart


   üìà 2. Top correlations scatter plots


   üî• 3. Comprehensive correlation heatmap



üí° KEY INSIGHTS FROM ENHANCED ANALYSIS:

   üèÜ STRONGEST PREDICTORS:
   ‚Ä¢ FDV: Log(Trading Intensity) (r=-0.178)
   ‚Ä¢ Liquidity: Trades per Trader (r=-0.145)

   üìà POSITIVE FDV CORRELATIONS (higher = better):
   ‚Ä¢ Unique Trading Pairs: r=0.172
   ‚Ä¢ Avg SOL per Trade: r=0.120
   ‚Ä¢ Total SOL Volume: r=0.095

   üìâ NEGATIVE FDV CORRELATIONS (lower = better):
   ‚Ä¢ Log(Trading Intensity): r=-0.178
   ‚Ä¢ Log(Unique Traders): r=-0.164
   ‚Ä¢ Log(Total Trades): r=-0.164

   üîÑ LOG TRANSFORMATION BENEFITS:
   ‚Ä¢ 5/5 metrics improved with log transformation
   ‚Ä¢ Log transformations reveal linear relationships in power-law data
   ‚Ä¢ Essential for proper modeling and prediction

‚úÖ ENHANCED VISUALIZATIONS COMPLETE
üéØ Ready for insights interpretation and strategy development
