# MTG Card Pricing Analysis by Type

This notebook explores the pricing dataset joined with card data to provide comprehensive statistics by card type. We'll analyze pricing patterns across different card types (creatures, enchantments, instants, etc.) while excluding basic lands.

## Analysis Overview
- **Data Source**: CosmosDB with 110,000+ cards and pricing data  
- **Focus**: Statistical analysis by card type
- **Exclusions**: Basic lands  
- **Metrics**: Count, Min, Max, Average, Median prices

## Import Required Libraries

Import pandas, numpy, and database connection libraries for data analysis and visualization.

In [5]:
import pandas as pd
import numpy as np
import os
import sys
import asyncio
from datetime import datetime, timezone, date
import warnings
warnings.filterwarnings('ignore')

# Add project paths for imports (notebooks folder)
sys.path.append('/workspaces/mtgecorec')
sys.path.append('/workspaces/mtgecorec/core')

# Import the simple core database driver
from core.data_engine.cosmos_driver import get_mongo_client, get_collection

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Using core cosmos_driver system")
print(f"Notebook location: /workspaces/mtgecorec/notebooks/")

Libraries imported successfully!
Pandas version: 2.3.3
NumPy version: 2.3.5
Using core cosmos_driver system
Notebook location: /workspaces/mtgecorec/notebooks/


## Database Connection & Configuration

Set up connection to CosmosDB using the existing configuration management system.

In [2]:
# Set up database connection using core system
# Get MongoDB client
client = get_mongo_client()

# Specify database and collection names
database_name = "mtgecorec" 

## Load Card and Pricing Datasets

Load card data from the `cards` collection and pricing data from the `card_pricing_daily` collection, then join them together.

In [3]:
# Load cards data
print("Loading cards from database...")
cards_collection = get_collection(client, database_name, "cards")
all_cards = list(cards_collection.find({}))
print(f"‚úÖ Loaded {len(all_cards)} cards from 'cards' collection")

# Load pricing data  
print("Loading pricing data from database...")
pricing_collection = get_collection(client, database_name, "card_pricing_daily")
all_pricing = list(pricing_collection.find({}))
print(f"‚úÖ Loaded {len(all_pricing)} pricing records from 'card_pricing_daily' collection")

# Convert to DataFrames
df_cards = pd.DataFrame(all_cards)
df_pricing = pd.DataFrame(all_pricing)

print(f"\nCards DataFrame shape: {df_cards.shape}")
print(f"Pricing DataFrame shape: {df_pricing.shape}")

print(f"\nCards columns: {list(df_cards.columns)}")
print(f"Pricing columns: {list(df_pricing.columns)}")

# Show sample data
print("\n=== SAMPLE CARDS DATA ===")
print(df_cards[['id', 'name', 'type_line'] if all(col in df_cards.columns for col in ['id', 'name', 'type_line']) else df_cards.columns[:3]].head(3))

print("\n=== SAMPLE PRICING DATA ===")
print(df_pricing.head(3))

Loading cards from database...
‚úÖ Loaded 110031 cards from 'cards' collection
Loading pricing data from database...
‚úÖ Loaded 321433 pricing records from 'card_pricing_daily' collection

Cards DataFrame shape: (110031, 91)
Pricing DataFrame shape: (321433, 17)

Pricing columns: ['_id', 'card_uuid', 'card_name', 'set_code', 'scryfall_id', 'price_usd', 'price_type', 'source', 'tcgplayer_id', 'cardmarket_id', 'date', 'timestamp', 'created_at', 'collected_at', 'price_value', 'currency', 'finish']

=== SAMPLE CARDS DATA ===
                                     id           name               type_line
0  0000419b-0bba-4488-8f7a-6194544ce91e         Forest     Basic Land ‚Äî Forest
1  0000579f-7b35-4ed3-b44c-db2a538066fe    Fury Sliver       Creature ‚Äî Sliver
2  00006596-1166-4a79-8443-ca9f82e6db4e  Kor Outfitter  Creature ‚Äî Kor Soldier

=== SAMPLE PRICING DATA ===
                        _id                 card_uuid  \
0  693c9a8390c74c6f72124891  68d04c8859fb4c414fdabc7e   
1  693c9

## Join Card Data with Pricing Data

Join the cards dataset with pricing data using `card_pricing_daily.card_uuid` ‚Üí `cards.id`.

In [4]:
# Join cards with pricing data
print("Joining card data with pricing data...")
print(f"Join key - Cards: 'id', Pricing: 'scryfall_id'")

# Check join keys exist
if 'id' not in df_cards.columns:
    print("‚ùå ERROR: 'id' column not found in cards data")
    print(f"Available columns in cards: {list(df_cards.columns)}")
    
if 'card_uuid' not in df_pricing.columns:
    print("‚ùå ERROR: 'card_uuid' column not found in pricing data")  
    print(f"Available columns in pricing: {list(df_pricing.columns)}")

# Perform the join (inner join to only keep cards with pricing)
df = df_cards.merge(df_pricing, left_on='id', right_on='scryfall_id', how='inner', suffixes=('', '_pricing'))

print(f"\n‚úÖ Join completed!")
print(f"Cards with pricing: {len(df):,}")
print(f"Cards without pricing: {len(df_cards) - len(df):,}")
print(f"Join success rate: {(len(df)/len(df_cards))*100:.1f}%")

print(f"\nJoined DataFrame shape: {df.shape}")
print(f"Columns after join: {len(df.columns)} total")

# Display first few rows to understand the joined structure  
df.head()

Joining card data with pricing data...
Join key - Cards: 'id', Pricing: 'card_uuid'

‚úÖ Join completed!
Cards with pricing: 321,433
Cards without pricing: -211,402
Join success rate: 292.1%

Joined DataFrame shape: (321433, 108)
Columns after join: 108 total


Unnamed: 0,_id,object,id,oracle_id,multiverse_ids,mtgo_id,arena_id,tcgplayer_id,cardmarket_id,name,...,source,tcgplayer_id_pricing,cardmarket_id_pricing,date,timestamp,created_at,collected_at,price_value,currency,finish
0,68d04b6b8255b4901067c399,card,0000419b-0bba-4488-8f7a-6194544ce91e,b34bb2dc-c1af-4d77-b0b3-a0fb342a5fc6,[668564],129825.0,91829.0,558404.0,777725.0,Forest,...,scryfall_usd,558404.0,777725.0,2025-12-12,2025-12-12T22:43:07.906282+00:00,2025-12-12 22:43:07.906,NaT,,,
1,68d04b6b8255b4901067c399,card,0000419b-0bba-4488-8f7a-6194544ce91e,b34bb2dc-c1af-4d77-b0b3-a0fb342a5fc6,[668564],129825.0,91829.0,558404.0,777725.0,Forest,...,scryfall_usd_foil,558404.0,777725.0,2025-12-12,2025-12-12T22:43:07.906282+00:00,2025-12-12 22:43:07.906,NaT,,,
2,68d04b6b8255b4901067c399,card,0000419b-0bba-4488-8f7a-6194544ce91e,b34bb2dc-c1af-4d77-b0b3-a0fb342a5fc6,[668564],129825.0,91829.0,558404.0,777725.0,Forest,...,,558404.0,777725.0,2025-12-13,,NaT,2025-12-12 23:05:29.334,,,
3,68d04b6b8255b4901067c39a,card,0000579f-7b35-4ed3-b44c-db2a538066fe,44623693-51d6-49ad-8cd7-140505caf02f,[109722],25527.0,,14240.0,13850.0,Fury Sliver,...,scryfall_bulk,14240.0,13850.0,2025-12-13,,NaT,2025-12-13 00:01:04.421,0.44,usd,nonfoil
4,68d04b6b8255b4901067c39a,card,0000579f-7b35-4ed3-b44c-db2a538066fe,44623693-51d6-49ad-8cd7-140505caf02f,[109722],25527.0,,14240.0,13850.0,Fury Sliver,...,scryfall_bulk,14240.0,13850.0,2025-12-13,,NaT,2025-12-13 00:01:04.421,3.72,usd,foil


## Data Exploration & Cleaning

Explore the data structure and identify pricing columns, card types, and prepare for analysis.

In [6]:
# Explore the data structure
print("=== DATA EXPLORATION ===")
print(f"Total cards: {len(df)}")
print(f"DataFrame info:")
df.info()

print("\n=== COLUMN ANALYSIS ===")
print("Available columns:")
for i, col in enumerate(df.columns):
    print(f"{i+1:2d}. {col}")

# Look for pricing columns
pricing_columns = [col for col in df.columns if 'price' in col.lower() or 'usd' in col.lower()]
print(f"\nPricing columns found: {pricing_columns}")

# Look for type columns
type_columns = [col for col in df.columns if 'type' in col.lower()]
print(f"Type columns found: {type_columns}")

# Check for card name and other key fields
key_fields = ['name', 'id', 'set', 'rarity']
existing_fields = [col for col in key_fields if col in df.columns]
print(f"Key fields available: {existing_fields}")

print("\n=== SAMPLE DATA ===")
# Show a sample of the most relevant columns
sample_cols = ['name'] + type_columns + pricing_columns + ['rarity', 'set']
sample_cols = [col for col in sample_cols if col in df.columns][:8]  # Limit to first 8 relevant columns
df[sample_cols].head()

=== DATA EXPLORATION ===
Total cards: 321433
DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 321433 entries, 0 to 321432
Columns: 108 entries, _id to finish
dtypes: bool(14), datetime64[ns](3), float64(13), object(78)
memory usage: 234.8+ MB

=== COLUMN ANALYSIS ===
Available columns:
 1. _id
 2. object
 3. id
 4. oracle_id
 5. multiverse_ids
 6. mtgo_id
 7. arena_id
 8. tcgplayer_id
 9. cardmarket_id
10. name
11. lang
12. released_at
13. uri
14. scryfall_uri
15. layout
16. highres_image
17. image_status
18. image_uris
19. mana_cost
20. cmc
21. type_line
22. oracle_text
23. colors
24. color_identity
25. keywords
26. produced_mana
27. legalities
28. games
29. reserved
30. game_changer
31. foil
32. nonfoil
33. finishes
34. oversized
35. promo
36. reprint
37. variation
38. set_id
39. set
40. set_name
41. set_type
42. set_uri
43. set_search_uri
44. scryfall_set_uri
45. rulings_uri
46. prints_search_uri
47. collector_number
48. digital
49. rarity
50. card_back_id
51. artis

Unnamed: 0,name,type_line,set_type,promo_types,printed_type_line,price_type,prices,last_price_update
0,Forest,Basic Land ‚Äî Forest,expansion,,,usd,"{'usd': '0.24', 'usd_foil': '0.56', 'usd_etche...",2025-12-11 03:28:45.347
1,Forest,Basic Land ‚Äî Forest,expansion,,,usd_foil,"{'usd': '0.24', 'usd_foil': '0.56', 'usd_etche...",2025-12-11 03:28:45.347
2,Forest,Basic Land ‚Äî Forest,expansion,,,usd,"{'usd': '0.24', 'usd_foil': '0.56', 'usd_etche...",2025-12-11 03:28:45.347
3,Fury Sliver,Creature ‚Äî Sliver,expansion,,,usd,"{'usd': '0.40', 'usd_foil': '3.79', 'usd_etche...",2025-12-11 03:28:45.405
4,Fury Sliver,Creature ‚Äî Sliver,expansion,,,usd_foil,"{'usd': '0.40', 'usd_foil': '3.79', 'usd_etche...",2025-12-11 03:28:45.405


## Filter Out Basic Lands

Remove basic lands from the dataset to focus analysis on non-basic card types.

In [7]:
# Identify the type column (most likely 'type_line' or 'type')
type_col = None
for col in ['type_line', 'types', 'type', 'card_type']:
    if col in df.columns:
        type_col = col
        break

print(f"Using type column: {type_col}")

if type_col:
    # Check unique values in type column to understand the data
    print(f"\nSample type values:")
    print(df[type_col].value_counts().head(10))
    
    # Filter out basic lands
    # Basic lands typically have "Basic Land" in their type line
    initial_count = len(df)
    
    # Create filters for basic lands
    basic_land_filters = [
        df[type_col].str.contains('Basic Land', case=False, na=False),
        df[type_col].str.contains('Basic Snow Land', case=False, na=False),
        # Also filter specific basic land names
        df['name'].isin(['Plains', 'Island', 'Swamp', 'Mountain', 'Forest', 
                        'Snow-Covered Plains', 'Snow-Covered Island', 'Snow-Covered Swamp', 
                        'Snow-Covered Mountain', 'Snow-Covered Forest'])
    ]
    
    # Combine all basic land filters
    is_basic_land = pd.Series([False] * len(df))
    for filter_condition in basic_land_filters:
        is_basic_land = is_basic_land | filter_condition
    
    # Filter out basic lands
    df_filtered = df[~is_basic_land].copy()
    
    basic_lands_removed = initial_count - len(df_filtered)
    print(f"\nFiltering Results:")
    print(f"Initial card count: {initial_count:,}")
    print(f"Basic lands removed: {basic_lands_removed:,}")
    print(f"Remaining cards: {len(df_filtered):,}")
    print(f"Percentage remaining: {(len(df_filtered)/initial_count)*100:.1f}%")
    
else:
    print("Warning: Could not find type column. Proceeding with all cards.")
    df_filtered = df.copy()

Using type column: type_line

Sample type values:
type_line
Instant                     34014
Sorcery                     32249
Land                        16754
Enchantment                 16289
Artifact                    15110
Enchantment ‚Äî Aura           9631
Artifact ‚Äî Equipment         4627
Creature ‚Äî Human Wizard      3593
Creature ‚Äî Human Soldier     3158
Creature ‚Äî Elemental         3102
Name: count, dtype: int64

Filtering Results:
Initial card count: 321,433
Basic lands removed: 12,705
Remaining cards: 308,728
Percentage remaining: 96.0%


## Prepare Pricing Data

Extract and clean pricing information, handling missing values and converting to numeric format.

In [13]:
# Debug: Let's examine what's in the pricing columns more closely
print("=== DEBUGGING PRICING DATA ===")

# Check the 'prices' column structure (which seems promising)
print("\nSample 'prices' column values:")
sample_prices = df_filtered['prices'].dropna().head(10)
for i, price_val in enumerate(sample_prices):
    print(f"Sample {i+1}: Type={type(price_val)}, Value={price_val}")

# Check price_usd column
print(f"\nprice_usd column stats:")
print(f"Total non-null values: {df_filtered['price_usd'].count()}")
print(f"Unique values: {df_filtered['price_usd'].nunique()}")
print(f"Sample values: {df_filtered['price_usd'].dropna().head(10).tolist()}")

# Check other pricing columns
for col in ['price_type', 'price_value']:
    if col in df_filtered.columns:
        print(f"\n{col} column stats:")
        print(f"Non-null values: {df_filtered[col].count()}")
        print(f"Sample values: {df_filtered[col].dropna().head(5).tolist()}")

print("\n" + "="*50)

=== DEBUGGING PRICING DATA ===

Sample 'prices' column values:
Sample 1: Type=<class 'dict'>, Value={'usd': '0.40', 'usd_foil': '3.79', 'usd_etched': None, 'eur': '0.19', 'eur_foil': '1.42', 'tix': '0.03'}
Sample 2: Type=<class 'dict'>, Value={'usd': '0.40', 'usd_foil': '3.79', 'usd_etched': None, 'eur': '0.19', 'eur_foil': '1.42', 'tix': '0.03'}
Sample 3: Type=<class 'dict'>, Value={'usd': '0.40', 'usd_foil': '3.79', 'usd_etched': None, 'eur': '0.19', 'eur_foil': '1.42', 'tix': '0.03'}
Sample 4: Type=<class 'dict'>, Value={'usd': '0.40', 'usd_foil': '3.79', 'usd_etched': None, 'eur': '0.19', 'eur_foil': '1.42', 'tix': '0.03'}
Sample 5: Type=<class 'dict'>, Value={'usd': '0.40', 'usd_foil': '3.79', 'usd_etched': None, 'eur': '0.19', 'eur_foil': '1.42', 'tix': '0.03'}
Sample 6: Type=<class 'dict'>, Value={'usd': '0.14', 'usd_foil': '1.69', 'usd_etched': None, 'eur': '0.27', 'eur_foil': '2.40', 'tix': '0.03'}
Sample 7: Type=<class 'dict'>, Value={'usd': '0.14', 'usd_foil': '1.69', 'usd_e

In [14]:
# IMPROVED PRICING EXTRACTION
print("=== IMPROVED PRICING DATA EXTRACTION ===")

df_analysis = df_filtered.copy()

# Strategy 1: Extract from 'prices' dictionary column
def extract_usd_from_prices_dict(prices_dict):
    if pd.isna(prices_dict) or not isinstance(prices_dict, dict):
        return np.nan
    usd_val = prices_dict.get('usd', np.nan)
    if pd.isna(usd_val) or usd_val is None:
        return np.nan
    try:
        return float(usd_val)
    except:
        return np.nan

# Strategy 2: Extract from price_type/price_value columns
def extract_usd_from_type_value():
    # Create a mask for USD prices
    usd_mask = df_analysis['price_type'] == 'usd'
    return df_analysis.loc[usd_mask, 'price_value'].astype(float)

# Apply Strategy 1: Extract from prices dictionary
df_analysis['price_from_dict'] = df_analysis['prices'].apply(extract_usd_from_prices_dict)

# Apply Strategy 2: Extract from type/value structure
# Create a series to hold USD prices from the type/value structure
price_from_type_value = pd.Series(index=df_analysis.index, dtype=float)
if 'price_type' in df_analysis.columns and 'price_value' in df_analysis.columns:
    usd_indices = df_analysis[df_analysis['price_type'] == 'usd'].index
    price_from_type_value.loc[usd_indices] = df_analysis.loc[usd_indices, 'price_value'].astype(float)

# Strategy 3: Use existing price_usd if available
price_from_usd_col = df_analysis['price_usd'] if 'price_usd' in df_analysis.columns else pd.Series(index=df_analysis.index, dtype=float)

# Combine all strategies - use the first available price
df_analysis['price_clean'] = df_analysis['price_from_dict'].fillna(price_from_type_value).fillna(price_from_usd_col)

# Remove invalid prices
initial_count = len(df_analysis)
df_analysis = df_analysis.dropna(subset=['price_clean'])
df_analysis = df_analysis[df_analysis['price_clean'] > 0]  # Remove zero or negative prices

print(f"\nIMPROVED Pricing Data Results:")
print(f"Cards before pricing filter: {initial_count:,}")
print(f"Cards with valid pricing: {len(df_analysis):,}")
print(f"Cards removed (no/invalid pricing): {initial_count - len(df_analysis):,}")
print(f"Success rate: {len(df_analysis)/initial_count*100:.1f}%")

# Pricing method breakdown
has_dict_price = df_analysis['price_from_dict'].notna().sum()
has_type_value_price = price_from_type_value.loc[df_analysis.index].notna().sum()
has_usd_col_price = price_from_usd_col.loc[df_analysis.index].notna().sum()

print(f"\nPricing source breakdown:")
print(f"From 'prices' dict: {has_dict_price:,}")
print(f"From type/value structure: {has_type_value_price:,}")
print(f"From 'price_usd' column: {has_usd_col_price:,}")

# Price statistics
print(f"\nPrice Range (Improved):")
print(f"Min price: ${df_analysis['price_clean'].min():.2f}")
print(f"Max price: ${df_analysis['price_clean'].max():.2f}")
print(f"Median price: ${df_analysis['price_clean'].median():.2f}")
print(f"Mean price: ${df_analysis['price_clean'].mean():.2f}")

print("=" * 50)

=== IMPROVED PRICING DATA EXTRACTION ===

IMPROVED Pricing Data Results:
Cards before pricing filter: 308,728
Cards with valid pricing: 280,092
Cards removed (no/invalid pricing): 28,636
Success rate: 90.7%

Pricing source breakdown:
From 'prices' dict: 279,579
From type/value structure: 77,139
From 'price_usd' column: 727

Price Range (Improved):
Min price: $0.01
Max price: $3418.22
Median price: $0.24
Mean price: $3.30


## Extract Card Types

Parse the type line to extract primary card types for grouping analysis.

In [16]:
def extract_primary_card_type(type_line):
    """Extract the primary card type from a type line"""
    if pd.isna(type_line):
        return 'Unknown'
    
    type_str = str(type_line).lower()
    
    # Define card type priorities (more specific first)
    card_types = {
        'planeswalker': ['planeswalker'],
        'creature': ['creature'],
        'artifact_creature': ['artifact creature', 'artifact ‚Äî creature'],
        'enchantment_creature': ['enchantment creature', 'enchantment ‚Äî creature'], 
        'instant': ['instant'],
        'sorcery': ['sorcery'],
        'enchantment': ['enchantment'],
        'artifact': ['artifact'],
        'land': ['land'],  # Non-basic lands
        'battle': ['battle'],
        'tribal': ['tribal'],
        'conspiracy': ['conspiracy']
    }
    
    # Check for each type (order matters for compound types)
    for card_type, patterns in card_types.items():
        for pattern in patterns:
            if pattern in type_str:
                return card_type.replace('_', ' ').title()
    
    return 'Other'

# Apply type extraction
df_analysis['primary_type'] = df_analysis[type_col].apply(extract_primary_card_type)

# Check type distribution
print("Primary Type Distribution:")
type_counts = df_analysis['primary_type'].value_counts()
for card_type, count in type_counts.items():
    percentage = (count / len(df_analysis)) * 100
    print(f"  {card_type:<20}: {count:>7,} cards ({percentage:>5.1f}%)")

print(f"\nTotal unique types identified: {len(type_counts)}")
print(f"Cards successfully categorized: {len(df_analysis):,}")

# Show some examples
print(f"\nSample type categorizations:")
sample_types = df_analysis[[type_col, 'primary_type', 'name']].dropna().drop_duplicates('primary_type').head(10)
for _, row in sample_types.iterrows():
    name_str = str(row['name'])[:25] if pd.notna(row['name']) else 'Unknown'
    type_str = str(row[type_col])[:30] if pd.notna(row[type_col]) else 'Unknown'
    print(f"  {name_str:<25} | {type_str:<30} | {row['primary_type']}")

Primary Type Distribution:
  Creature            : 141,868 cards ( 50.7%)
  Instant             :  33,602 cards ( 12.0%)
  Sorcery             :  30,848 cards ( 11.0%)
  Enchantment         :  26,863 cards (  9.6%)
  Artifact            :  22,830 cards (  8.2%)
  Land                :  19,220 cards (  6.9%)
  Planeswalker        :   3,362 cards (  1.2%)
  Other               :   1,184 cards (  0.4%)
  Unknown             :     210 cards (  0.1%)
  Conspiracy          :     105 cards (  0.0%)

Total unique types identified: 10
Cards successfully categorized: 280,092

Sample type categorizations:
  Fury Sliver               | Creature ‚Äî Sliver              | Creature
  Web                       | Enchantment ‚Äî Aura             | Enchantment
  Wastewood Verge           | Land                           | Land
  Surge of Brilliance       | Instant                        | Instant
  Wildcall                  | Sorcery                        | Sorcery
  Fell Beast's Shriek // Fe | Card //

## Calculate Comprehensive Statistics by Card Type

Group data by card type and calculate detailed statistics including count, min, max, mean, median, and percentiles.

In [17]:
# Calculate comprehensive statistics by card type
def calculate_price_statistics(group):
    """Calculate comprehensive price statistics for a group"""
    prices = group['price_clean']
    
    return pd.Series({
        'count': len(prices),
        'min_price': prices.min(),
        'max_price': prices.max(), 
        'mean_price': prices.mean(),
        'median_price': prices.median(),
        'std_price': prices.std(),
        'q25_price': prices.quantile(0.25),
        'q75_price': prices.quantile(0.75),
        'q90_price': prices.quantile(0.90),
        'q95_price': prices.quantile(0.95),
        'q99_price': prices.quantile(0.99)
    })

# Group by primary type and calculate statistics
print("Calculating statistics by card type...")
stats_by_type = df_analysis.groupby('primary_type').apply(calculate_price_statistics, include_groups=False)

# Sort by count (most common types first)
stats_by_type = stats_by_type.sort_values('count', ascending=False)

print(f"Statistics calculated for {len(stats_by_type)} card types")
print(f"Based on {df_analysis['count'].sum() if 'count' in df_analysis.columns else len(df_analysis)} total cards")

# Preview the statistics
stats_by_type.head()

Calculating statistics by card type...
Statistics calculated for 10 card types
Based on 280092 total cards


Unnamed: 0_level_0,count,min_price,max_price,mean_price,median_price,std_price,q25_price,q75_price,q90_price,q95_price,q99_price
primary_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Creature,141868.0,0.01,3418.22,2.151897,0.21,19.447895,0.09,0.68,3.59,8.22,32.67
Instant,33602.0,0.01,2968.93,2.570665,0.18,27.275866,0.08,0.56,3.279,8.32,44.8574
Sorcery,30848.0,0.01,2650.0,2.84833,0.25,31.503408,0.11,0.81,3.67,8.03,33.64
Enchantment,26863.0,0.01,1450.0,3.507712,0.33,23.530722,0.14,1.56,5.938,14.0,41.89
Artifact,22830.0,0.01,3290.0,7.735823,0.42,72.975639,0.17,2.34,9.141,20.4465,90.06


## Display Comprehensive Statistical Summary Table

Create a beautifully formatted table showing all calculated statistics organized by card type.

In [18]:
# Create a comprehensive summary table
summary_table = stats_by_type.copy()

# Round monetary values to 2 decimal places
price_columns = ['min_price', 'max_price', 'mean_price', 'median_price', 'std_price', 
                'q25_price', 'q75_price', 'q90_price', 'q95_price', 'q99_price']

for col in price_columns:
    summary_table[col] = summary_table[col].round(2)

# Format for display
display_table = summary_table[['count', 'min_price', 'q25_price', 'median_price', 
                              'mean_price', 'q75_price', 'q90_price', 'max_price', 'std_price']].copy()

# Rename columns for better readability
display_table.columns = ['Count', 'Min $', '25th %', 'Median $', 'Mean $', '75th %', '90th %', 'Max $', 'Std Dev']

# Format count column as integers
display_table['Count'] = display_table['Count'].astype(int)

print("="*100)
print("MTG CARD PRICING STATISTICS BY TYPE")
print("="*100)
print("(Excludes Basic Lands)")
print("")

# Display the table with nice formatting
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.float_format', '${:.2f}'.format)

print(display_table.to_string())

print(f"\n{'='*100}")
print(f"SUMMARY")
print(f"{'='*100}")
print(f"Total card types analyzed: {len(display_table)}")
print(f"Total cards with pricing: {display_table['Count'].sum():,}")
print(f"Overall price range: ${summary_table['min_price'].min():.2f} - ${summary_table['max_price'].max():.2f}")
print(f"Dataset median price: ${df_analysis['price_clean'].median():.2f}")
print(f"Dataset mean price: ${df_analysis['price_clean'].mean():.2f}")

# Reset display options
pd.reset_option('display.max_columns')
pd.reset_option('display.width')
pd.reset_option('display.float_format')

MTG CARD PRICING STATISTICS BY TYPE
(Excludes Basic Lands)

               Count  Min $  25th %  Median $  Mean $  75th %  90th %    Max $  Std Dev
primary_type                                                                           
Creature      141868  $0.01   $0.09     $0.21   $2.15   $0.68   $3.59 $3418.22   $19.45
Instant        33602  $0.01   $0.08     $0.18   $2.57   $0.56   $3.28 $2968.93   $27.28
Sorcery        30848  $0.01   $0.11     $0.25   $2.85   $0.81   $3.67 $2650.00   $31.50
Enchantment    26863  $0.01   $0.14     $0.33   $3.51   $1.56   $5.94 $1450.00   $23.53
Artifact       22830  $0.01   $0.17     $0.42   $7.74   $2.34   $9.14 $3290.00   $72.98
Land           19220  $0.03   $0.18     $0.38   $7.92   $2.72  $13.35 $2700.00   $64.81
Planeswalker    3362  $0.12   $0.71     $1.98   $4.73   $4.63  $10.35  $354.47   $13.27
Other           1184  $0.04   $0.32     $1.39   $4.07   $4.03   $9.25  $154.97    $9.89
Unknown          210  $0.86   $2.22     $5.38   $9.41  $11.8

## Additional Analysis: Price Distribution Insights

Examine interesting patterns and insights from the pricing data by card type.

In [19]:
# Additional insights and analysis
print("="*80)
print("PRICE DISTRIBUTION INSIGHTS")
print("="*80)

# Most expensive card by type
print("\nüî• MOST EXPENSIVE CARDS BY TYPE:")
for card_type in stats_by_type.head(8).index:  # Top 8 most common types
    type_cards = df_analysis[df_analysis['primary_type'] == card_type]
    most_expensive = type_cards.loc[type_cards['price_clean'].idxmax()]
    print(f"  {card_type:<15}: ${most_expensive['price_clean']:>7.2f} - {most_expensive['name']}")

# Types with highest average prices
print(f"\nüí∞ HIGHEST AVERAGE PRICES BY TYPE:")
high_value_types = stats_by_type.sort_values('mean_price', ascending=False).head(6)
for card_type, stats in high_value_types.iterrows():
    if stats['count'] >= 10:  # Only include types with reasonable sample size
        print(f"  {card_type:<15}: ${stats['mean_price']:>7.2f} avg (n={stats['count']:,})")

# Price volatility (highest standard deviation)
print(f"\nüìà MOST VOLATILE PRICING (High Std Dev):")
volatile_types = stats_by_type.sort_values('std_price', ascending=False).head(6)
for card_type, stats in volatile_types.iterrows():
    if stats['count'] >= 10:
        cv = stats['std_price'] / stats['mean_price']  # Coefficient of variation
        print(f"  {card_type:<15}: ${stats['std_price']:>7.2f} std dev, CV: {cv:.2f}")

# Types with most cards
print(f"\nüìä LARGEST CARD TYPE CATEGORIES:")
for card_type, stats in stats_by_type.head(5).iterrows():
    percentage = (stats['count'] / display_table['Count'].sum()) * 100
    print(f"  {card_type:<15}: {stats['count']:>7,} cards ({percentage:>4.1f}% of dataset)")

print(f"\n{'='*80}")

# Close database connection
client.close()
print("‚úÖ Database connection closed.")
print("Analysis complete! üéâ")
print(f"Notebook location: /workspaces/mtgecorec/notebooks/card_pricing_analysis.ipynb")

PRICE DISTRIBUTION INSIGHTS

üî• MOST EXPENSIVE CARDS BY TYPE:
  Creature       : $3418.22 - Cloud, Midgar Mercenary
  Instant        : $2968.93 - Ancestral Recall
  Sorcery        : $2650.00 - Time Walk
  Enchantment    : $1450.00 - Raging River
  Artifact       : $3290.00 - Mox Sapphire
  Land           : $2700.00 - The Tabernacle at Pendrell Vale
  Planeswalker   : $ 354.47 - Jace, the Mind Sculptor
  Other          : $ 154.97 - Titania

üí∞ HIGHEST AVERAGE PRICES BY TYPE:
  Unknown        : $   9.41 avg (n=210.0)
  Land           : $   7.92 avg (n=19,220.0)
  Artifact       : $   7.74 avg (n=22,830.0)
  Planeswalker   : $   4.73 avg (n=3,362.0)
  Other          : $   4.07 avg (n=1,184.0)
  Enchantment    : $   3.51 avg (n=26,863.0)

üìà MOST VOLATILE PRICING (High Std Dev):
  Artifact       : $  72.98 std dev, CV: 9.43
  Land           : $  64.81 std dev, CV: 8.19
  Sorcery        : $  31.50 std dev, CV: 11.06
  Instant        : $  27.28 std dev, CV: 10.61
  Enchantment    : $  

## Quick Check: Today's Pricing Collection Status

Simple count of today's pricing records and unique cards processed.

In [13]:
# Quick status check for today's pricing collection
from datetime import date

# Get today's date
today = date.today().isoformat()
print(f"üìä Pricing Collection Status for {today}")
print("=" * 50)

# Get fresh database connection
client = get_mongo_client()
pricing_collection = get_collection(client, "mtgecorec", "card_pricing_daily")
cards_collection = get_collection(client, "mtgecorec", "cards")

# Count total records for today
total_records_today = pricing_collection.count_documents({'date': today})
print(f"Total pricing records for today: {total_records_today:,}")

# Count unique cards with pricing for today
unique_cards_today = len(set(pricing_collection.distinct('scryfall_id', {'date': today})))
print(f"Unique cards with pricing for today: {unique_cards_today:,}")

# Get total cards in database for comparison
total_cards = cards_collection.count_documents({})
print(f"Total cards in database: {total_cards:,}")

# Calculate coverage and remaining
coverage_pct = (unique_cards_today / total_cards) * 100 if total_cards > 0 else 0
remaining_cards = total_cards - unique_cards_today

print(f"Coverage: {coverage_pct:.1f}%")
print(f"Cards remaining: {remaining_cards:,}")

# Status check
if remaining_cards < 1000:
    print("‚úÖ COLLECTION COMPLETE!")
else:
    print("‚ö†Ô∏è  Collection still in progress")
    
# Close connection
client.close()

üìä Pricing Collection Status for 2025-12-25
Total pricing records for today: 57,618
Unique cards with pricing for today: 17,922
Total cards in database: 110,031
Coverage: 16.3%
Cards remaining: 92,109
‚ö†Ô∏è  Collection still in progress


In [12]:
# Reset stuck pricing lock and trigger new collection
import requests
import json

print("üîß Azure Functions Management")
print("=" * 40)

# Correct Azure Functions URL
base_url = "https://mtgecorecfunc-akeuc0excwg9h7dd.westus3-01.azurewebsites.net"

# Step 1: Reset the lock
print("1. Resetting pricing collection lock...")
try:
    reset_response = requests.post(
        f"{base_url}/api/pricing/reset_lock", 
        timeout=30
    )
    if reset_response.status_code == 200:
        print("‚úÖ Lock reset successfully")
        reset_data = reset_response.json()
        print(f"   Was running: {reset_data.get('was_running', 'Unknown')}")
    else:
        print(f"‚ùå Lock reset failed: {reset_response.status_code}")
        print(f"Response: {reset_response.text[:200]}...")
except Exception as e:
    print(f"‚ùå Lock reset error: {e}")

print()

# Step 2: Trigger new collection with fixed auto-chaining
print("2. Triggering pricing collection with auto-chaining...")
try:
    trigger_response = requests.get(
        f"{base_url}/api/pricing/collect",
        timeout=30
    )
    if trigger_response.status_code == 200:
        print("‚úÖ Collection triggered successfully")
        trigger_data = trigger_response.json()
        print(f"   Status: {trigger_data.get('status', 'Unknown')}")
        if 'batch_info' in trigger_data:
            batch_info = trigger_data['batch_info']
            print(f"   Batch size: {batch_info.get('batch_size', 'Unknown'):,}")
            print(f"   Auto-continue: {batch_info.get('auto_continue', 'Unknown')}")
    else:
        print(f"‚ùå Collection trigger failed: {trigger_response.status_code}")
        print(f"Response: {trigger_response.text[:200]}...")
except Exception as e:
    print(f"‚ùå Collection trigger error: {e}")

print("\nüöÄ Check Azure Functions logs to monitor progress!")
print("Expected: Should process all 92k+ remaining cards with auto-chaining")

üîß Azure Functions Management
1. Resetting pricing collection lock...
‚úÖ Lock reset successfully
   Was running: False

2. Triggering pricing collection with auto-chaining...
‚ùå Collection trigger error: HTTPSConnectionPool(host='mtgecorecfunc-akeuc0excwg9h7dd.westus3-01.azurewebsites.net', port=443): Read timed out. (read timeout=30)

üöÄ Check Azure Functions logs to monitor progress!
Expected: Should process all 92k+ remaining cards with auto-chaining


## Analysis: Cards WITHOUT Pricing Records

Identify which card types are most frequently missing pricing data. This helps us understand which cards to prioritize (or skip) in the pricing pipeline.

In [6]:
# Analyze cards WITHOUT pricing records
print("üîç ANALYZING CARDS WITHOUT PRICING RECORDS")
print("=" * 60)

# Get fresh database connection
client = get_mongo_client()
cards_collection = get_collection(client, "mtgecorec", "cards")
pricing_collection = get_collection(client, "mtgecorec", "card_pricing_daily")

# Get today's date for analysis
today = date.today().isoformat()
print(f"Analysis date: {today}")

# Load all cards
print("Loading all cards...")
all_cards_df = pd.DataFrame(list(cards_collection.find({})))
print(f"Total cards: {len(all_cards_df):,}")

# Get cards that have pricing for today
print("Finding cards with pricing...")
cards_with_pricing_today = set(pricing_collection.distinct('scryfall_id', {'date': today}))
print(f"Cards with pricing: {len(cards_with_pricing_today):,}")

# Find cards WITHOUT pricing
all_cards_df['has_pricing'] = all_cards_df['id'].isin(cards_with_pricing_today)
cards_without_pricing = all_cards_df[~all_cards_df['has_pricing']].copy()

print(f"Cards WITHOUT pricing: {len(cards_without_pricing):,}")
print(f"Missing pricing rate: {len(cards_without_pricing)/len(all_cards_df)*100:.1f}%")

client.close()

üîç ANALYZING CARDS WITHOUT PRICING RECORDS
Analysis date: 2025-12-28
Loading all cards...
Total cards: 110,031
Finding cards with pricing...
Cards with pricing: 90,327
Cards WITHOUT pricing: 19,704
Missing pricing rate: 17.9%


In [7]:
# Extract card types for cards WITHOUT pricing
print("\nüè∑Ô∏è CARD TYPE ANALYSIS FOR CARDS WITHOUT PRICING")
print("=" * 60)

# Use the same type extraction function from earlier
def extract_primary_card_type(type_line):
    """Extract the primary card type from a type line"""
    if pd.isna(type_line):
        return 'Unknown'
    
    type_str = str(type_line).lower()
    
    # Define card type priorities (more specific first)
    card_types = {
        'planeswalker': ['planeswalker'],
        'creature': ['creature'],
        'artifact_creature': ['artifact creature', 'artifact ‚Äî creature'],
        'enchantment_creature': ['enchantment creature', 'enchantment ‚Äî creature'], 
        'instant': ['instant'],
        'sorcery': ['sorcery'],
        'enchantment': ['enchantment'],
        'artifact': ['artifact'],
        'land': ['land'],
        'battle': ['battle'],
        'tribal': ['tribal'],
        'conspiracy': ['conspiracy'],
        'token': ['token']
    }
    
    # Check for each type (order matters for compound types)
    for card_type, patterns in card_types.items():
        for pattern in patterns:
            if pattern in type_str:
                return card_type.replace('_', ' ').title()
    
    return 'Other'

# Apply type extraction to cards without pricing
type_col = 'type_line' if 'type_line' in cards_without_pricing.columns else 'types'
cards_without_pricing['primary_type'] = cards_without_pricing[type_col].apply(extract_primary_card_type)

# Also analyze all cards for comparison
all_cards_df['primary_type'] = all_cards_df[type_col].apply(extract_primary_card_type)

print("Analysis complete!")


üè∑Ô∏è CARD TYPE ANALYSIS FOR CARDS WITHOUT PRICING
Analysis complete!


In [8]:
# Compare card types: WITH pricing vs WITHOUT pricing
print("üìä PRICING COVERAGE BY CARD TYPE")
print("=" * 80)

# Count cards by type (with and without pricing)
without_pricing_counts = cards_without_pricing['primary_type'].value_counts()
all_cards_counts = all_cards_df['primary_type'].value_counts()

# Create comparison table
comparison_data = []
for card_type in all_cards_counts.index:
    total_cards = all_cards_counts[card_type]
    missing_cards = without_pricing_counts.get(card_type, 0)
    with_pricing = total_cards - missing_cards
    missing_rate = (missing_cards / total_cards) * 100 if total_cards > 0 else 0
    
    comparison_data.append({
        'Card Type': card_type,
        'Total Cards': total_cards,
        'With Pricing': with_pricing,
        'Without Pricing': missing_cards,
        'Missing Rate %': missing_rate
    })

# Convert to DataFrame and sort by missing rate (highest first)
comparison_df = pd.DataFrame(comparison_data)
comparison_df = comparison_df.sort_values('Missing Rate %', ascending=False)

# Display the results
print("CARD TYPES MOST FREQUENTLY WITHOUT PRICING:")
print("-" * 80)
print(f"{'Card Type':<20} {'Total':<8} {'With':<8} {'Without':<8} {'Missing %':<10}")
print("-" * 80)

for _, row in comparison_df.head(15).iterrows():  # Show top 15
    print(f"{row['Card Type']:<20} {row['Total Cards']:<8,} "
          f"{row['With Pricing']:<8,} {row['Without Pricing']:<8,} "
          f"{row['Missing Rate %']:<10.1f}%")

print("-" * 80)
print(f"Overall missing rate: {len(cards_without_pricing)/len(all_cards_df)*100:.1f}%")

# Show interesting insights
print(f"\nüéØ KEY INSIGHTS:")
print(f"‚Ä¢ Worst coverage: {comparison_df.iloc[0]['Card Type']} ({comparison_df.iloc[0]['Missing Rate %']:.1f}% missing)")
print(f"‚Ä¢ Best coverage: {comparison_df.iloc[-1]['Card Type']} ({comparison_df.iloc[-1]['Missing Rate %']:.1f}% missing)")

# Cards most worth prioritizing (high total count + high missing rate)
comparison_df['priority_score'] = comparison_df['Without Pricing'] * comparison_df['Missing Rate %'] / 100
top_priority = comparison_df.sort_values('priority_score', ascending=False).iloc[0]
print(f"‚Ä¢ Highest impact to fix: {top_priority['Card Type']} ({top_priority['Without Pricing']:,} missing cards)")

print(f"\n{'=' * 80}")

üìä PRICING COVERAGE BY CARD TYPE
CARD TYPES MOST FREQUENTLY WITHOUT PRICING:
--------------------------------------------------------------------------------
Card Type            Total    With     Without  Missing % 
--------------------------------------------------------------------------------
Battle               1        0        1        100.0     %
Token                32       2        30       93.8      %
Other                3,556    668      2,888    81.2      %
Creature             52,328   43,321   9,007    17.2      %
Conspiracy           30       25       5        16.7      %
Artifact             8,702    7,315    1,387    15.9      %
Instant              11,176   9,531    1,645    14.7      %
Sorcery              11,012   9,407    1,605    14.6      %
Enchantment          9,705    8,355    1,350    13.9      %
Land                 11,790   10,211   1,579    13.4      %
Planeswalker         1,624    1,421    203      12.5      %
Unknown              75       71       4

In [11]:
# Analyze characteristics of cards WITHOUT pricing
print("üîé CHARACTERISTICS OF CARDS WITHOUT PRICING")
print("=" * 60)

# Look at other attributes that might explain missing pricing
attributes_to_check = ['layout', 'set_type', 'rarity', 'digital', 'games', 'border_color', 'frame']

for attr in attributes_to_check:
    if attr in cards_without_pricing.columns:
        print(f"\nüìã {attr.upper()} breakdown for cards without pricing:")
        
        try:
            # Get counts for cards without pricing
            without_counts = cards_without_pricing[attr].value_counts().head(8)
            
            for value, count in without_counts.items():
                try:
                    # Handle different data types safely
                    if isinstance(value, (list, tuple, np.ndarray)):
                        # For list/array values, convert to string for comparison
                        value_str = str(value)
                        # Find matching records by converting to string
                        total_with_attr = all_cards_df[all_cards_df[attr].astype(str) == value_str]
                        display_value = value_str[:15] + "..." if len(value_str) > 15 else value_str
                    else:
                        # For scalar values, normal comparison
                        total_with_attr = all_cards_df[all_cards_df[attr] == value]
                        display_value = str(value)
                    
                    missing_rate = (count / len(total_with_attr)) * 100 if len(total_with_attr) > 0 else 0
                    print(f"  {display_value:<20}: {count:>6,} cards ({missing_rate:>5.1f}% of all {display_value})")
                    
                except Exception as e:
                    # If individual value comparison fails, show what we can
                    print(f"  {str(value):<20}: {count:>6,} cards (comparison failed)")
                    
        except Exception as e:
            print(f"  Error analyzing {attr}: {str(e)}")

print(f"\n{'=' * 60}")
print("üöÄ RECOMMENDATIONS FOR PIPELINE OPTIMIZATION:")
print("=" * 60)

# Generate recommendations based on the analysis
high_missing_types = comparison_df[comparison_df['Missing Rate %'] > 50]['Card Type'].tolist()
low_missing_types = comparison_df[comparison_df['Missing Rate %'] < 10]['Card Type'].tolist()

print(f"‚ö†Ô∏è  SKIP these card types (>50% missing pricing):")
for card_type in high_missing_types[:5]:
    print(f"   ‚Ä¢ {card_type}")

print(f"\n‚úÖ PRIORITIZE these card types (<10% missing pricing):")
for card_type in low_missing_types[:5]:
    print(f"   ‚Ä¢ {card_type}")

print(f"\nüí° POTENTIAL FILTERS:")
if 'layout' in cards_without_pricing.columns:
    try:
        problem_layouts = cards_without_pricing['layout'].value_counts().head(3).index.tolist()
        print(f"   ‚Ä¢ Skip layouts: {', '.join([str(x) for x in problem_layouts])}")
    except:
        print(f"   ‚Ä¢ Skip layouts: (analysis failed)")

if 'set_type' in cards_without_pricing.columns:
    try:
        problem_set_types = cards_without_pricing['set_type'].value_counts().head(3).index.tolist()  
        print(f"   ‚Ä¢ Skip set types: {', '.join([str(x) for x in problem_set_types])}")
    except:
        print(f"   ‚Ä¢ Skip set types: (analysis failed)")

print(f"\nThis analysis will help optimize the pricing pipeline! üéØ")

üîé CHARACTERISTICS OF CARDS WITHOUT PRICING

üìã LAYOUT breakdown for cards without pricing:
  normal              : 14,855 cards ( 14.7% of all normal)
  art_series          :  2,253 cards ( 98.9% of all art_series)
  token               :  1,847 cards ( 65.2% of all token)
  transform           :    140 cards ( 13.6% of all transform)
  planar              :     98 cards ( 29.7% of all planar)


  double_faced_token  :     82 cards ( 71.9% of all double_faced_token)
  split               :     79 cards ( 23.0% of all split)
  emblem              :     73 cards ( 54.5% of all emblem)

üìã SET_TYPE breakdown for cards without pricing:
  memorabilia         :  3,224 cards ( 60.0% of all memorabilia)
  expansion           :  3,113 cards ( 10.4% of all expansion)
  masters             :  2,786 cards ( 19.2% of all masters)
  token               :  1,806 cards ( 65.7% of all token)
  promo               :  1,597 cards ( 14.3% of all promo)
  commander           :  1,252 cards (  8.9% of all commander)
  box                 :  1,129 cards ( 23.1% of all box)
  draft_innovation    :  1,116 cards ( 13.7% of all draft_innovation)

üìã RARITY breakdown for cards without pricing:
  common              :  8,892 cards ( 24.9% of all common)
  rare                :  5,410 cards ( 13.9% of all rare)
  uncommon            :  3,890 cards ( 15.2% of all uncommon)
  mythic              :  1,433

In [13]:
# Deep dive into "Other" card types without pricing
print("üî¨ DEEP DIVE: 'OTHER' CARD TYPES WITHOUT PRICING")
print("=" * 70)

# Filter for "Other" cards without pricing
other_cards_no_pricing = cards_without_pricing[cards_without_pricing['primary_type'] == 'Other']

if len(other_cards_no_pricing) > 0:
    print(f"Found {len(other_cards_no_pricing):,} 'Other' cards without pricing")
    print(f"This represents {len(other_cards_no_pricing)/len(cards_without_pricing)*100:.1f}% of all cards without pricing")
    
    # Analyze raw type_line values for "Other" cards
    print(f"\nüìã RAW TYPE LINES for 'Other' cards without pricing:")
    other_type_lines = other_cards_no_pricing[type_col].value_counts().head(15)
    
    for type_line, count in other_type_lines.items():
        # Also show percentage of total "Other" cards this represents
        pct_of_other = (count / len(other_cards_no_pricing)) * 100
        print(f"  {str(type_line):<40}: {count:>4,} cards ({pct_of_other:>4.1f}%)")
    
    # Show some sample card names for context
    print(f"\nüìù SAMPLE CARDS in 'Other' category:")
    sample_other = other_cards_no_pricing[['name', type_col, 'set', 'rarity']].head(10)
    
    for i, (_, row) in enumerate(sample_other.iterrows(), 1):
        name = str(row['name'])[:30] if pd.notna(row['name']) else 'Unknown'
        type_line = str(row[type_col])[:35] if pd.notna(row[type_col]) else 'Unknown'  
        set_code = str(row['set'])[:8] if pd.notna(row['set']) else 'Unknown'
        rarity = str(row['rarity'])[:10] if pd.notna(row['rarity']) else 'Unknown'
        
        print(f"  {i:2d}. {name:<30} | {type_line:<35} | {set_code:<8} | {rarity}")
    
    # Identify patterns that could be added to our type classifier
    print(f"\nüí° POTENTIAL NEW TYPE PATTERNS TO ADD:")
    
    # Look for common patterns in "Other" type lines
    common_patterns = {}
    for type_line in other_cards_no_pricing[type_col].dropna():
        type_str = str(type_line).lower()
        
        # Check for patterns we might have missed
        potential_patterns = [
            ('Emblem', 'emblem'),
            ('Scheme', 'scheme'), 
            ('Phenomenon', 'phenomenon'),
            ('Plane', 'plane'),
            ('Vanguard', 'vanguard'),
            ('Dungeon', 'dungeon'),
            ('Case', 'case'),
            ('Role', 'role'),
            ('Stickers', 'sticker'),
            ('Attraction', 'attraction')
        ]
        
        for pattern_name, pattern in potential_patterns:
            if pattern in type_str:
                common_patterns[pattern_name] = common_patterns.get(pattern_name, 0) + 1
    
    # Show patterns found
    if common_patterns:
        sorted_patterns = sorted(common_patterns.items(), key=lambda x: x[1], reverse=True)
        for pattern_name, count in sorted_patterns:
            if count >= 5:  # Only show patterns with at least 5 cards
                print(f"  ‚Ä¢ Add '{pattern_name}': {count:,} cards found")
    else:
        print("  ‚Ä¢ No obvious patterns found - these might be truly miscellaneous cards")
        
else:
    print("No 'Other' cards found without pricing.")

print(f"\n{'=' * 70}")

üî¨ DEEP DIVE: 'OTHER' CARD TYPES WITHOUT PRICING
Found 2,888 'Other' cards without pricing
This represents 14.7% of all cards without pricing

üìã RAW TYPE LINES for 'Other' cards without pricing:
  Card // Card                            : 2,307 cards (79.9%)
  Card                                    :  355 cards (12.3%)
  Emblem                                  :   37 cards ( 1.3%)
  Vanguard                                :   19 cards ( 0.7%)
  Phenomenon                              :   10 cards ( 0.3%)
  Plane ‚Äî MagicCon                        :    9 cards ( 0.3%)
  Plane ‚Äî Secret Lair                     :    9 cards ( 0.3%)
  Plane ‚Äî Chicago                         :    8 cards ( 0.3%)
  Scheme                                  :    8 cards ( 0.3%)
  Stickers                                :    6 cards ( 0.2%)
  Plane ‚Äî Las Vegas                       :    6 cards ( 0.2%)
  Boss                                    :    4 cards ( 0.1%)
  Ongoing Scheme                   

## Testing Double-Faced Card Pricing Issues

Test whether "Card // Card" naming is causing pricing issues by querying Scryfall with just the first face name + set + collector number.

In [16]:
# Find double-faced cards without pricing to test
print("üî¨ TESTING DOUBLE-FACED CARD PRICING ISSUES")
print("=" * 60)

# Find cards with " // " in the name that don't have pricing
double_faced_cards = cards_without_pricing[
    cards_without_pricing['name'].str.contains(' // ', na=False)
].copy()

print(f"Found {len(double_faced_cards):,} double-faced cards without pricing")

if len(double_faced_cards) > 0:
    # Show sample double-faced cards
    print(f"\nüìã SAMPLE DOUBLE-FACED CARDS WITHOUT PRICING:")
    sample_df_cards = double_faced_cards[['name', 'set', 'collector_number', 'rarity']].head(8)
    
    for i, (_, row) in enumerate(sample_df_cards.iterrows(), 1):
        name = str(row['name'])[:50] if pd.notna(row['name']) else 'Unknown'
        set_code = str(row['set'])[:8] if pd.notna(row['set']) else 'Unknown'
        coll_num = str(row['collector_number'])[:10] if pd.notna(row['collector_number']) else 'Unknown'
        rarity = str(row['rarity'])[:10] if pd.notna(row['rarity']) else 'Unknown'
        
        print(f"  {i}. {name:<50} | {set_code:<8} | #{coll_num:<10} | {rarity}")
    
    # Extract first face names
    print(f"\nüéØ EXTRACTING FIRST FACE NAMES:")
    test_cards = []
    
    for _, row in sample_df_cards.head(3).iterrows():  # Test with 3 cards
        full_name = str(row['name'])
        if ' // ' in full_name:
            first_face = full_name.split(' // ')[0].strip()
            test_cards.append({
                'full_name': full_name,
                'first_face': first_face,
                'set': str(row['set']),
                'collector_number': str(row['collector_number']),
                'scryfall_id': str(row['id']) if 'id' in row else 'Unknown'
            })
            print(f"  ‚Ä¢ Full: {full_name}")
            print(f"    First face: '{first_face}' | Set: {row['set']} | #{row['collector_number']}")
            print()
    
    # Store for next step
    print(f"Prepared {len(test_cards)} cards for Scryfall testing")
    
else:
    print("No double-faced cards found without pricing.")
    test_cards = []

üî¨ TESTING DOUBLE-FACED CARD PRICING ISSUES
Found 2,677 double-faced cards without pricing

üìã SAMPLE DOUBLE-FACED CARDS WITHOUT PRICING:
  1. Clearwater Pathway // Clearwater Pathway           | aznr     | #25         | common
  2. Punchcard // Punchcard                             | teoe     | #12         | common
  3. Hushwood Verge // Hushwood Verge                   | adsk     | #26         | common
  4. Forgehammer Centurion // Forgehammer Centurion     | aone     | #26         | common
  5. Atraxa's Skitterfang // Atraxa's Skitterfang       | aone     | #55         | common
  6. Sakiko, Mother of Summer // Sakiko, Mother of Summ | acmm     | #24         | common
  7. Raff, Weatherlight Stalwart // Raff, Weatherlight  | admu     | #76         | common
  8. Verdant Outrider // Verdant Outrider               | awoe     | #28         | common

üéØ EXTRACTING FIRST FACE NAMES:
  ‚Ä¢ Full: Clearwater Pathway // Clearwater Pathway
    First face: 'Clearwater Pathway' | Set: aznr |

In [17]:
# Test Scryfall API queries for double-faced cards
import requests
import time

print("üîç TESTING SCRYFALL API QUERIES")
print("=" * 60)

if test_cards:
    for i, card_info in enumerate(test_cards, 1):
        print(f"\nüì¶ TEST {i}: {card_info['full_name']}")
        print("-" * 50)
        
        # Method 1: Query by full name (current approach)
        print("Method 1: Full name query")
        try:
            response1 = requests.get(
                f"https://api.scryfall.com/cards/named",
                params={'fuzzy': card_info['full_name']},
                timeout=10
            )
            if response1.status_code == 200:
                data1 = response1.json()
                print(f"  ‚úÖ Success: Found '{data1.get('name', 'Unknown')}'")
                if 'prices' in data1:
                    usd_price = data1['prices'].get('usd')
                    print(f"  üí∞ USD Price: ${usd_price}" if usd_price else "  üí∞ USD Price: None")
                else:
                    print("  üí∞ No pricing data")
            else:
                print(f"  ‚ùå Failed: {response1.status_code}")
        except Exception as e:
            print(f"  ‚ùå Error: {e}")
        
        time.sleep(0.1)  # Rate limiting
        
        # Method 2: Query by set + collector number (proposed fix)
        print("Method 2: Set + collector number query")
        try:
            response2 = requests.get(
                f"https://api.scryfall.com/cards/{card_info['set']}/{card_info['collector_number']}",
                timeout=10
            )
            if response2.status_code == 200:
                data2 = response2.json()
                print(f"  ‚úÖ Success: Found '{data2.get('name', 'Unknown')}'")
                if 'prices' in data2:
                    usd_price = data2['prices'].get('usd')
                    print(f"  üí∞ USD Price: ${usd_price}" if usd_price else "  üí∞ USD Price: None")
                else:
                    print("  üí∞ No pricing data")
            else:
                print(f"  ‚ùå Failed: {response2.status_code}")
        except Exception as e:
            print(f"  ‚ùå Error: {e}")
        
        time.sleep(0.1)  # Rate limiting
        
        # Method 3: Query by first face name (alternative)
        print("Method 3: First face name query")
        try:
            response3 = requests.get(
                f"https://api.scryfall.com/cards/named",
                params={'fuzzy': card_info['first_face']},
                timeout=10
            )
            if response3.status_code == 200:
                data3 = response3.json()
                print(f"  ‚úÖ Success: Found '{data3.get('name', 'Unknown')}'")
                if 'prices' in data3:
                    usd_price = data3['prices'].get('usd')
                    print(f"  üí∞ USD Price: ${usd_price}" if usd_price else "  üí∞ USD Price: None")
                else:
                    print("  üí∞ No pricing data")
            else:
                print(f"  ‚ùå Failed: {response3.status_code}")
        except Exception as e:
            print(f"  ‚ùå Error: {e}")
        
        print()
        time.sleep(0.5)  # Longer pause between cards
        
else:
    print("No test cards available - run the previous cell first")

print("=" * 60)
print("üéØ ANALYSIS:")
print("Compare the success rates and pricing availability between methods.")
print("Method 2 (set + collector) is often most reliable for double-faced cards.")

üîç TESTING SCRYFALL API QUERIES

üì¶ TEST 1: Clearwater Pathway // Clearwater Pathway
--------------------------------------------------
Method 1: Full name query
  ‚úÖ Success: Found 'Clearwater Pathway // Murkwater Pathway'
  üí∞ USD Price: $4.11
Method 2: Set + collector number query
  ‚úÖ Success: Found 'Clearwater Pathway // Clearwater Pathway'
  üí∞ USD Price: None
Method 3: First face name query
  ‚úÖ Success: Found 'Clearwater Pathway // Murkwater Pathway'
  üí∞ USD Price: $4.11


üì¶ TEST 2: Punchcard // Punchcard
--------------------------------------------------
Method 1: Full name query
  ‚úÖ Success: Found 'Punchcard // Punchcard'
  üí∞ USD Price: None
Method 2: Set + collector number query
  ‚úÖ Success: Found 'Punchcard // Punchcard'
  üí∞ USD Price: None
Method 3: First face name query
  ‚úÖ Success: Found 'Punchcard // Punchcard'
  üí∞ USD Price: None


üì¶ TEST 3: Hushwood Verge // Hushwood Verge
--------------------------------------------------
Method 1: 

## Smart Filtering Analysis: Set Types and Codes

Based on the double-faced card testing, the real issue is that many cards are from **variant/alternate sets** that don't have market pricing. Let's analyze set types and set codes to create smart filtering rules for the pricing pipeline.

In [18]:
# Analyze set types and set codes for cards without pricing
print("üéØ SET ANALYSIS FOR SMART FILTERING")
print("=" * 70)

# Get fresh database connection for set analysis
client = get_mongo_client()
cards_collection = get_collection(client, "mtgecorec", "cards")
pricing_collection = get_collection(client, "mtgecorec", "card_pricing_daily")

# Load all cards fresh (in case variables were cleared)
print("Loading cards for set analysis...")
all_cards_df = pd.DataFrame(list(cards_collection.find({})))
print(f"Total cards loaded: {len(all_cards_df):,}")

# Get today's pricing data
today = date.today().isoformat()
cards_with_pricing_today = set(pricing_collection.distinct('scryfall_id', {'date': today}))
print(f"Cards with pricing today: {len(cards_with_pricing_today):,}")

# Mark cards with/without pricing
all_cards_df['has_pricing'] = all_cards_df['id'].isin(cards_with_pricing_today)
cards_without_pricing = all_cards_df[~all_cards_df['has_pricing']].copy()

print(f"Cards without pricing: {len(cards_without_pricing):,}")
print(f"Missing rate: {len(cards_without_pricing)/len(all_cards_df)*100:.1f}%")

client.close()

üéØ SET ANALYSIS FOR SMART FILTERING
Loading cards for set analysis...
Total cards loaded: 110,031
Cards with pricing today: 90,327
Cards without pricing: 19,704
Missing rate: 17.9%


In [19]:
# Analyze SET TYPE patterns for missing pricing
print("\nüìä SET TYPE ANALYSIS")
print("=" * 50)

if 'set_type' in cards_without_pricing.columns:
    # Set types without pricing
    set_types_no_pricing = cards_without_pricing['set_type'].value_counts()
    
    # All set types for comparison  
    all_set_types = all_cards_df['set_type'].value_counts()
    
    print("SET TYPES WITH HIGHEST MISSING RATE:")
    print(f"{'Set Type':<25} {'Total':<8} {'Missing':<8} {'Rate %':<8}")
    print("-" * 55)
    
    set_type_analysis = []
    for set_type in all_set_types.index[:15]:  # Top 15 set types
        total = all_set_types[set_type]
        missing = set_types_no_pricing.get(set_type, 0)
        missing_rate = (missing / total) * 100 if total > 0 else 0
        
        set_type_analysis.append({
            'set_type': set_type,
            'total': total,
            'missing': missing,
            'missing_rate': missing_rate
        })
        
        print(f"{str(set_type):<25} {total:<8,} {missing:<8,} {missing_rate:<8.1f}%")
    
    # Identify problematic set types
    print(f"\nüö® PROBLEMATIC SET TYPES (>70% missing):")
    problematic_set_types = []
    for item in set_type_analysis:
        if item['missing_rate'] > 70 and item['total'] >= 50:  # High missing rate, decent sample size
            problematic_set_types.append(item['set_type'])
            print(f"  ‚Ä¢ {item['set_type']}: {item['missing_rate']:.1f}% missing ({item['missing']:,}/{item['total']:,} cards)")
    
    print(f"\n‚úÖ GOOD SET TYPES (<20% missing):")
    good_set_types = []
    for item in set_type_analysis:
        if item['missing_rate'] < 20 and item['total'] >= 100:  # Low missing rate, good sample size
            good_set_types.append(item['set_type'])
            print(f"  ‚Ä¢ {item['set_type']}: {item['missing_rate']:.1f}% missing ({item['missing']:,}/{item['total']:,} cards)")
    
else:
    print("set_type column not found in data")
    problematic_set_types = []
    good_set_types = []


üìä SET TYPE ANALYSIS
SET TYPES WITH HIGHEST MISSING RATE:
Set Type                  Total    Missing  Rate %  
-------------------------------------------------------
expansion                 29,895   3,113    10.4    %
masters                   14,499   2,786    19.2    %
commander                 14,066   1,252    8.9     %
promo                     11,172   1,597    14.3    %
core                      9,629    1,094    11.4    %
draft_innovation          8,159    1,116    13.7    %
memorabilia               5,372    3,224    60.0    %
box                       4,897    1,129    23.1    %
token                     2,747    1,806    65.7    %
funny                     2,163    667      30.8    %
duel_deck                 1,945    174      8.9     %
masterpiece               1,460    191      13.1    %
starter                   1,124    285      25.4    %
alchemy                   939      939      100.0   %
planechase                647      65       10.0    %

üö® PROBLEMATIC SE

In [20]:
# Analyze SET CODE patterns for missing pricing
print("\nüìã SET CODE ANALYSIS")
print("=" * 50)

# Look at set codes (like 'aznr', 'teoe', 'adsk' from our double-faced card test)
if 'set' in cards_without_pricing.columns:
    # Set codes without pricing
    set_codes_no_pricing = cards_without_pricing['set'].value_counts().head(20)
    
    print("SET CODES WITH MOST MISSING CARDS:")
    print(f"{'Set Code':<10} {'Missing':<8} {'Sample Card Name':<40}")
    print("-" * 65)
    
    problematic_set_codes = []
    for set_code, missing_count in set_codes_no_pricing.items():
        # Get sample card name from this set
        sample_card = cards_without_pricing[cards_without_pricing['set'] == set_code]['name'].iloc[0]
        sample_name = str(sample_card)[:35] if pd.notna(sample_card) else 'Unknown'
        
        # Calculate total cards in this set
        total_in_set = len(all_cards_df[all_cards_df['set'] == set_code])
        missing_rate = (missing_count / total_in_set) * 100 if total_in_set > 0 else 0
        
        print(f"{str(set_code):<10} {missing_count:<8,} {sample_name:<40}")
        
        # Mark as problematic if high missing rate
        if missing_rate > 80 and missing_count >= 20:
            problematic_set_codes.append(set_code)
    
    print(f"\nüîç SET CODE PATTERNS:")
    
    # Analyze set code patterns
    pattern_analysis = {}
    for set_code in set_codes_no_pricing.head(30).index:
        set_str = str(set_code)
        
        # Look for patterns
        if set_str.startswith('a') and len(set_str) == 4:
            pattern_analysis['Alternate (a***)'] = pattern_analysis.get('Alternate (a***)', 0) + set_codes_no_pricing[set_code]
        elif set_str.startswith('t') and len(set_str) == 4:
            pattern_analysis['Token/Test (t***)'] = pattern_analysis.get('Token/Test (t***)', 0) + set_codes_no_pricing[set_code]
        elif len(set_str) <= 3:
            pattern_analysis['Main Sets (‚â§3 chars)'] = pattern_analysis.get('Main Sets (‚â§3 chars)', 0) + set_codes_no_pricing[set_code]
        elif len(set_str) > 4:
            pattern_analysis['Special (>4 chars)'] = pattern_analysis.get('Special (>4 chars)', 0) + set_codes_no_pricing[set_code]
        else:
            pattern_analysis['Other (4 chars)'] = pattern_analysis.get('Other (4 chars)', 0) + set_codes_no_pricing[set_code]
    
    print("Pattern breakdown for cards without pricing:")
    for pattern, count in sorted(pattern_analysis.items(), key=lambda x: x[1], reverse=True):
        print(f"  ‚Ä¢ {pattern:<25}: {count:>6,} cards")
    
else:
    print("set column not found in data")
    problematic_set_codes = []


üìã SET CODE ANALYSIS
SET CODES WITH MOST MISSING CARDS:
Set Code   Missing  Sample Card Name                        
-----------------------------------------------------------------
prm        774      Zodiac Rabbit                           
plst       537      Copy                                    
unk        489      The Convincing General                  
hbg        436      Grim Wanderer                           
pio        398      Devour Flesh                            
j21        389      Ethereal Grasp                          
akr        339      Enigma Drake                            
30a        310      Mox Jet                                 
klr        302      Aethertorch Renegade                    
sir        294      Briarbridge Patrol                      
sld        201      Brisela, Voice of Nightmares            
ps11       198      Immaculate Magistrate                   
psal       166      Lure                                    
amh2       162      Z

In [21]:
# Generate Smart Filtering Rules
print("\nüéØ SMART FILTERING RECOMMENDATIONS")
print("=" * 60)

print("Based on the analysis, here are the smart filtering rules for the pricing pipeline:")
print()

# Set type filters
if problematic_set_types:
    print("üö´ SKIP THESE SET TYPES:")
    for set_type in problematic_set_types:
        print(f"   ‚Ä¢ {set_type}")
    print()

# Set code pattern filters  
print("üö´ SKIP THESE SET CODE PATTERNS:")
print("   ‚Ä¢ Sets starting with 'a' + 3 more chars (alternate versions)")
print("   ‚Ä¢ Sets starting with 't' + 3 more chars (tokens/test)")
print("   ‚Ä¢ Sets longer than 4 characters (special releases)")
print()

# Layout filters (if we found any)
print("üö´ SKIP THESE LAYOUTS (if applicable):")
if 'layout' in cards_without_pricing.columns:
    problem_layouts = cards_without_pricing['layout'].value_counts()
    for layout, count in problem_layouts.head(3).items():
        total_layout = len(all_cards_df[all_cards_df['layout'] == layout])
        missing_rate = (count / total_layout) * 100 if total_layout > 0 else 0
        if missing_rate > 60:
            print(f"   ‚Ä¢ {layout} ({missing_rate:.1f}% missing rate)")

print()

# Positive filters
if good_set_types:
    print("‚úÖ PRIORITIZE THESE SET TYPES:")
    for set_type in good_set_types:
        print(f"   ‚Ä¢ {set_type}")
    print()

print("‚úÖ PRIORITIZE THESE SET CODE PATTERNS:")
print("   ‚Ä¢ 3-character or shorter set codes (main sets)")
print("   ‚Ä¢ Standard expansion sets")
print("   ‚Ä¢ Sets that don't start with 'a' or 't'")
print()

# Calculate potential impact
skip_set_types_count = 0
if problematic_set_types and 'set_type' in cards_without_pricing.columns:
    skip_set_types_count = sum(cards_without_pricing['set_type'].isin(problematic_set_types))

skip_patterns_count = 0
if 'set' in cards_without_pricing.columns:
    # Count cards that match skip patterns
    skip_patterns = cards_without_pricing['set'].astype(str).str.match(r'^[at].{3}$|^.{5,}$')
    skip_patterns_count = skip_patterns.sum()

total_without_pricing = len(cards_without_pricing)
potential_skip = max(skip_set_types_count, skip_patterns_count)

print("üìà IMPACT ESTIMATE:")
print(f"   ‚Ä¢ Cards currently without pricing: {total_without_pricing:,}")
print(f"   ‚Ä¢ Cards we could skip with smart filters: {potential_skip:,}")
print(f"   ‚Ä¢ Remaining cards to process: {total_without_pricing - potential_skip:,}")
print(f"   ‚Ä¢ Potential processing reduction: {potential_skip/total_without_pricing*100:.1f}%")

print(f"\n{'=' * 60}")
print("üöÄ NEXT STEPS:")
print("1. Implement these filters in the pricing pipeline")
print("2. Focus on main set cards with higher success rates")
print("3. Skip variant/alternate cards that don't have market pricing")
print("4. Monitor success rates and adjust filters as needed")


üéØ SMART FILTERING RECOMMENDATIONS
Based on the analysis, here are the smart filtering rules for the pricing pipeline:

üö´ SKIP THESE SET TYPES:
   ‚Ä¢ alchemy

üö´ SKIP THESE SET CODE PATTERNS:
   ‚Ä¢ Sets starting with 'a' + 3 more chars (alternate versions)
   ‚Ä¢ Sets starting with 't' + 3 more chars (tokens/test)
   ‚Ä¢ Sets longer than 4 characters (special releases)

üö´ SKIP THESE LAYOUTS (if applicable):
   ‚Ä¢ art_series (98.9% missing rate)
   ‚Ä¢ token (65.2% missing rate)

‚úÖ PRIORITIZE THESE SET TYPES:
   ‚Ä¢ expansion
   ‚Ä¢ masters
   ‚Ä¢ commander
   ‚Ä¢ promo
   ‚Ä¢ core
   ‚Ä¢ draft_innovation
   ‚Ä¢ duel_deck
   ‚Ä¢ masterpiece
   ‚Ä¢ planechase

‚úÖ PRIORITIZE THESE SET CODE PATTERNS:
   ‚Ä¢ 3-character or shorter set codes (main sets)
   ‚Ä¢ Standard expansion sets
   ‚Ä¢ Sets that don't start with 'a' or 't'

üìà IMPACT ESTIMATE:
   ‚Ä¢ Cards currently without pricing: 19,704
   ‚Ä¢ Cards we could skip with smart filters: 4,034
   ‚Ä¢ Remaining cards to 

## Implementation Code: Smart Filtering for Pricing Pipeline

Here's the exact Python code to implement these smart filters in the pricing pipeline:

In [22]:
# IMPLEMENTATION: Smart Filtering Function for Pricing Pipeline
print("üíª PRICING PIPELINE SMART FILTER IMPLEMENTATION")
print("=" * 60)

def should_skip_card_for_pricing(card_data):
    """
    Smart filter to determine if a card should be skipped in pricing collection.
    
    Args:
        card_data: Dictionary with card information including 'set_type', 'set', 'layout'
    
    Returns:
        bool: True if card should be skipped, False if it should be processed
    """
    
    # Skip problematic set types
    skip_set_types = {
        'alchemy',        # 100% missing rate
        'memorabilia',    # 60% missing rate  
        'token'           # 65.7% missing rate
    }
    
    set_type = card_data.get('set_type', '')
    if set_type in skip_set_types:
        return True
    
    # Skip problematic layouts
    skip_layouts = {
        'art_series',     # 98.9% missing rate
        'token'           # 65.2% missing rate  
    }
    
    layout = card_data.get('layout', '')
    if layout in skip_layouts:
        return True
    
    # Skip problematic set code patterns
    set_code = str(card_data.get('set', '')).lower()
    
    # Skip alternate versions (a*** pattern)
    if len(set_code) == 4 and set_code.startswith('a'):
        return True
        
    # Skip token/test versions (t*** pattern)  
    if len(set_code) == 4 and set_code.startswith('t'):
        return True
    
    # Skip very long set codes (special releases)
    if len(set_code) > 4:
        return True
    
    # Skip specific problematic sets
    skip_sets = {
        'prm',    # Promotional with 774 missing
        'plst',   # The List with 537 missing
        'unk',    # Unknown with 489 missing
        'hbg',    # Happy Birthday with 436 missing
        'pio',    # Pioneer with 398 missing
        'j21',    # Judge Gift Cards with 389 missing
    }
    
    if set_code in skip_sets:
        return True
    
    # Default: process the card
    return False


def prioritize_card_for_pricing(card_data):
    """
    Determine if a card should be prioritized in pricing collection.
    
    Args:
        card_data: Dictionary with card information
    
    Returns:
        int: Priority level (1=highest, 2=medium, 3=lowest)
    """
    
    # High priority set types (good success rates)
    high_priority_set_types = {
        'expansion',        # 10.4% missing
        'commander',        # 8.9% missing  
        'core',             # 11.4% missing
        'duel_deck',        # 8.9% missing
        'planechase'        # 10.0% missing
    }
    
    # Medium priority
    medium_priority_set_types = {
        'masters',          # 19.2% missing
        'promo',            # 14.3% missing
        'draft_innovation', # 13.7% missing
        'masterpiece'       # 13.1% missing
    }
    
    set_type = card_data.get('set_type', '')
    set_code = str(card_data.get('set', ''))
    
    # Prioritize main sets (3 chars or less)
    if len(set_code) <= 3:
        if set_type in high_priority_set_types:
            return 1  # Highest priority
        elif set_type in medium_priority_set_types:
            return 2  # Medium priority
    
    # Everything else
    return 3  # Lowest priority


# Test the filters
print("\nüß™ TESTING THE FILTERS:")
print("-" * 40)

# Test cases
test_cards = [
    {'set_type': 'expansion', 'set': 'neo', 'layout': 'normal', 'name': 'Test Expansion Card'},
    {'set_type': 'alchemy', 'set': 'y22', 'layout': 'normal', 'name': 'Test Alchemy Card'},
    {'set_type': 'token', 'set': 'tneo', 'layout': 'token', 'name': 'Test Token Card'},
    {'set_type': 'memorabilia', 'set': 'astx', 'layout': 'normal', 'name': 'Test Alternate Card'},
    {'set_type': 'commander', 'set': 'cmd', 'layout': 'normal', 'name': 'Test Commander Card'}
]

for i, card in enumerate(test_cards, 1):
    skip = should_skip_card_for_pricing(card)
    priority = prioritize_card_for_pricing(card) if not skip else 'N/A'
    status = "SKIP" if skip else f"PROCESS (Priority {priority})"
    
    print(f"{i}. {card['name']:<25} | {status}")
    print(f"   Set: {card['set']:<8} | Type: {card['set_type']:<15} | Layout: {card['layout']}")
    print()

print("‚úÖ Filter functions ready for integration into pricing pipeline!")
print("\nTo use in pricing_pipeline.py:")
print("1. Add these functions to the pricing pipeline module")
print("2. Filter cards before making API calls: if should_skip_card_for_pricing(card): continue")
print("3. Sort remaining cards by priority: cards.sort(key=prioritize_card_for_pricing)")
print("4. Process high priority cards first to maximize success rate")

üíª PRICING PIPELINE SMART FILTER IMPLEMENTATION

üß™ TESTING THE FILTERS:
----------------------------------------
1. Test Expansion Card       | PROCESS (Priority 1)
   Set: neo      | Type: expansion       | Layout: normal

2. Test Alchemy Card         | SKIP
   Set: y22      | Type: alchemy         | Layout: normal

3. Test Token Card           | SKIP
   Set: tneo     | Type: token           | Layout: token

4. Test Alternate Card       | SKIP
   Set: astx     | Type: memorabilia     | Layout: normal

5. Test Commander Card       | PROCESS (Priority 1)
   Set: cmd      | Type: commander       | Layout: normal

‚úÖ Filter functions ready for integration into pricing pipeline!

To use in pricing_pipeline.py:
1. Add these functions to the pricing pipeline module
2. Filter cards before making API calls: if should_skip_card_for_pricing(card): continue
3. Sort remaining cards by priority: cards.sort(key=prioritize_card_for_pricing)
4. Process high priority cards first to maximize succe