# Vizard Advanced Polars Test Suite 2

**Purpose:** Test fixes and new features added after vz_polars_advanced_test.ipynb

**Categories:**
1. BIN fixes (actual bin values, starting at syntax) - 4 tests
2. NULL handling (DROP_NULLS, FILL_NULLS, IS_NULL) - 9 tests
3. DATA/SEP I/O combinations - 10 tests

**Total:** ~23 tests

## Setup

In [None]:
import altair as alt
import polars as pl
import pandas as pd
import numpy as np
from altair.datasets import data

In [None]:
%load_ext vizard_magic

In [None]:
%cc HELP

In [None]:
%cc RESET

## Load Datasets

In [None]:
# Reuse same datasets from original test
df_cars = pl.DataFrame(data.cars())
print(f"cars shape: {df_cars.shape}")
df_cars.head()

In [None]:
df_weather = pl.DataFrame(data.seattle_weather())
print(f"seattle_weather shape: {df_weather.shape}")
df_weather.head()

## Create Test DataFrames with Nulls

In [None]:
# DataFrame with actual nulls and NaN values
df_nulls = pl.DataFrame({
    'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'name': ['Alice', None, 'Charlie', 'David', 'Eve', None, 'George', 'Helen', 'Ivan', None],
    'value': [100.0, 200.0, None, 400.0, float('nan'), 600.0, 700.0, float('nan'), 900.0, 1000.0],
    'score': [85.5, float('nan'), 92.0, 90.0, None, 88.0, 87.5, None, 92.0, 87.0]
})
print("Test DataFrame with actual nulls (None) and NaN:")
df_nulls

In [None]:
# DataFrame for multi-column null testing
df_nulls_multi = pl.DataFrame({
    'gene': ['BRCA1', 'TP53', 'EGFR', 'KRAS', 'MYC'],
    'expression': [5.2, None, 3.8, 4.1, float('nan')],
    'pvalue': [0.001, None, float('nan'), 0.05, 0.003],
    'significant': [True, True, False, True, False]
})
print("Multi-column null test DataFrame:")
df_nulls_multi

---
# Category 1: BIN Fixes (4 tests)

## Test 1.1: BIN - Actual Bin Values (Equal Width)

In [None]:
# Before: Check weight range
print(f"Weight range: {df_cars['Weight_in_lbs'].min()} - {df_cars['Weight_in_lbs'].max()}")
df_cars.select(['Name', 'Weight_in_lbs']).head(10)

In [None]:
# Should produce bin values: 1500, 2000, 2500, 3000... (not 0, 1, 2, 3...)
%cc DATA df_cars SELECT Name, Weight_in_lbs BIN Weight_in_lbs by 500 as weight_bin ||

## Test 1.2: BIN - With Starting Point

In [None]:
# Should produce bins starting at 2000: 2000, 2500, 3000, 3500...
%cc DATA df_cars SELECT Name, Weight_in_lbs BIN Weight_in_lbs by 500 starting at 2000 as weight_bin ||

## Test 1.3: BIN - Year Data (Fix for Test 12.4)

In [None]:
# Before: Check Year range
print(f"Year range: {df_cars['Year'].min()} - {df_cars['Year'].max()}")
df_cars.select(['Name', 'Year']).head(10)

In [None]:
# Should produce: 70, 75, 80 (for years 70-74, 75-79, 80-84)
# NOT astronomical numbers like 75738240000000000
%cc DATA df_cars SELECT Name, Origin, Year CAST Year to integer BIN Year by 5 as year_range GROUP by Origin, year_range aggregating count() as n_cars ||

## Test 1.4: BIN - MPG with Ascending Order

In [None]:
# Should produce bins: 10, 15, 20, 25, 30, 35... ordered
%cc DATA df_cars SELECT Name, Miles_per_Gallon BIN Miles_per_Gallon by 5 ascending as mpg_bin HEAD 20 ||

---
# Category 2: NULL Handling (9 tests)

## Test 2.1: DROP_NULLS - Single Column (Various Null Types)

In [None]:
# Before: Has None in name column (rows 2, 6, 10)
print("Before DROP_NULLS:")
df_nulls

In [None]:
# Should drop rows 2, 6, 10 (where name is None)
# Keep rows: 1, 3, 4, 5, 7, 8, 9 (Alice, Charlie, David, Eve, George, Helen, Ivan)
%cc DATA df_nulls DROP_NULLS name ||

## Test 2.2: DROP_NULLS - Multiple Columns

In [None]:
# Before: Has None and NaN in expression and pvalue columns
print("Before DROP_NULLS:")
df_nulls_multi

In [None]:
# Should drop rows where expression OR pvalue is None/NaN
# expression: row 2 (None), row 5 (NaN)
# pvalue: row 2 (None), row 3 (NaN)
# Keep only: rows 1, 4, 5 BUT row 5 has NaN in expression, so keep only rows 1, 4 (BRCA1, KRAS)
%cc DATA df_nulls_multi DROP_NULLS expression, pvalue ||

## Test 2.3: FILL_NULLS - Single Column with Constant

In [None]:
# Before: value has None (row 3) and NaN (rows 5, 8)
print("Before FILL_NULLS:")
df_nulls.select(['id', 'name', 'value'])

In [None]:
# Should fill None and NaN with -1 (rows 3, 5, 8)
%cc DATA df_nulls SELECT id, name, value FILL_NULLS value with -1 ||

## Test 2.4: FILL_NULLS - Multiple Columns

In [None]:
# Before: expression and pvalue have None and NaN
print("Before FILL_NULLS:")
df_nulls_multi

In [None]:
# Should fill None and NaN in both columns with 0
%cc DATA df_nulls_multi FILL_NULLS expression, pvalue with 0 ||

## Test 2.5: IS_NULL - Create Boolean Flag

In [None]:
# Before: name has None at rows 2, 6, 10
print("Before IS_NULL:")
df_nulls.select(['id', 'name'])

In [None]:
# Should create boolean column: True for rows 2, 6, 10 (where name is None)
%cc DATA df_nulls SELECT id, name IS_NULL name as name_is_missing ||

## Test 2.6: Null Handling - Case Insensitivity

In [None]:
# Test NaN handling in numeric columns (score has NaN at rows 2, 8 and None at rows 5, 8)
print("Before (score has NaN and None):")
df_nulls.select(['id', 'score'])

In [None]:
# Should drop rows 2, 5, 8 (NaN, None, None) from numeric column
%cc DATA df_nulls SELECT id, score DROP_NULLS score ||

## Test 2.7: Null Handling with Real Dataset

In [None]:
# Check if cars dataset has any nulls
print("Null counts in cars dataset:")
df_cars.null_count()

In [None]:
# If Horsepower has nulls, drop them
%cc DATA df_cars DROP_NULLS Horsepower SELECT Name, Horsepower HEAD 10 ||

## Test 2.8: Combination - IS_NULL then FILTER

In [None]:
# Create flag, then filter to only missing values
%cc DATA df_nulls_multi IS_NULL expression as expr_missing FILTER expr_missing == true ||

## Test 2.9: Combination - FILL_NULLS then GROUP

In [None]:
# Fill nulls, then aggregate
%cc DATA df_nulls_multi FILL_NULLS expression with 0 GROUP by significant aggregating mean(expression) as avg_expr, count() as n ||

---
# Category 3: DATA/SEP I/O Testing (10 tests)

## Create Test Files

In [None]:
# Create test CSV file
test_data = pl.DataFrame({
    'gene': ['BRCA1', 'TP53', 'EGFR'],
    'expression': [5.2, 8.1, 3.4],
    'pvalue': [0.001, 0.003, 0.05]
})

test_data.write_csv('test_data.csv')
print("Created test_data.csv")

In [None]:
# Create test TSV file
test_data.write_csv('test_data.tsv', separator='\t')
print("Created test_data.tsv")

In [None]:
# Create ambiguous .dat file (tab-separated)
test_data.write_csv('test_data.dat', separator='\t')
print("Created test_data.dat (tab-separated)")

## Test 3.1: DATA - CSV File

In [None]:
%cc DATA test_data.csv ||

## Test 3.2: DATA - TSV File with SEP \t

In [None]:
%cc DATA test_data.tsv SEP \t ||

## Test 3.3: DATA - .dat File with SEP csv

In [None]:
# Should try comma separator (but file is tab-separated, so may fail - that's ok)
%cc DATA test_data.dat SEP csv ||

## Test 3.4: DATA - .dat File with SEP tsv

In [None]:
# Should work - file is tab-separated
%cc DATA test_data.dat SEP tsv ||

## Test 3.5: DATA - .dat File with SEP both

In [None]:
# Should try TSV first (succeed), fallback to CSV not needed
%cc DATA test_data.dat SEP both ||

## Test 3.6: DATA - DataFrame Variable

In [None]:
# Should use df_cars directly
%cc DATA df_cars SELECT Name, Origin HEAD 5 ||

## Test 3.7: DATA - Altair Dataset

In [None]:
# Should load from altair.datasets
%cc DATA cars HEAD 10 ||

## Test 3.8: DATA - CSV with Explicit SEP ,

In [None]:
%cc DATA test_data.csv SEP , ||

## Test 3.9: DATA - URL (if available)

In [None]:
# Test with a known public CSV URL
%cc DATA https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv HEAD 5 ||

## Test 3.10: DATA - Chained with Wrangling

In [None]:
# Load TSV and immediately wrangle
%cc DATA test_data.tsv SEP \t FILTER pvalue < 0.01 ||

---
# Cleanup

In [None]:
# Remove test files
import os
for file in ['test_data.csv', 'test_data.tsv', 'test_data.dat']:
    if os.path.exists(file):
        os.remove(file)
        print(f"Removed {file}")

---
# Summary

**Tests completed:** 23 tests total

**Category 1 - BIN Fixes (4 tests):**
- Actual bin values (lower bounds)
- Starting at syntax
- Year data binning fix
- Ascending order

**Category 2 - NULL Handling (9 tests):**
- DROP_NULLS single/multiple columns
- FILL_NULLS with constants
- IS_NULL boolean flags
- Case insensitivity
- Real dataset nulls
- Combinations with other operations

**Category 3 - DATA/SEP I/O (10 tests):**
- CSV files
- TSV files with SEP variants
- .dat files with SEP csv/tsv/both
- DataFrame variables
- Altair datasets
- CSV URL loading (JSON URLs not supported by Polars)
- Chained wrangling

**Next steps:**
1. Run all tests and identify failures
2. Report issues for CLAUDE.md updates
3. Iterate until all pass