# Vizard Advanced Polars Keywords Test Suite

**Purpose:** Test 11 new Polars preprocessing keywords comprehensively

**New Keywords:** RENAME, BIN, JOIN, STRING, CAST, PIVOT, UNPIVOT, UNIQUE, HEAD, CONCAT, MAP

**Datasets:** lookup_people, lookup_groups, seattle_weather, unemployment_across_industries, cars

**Test Coverage:** ~45 tests including simple, medium, complex, and combination tests

## Setup

In [None]:
import altair as alt
import polars as pl
import pandas as pd
import numpy as np
from altair.datasets import data

In [None]:
%load_ext vizard_magic

In [None]:
%cc --model haiku

In [None]:
%%time
%cc RESET

## Load Datasets

In [None]:
# For JOIN testing
df_lookup_people = pl.DataFrame(data.lookup_people())
print(f"lookup_people shape: {df_lookup_people.shape}")
df_lookup_people.head()

In [None]:
df_lookup_groups = pl.DataFrame(data.lookup_groups())
print(f"lookup_groups shape: {df_lookup_groups.shape}")
df_lookup_groups.head()

In [None]:
# For STRING and PIVOT testing
df_weather = pl.DataFrame(data.seattle_weather())
print(f"seattle_weather shape: {df_weather.shape}")
df_weather.head()

In [None]:
# For PIVOT and multi-category testing
df_unemployment = pl.DataFrame(data.unemployment_across_industries())
print(f"unemployment shape: {df_unemployment.shape}")
df_unemployment.head()

In [None]:
# Cars dataset (continue using for consistency)
df_cars = pl.DataFrame(data.cars())
print(f"cars shape: {df_cars.shape}")
df_cars.head()

---
# Simple Keywords (1-2 tests each)

In [None]:
%%time
%cc DATA df_cars SELECT Name, Weight_in_lbs RENAME Weight_in_lbs as weight ||

In [None]:
%%time
%cc DATA df_cars SELECT Name, Miles_per_Gallon, Weight_in_lbs RENAME Miles_per_Gallon as mpg, Weight_in_lbs as weight ||

In [None]:
%%time
%cc DATA df_cars HEAD 10 ||

In [None]:
%%time
%cc DATA df_cars SELECT Origin UNIQUE ||

In [None]:
%%time
%cc DATA df_cars UNIQUE on Origin, Cylinders keeping first ||

In [None]:
%%time
%cc DATA df_cars SELECT Name, Year CAST Year to integer ||

In [None]:
%%time
%cc DATA df_cars SELECT Name, Horsepower CAST Horsepower to float ||

---
# Medium Complexity (3-4 tests each)

In [None]:
%%time
%cc DATA df_cars SELECT Name, Weight_in_lbs BIN Weight_in_lbs by 500 as weight_category ||

In [None]:
%%time
%cc DATA df_cars SELECT Name, Miles_per_Gallon BIN Miles_per_Gallon into 5 as mpg_range ||

In [None]:
%%time
%cc DATA df_cars SELECT Name, Horsepower BIN Horsepower by 50 ascending as power_class ||

In [None]:
%%time
%cc DATA df_weather SELECT weather STRING uppercase weather ||

In [None]:
%%time
%cc DATA df_cars SELECT Name, Origin STRING lowercase Origin ||

In [None]:
%%time
%cc DATA df_weather SELECT weather STRING replace weather sun to sunny ||

In [None]:
%%time
%cc DATA df_cars SELECT Name STRING substring Name from 0 to 10 ||

## Test 7.1: CONCAT - Vertical (Default)

In [None]:
# Create subset datasets for CONCAT testing
df_cars_usa = df_cars.filter(pl.col('Origin') == 'USA').head(5)
df_cars_japan = df_cars.filter(pl.col('Origin') == 'Japan').head(5)
print("Created df_cars_usa and df_cars_japan")

In [None]:
%%time
%cc DATA df_cars_usa CONCAT df_cars_japan ||

In [None]:
%%time
%cc DATA df_cars_usa CONCAT df_cars_japan ||

In [None]:
# Create complementary columns for horizontal CONCAT
df_cars_cols1 = df_cars.select(['Name', 'Origin']).head(10)
df_cars_cols2 = df_cars.select(['Miles_per_Gallon', 'Horsepower']).head(10)
print("Created df_cars_cols1 and df_cars_cols2")

## Test 7.2: CONCAT - Horizontal

In [None]:
%%time
%cc DATA df_cars_cols1 CONCAT df_cars_cols2 horizontally ||

In [None]:
# Create wide format data for UNPIVOT testing
df_wide = pl.DataFrame({
    'name': ['A', 'B', 'C'],
    'value1': [10, 20, 30],
    'value2': [15, 25, 35]
})
print("Created df_wide")
df_wide

In [None]:
%%time
%cc DATA df_cars_cols1 CONCAT df_cars_cols2 horizontally ||

## Test 8.1: UNPIVOT - Simple

In [None]:
%%time
%cc DATA df_wide UNPIVOT value1, value2 keeping name as metric, amount ||

In [None]:
%%time
%cc DATA df_wide UNPIVOT value1, value2 keeping name as metric, amount ||

In [None]:
%%time
%cc DATA df_weather HEAD 5 SELECT date, temp_max, temp_min UNPIVOT temp_max, temp_min keeping date as temp_type, temperature ||

In [None]:
# Rename person column for JOIN testing
df_lookup_groups_renamed = df_lookup_groups.rename({'person': 'name'})
print("Created df_lookup_groups_renamed")

In [None]:
%%time
%cc DATA df_wide UNPIVOT value1, value2 keeping name ||

---
# Complex Keywords (4-5 tests each)

## Test 9.1: JOIN - Simple (Same Column Name)

In [None]:
# First, let's see the data
print("People:")
display(df_lookup_people)
print("\nGroups:")
display(df_lookup_groups)

In [None]:
%%time
%cc DATA df_lookup_people JOIN df_lookup_groups_renamed on name ||

In [None]:
# Create datasets for JOIN with multiple keys
df_cars_a = df_cars.select(['Origin', 'Cylinders', 'Miles_per_Gallon']).unique(subset=['Origin', 'Cylinders']).head(10)
df_cars_b = df_cars.select(['Origin', 'Cylinders', 'Horsepower']).unique(subset=['Origin', 'Cylinders']).head(10)
print("Created df_cars_a and df_cars_b")
print(f"df_cars_a shape: {df_cars_a.shape}")
df_cars_a

In [None]:
%%time
%cc DATA df_lookup_people JOIN df_lookup_groups_renamed on name ||

In [None]:
%%time
%cc DATA df_lookup_people JOIN df_lookup_groups on name = person ||

In [None]:
%%time
%cc DATA df_lookup_people JOIN df_lookup_groups on name = person type left ||

In [None]:
# Create simple long-format data for PIVOT testing
df_long = pl.DataFrame({
    'date': ['2020-01-01', '2020-01-01', '2020-01-02', '2020-01-02'],
    'symbol': ['AAPL', 'MSFT', 'AAPL', 'MSFT'],
    'price': [100, 50, 105, 52]
})
print("Created df_long")
df_long

## Test 9.4: JOIN - Multiple Keys (Cars Example)

In [None]:
%%time
%cc DATA df_cars_a JOIN df_cars_b on Origin, Cylinders ||

In [None]:
%%time
%cc DATA df_cars_a JOIN df_cars_b on Origin, Cylinders ||

In [None]:
%%time
%cc DATA df_lookup_people FILTER age > 25 JOIN df_lookup_groups on name = person ||

## Test 10.1: PIVOT - Simple

In [None]:
%%time
%cc DATA df_long PIVOT price by date for symbol ||

In [None]:
%%time
%cc DATA df_long PIVOT price by date for symbol ||

In [None]:
%%time
%cc DATA df_unemployment HEAD 100 PIVOT count by year for series aggregating mean ||

In [None]:
# Use first few days only
%cc DATA df_weather HEAD 20 SELECT date, weather, temp_max PIVOT temp_max by date for weather ||

In [None]:
%%time
%cc DATA df_cars SELECT Origin, Cylinders, Miles_per_Gallon PIVOT Miles_per_Gallon by Origin for Cylinders aggregating mean ||

In [None]:
%%time
%cc DATA df_cars SELECT Name, Origin MAP Origin using {USA: United States, Japan: Japan, Europe: European Union} as origin_full ||

In [None]:
%%time
%cc DATA df_cars SELECT Name, Miles_per_Gallon MAP Miles_per_Gallon where > 30 is Efficient, else Inefficient as efficiency ||

In [None]:
%%time
%cc DATA df_cars SELECT Name, Miles_per_Gallon MAP Miles_per_Gallon where > 30 is High, > 20 is Medium, else Low as mpg_category ||

In [None]:
%%time
%cc DATA df_weather SELECT date, weather, precipitation MAP precipitation where > 10 is Heavy, > 5 is Moderate, > 0 is Light, else None as rain_level ||

In [None]:
%%time
%cc DATA df_cars FILTER Horsepower > 100 SELECT Name, Origin, Horsepower MAP Horsepower where > 150 is High, else Medium as power_class HEAD 10 ||

---
# Combination Tests (Complex Chains)

In [None]:
%%time
%cc DATA df_cars SELECT Name, Weight_in_lbs, Miles_per_Gallon RENAME Weight_in_lbs as weight BIN weight by 500 as weight_cat MAP Miles_per_Gallon where > 25 is Efficient, else Inefficient as efficiency GROUP by weight_cat, efficiency aggregating count() as n_cars ||

In [None]:
%%time
%cc DATA df_lookup_people JOIN df_lookup_groups on name = person FILTER age > 25 UNIQUE on group HEAD 5 ||

In [None]:
%%time
%cc DATA df_weather HEAD 10 SELECT date, temp_max, temp_min UNPIVOT temp_max, temp_min keeping date as temp_type, temperature MAP temperature where > 15 is Warm, else Cold as temp_cat PIVOT temperature by date for temp_type aggregating mean ||

In [None]:
%%time
%cc DATA df_cars SELECT Name, Origin, Year STRING uppercase Origin CAST Year to integer BIN Year by 5 as year_range GROUP by Origin, year_range aggregating count() as n_cars ||

In [None]:
%%time
%cc DATA df_lookup_people JOIN df_lookup_groups on name = person RENAME height as height_cm MAP age where > 30 is Senior, > 20 is Adult, else Young as age_group BIN height_cm by 10 as height_range GROUP by age_group, height_range aggregating count() as count SORT by count descending ||

---
# Summary

**Tests completed:** 45 tests total

**Simple keywords (8 tests):**
- RENAME: 2 tests
- HEAD: 1 test
- UNIQUE: 2 tests
- CAST: 2 tests

**Medium complexity (14 tests):**
- BIN: 3 tests
- STRING: 4 tests
- CONCAT: 2 tests
- UNPIVOT: 3 tests

**Complex keywords (18 tests):**
- JOIN: 5 tests
- PIVOT: 4 tests
- MAP: 5 tests

**Combination tests:** 5 complex chains

**Datasets used:**
- lookup_people / lookup_groups (JOIN operations)
- seattle_weather (STRING, PIVOT operations)
- unemployment_across_industries (PIVOT, multi-category)
- cars (BIN, CAST, RENAME, general operations)

**Next steps:**
1. Run all tests and identify any failures
2. Report syntax errors or unexpected behavior
3. Verify generated Polars code is correct
4. Test HELP <keyword> functionality separately