# Vizard Polars Preprocessing Test Suite

**Purpose:** Test preprocessing engine robustness with 15 examples based on Polars user guide patterns.

**Focus:** Preprocessing ONLY (no visualization)

**Datasets:** cars, stocks, movies from altair.datasets

**Goal:** Verify current keywords (FILTER, SELECT, DROP, SORT, ADD, GROUP, SAVE) are sufficient or identify gaps

## Setup

In [None]:
import altair as alt
import polars as pl
import pandas as pd
import numpy as np
from altair.datasets import data

In [None]:
%load_ext vizard_magic

In [None]:
%cc RESET

## Load and Name Datasets

In [None]:
df_cars = pl.DataFrame(data.cars())
print(f"Cars shape: {df_cars.shape}")
df_cars.head()

In [None]:
df_stocks = pl.DataFrame(data.stocks())
print(f"Stocks shape: {df_stocks.shape}")
df_stocks.head()

In [None]:
df_movies = pl.DataFrame(data.movies())
print(f"Movies shape: {df_movies.shape}")
df_movies.head()

## Test 1: Basic FILTER - Single Condition (using direct dataset access)

Filter cars with good fuel efficiency (MPG > 25)

In [None]:
%cc DATA cars FILTER Miles_per_Gallon > 25 ||

## Test 2: SELECT - Column Subset (using variable access)

Select only essential car attributes

In [None]:
%cc DATA df_cars SELECT Name, Miles_per_Gallon, Horsepower, Origin ||

## Test 3: FILTER with AND - Multiple Conditions

High-performance cars: Horsepower > 100 AND Miles_per_Gallon > 20

In [None]:
%cc DATA df_cars FILTER Horsepower > 100 and Miles_per_Gallon > 20 ||

## Test 4: ADD - Simple Computed Column

Calculate power-to-weight ratio

In [None]:
%cc DATA df_cars SELECT Name, Horsepower, Weight_in_lbs ADD power_to_weight as Horsepower / Weight_in_lbs ||

## Test 5: SORT - Ascending Order

Sort cars by fuel efficiency (worst to best)

In [None]:
%cc DATA df_cars SELECT Name, Miles_per_Gallon, Origin SORT by Miles_per_Gallon ascending ||

## Test 6: GROUP - Aggregation with Mean

Average MPG by car origin

In [None]:
%cc DATA df_cars GROUP by Origin aggregating mean(Miles_per_Gallon) as avg_mpg, count() as n_cars ||

## Test 7: Complex Chain - FILTER → SELECT → ADD → SORT

Multi-step pipeline: filter efficient cars, select columns, add computed field, sort

In [None]:
%cc DATA df_cars FILTER Miles_per_Gallon > 25 SELECT Name, Miles_per_Gallon, Horsepower ADD efficiency_score as Miles_per_Gallon * 10 + Horsepower SORT by efficiency_score descending ||

## Test 8: ADD - Multiple Derived Columns with Dependencies

Create log transformations and then compute ratios

In [None]:
%cc DATA df_cars SELECT Name, Horsepower, Weight_in_lbs ADD log_hp as log10(Horsepower) ADD log_weight as log10(Weight_in_lbs) ADD log_ratio as log_hp / log_weight ||

## Test 9: FILTER with OR - Alternative Conditions

Select cars that are either very efficient OR very powerful

In [None]:
%cc DATA df_cars FILTER Miles_per_Gallon > 35 or Horsepower > 150 ||

## Test 10: GROUP - Multiple Aggregations

Statistics by Origin and Cylinders

In [None]:
%cc DATA df_cars GROUP by Origin, Cylinders aggregating mean(Horsepower) as avg_hp, mean(Miles_per_Gallon) as avg_mpg, count() as count ||

## Test 11: ADD - Conditional/Boolean Column

Create categorical labels based on conditions

In [None]:
%cc DATA df_cars SELECT Name, Miles_per_Gallon ADD is_efficient as Miles_per_Gallon > 25 ||

## Test 12: DROP - Remove Unnecessary Columns

Remove internal columns after loading

In [None]:
%cc DATA df_cars DROP columns Acceleration, Displacement ||

## Test 13: Stocks Dataset - Time Series Filtering

Filter stocks with high prices (using direct dataset access)

In [None]:
%cc DATA stocks FILTER price > 100 ||

## Test 14: Movies Dataset - Complex Filter and Aggregation

Analyze high-rated movies (using variable access)

In [None]:
%cc DATA df_movies FILTER IMDB_Rating > 7.5 SELECT Title, IMDB_Rating, Major_Genre GROUP by Major_Genre aggregating mean(IMDB_Rating) as avg_rating, count() as n_movies ||

## Test 15: Real-World Scenario - Complete Analysis Pipeline

Cars: Filter by year (1975+), compute metrics, aggregate by origin

In [None]:
%cc DATA df_cars FILTER Year >= 75 SELECT Origin, Miles_per_Gallon, Horsepower, Weight_in_lbs ADD hp_per_ton as Horsepower / (Weight_in_lbs / 2000) GROUP by Origin aggregating mean(Miles_per_Gallon) as avg_mpg, mean(hp_per_ton) as avg_hp_per_ton, count() as n_cars SORT by avg_mpg descending ||

## Summary

**Tests completed:**
1. ✓ Basic FILTER - single condition (DATA cars)
2. ✓ SELECT - column subset (DATA df_cars)
3. ✓ FILTER - multiple conditions (AND)
4. ✓ ADD - simple computed column
5. ✓ SORT - ascending order
6. ✓ GROUP - aggregation with mean
7. ✓ Complex chain - FILTER → SELECT → ADD → SORT
8. ✓ ADD - multiple derived columns with dependencies
9. ✓ FILTER - OR conditions
10. ✓ GROUP - multiple aggregations by multiple columns
11. ✓ ADD - boolean/conditional column
12. ✓ DROP - remove columns
13. ✓ Stocks dataset - time series filtering (DATA stocks)
14. ✓ Movies dataset - complex filter + aggregation (DATA df_movies)
15. ✓ Real-world scenario - complete pipeline

**Dataset Access Methods Tested:**
- Direct access: `DATA cars`, `DATA stocks` (from altair.datasets)
- Variable access: `DATA df_cars`, `DATA df_movies`, `DATA df_stocks`

**Keywords tested:**
- FILTER (with and/or)
- SELECT
- DROP
- ADD (including dependencies)
- SORT
- GROUP (with multiple aggregations)

**Potential gaps to explore:**
- RENAME columns (if needed)
- JOIN operations (combining datasets)
- PIVOT/MELT (wide ↔ long format)
- String operations (uppercase, lowercase, substring)
- CAST/type conversions
- NULL handling (fill, drop)
- Window functions beyond current WINDOW keyword