# <font color="#418FDE" size="6.5" uppercase>**Interoperability and Testing**</font>

>Last update: 20251228.
    
By the end of this Lecture, you will be able to:
- Convert data between Pandas and Polars safely when partial migration is required. 
- Design tests that compare outputs of Pandas and Polars implementations for the same logic. 
- Detect and handle subtle behavioral differences between Pandas and Polars, such as type coercion or null semantics. 


## **1. Safe Data Conversion**

### **1.1. Pandas Polars Conversion**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_05/Lecture_B/image_01_01.jpg?v=1766903734" width="250">



>* Treat Pandasâ€“Polars conversion as a critical boundary
>* Preserve schema details and meaning during round trips

>* Understand and align Pandas and Polars dtypes
>* Clean data and verify Polars schema after conversion

>* Preserve data meaning when returning to Pandas
>* Document, validate, and control schema changes on conversion



In [None]:
#@title Python Code - Pandas Polars Conversion

# Demonstrate safe conversion between Pandas and Polars dataframes.
# Show how mixed types in Pandas affect Polars conversion.
# Compare schemas before and after conversion for safer partial migration.

import pandas as pd
import polars as pl

# Create a small Pandas dataframe with mixed type column.
# Revenue column mixes numbers and string unknown marker.
data_pandas = {
    "customer_id": [1, 2, 3],
    "revenue_usd": [100.0, "unknown", 250.5],
}

# Build the Pandas dataframe and inspect dtypes.
df_pandas = pd.DataFrame(data_pandas)
print("Pandas dataframe dtypes before cleaning:")
print(df_pandas.dtypes)

# Clean the mixed column by coercing non numeric values to missing.
df_pandas["revenue_usd"] = pd.to_numeric(df_pandas["revenue_usd"], errors="coerce")
print("\nPandas dataframe dtypes after cleaning:")
print(df_pandas.dtypes)

# Convert cleaned Pandas dataframe into Polars dataframe safely.
df_polars = pl.from_pandas(df_pandas)
print("\nPolars schema after conversion:")
print(df_polars.schema)

# Convert back from Polars to Pandas and compare dtypes again.
df_roundtrip = df_polars.to_pandas()
print("\nRoundtrip Pandas dataframe dtypes:")
print(df_roundtrip.dtypes)



### **1.2. Arrow Interchange Bridge**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_05/Lecture_B/image_01_02.jpg?v=1766903752" width="250">



>* Use Arrow as neutral format between libraries
>* Reduces quirks and keeps mixed pipelines consistent

>* Arrow handles complex types consistently across systems
>* Standardized schemas prevent subtle conversion bugs appearing

>* Arrow standardizes shared datasets across diverse teams
>* Enables reproducible analyses and reliable cross-tool interoperability



In [None]:
#@title Python Code - Arrow Interchange Bridge

# Demonstrate Arrow bridge between Pandas and Polars safely.
# Show conversion Pandas to Arrow then Polars clearly.
# Highlight preserved types and missing values across conversions.

import pandas as pd
import polars as pl
import pyarrow as pa

# Create simple Pandas DataFrame with mixed types.
# Include float, category, timestamp, and missing value.

df_pandas = pd.DataFrame({
    "city": pd.Series(["Boston", "Dallas", "Miami"], dtype="category"),
    "temperature_f": [72.5, None, 88.0],
    "timestamp": pd.to_datetime([
        "2024-01-01 10:00:00",
        "2024-01-01 11:00:00",
        "2024-01-01 12:00:00",
    ]),
})

# Convert Pandas DataFrame into Arrow Table container.
# Arrow acts as neutral shipping container here.

table_arrow = pa.Table.from_pandas(df_pandas, preserve_index=False)

# Convert Arrow Table into Polars DataFrame safely.
# Polars understands Arrow columnar format.

df_polars = pl.from_arrow(table_arrow)

# Show Pandas dtypes and head for comparison.
# This helps visualize original representation.

print("PANDAS DATAFRAME AND DTYPES:")
print(df_pandas.dtypes)
print(df_pandas.head())

# Show Polars schema and head after Arrow bridge.
# Types and nulls should remain consistent.

print("\nPOLARS DATAFRAME AND SCHEMA:")
print(df_polars.schema)
print(df_polars.head())

# Convert back from Polars to Arrow then Pandas again.
# Confirm roundtrip stability through Arrow bridge.

table_arrow_back = df_polars.to_arrow()
df_pandas_back = table_arrow_back.to_pandas()

# Show final Pandas dtypes after roundtrip conversion.
# Compare with original to check safety.

print("\nPANDAS DTYPES AFTER ROUNDTRIP:")
print(df_pandas_back.dtypes)



### **1.3. Conversion Performance Tradeoffs**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_05/Lecture_B/image_01_03.jpg?v=1766903814" width="250">



>* Dataframe conversions add memory and time overhead
>* Frequent small conversions can outweigh engine speed gains

>* Plan around data size and conversion frequency
>* Convert big tables rarely; reuse small converted data

>* Conversion overhead affects memory, scalability, reliability
>* Minimize conversions with selective columns and batching



In [None]:
#@title Python Code - Conversion Performance Tradeoffs

# Compare conversion heavy workflow versus conversion light workflow performance.
# Show how repeated conversions add overhead and reduce overall speed.
# Help reason about when conversions are worth the performance toll.

import time
import numpy as np
import pandas as pd

try:
    import polars as pl
except ImportError:
    import sys
    !{sys.executable} -m pip install polars --quiet
    import polars as pl

n_rows = 2_000_000
n_conversions_heavy = 20
n_conversions_light = 2

values = np.random.rand(n_rows)
ids = np.random.randint(0, 100, size=n_rows)

pdf = pd.DataFrame({"id": ids, "value": values})

start_heavy = time.time()
current_pdf = pdf.copy()

for i in range(n_conversions_heavy):
    pl_df = pl.from_pandas(current_pdf)
    pl_df = pl_df.with_columns((pl.col("value") * 1.01).alias("value"))
    current_pdf = pl_df.to_pandas()

heavy_duration = time.time() - start_heavy

start_light = time.time()
pl_df_light = pl.from_pandas(pdf)

pl_df_light = pl_df_light.with_columns((pl.col("value") * 1.01).alias("value"))
pl_df_light = pl_df_light.with_columns((pl.col("value") * 1.01).alias("value"))

light_pdf = pl_df_light.to_pandas()
light_duration = time.time() - start_light

print("Rows processed, approximately count:", n_rows)
print("Heavy conversions count, total crossings:", n_conversions_heavy)
print("Light conversions count, total crossings:", n_conversions_light)
print("Heavy pattern seconds, approximate duration:", round(heavy_duration, 3))
print("Light pattern seconds, approximate duration:", round(light_duration, 3))
print("Speedup factor, heavy divided by light:", round(heavy_duration / light_duration, 1))
print("Final heavy mean value, approximate result:", round(current_pdf["value"].mean(), 4))
print("Final light mean value, approximate result:", round(light_pdf["value"].mean(), 4))
print("Observation, fewer crossings usually save time.")



## **2. Testing Crossframe Consistency**

### **2.1. Golden data baselines**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_05/Lecture_B/image_02_01.jpg?v=1766903844" width="250">



>* Curated datasets define correct results for transformations
>* Both Pandas and Polars must match baseline

>* Design small, clear datasets covering tricky edge cases
>* Lock baselines, reuse them, and expand with discoveries

>* Golden baselines anchor automated, versioned crossframe tests
>* They expose differences, document changes, preserve business meaning



In [None]:
#@title Python Code - Golden data baselines

# Demonstrate golden baseline testing with Pandas and Polars together.
# Show how expected results are stored and compared consistently.
# Highlight how both libraries must match the same trusted baseline.

import pandas as pd
import polars as pl

# Create a tiny golden baseline input dataset in memory.
# This simulates monthly revenue per customer with simple edge cases.
input_data = [
    {"customer_id": "A", "month": "2024-01", "amount_usd": 100.0},
    {"customer_id": "A", "month": "2024-01", "amount_usd": -20.0},
    {"customer_id": "B", "month": "2024-01", "amount_usd": 50.0},
]

# Create the trusted golden expected output for grouped revenue.
# These values would normally be reviewed and version controlled.
golden_output = [
    {"customer_id": "A", "month": "2024-01", "total_revenue_usd": 80.0},
    {"customer_id": "B", "month": "2024-01", "total_revenue_usd": 50.0},
]

# Build Pandas DataFrame from the golden input data.
# Then compute grouped totals using Pandas operations.
pd_df = pd.DataFrame(input_data)
pd_result = (
    pd_df.groupby(["customer_id", "month"], as_index=False)["amount_usd"].sum()
    .rename(columns={"amount_usd": "total_revenue_usd"})
)

# Build Polars DataFrame from the same golden input data.
# Then compute grouped totals using Polars operations.
pl_df = pl.DataFrame(input_data)
pl_result = (
    pl_df.group_by(["customer_id", "month"]).agg(
        pl.col("amount_usd").sum().alias("total_revenue_usd")
    ).sort(["customer_id", "month"])
)

# Convert golden expected output into Pandas and Polars for comparison.
# In practice these would be loaded from stable files.
golden_pd = pd.DataFrame(golden_output)
golden_pl = pl.DataFrame(golden_output).sort(["customer_id", "month"])

# Compare Pandas result to golden baseline using equality checks.
# Any mismatch means the implementation is not trusted yet.
pandas_matches = pd_result.equals(golden_pd)

# Compare Polars result to golden baseline after conversion.
# We convert Polars result into Pandas for simple comparison.
polars_matches = pl_result.to_pandas().equals(golden_pd)

# Print a short summary showing whether each library matches the baseline.
# This mirrors what automated tests would assert in continuous integration.
print("Golden baseline comparison summary:")
print("Pandas matches golden baseline:", pandas_matches)
print("Polars matches golden baseline:", polars_matches)




### **2.2. Property Based Testing**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_05/Lecture_B/image_02_02.jpg?v=1766903865" width="250">



>* Define general behaviors that must always hold
>* Auto-generate many dataframes to compare library outputs

>* Focus on invariants, not exact row matches
>* Encode structural and aggregate expectations as properties

>* Generated data reveals rare, tricky edge cases
>* Shrunk failures become reusable regression test examples



In [None]:
#@title Python Code - Property Based Testing

# Demonstrate property based testing comparing Pandas and Polars results.
# Use Hypothesis to generate random DataFrames with numeric revenue values.
# Check that grouped revenue sums match across both libraries consistently.

import pandas as pd
import polars as pl
from hypothesis import given, settings
from hypothesis import strategies as st

# Create a Hypothesis strategy that builds random Pandas DataFrames.
# DataFrames contain customer ids and revenue values with possible nulls.
# This strategy keeps sizes small so tests run quickly.

customer_ids_strategy = st.integers(min_value=1, max_value=5)
revenue_strategy = st.floats(min_value=-1000.0, max_value=1000.0, allow_nan=False)
rows_strategy = st.lists(st.tuples(customer_ids_strategy, revenue_strategy), min_size=1, max_size=8)

# Convert generated rows into a simple Pandas DataFrame for testing.
# Columns represent customer identifiers and associated revenue amounts.
# This function is used inside the property based test.

def build_pandas_frame(rows):
    df = pd.DataFrame(rows, columns=["customer_id", "revenue"])
    return df

# Define a property that grouped revenue sums must match across libraries.
# We group by customer identifier and sum revenue values for each customer.
# The property compares numeric results within a small tolerance.

@given(rows=rows_strategy)
@settings(max_examples=20)
def test_groupby_sum_matches(rows):
    pdf = build_pandas_frame(rows)
    pl_df = pl.from_pandas(pdf)
    pandas_result = pdf.groupby("customer_id", as_index=False)["revenue"].sum()

    polars_result = (
        pl_df.groupby("customer_id")
        .agg(pl.col("revenue").sum())
        .sort("customer_id")
        .to_pandas()
    )

    pandas_sorted = pandas_result.sort_values("customer_id").reset_index(drop=True)
    polars_sorted = polars_result.sort_values("customer_id").reset_index(drop=True)

    assert len(pandas_sorted) == len(polars_sorted)
    for idx in range(len(pandas_sorted)):
        assert pandas_sorted.loc[idx, "customer_id"] == polars_sorted.loc[idx, "customer_id"]
        diff = abs(pandas_sorted.loc[idx, "revenue"] - polars_sorted.loc[idx, "revenue"])
        assert diff < 1e-9

# Run the property based test and print a short confirmation message.
# Hypothesis will generate many random inputs behind the scenes.
# If no assertion fails, we consider implementations consistent.

if __name__ == "__main__":
    test_groupby_sum_matches()
    print("Property based test passed, grouped revenue sums match across libraries.")



### **2.3. Numeric Tolerance Strategies**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_05/Lecture_B/image_02_03.jpg?v=1766903883" width="250">



>* Small numeric differences are normal between frameworks
>* Use tolerances to treat close results as equivalent

>* Use absolute and relative tolerances for comparisons
>* Apply context-aware thresholds at multiple data levels

>* Report detailed stats for out-of-tolerance differences
>* Use diagnostics to refine thresholds and document quirks



In [None]:
#@title Python Code - Numeric Tolerance Strategies

# Demonstrate numeric tolerance when comparing two calculation results from different implementations.
# Show absolute and relative tolerance checks using simple floating point examples.
# Print clear messages explaining when values are considered close enough or significantly different.

import math

# Define two results that should be almost equal but not exactly equal.
result_pandas_style = 0.1 + 0.1 + 0.1 + 0.1
result_polars_style = 0.4

# Define absolute tolerance for values around one dollar or less.
abs_tolerance = 0.0001

# Define relative tolerance for values that might be much larger.
rel_tolerance = 0.001

# Compute absolute difference between the two simulated framework results.
abs_difference = abs(result_pandas_style - result_polars_style)

# Check closeness using only absolute tolerance for small scale values.
abs_close = abs_difference <= abs_tolerance

# Compute relative difference scaled by the larger magnitude value.
max_magnitude = max(abs(result_pandas_style), abs(result_polars_style))

rel_difference = abs_difference / max_magnitude if max_magnitude != 0 else 0.0

# Check closeness using relative tolerance for potentially large scale values.
rel_close = rel_difference <= rel_tolerance

# Use math.isclose to combine absolute and relative tolerance checks.
combined_close = math.isclose(result_pandas_style, result_polars_style, rel_tol=rel_tolerance, abs_tol=abs_tolerance)

# Print numeric values and tolerance decisions for clear comparison understanding.
print("Pandas style result:", result_pandas_style)
print("Polars style result:", result_polars_style)
print("Absolute difference value:", abs_difference)
print("Absolute close within tolerance:", abs_close)
print("Relative difference ratio:", rel_difference)
print("Relative close within tolerance:", rel_close)
print("Combined tolerance close decision:", combined_close)



## **3. Behavioral Differences**

### **3.1. Type Coercion Pitfalls**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_05/Lecture_B/image_03_01.jpg?v=1766903909" width="250">



>* Different libraries silently change column data types
>* Hidden coercion causes rounding errors and discrepancies

>* Mixed-type columns are coerced differently across libraries
>* Unaligned coercion affects calculations, comparisons, and decisions

>* Type promotion differs in aggregations, joins, booleans
>* Make types explicit, validate schemas, cast deliberately



In [None]:
#@title Python Code - Type Coercion Pitfalls

# Demonstrate type coercion differences between Pandas and Polars operations.
# Show how integer division can silently change column data types unexpectedly.
# Encourage explicit casting to keep financial style integer cents precise.

import pandas as pd
import polars as pl

# Create simple cents amounts representing dollars and cents as integers.
amounts_list = [150, 275, 325, 500]
print("Original integer cents list values:", amounts_list)

# Build Pandas DataFrame and perform division by integer divisor.
pd_df = pd.DataFrame({"cents": amounts_list})
pd_df["dollars"] = pd_df["cents"] / 100
print("Pandas column types after division:")
print(pd_df.dtypes)

# Build Polars DataFrame and perform same division operation similarly.
pl_df = pl.DataFrame({"cents": amounts_list})
pl_df = pl_df.with_columns((pl.col("cents") / 100).alias("dollars"))
print("Polars schema types after division:")
print(pl_df.schema)

# Show explicit casting in Polars to keep integer cents division precise.
pl_df_fixed = pl_df.with_columns((pl.col("cents") // 100).alias("dollars_int"))
print("Polars schema types after integer floor division:")
print(pl_df_fixed.schema)



### **3.2. Null Handling Differences**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_05/Lecture_B/image_03_02.jpg?v=1766903930" width="250">



>* Pandas and Polars treat missing values differently
>* Differences affect types, counts, and downstream behavior

>* Nulls affect math, grouping, and joins differently
>* Define and encode clear null semantics in both

>* Create test datasets to compare null behavior
>* Derive shared rules and conventions for nulls



In [None]:
#@title Python Code - Null Handling Differences

# Demonstrate different null handling between Pandas and Polars clearly.
# Show how arithmetic and grouping behave with missing values differently.
# Help you design consistent null handling rules across both libraries.

import pandas as pd
import polars as pl

# Create simple customer revenue data with missing values included.
# None represents missing revenue, and None user means unknown customer.
# Values are small and readable, representing dollars and user identifiers.

data = {"user": ["A", "B", None, "A"], "revenue": [10.0, None, 5.0, None]}

# Build a Pandas DataFrame from the shared dictionary data.
# Build a Polars DataFrame from the same dictionary data.
# These frames look similar but handle nulls differently internally.

df_pd = pd.DataFrame(data)
df_pl = pl.DataFrame(data)

# Show original dataframes to confirm identical starting values visually.
# Printing both helps compare structures and missing markers quickly.

print("Pandas original dataframe:")
print(df_pd)
print("\nPolars original dataframe:")
print(df_pl)

# Compute revenue per user count in Pandas, ignoring null revenues automatically.
# groupby and mean will skip NaN values, affecting averages and counts.

pd_group = df_pd.groupby("user", dropna=False)["revenue"].mean()

# Compute revenue per user count in Polars, using mean aggregation similarly.
# Polars also skips nulls, but output types and formatting may differ.

pl_group = df_pl.group_by("user").agg(pl.col("revenue").mean()).sort("user")

# Show grouped results side by side for clear comparison.
# Notice how missing user keys and null revenues are treated differently.

print("\nPandas mean revenue by user:")
print(pd_group)
print("\nPolars mean revenue by user:")
print(pl_group)



### **3.3. Index and Order Semantics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_05/Lecture_B/image_03_03.jpg?v=1766903949" width="250">



>* Row identity and order differ across libraries
>* Index versus position affects joins, alignment, comparisons

>* Row order can change across different systems
>* Always sort and normalize order before comparing

>* Indexes can align rows or act positional
>* Standardize keys, indexes, and sorting before comparison



In [None]:
#@title Python Code - Index and Order Semantics

# Demonstrate index and order differences between Pandas and Polars.
# Show how joins behave with index labels versus plain columns.
# Emphasize sorting and resetting index before comparing results.

import pandas as pd
import polars as pl

# Create simple customer orders dataframe with custom index labels.
pd_orders = pd.DataFrame({"customer_id": [101, 102], "amount_usd": [50, 80]}, index=["rowA", "rowB"])

# Create payments dataframe with different index labels but matching customer ids.
pd_payments = pd.DataFrame({"customer_id": [102, 101], "paid_usd": [80, 50]}, index=["x1", "x2"])

# Pandas join using index alignment, which ignores customer_id column order.
pd_join_index = pd_orders.join(pd_payments.set_index(pd_orders.index), rsuffix="_pay")

# Pandas join using explicit customer_id column, enforcing key based alignment.
pd_join_key = pd_orders.reset_index(drop=True).merge(pd_payments, on="customer_id", how="inner")

# Convert same dataframes to Polars, which ignores Pandas index semantics.
pl_orders = pl.from_pandas(pd_orders.reset_index(drop=True))

pl_payments = pl.from_pandas(pd_payments.reset_index(drop=True))

# Polars join always uses explicit columns, here customer_id, for alignment.
pl_join_key = pl_orders.join(pl_payments, on="customer_id", how="inner")

# Print results with clear labels, highlighting index and order behavior.
print("Pandas join using index alignment order:")

print(pd_join_index)

print("\nPandas join using customer_id key:")

print(pd_join_key)

print("\nPolars join using customer_id key:")

print(pl_join_key)



# <font color="#418FDE" size="6.5" uppercase>**Interoperability and Testing**</font>


In this lecture, you learned to:
- Convert data between Pandas and Polars safely when partial migration is required. 
- Design tests that compare outputs of Pandas and Polars implementations for the same logic. 
- Detect and handle subtle behavioral differences between Pandas and Polars, such as type coercion or null semantics. 

<font color='yellow'>Congratulations on completing this course!</font>