# <font color="#418FDE" size="6.5" uppercase>**DataFrames and Schema**</font>

>Last update: 20260101.
    
By the end of this Lecture, you will be able to:
- Create Polars DataFrames and LazyFrames from in-memory data and external files that are commonly used with pandas. 
- Inspect and adjust Polars schemas to ensure column names and data types match expectations from existing pandas workflows. 
- Compare pandas and Polars data loading patterns to identify where direct one-to-one translations are possible. 


## **1. Building Polars DataFrames**

### **1.1. Dictionaries and Lists**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_A/image_01_01.jpg?v=1767311272" width="250">



>* Use dictionaries and lists to build DataFrames
>* Polars turns them into columns for quick analysis

>* Lists represent row-wise records like customers
>* Polars turns nested lists and dicts into DataFrames

>* Prototype with small in-memory data in Polars
>* Reuse same patterns when scaling to larger datasets



In [None]:
#@title Python Code - Dictionaries and Lists

# Demonstrate creating Polars DataFrames from dictionaries and lists.
# Show how column names map from dictionary keys and record fields.
# Print small DataFrames to compare structures and understand behavior.

# !pip install polars --quiet.

# Import polars library for DataFrame creation.
import polars as pl

# Create a dictionary where keys represent column names.
sales_dict = {"product": ["Book", "Pen"], "quantity": [3, 10]}

# Build a DataFrame directly from the dictionary structure.
df_from_dict = pl.DataFrame(sales_dict)

# Print the DataFrame created from the dictionary.
print("DataFrame from dictionary:\n", df_from_dict)

# Create a list of dictionaries representing individual sales records.
sales_list_dicts = [{"product": "Book", "quantity": 3}, {"product": "Pen", "quantity": 10}]

# Build a DataFrame from the list of dictionaries records.
df_from_list_dicts = pl.DataFrame(sales_list_dicts)

# Print the DataFrame created from the list of dictionaries.
print("\nDataFrame from list of dictionaries:\n", df_from_list_dicts)

# Create a list of lists representing rows with consistent ordering.
sales_list_rows = [["Book", 3], ["Pen", 10]]

# Build a DataFrame from list rows with explicit column names.
df_from_list_rows = pl.DataFrame(sales_list_rows, schema=["product", "quantity"], orient="row")

# Print the DataFrame created from the list of lists rows.
print("\nDataFrame from list of lists:\n", df_from_list_rows)



### **1.2. Loading CSV and Parquet**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_A/image_01_02.jpg?v=1767311307" width="250">



>* Polars loads CSV and Parquet efficiently
>* Creates accurate DataFrames ready for further analysis

>* Control headers, missing values, and type inference
>* Avoid mis-typed columns that break downstream work

>* Parquet stores typed, columnar data Polars optimizes
>* Enables fast, selective reads across large datasets



In [None]:
#@title Python Code - Loading CSV and Parquet

# Demonstrate loading CSV and Parquet files with Polars DataFrames.
# Show basic options for headers and data type inference with CSV files.
# Compare loading the same data from CSV and Parquet formats efficiently.

# !pip install polars pyarrow.

# Import required libraries for data handling and file formats.
import polars as pl
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Create a small pandas DataFrame representing simple sales records.
data_dict = {"order_id": ["0001", "0002"], "amount_usd": [19.99, 5.50]}
pdf = pd.DataFrame(data_dict)

# Save the pandas DataFrame as a CSV file for Polars loading.
csv_path = "sales_small.csv"
pdf.to_csv(csv_path, index=False)

# Save the same pandas DataFrame as a Parquet file for comparison.
parquet_path = "sales_small.parquet"
table = pa.Table.from_pandas(pdf)
pq.write_table(table, parquet_path)

# Load the CSV file into a Polars DataFrame with header handling.
df_csv = pl.read_csv(csv_path, has_header=True)
print("CSV DataFrame loaded with Polars:")
print(df_csv)

# Load the Parquet file into a Polars DataFrame efficiently.
df_parquet = pl.read_parquet(parquet_path)
print("Parquet DataFrame loaded with Polars:")
print(df_parquet)



### **1.3. Migrating from pandas**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_A/image_01_03.jpg?v=1767311336" width="250">



>* Reuse existing workflows when moving to Polars
>* Match outputs to build confidence before optimizations

>* Replace initial tabular steps with Polars creation
>* Validate schema and sample values before migrating

>* Use lazy queries instead of eager loading
>* Gain speed, scalability, and easier workflow maintenance



In [None]:
#@title Python Code - Migrating from pandas

# Show simple migration from pandas DataFrame into Polars DataFrame and LazyFrame.
# Compare basic properties to confirm both DataFrames contain matching rows and columns.
# Demonstrate starting with pandas habits while gradually adopting Polars loading patterns.

# pip install polars and pandas if running outside Colab environment.
# !pip install polars pandas --quiet.

# Import pandas and polars libraries for DataFrame creation and comparison.
import pandas as pd
import polars as pl

# Create small pandas DataFrame that mimics daily sales report structure.
pd_df = pd.DataFrame({"day":["Mon","Tue","Wed"],"region":["East","West","East"],"sales_dollars":[120.0,150.5,99.0]})

# Convert pandas DataFrame into Polars DataFrame using from_pandas constructor.
pl_from_pd = pl.from_pandas(pd_df)

# Load same data directly with Polars to mirror existing pandas file workflow.
pl_direct = pl.DataFrame({"day":["Mon","Tue","Wed"],"region":["East","West","East"],"sales_dollars":[120.0,150.5,99.0]})

# Create Polars LazyFrame to prepare for larger future datasets and lazy execution.
lazy_sales = pl_direct.lazy()

# Print basic shape comparison to confirm matching row and column counts.
print("pandas shape:", pd_df.shape, "polars shape:", pl_from_pd.shape)

# Print column names from both libraries to verify consistent schema alignment.
print("pandas columns:", list(pd_df.columns))
print("polars columns:", pl_from_pd.columns)

# Collect lazy result and show first rows to mirror familiar head style checks.
print("lazy head:")
print(lazy_sales.head(2).collect())



## **2. LazyFrame Schema Essentials**

### **2.1. Building LazyFrames From Scans**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_A/image_02_01.jpg?v=1767311403" width="250">



>* Use scans to build lazy data blueprints
>* Infer, review, and fix schemas before analysis

>* Schema comes from metadata and sampled rows
>* Immediately inspect and fix column names, types

>* Review and adjust scanned schemas before transformations
>* Rename columns, fix types, ensure workflow alignment



In [None]:
#@title Python Code - Building LazyFrames From Scans

# Demonstrate building LazyFrames from scan operations with schema inspection.
# Show how Polars infers column names and data types from external files.
# Adjust schema to match expectations from previous pandas style workflows.

# !pip install polars pyarrow fsspec.

# Import required libraries for data handling and filesystem operations.
import polars as pl
import os

# Create a small directory for our example CSV files.
os.makedirs("sales_data", exist_ok=True)

# Define CSV content representing daily sales with mixed data types.
csv_content = "transaction_id,date,amount_usd\n1,2024-01-01,10.5\n2,2024-01-02,20.0\n3,2024-01-03,15.75\n"

# Write the CSV content into a file inside the directory.
with open("sales_data/day1.csv", "w", encoding="utf-8") as file_handle:
    file_handle.write(csv_content)

# Build a LazyFrame by scanning the CSV directory lazily.
lazy_sales = pl.scan_csv("sales_data/*.csv")

# Print the inferred schema to inspect column names and data types.
print("Inferred schema from scan:")
print(lazy_sales.schema)

# Rename columns to match expected pandas style naming conventions.
lazy_renamed = lazy_sales.rename({"amount_usd": "revenue_usd"})

# Cast transaction_id to string to match identifier expectations.
lazy_casted = lazy_renamed.with_columns(pl.col("transaction_id").cast(pl.Utf8))

# Print the adjusted schema after renaming and casting operations.
print("\nAdjusted schema after alignment:")
print(lazy_casted.schema)

# Collect a small sample to confirm that lazy operations worked correctly.
print("\nSample rows after lazy alignment:")
print(lazy_casted.limit(3).collect())



### **2.2. Deferred Execution Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_A/image_02_02.jpg?v=1767311420" width="250">



>* Lazy queries build a plan, not data
>* Schema preview shows future column names and types

>* Design-time schema checks speed up migration tweaks
>* Early validation prevents costly, real-world data errors

>* Deferred execution lets Polars globally optimize transformations
>* Optimizations keep final schema consistent with pandas



In [None]:
#@title Python Code - Deferred Execution Basics

# Demonstrate lazy deferred execution with Polars LazyFrame schemas.
# Show schema prediction before any actual data loading happens.
# Compare lazy schema inspection with final executed DataFrame result.

# !pip install polars pyarrow.

# Import required Polars library for lazy DataFrame operations.
import polars as pl

# Create a small in memory dataset representing simple web log rows.
data_rows = [
    {"user_id": "u1", "clicks": "3", "miles": "12.5"},
    {"user_id": "u2", "clicks": "5", "miles": "7.0"},
]

# Build a LazyFrame from the in memory dataset using scan like behavior.
lf = pl.LazyFrame(data_rows)

# Define a lazy transformation pipeline with renaming and type casting operations.
lf_transformed = (
    lf.rename({"miles": "distance_miles"})
      .with_columns(pl.col("clicks").cast(pl.Int64))
      .with_columns(pl.col("distance_miles").cast(pl.Float64))
)

# Inspect the lazy schema prediction before any data execution happens.
print("Lazy schema prediction before execution:")
print(lf_transformed.schema)

# Execute the lazy pipeline to materialize an actual DataFrame result.
df_result = lf_transformed.collect()

# Show the final DataFrame and its concrete schema after execution.
print("\nExecuted DataFrame and its schema:")
print(df_result)
print(df_result.schema)



### **2.3. DataFrame LazyFrame Conversion**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_A/image_02_03.jpg?v=1767311438" width="250">



>* Convert DataFrame to LazyFrame to preserve schema
>* LazyFrame keeps original column names and types

>* Use LazyFrame as a workspace for schema cleanup
>* Rename columns, fix types, and defer execution

>* Convert DataFrames to LazyFrames for schema cleanup
>* Standardize types and names in one execution



In [None]:
#@title Python Code - DataFrame LazyFrame Conversion

# Demonstrate converting Polars DataFrame into LazyFrame safely.
# Show that schema stays consistent during lazy transformations.
# Compare DataFrame and LazyFrame schemas before and after changes.

# pip install polars.

# Import polars library for DataFrame and LazyFrame usage.
import polars as pl

# Create example data using a simple Python dictionary.
data = {
    "customer_id": [101, 102, 103],
    "signup_date": ["2024-01-01", "2024-01-05", "2024-01-10"],
    "revenue_usd": ["$10.50", "$20.00", "$7.25"],
}

# Build a Polars DataFrame from the in memory dictionary.
df = pl.DataFrame(data)

# Print original DataFrame schema to inspect column types.
print("Original DataFrame schema:")
print(df.schema)

# Convert the existing DataFrame into a LazyFrame object.
lf = df.lazy()

# Print LazyFrame schema which should match DataFrame schema.
print("\nLazyFrame schema after conversion:")
print(lf.schema)

# Plan lazy schema refinements without executing immediately.
lf_fixed = (
    lf.with_columns([
        pl.col("revenue_usd").str.replace_all("$", "").cast(pl.Float64, strict=False),
        pl.col("signup_date").str.strptime(pl.Date, strict=False),
    ])
    .rename({"revenue_usd": "revenue_dollars"})
)

# Print refined LazyFrame schema showing updated column types and names.
print("\nLazyFrame schema after lazy refinements:")
print(lf_fixed.schema)

# Collect final DataFrame to execute lazy plan and view results.
result_df = lf_fixed.collect()

# Print final DataFrame to confirm schema and values after execution.
print("\nFinal DataFrame after lazy execution:")
print(result_df)



## **3. Schema alignment basics**

### **3.1. Column Types Overview**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_A/image_03_01.jpg?v=1767311465" width="250">



>* Pandas and Polars describe column types differently
>* Type differences affect one‑to‑one code translation

>* Group columns into numeric, text, temporal, boolean, categorical
>* Compare these families to spot semantic differences

>* Mismatched column types can break existing workflows
>* Understand each library’s types to safely translate



### **3.2. Column casting and renaming**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_A/image_03_02.jpg?v=1767311475" width="250">



>* Use casting and renaming to match expectations
>* Align schemas to avoid pipeline bugs

>* Casting fixes wrong column data types
>* Prevents lost leading zeros and broken date logic

>* Rename columns to match shared naming contracts
>* Keeps pipelines unchanged while swapping dataframe libraries



In [None]:
#@title Python Code - Column casting and renaming

# Demonstrate simple column casting between pandas and Polars dataframes.
# Show how identifier columns can be safely converted to string types.
# Show how columns can be renamed to match existing pipeline expectations.
# !pip install polars pandas.

# Import required libraries for pandas and Polars usage.
import pandas as pd
import polars as pl

# Create a small pandas DataFrame with numeric looking identifiers.
pd_df = pd.DataFrame({"customer_id": [101, 102, 103], "net_revenue": [25.5, 40.0, 32.5]})

# Display original pandas DataFrame and its dtypes for comparison.
print("Original pandas DataFrame and dtypes:")
print(pd_df)
print(pd_df.dtypes)

# Convert pandas DataFrame into a Polars DataFrame for migration demonstration.
pl_df = pl.from_pandas(pd_df)

# Show Polars schema before any casting or renaming operations.
print("\nOriginal Polars schema and data:")
print(pl_df)
print(pl_df.schema)

# Cast customer_id column to Utf8 string type to preserve identifier semantics.
pl_casted = pl_df.with_columns(pl.col("customer_id").cast(pl.Utf8).alias("customer_id"))

# Rename columns to match a different expected schema from existing pipelines.
pl_renamed = pl_casted.rename({"customer_id": "customerId", "net_revenue": "netRevenue"})

# Show final Polars schema and data after casting and renaming operations.
print("\nPolars after casting and renaming:")
print(pl_renamed)
print(pl_renamed.schema)



### **3.3. Aligning schemas with pandas expectations**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_A/image_03_03.jpg?v=1767311494" width="250">



>* Check schemas match what analyses expect downstream
>* Adjust column names and types to avoid errors

>* Start from existing workflows and their expectations
>* Match column names and types to prevent mismatches

>* Schema alignment supports collaboration, regulation, and documentation
>* It ensures reproducibility, validation, and cross-tool consistency



In [None]:
#@title Python Code - Aligning schemas with pandas expectations

# Demonstrate aligning Polars schema with existing pandas expectations.
# Show mismatched types and names between two similar DataFrames.
# Fix Polars schema so it matches the trusted pandas based workflow.

# !pip install polars pandas.

# Import required libraries for pandas and Polars usage.
import pandas as pd
import polars as pl

# Create a pandas DataFrame representing trusted workflow expectations.
pd_df = pd.DataFrame({"campaign_id": ["A101", "B202"], "revenue_cents": [1500, 2300]})

# Create a Polars DataFrame with slightly different schema assumptions.
pl_df = pl.DataFrame({"campaignId": [101, 202], "revenue_dollars": [15.0, 23.0]})

# Show original pandas schema and first rows for comparison clarity.
print("PANDAS SCHEMA AND DATA:")
print(pd_df.dtypes)
print(pd_df.head())

# Show original Polars schema and first rows before any alignment.
print("\nPOLARS ORIGINAL SCHEMA AND DATA:")
print(pl_df.schema)
print(pl_df.head())

# Align Polars column names and types with pandas expectations.
pl_aligned = pl_df.rename({"campaignId": "campaign_id", "revenue_dollars": "revenue_cents"}).with_columns(
    pl.col("campaign_id").cast(pl.Utf8),
    (pl.col("revenue_cents") * 100).cast(pl.Int64),
)

# Show aligned Polars schema and first rows after transformations.
print("\nPOLARS ALIGNED SCHEMA AND DATA:")
print(pl_aligned.schema)
print(pl_aligned.head())



# <font color="#418FDE" size="6.5" uppercase>**DataFrames and Schema**</font>


In this lecture, you learned to:
- Create Polars DataFrames and LazyFrames from in-memory data and external files that are commonly used with pandas. 
- Inspect and adjust Polars schemas to ensure column names and data types match expectations from existing pandas workflows. 
- Compare pandas and Polars data loading patterns to identify where direct one-to-one translations are possible. 

In the next Lecture (Lecture B), we will go over 'Selecting and Filtering'