# <font color="#418FDE" size="6.5" uppercase>**DataFrames and Schema**</font>

>Last update: 20260101.
    
By the end of this Lecture, you will be able to:
- Create Polars DataFrames and LazyFrames from in-memory data and external files that are commonly used with pandas. 
- Inspect and adjust Polars schemas to ensure column names and data types match expectations from existing pandas workflows. 
- Compare pandas and Polars data loading patterns to identify where direct one-to-one translations are possible. 


## **1. Building Polars DataFrames**

### **1.1. Dictionaries and Lists**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_A/image_01_01.jpg?v=1767310151" width="250">



>* Polars builds DataFrames from familiar Python containers
>* Dictionaries and lists become efficient columnar tables

>* Use dicts and lists for raw data
>* Convert them to DataFrames for analysis

>* Keep columns and rows aligned and consistently typed
>* Clean structure ensures reliable schemas and later analysis



In [None]:
#@title Python Code - Dictionaries and Lists

# Demonstrate building Polars DataFrames from dictionaries and lists.
# Show dictionary of lists and list of dictionaries conversions.
# Print resulting DataFrames to compare structures clearly.

# !pip install polars --quiet.

# Import Polars library for DataFrame creation.
import polars as pl

# Create sales data using dictionary of lists.
sales_dict = {"day": ["Mon", "Tue", "Wed"], "orders": [12, 18, 9]}

# Build DataFrame from dictionary of lists.
df_from_dict = pl.DataFrame(sales_dict)

# Create sales data using list of dictionaries.
sales_list = [
    {"day": "Mon", "orders": 12},
    {"day": "Tue", "orders": 18},
    {"day": "Wed", "orders": 9},
]

# Build DataFrame from list of dictionaries.
df_from_list = pl.DataFrame(sales_list)

# Print both DataFrames to compare structures.
print("DataFrame from dictionary of lists:\n", df_from_dict)

# Print second DataFrame showing equivalent result.
print("\nDataFrame from list of dictionaries:\n", df_from_list)



### **1.2. File Based DataFrames**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_A/image_01_02.jpg?v=1767310170" width="250">



>* Load Polars DataFrames directly from common files
>* Choose DataFrame or LazyFrame for existing datasets

>* Configure how files map to columns
>* Handle CSV, JSON, Parquet with correct types

>* Choose eager DataFrames for small, interactive datasets
>* Use LazyFrames for large, optimized, scalable workflows



In [None]:
#@title Python Code - File Based DataFrames

# Demonstrate reading small CSV files into Polars DataFrames and LazyFrames.
# Show basic file based loading using simple in memory created files.
# Compare eager read_csv with lazy scan_csv for the same dataset.

# !pip install polars pyarrow fsspec.

# Import required libraries for file handling and Polars usage.
import os
import polars as pl

# Create a small CSV text representing daily temperatures in Fahrenheit.
csv_text = "day,city,temp_f\nMonday,Denver,68\nTuesday,Denver,70\nWednesday,Denver,65\n"

# Define a file path inside the Colab working directory for saving CSV.
csv_path = "weather_small.csv"

# Write the CSV text content into the file using standard Python open.
with open(csv_path, "w", encoding="utf-8") as file_handle:
    file_handle.write(csv_text)

# Read the CSV eagerly into a Polars DataFrame using read_csv function.
df_eager = pl.read_csv(csv_path)

# Print the eager DataFrame to inspect loaded columns and values.
print("Eager DataFrame from CSV:")
print(df_eager)

# Create a LazyFrame using scan_csv which defers reading until collect.
lf_lazy = pl.scan_csv(csv_path)

# Add a new Celsius column lazily using a simple Fahrenheit conversion.
lf_with_c = lf_lazy.with_columns((pl.col("temp_f") - 32) * 5 / 9)

# Collect the LazyFrame into a DataFrame after transformations are defined.
df_lazy_result = lf_with_c.collect()

# Print the resulting DataFrame showing both Fahrenheit and Celsius columns.
print("\nLazyFrame collected with Celsius:")
print(df_lazy_result)



### **1.3. Migrating from pandas**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_A/image_01_03.jpg?v=1767310191" width="250">



>* Reuse familiar table-like objects when migrating
>* Build column-based DataFrames from the same sources

>* Reuse existing files and in-memory data
>* Rebuild matching DataFrames while learning new syntax

>* Lazy evaluation defers running data transformations
>* Same sources scale better for large complex workflows



In [None]:
#@title Python Code - Migrating from pandas

# Demonstrate migrating simple pandas DataFrame into Polars DataFrame and LazyFrame.
# Show same in memory data used with both libraries for familiarity.
# Compare eager DataFrame and lazy LazyFrame using identical underlying data.

# !pip install pandas polars.

# Import pandas and polars libraries together for comparison.
import pandas as pd
import polars as pl

# Create simple pandas DataFrame from dictionary of lists data.
pd_orders = pd.DataFrame({"order_id": [1, 2, 3], "amount_usd": [20, 35, 50]})

# Convert pandas DataFrame into Polars eager DataFrame directly.
pl_orders = pl.from_pandas(pd_orders)

# Build Polars LazyFrame from same pandas DataFrame source.
lazy_orders = pl.from_pandas(pd_orders).lazy()

# Print original pandas DataFrame to show familiar structure.
print("Pandas DataFrame:")
print(pd_orders)

# Print Polars eager DataFrame to show migrated structure.
print("\nPolars DataFrame:")
print(pl_orders)

# Collect lazy Polars DataFrame and print filtered result.
print("\nPolars LazyFrame filtered amount_usd greater than thirty:")
print(lazy_orders.filter(pl.col("amount_usd") > 30).collect())



## **2. LazyFrame Schema Essentials**

### **2.1. Building LazyFrames From Scans**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_A/image_02_01.jpg?v=1767310219" width="250">



>* Use scans to build memory-efficient LazyFrames
>* Scans reveal schema early for pipeline alignment

>* Inspect and tweak schema right after scanning
>* Fix column names and types before processing

>* Use one lazy scan for many files
>* Standardize schemas to unify data and workflows



In [None]:
#@title Python Code - Building LazyFrames From Scans

# Demonstrate building Polars LazyFrames from scan operations on CSV files.
# Show how to inspect inferred schemas before loading full dataset eagerly.
# Adjust column data types at scan stage to match expected analytical workflows.

# !pip install polars pyarrow.

# Import required libraries for file handling and Polars usage.
import os
import polars as pl

# Create a small sample CSV file representing customer transactions data.
csv_content = "customer_id,order_date,amount_usd\nA001,2024-01-01,19.99\nA002,2024-01-02,5.50\nA003,2024-01-03,100.00\n"

# Write the CSV content into a temporary file within current working directory.
file_path = os.path.join(os.getcwd(), "transactions_sample.csv")
with open(file_path, "w", encoding="utf-8") as f:
    f.write(csv_content)

# Build a LazyFrame using scan_csv which avoids immediate full data loading.
lf = pl.scan_csv(file_path)

# Print the inferred schema to inspect column names and detected data types.
print("Inferred schema from scan_csv:")
print(lf.schema)

# Adjust schema by casting order_date to Date and customer_id to categorical type.
lf_fixed = lf.with_columns([
    pl.col("order_date").str.strptime(pl.Date, strict=False).alias("order_date"),
    pl.col("customer_id").cast(pl.Categorical).alias("customer_id"),
])

# Print the updated schema after applying type corrections on the LazyFrame.
print("\nUpdated schema after casting columns:")
print(lf_fixed.schema)

# Collect the LazyFrame into a DataFrame only after schema adjustments are complete.
df_result = lf_fixed.collect()

# Display the final DataFrame to confirm values and corrected data types visually.
print("\nFinal DataFrame after lazy scan and schema fixes:")
print(df_result)



### **2.2. Deferred Execution Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_A/image_02_02.jpg?v=1767310235" width="250">



>* Lazy queries build a plan, not data
>* Schema shows predicted columns and types for validation

>* Deferred execution lets you inspect schemas early
>* Catch type mismatches and fix with planned casts

>* Schema can change silently as sources evolve
>* Regular lazy schema checks catch breaking changes



In [None]:
#@title Python Code - Deferred Execution Basics

# Demonstrate lazy deferred execution with Polars LazyFrame schemas.
# Show schema prediction before executing any heavy data loading.
# Compare schema before and after adding a cast transformation.

# !pip install polars pyarrow.

# Import required Polars library for lazy operations.
import polars as pl

# Create a small in memory DataFrame mimicking a CSV file.
df = pl.DataFrame({"id": [1, 2], "price_str": ["10.5", "20.0"]})

# Build a LazyFrame from the eager DataFrame without executing transformations.
lf = df.lazy()

# Print the initial lazy schema prediction before any transformations.
print("Initial lazy schema:")
print(lf.schema)

# Add a cast transformation to convert price_str into a float column.
lf_cast = lf.with_columns(pl.col("price_str").cast(pl.Float64).alias("price_usd"))

# Print the updated lazy schema prediction after adding the cast step.
print("\nSchema after cast step:")
print(lf_cast.schema)

# Finally collect executes the plan and materializes the transformed DataFrame.
result_df = lf_cast.select(["id", "price_usd"]).collect()

# Show the final materialized DataFrame confirming schema predictions were accurate.
print("\nFinal collected DataFrame:")
print(result_df)



### **2.3. DataFrame LazyFrame Conversion**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_A/image_02_03.jpg?v=1767310250" width="250">



>* Convert DataFrame to LazyFrame without changing schema
>* Keeps column names and types for easy validation

>* Clean and standardize data eagerly in DataFrames
>* Convert to LazyFrame for scalable, optimized pipelines

>* Use conversion to compare pandas and Polars
>* Validate schemas and results during gradual migration



In [None]:
#@title Python Code - DataFrame LazyFrame Conversion

# Demonstrate converting Polars DataFrame into LazyFrame while preserving schema.
# Show schema before conversion and after conversion for clear comparison.
# Highlight how eager cleaning can precede lazy optimized transformations.

# !pip install polars pyarrow pandas.

# Import required libraries for DataFrame and LazyFrame demonstration.
import polars as pl

# Create a simple Polars DataFrame with mixed column types.
df = pl.DataFrame({"customer_id": ["A1", "B2", "C3"],
                   "product_category": ["Books", "Tools", "Books"],
                   "amount_usd": [12.5, 7.0, 20.0]})

# Print original DataFrame schema to inspect column names and types.
print("Original DataFrame schema:")
print(df.schema)

# Perform eager cleaning by casting product_category to categorical type.
df_clean = df.with_columns(pl.col("product_category").cast(pl.Categorical))

# Print cleaned DataFrame schema to confirm updated logical types.
print("\nCleaned DataFrame schema:")
print(df_clean.schema)

# Convert cleaned DataFrame into a LazyFrame for deferred execution.
lf = df_clean.lazy()

# Print LazyFrame schema to verify it matches cleaned DataFrame schema.
print("\nLazyFrame schema after conversion:")
print(lf.schema)

# Collect LazyFrame back to DataFrame and print to confirm identical visible structure.
print("\nMaterialized DataFrame from LazyFrame:")
print(lf.collect())



## **3. Schema alignment basics**

### **3.1. Inspecting Columns and Types**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_A/image_03_01.jpg?v=1767310286" width="250">



>* Carefully check loaded columns and inferred types
>* Use checks to judge safe one-to-one translations

>* Messy data makes pandas and Polars infer differently
>* Check columns and dtypes to avoid downstream issues

>* Check schemas match team and workflow expectations
>* Consistent inspection reduces bugs and reveals library differences



In [None]:
#@title Python Code - Inspecting Columns and Types

# Demonstrate inspecting columns and types in pandas and Polars.
# Show how the same CSV loads with different inferred dtypes.
# Help compare schemas for safer pandas to Polars migration.

# !pip install pandas polars pyarrow.

# Import required libraries for data handling.
import pandas as pd
import polars as pl
from io import StringIO

# Create a small CSV string with mixed types.
csv_text = "id,score,joined_at\n001,10,2024-01-01\n002,NA,2024-01-02\n003,15,not_a_date"

# Load the CSV into a pandas DataFrame.
pdf = pd.read_csv(StringIO(csv_text))

# Load the same CSV into a Polars DataFrame.
pldf = pl.read_csv(source=StringIO(csv_text))

# Print pandas columns and dtypes for comparison.
print("Pandas columns and dtypes:")
print(pdf.dtypes)

# Print Polars columns and dtypes for comparison.
print("\nPolars columns and dtypes:")
print(pldf.dtypes)

# Print Polars schema showing column names and types.
print("\nPolars schema:")
print(pldf.schema)



### **3.2. Column casting and renaming**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_A/image_03_02.jpg?v=1767310315" width="250">



>* Cast each column to its intended type
>* Explicit casting prevents subtle cross-library data bugs

>* Consistent casting keeps behavior aligned across libraries
>* Matching logical types reduces errors and mental effort

>* Rename columns to match shared naming conventions
>* Standardized names and types simplify cross-tool workflows



In [None]:
#@title Python Code - Column casting and renaming

# Demonstrate casting and renaming columns for schema alignment between pandas and Polars.
# Show how string columns become numeric, date, and categorical types explicitly.
# Compare original and aligned schemas to mirror existing pandas based workflows.

# !pip install polars pandas.

# Import required libraries for pandas and Polars usage.
import pandas as pd
import polars as pl

# Create a small pandas DataFrame with all columns as strings.
pd_df = pd.DataFrame({"order_id": ["A1", "A2"], "purchase_date": ["2024-01-01", "2024-01-02"], "quantity": ["3", "5"], "unit_price": ["10.50", "7.25"]})

# Show original pandas dtypes for quick schema inspection.
print("Original pandas dtypes:\n", pd_df.dtypes)

# Convert pandas DataFrame into a Polars DataFrame directly.
pl_df = pl.from_pandas(pd_df)

# Show original Polars schema before any casting or renaming.
print("\nOriginal Polars schema:", pl_df.schema)

# Cast quantity and unit_price to numeric, purchase_date to date type.
pl_casted = pl_df.with_columns([pl.col("quantity").cast(pl.Int64), pl.col("unit_price").cast(pl.Float64), pl.col("purchase_date").str.strptime(pl.Date, strict=False)])

# Rename columns to match a preferred existing pandas naming convention.
pl_aligned = pl_casted.rename({"order_id": "order_label", "purchase_date": "purchase_date", "quantity": "quantity_units", "unit_price": "unit_price_usd"})

# Show final Polars schema after casting and renaming operations.
print("\nAligned Polars schema:", pl_aligned.schema)




### **3.3. Aligning schemas with pandas expectations**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Pandas to Polars Migration/Module_02/Lecture_A/image_03_03.jpg?v=1767310333" width="250">



>* Understand downstream expectations for columns and types
>* Shape Polars schemas to match pandas-based workflows

>* Treat schema alignment like passing a contract
>* Match pandas column names and types to avoid bugs

>* Schema alignment smooths mixed Polarsâ€“pandas pipelines
>* Enables gradual Polars adoption without disrupting workflows



# <font color="#418FDE" size="6.5" uppercase>**DataFrames and Schema**</font>


In this lecture, you learned to:
- Create Polars DataFrames and LazyFrames from in-memory data and external files that are commonly used with pandas. 
- Inspect and adjust Polars schemas to ensure column names and data types match expectations from existing pandas workflows. 
- Compare pandas and Polars data loading patterns to identify where direct one-to-one translations are possible. 

In the next Lecture (Lecture B), we will go over 'Selecting and Filtering'