# How to code and Polars DataFrames

Lecture, April 28, 2025

## Part 1: how to code and avoid spaghetti code

**What is Spaghetti Code?**

Spaghetti code refers to tangled, poorly structured code that is difficult to understand or maintain. It typically lacks logical organization, relies heavily on interdependent sections, and makes debugging and scaling complex.

**Symptoms of Spaghetti Code:**

- Long, unstructured scripts without functions or classes.
- Code with deep nesting of `if`, `for`, or `while` blocks.
- Repeated patterns instead of reusable functions.
- Lack of comments or documentation.

**Why is it a Problem in Data Processing?**

- **Data Volume Increases Complexity**: As datasets grow larger, unstructured code becomes harder to debug.
- **Collaboration Challenges**: Team members may struggle to understand or contribute to tangled code.
- **Risk of Bugs**: A minor change can introduce unpredictable side effects.

**Q:**

- What is the DRY principle?
- When can be spaghetti code good?

**Short Demonstration:**

Spaghetti code example for processing a dataset:

In [None]:
data = [1, 2, 3, 4, 5]
results = []
for i in range(len(data)):
    if data[i] % 2 == 0:
        results.append(data[i] * 2)
    else:
        results.append(data[i] + 3)
print(results)

**Q:** What makes this code hard to read or maintain?

## Prior notes to help with coding

- Minimal arguments: `def func_run(x, y, z, another, anotherx, ano, why_this):`
- Logical separation: white spaces, empty lines, functions related to each other close 
- Extraction of subrutines
- Type arguments `x: list` , `x : pd.DataFrame`
- Name conventions

## Core Principles of Clean Coding

**2.1 Code Readability:**

- Follow [PEP 8 standards](https://peps.python.org/pep-0008/), or [realpython pep8](https://realpython.com/python-pep8/):
    - Indentation: Use 4 spaces per level.
    - Line length: Limit to 79 characters.
    - Use meaningful variable names: `data` instead of `x`, `process_even` instead of `pe`.
- Keep logic simple and intuitive. Avoid over-engineering solutions.
- Add comments where necessary:

Proper indentation

In [None]:
x = 1

In [None]:
# Bad indentation:
if x>10:
 print("x is large")

# Correct indentation:
if x > 10:
    print("x is large")

White space clarity

In [None]:
x = 1+2*3 # Hard to read
x = 1 + 2 * 3 # Easier to read

**2.2 Modularity:**

- Decompose tasks into logical parts using functions, classes, and modules.
- Aim for functions to perform a single responsibility.

In [None]:
def calculate_area(radius):
    """Calculate the area of a circle."""
    return 3.14 * radius**2

**2.3 DRY Principle (Don't Repeat Yourself):**

- Avoid duplicating logic; instead, reuse functions.
- Example:

In [None]:
data = [1, 2, 3, 4, 5]

In [None]:
# BAD: Repeated logic
total_odd = sum(x for x in data if x % 2 != 0)
total_even = sum(x for x in data if x % 2 == 0)

In [None]:
# GOOD: Refactor into a reusable function
def calculate_sum(data, condition):
    """
    Calculate the sum of elements in a list that satisfy a given condition.

    Args:
        data (list): A list of numerical values.
        condition (function): A function that takes a single argument and returns a boolean.
                              The function is used to test each element in the list.

    Returns:
        int or float: The sum of elements that satisfy the condition.
    """
    return sum(x for x in data if condition(x))

total_odd = calculate_sum(data, lambda x: x % 2 != 0)
total_even = calculate_sum(data, lambda x: x % 2 == 0)

**2.4 Naming Conventions:**

- Variables: `lowercase_with_underscores`. Example: `student_scores`.
- Functions: Verb-based, descriptive. Example: `fetch_data_from_api`.
- Constants: `ALL_CAPS`. Example: `PI = 3.14`.

**2.5 Documentation:**

- Use docstrings for functions, classes, and modules.
Example:

In [None]:
def calculate_area(radius: float) -> float:
    """
    Calculate the area of a circle given its radius.

    Args:
    radius (float): The radius of the circle.

    Returns:
    float: The area of the circle.
    """
    return 3.14 * radius**2

In [None]:
calculate_area(1)

## Practical Techniques to Avoid Spaghetti Code

**3.1 Structuring Projects:**

- Folder structure for a data processing project:
    
    ```bash
    project/
        data/            # Raw and processed data files
            raw/
            processed/
        src/         # Python scripts
            preproces.py
            analyze.py
        tests/           # Unit tests
        main.py          # Entry point
        README.md        # Documentation
    
    ```
    

**3.2 Breaking Down Tasks:**

- Instead of long scripts, divide the logic:

In [None]:
def load_data(filepath):
    """Load data from a CSV file."""
    pass

def preprocess_data(data):
    """Clean and prepare data."""
    pass

def analyze_data(data):
    """Perform analysis."""
    pass

if __name__ == "__main__":
    filepath = "data.csv"
    raw_data = load_data(filepath)
    processed_data = preprocess_data(raw_data)
    analysis = analyze_data(processed_data)

**3.3 Using Type Hints and Annotations:**

- Improve clarity and reduce bugs:

In [None]:
def load_data(filepath: str) -> list[dict]:
    """Load data from a JSON file."""
    pass

**3.4 Error Handling:**

- Add meaningful error messages:

In [None]:
try:
    with open("data.csv") as f:
        data = f.readlines()
except FileNotFoundError:
    print("Error: File not found. Check the filepath.")

## Example Code Walkthrough

**Scenario:** Process a list of numbers, doubling evens and adding 3 to odds.

**Spaghetti Code:**

In [None]:
data = [1, 2, 3, 4, 5]
results = []
for num in data:
    if num % 2 == 0:
        results.append(num * 2)
    else:
        results.append(num + 3)
print(results)

**Refactored Code:**

In [None]:
# Extract the logic into a function


# Extract conditions into functions
def process_even(number: int) -> int:
    """Double the input number if it is even."""
    return number * 2


def process_odd(number: int) -> int:
    """Add 3 to the input number if it is odd."""
    return number + 3


# Modularize the logic
def process_numbers(numbers: list[int]) -> list[int]:
    """Process a list of numbers, doubling evens and adding 3 to odds."""
    return [process_even(num) if num % 2 == 0 else process_odd(num) for num in numbers]

In [None]:
if __name__ == "__main__":
    data = [1, 2, 3, 4, 5]
    print(process_numbers(data))

**Can we do it even better?**

In [None]:
def process_number(num):
    return num * 2 if num % 2 == 0 else num + 3

In [None]:
[process_number(num) for num in data]

## Recap

- Recap best practices:
    - Follow PEP 8.
    - Use functions for modularity.
    - Keep code readable and reusable.

## Part 2: `polars` DataFrames

[Documentation](https://docs.pola.rs) + [github page](https://github.com/pola-rs/polars)

Nice website: [Modern polars](https://kevinheavey.github.io/modern-polars/timeseries.html)

### **What is `polars`?**

`polars` is a **high-performance DataFrame library for Python** that is designed to handle large datasets efficiently. It's written in Rust and optimized for both speed and memory usage.

**Key Features:**

- Lazy evaluation for optimized computation
- Native support for multi-threading
- Designed to work seamlessly with Arrow and Parquet formats
- Immutable data structures for safety

### **Why Choose `polars`?**

| Feature | `polars` | `pandas` |
| --- | --- | --- |
| Speed | Faster due to Rust and parallelism | Slower for large datasets |
| Memory Efficiency | Optimized for low memory usage | High memory usage for large operations |
| Lazy Evaluation | Available (process only when required) | Not available |
| Parquet Support | Built-in | Requires additional packages |

### Hands-on dataset with `polars`

We’ll use the **Kaggle "Netflix Movies and TV Shows" Dataset** (`netflix_titles.csv`) for demonstration.

In [None]:
# !pip install polars

In [None]:
import polars as pl
# import altair as alt

### Load the Dataset

In [None]:
df = pl.read_csv("../99_files/netflix_titles.csv")

In [None]:
df.head()

### Explore the Dataset

In [None]:
type(df)

In [None]:
# Print schema
df.schema

In [None]:
# Basic statistics
df.describe()

In [None]:
print(df.describe())

### Manipulation and Filtering

In [None]:
df["type"].unique()

In [None]:
df["type"].value_counts()

In [None]:
df["type"].n_unique()

In [None]:
df.filter((pl.col("type") == "Movie"))

In [None]:
df.filter(pl.col("release_year") > 2020)

In [None]:
def sel_split(x):
    return x.split()[0]

dfm = df.filter((pl.col("type") == "Movie"))

result = dfm.select(
    pl.col(["type", "title", "release_year", "duration"]),
    pl.col("duration").map_elements(sel_split,return_dtype=str).alias("duration_min"),
)
result

In [None]:
# Filter movies released after 2015
filtered_df = df.filter((pl.col("type") == "Movie") & (pl.col("release_year") > 2015))
filtered_df

In [None]:
# df.group_by vs. pd.df.groupby

In [None]:
df0 = df.group_by("release_year").len()
df0.sort("release_year")

In [None]:
df0.to_pandas()

In [None]:
# Group by and aggregate
agg_df = df.group_by("release_year").agg([
    pl.col("type").count().alias("count"),
])
print(agg_df)

In [None]:
q = (
    df.lazy()
    .sort("release_year")
    .select(["type", "title", "release_year", "duration"])
    .filter(pl.col("type") == "Movie")
)

dfs = q.collect()
print(dfs)

In [None]:
def extract_min_pl() -> pl.Expr:
    cols = ["duration"]
    return pl.col(cols).str.split(" ").list.get(0)

dfs = dfs.with_columns(
    extract_min_pl().alias("duration_min").cast(pl.Int32)
)
dfs

In [None]:
dfs.with_columns(
    (pl.col("duration_min") / pl.col("duration_min").min()).alias("pct_total"),
    (pl.col("duration_min") / pl.col("duration_min").max()).alias("pct_total_2"),
)

### Plots (unstable) - do `to_pandas()` then plot

In [None]:
# s = pl.Series([1, 4, 4, 6, 2, 4, 3, 5, 5, 7, 1])
# s.plot.hist()

In [None]:
# retina resolution seaborn plot
import seaborn as sns
sns.set_context("notebook", font_scale=1.2)
%config InlineBackend.figure_format = "retina"

In [None]:
dfs["duration_min"].to_pandas().plot.hist(title="Movie Duration Distribution", color="gray", bins=30, xlabel="Duration (minutes)")

## Parquet

### Dive into Parquet Format: What is Parquet?

**Parquet** is a columnar storage file format designed for efficient data processing. It is widely used in big data frameworks like Spark and Hive.

[https://docs.pola.rs/user-guide/io/parquet/](https://docs.pola.rs/user-guide/io/parquet/)

**Advantages of Parquet:**

- Compact storage through compression
- Faster read/write times for columnar operations
- Schema evolution support

### Using Parquet with `polars`

In [None]:
# Save as Parquet
df.write_parquet("netflix_titles.parquet")

In [None]:
# Load from Parquet
parquet_df = pl.read_parquet("netflix_titles.parquet")
print(parquet_df)

Performance Showcase

We'll benchmark `polars` and `pandas` using a large dataset.

### Benchmark Code

In [None]:
import pandas as pd
import time

# Pandas Benchmark
start = time.time()
pandas_df = pd.read_csv("../99_files/netflix_titles.csv")
pandas_filtered = pandas_df[(pandas_df["type"] == "Movie") & (pandas_df["release_year"] > 2015)]
print("Pandas Time:", time.time() - start)

# Polars Benchmark
start = time.time()
polars_df = pl.read_csv("../99_files/netflix_titles.csv")
polars_filtered = polars_df.filter((pl.col("type") == "Movie") & (pl.col("release_year") > 2015))
print("Polars Time:", time.time() - start)

# Polars Benchmark parquet
start = time.time()
polars_df = pl.read_parquet("netflix_titles.parquet")
polars_filtered = polars_df.filter((pl.col("type") == "Movie") & (pl.col("release_year") > 2015))
print("Polars Time:", time.time() - start)

### Expected Results

- `polars` will outperform `pandas` significantly in terms of speed and memory usage.

### **Summary**

1. **`polars` vs `pandas`:**
    - Faster, more efficient for large datasets
    - Lazy evaluation for complex pipelines
2. **Parquet Format:**
    - Ideal for compact, efficient storage
    - Works seamlessly with `polars`