# Best Practices, Performance, & Troubleshooting

This chapter provides tips and best practices for working effectively with `Flow`, optimizing performance, and troubleshooting common issues. Understanding these points will help you write cleaner, more efficient, and more robust data processing pipelines.

## Embrace (and Understand) Laziness

- Work is Deferred: Remember that Flow operations like `.filter()`, `.assign()`, and `.select()` don't process data immediately. They build up a plan. The actual computation happens only when you call a "materialization" method (e.g., `.collect()`, `.to_pandas()`, `.to_jsonl()`, `len(my_flow)` etc).
- Errors Surface Late: Consequently, an error in your lambda function within an `.assign()` might only appear when you call `.collect()` much later in your chain. If you get an error during materialization, trace back through your pipeline steps.

## Managing Flow Consumption & Reusability

- Single Pass by Default: A `Flow` instance generally represents a single pass through its underlying data stream. Once records are consumed by a materializing operation or by iterating through the `Flow`, they are gone from that specific instance for subsequent operations on the same instance.
- When to Re-instantiate: If you need to perform multiple, independent analyses starting from the exact same original data, create a fresh `Flow` from your source for each independent pipeline:

```python
source_data = Flow.from_jsonl("my_data.jsonl")

# Analysis 1
result1 = source_data.filter(condition1).collect() # Consumes source_data

# Analysis 2 - WRONG if you expect it to see original data
# result2 = source_data.limit(10).collect() # source_data is already exhausted!

# Analysis 2 - CORRECT
source_data_for_analysis2 = Flow.from_jsonl("my_data.jsonl")
result2 = source_data_for_analysis2.limit(10).collect()
```

Alternatively, if the source data is small enough, `.collect()` it into a list first and create Flows from that list. If in doubt though, just create a new `Flow`.

- Know Consuming Operations: Be aware of methods that consume the `Flow` they are called on:
    - Materialization: `.collect()`, `.to_pandas()`, `.to_json()`, `.to_jsonl()`, `.to_json_files()`, `.describe()`.
    - Full Stream Inspection: `len(my_flow)`, `.last()`, `.sort()`, `.summary()`, `.group_by()`, `.sample()`, `.row_number()`, `.take_last()`.
    - `.keys()` (if limit=None).

After these, the `Flow` instance itself will be empty. The methods usually return a new `Flow` or the resulting data structure.

## Memory and Performance Considerations

- Stream When Possible: Leverage `Flow`'s streaming capabilities by chaining non-materializing operations (`.filter()`, `.assign()`, `.select()`, `.drop()`, `.rename()`, `.flatten()`, `.explode()`, `.sample_frac()`) as much as possible before operations that require loading data into memory
- Memory-Intensive Operations:
    - `.sort()`, `.group_by()`, `.summary()`, `.row_number()`, `.take_last()`, `.sample()` all need to process or hold all relevant data in memory.
    - `.join()` materializes its right-hand side (other `Flow`/list) into a lookup table. Keep the right-hand side as small as feasible if memory is a concern.
    - `.to_pandas()`, `.to_json_single()` (as array) inherently load all data.
    - `.drop_duplicates()`, `.unique()` need to store seen keys/records.
- Large Datasets: If an operation requires materializing a dataset that's too large for memory (e.g., sorting a 50GB file), `Flow` (like any single-machine in-memory tool) will struggle. Consider:
    - Pre-filtering data aggressively to reduce its size.
    - Processing data in chunks if your logic allows.
    - For truly massive datasets, explore tools designed for distributed computing (e.g., Dask, Spark) or database solutions.
- `folder_flow` for File-Level Parallelism: Use `folder_flow` to process multiple files concurrently, which can significantly speed up workflows where I/O or CPU-bound tasks per file are the bottleneck. Note that you cannot use lambda functions with `folder_flow` unless they are wrapped in a named function due to limitations in Python's multiprocessing capabilities.

## Debugging Your Pipelines

- Break Down Long Chains: If a long chain of `Flow` operations isn't working as expected or raises an error:
    - Assign intermediate steps to variables.
    - Use `.first()`, `.limit(N)`, or `.collect()` on these intermediate Flows to inspect the data at each stage.

```python
initial_flow = Flow.from_file("data.json")
step1_flow = initial_flow.filter(my_filter_fn)
print("After step 1:", step1_flow.limit(3).collect()) # Inspect

step2_flow = step1_flow.assign(new_col=my_assign_fn)
print("After step 2:", step2_flow.limit(3).collect()) # Inspect
```

- `.pipe()` for Quick Inspection: You can use `.pipe()` with a simple lambda to print intermediate records or properties without breaking the chain (ensure the lambda returns the flow).

```python
def peek(flow, num_records=1):
    print(f"Peek: {flow.limit(num_records).collect()}")
    return flow # Must return the flow to continue the chain

result = events_flow.filter(...) \
                    .pipe(peek) \
                    .assign(...) \
                    .collect()
```

- Debugging folder_flow:
    - Test your `flow_fn` and `reduce_fn` thoroughly on a single file first by calling them directly.
    - When using `folder_flow`, set `n_jobs=1` to run sequentially for easier debugging of issues within your `flow_fn` or `reduce_fn`. Stack traces will be simpler.
    - Ensure `flow_fn` and `reduce_fn` are pickleable.

## Working with Data Structures

- Nested Data: Use `.flatten()` for straightforward un-nesting. For more complex extraction from nested dicts/lists, `.assign()` with custom lambda functions accessing specific keys/indices is powerful. Remember `.explode()` and `.split_array()` for list-like fields.
- Type Safety in Lambdas: When writing lambda functions for `.filter()` or `.assign()`, be mindful of missing keys or unexpected data types. Use `.get("key", default_value)` for dictionaries and consider type checks if necessary.

```python
# Safer access
events_flow.assign(x=lambda r: r.get("location")[0] if isinstance(r.get("location"), list) and len(r.get("location")) > 0 else None)
# you could also wrap this in a custom function like:
```

## Choosing IO Methods

- Large Streaming Data: Prefer `.from_jsonl()` for loading and `.to_jsonl()` for saving when dealing with large datasets that benefit from line-by-line processing without loading the entire file content into memory.
- Self-Contained JSON: Use `.from_file()` for JSON files containing a single array or object, and `.to_json_single()` to save as such (but be aware it materializes the whole `Flow` into a list first).

## Code Readability

- Descriptive Variable Names: Assign intermediate Flow objects to well-named variables if your pipeline is complex. This improves readability more than extremely long chains.
- Helper Functions for `flow_fn`: For `folder_flow` or complex `.pipe()` operations, define your processing logic in clearly named helper functions rather than very long lambdas.

## When Flow Might Not Be the Best Tool

`Flow` is excellent for stream processing and transformations on datasets that can be handled by a single machine, especially for JSON-like record structures. However, consider alternatives if:

- Data Exceeds Single-Machine Memory for Key Operations: If even core operations like sorting or grouping (which Flow performs in memory after collecting necessary data) cannot fit your dataset into RAM. Tools like Dask (for Python) or Apache Spark might be more appropriate as they can distribute these operations across multiple machines or use disk spilling.
- Complex Relational Queries & Indexing: For highly relational data requiring SQL-like joins across many tables, persistent storage with indexing, and transactional integrity, a proper database (SQL or NoSQL) is currently better suited.
- True Real-Time, Low-Latency Streaming: If you need to process unbounded streams of data with very low latency (milliseconds), specialized stream processing engines (e.g., Kafka Streams, Flink, Spark Streaming) are designed for that purpose. `Flow` is more for batch-oriented or finite stream processing.

By keeping these practices in mind, you can harness the power and flexibility of `Flow` for your data analysis tasks efficiently and effectively.
