# File Handling Basics

In Data Engineering and AI workflows, files are a primary way to **ingest**, **store**, and **exchange** data (logs, CSV/JSON, model artifacts, configs).

## Learning goals
By the end of this notebook, you should be able to:
- Open files safely using `with open(...)`
- Choose the correct **mode** (`r`, `w`, `a`, `x`, `b`) and handle **encoding**
- Read files in a **memory-safe** way (streaming line-by-line)
- Write simple datasets (including **JSON Lines / JSONL**)
- Handle common file errors (`FileNotFoundError`, permissions)


## Paths and the working directory

When your script runs, Python has a **current working directory (CWD)**. Relative paths (like `"data/file.txt"`) are resolved from the CWD.

For portable code, prefer `pathlib.Path` for path building.


In [None]:
from pathlib import Path

cwd = Path.cwd()
print("Current working directory:", cwd)

data_dir = cwd / "data"
data_dir.mkdir(exist_ok=True)

sample_path = data_dir / "sample.txt"
print("Sample file path:", sample_path)

## Creating a small sample file

We create a small text file so everyone can run the same examples.


In [None]:
lines = [
    "id,name,score\n",
    "1,Alice,98\n",
    "2,Bob,87\n",
    "3,Carol,91\n",
    "4,Dan,73\n",
    "5,Eve,88\n",
]

with open(sample_path, "w", encoding="utf-8") as file_handler:
    file_handler.writelines(lines)

print("Wrote", len(lines), "lines to", sample_path.name)

## Opening files: modes, text vs binary, encoding

`open(path, mode, encoding=...)` returns a **file handle** (a stream).

Common modes:
- `"r"`: read (fails if file does not exist)
- `"w"`: write (overwrites if file exists)
- `"a"`: append (creates if missing)
- `"x"`: create (fails if file already exists)
- Add `"b"` for binary: `"rb"`, `"wb"` (images, compressed files, parquet bytes)

Best practice for text files:
- Always specify `encoding="utf-8"` unless you have a reason not to.


## Reading patterns

### Read the entire file (easy, not always scalable)
Use `.read()` only when you know the file is small enough to fit in memory.


In [None]:
with open(sample_path, "r", encoding="utf-8") as file_handler:
    content = file_handler.read()

print(content)

### Stream line-by-line (recommended for large files)

Iterating over a file handle reads lazily and avoids loading everything into memory.


In [None]:
with open(sample_path, "r", encoding="utf-8") as file_handler:
    for line in file_handler:
        print(line.rstrip("\n"))

### Peek at the first *N* lines (common DE task)

Useful when exploring unknown files or validating ingestion.


In [None]:
N = 3

with open(sample_path, "r", encoding="utf-8") as file_handler:
    for i, line in enumerate(file_handler):
        if i == N:
            break
        print(line.rstrip("\n"))

### Read in chunks (for very large files)

Chunking is useful for large text or binary files (e.g., upload/download, hashing).


In [None]:
chunk_size = 16  # bytes/characters in text mode depends on encoding; keep demos small

with open(sample_path, "r", encoding="utf-8") as file_handler:
    while True:
        chunk = file_handler.read(chunk_size)
        if not chunk:
            break
        print(repr(chunk))

## Writing patterns

### Write vs append
- `"w"` overwrites the file
- `"a"` appends at the end

Remember: `write()` does not add newlines automatically.


In [None]:
log_path = data_dir / "log.txt"

with open(log_path, "w", encoding="utf-8") as file_handler:
    file_handler.write("Start of log\n")

with open(log_path, "a", encoding="utf-8") as file_handler:
    file_handler.write("Another line\n")

print("Log content:")
with open(log_path, "r", encoding="utf-8") as file_handler:
    print(file_handler.read())

### Create-only mode (`"x"`) to avoid accidental overwrites

This is a safety pattern when generating outputs in pipelines.


In [None]:
safe_path = data_dir / "created_once.txt"

try:
    with open(safe_path, "x", encoding="utf-8") as file_handler:
        file_handler.write("This file is created only if it does not exist.\n")
    print("Created:", safe_path.name)
except FileExistsError:
    print("Already exists:", safe_path.name)

## Why `with` matters (context managers)

`with open(...)` ensures the file is **closed** even if an exception occurs.

Technical detail:
- `__enter__()` runs at the start of the block
- `__exit__()` runs at the end (even on errors)


## Common errors and defensive patterns

Most common issues:
- File does not exist: `FileNotFoundError`
- Permission problems: `PermissionError`
- Wrong encoding: `UnicodeDecodeError`

In DE pipelines, fail fast with clear messages or implement a fallback strategy.


In [None]:
missing_path = data_dir / "does_not_exist.txt"

try:
    with open(missing_path, "r", encoding="utf-8") as file_handler:
        file_handler.read()
except FileNotFoundError:
    print(f"File not found: {missing_path}")

## Data Engineering example: JSON Lines (JSONL)

JSONL is common for logs and event streams:
- One JSON object per line
- Stream-friendly: you can process line-by-line


In [None]:
import json

records = [
    {"id": 1, "name": "Alice", "score": 98},
    {"id": 2, "name": "Bob", "score": 87},
    {"id": 3, "name": "Carol", "score": 91},
]

jsonl_path = data_dir / "records.jsonl"

with open(jsonl_path, "w", encoding="utf-8") as file_handler:
    for record in records:
        file_handler.write(json.dumps(record) + "\n")

print("Wrote JSONL:", jsonl_path.name)

In [None]:
# Read JSONL line-by-line (streaming)
import json

parsed = []
with open(jsonl_path, "r", encoding="utf-8") as file_handler:
    for line in file_handler:
        parsed.append(json.loads(line))

print(parsed)

# Options to open a file

Here is a table of the options available for reading a file in Python using the `open` function:

Option | Description
------- | -----------
"r" | Open the file for reading (default).
"w" | Open the file for writing. If the file already exists, its contents will be overwritten. If the file does not exist, a new file will be created.
"a" | Open the file for appending. Any new data written to the file will be added to the end of the file, without overwriting its existing contents.
"x" | Open the file for exclusive creation. If the file already exists, an error will be raised. This option is only available in Python 3.3 and later.
"b" | Open the file in binary mode. The file contents are read as binary data and can be manipulated as bytes.
"+" | Open the file for both reading and writing.

## Quick exercises (3â€“7 minutes each)

1) **Peek**: Print the first 5 lines of `sample.txt` (no `.read()`).
2) **Count**: Count how many lines are in `sample.txt`.
3) **Append**: Append a new log line with your name to `log.txt`.
4) **JSONL**: Add one new record to `records.jsonl` using append mode.
5) **Defensive open**: Try to open a missing file and print a clean error message.

You can implement them in a new code cell below.


> Content created by **Carlos Cruz-Maldonado**.  
> Updated and expanded for the training session.
