# Working with Files in Python

When working with Python, it is possible to access external files. This notebook covers how to access different types of files in the best way as well as how to navigate and interact with your directories from Python.

Contents:
- [Navigate directories using `os` and `pathlib`](#pathlib)
- [Read different file formats](#format)
- [Work with multiple datasets from a folder](#files)
- [Exercises](#)


<a id=pathlib></a>

## Navigating Directories

### Using `os` module


In [None]:
import os

In [None]:
# Get current working directory
current_dir = os.getcwd()
print(f"Current directory: {current_dir}")

In [None]:
# List all files in a directory
files = os.listdir('data')
print(f"\nFiles in data/: {files}")

In [None]:
# Check if a file exists
file_exists = os.path.exists('data/sales.csv')
print(f"\nDoes sales.csv exist? {file_exists}")

In [None]:
# Get file size
if file_exists:
    file_size = os.path.getsize('data/sales.csv')
    print(f"File size: {file_size} bytes")



### Using `pathlib` (Modern Python approach)


In [None]:
from pathlib import Path

In [None]:
# Get current directory
current_dir = Path.cwd()
print(f"Current directory: {current_dir}")

In [None]:
# Create a path object
data_dir = Path('data')

In [None]:
# List all files
files = list(data_dir.iterdir())
print(f"\nFiles in data/:")
for file in files:
    print(f"  - {file.name}")

In [None]:
# Check if file exists
sales_file = data_dir / 'sales.csv'
print(f"\nDoes sales.csv exist? {sales_file.exists()}")

In [None]:
# Get file info
if sales_file.exists():
    print(f"File size: {sales_file.stat().st_size} bytes")
    print(f"Is it a file? {sales_file.is_file()}")
    print(f"Is it a directory? {sales_file.is_dir()}")

<a id=format></a>

## Reading Different File Types

There are many different file types you will need to work with when working in Python. The obvious and sometimes first is a `.csv` or `.xlsx` file with pandas. However there are other types that will be used for different purposes.

- [Comma-separated values](#csv) `CSV`: A tabular data format where each column is separated by a comma
    - Commonly used with internal data

- Text `txt`: Plain text files ‚Äì a simple format containing unstructured or semi-structured data
    - Often used for logs, notes, or simple datasets

- JSON `.json`: JavaScript Object Notation ‚Äì a structured, hierarchical format using key-value pairs
    - Commonly used for APIs, configuration files, and modern data interchange

- YAML `.yaml` / `.yml`: ‚ÄúYAML Ain‚Äôt Markup Language‚Äù ‚Äì a human-readable structured format similar to JSON but easier to read
    - Often used for configuration files, pipeline definitions, and infrastructure-as-code

- Parquet `.parquet`: Columnar, binary data format optimized for analytics and big data processing
    - Commonly used in data engineering workflows for large datasets because it‚Äôs fast and memory-efficient



<a id=csv></a>

### Reading CSV Files

Pandas provides a built-in function to read CSVs:

In [None]:
import pandas as pd

# Read CSV file
sales = pd.read_csv('data/sales.csv', header=2)

print("CSV file contents:")
print(sales.head())
print(f"\nShape: {sales.shape}")
print(f"Columns: {', '.join(sales.columns)}")

And can even help when the file formatting isn't quite a straight-forward as one would expect.

In [None]:
import pandas as pd

df = pd.read_csv(
    'data/employee_data.csv',
    skiprows=2,                 # skip the comment lines
    sep='|',                     # delimiter is a pipe
    usecols=['ID', 'Name', 'Salary', 'Start Date'], 
    dtype={'ID': str},           # ensure ID stays a string
    parse_dates=['Start Date'],   # parse Start Date as datetime
)

print(df)


### Reading Text Files

Knowing the best practices for reading text files in Python is important because it ensures your code is safe, efficient, and portable, preventing common issues like memory overload, encoding errors, or leaving files open accidentally.

Best practices include:
- Using the `with` context manager
- Use f.read() on the entire content for small files and line-by-line for larger files
- Handle exceptions (more on that in the next notebook)
- Avoid hard-coding paths

Let's say you have a log file saved and you want to check how many errors have occurred:

In [None]:
# Method 1: Read entire file as string
with open('data/system_log.txt', 'r') as f:
    log_content = f.read()

print("Text file contents:")
print(log_content[:200])  # First 200 characters
print('\nNumber of errors:', log_content.count('ERROR'))

Another option is to read the data line-by-line to provide quick analysis like counting lines, checking content, or searching for specific keywords. Reading all lines into a list allows multiple passes over the data without reopening the file. This can be helpful when exploring a new dataset or log file.

In [None]:
# Method 2: Read line by line
with open('data/system_log.txt', 'r') as f:
    lines = f.readlines()

print(f"Total lines: {len(lines)}")
num_errors = len([lin for lin in lines if 'ERROR' in lin])

print("\nFirst 3 lines:")
print(''.join(lines[:3]))

print('\nNumber of errors:', num_errors)

If `system_log.txt` is very large (hundreds of MBs or GBs), `f.readlines()` can use a lot of memory. In that case, it‚Äôs better to iterate over the file line by line instead:

In [None]:
# Method 3: Process line by line (memory efficient for large files)
error_lines = []
with open('data/system_log.txt', 'r') as f:
    for line in f:
        if 'ERROR' in line:
            error_lines.append(line.strip())

print(f"\nFound {len(error_lines)} error lines:")
for error in error_lines:
    print(f"  {error}")

### Reading JSON Files

JSON (JavaScript Object Notation) is widely used in Python because it‚Äôs a lightweight, human-readable way to represent structured data. The most common use cases are configuration files, API requests or storing structured data.

Here is a configuration example, where the `json` library is used to convert the JSON file into a Python dictionary, so you can access nested values using standard dictionary syntax instead of parsing text manually.

In [None]:
import json

with open('data/config.json', 'r') as f:
    config = json.load(f)

print("JSON file contents:")
print(json.dumps(config, indent=2))


Now if you want to access the nested values:

In [None]:
print(f"\nDatabase host: {config['database']['host']}")
print(f"API endpoint: {config['api']['endpoint']}")
print(f"Regions: {config['regions']}")

### Reading Parquet Files

Reading a Parquet file like this is useful because Parquet is a columnar, compressed, binary format designed for large datasets:

- Faster reads/writes than CSV or JSON because only the needed columns are loaded.
- Smaller disk space usage due to compression.
- Preserves data types (e.g., dates, integers) better than CSV.

Parquet + Pandas works fine for medium datasets, but Polars shines for very large datasets in both speed and memory usage.

In [None]:
import polars as pl

# Read Parquet file
df_parquet = pl.read_parquet('data/large_sales.parquet')

print("Parquet file contents:")
print(df_parquet.head())  # Polars DataFrame head

# Shape
print(f"\nShape: {df_parquet.shape}")

# Memory usage (approximate, Polars doesn't have exact equivalent of Pandas deep=True)
print(f"\nEstimated memory usage: {df_parquet.estimated_size() / (1024**2):.2f} MB")

# Access first and last transaction dates
print(f"\nFirst transaction: {df_parquet[0, 'date']}")
print(f"Last transaction: {df_parquet[-1, 'date']}")

### Reading YAML Files

YAML files are used for storing structured, human-readable configuration data, often for applications, pipelines, or infrastructure. They are like JSON but easier for humans to read and write, supporting comments, nested structures, and lists more cleanly.

Using Python you can:
- Easily parse structured data:
- Convert YAML into a Python dictionary and nested lists, so you can access values naturally using dict syntax.
- Keep configuration separate from code

In [None]:
import yaml

# Read YAML file
with open('data/pipeline_config.yml', 'r') as f:
    pipeline_config = yaml.safe_load(f)

print("YAML file contents:")
print(f"Pipeline name: {pipeline_config['pipeline_name']}")
print(f"Version: {pipeline_config['version']}")
print(f"Schedule: {pipeline_config['schedule']}")

print("\nPipeline stages:")
for stage in pipeline_config['stages']:
    print(f"  - {stage['name']}: {stage['enabled']}")

print(f"\nRetry policy: {pipeline_config['retry_policy']}")

<a id=files></a>

## Multiple CSV Files from a Folder

Combining these things together, using `pathlib` and reading file types, you can start to apply more logic. For example, let's say you receive monthly company sales files:


In [None]:
from pathlib import Path
import pandas as pd

# Get all CSV files in the monthly_sales folder
sales_folder = Path('data/monthly_sales')
csv_files = list(sales_folder.glob('*.csv'))
print(f"Found {len(csv_files)} CSV files:")
for file in csv_files:
    print(f"  - {file.name}")


Now let's say you want to read this all in as one collection of data to do full analysis on it:


In [None]:
# Read all files and combine them
sales_dfs = [
    pd.read_csv(file).assign(source_file = file.stem)
    for file in csv_files
]

# Combine all dataframes
combined_sales = pd.concat(sales_dfs, ignore_index=True)

print(f"\nCombined dataset shape: {combined_sales.shape}")
combined_sales.head()

## <mark>Exercises</mark>

### <mark>Exercise 1: Filter and Combine Log Files</mark>

You have multiple log files from different servers in the `data/logs/` folder. Your task is to:
1. Read all text files from the folder
2. Extract only the ERROR and WARNING lines
3. Create a DataFrame with columns: `timestamp`, `level`, `message`
4. Sort by timestamp

Your expected output is as follows:

**Expected output:**
```
   timestamp            level  message                          server
0  2024-01-15 10:24:12  ERROR  Database connection failed      server1
1  2024-01-15 10:25:03  WARNING High memory usage: 85%         server1
2  2024-01-15 10:27:45  ERROR  Timeout on API call             server1
3  2024-01-15 10:31:22  WARNING Disk space low: 15%            server2
4  2024-01-15 10:32:33  ERROR  Failed to write file            server2
5  2024-01-15 10:41:11  ERROR  Network timeout                 server3
6  2024-01-15 10:42:22  WARNING CPU usage: 95%                 server3
```

In [None]:

from pathlib import Path
import pandas as pd

# TODO: Your code here
# 1. Get all .log files from data/logs/
# 2. Read each file and extract ERROR and WARNING lines
# 3. Parse each line into timestamp, level, and message
# 4. Create a DataFrame and sort by timestamp

# Your solution here


### <mark>Exercise 2: Configuration Merger</mark>

You have multiple JSON configuration files for different environments (dev, staging, prod). Your task is to:
1. Read all JSON files from `data/configs/`
2. Merge them into a single DataFrame showing settings across environments
3. Identify which settings differ between environments

Your expected output is as follows:
```
                        dev                      staging                  prod
database_host           localhost                staging-db.company.com   prod-db.company.com
database_port           5432                     5432                     5432
database_max_connections 10                      50                       200
api_timeout             30                       60                       60
api_rate_limit          100                      500                      1000
debug                   True                     True                     False
```

In [None]:
from pathlib import Path
import pandas as pd
import json

# TODO: Your code here
# 1. Read all JSON files from data/configs/
# 2. Flatten the nested structure
# 3. Create a DataFrame where each row is a setting and columns are environments
# 4. Identify settings that differ across environments

# Your solution here


**Answers**: Uncomment and run the code to see answers

In [None]:
# %load answers/file-1.py

In [None]:
# %load answers/file-2.py

## Summary: File Reading Best Practices

**Key Takeaways:**

- **Use `pathlib`** over `os` for modern, cleaner code
- **CSV files**: Use `pd.read_csv()` for tabular data
- **JSON files**: Use `json.load()` for configs, `pd.read_json()` for tabular data
- **Parquet files**: Use `pd.read_parquet()` for efficient storage of large datasets
- **Text files**: Use context managers (`with open()`) to ensure files close properly
- **YAML files**: Use `yaml.safe_load()` for configuration files
- **Multiple files**: Use `pathlib.glob()` + `pd.concat()` to combine datasets

**Common patterns:**
```python
# Read single CSV
df = pd.read_csv('file.csv')

# Read all CSVs from folder
dfs = [pd.read_csv(f) for f in Path('folder').glob('*.csv')]
combined = pd.concat(dfs, ignore_index=True)

# Read config file
with open('config.json') as f:
    config = json.load(f)
```

**Remember**: Always use context managers (`with open()`) when reading files to ensure they're properly closed, even if an error occurs! üìÅ