# Working with Files in Python

This notebook will teach you how to:
- [Navigate directories using `pathlib` and `os`](#pathlib)
- [Read different file formats](#format)
- [Work with multiple datasets from a folder](#files)
- [Exercises](#)


<a id=pathlib></a>

## Navigating Directories

### Using `os` module


In [None]:
import os

In [None]:
# Get current working directory
current_dir = os.getcwd()
print(f"Current directory: {current_dir}")

In [None]:
# List all files in a directory
files = os.listdir('data')
print(f"\nFiles in data/: {files}")

In [None]:
# Check if a file exists
file_exists = os.path.exists('data/sales.csv')
print(f"\nDoes sales.csv exist? {file_exists}")

In [None]:
# Get file size
if file_exists:
    file_size = os.path.getsize('data/sales.csv')
    print(f"File size: {file_size} bytes")



### Using `pathlib` (Modern Python approach)


In [None]:
from pathlib import Path

In [None]:
# Get current directory
current_dir = Path.cwd()
print(f"Current directory: {current_dir}")

In [None]:
# Create a path object
data_dir = Path('data')

In [None]:
# List all files
files = list(data_dir.iterdir())
print(f"\nFiles in data/:")
for file in files:
    print(f"  - {file.name}")

In [None]:
# Check if file exists
sales_file = data_dir / 'sales.csv'
print(f"\nDoes sales.csv exist? {sales_file.exists()}")

In [None]:
# Get file info
if sales_file.exists():
    print(f"File size: {sales_file.stat().st_size} bytes")
    print(f"Is it a file? {sales_file.is_file()}")
    print(f"Is it a directory? {sales_file.is_dir()}")

<a id=format></a>

## Reading Different File Types

### Reading CSV Files


In [None]:
import pandas as pd

# Read CSV file
df_csv = pd.read_csv('data/sales.csv')

print("CSV file contents:")
print(df_csv.head())
print(f"\nShape: {df_csv.shape}")
print(f"Columns: {df_csv.columns.tolist()}")

### Reading JSON Files


In [None]:
import json

with open('data/config.json', 'r') as f:
    config = json.load(f)

print("JSON file contents:")
print(json.dumps(config, indent=2))


In [None]:
# Access nested values
print(f"\nDatabase host: {config['database']['host']}")
print(f"API endpoint: {config['api']['endpoint']}")
print(f"Regions: {config['regions']}")


### Reading Parquet Files


In [None]:

# Read Parquet file
df_parquet = pd.read_parquet('data/large_sales.parquet')

print("Parquet file contents:")
print(df_parquet.head())
print(f"\nShape: {df_parquet.shape}")
print(f"\nMemory usage:")
print(df_parquet.memory_usage(deep=True))

# Parquet files are more efficient for large datasets
print(f"\nFirst transaction: {df_parquet.iloc[0]['date']}")
print(f"Last transaction: {df_parquet.iloc[-1]['date']}")


### Reading Text Files


In [None]:
# Method 1: Read entire file as string
with open('data/system_log.txt', 'r') as f:
    log_content = f.read()

print("Text file contents:")
print(log_content[:200])  # First 200 characters


In [None]:
# Method 2: Read line by line
with open('data/system_log.txt', 'r') as f:
    lines = f.readlines()

print(f"Total lines: {len(lines)}")
print("\nFirst 3 lines:")
for line in lines[:3]:
    print(line.strip())


In [None]:
# Method 3: Process line by line (memory efficient for large files)
error_lines = []
with open('data/system_log.txt', 'r') as f:
    for line in f:
        if 'ERROR' in line:
            error_lines.append(line.strip())

print(f"\nFound {len(error_lines)} error lines:")
for error in error_lines:
    print(f"  {error}")


### Reading YAML Files

In [None]:
import yaml

# Read YAML file
with open('data/pipeline_config.yml', 'r') as f:
    pipeline_config = yaml.safe_load(f)

print("YAML file contents:")
print(f"Pipeline name: {pipeline_config['pipeline_name']}")
print(f"Version: {pipeline_config['version']}")
print(f"Schedule: {pipeline_config['schedule']}")

print("\nPipeline stages:")
for stage in pipeline_config['stages']:
    print(f"  - {stage['name']}: {stage['enabled']}")

print(f"\nRetry policy: {pipeline_config['retry_policy']}")

<a id=files></a>

## Multiple CSV Files from a Folder


In [None]:
### Using `pathlib` and `pandas.concat()`

from pathlib import Path
import pandas as pd

# Get all CSV files in the monthly_sales folder
sales_folder = Path('data/monthly_sales')
csv_files = list(sales_folder.glob('*.csv'))

print(f"Found {len(csv_files)} CSV files:")
for file in csv_files:
    print(f"  - {file.name}")

# Read all files and combine them
dfs = []
for file in csv_files:
    df = pd.read_csv(file)
    # Add a column to track which file the data came from
    df['source_file'] = file.stem  # stem gives filename without extension
    dfs.append(df)

# Combine all dataframes
combined_df = pd.concat(dfs, ignore_index=True)

print(f"\nCombined dataset shape: {combined_df.shape}")
print(f"Total rows: {len(combined_df)}")
print(f"\nFirst few rows:")
print(combined_df.head())
print(f"\nLast few rows:")
print(combined_df.tail())


## <mark>Exercises</mark>

### <mark>Exercise 1: Filter and Combine Log Files</mark>

You have multiple log files from different servers in the `data/logs/` folder. Your task is to:
1. Read all text files from the folder
2. Extract only the ERROR and WARNING lines
3. Create a DataFrame with columns: `timestamp`, `level`, `message`
4. Sort by timestamp

Your expected output is as follows:

**Expected output:**
```
   timestamp            level  message                          server
0  2024-01-15 10:24:12  ERROR  Database connection failed      server1
1  2024-01-15 10:25:03  WARNING High memory usage: 85%         server1
2  2024-01-15 10:27:45  ERROR  Timeout on API call             server1
3  2024-01-15 10:31:22  WARNING Disk space low: 15%            server2
4  2024-01-15 10:32:33  ERROR  Failed to write file            server2
5  2024-01-15 10:41:11  ERROR  Network timeout                 server3
6  2024-01-15 10:42:22  WARNING CPU usage: 95%                 server3
```

In [None]:

from pathlib import Path
import pandas as pd

# TODO: Your code here
# 1. Get all .log files from data/logs/
# 2. Read each file and extract ERROR and WARNING lines
# 3. Parse each line into timestamp, level, and message
# 4. Create a DataFrame and sort by timestamp

# Your solution here


### <mark>Exercise 2: Configuration Merger</mark>

You have multiple JSON configuration files for different environments (dev, staging, prod). Your task is to:
1. Read all JSON files from `data/configs/`
2. Merge them into a single DataFrame showing settings across environments
3. Identify which settings differ between environments

Your expected output is as follows:
```
                        dev                      staging                  prod
database_host           localhost                staging-db.company.com   prod-db.company.com
database_port           5432                     5432                     5432
database_max_connections 10                      50                       200
api_timeout             30                       60                       60
api_rate_limit          100                      500                      1000
debug                   True                     True                     False
```

In [None]:
from pathlib import Path
import pandas as pd
import json

# TODO: Your code here
# 1. Read all JSON files from data/configs/
# 2. Flatten the nested structure
# 3. Create a DataFrame where each row is a setting and columns are environments
# 4. Identify settings that differ across environments

# Your solution here


**Answers**: Uncomment and run the code to see answers

In [None]:
# %load answers/file-1.py

In [None]:
# %load answers/file-2.py

## Summary: File Reading Best Practices

**Key Takeaways:**

- **Use `pathlib`** over `os` for modern, cleaner code
- **CSV files**: Use `pd.read_csv()` for tabular data
- **JSON files**: Use `json.load()` for configs, `pd.read_json()` for tabular data
- **Parquet files**: Use `pd.read_parquet()` for efficient storage of large datasets
- **Text files**: Use context managers (`with open()`) to ensure files close properly
- **YAML files**: Use `yaml.safe_load()` for configuration files
- **Multiple files**: Use `pathlib.glob()` + `pd.concat()` to combine datasets

**Common patterns:**
```python
# Read single CSV
df = pd.read_csv('file.csv')

# Read all CSVs from folder
dfs = [pd.read_csv(f) for f in Path('folder').glob('*.csv')]
combined = pd.concat(dfs, ignore_index=True)

# Read config file
with open('config.json') as f:
    config = json.load(f)
```

**Remember**: Always use context managers (`with open()`) when reading files to ensure they're properly closed, even if an error occurs! üìÅ