# Python File Operations

This notebook covers working with files in Python, based on [Google's Python Class](https://developers.google.com/edu/python/), specially modified for Marina PANDOLFINO by Daniel Patrick MORGAN.

## Introduction: Working with Files

You have thus far been playing in a sandbox, because you are running Python in little windows in a Jupyterlab project, with access only to the files therein. However, you can just as easily use Python to delete all the files on your computer, one by one, or open every single webpage on the internet to compile a comprehensive list of animated cat GIFs.

Between these two extremes, real-world programs need to read data from files or write results to files. Python makes this straightforward, but there are important concepts to understand:

- **File paths** - Where the file is located
- **File modes** - How you want to open the file (read, write, etc.)
- **Encoding** - How text is stored (especially important for Japanese characters!)
- **File handles** - The connection between your program and the file


## Opening Files

To work with a file, you first need to **open** it using the `open()` function. This creates a **file handle** (also called a "file object") that you use to read from or write to the file.

**Basic syntax:**
```python
file_handle = open('filename.txt', 'r')  # 'r' means read mode
```

**Important:** Always close files when you're done! (We'll learn a better way to do this automatically later.)


In [None]:
# Example: Opening a file (basic way - we'll improve this later)
file_handle = open('../data/太平記·卷一.txt', 'r', encoding='utf-8')
print(f"File opened: {file_handle}")
print(f"File name: {file_handle.name}")
file_handle.close()  # Always close when done!
print("File closed")


## Why Encoding Matters: A Critical Example

**This is why you MUST specify `encoding='utf-8'` for Japanese text!**

If you don't specify the encoding (or use the wrong one), Python will try to guess or use your system's default encoding. This often results in **garbled, unreadable characters** instead of proper Japanese text.


In [None]:
# Example: What happens WITHOUT specifying encoding (WRONG WAY - DON'T DO THIS!)
# This will likely produce garbled characters or an error
try:
    # Opening without encoding - Python uses system default (often wrong for Japanese!)
    file_handle = open('../data/太平記·卷一.txt', 'r')  # NO encoding specified!
    content = file_handle.read(100)
    file_handle.close()
    print("Without encoding (WRONG - garbled text):")
    print(content)
    print("\n⚠️ See those weird characters? That's what happens without UTF-8!")
except UnicodeDecodeError as e:
    print(f"❌ Error! Python couldn't decode the file: {e}")
    print("This is exactly why we need to specify encoding='utf-8'!")
except Exception as e:
    print(f"Error: {e}")

print("\n" + "="*60 + "\n")

# Example: The CORRECT way with UTF-8 encoding
file_handle = open('../data/太平記·卷一.txt', 'r', encoding='utf-8')
content = file_handle.read(100)
file_handle.close()
print("With encoding='utf-8' (CORRECT - readable Japanese text):")
print(content)
print("\n✅ Perfect! The text is readable because we specified UTF-8 encoding.")


## File Modes

When opening a file, you specify a **mode** that tells Python what you want to do:

- `'r'` - **Read mode** (default) - Open for reading only
- `'w'` - **Write mode** - Open for writing (overwrites existing file!)
- `'a'` - **Append mode** - Open for writing, adds to end of file
- `'r+'` - **Read and write mode** - Open for both reading and writing
- `'b'` - **Binary mode** - Add 'b' to any mode for binary files (e.g., `'rb'`, `'wb'`)

**Warning:** `'w'` mode will **delete** the existing file if it exists! Use `'a'` if you want to add to a file.


In [None]:
# Example: Different file modes
# Read mode
read_file = open('../data/太平記·卷一.txt', 'r', encoding='utf-8')
first_line = read_file.readline()
print(f"First line (read mode): {first_line[:50]}...")
read_file.close()

# Note: We won't demonstrate 'w' mode here to avoid accidentally deleting files!
# But you would use: write_file = open('output.txt', 'w', encoding='utf-8')


## Reading Files

There are several ways to read from a file:

1. **`read()`** - Reads the entire file as a single string
2. **`readline()`** - Reads one line at a time
3. **`readlines()`** - Reads all lines into a list
4. **Iterating** - Loop through the file line by line (most memory-efficient)

**Which to use?**
- Small files: `read()` or `readlines()` is fine
- Large files: Iterate line by line to save memory


In [None]:
# Example 1: read() - Read entire file
file_handle = open('../data/太平記·卷一.txt', 'r', encoding='utf-8')
content = file_handle.read(200)  # Read first 200 characters
print("First 200 characters:")
print(content)
file_handle.close()


In [None]:
# Example 2: readline() - Read one line at a time
file_handle = open('../data/太平記·卷一.txt', 'r', encoding='utf-8')
line1 = file_handle.readline()
line2 = file_handle.readline()
line3 = file_handle.readline()
print("First 3 lines:")
print(f"Line 1: {line1.strip()}")
print(f"Line 2: {line2.strip()}")
print(f"Line 3: {line3.strip()}")
file_handle.close()


In [None]:
# Example 3: readlines() - Read all lines into a list
file_handle = open('../data/太平記·卷一.txt', 'r', encoding='utf-8')
all_lines = file_handle.readlines()
file_handle.close()

print(f"Total lines: {len(all_lines)}")
print(f"First 5 lines:")
for i, line in enumerate(all_lines[:5], 1):
    print(f"  {i}: {line.strip()}")


In [None]:
# Example 4: Iterating line by line (most memory-efficient for large files)
file_handle = open('../data/太平記·卷一.txt', 'r', encoding='utf-8')
line_count = 0
for line in file_handle:
    line_count += 1
    if line_count <= 5:  # Print first 5 lines
        print(f"Line {line_count}: {line.strip()}")
file_handle.close()
print(f"\nTotal lines processed: {line_count}")


## Writing Files

To write to a file, open it in write mode (`'w'`) or append mode (`'a'`), then use:
- **`write(string)`** - Writes a string to the file
- **`print(..., file=file_handle)`** - Prints to a file instead of the screen

**Important:** 
- `'w'` mode **overwrites** the file if it exists
- `'a'` mode **adds** to the end of the file
- Always specify `encoding='utf-8'` for text files with Japanese characters!


In [None]:
# Example: Writing to a file
# Create a test output file
output_file = open('test_output.txt', 'w', encoding='utf-8')

# Method 1: Using write()
output_file.write("Hello, world!\n")
output_file.write("This is line 2.\n")

# Method 2: Using print() with file parameter
print("This is line 3.", file=output_file)
print("日本語のテキスト", file=output_file)

output_file.close()
print("File written! Check 'test_output.txt'")

# Read it back to verify
read_back = open('test_output.txt', 'r', encoding='utf-8')
print("\nContents of test_output.txt:")
print(read_back.read())
read_back.close()


## The `with` Statement (Recommended Way!)

The `with` statement automatically closes the file when you're done, even if an error occurs. This is the **recommended way** to work with files in Python.

**Syntax:**
```python
with open('filename.txt', 'r', encoding='utf-8') as file_handle:
    # Do something with the file
    content = file_handle.read()
# File is automatically closed here, even if an error occurred
```

**Why use `with`?**
- Automatically closes the file
- Handles errors gracefully
- Cleaner, more readable code
- Prevents file handle leaks


In [None]:
# Example: Using 'with' statement (recommended!)
with open('../data/太平記·卷一.txt', 'r', encoding='utf-8') as f:
    first_100_chars = f.read(100)
    print("First 100 characters:")
    print(first_100_chars)
# File is automatically closed here - no need to call .close()!


In [None]:
# Example: Reading line by line with 'with'
with open('../data/太平記·卷一.txt', 'r', encoding='utf-8') as f:
    for i, line in enumerate(f, 1):
        if i <= 5:  # Print first 5 lines
            print(f"Line {i}: {line.strip()}")
        if i >= 5:
            break


## Encoding: Working with Japanese Text

When working with text files containing Japanese characters (or any non-ASCII characters), you **must** specify the encoding. UTF-8 is the standard encoding that supports all characters.

**Why encoding matters:**
- Without specifying encoding, Python might use your system's default (which may not support Japanese)
- UTF-8 can handle Japanese, Chinese, Korean, and all other characters
- Always use `encoding='utf-8'` when opening text files with non-English characters


In [None]:
# Example: Reading Japanese text with UTF-8 encoding
with open('../data/太平記·卷一.txt', 'r', encoding='utf-8') as f:
    first_line = f.readline()
    print("First line (with UTF-8 encoding):")
    print(first_line)


## File Paths: Relative vs Absolute

**Relative paths** are relative to your current working directory:
- `'data/file.txt'` - looks in the `data` folder in the current directory
- `'../data/file.txt'` - goes up one directory, then into `data`

**Absolute paths** specify the full path from the root:
- `'/home/user/project/data/file.txt'` - full path from the root of the filesystem

**Which to use?**
- Relative paths are more portable (work on different computers)
- Absolute paths are more explicit but less portable


In [None]:
# Example: Relative vs absolute paths
import os

# Relative path (relative to current directory)
relative_path = '../data/太平記·卷一.txt'

# Convert to absolute path
absolute_path = os.path.abspath(relative_path)
print(f"Relative path: {relative_path}")
print(f"Absolute path: {absolute_path}")


## OS Tools: Working with the File System

Python's `os` and `os.path` modules provide tools for interacting with the file system. This is based on [Google's Python Utilities](https://developers.google.com/edu/python/utilities).

**Common functions:**
- `os.listdir(dir)` - List all files in a directory
- `os.path.join(dir, filename)` - Safely join directory and filename
- `os.path.abspath(path)` - Get absolute path
- `os.path.exists(path)` - Check if file/directory exists
- `os.path.dirname(path)` - Get directory part of path
- `os.path.basename(path)` - Get filename part of path
- `os.mkdir(dir)` - Create a directory
- `os.makedirs(dir)` - Create directory and all parent directories


In [1]:
# Example: Using os and os.path
import os

# List files in a directory
data_dir = '../data'
if os.path.exists(data_dir):
    filenames = os.listdir(data_dir)
    print(f"Files in {data_dir}:")
    for filename in filenames[:5]:  # Show first 5
        # Join directory and filename safely
        full_path = os.path.join(data_dir, filename)
        # Get absolute path
        abs_path = os.path.abspath(full_path)
        print(f"  {filename}")
        print(f"    Full path: {full_path}")
        print(f"    Absolute: {abs_path}")
        print(f"    Exists: {os.path.exists(full_path)}")
        print()


Files in ../data:
  chise_ids
    Full path: ../data/chise_ids
    Absolute: /home/d/Python/codeclass/data/chise_ids
    Exists: True

  jisho_radicals.json
    Full path: ../data/jisho_radicals.json
    Absolute: /home/d/Python/codeclass/data/jisho_radicals.json
    Exists: True

  jisho_radicals.csv
    Full path: ../data/jisho_radicals.csv
    Absolute: /home/d/Python/codeclass/data/jisho_radicals.csv
    Exists: True

  jisho_radicals.py
    Full path: ../data/jisho_radicals.py
    Absolute: /home/d/Python/codeclass/data/jisho_radicals.py
    Exists: True

  jisho_radicals.xml
    Full path: ../data/jisho_radicals.xml
    Absolute: /home/d/Python/codeclass/data/jisho_radicals.xml
    Exists: True



In [None]:
# Example: Using os.path.dirname() and os.path.basename()
file_path = '../data/太平記·卷一.txt'

dirname = os.path.dirname(file_path)
basename = os.path.basename(file_path)

print(f"Full path: {file_path}")
print(f"Directory: {dirname}")
print(f"Filename: {basename}")


## Copying Files with shutil

The `shutil` module provides file operations like copying files.

**Common function:**
- `shutil.copy(source, destination)` - Copy a file from source to destination


In [None]:
# Example: Copying a file (commented out to avoid creating files)
import shutil

# shutil.copy('../data/太平記·卷一.txt', 'copy_of_太平記.txt')
# print("File copied!")

# Note: Uncomment the lines above to actually copy a file
# Make sure the destination directory exists first!


## Debugging and Error Handling: try/except

When working with files, things can go wrong:
- File doesn't exist
- Permission denied
- Disk full
- Wrong encoding

**Exceptions** are Python's way of handling errors. An exception interrupts normal execution and transfers control to error-handling code.

**The `try/except` block:**
- `try:` - Code that might cause an error
- `except:` - Code to run if an error occurs
- The program continues after the try/except block

This is based on [Google's Python Utilities](https://developers.google.com/edu/python/utilities).


In [None]:
# Example: Basic try/except
try:
    # This might fail if the file doesn't exist
    with open('../data/nonexistent_file.txt', 'r', encoding='utf-8') as f:
        content = f.read()
    print("File read successfully!")
except FileNotFoundError:
    print("Error: File not found!")
    print("The program continues here...")

print("\nProgram continues normally after error handling.")


In [None]:
# Example: Handling multiple types of errors
try:
    with open('../data/太平記·卷一.txt', 'r', encoding='utf-8') as f:
        content = f.read(100)
    print("Successfully read file!")
    print(f"First 100 characters: {content[:50]}...")
except FileNotFoundError:
    print("Error: File not found!")
except PermissionError:
    print("Error: Permission denied!")
except Exception as e:
    # Catch any other error
    print(f"Unexpected error: {e}")


## Raising Errors

Sometimes you want to **raise** (create) your own errors when something goes wrong in your code. This helps with debugging and makes your code more robust.

**Syntax:** `raise ExceptionType("Error message")`


In [None]:
# Example: Raising errors
def read_file_safely(filename):
    """Read a file, but raise an error if it doesn't exist."""
    if not os.path.exists(filename):
        raise FileNotFoundError(f"File '{filename}' does not exist!")
    
    with open(filename, 'r', encoding='utf-8') as f:
        return f.read()

# This will work
try:
    content = read_file_safely('../data/太平記·卷一.txt')
    print("File read successfully!")
except FileNotFoundError as e:
    print(f"Error: {e}")

# This will raise an error
try:
    content = read_file_safely('../data/nonexistent.txt')
except FileNotFoundError as e:
    print(f"Caught error: {e}")


## Recursive File Searching: Finding All Files of a Type

Often you need to find all files of a certain type (e.g., all `.txt` files) in a directory and its subdirectories. Python provides several ways to do this:

**Method 1: `os.walk()`** - Walks through directory tree recursively
**Method 2: `glob.glob()`** - Pattern matching for file paths


In [None]:
# Example 1: Using os.walk() to find all .txt files
import os

txt_files = []
search_dir = '../data'

# os.walk() returns (dirpath, dirnames, filenames) for each directory
for dirpath, dirnames, filenames in os.walk(search_dir):
    for filename in filenames:
        if filename.endswith('.txt'):
            full_path = os.path.join(dirpath, filename)
            txt_files.append(full_path)

print(f"Found {len(txt_files)} .txt files:")
for file_path in txt_files[:5]:  # Show first 5
    print(f"  {file_path}")


In [None]:
# Example 2: Using glob to find files (simpler for patterns)
import glob

# Find all .txt files in data directory (non-recursive)
txt_files = glob.glob('../data/*.txt')
print(f"Found {len(txt_files)} .txt files (non-recursive):")
for file_path in txt_files:
    print(f"  {file_path}")

# Find all .txt files recursively (using **)
txt_files_recursive = glob.glob('../data/**/*.txt', recursive=True)
print(f"\nFound {len(txt_files_recursive)} .txt files (recursive):")
for file_path in txt_files_recursive[:5]:  # Show first 5
    print(f"  {file_path}")


## Practical Example: Processing All Files of a Type

Let's combine what we've learned: find all IDS files in the `chise_ids` directory and process them.


In [None]:
# Example: Find and process all IDS files
import os
import glob

# Find all .txt files in chise_ids directory
ids_dir = '../data/chise_ids'
ids_files = glob.glob(os.path.join(ids_dir, '*.txt'))

print(f"Found {len(ids_files)} IDS files:")
for file_path in ids_files[:3]:  # Show first 3
    print(f"  {os.path.basename(file_path)}")

# Process first file as example
if ids_files:
    first_file = ids_files[0]
    print(f"\nProcessing: {os.path.basename(first_file)}")
    
    try:
        with open(first_file, 'r', encoding='utf-8') as f:
            line_count = 0
            for line in f:
                line_count += 1
                if line_count <= 5:  # Show first 5 lines
                    print(f"  Line {line_count}: {line.strip()}")
            print(f"  Total lines: {line_count}")
    except Exception as e:
        print(f"Error reading file: {e}")


## Practice Exercises

Now it's your turn! Practice working with files, OS tools, and error handling.


### Exercise 1: List Files in a Directory

Use `os.listdir()` to list all files in the `../data` directory. Print each filename and check if it's a file (not a directory) using `os.path.isfile()`.


In [None]:
# Exercise 1: List Files in a Directory
import os

# Your code here:
# 1. Use os.listdir() to get files in '../data'
# 2. Loop through and print each filename
# 3. Use os.path.isfile() to check if it's a file (not a directory)


### Exercise 2: Read File with Error Handling

Write code to read `../data/太平記·卷一.txt` using a `try/except` block. Handle the `FileNotFoundError` exception and print a helpful error message if the file doesn't exist.


In [None]:
# Exercise 2: Read File with Error Handling
# Your code here:
# 1. Use try/except to read '../data/太平記·卷一.txt'
# 2. Handle FileNotFoundError
# 3. Print the first 100 characters if successful
# 4. Print an error message if file not found


### Exercise 3: Find All JSON Files

Use `glob.glob()` to find all `.json` files in the `../data` directory (recursively). Print the full path of each file found.


In [None]:
# Exercise 3: Find All JSON Files
import glob

# Your code here:
# 1. Use glob.glob() with recursive=True to find all .json files in '../data'
# 2. Print each file path found


### Exercise 4: Process Multiple Files

Use `os.walk()` to find all `.txt` files in the `../data` directory (recursively). For each file, try to read the first line and print the filename and first line. Use error handling to skip files that can't be read.


In [None]:
# Exercise 4: Process Multiple Files
import os

# Your code here:
# 1. Use os.walk() to find all .txt files in '../data'
# 2. For each file, try to read the first line
# 3. Use try/except to handle errors
# 4. Print filename and first line for each successful read


### Exercise 5: Count Lines in All IDS Files

Find all `.txt` files in the `../data/chise_ids` directory and count the total number of lines across all files. Use error handling to skip any files that can't be read.


In [None]:
# Exercise 5: Count Lines in All IDS Files
import glob
import os

# Your code here:
# 1. Find all .txt files in '../data/chise_ids'
# 2. For each file, count the lines
# 3. Sum up the total lines across all files
# 4. Use error handling to skip files that can't be read
# 5. Print the total line count
