# Interacting with the Filesystem

In this notebook we will focus on how to navigate and interact with the filesystem using libraries such as `os` and `pathlib`.

## 00. Getting Setup

In [10]:
from pathlib import Path
from typing import *

In [2]:
!pip install -r ../requirements.txt



ERROR: Could not find a version that satisfies the requirement gzip (from versions: none)
ERROR: No matching distribution found for gzip

[notice] A new release of pip is available: 23.0.1 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip


---

## 01. Introduction

A filesystem is more or less a data structure that an operating system uses to store and handle data.

In [11]:
# what is a path
c:\users\can134\onedrive - csiro\documents\csiro\nextgen\2025\coding-bootcamp\nextgen2025-codingbootcamp-session05\.venv\lib\site-packages

In [12]:
# relative path
.\notebooks\01-interacting-with-the-filesystem.ipynb

In [13]:
# absolute path
c:\users\can134\onedrive - csiro\documents\csiro\nextgen\2025\coding-bootcamp\nextgen2025-codingbootcamp-session05 + \notebooks\01-interacting-with-the-filesystem.ipynb

In [14]:
# directory

In [15]:
# what is a working directory

In [16]:
# file

---

## 02. Pathlib

There are several Python libraries to simplify interacting with the filesystem, in this workshop we will use `pathlib`.

In [6]:
from pathlib import Path

# we can use pathlib to more easily handle navigating through the filesystem
p = Path("..").absolute().resolve()

print(p)

C:\Users\can134\OneDrive - CSIRO\Documents\CSIRO\NextGen\2025\coding-bootcamp\nextgen2025-codingbootcamp-session05


In [14]:
# we can use it to check if the path exists, is a file, or is a directory
p = Path("../requirements.txt")

print(p.is_file())

print(p.absolute().resolve())



True
C:\Users\can134\OneDrive - CSIRO\Documents\CSIRO\NextGen\2025\coding-bootcamp\nextgen2025-codingbootcamp-session05\requirements.txt


In [20]:
# we can split it up into different types of paths
print(p.absolute().resolve())

print(p.parent.absolute().resolve())

C:\Users\can134\OneDrive - CSIRO\Documents\CSIRO\NextGen\2025\coding-bootcamp\nextgen2025-codingbootcamp-session05\requirements.txt
C:\Users\can134\OneDrive - CSIRO\Documents\CSIRO\NextGen\2025\coding-bootcamp\nextgen2025-codingbootcamp-session05


In [21]:
p.absolute().resolve().parts

('C:\\',
 'Users',
 'can134',
 'OneDrive - CSIRO',
 'Documents',
 'CSIRO',
 'NextGen',
 '2025',
 'coding-bootcamp',
 'nextgen2025-codingbootcamp-session05',
 'requirements.txt')

In [20]:
# we can access parent directories

In [21]:
# we can construct child directories

### Activity

Create a `pathlib.Path` that references the file `wave.txt` in `<root>/data/files/` and validate the file exists.

In [31]:
filepath = Path(".").absolute().resolve().parent.joinpath("data").joinpath("files").joinpath("wave.txt")

filepath = Path("../data/files/wavez.txt")

print(filepath)

print(filepath.exists())

..\data\files\wavez.txt
False


---

## 03. Programatically Trawling

A common component of data wrangling and process is retrieving and processing files, manually defining each file quickly becomes burdensome, thankfully we can trawl through the filesystem programatically using `pathlib` and regular expression matching.

Regular expression matching is the process of finding strings that match a specified pattern using globbing regular expression.

https://www.ibm.com/docs/en/netcoolconfigmanager/6.4.2.0?topic=wildcards-glob-regular-expressions

In [33]:
for path in Path("..").glob("*"):
    print(path)

..\.git
..\.gitignore
..\.venv
..\data
..\docs
..\notebooks
..\README.md
..\requirements.txt
..\scripts


In [40]:
# lets manually access all the files in `<root>/data/files`
for path in Path("../data/files").glob("*.txt"):
    print(path)

..\data\files\label.txt
..\data\files\wave.txt


In [None]:
items : Generator = Path("../data/files").glob("*.txt")

A `Generator` is an `Iterable` object which `yield`s an item each time it is iterated over.

In [32]:
# you can also create your own generators
class CustomGenerator:
    def __init__(self, items: Iterable) -> None:
        super(CustomGenerator, self).__init__()
        self.items = items

    def __iter__(self) -> Any:
        items = self.items
        for item in items:
            yield item

In [34]:
# when we use `.glob` we return a `Generator` that we can iterate over


In [35]:
# lets iterate over the files in notebooks

We can use different regular expression patterns to retrieve different items.

In [37]:
# get all items in a directory

In [38]:
# get all items in a directory with a part of a name

In [39]:
# recursively access items in a directory

In [None]:
# lots of different options

### Activity

Create a `pathlib.Path` for the `data` directory in this project and use `.glob(...)` to match all of the files in the `files` subdirectory.

In [41]:
for path in Path("../data/files").glob("*"):
    print(path)

..\data\files\label.txt
..\data\files\wave.bin
..\data\files\wave.HDF5
..\data\files\wave.png
..\data\files\wave.txt


Can you build on this and index over the 2 samples in the `datasets` subdirectory.