# Motivation

The biggest drawback of `os.path` was treating system paths as strings, which led to unreadable, messy code and a steep learning curve.

By representing paths as fully-fledged objects, Pathlib solves all these issues and introduces elegance, consistency, and a breath of fresh air into path handling.

And this long-overdue article of mine will outline some of the best functions/features and tricks of `pathlib` to perform tasks that would have been truly horrible experiences in `os.path`.

Learning these features of Pathlib will make everything related to paths and files easier for you as a data professional, especially during data processing workflows where you have to move around thousands of images, CSVs, or audio files.

# Working with paths

## 1. Creating paths

Almost all features of `pathlib` is accessible through its `Path` class, which you can use to create paths to files and directories.

There are a few ways you can create paths with `Path`. First, there are class methods like `cwd` and `home` for the current working and the home user directories:

In [1]:
from pathlib import Path

In [2]:
Path.cwd()

WindowsPath('C:/Users/johnw/Data_Science_Libraries/Pathlib')

In [3]:
Path.home()

WindowsPath('C:/Users/johnw')

You can also create paths from string paths:

In [4]:
p = Path("documents")
p

WindowsPath('documents')

Joining paths is a breeze in Pathlib with the **forward slash operator**:

In [5]:
data_dir = Path(".") / "data"  # . does nothing
data_dir

WindowsPath('data')

In [6]:
csv_file = data_dir / "file.csv"
csv_file

WindowsPath('data/file.csv')

Please, don't let anyone ever catch you using `os.path.join` after this.

To check whether a path, you can use the boolean function `exists`:

In [7]:
data_dir.exists()

True

In [8]:
csv_file.exists()

True

Sometimes, the entire Path object won’t be visible, and you have to check whether it is a directory or a file. So, you can use `is_dir` or `is_file` functions to do it:

In [9]:
data_dir.is_dir()

True

In [10]:
csv_file.is_file()

True

Most paths you work with will be relative to your current directory. But, there are cases where you have to provide the exact location of a file or a directory to make it accessible from any Python script. This is when you use `absolute` paths:

In [11]:
csv_file.absolute()

WindowsPath('C:/Users/johnw/Data_Science_Libraries/Pathlib/data/file.csv')

Lastly, if you have the misfortune of working with libraries that still require string paths, you can call `str(path)`:

In [12]:
str(Path.home())

'C:\\Users\\johnw'

Most libraries in the data stack have long supported `Path` objects, including `sklearn`, `pandas`, `matplotlib`, `seaborn`, etc.

## 2. Path attributes

`Path` objects have many useful attributes. Let’s see some examples using this path object that points to an image file.

In [13]:
image_file = Path("images/midjourney.png").absolute()

In [14]:
image_file

WindowsPath('C:/Users/johnw/Data_Science_Libraries/Pathlib/images/midjourney.png')

Let's start with the `parent`. It returns a path object that is one level up the current working directory.

In [15]:
image_file.parent

WindowsPath('C:/Users/johnw/Data_Science_Libraries/Pathlib/images')

Sometimes, you may want only the file `name` instead of the whole path. There is an attribute for that:

In [16]:
image_file.name

'midjourney.png'

There is also `stem` for the file name without the suffix:

In [17]:
image_file.stem

'midjourney'

Or the `suffix` itself with the dot for the file extension:

In [18]:
image_file.suffix

'.png'

If you want to divide a path into its components, you can use `parts` instead of `str.split('/')`:

In [19]:
image_file.parts

('C:\\',
 'Users',
 'johnw',
 'Data_Science_Libraries',
 'Pathlib',
 'images',
 'midjourney.png')

If you want those components to be `Path` objects in themselves, you can use `parents` attribute, which creates a generator:

In [20]:
for i in image_file.parents:
    print(i)

C:\Users\johnw\Data_Science_Libraries\Pathlib\images
C:\Users\johnw\Data_Science_Libraries\Pathlib
C:\Users\johnw\Data_Science_Libraries
C:\Users\johnw
C:\Users
C:\


# Working with files

To create files and write to them, you don't have to use `open` function anymore. Just create a `Path` object and `write_text` or `write_btyes` to them:

In [21]:
markdown = data_dir / "file.md"

In [22]:
markdown

WindowsPath('data/file.md')

In [23]:
# Create (override) and write text
markdown.write_text("# This is a test markdown")

25

Or, if you already have a file, you can `read_text` or `read_bytes`:

In [24]:
markdown.read_text()

'# This is a test markdown'

In [25]:
len(image_file.read_bytes())

114790

However, note that `write_text` or `write_bytes` overrides existing contents of a file.

In [26]:
# Write new text to existing file
markdown.write_text("## This is a new line")

21

In [27]:
# The file is overridden
markdown.read_text()

'## This is a new line'

To append new information to existing files, you should use `open` method of `Path` objects in `a` (append) mode:

In [28]:
# Append text
with markdown.open(mode="a") as file:
    file.write("\n### This is the second line")

In [29]:
markdown.read_text()

'## This is a new line\n### This is the second line'

It is also common to rename files. `rename` method accepts the destination path for the renamed file.

To create the destination path in the current directory, i. e. rename the file, you can use `with_stem` on the existing path, which replaces the `stem` of the original file:

In [30]:
renamed_md = markdown.with_stem("new_markdown")

In [31]:
renamed_md

WindowsPath('data/new_markdown.md')

In [32]:
markdown

WindowsPath('data/file.md')

In [33]:
markdown.rename(renamed_md)

WindowsPath('data/new_markdown.md')

Above, `file.md` is turned into `new_markdown.md`.

Let's see the file size through `stat().st_size`:

In [34]:
# Display file size
renamed_md.stat().st_size

50

or the last time the file was modified, which was a few seconds ago:

In [35]:
from datetime import datetime

In [36]:
modified_timestamp = renamed_md.stat().st_mtime

datetime.fromtimestamp(modified_timestamp)

datetime.datetime(2023, 4, 23, 17, 1, 5, 996965)

`st_mtime` returns a timestamp, which is the count of seconds since January 1, 1970. To make it readable, you can use use the `fromtimestamp` function of `datetime`.

To remove unwanted files, you can `unlink` them:

In [37]:
renamed_md.unlink(missing_ok=True)

Setting `missing_ok` to `True` won't raise any alarms if the file doesn't exist.

# Working with directories

There are a few neat tricks to work with directories in Pathlib. First, let's see how to create directories recursively.

In [38]:
new_dir = (
    Path.cwd()
    / "new_dir"
    / "child_dir"
    / "grandchild_dir"
)

In [39]:
new_dir

WindowsPath('C:/Users/johnw/Data_Science_Libraries/Pathlib/new_dir/child_dir/grandchild_dir')

In [40]:
new_dir.exists()

False

The `new_dir` doesn't exist, so let's create it with all its children:

In [41]:
new_dir.mkdir(parents=True, exist_ok=True)

By default, `mkdir` creates the last child of the given path. If the intermediate parents don't exist, you have to set `parents` to `True`.

To remove empty directories, you can use `rmdir`. If the given path object is nested, only the last child directory is deleted:

In [42]:
# Removes the last child directory
new_dir.rmdir()

To list the contents of a directory like `ls` on the terminal, you can use `iterdir`. Again, the result will be a generator object, yielding directory contents as separate path objects one at a time:

In [43]:
for p in Path.cwd().iterdir():
    print(p)

C:\Users\johnw\Data_Science_Libraries\Pathlib\.ipynb_checkpoints
C:\Users\johnw\Data_Science_Libraries\Pathlib\data
C:\Users\johnw\Data_Science_Libraries\Pathlib\images
C:\Users\johnw\Data_Science_Libraries\Pathlib\Medium_15_Pathlib_tricks_to_master_file_system_in_Python.ipynb
C:\Users\johnw\Data_Science_Libraries\Pathlib\new_dir


To capture all files with a specific extension or a name pattern, you can use the `glob` function with a regular expression.

For example, below, we will find all text files inside my home directory with `glob("*.txt")`:

In [44]:
data_science_libraries = Path.cwd().parent
text_files = list(data_science_libraries.glob("*.txt"))

len(text_files)

0

To search for text files recursively, meaning inside all child directories as well, you can use *recursive glob* with `rglob`:

In [45]:
all_text_files = [p for p in data_science_libraries.rglob("*.txt")]

len(all_text_files)

50174

You can also use `rglob('*')` to list directory contents recursively. It is like the supercharged version of `iterdir()`.

One of the use cases of this is counting the number of file formats that appear within a directory.

To do this, we import the `Counter` class from `collections` and provide all file suffixes to it within the folder of `data_science_libraries`:

In [46]:
from collections import Counter

In [47]:
file_counts = Counter(
    path.suffix for path in data_science_libraries.rglob("*")
)

file_counts

Counter({'': 8202,
         '.md': 4,
         '.sample': 13,
         '.idx': 1,
         '.pack': 1,
         '.csv': 71,
         '.ipynb': 330,
         '.xls': 2,
         '.pkl': 164,
         '.py': 16,
         '.zip': 1,
         '.png': 895,
         '.css': 1,
         '.pbtxt': 2,
         '.config': 1,
         '.txt': 50174,
         '.data-00000-of-00001': 2,
         '.index': 2,
         '.pb': 1,
         '.html': 24,
         '.pyc': 5,
         '.pth': 19,
         '.log': 7,
         '.pt': 15,
         '.json': 16,
         '.tsv': 14,
         '.tfevents': 6,
         '.jpg': 21379,
         '.ply': 2,
         '.xml': 2,
         '.feather': 3,
         '.parquet': 3,
         '.pickle': 2,
         '.tmp': 3,
         '.yaml': 449,
         '.runName': 172,
         '.name': 172,
         '.type': 172,
         '.user': 172,
         '.history': 153,
         '.parentRunId': 159,
         '.gz': 6,
         '.mat': 20,
         '.meta': 1,
         '.0': 2,
   

# Operating system differences

Sorry, but we have to talk about this nightmare of an issue.

Up until now, we have been dealing with `WindowsPath`  objects, which are the default for Windows systems:

In [48]:
type(Path.home())

pathlib.WindowsPath

If you were on UNIX-like system, you would get a `PosixPath` object:

In [49]:
from pathlib import PosixPath

# User raw strings that start with r to write windows paths
path = PosixPath(r"C:\users")
path

NotImplementedError: cannot instantiate 'PosixPath' on your system

Instantiating another system's path raises an error like the above.

But what if you were forced to work with paths from another system, like code written by coworkers who use UNIX?

As a solution, `pathlib` offers pure path objects like `PureWindowsPath` or `PurePosixPath`:

In [50]:
from pathlib import PurePosixPath, PureWindowsPath

path = PurePosixPath(r"C:\users")
path

PurePosixPath('C:\\users')

These are primitive path objects. You've access to some path methods and attributes, but essentially, the path object remains a string:

In [51]:
path / "johnw"

PurePosixPath('C:\\users/johnw')

In [52]:
path.parent

PurePosixPath('.')

In [53]:
path.stem

'C:\\users'