# 8. Handling path and files

Getting a list of all the `files and folders in a directory` is a natural first step for many file-related operations in Python.

In this section, we will learn how to use the most general-purpose techniques in the **pathlib** module to list items in a directory. Before `pathlib` came out in Python 3.4, if you wanted to work with file paths, then you’d use the `os module`. While this was very efficient in terms of performance, you had to handle all the `paths as strings`.

If you only work in one OS, handle as string is ok. But once you start bringing multiple operating systems into the mix, you need to write a bunch of code related to string manipulation for adapting different OS.


In [1]:
import pathlib

## 8.1 Path object

In the first step, pathlib will `convert a string path to a Path object`, which will be different depending on your operating system (OS). On Windows, you’ll get a `WindowsPath object`, while Linux and macOS will return `PosixPath`

```python
import pathlib
desktop = pathlib.Path("C:/Users/toto/Desktop")
desktop
# output
# WindowsPath("C:/Users/toto/Desktop")

# for linux/MacOs
desktop = pathlib.Path("/home/toto/Desktop")
desktop
# output
# PosixPath('/home/toto/Desktop')
```


An example of path object

In [5]:
path_str="/home/pliu/git/RecetteConstance"
root_path=pathlib.Path(path_str)

raw_data_folder_path=f"{path_str}/data"
data_folder_path=pathlib.Path(raw_data_folder_path)

In [6]:
data_folder_path

PosixPath('/home/pliu/git/RecetteConstance/data')

## 8.2 List contents of a path(directory)

If you only need to list the contents of a given directory, and you don’t need to get the contents of each subdirectory too, then you can use the **Path object’s .iterdir()** method.

In [7]:
file_list=list(data_folder_path.iterdir())
for file in file_list:
    print(f"file path: {file}")


file path: /home/pliu/git/RecetteConstance/data/dia_gen_2019.csv
file path: /home/pliu/git/RecetteConstance/data/Descripteur_CONSTANCES_Extraction2014.csv
file path: /home/pliu/git/RecetteConstance/data/diabetes.csv
file path: /home/pliu/git/RecetteConstance/data/dia_gen_2018.csv
file path: /home/pliu/git/RecetteConstance/data/diabetes_profile_report.html
file path: /home/pliu/git/RecetteConstance/data/descriptor.json
file path: /home/pliu/git/RecetteConstance/data/dia_gen_2020.csv
file path: /home/pliu/git/RecetteConstance/data/updated_descriptor.json


Note the `.iterdir()` returns an iterator of Path object item. The Path object has function `is_dir()` to tell you it's a file or directory.

In [8]:
for item in root_path.iterdir():
    print(f"{item} - {'dir' if item.is_dir() else 'file'}")

/home/pliu/git/RecetteConstance/pyproject.toml - file
/home/pliu/git/RecetteConstance/.idea - dir
/home/pliu/git/RecetteConstance/data - dir
/home/pliu/git/RecetteConstance/README.md - file
/home/pliu/git/RecetteConstance/src - dir
/home/pliu/git/RecetteConstance/LICENSE - file
/home/pliu/git/RecetteConstance/.git - dir
/home/pliu/git/RecetteConstance/notebooks - dir
/home/pliu/git/RecetteConstance/.gitignore - file
/home/pliu/git/RecetteConstance/poetry.lock - file


With .is_dir() method, we can also return only files or directories of a path

In [16]:
def get_all_files_of_path(path_obj):
    return [item for item in path_obj.iterdir() if not item.is_dir()]

def get_all_dirs_of_path(path_obj):
    return [item for item in path_obj.iterdir() if item.is_dir()]


In [17]:
file_list=get_all_files_of_path(root_path)
print(file_list)

[PosixPath('/home/pliu/git/RecetteConstance/pyproject.toml'), PosixPath('/home/pliu/git/RecetteConstance/README.md'), PosixPath('/home/pliu/git/RecetteConstance/LICENSE'), PosixPath('/home/pliu/git/RecetteConstance/.gitignore'), PosixPath('/home/pliu/git/RecetteConstance/poetry.lock')]


In [18]:
dir_list=get_all_dirs_of_path(root_path)
print(dir_list)

[PosixPath('/home/pliu/git/RecetteConstance/.idea'), PosixPath('/home/pliu/git/RecetteConstance/data'), PosixPath('/home/pliu/git/RecetteConstance/src'), PosixPath('/home/pliu/git/RecetteConstance/.git'), PosixPath('/home/pliu/git/RecetteConstance/notebooks')]


## 8.3 Recursively Listing With .rglob()

To recursively list the items in a directory means to list not only the directory’s contents, but also the contents of the subdirectories, their subdirectories, and so on. With pathlib, You can use **.rglob()** to return absolutely everything.



In [20]:
# It returns a generator by default, you can convert it to a list
root_path.rglob("*")

<generator object Path.rglob at 0x7f0024febe40>

In [21]:
list(root_path.rglob("*"))

[PosixPath('/home/pliu/git/RecetteConstance/pyproject.toml'),
 PosixPath('/home/pliu/git/RecetteConstance/.idea'),
 PosixPath('/home/pliu/git/RecetteConstance/data'),
 PosixPath('/home/pliu/git/RecetteConstance/README.md'),
 PosixPath('/home/pliu/git/RecetteConstance/src'),
 PosixPath('/home/pliu/git/RecetteConstance/LICENSE'),
 PosixPath('/home/pliu/git/RecetteConstance/.git'),
 PosixPath('/home/pliu/git/RecetteConstance/notebooks'),
 PosixPath('/home/pliu/git/RecetteConstance/.gitignore'),
 PosixPath('/home/pliu/git/RecetteConstance/poetry.lock'),
 PosixPath('/home/pliu/git/RecetteConstance/.idea/inspectionProfiles'),
 PosixPath('/home/pliu/git/RecetteConstance/.idea/vcs.xml'),
 PosixPath('/home/pliu/git/RecetteConstance/.idea/RecetteConstance.iml'),
 PosixPath('/home/pliu/git/RecetteConstance/.idea/workspace.xml'),
 PosixPath('/home/pliu/git/RecetteConstance/.idea/misc.xml'),
 PosixPath('/home/pliu/git/RecetteConstance/.idea/.gitignore'),
 PosixPath('/home/pliu/git/RecetteConstance/

The `.rglob() method with "*"` as an argument produces a generator that yields all the files and folders from the Path object recursively.

## 8.4 Filter return content with regex

Sometimes you don’t want all the files. So you need to filter the result with a certain pattern of characters in their name.

Pathlib provides `.rglob() and .glob()` method. Both of these methods make use of [glob patterns](https://en.wikipedia.org/wiki/Glob_(programming)). Glob patterns make use of wildcard characters to match on certain criteria. For example, the single asterisk * matches everything in the directory.

There are many glob patterns that you can take advantage of. Check out the following selection of glob patterns for some ideas:

| Glob Pattern	    | Matches                                                                                                                  |
|------------------|--------------------------------------------------------------------------------------------------------------------------|
| *	               | Every item                                                                                                               |
| *.txt	           | Every item ending in .txt, such as notes.txt or hello.txt                                                                |
| ??????	          | Every item whose name is six characters long, such as 01.txt, A-01.c, or .zshrc                                          |
| A*	              | Every item that starts with the character A, such as Album, A.txt, or AppData                                            |
| [abc][abc][abc]	 | Every item whose name is three characters long but only composed of the characters a, b, and c, such as abc, aaa, or cba |

With these patterns, you can flexibly match many types of files. Check out the documentation on [fnmatch](https://docs.python.org/3/library/fnmatch.html#module-fnmatch), which is the underlying module governing the behavior of .glob(), to get a feel for the other patterns that you can use in Python.

Note that on **Windows, glob patterns are case-insensitive**, because paths are case-insensitive in general. On Unix-like systems like **Linux and macOS, glob patterns are case-sensitive**.

### 8.4.1 Some example of glob

The .glob() method of a Path object behaves in much the same way as .rglob(). If you pass the "*" argument, then you’ll get a list of items in the directory, but `without recursion`. It produces a `generator that yields all the items in the directory` that’s represented by the Path object, without going into the subdirectories. In this way, it produces the same result as .iterdir(), and you can use the resulting generator in a for loop or a comprehension, just as you would with iterdir().

In [22]:
list(root_path.glob("*"))

[PosixPath('/home/pliu/git/RecetteConstance/pyproject.toml'),
 PosixPath('/home/pliu/git/RecetteConstance/.idea'),
 PosixPath('/home/pliu/git/RecetteConstance/data'),
 PosixPath('/home/pliu/git/RecetteConstance/README.md'),
 PosixPath('/home/pliu/git/RecetteConstance/src'),
 PosixPath('/home/pliu/git/RecetteConstance/LICENSE'),
 PosixPath('/home/pliu/git/RecetteConstance/.git'),
 PosixPath('/home/pliu/git/RecetteConstance/notebooks'),
 PosixPath('/home/pliu/git/RecetteConstance/.gitignore'),
 PosixPath('/home/pliu/git/RecetteConstance/poetry.lock')]

In [23]:
# list items finish with .md
list(root_path.glob("*.md"))

[PosixPath('/home/pliu/git/RecetteConstance/README.md')]

### 8.4.2 Example of rglob

rglob is just the recursive version of glob.

In [16]:
# find all csv file under root path recursively
list(root_path.rglob("*.csv"))

[PosixPath('/home/pengfei/git/RecetteConstance/data/dia_gen_2019.csv'),
 PosixPath('/home/pengfei/git/RecetteConstance/data/dia_gen_2020.csv'),
 PosixPath('/home/pengfei/git/RecetteConstance/data/Descripteur_CONSTANCES_Extraction2014.csv'),
 PosixPath('/home/pengfei/git/RecetteConstance/data/diabetes.csv'),
 PosixPath('/home/pengfei/git/RecetteConstance/data/dia_gen_2018.csv')]

In [24]:
# find all csv file under root path
list(root_path.glob("*.csv"))

[]

In [25]:
# You can actually use .glob() and get it to behave in the same way as .rglob() by adjusting the glob pattern
list(root_path.glob("**/*.csv"))

[PosixPath('/home/pliu/git/RecetteConstance/data/dia_gen_2019.csv'),
 PosixPath('/home/pliu/git/RecetteConstance/data/Descripteur_CONSTANCES_Extraction2014.csv'),
 PosixPath('/home/pliu/git/RecetteConstance/data/diabetes.csv'),
 PosixPath('/home/pliu/git/RecetteConstance/data/dia_gen_2018.csv'),
 PosixPath('/home/pliu/git/RecetteConstance/data/dia_gen_2020.csv')]

So we can conclude that `.glob("**/*.md") is equivalent to .rglob(*.md)`. Likewise, a call to `.glob("**/*") is equivalent to .rglob("*")`

## 8.5 Advanced filtering With the Glob Methods

**One of the potential drawbacks with the glob methods is that you can only select files based on glob patterns**. If you want to do more advanced matching or filter on the attributes of the item, then you need to reach for something extra.

To run more complex matching and filtering, you can follow at least three strategies. You can use:

- A for loop with a conditional check
- A comprehension with a conditional expression
- The built-in filter() function

### 8.5.1 Loop with conditional check

In [26]:
for item in root_path.rglob("*"):
    if item.is_file() and str(item).endswith(".csv"):
         print(item)

/home/pliu/git/RecetteConstance/data/dia_gen_2019.csv
/home/pliu/git/RecetteConstance/data/Descripteur_CONSTANCES_Extraction2014.csv
/home/pliu/git/RecetteConstance/data/diabetes.csv
/home/pliu/git/RecetteConstance/data/dia_gen_2018.csv
/home/pliu/git/RecetteConstance/data/dia_gen_2020.csv


### 8.5.2 Comprehension with a conditional expression

In [27]:
[item for item in root_path.rglob("*") if (item.is_file()and str(item).endswith(".csv"))]

[PosixPath('/home/pliu/git/RecetteConstance/data/dia_gen_2019.csv'),
 PosixPath('/home/pliu/git/RecetteConstance/data/Descripteur_CONSTANCES_Extraction2014.csv'),
 PosixPath('/home/pliu/git/RecetteConstance/data/diabetes.csv'),
 PosixPath('/home/pliu/git/RecetteConstance/data/dia_gen_2018.csv'),
 PosixPath('/home/pliu/git/RecetteConstance/data/dia_gen_2020.csv')]

### 8.5.3 built-in filter() function

In [28]:
list(filter(lambda item: item.is_file() and str(item).endswith(".csv"), root_path.rglob("*")))

[PosixPath('/home/pliu/git/RecetteConstance/data/dia_gen_2019.csv'),
 PosixPath('/home/pliu/git/RecetteConstance/data/Descripteur_CONSTANCES_Extraction2014.csv'),
 PosixPath('/home/pliu/git/RecetteConstance/data/diabetes.csv'),
 PosixPath('/home/pliu/git/RecetteConstance/data/dia_gen_2018.csv'),
 PosixPath('/home/pliu/git/RecetteConstance/data/dia_gen_2020.csv')]

The glob methods are extremely versatile, but for large directory trees, they can be a bit slow. In the next section, you’ll be examining an example in which reaching for more controlled iteration with .iterdir() may be a better choice.

## 8.6 Opting Out of Listing Junk Directories

Imagine, you want to find all the files on your system, but you have various subdirectories that have lots and lots of subdirectories and files. Some of the largest `subdirectories are temporary files` that you aren’t interested in.

### 8.6.1 Attempt with rglob

We want to skip some directories that only contains junk files. We can check if any two iterables have an item in common by taking advantage of [sets](https://realpython.com/python-sets/). If you cast one of the iterables to a set, then you can use the `.isdisjoint()` method to determine whether they have any elements in common

In [29]:
{"documents", "notes", "find_me.txt"}.isdisjoint({"temp", "temporary"})

True

In [None]:
{"documents", "temp", "find_me.txt"}.isdisjoint({"temp", "temporary"})

If the two sets have no elements in common, then .isdisjoint() returns True. If the two sets have at least one element in common, then .isdisjoint() returns False. You can incorporate this check into a for loop that goes over all the items returned by .rglob("*")

In [32]:
SKIP_DIRS = [".idea", ".git","src","data"]
for item in root_path.rglob("*"):
     if set(item.parts).isdisjoint(SKIP_DIRS):
         print(item)

/home/pliu/git/RecetteConstance/pyproject.toml
/home/pliu/git/RecetteConstance/README.md
/home/pliu/git/RecetteConstance/LICENSE
/home/pliu/git/RecetteConstance/notebooks
/home/pliu/git/RecetteConstance/.gitignore
/home/pliu/git/RecetteConstance/poetry.lock
/home/pliu/git/RecetteConstance/notebooks/03.Check_file_number_and_names.ipynb
/home/pliu/git/RecetteConstance/notebooks/01.check_data_format.ipynb
/home/pliu/git/RecetteConstance/notebooks/02.EDA_on_descriptor.ipynb


You can definitely filter out whole folders with .rglob(), but you can’t get away from the fact that the resulting generator will yield all the items and then filter out the undesirable ones, one by one. This can make the glob methods very slow, depending on your use case. That’s why you might opt for a recursive .iterdir() function, which you’ll explore next.

### 8.6.2 Recursive .iterdir() Function

In the example of junk directories, you ideally want the ability to opt out of iterating over all the files in a given subdirectory if they match one of the names in SKIP_DIRS:

In [33]:
def get_all_items(root: pathlib.Path, exclude=SKIP_DIRS):
    for item in root.iterdir():
        if item.name in exclude:
            continue
        yield item
        if item.is_dir():
            yield from get_all_items(item)

In this module, you define a list of strings, SKIP_DIRS, that contains the names of directories that you’d like to ignore. Then you define a generator function that uses .iterdir() to go over each item.

The generator function uses the type annotation : pathlib.Path after the first argument to indicate that you can’t just pass in a string that represents a path. The argument needs to be a Path object.

If the item name is in the exclude list, then you just move on to the next item, skipping the whole subdirectory tree in one go.

If the item isn’t in the list, then you yield the item, and if it’s a directory, you invoke the function again on that directory. That is, within the function body, the function conditionally invokes the same function again. This is a hallmark of a recursive function.

This recursive function efficiently yields all the files and directories that you want, excluding all that you aren’t interested in:

In [35]:
all_item=get_all_items(root_path,SKIP_DIRS)
print(list(all_item))

[PosixPath('/home/pliu/git/RecetteConstance/pyproject.toml'), PosixPath('/home/pliu/git/RecetteConstance/README.md'), PosixPath('/home/pliu/git/RecetteConstance/LICENSE'), PosixPath('/home/pliu/git/RecetteConstance/notebooks'), PosixPath('/home/pliu/git/RecetteConstance/notebooks/03.Check_file_number_and_names.ipynb'), PosixPath('/home/pliu/git/RecetteConstance/notebooks/01.check_data_format.ipynb'), PosixPath('/home/pliu/git/RecetteConstance/notebooks/02.EDA_on_descriptor.ipynb'), PosixPath('/home/pliu/git/RecetteConstance/.gitignore'), PosixPath('/home/pliu/git/RecetteConstance/poetry.lock')]


Crucially, you’ve managed to opt out of having to examine all the files in the undesired directories. Once your generator identifies that the directory is in the SKIP_DIRS list, it just skips the whole thing.

So, in this case, using .iterdir() is going to be far more efficient than the equivalent glob methods.

In fact, you’ll find that .iterdir() is generally more efficient than the glob methods if you need to filter on anything more complex than can be achieved with a glob pattern. However, if all you need to do is to get a list of all the .txt files recursively, then the glob methods will be faster.