# Organizing Files

Description: This notebook demonstrates common file system operations.

Useful prior knowledge
* Handle on Python fundamenetals: loops, if statements, string operations, and data structures (lists, dictionaries), and exception handling (try-except)
* See chapters 1-6 of [https://automatetheboringstuff.com](https://automatetheboringstuff.com)

### Outline

[File System Path Handling](#File-System-Path-Handling)
* os module - operating system utilities
* pathlib- object-oriented file system paths

[Reading and writing files](#Reading-and-writing-files)
* Reading and writing files
* Reading large files
* Note on with statement (a.k.a., context manager)

[More file operations](#More-file-operations)
* Renaming a file
* Copying a file
* Moving a file
* Removing files with os.remove(path)

[Pattern Matching](#Pattern-Matching)
* sample of string operations
* regular expressions with re module

[More directory operations](#More-directory-operations)
* make a directory
* copy a directory tree
* remove a directory tree

[Inventory a directory](#Inventory-a-directory)
* Display contents of a directory (non-recursive, non-inclusive of subdirectories) using os.listdir()
* Display contents (including subdirectories) with os.walk()
* Get file statistics
* Display inventory with pandas and prep for export (CSV, Excel)
* Pulling it all together. Walk a directory and collect the file statistics.

[Working with ZIP Files](#Working-with-ZIP-Files)
* Create and add to ZIP file
* Read contents and file info
* Extract contents with ZipFile.extract_all() or ZipFile.extract()

### Tips
* Use backups. It's easy to miss subtle errors and irreversibly move or rename files in a way that may be impossible to undo.
* Use print() statements to test a dry run first.
* Notepad++ for searching across files and directories

### Further Reading
* https://automatetheboringstuff.com
    * Chapter 7 – Pattern Matching with Regular Expressions
    * Chapter 9 – Reading and Writing Files
    * Chapter 10 – Organizing Files
* Python standard library documentation
    * https://docs.python.org/3/library/index.html
    * Use `help(<object_name>)` in Python or search `python docs <object_name>` for detailed documention. 
    * E.g., `>>> help(os.walk)`, `>>> help(shutil.copy2)`).


In [1]:
from pathlib import Path
import os
import re
import shutil

import pandas as pd                 # $ python -m pip install pandas
from file_utils import file_utils   # imports ./file_utils/file_utils.py

pd.set_option('max_colwidth', None)
pd.set_option('max_rows', 1000)

### File System Path Handling

In [2]:
# backslash is an escape character in many languages but on Windows OS, it is the path separator 

# use double backslash to represent a single backslash in a string
filepath = 'c:\\users\\user\\file.txt'
print(filepath)

# or use raw string prefix (r'')
filepath = r'c:\users\user\file.txt'
print(filepath)

c:\users\user\file.txt
c:\users\user\file.txt


In [3]:
# manually splitting strings into components
filepath.split('\\')

['c:', 'users', 'user', 'file.txt']

In [4]:
# split into string into a list on a delimeter
print('file.txt'.split('.'))


# assign filename to variable
filename = 'file.txt'.split('.')[0]
print(filename)

['file', 'txt']
file


`os` and `pathlib` module cover many common utilities in a more readable and less error-prone manner.

### `os` module - operating system utilities

In [5]:
filepath = r'c:\users\user\file.txt'
basename = os.path.basename(filepath)
dirname = os.path.dirname(filepath)
full_path = os.path.join(r'c:\users\user', 'file.txt')

print('filepath:', filepath)
print('basename:', basename)
print('dirname:', dirname)
print('full_path:', full_path)

filepath: c:\users\user\file.txt
basename: file.txt
dirname: c:\users\user
full_path: c:\users\user\file.txt


In [6]:
# getcwd() -> get Current Working Directory
# chdir() -> change directory

print(os.getcwd())
os.chdir('..')

print(os.getcwd())
os.chdir('./demo-organizing-files')

print(os.getcwd())

C:\Users\pzuradzki\Code\_unordered\demo-organizing-files
C:\Users\pzuradzki\Code\_unordered
C:\Users\pzuradzki\Code\_unordered\demo-organizing-files


### `pathlib`- object-oriented file system paths

* Treat paths like custom objects/data structures rather than strings or outputs of functions.
* Operator overloading of `/` division operator to represent path separator
* OS-agnostic path objects that assures of the correct path format. PosixPath on Mac/Linux. WindowsPath on Windows.

In [7]:
from pathlib import Path

filepath = Path(r'c:\users\user\files.txt')
filepath

WindowsPath('c:/users/user/files.txt')

In [8]:
filepath = Path(os.getcwd()) / 'file_utils.py'

print(filepath)
print(repr(filepath))
print(filepath.name)
print(filepath.parent)
print(filepath.is_dir())
print(filepath.is_file())

C:\Users\pzuradzki\Code\_unordered\demo-organizing-files\file_utils.py
WindowsPath('C:/Users/pzuradzki/Code/_unordered/demo-organizing-files/file_utils.py')
file_utils.py
C:\Users\pzuradzki\Code\_unordered\demo-organizing-files
False
False


In [9]:
# operator overloading with '/' symbol
directory = Path(r'c:\users\user')
filepath = directory / 'subdirectory/files.txt'
print(filepath)

c:\users\user\subdirectory\files.txt


### Reading and writing files
* Reading and writing files
* Reading large files
* Note on with statement (a.k.a., context manager)

### Reading and writing files

In [10]:
with open('test_file.txt', 'w') as f:
    f.write('qty,price\n1,2\n3,4\n')

In [11]:
with open('test_file.txt', 'a') as f:
    f.write('4,5\n6,7')

In [12]:
with open('test_file.txt', 'r') as f:
    data_as_str = f.read()
        
with open('test_file.txt', 'r') as f:
    data_as_list = f.readlines()

print(data_as_str)
print()
print(data_as_list)

qty,price
1,2
3,4
4,5
6,7

['qty,price\n', '1,2\n', '3,4\n', '4,5\n', '6,7']


In [13]:
# in pracitce, using built-in csv module or pandas is less error-prone than manually writing strings to a file
# these libraries take care of many subtle complications like quotes, escape characters, data types (1 vs '1'), and performance (pandas)

import csv

with open('test_csv.csv', 'w', newline='') as f:
    writer = csv.writer(f, delimiter=',')
    data = [['qty','price'], [1,2], [3,4]]
    for row in data:
        writer.writerow(row)

In [14]:
with open('test_csv.csv', 'r') as f:
    reader = csv.reader(f)
    for row in reader:
        print(','.join(row))

qty,price
1,2
3,4


### Reading large files

Sometimes, you simply want to inspect a file for initial exploration and don't want to read the whole file into memory. Perhaps you want a row count.

The Python file handler returns a generator object, which means that we can iterate through a file one element at a time without loading everything into RAM memory.

In [15]:
import numpy as np

with open('big_file.txt', 'w') as f:
    for n in range(100_000):
        data = np.random.rand(5).round(2).tolist()
        f.write(str(data)[1:-1])
        f.write('\n')

In [16]:
# peak at first N rows without reading entire file

with open('big_file.txt', 'r') as f:
    for row in range(10):
        print(f.readline(), end='')

0.69, 0.01, 0.97, 0.15, 0.79
0.32, 0.28, 0.63, 0.59, 0.14
0.99, 0.88, 0.0, 0.79, 0.4
0.61, 0.37, 0.55, 0.4, 0.92
0.61, 0.74, 0.93, 0.33, 0.59
0.96, 0.02, 0.41, 0.85, 0.11
0.71, 0.47, 0.03, 0.55, 0.7
0.81, 0.64, 0.29, 0.53, 0.02
0.47, 0.46, 0.43, 0.89, 0.3
0.46, 0.67, 0.04, 0.9, 0.36


In [17]:
# get row count without reading entire file into memory

row_count = 0

with open('big_file.txt', 'r') as f:
    for row in f:
        row_count = row_count + 1

print(row_count)

100000


### Note on `with` statement (a.k.a., context manager)

The `with` block was introduced to Python assist handling objects that require some sort of setup or teardown.
Ex: opening and closing a file, connecting and disconnecting from a database.

It is a form of syntax sugar/convenience but it is recommended to indicate (1) a setup/teardown context is involved and (2) it helps avoid loose ends such as failing to close a file or connection, which can waste memory or have unintended consequences. For example, `with` enables avoiding access to an unavailable resource. Ex: a file might be opened, an error occurs, and it is never closed. It lessens the need to re-write exception-handling.

The following are equivalent:
```python
# Simple open() and close(). Not idiomatic.
f = open('test_file.txt', 'r')
print(f.read())
f.close()

# Using context manager
with open('test_file.txt', 'r') as f:
    print(f.read())
```

### More file operations
* Renaming a file
* Copying a file
* Moving a file

In [18]:
import os
from pathlib import Path
import shutil

# create a file then rename
    # catch exception if file already exists and continue
with open('a_file.txt', 'w') as f:
    f.write('File contents.')

try:
    os.rename('a_file.txt', 'a_file_renamed.txt')
except FileExistsError as err:
    print(err)

# copy a file. copy2() (vs copy) includes file metadata (e.g., owner, create date).
shutil.copy2(src='a_file_renamed.txt', dst='a_file_copy.txt')

# make a directory and continue if it already exists
    # can also use os.mkdir('foo_dir') with try-except for same effect
Path('foo_dir').mkdir(exist_ok=True)


# create a file then move it
with open('a_file.txt', 'w') as f:
    f.write('File contents.')
    
_ = shutil.move(src='a_file.txt', dst='foo_dir/a_file_copy.txt')

### Removing files with `os.remove(path)`

In [19]:
# loop through list of files in current directory returned by os.listidir()
# if the file name ends with 'csv' or 'txt', we remove it
for file in os.listdir():
    if file.endswith('.csv') or file.endswith('.txt'):
        os.remove(file)

### Pattern Matching
* sample of string operations
* regular expressions with `re` module

In [20]:
'.TXT'.lower()              # => 'txt'
'.json'.upper()             # => 'JSON'
'txt' in 'file.txt'         # => True
'file.txt'.endswith('txt')  # => True
'file.txt'.split('.')       # => ['file', 'txt']
'file.txt'.split('.')[0]    # => 'file'

pass

In [21]:
import re

s1 = '20220101_file.txt'
s2 = '20201212_file.txt'
s3 = '2020_file.txt'
s4 = 'file.txt'

rgx = re.compile(r'(\d\d\d\d\d?\d?\d?\d?)(.*)')

for s in [s1, s2, s3, s4]:
    print(s)
    if match:=rgx.search(s):
            print(match.groups())
    print()

20220101_file.txt
('20220101', '_file.txt')

20201212_file.txt
('20201212', '_file.txt')

2020_file.txt
('2020', '_file.txt')

file.txt



**Pattern Matching with Regular Expressions**

More complicated patterns can be captured using a small-but-powerful text processing language known as regular expressions.

Resources for more depth:
* https://automatetheboringstuff.com/chapter7
* https://regex101.com

In [22]:
import re

# objective: capture the id prefix, separated year/month/day, and file extension for CSV or TXT files matching the pattern
    # note that we cannot split on '_' underscore because there can be an arbitrary #
    # say we want to capture only patterns like the first 3 strings
    
s1 = 'id01_foo_file_20220101.txt'
s2 = 'id02_bar_2022-01-01.csv'
s3 = 'id03_yow_file_filler_fill_fill_2022.01.01.txt'
s4 = 'oddball.json'
s5 = 'oddball.txt'

rgx = re.compile(r'''
([^_]+)    # id
.*         # filler
(\d\d\d\d) # year
.*?        # optional date separator
(\d\d)     # month
.*?        # optional date separator
(\d\d)     # day
.(txt|csv) # file extension
''', re.VERBOSE)

for s in [s1, s2, s3, s4, s5]:
    if match:=rgx.search(s):
        print(s)
        print(match.groups())
        print()
    else:
        print(f"{s} => no match")

id01_foo_file_20220101.txt
('id01', '2022', '01', '01', 'txt')

id02_bar_2022-01-01.csv
('id02', '2022', '01', '01', 'csv')

id03_yow_file_filler_fill_fill_2022.01.01.txt
('id03', '2022', '01', '01', 'txt')

oddball.json => no match
oddball.txt => no match


In [23]:
for s in [s1, s2, s3, s4, s5]:
    print(s)

id01_foo_file_20220101.txt
id02_bar_2022-01-01.csv
id03_yow_file_filler_fill_fill_2022.01.01.txt
oddball.json
oddball.txt


### More directory operations
* make a directory
* copy a directory tree
* remove a directory tree

In [24]:
# makes a directory
    # exist_ok=False by default. This will raise and Exception / halt the program if the directory already exists.
    # Setting exist_ok=True will ignore the FileExists exception and skip making a new directory. 
    # In other words, setting exist_ok=True, this method safely avoids replacing contents of a directory inadvertently.
Path('new_directory').mkdir(exist_ok=True)
    
# removes directory including it's contents (careful with this; there is no "undo")
shutil.rmtree('new_directory')

Path('new_directory').mkdir(exist_ok=True)
Path(r'new_directory/subdirectory').mkdir(exist_ok=True)
with open(r'new_directory/subdirectory/file.txt', 'w') as f:
    f.write('hello')


try:
    shutil.copytree('new_directory', 'new_directory_copied')
except FileExistsError as err:
    print(err)

In [25]:
for path in ['new_directory', 'new_directory_copied', 'foo_dir']:
    #print(path)
    
    try:
        shutil.rmtree(path) 
    except FileNotFoundError as err:
        print(err)

# Inventory a directory

### Make dummy data/files

Make a nested directory 2-levels deep. Outer loop creates 3 directories and 3 files. Inner loop creates 3 subdirectories and 3 files per directory.

In [26]:
Path('./root').mkdir(exist_ok=True)

for i in range(1,4):    
    directory = Path(fr'./root/dir_{str(i).zfill(2)}')
    directory.mkdir(exist_ok=True)
    
    with open(directory / fr'file_{str(i).zfill(2)}.txt', 'w') as f:
        f.write("Hello, I'm a file.")
        
    for j in range(1,4):        
        subdirectory = directory / f'subdir_{str(j).zfill(2)}'
        subdirectory.mkdir(exist_ok=True)
        
        with open(subdirectory / fr'file_{str(i).zfill(2)}.txt', 'w') as f:
            f.write("Hello, I'm a file.")

Path(r'./root').mkdir(exist_ok=True)

# call shutil.rmtree to remove entire directory and contents
# shutil.rmtree('./root')

### Display contents of a directory (non-recursive, non-inclusive of subdirectories) using `os.listdir()`

`os.listdir()` returns a list, which is useful for cataloging.

In [27]:
# list contents of the root folder in the current directory ('.'=current directory)
print(os.listdir('./root'))
print(os.listdir('.'))

['dir_01', 'dir_02', 'dir_03']
['.ipynb_checkpoints', 'file_utils', 'organizing_files.ipynb', 'root', '__pycache__']


### Display contents (including subdirectories) with `os.walk()`

**`os.walk(top_directory_path)`**: Directory tree generator. A "generator" is a special Python object that is iterable. Unlike other iterable containers/collections/structures like a list or dictionary, a generator is consumed one at a time. The contents of the generator are not stored in memory simultaneously which can be useful for memory efficiency.


Unlike `os.listdir()` which returns a list, we must build up our own data structure for os.walk() depending on what we want to include.

In [28]:
# walk the entire `root` directory and subdirectory
    # try commenting out one of the loops to target only files or directories
    # note that we have to join the root path with the filename to get the full file path 
    # (else we'll have a filename and not know where it came from)
    # rathern than print, we can append these to a list or build a dictionary

root_dir = Path('./root')

for root, dirs, files in os.walk(root_dir):
    for f in files:
        print(Path(root)/f)
    for d in dirs:
        print(Path(root)/d)

root\dir_01
root\dir_02
root\dir_03
root\dir_01\file_01.txt
root\dir_01\subdir_01
root\dir_01\subdir_02
root\dir_01\subdir_03
root\dir_01\subdir_01\file_01.txt
root\dir_01\subdir_02\file_01.txt
root\dir_01\subdir_03\file_01.txt
root\dir_02\file_02.txt
root\dir_02\subdir_01
root\dir_02\subdir_02
root\dir_02\subdir_03
root\dir_02\subdir_01\file_02.txt
root\dir_02\subdir_02\file_02.txt
root\dir_02\subdir_03\file_02.txt
root\dir_03\file_03.txt
root\dir_03\subdir_01
root\dir_03\subdir_02
root\dir_03\subdir_03
root\dir_03\subdir_01\file_03.txt
root\dir_03\subdir_02\file_03.txt
root\dir_03\subdir_03\file_03.txt


### Get file statistics

```
os.stat_result object may be accessed either as a tuple of
      (mode, ino, dev, nlink, uid, gid, size, atime, mtime, ctime)
    or via the attributes st_mode, st_ino, st_dev, st_nlink, st_uid, and so on.
```

In [29]:
filepath = Path(r'root\dir_03\subdir_01\file_03.txt')
stat_result = os.stat(filepath)
stat_result

os.stat_result(st_mode=33206, st_ino=32932572275263565, st_dev=875114692, st_nlink=1, st_uid=0, st_gid=0, st_size=18, st_atime=1642005006, st_mtime=1642005006, st_ctime=1642005006)

In [30]:
# attribute access with dot operator
print(stat_result.st_size)
print(stat_result.st_mtime)

18
1642005006.6217597


In [31]:
# tuple unpacking -> separate variables per statistic
mode, ino, dev, nlink, uid, gid, size, atime, mtime, ctime = stat_result
print(size)
print(mtime)

18
1642005006


**Sidenote demo on tuple unpacking**

In [32]:
# these have the same effect

# tuple unpacking
x, y, z = (1, 2, 3)
print(x,y,z)

# variable assignment
a_tuple = (1, 2, 3)
x = a_tuple[0]
y = a_tuple[1]
z = a_tuple[2]
print(x,y,z)

1 2 3
1 2 3


In [33]:
# using a helper function to also format key statistics and convert to dictionary 
    # ex: epoch timestamp converted to localized YYYY.MM.DD-hhmm 
    # system owner user ID # converted to domain/username
    # file size converted to megabytes

filepath = Path(r'root\dir_03\subdir_01\file_03.txt')
file_utils.get_file_stats(filepath)

{'path': WindowsPath('root/dir_03/subdir_01/file_03.txt'),
 'name': 'file_03.txt',
 'parent': WindowsPath('root/dir_03/subdir_01'),
 'size_mb': 0.0,
 'modified_dt': '2022.01.12-1030',
 'create_dt': '2022.01.12-1030',
 'owner': 'EVOLENTHEALTH/PZuradzki',
 'is_file': 1,
 'is_dir': 0}

### Display inventory with pandas and prep for export (CSV, Excel)

In [34]:
import pandas as pd

fp1 = Path(r'root\dir_01\file_01.txt')
fp2 = Path(r'root\dir_03\file_03.txt')

# file_utils.get_file_stats() returns a dictionary
# stats_list is a list of dictionaries
stats_list = [file_utils.get_file_stats(fp1),
              file_utils.get_file_stats(fp2)]

# pandas can convert from a list of dictionaries to a tabular format that we can display or export
df = pd.DataFrame(stats_list)
df

Unnamed: 0,path,name,parent,size_mb,modified_dt,create_dt,owner,is_file,is_dir
0,root\dir_01\file_01.txt,file_01.txt,root\dir_01,0.0,2022.01.12-1030,2022.01.12-1030,EVOLENTHEALTH/PZuradzki,1,0
1,root\dir_03\file_03.txt,file_03.txt,root\dir_03,0.0,2022.01.12-1030,2022.01.12-1030,EVOLENTHEALTH/PZuradzki,1,0


### Pulling it all together. Walk a directory *and* collect the file statistics.

In [35]:
from pprint import pprint

def make_file_inventory(root_dir: Path, as_df=False):    
    
    inventory = []

    for root, dirs, files in os.walk(root_dir):
        for f in files:
            filepath = Path(root) / f
            inventory.append(file_utils.get_file_stats(filepath))
        for d in dirs:
            dirpath = Path(root) / d
            inventory.append(file_utils.get_file_stats(dirpath))
            
    if as_df:
        return pd.DataFrame(inventory).sort_values(by=['is_file', 'path'])
    
    return inventory

In [36]:
inventory = make_file_inventory(Path('./root'))

# peak at first two objects
for file_obj in inventory[:2]:
    pprint(file_obj)

{'create_dt': '2022.01.12-1030',
 'is_dir': 1,
 'is_file': 0,
 'modified_dt': '2022.01.12-1030',
 'name': 'dir_01',
 'owner': 'EVOLENTHEALTH/PZuradzki',
 'parent': WindowsPath('root'),
 'path': WindowsPath('root/dir_01'),
 'size_mb': 0.0039}
{'create_dt': '2022.01.12-1030',
 'is_dir': 1,
 'is_file': 0,
 'modified_dt': '2022.01.12-1030',
 'name': 'dir_02',
 'owner': 'EVOLENTHEALTH/PZuradzki',
 'parent': WindowsPath('root'),
 'path': WindowsPath('root/dir_02'),
 'size_mb': 0.0039}


In [37]:
df = make_file_inventory(Path('./root'), as_df=True)
df.to_excel('./file_inventory.xlsx', index=False)
# df.to_clipboard(index=False)

In [38]:
_df = df.head(10).copy()
_df['owner'] = _df['owner'].str.replace(r'EVOLENTHEALTH', 'domain')

In [39]:
print(_df.to_markdown(index=False))

| path                  | name      | parent      |   size_mb | modified_dt     | create_dt       | owner            |   is_file |   is_dir |
|:----------------------|:----------|:------------|----------:|:----------------|:----------------|:-----------------|----------:|---------:|
| root\dir_01           | dir_01    | root        |    0.0039 | 2022.01.12-1030 | 2022.01.12-1030 | domain/PZuradzki |         0 |        1 |
| root\dir_01\subdir_01 | subdir_01 | root\dir_01 |    0      | 2022.01.12-1030 | 2022.01.12-1030 | domain/PZuradzki |         0 |        1 |
| root\dir_01\subdir_02 | subdir_02 | root\dir_01 |    0      | 2022.01.12-1030 | 2022.01.12-1030 | domain/PZuradzki |         0 |        1 |
| root\dir_01\subdir_03 | subdir_03 | root\dir_01 |    0      | 2022.01.12-1030 | 2022.01.12-1030 | domain/PZuradzki |         0 |        1 |
| root\dir_02           | dir_02    | root        |    0.0039 | 2022.01.12-1030 | 2022.01.12-1030 | domain/PZuradzki |         0 |        1 |
| root

# Working with ZIP Files
* Create and add to ZIP file
* Read contents and file info
* Extract contents with ZipFile.extract_all() or ZipFile.extract()

### Create and add to ZIP file

In [40]:
import zipfile

with zipfile.ZipFile('./root_zipped.zip', 'w') as zf:
    for root, dirs, files in os.walk('./root'):
        for file in files:
            zf.write(Path(root)/file, compress_type=zipfile.ZIP_DEFLATED)
        for d in dirs:
            zf.write(Path(root)/d, compress_type=zipfile.ZIP_DEFLATED)

### Read contents and file info

In [41]:
with zipfile.ZipFile('root_zipped.zip') as zf:
    print(zf.namelist())
    print()
    print(zf.getinfo(r'root/dir_01/file_01.txt'))

['root/dir_01/', 'root/dir_02/', 'root/dir_03/', 'root/dir_01/file_01.txt', 'root/dir_01/subdir_01/', 'root/dir_01/subdir_02/', 'root/dir_01/subdir_03/', 'root/dir_01/subdir_01/file_01.txt', 'root/dir_01/subdir_02/file_01.txt', 'root/dir_01/subdir_03/file_01.txt', 'root/dir_02/file_02.txt', 'root/dir_02/subdir_01/', 'root/dir_02/subdir_02/', 'root/dir_02/subdir_03/', 'root/dir_02/subdir_01/file_02.txt', 'root/dir_02/subdir_02/file_02.txt', 'root/dir_02/subdir_03/file_02.txt', 'root/dir_03/file_03.txt', 'root/dir_03/subdir_01/', 'root/dir_03/subdir_02/', 'root/dir_03/subdir_03/', 'root/dir_03/subdir_01/file_03.txt', 'root/dir_03/subdir_02/file_03.txt', 'root/dir_03/subdir_03/file_03.txt']

<ZipInfo filename='root/dir_01/file_01.txt' compress_type=deflate filemode='-rw-rw-rw-' file_size=18 compress_size=20>


### Extract contents with ZipFile.extract_all() or ZipFile.extract()

In [42]:
with zipfile.ZipFile('root_zipped.zip') as zf:
    zf.extractall('unzipped')

### Cleanup

In [43]:
shutil.rmtree('./root')
shutil.rmtree('./unzipped')
os.remove('root_zipped.zip')
os.remove('file_inventory.xlsx')