# Working with Files in Python

_Updated September 2023_

This notebook explains and illustrates some of the basic operations of working with files in Python. The notebook begins with a basic look at reading and writing files (opening and saving) with a look through an example that extracts specific information from one file then writes it to a new file. After that, the notebook introduces a few modules that will help build an inventory of basic system information including filenames, locations (paths), and sizes. Once such information is etracted, you can use it to create an inventory or manifest. 

As anyone who has worked with files and operating systems will know, the operating system data about files is often contextual and depends on the system in which it was created or saved. Other information, like user profiles, access controls, and encryption are typically not reflected in this sort of information and that context will require other methods to preserve.

## Python libraries/modules used

* `os` module ([docs](https://docs.python.org/3/library/os.html)) for navigating and interacting with the file system
* `os.path` module ([docs](https://docs.python.org/3/library/os.path.html)) for working with paths - note this is a submodule of `os`, so when you import `os`, you also have access to all of these functions
* `shutil` module ([docs](https://docs.python.org/3/library/shutil.html)) for copying, moving, removing
* `hashlib` module ([docs]()) for creating secure hash digests for files. This is primarily for creating file integrity metadata, which is important for confirming or transferring 


## Key points / Questions

* How can I interact with the files on my computer using python?
* How can I work with file paths in various operating systems?
* How do I move around the file system in Python? 
* How can I find out the current location on the file system, and how can I look in other locations, using Python?
* How can I identify file information and metadata using Python?

First, let's look at writing files. 

# Part 1: Writing and Reading Files

The basic function for writing files is the `write()` function. This can be used to write contents from the argument or 
to write multi-line content. Unlike in other environments like the GUI or shell, where the open command is often assumed, 
you may need to `open()` and then `close()` files when working in python. You cannot write to a file that is not known and opened, and a file that is not closed may be corrupted. 

Fortunately, we can usually use the contexual opener:

```python
with open(file, 'w') as f:
```

This will automatically close the file when the loop completes. The `w` argument indicates that the file is opened in "write" mode. If the file doesn't exist, the file will be written. 

In [1]:
# Basic use of open() and write()

line = 'Believe that life is worth living, and your belief will help create the fact.'
# Credit William James https://en.wikiquote.org/wiki/William_James

fout = open('quote-output.txt', 'w')

fout.write(line)

fout.close()

In [2]:
# use the with open() syntax to check if the file is there

with open('quote-output.txt', 'r') as f:
    print(f.read())

Believe that life is worth living, and your belief will help create the fact.


# Part 2: Navigating the file system

Moving around and pointing toward different locations in the file system operates 
in many ways similar to navigation on the command line. 
All movements and file locations are relative to the current location of the terminal (or in this case, program),
but "absolute" paths can be created that point completely from the root of the filesystem. 
In Python, the most useful tools are in the `os` and `os.path` libraries.  

## `os`: Navigation and File Paths

Now, let's look at moving around and managing file paths. This uses the `os` and `os.path` libraries. Importing `os` will import all of the `os.path` functions as well. We can use the `os` module to interact with the operating system. For example, 
functions similar to file navigation on the terminal.

In [3]:
import os

The `getcwd()` function is like `pwd`:

In [4]:
os.getcwd()

'/Users/jajohnst/Desktop/networked-services-labs-2023'

The `chdir()` allows you to move around, like `cd`:

In [5]:
os.chdir('data')

In [6]:
os.getcwd()

'/Users/jajohnst/Desktop/networked-services-labs-2023/data'

In [7]:
os.chdir('..')

In [8]:
os.getcwd()

'/Users/jajohnst/Desktop/networked-services-labs-2023'

The path module allows building up paths using multiple arguments by combining them together as arguments in the `os.path.join()` function. This can also be assigned to a variable, such as `path_to_file` below:

In [9]:
file = 'mbox-short.txt'

path_to_file = os.path.join('data','emails',file)

print(path_to_file)

data/emails/mbox-short.txt


And, you can check to see if the entity is indeed a file:

In [10]:
os.path.isfile(path_to_file)

True

## `os.path`: Identifying parts of the path

The path module also helps to split up paths. For example, to extract the directory location of a file, the name of a file, and even the extension of a file. These functions do this:

* `os.path.basename()` returns what usually is considered the filename. Technically whatever is at the last position of the path after the final slash or path separator character
* `os.path.dirname()` returns a string of everything before the last separator
* `os.path.splitext()` returns a tuple containing the basename and the extension of the file. This assumes that the file extension is indicated by a dot before the last bit of the file. 

In [11]:
path_to_file

'data/emails/mbox-short.txt'

In [12]:
dirName = os.path.dirname(path_to_file)
baseName = os.path.basename(path_to_file)
fExtension = os.path.splitext(path_to_file)

print(dirName,'\n',baseName,'\n',fExtension)

data/emails 
 mbox-short.txt 
 ('data/emails/mbox-short', '.txt')


Note that the `.splitext()` function returns a tuple, with the first element being the directory+basename and the second element the extension.

In [13]:
# to isolate the file extension portion:
fExtension[1]

'.txt'

## Absolute & Relative Paths

* `os.path.abspath()` returns the absolute path as a string
* `os.path.isabs()` returns a `True` or `False` depending on whether argument it receives is an absolute path or not
* `os.path.relpath()` returns a string of the relative path

As in the shell terminal, we can refer to the current directory with the `.` file, and the parent directory with two dots like `..`.

In [14]:
os.path.abspath('.')

'/Users/jajohnst/Desktop/networked-services-labs-2023'

In [15]:
cur_loc = os.getcwd()

os.path.isabs(cur_loc)

True

Note how it is possible to "ask" for the relative path to your current location, and you should receive the dot directory `.`.

In [16]:
# what is the relative path to the current location?
os.path.relpath(os.getcwd())

'.'

The relative path can also be used to return what the path to another location is, relative to the current location:

In [17]:
os.path.relpath('data','webfiles-samples')

'../data'

# Part 3: Getting file information (i.e., metadata)

![Metadataz cat](https://i.imgflip.com/6uaiuk.jpg "A metadataz cat meme that says 'I can haz metadataz'.")

Finally, `os.path` allows us to also query the location (if it's a file) to show things like size, date modified, etc. The basic things, like the size of the file are easier to get to:

* `os.path.getsize()` returns the file size in bytes. 
* `os.path.getmtime()` returns the modification time. Note that the result is an integer that represents time according to the file system where you are running the script, and it will require conversion into some other useful string format to be human readable. Also, it is difficult to verify accuracy of this sort of date information since it can vary depending on operating system. Still, it is useful to record. 

Other information, like modification time, is contained in a property called `stat_result`. This is accessed via the `stat` method and is a part of file path objects. (Here is where we may need to test if something is or is not a file.) In brief, stat contains things like:

* `st_size` also contains the size in bytes
* `st_mtime` indicates the modification time in a value that indicates time according to the computer's operating system. Note that Unix and Windows systems may calculate time differently, and these may or may not have reliable information. In other cases, the time may be called modify time but the system records something else, like the last time the file was accessed (vs the last time it was saved/written), or the time it was created. This is, unfortunately (for those of us interested in these kinds of things), very difficult to track across systems, so always take this under advisement. 



In [18]:
path_to_file

'data/emails/mbox-short.txt'

In [19]:
os.stat(path_to_file)

os.stat_result(st_mode=33188, st_ino=28064473, st_dev=16777229, st_nlink=1, st_uid=504, st_gid=20, st_size=94625, st_atime=1694411544, st_mtime=1663731214, st_ctime=1694411358)

This can be assigned to a variable, and the various elements of the result object called using dot notation. For example, to make a list of the file size and modify time:

In [20]:
file_stats = os.stat(path_to_file)

file_info = {
    'size' : file_stats.st_size,
    'modify_time' : file_stats.st_mtime
}

print(file_info)

{'size': 94625, 'modify_time': 1663731214.6876292}


Building these all together, we can extract the file metadata! :tada:

In [21]:
file_stats = os.stat(path_to_file)

file_info = {
    'absPath' : os.path.abspath(path_to_file),
    'directory_location' : os.path.dirname(path_to_file),
    'name' : os.path.basename(path_to_file),
    'extension' : os.path.splitext(path_to_file)[1],
    'size': os.path.getsize(path_to_file),
    'modify_time' : file_stats.st_mtime
}

print(file_info)

{'absPath': '/Users/jajohnst/Desktop/networked-services-labs-2023/data/emails/mbox-short.txt', 'directory_location': 'data/emails', 'name': 'mbox-short.txt', 'extension': '.txt', 'size': 94625, 'modify_time': 1663731214.6876292}


### Handling time and date inforation

In the above example, you may note that the value returned by the "modified time" query doesn't look much like a time value. You will need to convert this from the numerical value that is received to a date value. This can be accomplished using the `datetime` module and format conversion to produce and record the information in a more usable and portable format. 

In the example below, note the `datetime` functions are imported, then the result is converted to a string that will represent the modification information more reliably. 

In [22]:
import datetime

In [23]:
file_info = {
    'name' : os.path.basename(path_to_file),
    'extension' : os.path.splitext(path_to_file),
    'size' : os.path.getsize(path_to_file), 
    'modify_datetime' : datetime.datetime.strftime(datetime.datetime.fromtimestamp(os.path.getmtime(path_to_file)), "%Y-%m-%dT%H:%M:%S%Z")
}
print(file_info)

{'name': 'mbox-short.txt', 'extension': ('data/emails/mbox-short', '.txt'), 'size': 94625, 'modify_datetime': '2022-09-20T23:33:34'}


In the line that provides the `modify_datetime` information, the timestamp from `os.path.getmtime()` is passed to the `fromtimestamp()` function of `datetime`. Then, the information converted into a timestampe is filtered to a string using the formatting mask specified at the end, and a string representing the modification time and date in ISO standard format are created in the `modify_datetime` variable.

## Interacting with files and paths

We can also extract information from a file then reuse that in another file. 
For example, we could extract the email addresses from `mbox-short.txt` and create
an address book file:

In [24]:
# in this case, file in is already stored in the path_to_file variable

# set up a file name for a file to create
fout = 'email-list.txt'

#establish a list to record emails as they are identified
emails = []

# open the source file to extract emails
with open(path_to_file, 'r') as f:
    for line in f:
        if line.startswith('From:'):
            email = line[6:].rstrip()
            if email not in emails:
                emails.append(email)
print(emails, '\n\n')

# open another file in write mode to write the emails.
with open(fout, 'w') as f:
    for email in emails:
        f.write(email + '\n')

print(open(fout).read())

['stephen.marquard@uct.ac.za', 'louis@media.berkeley.edu', 'zqian@umich.edu', 'rjlowe@iupui.edu', 'cwen@iupui.edu', 'gsilver@umich.edu', 'wagnermr@iupui.edu', 'antranig@caret.cam.ac.uk', 'gopal.ramasammycook@gmail.com', 'david.horwitz@uct.ac.za', 'ray@media.berkeley.edu'] 


stephen.marquard@uct.ac.za
louis@media.berkeley.edu
zqian@umich.edu
rjlowe@iupui.edu
cwen@iupui.edu
gsilver@umich.edu
wagnermr@iupui.edu
antranig@caret.cam.ac.uk
gopal.ramasammycook@gmail.com
david.horwitz@uct.ac.za
ray@media.berkeley.edu



## Inventorying Files

For this activity, we are going to expand the use of modules that allow us to interact with the file system. These should be somewhat familiar after we have already looked into basic shell commands.

* `os` assists in using aspects of the operating system, in this case particularly file information and paths. See https://docs.python.org/3/library/os.html; 
* `os.path` is often called by itself and allows us to interact with file path and directory information. See https://docs.python.org/3/library/os.path.html#module-os.path. 
* `shutil` allows to access some shell utilities, like move, copy, rename, delete. See https://docs.python.org/3/library/shutil.html?highlight=shutils.

We will also use the `csv` module since it will help us to write the information that we gather to a structured data file that can later be opened in Excel or other spreadsheet applications. See https://docs.python.org/3/library/csv.html

In [25]:
# if you haven't yet imported the modules, run this cell
import os
# if you only want to access certain elements of the modules, or to alias them to simpler calls, run this line
#from os.path import join, getsize, getctime
# activities to write out the files to a CSV will use the CSV module
import csv

Once we know what we want in the csv, how do we get that information? We can use the `os` module to get file information. We will use the `os.walk` function to "walk" over the file tree, identify folder lists, paths, and filenames.  

In [26]:
walk_this_directory = os.path.join('data','webfiles-samples')

print(walk_this_directory)

data/webfiles-samples


### Using os.listdir()

We can generate a list of the files in the directory using the `os.listdir()` function. This list will include the file names for all the files in the directory. 

In [27]:
dir_list = os.listdir(walk_this_directory)

print(dir_list)

['video', 'pdf', 'image', 'audio', 'web-files-small-metadata.csv', 'presentation']


Let's use the `listdir()` function to create a manifest of the files in the `pdf` directory:

In [28]:
# create the list
pdf_dir_list = os.listdir(os.path.join(walk_this_directory, 'pdf'))

# set up a file name for a file to create
fout = 'pdf-file-list.txt'

# open outfile in write mode to write the filenames
with open(fout, 'w') as f:
    for filename in pdf_dir_list:
        f.write(filename)
        f.write('\n')
print('wrote',fout)

wrote pdf-file-list.txt


To view the contents of a file quickly, it's possible to use the `.read()` like `cat` on the terminal.

In [29]:
print(open(fout).read())

01-1480.pdf
file.pdf
Chapter03.pdf
PFCHEJ.pdf
HR2021 commtext.pdf



We could, for example, use this to calculate the total size of the files in the directory: 

In [30]:
total_size = 0

# set up a path to the directory
pdf_directory = os.path.join(walk_this_directory, 'pdf')

# use listdir() to loop through the directory and tally up the file sizes:
for pdf in os.listdir(pdf_directory):
    total_size = total_size + os.path.getsize(os.path.join(walk_this_directory, 'pdf', pdf))

print('The directory pdf has this many bytes:')
print(total_size)

The directory pdf has this many bytes:
149427


Note: the `listdir()` function creates a list. It's helpful to use when creating a list of the files, but for more complex operations, there are additional options that allow us to iterate through, or loop through the contents of a directory in more useful ways. For example, note above the somewhat tedious reiteration of `os.path.join()` when going through the list of files in the for loop.

### Using os.scandir()

The previous steps are a good way to record a simple list of files in a directory, but there is still a lot of information from the filepath information. For example, what are the absolute paths to the files? How big are the files? That information is stored in the filesystem, but we do not get it using the `listdir()` method. 

We can use the `scandir()` function to get a list of path objects that we can iterate through. For example, we may want to check whether items are recognized as files by the system. Other functions, like `os.is_file()`, will evaluate whether the item is a file. We can use this function to create data that we can iterate through using a `with` ... `as` construction, like we have seen in opening files.

In [31]:
# recall, we have set walk_this_directory to the data/webfiles-small directory 
walk_this_directory

'data/webfiles-samples'

In [32]:
item_count = 0

with os.scandir(walk_this_directory) as items_list:    
    for entry in items_list:
        item_count += 1
        print('Looking at:',entry.name)
        if entry.is_dir():
            file_list = os.listdir(entry)
            print('This is a directory and contains',len(file_list),'files (',file_list,').')
        if entry.is_file():
            print('This is a file named',entry.name,'that takes up',os.path.getsize(entry),'bytes')

print('Scanned',item_count,'items.')

Looking at: video
This is a directory and contains 4 files ( ['vlwhcssc.asx', '04-04-21full.asf', 'glmp_cig.EQ.wm.p20.t12z', 'oct17cc.asx'] ).
Looking at: pdf
This is a directory and contains 5 files ( ['01-1480.pdf', 'file.pdf', 'Chapter03.pdf', 'PFCHEJ.pdf', 'HR2021 commtext.pdf'] ).
Looking at: image
This is a directory and contains 5 files ( ['13080t.jpg', 'orca.via_.moc_.noaa_.jpg', 'k7989-7x.jpg', 'm237a2f.gif', '1005107061.tif'] ).
Looking at: audio
This is a directory and contains 4 files ( ['11-3250JohnsonvFolinoEtAl.wma', 'NEWSLINE_802AF71F439D401585C6FCB02F358307.mp3', 'mj_telework_exchange_final_100710.mp3', '000727.ram'] ).
Looking at: web-files-small-metadata.csv
This is a file named web-files-small-metadata.csv that takes up 9069 bytes
Looking at: presentation
This is a directory and contains 3 files ( ['BudgetandGrants012710.ppt', 'ADAEMPLOYMENTTaxIncentives.ppt', 'Non-FTE-Trainee-Activities-060109.ppt'] ).
Scanned 6 items.


### Using os.walk()

The `os.walk()` function allows us to do a more complex mapping of the directory. This function can be used to create a "tuple" &ndash; a special datatype that creates a small, unmutable set that we can reuse &ndash; and we can store that information to derive foldernames and paths to individual files. 

In [33]:
for folderName, subfolders, filenames in os.walk(walk_this_directory):
    # see what this produces
    print('folderName is a type',type(folderName),
          '\nsubfolders is a type',type(subfolders),
          '\nfilenames is a type',type(filenames))

    ## so, this is a series of nested loops, 
    ## the top level produces a string for the folder name, 
    ## and the secondary levels create lists of the contained folders and files 

folderName is a type <class 'str'> 
subfolders is a type <class 'list'> 
filenames is a type <class 'list'>
folderName is a type <class 'str'> 
subfolders is a type <class 'list'> 
filenames is a type <class 'list'>
folderName is a type <class 'str'> 
subfolders is a type <class 'list'> 
filenames is a type <class 'list'>
folderName is a type <class 'str'> 
subfolders is a type <class 'list'> 
filenames is a type <class 'list'>
folderName is a type <class 'str'> 
subfolders is a type <class 'list'> 
filenames is a type <class 'list'>
folderName is a type <class 'str'> 
subfolders is a type <class 'list'> 
filenames is a type <class 'list'>


In [34]:
# get information about how many files are in each directory and how much space they take up
for FolderPaths, SubfolderNames, filenames in os.walk(walk_this_directory):
    print(FolderPaths, "consumes", end=" ")
    print(sum(os.path.getsize(os.path.join(FolderPaths, name)) for name in filenames), end=" ")
    print("bytes in", len(filenames), "non-directory files")


data/webfiles-samples consumes 9069 bytes in 1 non-directory files
data/webfiles-samples/video consumes 115695 bytes in 4 non-directory files
data/webfiles-samples/pdf consumes 149427 bytes in 5 non-directory files
data/webfiles-samples/image consumes 497284 bytes in 5 non-directory files
data/webfiles-samples/audio consumes 25856261 bytes in 4 non-directory files
data/webfiles-samples/presentation consumes 289792 bytes in 3 non-directory files


In [35]:
for folderName, subfolders, filenames in os.walk(walk_this_directory):
    print('Current folder:',folderName)
    
    for subfolder in subfolders:
        print('Parent folder:',folderName,'; subfolder:',subfolder)
        
    for filename in filenames:
        print('The file', filename, 
              '\n    This is the folder:', folderName,
             '\n    The filepath is:',os.path.join(folderName, filename))
    
    print('\n')

    ## Note that this does not record hidden items like . and ..

Current folder: data/webfiles-samples
Parent folder: data/webfiles-samples ; subfolder: video
Parent folder: data/webfiles-samples ; subfolder: pdf
Parent folder: data/webfiles-samples ; subfolder: image
Parent folder: data/webfiles-samples ; subfolder: audio
Parent folder: data/webfiles-samples ; subfolder: presentation
The file web-files-small-metadata.csv 
    This is the folder: data/webfiles-samples 
    The filepath is: data/webfiles-samples/web-files-small-metadata.csv


Current folder: data/webfiles-samples/video
The file vlwhcssc.asx 
    This is the folder: data/webfiles-samples/video 
    The filepath is: data/webfiles-samples/video/vlwhcssc.asx
The file 04-04-21full.asf 
    This is the folder: data/webfiles-samples/video 
    The filepath is: data/webfiles-samples/video/04-04-21full.asf
The file glmp_cig.EQ.wm.p20.t12z 
    This is the folder: data/webfiles-samples/video 
    The filepath is: data/webfiles-samples/video/glmp_cig.EQ.wm.p20.t12z
The file oct17cc.asx 
    Thi

In [36]:
file_count = 0

# get information about each of the files
for folderName, subfolders, filenames in os.walk(walk_this_directory):
    
    for filename in filenames:
        file_count += 1
        filename = filename 
        folder = folderName
        path = os.path.join(folderName, filename)
        absolutePath = os.path.abspath(filename)
        size = os.path.getsize(path)
        print('Found:', filename, folder, path, size,'\n',absolutePath)
print(f'Counted {file_count} files')
    ## Note that this does not record hidden items like . and ..

Found: web-files-small-metadata.csv data/webfiles-samples data/webfiles-samples/web-files-small-metadata.csv 9069 
 /Users/jajohnst/Desktop/networked-services-labs-2023/web-files-small-metadata.csv
Found: vlwhcssc.asx data/webfiles-samples/video data/webfiles-samples/video/vlwhcssc.asx 356 
 /Users/jajohnst/Desktop/networked-services-labs-2023/vlwhcssc.asx
Found: 04-04-21full.asf data/webfiles-samples/video data/webfiles-samples/video/04-04-21full.asf 98 
 /Users/jajohnst/Desktop/networked-services-labs-2023/04-04-21full.asf
Found: glmp_cig.EQ.wm.p20.t12z data/webfiles-samples/video data/webfiles-samples/video/glmp_cig.EQ.wm.p20.t12z 8296 
 /Users/jajohnst/Desktop/networked-services-labs-2023/glmp_cig.EQ.wm.p20.t12z
Found: oct17cc.asx data/webfiles-samples/video data/webfiles-samples/video/oct17cc.asx 106945 
 /Users/jajohnst/Desktop/networked-services-labs-2023/oct17cc.asx
Found: 01-1480.pdf data/webfiles-samples/pdf data/webfiles-samples/pdf/01-1480.pdf 49088 
 /Users/jajohnst/Deskto

In [37]:
## get information about each of the files

# first set some counters
fileCount = 0

# and a list fileInfo to hold the information about the file, 
# and another list manifestInfo to hold information about all of the files
fileInfo = list()
manifestInfo = list()

for folderName, subfolders, filenames in os.walk(walk_this_directory):    
    for filename in filenames:
        fileCount += 1
        index = fileCount
        filename = filename 
        folder = folderName
        path = os.path.join(folderName, filename)
        size = os.path.getsize(path)
#        print('Found:', filename, folder, path, size)
        # put the information into a list:
        fileInfo = [
            index,
            filename,
            folder,
            path,
            size
            ]
        # add the list to the list of the lists!
        manifestInfo.append(fileInfo)

print('Found',len(manifestInfo),'items.\n\n',manifestInfo[:2])

Found 22 items.

 [[1, 'web-files-small-metadata.csv', 'data/webfiles-samples', 'data/webfiles-samples/web-files-small-metadata.csv', 9069], [2, 'vlwhcssc.asx', 'data/webfiles-samples/video', 'data/webfiles-samples/video/vlwhcssc.asx', 356]]


There are additional things that the `os` module can tell us. So let's take a moment to pause here to explore some of its additional uses. 

Now that we can handle file paths, we may want additional information about the files that we are taking a look at. 
Our goal is to create a script that inventories a specific directory and records metadata about the files in that directory.
Using `os.path`, it is possible to get the size of the file (bytes), well as additional path information, such as the full path to the file (aka, the "absolute path"), and the "base" name of the file (just the name of the file), as well as the file extension. 

To retrieve this metadata:

* `os.path.abspath(path)` returns a string that has the absolute path to the requested argument; this can also be used to convert a relative path to an absolute path
* `os.path.relpath(path, start)` returns a string of the relative path to the given argument "path," and you can specify where to "start", that is the location. 
* `os.path.getsize(path)` returns the size of the requested file location in bytes, note that this may require an absolute path argument, so as illustrated below you may need to create the full path
* `os.path.splitext(file)` returns a tuple with the first part of the filename as the first element and the the extension of the file as the second element)
* `os.path.getmtime(file)` returns a timestamp with the most recent modification time recorded; this should work on both MAC and Windows platforms, but some file metadata will be specific to your operating system; the returned value will be an era number, and to 'decode' the time will require a time conversion, which we will cover later

## Generate file hashes for integrity (fixity)

There is an additional Python library that can help us create fixity information. The `hashlib` library can craete many different types of has digests, including MD5, SHA256, and SHA512.

The following includes a helper function to create secure hashes in [md5](https://datatracker.ietf.org/doc/html/rfc1321.html) or sha256 (defined in the [NIST standard FIPS 180-4](https://csrc.nist.gov/publications/detail/fips/180/4/final) as of September 2022).

In [38]:
# to run the following, you will need the hashlib module:
import hashlib


In [39]:

def get_checksum(filePath, checksum_type):
    '''This is a helper function to create a checksum. 
    In this example we will focus on MD5, which can be used to check data integrity.
    
    The filePath value argument be a string representing a valid path.
    The checksum_type argument should be a valid type of checksum.
    
    The function returns the string of characters for an MD5 or SHA256 checksum.
    The is function only allows you to create MD5 or SHA 256 and will result in an error for other types.'''
    checksum_type = checksum_type.lower()

    with open(filePath, 'rb') as f:
        bytes = f.read()
        if checksum_type == 'md5':
            hash_string = hashlib.md5(bytes).hexdigest()
        elif checksum_type == 'sha256':
            hash_string = hashlib.sha256(bytes).hexdigest()
        else:
            Raise('{} is not a hash function supported by this program. You must ask for MD5.')
    return hash_string

In [40]:
# for example, look at the pdfs

for file in os.scandir(os.path.join(walk_this_directory, 'pdf')):
    print(file,type(file))
    if os.path.isfile(file):
        path = os.path.join(walk_this_directory, 'pdf', file.name)
        print(path)
        md5digest = get_checksum(file, 'md5')
        sha256digest = get_checksum(file, 'sha256')
        print(f'md5digest: {md5digest}\nsha256_digest: {sha256digest}', end='\n\n') # the encode here is to ensure that the file is treated as a file, not a string of characters. 
    else:
        print(f'ERROR: No file found at {file}.')

<DirEntry '01-1480.pdf'> <class 'posix.DirEntry'>
data/webfiles-samples/pdf/01-1480.pdf
md5digest: 01c8ed0f4635e7974734087ffec0087d
sha256_digest: a9b5603c9cdb8e9b5d362758dc1e0b3c10bb250fade691d58e3bc2467dfa9020

<DirEntry 'file.pdf'> <class 'posix.DirEntry'>
data/webfiles-samples/pdf/file.pdf
md5digest: b76bb219e68469283095b12a0b6aef2c
sha256_digest: aa3749f994f8869e3d982514ea4dd17e08812feb40e70c03b7e6b8503f802e69

<DirEntry 'Chapter03.pdf'> <class 'posix.DirEntry'>
data/webfiles-samples/pdf/Chapter03.pdf
md5digest: b4c4ac3b610fc0057ef57acaace0c78a
sha256_digest: 69781aa7568c6dd54443c11cbc8e3ff4223c36b8d5f1f2adc3d387b4cae05694

<DirEntry 'PFCHEJ.pdf'> <class 'posix.DirEntry'>
data/webfiles-samples/pdf/PFCHEJ.pdf
md5digest: fec5b4c417432aa2de0a482186f3d9d5
sha256_digest: d38d370486dda2eb8a5c677c5d2e5d8d174301e30011776ea2689a2270f81f87

<DirEntry 'HR2021 commtext.pdf'> <class 'posix.DirEntry'>
data/webfiles-samples/pdf/HR2021 commtext.pdf
md5digest: 1de98a2344bec62c650f8adb4b8f33e0
sha2

# Conclusion: Write the results to a CSV

Now that we can gather the information, let's write to a file:

In [41]:
import csv

In [42]:
## write to a CSV
# To do: Create a header row, write rows of file information, close the complete file

# set up the csv, create a header row
headers = [
    'index',
    'filename',
    'in_folder_path',
    'full_file_path',
    'size'
    ]

# write the information using csvwriter()
with open('file-manifest.csv', 'w') as f:
    writer = csv.writer(f)
    print('writing file manifest CSV')
    writer.writerow(headers)
    for file in manifestInfo:
        print(f'adding {file[1]}')
        writer.writerow(file)
    print('Wrote the file manifest')

writing file manifest CSV
adding web-files-small-metadata.csv
adding vlwhcssc.asx
adding 04-04-21full.asf
adding glmp_cig.EQ.wm.p20.t12z
adding oct17cc.asx
adding 01-1480.pdf
adding file.pdf
adding Chapter03.pdf
adding PFCHEJ.pdf
adding HR2021 commtext.pdf
adding 13080t.jpg
adding orca.via_.moc_.noaa_.jpg
adding k7989-7x.jpg
adding m237a2f.gif
adding 1005107061.tif
adding 11-3250JohnsonvFolinoEtAl.wma
adding NEWSLINE_802AF71F439D401585C6FCB02F358307.mp3
adding mj_telework_exchange_final_100710.mp3
adding 000727.ram
adding BudgetandGrants012710.ppt
adding ADAEMPLOYMENTTaxIncentives.ppt
adding Non-FTE-Trainee-Activities-060109.ppt
Wrote the file manifest


## Reflection Activities

1. Write a script that uses `os.listdir()` for each of the directories in the `webfiles-samples` directory. You can put in the path names directly as variables, but you should use the `os.path.join()` function to create filepaths that do not depend on your inputting the exact filepath string, which will vary across operating systems.
1. Write a script that uses `os.scandir()` to check whether or not the entities in the director are files or directories. The script should output a count of files and a count of directories.
1. Examine the examples above that use `os.walk()`. What is the difference between this and the previous two functions? In some ways it lets you get deeper into the file structure, so please explain your observation in a sentence or two.  
1. Create a script that will print out an inventory of all the files in the assets folder `webfiles-samples`. The list should include the filename of the file, the directory path for the file, the full path to the file, and the file size. You may include any other information that you think is important. Call your script file `inventory_script.py`.
1. Extend the previous script: Create a script that will print out an inventory of all the files in the assets folder `webfiles-samples`. The inventory should be a CSV file, and it should include the filename of the file, the directory path for the file, the full path to the file, and the file size. You may include any other information that you think is important. Call your script file `inventory_script.py`.
1. Extend the above script, using the techniques demonstrated here, and add in a way to determine the file extension of the file, then add the extension to the CSV output? (Hint: you could split the filename string, right?)
1. Write a script that can walk through a series of directories and identify files based on their file extension. For example, perhaps you want to count the number of .pdf files or .jpg. Create file that can look for this information and then tally the files. Then, have the program output the list of filenames and filepaths in a CSV file. Call this file `extension_detector.py`. 

### Function exercises 

1. Building on the above examples, can you a) write functions that bundle code to ask for a directory? You could call this function `create_manifest_information` and it should be able to accept a path to a directory as an argument and return the manifestInfo list. And b) write a function that would accept the manifestInfo list as an argument and create a CSV? 
1. Write a script that creates a `main` and `derivative` directory within a subdirectory that has the file's name as its name. For example, if there are two files, one named `001.jpg` and `audition.wav`, there should be a directory named `001` and another named `audition`. Within these, there should be main and derivative folders. The original files should be in the `main` folder. Call this file `main_and_derivatives.py`.