# Writing Files, Inventorying Files

This notebook goes through the basics of writing files. We look through one basic example and one that extracts specific information from one file then writes it to a new file. After that, we look at a few modules that will help us to build an inventory of basic system information including filenames, locations (paths), and sizes. Once we identify this information we can use it to create an inventory manifest. 

## Python libraries/modules used

* `os` module ([docs](https://docs.python.org/3/library/os.html)) for navigating and interacting with the file system
* `os.path` module ([docs](https://docs.python.org/3/library/os.path.html)) for working with paths
* `shutil` module ([docs](https://docs.python.org/3/library/shutil.html)) for copying, moving, removing


## Key points / Questions

* How can I interact with the files on my computer using python?
* How can I work with file paths in various operating systems?
* How do I move around the file system in Python? 
* How can I find out the current location on the file system, and how can I look in other locations, using Python?
* How can I identify file information and metadata using Python?

First, let's look at writing files. 

## Writing Files

The basic function for writing files is the `write()` function. This can be used to write contents from the argument or 
to write multi-line content. Unlike in other environments like the GUI or shell, where the open command is often assumed, 
you may need to `open()` and then `close()` files when working in python. You cannot write to a file that is not known and opened, and a file that is not closed may be corrupted. 

Fortunately, we can usually use the contexual opener:

```python
with open(file, 'w') as f:
    ```

This will automatically close the file when the loop completes. The `w` argument indicates that the file is opened in "write" mode. If the file doesn't exist, the file will be written. 

In [1]:
# Basic use of open() and write()

line = 'Believe that life is worth living, and your belief will help create the fact.'
# Credit William James https://en.wikiquote.org/wiki/William_James

fout = open('quote-output.txt', 'w')

fout.write(line)

fout.close()

In [2]:
# use the with open() syntax to check if the file is there

with open('quote-output.txt', 'r') as f:
    print(f.read())

Believe that life is worth living, and your belief will help create the fact.


## OS Navigation and File Paths

Now, let's look at moving around and managing file paths. This uses the `os` and `os.path` libraries. We can use the `os` module to interact with the operating system. For example, 
functions similar to file navigation on the terminal.

In [4]:
import os
import os.path

The `getcwd()` funcrtion is like `pwd`:

In [5]:
os.getcwd()

'/Users/jajohnst/Desktop/networked-services-labs-2022'

The `chdir()` allows you to move around, like `cd`:

In [6]:
os.chdir('data')

In [7]:
os.getcwd()

'/Users/jajohnst/Desktop/networked-services-labs-2022/data'

In [8]:
os.chdir('..')

In [9]:
os.getcwd()

'/Users/jajohnst/Desktop/networked-services-labs-2022'

You should now be in the repo's top directory (`networked-services-labs-2022`).

the `os.path` functions help to work with paths. This can also help to make your scripts more platform agnostic (will usually work in both Mac, Windows, or *nix settings).

In [10]:
os.getcwd()

'/Users/jajohnst/Desktop/networked-services-labs-2022'

In [11]:
file = 'mbox-short.txt'

path_to_file = os.path.join('data','emails',file)

print(path_to_file)

data/emails/mbox-short.txt


## Interacting with files and paths

We can also extract information from a file then reuse that in another file. 
For example, we could extract the email addresses from `mbox-short.txt` and create
an address book file:

In [15]:
# set up a file name for a file to create
fout = 'email-list.txt'

#establish a list to record emails as they are identified
emails = []

# open the source file to extract emails
with open(path_to_file, 'r') as f:
    for line in f:
        if line.startswith('From:'):
            email = line[6:].strip()
            if email not in emails:
                emails.append(email)
print(emails, '\n\n')

# open another file in write mode to write the emails.
with open(fout, 'w') as f:
    for email in emails:
        f.write(email + '\n')

print(open(fout).read())

['stephen.marquard@uct.ac.za', 'louis@media.berkeley.edu', 'zqian@umich.edu', 'rjlowe@iupui.edu', 'cwen@iupui.edu', 'gsilver@umich.edu', 'wagnermr@iupui.edu', 'antranig@caret.cam.ac.uk', 'gopal.ramasammycook@gmail.com', 'david.horwitz@uct.ac.za', 'ray@media.berkeley.edu'] 


stephen.marquard@uct.ac.za
louis@media.berkeley.edu
zqian@umich.edu
rjlowe@iupui.edu
cwen@iupui.edu
gsilver@umich.edu
wagnermr@iupui.edu
antranig@caret.cam.ac.uk
gopal.ramasammycook@gmail.com
david.horwitz@uct.ac.za
ray@media.berkeley.edu



## Inventorying Files

For this activity, we are going to expand the use of modules that allow us to interact with the file system. These should be somewhat familiar after we have already looked into basic shell commands.

* `os` assists in using aspects of the operating system, in this case particularly file information and paths. See https://docs.python.org/3/library/os.html; 
* `os.path` is often called by itself and allows us to interact with file path and directory information. See https://docs.python.org/3/library/os.path.html#module-os.path. 
* `shutil` allows to access some shell utilities, like move, copy, rename, delete. See https://docs.python.org/3/library/shutil.html?highlight=shutils.

We will also use the `csv` module since it will help us to write the information that we gather to a structured data file that can later be opened in Excel or other spreadsheet applications. See https://docs.python.org/3/library/csv.html

In [4]:
# if you haven't yet imported the modules, run this cell
import os
from os.path import join, getsize, getctime
import csv

Once we know what we want in the csv, how do we get that information? We can use the `os` module to get file information. We will use the `os.walk` function to "walk" over the file tree, identify folder lists, paths, and filenames.  

In [17]:
walk_this_directory = os.path.join('data','webfiles-samples')

print(walk_this_directory)

data/webfiles-samples


### Using os.listdir()

We can generate a list of the files in the directory using the `os.listdir()` function. This list will include the file names for all the files in the directory. 

In [18]:
dir_list = os.listdir(walk_this_directory)

print(dir_list)

['video', '.DS_Store', 'pdf', 'image', 'audio', 'web-files-small-metadata.csv', 'presentation']


Let's use the `listdir()` function to create manifest of the files in the `pdf` directory:

In [19]:
# create the list
pdf_dir_list = os.listdir(os.path.join(walk_this_directory, 'pdf'))

# set up a file name for a file to create
fout = 'pdf-file-list.txt'

# open outfile in write mode to write the filenames
with open(fout, 'w') as f:
    for filename in pdf_dir_list:
        f.write(filename)
        f.write('\n')
print('wrote',fout)

wrote pdf-file-list.txt


To view the contents of a file quickly, it's possible to use the `.read()` like `cat` on the terminal.

In [20]:
print(open(fout).read())

01-1480.pdf
file.pdf
Chapter03.pdf
PFCHEJ.pdf
HR2021 commtext.pdf



### Using os.scandir()

The previous steps are a good way to record a simple list of files in a directory, but there is still a lot of information from the filepath information. For example, what are the abosolute paths to the files? How big are the files? That information is stored in the filesystem, but we do not get it using the listdir() method. 

We can use the `scandir()` function to get a list that we can iterate through. For example, we may want to check whether items are recognized as files by the system. Other functions, like `os.is_file()`, will evaluate whether the item is a file. We can use this function to create data that we can iterate through using a `with` ... `as` construction, like we have seen in opening files.

In [26]:
item_count = 0

with os.scandir(walk_this_directory) as items_list:    
    for entry in items_list:
        item_count += 1
        print('Looking at:',entry)
        if entry.is_dir():
            file_list = os.listdir(os.path.join(entry))
            print('This is a directory and contains',len(file_list),'files (',file_list,').')
        if entry.is_file():
            print('This is a file named',entry,'that takes up',os.path.getsize(entry),'bytes')

print('Scanned',item_count,'items.')

Looking at: <DirEntry 'video'>
This is a directory and contains 4 files ( ['vlwhcssc.asx', '04-04-21full.asf', 'glmp_cig.EQ.wm.p20.t12z', 'oct17cc.asx'] ).
Looking at: <DirEntry '.DS_Store'>
This is a file named <DirEntry '.DS_Store'> that takes up 6148 bytes
Looking at: <DirEntry 'pdf'>
This is a directory and contains 5 files ( ['01-1480.pdf', 'file.pdf', 'Chapter03.pdf', 'PFCHEJ.pdf', 'HR2021 commtext.pdf'] ).
Looking at: <DirEntry 'image'>
This is a directory and contains 5 files ( ['13080t.jpg', 'orca.via_.moc_.noaa_.jpg', 'k7989-7x.jpg', 'm237a2f.gif', '1005107061.tif'] ).
Looking at: <DirEntry 'audio'>
This is a directory and contains 4 files ( ['11-3250JohnsonvFolinoEtAl.wma', 'NEWSLINE_802AF71F439D401585C6FCB02F358307.mp3', 'mj_telework_exchange_final_100710.mp3', '000727.ram'] ).
Looking at: <DirEntry 'web-files-small-metadata.csv'>
This is a file named <DirEntry 'web-files-small-metadata.csv'> that takes up 9069 bytes
Looking at: <DirEntry 'presentation'>
This is a directory

### Using os.walk()

The `os.walk()` function allows us to do a more complex mapping of the directory. This function can be used to create a "tuple" &ndash; a special datatype that creates a small, unmutable set that we can reuse &ndash; and we can store that information to derive foldernames and paths to individual files. 

In [27]:
for folderName, subfolders, filenames in os.walk(walk_this_directory):
    # see what this produces
    print('folderName is a type',type(folderName),
          '\nsubfolders is a type',type(subfolders),
          '\nfilenames is a type',type(filenames))

    ## so, this is a series of nested loops, 
    ## the top level produces a string for the folder name, 
    ## and the secondary levels create lists of the contained folders and files 

folderName is a type <class 'str'> 
subfolders is a type <class 'list'> 
filenames is a type <class 'list'>
folderName is a type <class 'str'> 
subfolders is a type <class 'list'> 
filenames is a type <class 'list'>
folderName is a type <class 'str'> 
subfolders is a type <class 'list'> 
filenames is a type <class 'list'>
folderName is a type <class 'str'> 
subfolders is a type <class 'list'> 
filenames is a type <class 'list'>
folderName is a type <class 'str'> 
subfolders is a type <class 'list'> 
filenames is a type <class 'list'>
folderName is a type <class 'str'> 
subfolders is a type <class 'list'> 
filenames is a type <class 'list'>


In [30]:
# get information about how many files are in each directory and how much space they take up
for FolderPaths, SubfolderNames, filenames in os.walk(walk_this_directory):
    print(FolderPaths, "consumes", end=" ")
    print(sum(os.path.getsize(os.path.join(FolderPaths, name)) for name in filenames), end=" ")
    print("bytes in", len(filenames), "non-directory files")


data/webfiles-samples consumes 15217 bytes in 2 non-directory files
data/webfiles-samples/video consumes 115695 bytes in 4 non-directory files
data/webfiles-samples/pdf consumes 149427 bytes in 5 non-directory files
data/webfiles-samples/image consumes 497284 bytes in 5 non-directory files
data/webfiles-samples/audio consumes 25856261 bytes in 4 non-directory files
data/webfiles-samples/presentation consumes 289792 bytes in 3 non-directory files


In [34]:
for folderName, subfolders, filenames in os.walk(walk_this_directory):
    print('Current folder:',folderName)
    
    for subfolder in subfolders:
        print('Parent folder:',folderName,'; subfolder:',subfolder)
        
    for filename in filenames:
        print('The file', filename, 
              '\n    This is the folder:', folderName,
             '\n    The filepath is:',os.path.join(folderName, filename))
    
    print('\n')

    ## Note that this does not record hidden items like . and ..

Current folder: data/webfiles-samples
Parent folder: data/webfiles-samples ; subfolder: video
Parent folder: data/webfiles-samples ; subfolder: pdf
Parent folder: data/webfiles-samples ; subfolder: image
Parent folder: data/webfiles-samples ; subfolder: audio
Parent folder: data/webfiles-samples ; subfolder: presentation
The file .DS_Store 
    This is the folder: data/webfiles-samples 
    The filepath is: data/webfiles-samples/.DS_Store
The file web-files-small-metadata.csv 
    This is the folder: data/webfiles-samples 
    The filepath is: data/webfiles-samples/web-files-small-metadata.csv


Current folder: data/webfiles-samples/video
The file vlwhcssc.asx 
    This is the folder: data/webfiles-samples/video 
    The filepath is: data/webfiles-samples/video/vlwhcssc.asx
The file 04-04-21full.asf 
    This is the folder: data/webfiles-samples/video 
    The filepath is: data/webfiles-samples/video/04-04-21full.asf
The file glmp_cig.EQ.wm.p20.t12z 
    This is the folder: data/webfil

In [35]:
# get information about each of the files
for folderName, subfolders, filenames in os.walk(walk_this_directory):
    
    for filename in filenames:
        filename = filename 
        folder = folderName
        path = os.path.join(folderName, filename)
        absolutePath = os.path.abspath(filename)
        size = os.path.getsize(path)
        print('Found:', filename, folder, path, size,'\n',absolutePath)

    ## Note that this does not record hidden items like . and ..

Found: .DS_Store data/webfiles-samples data/webfiles-samples/.DS_Store 6148 
 /Users/jajohnst/Desktop/networked-services-labs-2022/.DS_Store
Found: web-files-small-metadata.csv data/webfiles-samples data/webfiles-samples/web-files-small-metadata.csv 9069 
 /Users/jajohnst/Desktop/networked-services-labs-2022/web-files-small-metadata.csv
Found: vlwhcssc.asx data/webfiles-samples/video data/webfiles-samples/video/vlwhcssc.asx 356 
 /Users/jajohnst/Desktop/networked-services-labs-2022/vlwhcssc.asx
Found: 04-04-21full.asf data/webfiles-samples/video data/webfiles-samples/video/04-04-21full.asf 98 
 /Users/jajohnst/Desktop/networked-services-labs-2022/04-04-21full.asf
Found: glmp_cig.EQ.wm.p20.t12z data/webfiles-samples/video data/webfiles-samples/video/glmp_cig.EQ.wm.p20.t12z 8296 
 /Users/jajohnst/Desktop/networked-services-labs-2022/glmp_cig.EQ.wm.p20.t12z
Found: oct17cc.asx data/webfiles-samples/video data/webfiles-samples/video/oct17cc.asx 106945 
 /Users/jajohnst/Desktop/networked-ser

In [36]:
## get information about each of the files

# first set some counters
fileCount = 0

# and a list to hold the information about the file, and another to hold the fileInfo
fileInfo = list()
manifestInfo = list()

for folderName, subfolders, filenames in os.walk(walk_this_directory):    
    for filename in filenames:
        fileCount += 1
        index = fileCount
        filename = filename 
        folder = folderName
        path = os.path.join(folderName, filename)
        size = os.path.getsize(path)
#        print('Found:', filename, folder, path, size)

        fileInfo = [
            index,
            filename,
            folder,
            path,
            size
            ]
        manifestInfo.append(fileInfo)

print('Found',len(manifestInfo),'items.\n\n',manifestInfo)

Found 23 items.

 [[1, '.DS_Store', 'data/webfiles-samples', 'data/webfiles-samples/.DS_Store', 6148], [2, 'web-files-small-metadata.csv', 'data/webfiles-samples', 'data/webfiles-samples/web-files-small-metadata.csv', 9069], [3, 'vlwhcssc.asx', 'data/webfiles-samples/video', 'data/webfiles-samples/video/vlwhcssc.asx', 356], [4, '04-04-21full.asf', 'data/webfiles-samples/video', 'data/webfiles-samples/video/04-04-21full.asf', 98], [5, 'glmp_cig.EQ.wm.p20.t12z', 'data/webfiles-samples/video', 'data/webfiles-samples/video/glmp_cig.EQ.wm.p20.t12z', 8296], [6, 'oct17cc.asx', 'data/webfiles-samples/video', 'data/webfiles-samples/video/oct17cc.asx', 106945], [7, '01-1480.pdf', 'data/webfiles-samples/pdf', 'data/webfiles-samples/pdf/01-1480.pdf', 49088], [8, 'file.pdf', 'data/webfiles-samples/pdf', 'data/webfiles-samples/pdf/file.pdf', 1538], [9, 'Chapter03.pdf', 'data/webfiles-samples/pdf', 'data/webfiles-samples/pdf/Chapter03.pdf', 51919], [10, 'PFCHEJ.pdf', 'data/webfiles-samples/pdf', 'dat

There are additional things that the `os` module can tell us. So let's take a moment to pause here to explore some of its additional uses. 

Now that we can handle file paths, we may want additional information about the files that we are taking a look at. 
Our goal is to create a script that inventories a specific directory and records metadata about the files in that directory.
Using `os.path`, it is possible to get the size of the file (bytes), well as additional path information, such as the full path to the file (aka, the "absolute path"), and the "base" name of the file (just the name of the file), as well as the file extension. 

To retrieve this metadata:

* `os.path.abspath(path)` returns a string that has the absolute path to the requested argument; this can also be used to convert a relative path to an absolute path
* `os.path.relpath(path, start)` returns a string of the relative path to the given argument "path," and you can specify where to "start", that is the location. 
* `os.path.getsize(path)` returns the size of the requested file location in bytes, note that this may require an absolute path argument, so as illustrated below you may need to create the full path
* `os.path.splitext(file)` returns a tuple with the first part of the filename as the first element and the the extension of the file as the second element)
* `os.path.getmtime(file)` returns a timestamp with the most recent modification time recorded; this should work on both MAC and Windows platforms, but some file metadata will be specific to your operating system; the returned value will be an era number, and to 'decode' the time will require a time conversion, which we will cover later

## Getting file hashes

There is an additional Python library that can help us create fixity information. The `hashlib` library can craete many different types of has digests, including MD5, SHA256, and SHA512.

In [40]:
import hashlib

In [64]:
# for example, look at the pdfs

for file in os.scandir(os.path.join(walk_this_directory, 'pdf')):

    print(os.path.abspath(file), hashlib.md5(file.hexdigest()) # the encode here is to ensure that the file is treated as a file, not a string of characters. 

AttributeError: 'posix.DirEntry' object has no attribute 'encode'

## Write the results to a CSV

Now that we can gather the information, let's write to a file:

In [38]:
import csv

In [39]:
## write to a CSV
# To do: Create a header row, write rows of file information, close the complete file

# set up the csv, create a header row
headers = [
    'index',
    'filename',
    'in_folder_path',
    'full_file_path',
    'size'
    ]

# write the information using csvwriter()
with open('file-manifest.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(headers)
    for file in manifestInfo:
        print(file)
        writer.writerow(file)
    print('Wrote the file manifest')

[1, '.DS_Store', 'data/webfiles-samples', 'data/webfiles-samples/.DS_Store', 6148]
[2, 'web-files-small-metadata.csv', 'data/webfiles-samples', 'data/webfiles-samples/web-files-small-metadata.csv', 9069]
[3, 'vlwhcssc.asx', 'data/webfiles-samples/video', 'data/webfiles-samples/video/vlwhcssc.asx', 356]
[4, '04-04-21full.asf', 'data/webfiles-samples/video', 'data/webfiles-samples/video/04-04-21full.asf', 98]
[5, 'glmp_cig.EQ.wm.p20.t12z', 'data/webfiles-samples/video', 'data/webfiles-samples/video/glmp_cig.EQ.wm.p20.t12z', 8296]
[6, 'oct17cc.asx', 'data/webfiles-samples/video', 'data/webfiles-samples/video/oct17cc.asx', 106945]
[7, '01-1480.pdf', 'data/webfiles-samples/pdf', 'data/webfiles-samples/pdf/01-1480.pdf', 49088]
[8, 'file.pdf', 'data/webfiles-samples/pdf', 'data/webfiles-samples/pdf/file.pdf', 1538]
[9, 'Chapter03.pdf', 'data/webfiles-samples/pdf', 'data/webfiles-samples/pdf/Chapter03.pdf', 51919]
[10, 'PFCHEJ.pdf', 'data/webfiles-samples/pdf', 'data/webfiles-samples/pdf/PFCHE

## Reflection Activities

1. Write a script that uses `os.listdir()` for each of the directories in the `webfiles-samples` directory. You can put in the path names directly as variables, but you should use the `os.path.join()` function to create filepaths that do not depend on your inputting the exact filepath string, which will vary across operating systems.
1. Write a script that uses `os.scandir()` to check whether or not the entities in the director are files or directories. The script should output a count of files and a count of directories.
1. Examine the examples above that use `os.walk()`. What is the difference between this and the previous two functions? In some ways it lets you get deeper into the file structure, so please explain your observation in a sentence or two.  
1. Create a script that will print out an inventory of all the files in the assets folder `webfiles-samples`. The list should include the filename of the file, the directory path for the file, the full path to the file, and the file size. You may include any other information that you think is important. Call your script file `inventory_script.py`.
1. Extend the previous script: Create a script that will print out an inventory of all the files in the assets folder `webfiles-samples`. The inventory should be a CSV file, and it should include the filename of the file, the directory path for the file, the full path to the file, and the file size. You may include any other information that you think is important. Call your script file `inventory_script.py`.
1. Extend the above script, using the techniques demonstrated here, and add in a way to determine the file extension of the file, then add the extension to the CSV output? (Hint: you could split the filename string, right?)
1. Write a script that can walk through a series of directories and identify files based on their file extension. For example, perhaps you want to count the number of .pdf files or .jpg. Create file that can look for this information and then tally the files. Then, have the program output the list of filenames and filepaths in a CSV file. Call this file `extension_detector.py`. 

### Function exercises 

1. Building on the above examples, can you a) write functions that bundle code to ask for a directory? You could call this function `create_manifest_information` and it should be able to accept a path to a directory as an argument and return the manifestInfo list. And b) write a function that would accept the manifestInfo list as an argument and create a CSV? 
1. Write a script that creates a `master` and `derivative` directory within a subdirectory that has the file's name as its name. For example, if there are two files, one named `001.jpg` and `audition.wav`, there should be a directory named `001` and another named `audition`. Within these, there should be master and derivative folders. The original files should be in the `master` folder. Call this file `master_and_derivatives.py`.