# Python 104 - Writing Files, Inventorying Files

This notebook goes through the basics of writing files. We look through one basic example and one that extracts specific information from one file then writes it to a new file. After that, we look at a few modules that will help us to build an inventory of basic system information including filenames, locations (paths), and sizes. Once we identify this information we can use it to create an inventory manifest. 

First, let's look at the basics of writing files. 

## Writing Files

The basic function for writing files is the `write()` function. This can be used to write contents from the argument or 
to write multi-line content. Unlike in other environments like the GUI or shell, where the open command is often assumed, 
you may need to `open()` and then `close()` files when working in python. You cannot write to a file that is not known and opened, and a file that is not closed may be corrupted. 

Fortunately, we can usually use the contexual opener:

```python
with open(file, 'w') as f:
    ```

This will automatically close the file when the loop completes. The `w` argument indicates that the file is opened in "write" mode. If the file doesn't exist, the file will be written. 

In [1]:
# Basic use of open() and write()

line = 'Believe that life is worth living, and your belief will help create the fact.'
# Credit William James https://en.wikiquote.org/wiki/William_James

fout = open('quote-output.txt', 'w')

fout.write(line)

fout.close()

In [5]:
# use the with open() syntax to check if the file is there

with open('quote-output.txt', 'r') as f:
    print(f.read())

Believe that life is worth living, and your belief will help create the fact.


We can also extract information from a file then reuse that in another file. 
For example, we could extract the email addresses from `mbox-short.txt` and create
an address book file:

In [1]:
# create a path to the file
file = '../assets/mbox-short.txt'

# set up a file name for a file to create
fout = 'email-list.txt'

#establish a list to record emails as they are identified
emails = []

# open the source file to extract emails
with open(file, 'r') as f:
    for line in f:
        if line.startswith('From:'):
            email = line[6:]
            if email not in emails:
                emails.append(email)
print(emails, '\n\n')

# open another file in write mode to write the emails.
with open(fout, 'w') as f:
    for email in emails:
        f.write(email)

print(open(fout).read())

['stephen.marquard@uct.ac.za\n', 'louis@media.berkeley.edu\n', 'zqian@umich.edu\n', 'rjlowe@iupui.edu\n', 'cwen@iupui.edu\n', 'gsilver@umich.edu\n', 'wagnermr@iupui.edu\n', 'antranig@caret.cam.ac.uk\n', 'gopal.ramasammycook@gmail.com\n', 'david.horwitz@uct.ac.za\n', 'ray@media.berkeley.edu\n'] 


stephen.marquard@uct.ac.za
louis@media.berkeley.edu
zqian@umich.edu
rjlowe@iupui.edu
cwen@iupui.edu
gsilver@umich.edu
wagnermr@iupui.edu
antranig@caret.cam.ac.uk
gopal.ramasammycook@gmail.com
david.horwitz@uct.ac.za
ray@media.berkeley.edu



## Inventorying Files

For this activity, we are going to use a few modules that allow us to interact with the file system. These should be somewhat familiar after we have already looked into basic shell commands.

* `os` assists in using aspects of the operating system, in this case particularly file information and paths. See https://docs.python.org/3/library/os.html; 
* `os.path` is often called by itself and allows us to interact with file path and directory information. See https://docs.python.org/3/library/os.path.html#module-os.path. 
* `shutil` allows to access some shell utilities, like move, copy, rename, delete. See https://docs.python.org/3/library/shutil.html?highlight=shutils.

We will also use the `csv` module since it will help us to write the information that we gather to a structured data file that can later be opened in Excel or other spreadsheet applications. See https://docs.python.org/3/library/csv.html

In [3]:
import os
from os.path import join, getsize
import csv

Once we know what we want in the csv, how do we get that information? We can use the `os` module to get file information. We will use the `os.walk` function to "walk" over the file tree, identify folder lists, paths, and filenames.  

In [4]:
walk_this_directory = os.path.join('..','assets','Bundle-web-files-small')

print(walk_this_directory)

../assets/Bundle-web-files-small


In [13]:
for folderName, subfolders, filenames in os.walk(walk_this_directory):
    # see what this produces
    print('folderName is a type',type(folderName),
          '\nsubfolders is a type',type(subfolders),
          '\nfilenames is a type',type(filenames))

    ## so, this is a series of nested loops, 
    ## the top level produces a string for the folder name, 
    ## and the secondary levels create lists of the contained folders and files 

folderName is a type <class 'str'> 
subfolders is a type <class 'list'> 
filenames is a type <class 'list'>
folderName is a type <class 'str'> 
subfolders is a type <class 'list'> 
filenames is a type <class 'list'>
folderName is a type <class 'str'> 
subfolders is a type <class 'list'> 
filenames is a type <class 'list'>
folderName is a type <class 'str'> 
subfolders is a type <class 'list'> 
filenames is a type <class 'list'>
folderName is a type <class 'str'> 
subfolders is a type <class 'list'> 
filenames is a type <class 'list'>
folderName is a type <class 'str'> 
subfolders is a type <class 'list'> 
filenames is a type <class 'list'>


In [10]:
# get information about how many files are in each directory and how much space they take up
for FolderPaths, SubfolderNames, filenames in os.walk(walk_this_directory):
    print(FolderPaths, "consumes", end=" ")
    print(sum(getsize(join(FolderPaths, name)) for name in filenames), end=" ")
    print("bytes in", len(filenames), "non-directory files")


../assets/Bundle-web-files-small consumes 9069 bytes in 1 non-directory files
../assets/Bundle-web-files-small/audio consumes 25856261 bytes in 4 non-directory files
../assets/Bundle-web-files-small/image consumes 497284 bytes in 5 non-directory files
../assets/Bundle-web-files-small/pdf consumes 149427 bytes in 5 non-directory files
../assets/Bundle-web-files-small/presentation consumes 289792 bytes in 3 non-directory files
../assets/Bundle-web-files-small/video consumes 115706 bytes in 4 non-directory files


In [24]:
for folderName, subfolders, filenames in os.walk(walk_this_directory):
    print('Current folder:',folderName)
    
    for subfolder in subfolders:
        print('Parent folder:',folderName,'; subfolder:',subfolder)
        
    for filename in filenames:
        print('The file', filename, 
              '\n    This is the folder:', folderName,
             '\n    The filepath is:',os.path.join(folderName, filename))
    
    print('\n')

    ## Note that this does not record hidden items like . and ..

Current folder: ../assets/Bundle-web-files-small
Parent folder: ../assets/Bundle-web-files-small ; subfolder: audio
Parent folder: ../assets/Bundle-web-files-small ; subfolder: image
Parent folder: ../assets/Bundle-web-files-small ; subfolder: pdf
Parent folder: ../assets/Bundle-web-files-small ; subfolder: presentation
Parent folder: ../assets/Bundle-web-files-small ; subfolder: video
The file web-files-small-metadata.csv 
    This is the folder: ../assets/Bundle-web-files-small 
    The filepath is: ../assets/Bundle-web-files-small/web-files-small-metadata.csv


Current folder: ../assets/Bundle-web-files-small/audio
The file 000727.ram 
    This is the folder: ../assets/Bundle-web-files-small/audio 
    The filepath is: ../assets/Bundle-web-files-small/audio/000727.ram
The file 11-3250JohnsonvFolinoEtAl.wma 
    This is the folder: ../assets/Bundle-web-files-small/audio 
    The filepath is: ../assets/Bundle-web-files-small/audio/11-3250JohnsonvFolinoEtAl.wma
The file mj_telework_exc

In [31]:
# get information about each of the files
for folderName, subfolders, filenames in os.walk(walk_this_directory):
    
    for filename in filenames:
        filename = filename 
        folder = folderName
        path = os.path.join(folderName, filename)
        size = os.path.getsize(path)
        print('Found:', filename, folder, path, size)

    ## Note that this does not record hidden items like . and ..

Found: web-files-small-metadata.csv ../assets/Bundle-web-files-small ../assets/Bundle-web-files-small/web-files-small-metadata.csv 9069
Found: 000727.ram ../assets/Bundle-web-files-small/audio ../assets/Bundle-web-files-small/audio/000727.ram 79
Found: 11-3250JohnsonvFolinoEtAl.wma ../assets/Bundle-web-files-small/audio ../assets/Bundle-web-files-small/audio/11-3250JohnsonvFolinoEtAl.wma 21423499
Found: mj_telework_exchange_final_100710.mp3 ../assets/Bundle-web-files-small/audio ../assets/Bundle-web-files-small/audio/mj_telework_exchange_final_100710.mp3 3471488
Found: NEWSLINE_802AF71F439D401585C6FCB02F358307.mp3 ../assets/Bundle-web-files-small/audio ../assets/Bundle-web-files-small/audio/NEWSLINE_802AF71F439D401585C6FCB02F358307.mp3 961195
Found: 1005107061.tif ../assets/Bundle-web-files-small/image ../assets/Bundle-web-files-small/image/1005107061.tif 395734
Found: 13080t.jpg ../assets/Bundle-web-files-small/image ../assets/Bundle-web-files-small/image/13080t.jpg 3764
Found: k7989-

In [34]:
## get information about each of the files

# first let's set some counters
fileCount = 0
# and a list to hold the information about the file, and another to hold the fileInfo
fileInfo = list()
manifestInfo = list()

for folderName, subfolders, filenames in os.walk(walk_this_directory):    
    for filename in filenames:
        fileCount += 1
        index = fileCount
        filename = filename 
        folder = folderName
        path = os.path.join(folderName, filename)
        size = os.path.getsize(path)
#        print('Found:', filename, folder, path, size)

        fileInfo = [
            index,
            filename,
            folder,
            path,
            size
            ]
        manifestInfo.append(fileInfo)

print('Found',len(manifestInfo),'items.\n\n',manifestInfo)

Found 22 items.

 [[1, 'web-files-small-metadata.csv', '../assets/Bundle-web-files-small', '../assets/Bundle-web-files-small/web-files-small-metadata.csv', 9069], [2, '000727.ram', '../assets/Bundle-web-files-small/audio', '../assets/Bundle-web-files-small/audio/000727.ram', 79], [3, '11-3250JohnsonvFolinoEtAl.wma', '../assets/Bundle-web-files-small/audio', '../assets/Bundle-web-files-small/audio/11-3250JohnsonvFolinoEtAl.wma', 21423499], [4, 'mj_telework_exchange_final_100710.mp3', '../assets/Bundle-web-files-small/audio', '../assets/Bundle-web-files-small/audio/mj_telework_exchange_final_100710.mp3', 3471488], [5, 'NEWSLINE_802AF71F439D401585C6FCB02F358307.mp3', '../assets/Bundle-web-files-small/audio', '../assets/Bundle-web-files-small/audio/NEWSLINE_802AF71F439D401585C6FCB02F358307.mp3', 961195], [6, '1005107061.tif', '../assets/Bundle-web-files-small/image', '../assets/Bundle-web-files-small/image/1005107061.tif', 395734], [7, '13080t.jpg', '../assets/Bundle-web-files-small/image'

In [38]:
## write to a CSV
# To do: Create a header row, write rows of file information, close the complete file

# set up the csv, create a header row
headers = [
    'index',
    'filename',
    'in_folder_path',
    'full_file_path',
    'size'
    ]

# write the information using csvwriter()
with open('file-manifest.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(headers)
    for file in manifestInfo:
        print(file)
        writer.writerow(file)
    print('Wrote the file manifest')
        

[1, 'web-files-small-metadata.csv', '../assets/Bundle-web-files-small', '../assets/Bundle-web-files-small/web-files-small-metadata.csv', 9069]
[2, '000727.ram', '../assets/Bundle-web-files-small/audio', '../assets/Bundle-web-files-small/audio/000727.ram', 79]
[3, '11-3250JohnsonvFolinoEtAl.wma', '../assets/Bundle-web-files-small/audio', '../assets/Bundle-web-files-small/audio/11-3250JohnsonvFolinoEtAl.wma', 21423499]
[4, 'mj_telework_exchange_final_100710.mp3', '../assets/Bundle-web-files-small/audio', '../assets/Bundle-web-files-small/audio/mj_telework_exchange_final_100710.mp3', 3471488]
[5, 'NEWSLINE_802AF71F439D401585C6FCB02F358307.mp3', '../assets/Bundle-web-files-small/audio', '../assets/Bundle-web-files-small/audio/NEWSLINE_802AF71F439D401585C6FCB02F358307.mp3', 961195]
[6, '1005107061.tif', '../assets/Bundle-web-files-small/image', '../assets/Bundle-web-files-small/image/1005107061.tif', 395734]
[7, '13080t.jpg', '../assets/Bundle-web-files-small/image', '../assets/Bundle-web-f

## Reflection Activities

1. Create a script that will create an inventory of all the files in the assets folder `Bundle-web-files-small`. The inventory should be a CSV file, and it should include the filename of the file, the directory path for the file, the full path to the file, and the file size. You may include any other information that you think is important. Call this file `inventory_script.py`.
1. Extend the above script, using the techniques demonstrated here, and add in a way to determine the file extension of the file, then add the extension to the CSV output? (Hint: you could split the filename string, right?)
1. Building on the above examples, can you a) write functions that bundle code to ask for a directory? You could call this function `create_manifest_information` and it should be able to accept a path to a directory as an argument and return the manifestInfo list. And b) write a function that would accept the manifestInfo list as an argument and create a CSV? 
1. Write a script that can walk through a series of directories nad identify files based on their file extension. For example, perhaps you want to count the number of .pdf files or .jpg. Create file that can look for this information and then tally the files. Then, have the program output the list of filenames and filepaths in a CSV file. Call this file `extension_detector.py`. 
1. Write a script that creates a `master` and `derivative` directory within a subdirectory that has the file's name as its name. For example, if there are two files, one named `001.jpg` and `audition.wav`, there should be a directory named `001` and another named `audition`. Within these, there should be master and derivative folders. The original files should be in the `master` folder. Call this file `master_and_derivatives.py`.

Next week, dictionaries (streamline CSV creation), and additional derived information: mimetype and fixity/hash.