# Python 104 - Writing Files, Inventorying Files

This notebook goes through the basics of writing files. We look through one basic example and one that extracts specific information from one file then writes it to a new file. After that, we look at a few modules that will help us to build an inventory of basic system information including filenames, locations (paths), and sizes. Once we identify this information we can use it to create an inventory manifest. 

First, let's look at the basics of writing files. 

## Writing Files

The basic function for writing files is the `write()` function. This can be used to write contents from the argument or 
to write multi-line content. Unlike in other environments like the GUI or shell, where the open command is often assumed, 
you may need to `open()` and then `close()` files when working in python. You cannot write to a file that is not known and opened, and a file that is not closed may be corrupted. 

Fortunately, we can usually use the contexual opener:

```python
with open(file, 'w') as f:
    ```

This will automatically close the file when the loop completes. The `w` argument indicates that the file is opened in "write" mode. If the file doesn't exist, the file will be written. 

In [1]:
# Basic use of open() and write()

line = 'Believe that life is worth living, and your belief will help create the fact.'
# Credit William James https://en.wikiquote.org/wiki/William_James

fout = open('quote-output.txt', 'w')

fout.write(line)

fout.close()

In [2]:
# use the with open() syntax to check if the file is there

with open('quote-output.txt', 'r') as f:
    print(f.read())

Believe that life is worth living, and your belief will help create the fact.


We can also extract information from a file then reuse that in another file. 
For example, we could extract the email addresses from `mbox-short.txt` and create
an address book file:

In [3]:
# create a path to the file
file = '../assets/mbox-short.txt'

# set up a file name for a file to create
fout = 'email-list.txt'

#establish a list to record emails as they are identified
emails = []

# open the source file to extract emails
with open(file, 'r') as f:
    for line in f:
        if line.startswith('From:'):
            email = line[6:]
            if email not in emails:
                emails.append(email)
print(emails, '\n\n')

# open another file in write mode to write the emails.
with open(fout, 'w') as f:
    for email in emails:
        f.write(email)

print(open(fout).read())

['stephen.marquard@uct.ac.za\n', 'louis@media.berkeley.edu\n', 'zqian@umich.edu\n', 'rjlowe@iupui.edu\n', 'cwen@iupui.edu\n', 'gsilver@umich.edu\n', 'wagnermr@iupui.edu\n', 'antranig@caret.cam.ac.uk\n', 'gopal.ramasammycook@gmail.com\n', 'david.horwitz@uct.ac.za\n', 'ray@media.berkeley.edu\n'] 


stephen.marquard@uct.ac.za
louis@media.berkeley.edu
zqian@umich.edu
rjlowe@iupui.edu
cwen@iupui.edu
gsilver@umich.edu
wagnermr@iupui.edu
antranig@caret.cam.ac.uk
gopal.ramasammycook@gmail.com
david.horwitz@uct.ac.za
ray@media.berkeley.edu



## Inventorying Files

For this activity, we are going to use a few modules that allow us to interact with the file system. These should be somewhat familiar after we have already looked into basic shell commands.

* `os` assists in using aspects of the operating system, in this case particularly file information and paths. See https://docs.python.org/3/library/os.html; 
* `os.path` is often called by itself and allows us to interact with file path and directory information. See https://docs.python.org/3/library/os.path.html#module-os.path. 
* `shutil` allows to access some shell utilities, like move, copy, rename, delete. See https://docs.python.org/3/library/shutil.html?highlight=shutils.

We will also use the `csv` module since it will help us to write the information that we gather to a structured data file that can later be opened in Excel or other spreadsheet applications. See https://docs.python.org/3/library/csv.html

In [4]:
import os
from os.path import join, getsize, getctime
import csv

Once we know what we want in the csv, how do we get that information? We can use the `os` module to get file information. We will use the `os.walk` function to "walk" over the file tree, identify folder lists, paths, and filenames.  

In [5]:
walk_this_directory = os.path.join('..','assets','Bundle-web-files-small')

print(walk_this_directory)

../assets/Bundle-web-files-small


### Using os.listdir()

We can generate a list of the files in the directory using the `os.listdir()` function. This list will include the file names for all the files in the directory. 

In [6]:
dir_list = os.listdir(walk_this_directory)

print(dir_list)

['.DS_Store', 'audio', 'image', 'pdf', 'presentation', 'video', 'web-files-small-metadata.csv']


Let's use the `listdir()` function to create manifest of the files in the `pdf` directory:

In [7]:
# create the list
dir_list = os.listdir(os.path.join(walk_this_directory, 'pdf'))

# set up a file name for a file to create
fout = 'pdf-file-list.txt'

# open another file in write mode to write the emails.
with open(fout, 'w') as f:
    for filename in dir_list:
        f.write(filename)
        f.write('\n')

print(open(fout).read())    

01-1480.pdf
Chapter03.pdf
file.pdf
HR2021 commtext.pdf
PFCHEJ.pdf



### Using os.scandir()

This is useful, but there is still a lot of information from the filepath information, which gives a lot of context about the file in the filesystem, that we cannot get using the listdir() method. To get a list that we can iterate through, check whether items are recognized as files by the system, and iterate through the list. For this, we can use the `os.scandir()` function. We can use other functions, like `os.is_file()`, which will evaluate whether the item is a file. We can use this function to create data that we can iterate through using a `with` ... `as` construction, like we have seen in opening files.

In [8]:
with os.scandir(walk_this_directory) as items_list:
    for entry in items_list:
        print('Looking at:',entry)
        if entry.is_dir():
            file_list = os.listdir(os.path.join(entry))
            print('This is a directory and contains',len(file_list),'files (',file_list,').')
        if entry.is_file():
            print('This is a file named',entry,'that takes up',os.path.getsize(entry),'bytes')

Looking at: <DirEntry '.DS_Store'>
This is a file named <DirEntry '.DS_Store'> that takes up 6148 bytes
Looking at: <DirEntry 'audio'>
This is a directory and contains 4 files ( ['000727.ram', '11-3250JohnsonvFolinoEtAl.wma', 'mj_telework_exchange_final_100710.mp3', 'NEWSLINE_802AF71F439D401585C6FCB02F358307.mp3'] ).
Looking at: <DirEntry 'image'>
This is a directory and contains 5 files ( ['1005107061.tif', '13080t.jpg', 'k7989-7x.jpg', 'm237a2f.gif', 'orca.via_.moc_.noaa_.jpg'] ).
Looking at: <DirEntry 'pdf'>
This is a directory and contains 5 files ( ['01-1480.pdf', 'Chapter03.pdf', 'file.pdf', 'HR2021 commtext.pdf', 'PFCHEJ.pdf'] ).
Looking at: <DirEntry 'presentation'>
This is a directory and contains 3 files ( ['ADAEMPLOYMENTTaxIncentives.ppt', 'BudgetandGrants012710.ppt', 'Non-FTE-Trainee-Activities-060109.ppt'] ).
Looking at: <DirEntry 'video'>
This is a directory and contains 4 files ( ['04-04-21full.asf', 'glmp_cig.EQ.wm.p20.t12z', 'oct17cc.asx', 'vlwhcssc.asx'] ).
Looking at

### Using os.walk()

The `os.walk()` function allows us to do a more complex mapping of the directory. This function can be used to create a "tuple" &ndash; a special datatype that creates a small, unmutable set that we can reuse &ndash; and we can store that information to derive foldernames and paths to individual files. 

In [9]:
for folderName, subfolders, filenames in os.walk(walk_this_directory):
    # see what this produces
    print('folderName is a type',type(folderName),
          '\nsubfolders is a type',type(subfolders),
          '\nfilenames is a type',type(filenames))

    ## so, this is a series of nested loops, 
    ## the top level produces a string for the folder name, 
    ## and the secondary levels create lists of the contained folders and files 

folderName is a type <class 'str'> 
subfolders is a type <class 'list'> 
filenames is a type <class 'list'>
folderName is a type <class 'str'> 
subfolders is a type <class 'list'> 
filenames is a type <class 'list'>
folderName is a type <class 'str'> 
subfolders is a type <class 'list'> 
filenames is a type <class 'list'>
folderName is a type <class 'str'> 
subfolders is a type <class 'list'> 
filenames is a type <class 'list'>
folderName is a type <class 'str'> 
subfolders is a type <class 'list'> 
filenames is a type <class 'list'>
folderName is a type <class 'str'> 
subfolders is a type <class 'list'> 
filenames is a type <class 'list'>


In [10]:
# get information about how many files are in each directory and how much space they take up
for FolderPaths, SubfolderNames, filenames in os.walk(walk_this_directory):
    print(FolderPaths, "consumes", end=" ")
    print(sum(getsize(join(FolderPaths, name)) for name in filenames), end=" ")
    print("bytes in", len(filenames), "non-directory files")


../assets/Bundle-web-files-small consumes 15217 bytes in 2 non-directory files
../assets/Bundle-web-files-small/audio consumes 25856261 bytes in 4 non-directory files
../assets/Bundle-web-files-small/image consumes 497284 bytes in 5 non-directory files
../assets/Bundle-web-files-small/pdf consumes 149427 bytes in 5 non-directory files
../assets/Bundle-web-files-small/presentation consumes 289792 bytes in 3 non-directory files
../assets/Bundle-web-files-small/video consumes 115706 bytes in 4 non-directory files


In [11]:
for folderName, subfolders, filenames in os.walk(walk_this_directory):
    print('Current folder:',folderName)
    
    for subfolder in subfolders:
        print('Parent folder:',folderName,'; subfolder:',subfolder)
        
    for filename in filenames:
        print('The file', filename, 
              '\n    This is the folder:', folderName,
             '\n    The filepath is:',os.path.join(folderName, filename))
    
    print('\n')

    ## Note that this does not record hidden items like . and ..

Current folder: ../assets/Bundle-web-files-small
Parent folder: ../assets/Bundle-web-files-small ; subfolder: audio
Parent folder: ../assets/Bundle-web-files-small ; subfolder: image
Parent folder: ../assets/Bundle-web-files-small ; subfolder: pdf
Parent folder: ../assets/Bundle-web-files-small ; subfolder: presentation
Parent folder: ../assets/Bundle-web-files-small ; subfolder: video
The file .DS_Store 
    This is the folder: ../assets/Bundle-web-files-small 
    The filepath is: ../assets/Bundle-web-files-small/.DS_Store
The file web-files-small-metadata.csv 
    This is the folder: ../assets/Bundle-web-files-small 
    The filepath is: ../assets/Bundle-web-files-small/web-files-small-metadata.csv


Current folder: ../assets/Bundle-web-files-small/audio
The file 000727.ram 
    This is the folder: ../assets/Bundle-web-files-small/audio 
    The filepath is: ../assets/Bundle-web-files-small/audio/000727.ram
The file 11-3250JohnsonvFolinoEtAl.wma 
    This is the folder: ../assets/Bu

In [12]:
# get information about each of the files
for folderName, subfolders, filenames in os.walk(walk_this_directory):
    
    for filename in filenames:
        filename = filename 
        folder = folderName
        path = os.path.join(folderName, filename)
        size = os.path.getsize(path)
        print('Found:', filename, folder, path, size)

    ## Note that this does not record hidden items like . and ..

Found: .DS_Store ../assets/Bundle-web-files-small ../assets/Bundle-web-files-small/.DS_Store 6148
Found: web-files-small-metadata.csv ../assets/Bundle-web-files-small ../assets/Bundle-web-files-small/web-files-small-metadata.csv 9069
Found: 000727.ram ../assets/Bundle-web-files-small/audio ../assets/Bundle-web-files-small/audio/000727.ram 79
Found: 11-3250JohnsonvFolinoEtAl.wma ../assets/Bundle-web-files-small/audio ../assets/Bundle-web-files-small/audio/11-3250JohnsonvFolinoEtAl.wma 21423499
Found: mj_telework_exchange_final_100710.mp3 ../assets/Bundle-web-files-small/audio ../assets/Bundle-web-files-small/audio/mj_telework_exchange_final_100710.mp3 3471488
Found: NEWSLINE_802AF71F439D401585C6FCB02F358307.mp3 ../assets/Bundle-web-files-small/audio ../assets/Bundle-web-files-small/audio/NEWSLINE_802AF71F439D401585C6FCB02F358307.mp3 961195
Found: 1005107061.tif ../assets/Bundle-web-files-small/image ../assets/Bundle-web-files-small/image/1005107061.tif 395734
Found: 13080t.jpg ../assets

In [13]:
## get information about each of the files

# first let's set some counters
fileCount = 0
# and a list to hold the information about the file, and another to hold the fileInfo
fileInfo = list()
manifestInfo = list()

for folderName, subfolders, filenames in os.walk(walk_this_directory):    
    for filename in filenames:
        fileCount += 1
        index = fileCount
        filename = filename 
        folder = folderName
        path = os.path.join(folderName, filename)
        size = os.path.getsize(path)
#        print('Found:', filename, folder, path, size)

        fileInfo = [
            index,
            filename,
            folder,
            path,
            size
            ]
        manifestInfo.append(fileInfo)

print('Found',len(manifestInfo),'items.\n\n',manifestInfo)

Found 23 items.

 [[1, '.DS_Store', '../assets/Bundle-web-files-small', '../assets/Bundle-web-files-small/.DS_Store', 6148], [2, 'web-files-small-metadata.csv', '../assets/Bundle-web-files-small', '../assets/Bundle-web-files-small/web-files-small-metadata.csv', 9069], [3, '000727.ram', '../assets/Bundle-web-files-small/audio', '../assets/Bundle-web-files-small/audio/000727.ram', 79], [4, '11-3250JohnsonvFolinoEtAl.wma', '../assets/Bundle-web-files-small/audio', '../assets/Bundle-web-files-small/audio/11-3250JohnsonvFolinoEtAl.wma', 21423499], [5, 'mj_telework_exchange_final_100710.mp3', '../assets/Bundle-web-files-small/audio', '../assets/Bundle-web-files-small/audio/mj_telework_exchange_final_100710.mp3', 3471488], [6, 'NEWSLINE_802AF71F439D401585C6FCB02F358307.mp3', '../assets/Bundle-web-files-small/audio', '../assets/Bundle-web-files-small/audio/NEWSLINE_802AF71F439D401585C6FCB02F358307.mp3', 961195], [7, '1005107061.tif', '../assets/Bundle-web-files-small/image', '../assets/Bundle-

In [14]:
## write to a CSV
# To do: Create a header row, write rows of file information, close the complete file

# set up the csv, create a header row
headers = [
    'index',
    'filename',
    'in_folder_path',
    'full_file_path',
    'size'
    ]

# write the information using csvwriter()
with open('file-manifest.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(headers)
    for file in manifestInfo:
        print(file)
        writer.writerow(file)
    print('Wrote the file manifest')
        

[1, '.DS_Store', '../assets/Bundle-web-files-small', '../assets/Bundle-web-files-small/.DS_Store', 6148]
[2, 'web-files-small-metadata.csv', '../assets/Bundle-web-files-small', '../assets/Bundle-web-files-small/web-files-small-metadata.csv', 9069]
[3, '000727.ram', '../assets/Bundle-web-files-small/audio', '../assets/Bundle-web-files-small/audio/000727.ram', 79]
[4, '11-3250JohnsonvFolinoEtAl.wma', '../assets/Bundle-web-files-small/audio', '../assets/Bundle-web-files-small/audio/11-3250JohnsonvFolinoEtAl.wma', 21423499]
[5, 'mj_telework_exchange_final_100710.mp3', '../assets/Bundle-web-files-small/audio', '../assets/Bundle-web-files-small/audio/mj_telework_exchange_final_100710.mp3', 3471488]
[6, 'NEWSLINE_802AF71F439D401585C6FCB02F358307.mp3', '../assets/Bundle-web-files-small/audio', '../assets/Bundle-web-files-small/audio/NEWSLINE_802AF71F439D401585C6FCB02F358307.mp3', 961195]
[7, '1005107061.tif', '../assets/Bundle-web-files-small/image', '../assets/Bundle-web-files-small/image/100

## Reflection Activities

1. Write a script that uses `os.listdir()` for each of the directories in the `Bundle-web-files-small` directory. You can put in the path names directly as variables, but you should use the `os.path.join()` function to create filepaths that do not depend on your inputting the exact filepath string, which will vary across operating systems.
1. Write a script that uses `os.scandir()` to check whether or not the entities in the director are files or directories. The script should output a count of files and a count of directories.
1. Examine the examples above that use `os.walk()`. What is the difference between this and the previous two functions? In some ways it lets you get deeper into the file structure, so please explain your observation in a sentence or two.  
1. Create a script that will create an inventory of all the files in the assets folder `Bundle-web-files-small`. The inventory should be a CSV file, and it should include the filename of the file, the directory path for the file, the full path to the file, and the file size. You may include any other information that you think is important. Call this file `inventory_script.py`.
1. Extend the above script, using the techniques demonstrated here, and add in a way to determine the file extension of the file, then add the extension to the CSV output? (Hint: you could split the filename string, right?)
1. Write a script that can walk through a series of directories and identify files based on their file extension. For example, perhaps you want to count the number of .pdf files or .jpg. Create file that can look for this information and then tally the files. Then, have the program output the list of filenames and filepaths in a CSV file. Call this file `extension_detector.py`. 
1. Building on the above examples, can you a) write functions that bundle code to ask for a directory? You could call this function `create_manifest_information` and it should be able to accept a path to a directory as an argument and return the manifestInfo list. And b) write a function that would accept the manifestInfo list as an argument and create a CSV? 

This activity took two weeks when combined with the [Git exercise](exercise-git-intro.md). 

Next week, dictionaries (streamline CSV creation), and additional derived information: mimetype and fixity/hash.

1. Write a script that creates a `master` and `derivative` directory within a subdirectory that has the file's name as its name. For example, if there are two files, one named `001.jpg` and `audition.wav`, there should be a directory named `001` and another named `audition`. Within these, there should be master and derivative folders. The original files should be in the `master` folder. Call this file `master_and_derivatives.py`.