# Review - iterating, listing files, writing lists to files

Notes from class, 14 March 2019. Reviewing where we've been so far. 

Below are worked examples from the previous activities in looking through files, including `listdir()`, `scandir()`, and `os.walk()`.

Below sections work through the following, which are based on the Reflection Questions from the Python 104 examples.

## Work throughs: listdir()

1. Write a script that uses `os.listdir()` for each of the directories in the `Bundle-web-files-small` directory. You can put in the path names directly as variables, but you should use the `os.path.join()` function to create filepaths that do not depend on your inputting the exact filepath string, which will vary across operating systems.

In [3]:
import os
from os.path import join, getsize
import csv

walk_this_directory = os.path.join('..','..','assets','Bundle-web-files-small')
print(walk_this_directory)

../../assets/Bundle-web-files-small


In [4]:
os.listdir(walk_this_directory)

['.DS_Store',
 'audio',
 'image',
 'pdf',
 'presentation',
 'video',
 'web-files-small-metadata.csv']

In [5]:
path_to_query = os.path.join(walk_this_directory,'image')

print(path_to_query)
os.listdir(path_to_query)

../../assets/Bundle-web-files-small/image


['1005107061.tif',
 '13080t.jpg',
 'k7989-7x.jpg',
 'm237a2f.gif',
 'orca.via_.moc_.noaa_.jpg']

In [7]:
parent_directory_list = os.listdir(walk_this_directory)

for thing in parent_directory_list:
    if len(thing.split('.')) > 1:
#        continue
        print('It\'s a file!', thing)
    else:
        print(thing,':',os.listdir(os.path.join(walk_this_directory,thing)))

It's a file! .DS_Store
audio : ['000727.ram', '11-3250JohnsonvFolinoEtAl.wma', 'mj_telework_exchange_final_100710.mp3', 'NEWSLINE_802AF71F439D401585C6FCB02F358307.mp3']
image : ['1005107061.tif', '13080t.jpg', 'k7989-7x.jpg', 'm237a2f.gif', 'orca.via_.moc_.noaa_.jpg']
pdf : ['01-1480.pdf', 'Chapter03.pdf', 'file.pdf', 'HR2021 commtext.pdf', 'PFCHEJ.pdf']
presentation : ['ADAEMPLOYMENTTaxIncentives.ppt', 'BudgetandGrants012710.ppt', 'Non-FTE-Trainee-Activities-060109.ppt']
video : ['04-04-21full.asf', 'glmp_cig.EQ.wm.p20.t12z', 'oct17cc.asx', 'vlwhcssc.asx']
It's a file! web-files-small-metadata.csv


## Work throughs: scandir()
1. Write a script that uses `os.scandir()` to check whether or not the entities in the director are files or directories. The script should output a count of files and a count of directories.

In [8]:
# check that we still have our desired directory:
print(walk_this_directory)

../../assets/Bundle-web-files-small


In [9]:
# here is one way to use os.scandir, and to check if we are getting information back:
for item in os.scandir(walk_this_directory):
    print(item)

<DirEntry '.DS_Store'>
<DirEntry 'audio'>
<DirEntry 'image'>
<DirEntry 'pdf'>
<DirEntry 'presentation'>
<DirEntry 'video'>
<DirEntry 'web-files-small-metadata.csv'>


In [21]:
# notice that these are "DirEntry" types, though. That means 
# we can get more information. To do that, use a list, 
# in this case using the contextual opener 'with'

with os.scandir(walk_this_directory) as items_list:
    for entry in items_list:
        # allows an ask for a few specific things, like the "name" and the "path" without using os.path.join
        print('Looking at:',entry.name)
        # scandir also can retrieve file information using the stat() call, which gives size and other information
        statinfo = entry.stat()
        print('Stat size:',statinfo.st_size)
        # the dirEntry information returned by scandir also allows us to do 
        # a logical check to see if it's a directory or file:
        if entry.is_dir():
            file_list = os.listdir(entry.path)
            print('This is a directory and contains',len(file_list),'files:',file_list)
            # loop through the file_list and use os.path.getsize() to calculate the bytes in the directory
            dirSize = 0
            for file in file_list:
                size = getsize(os.path.join(entry.path,file))
                dirSize = dirSize + size
            print('The directory takes up',dirSize,'bytes.')
        elif entry.is_file():
            print('This is a file named',entry.name,'that takes up',statinfo.st_size,'bytes at inode',statinfo.st_ino)
        # just in case something is not a file, add in this option:
        else:
            print('This object is unrecognized:', entry.name)

Looking at: .DS_Store
Stat size: 6148
This is a file named .DS_Store that takes up 6148 bytes at inode 31920562
Looking at: audio
Stat size: 204
This is a directory and contains 4 files: ['000727.ram', '11-3250JohnsonvFolinoEtAl.wma', 'mj_telework_exchange_final_100710.mp3', 'NEWSLINE_802AF71F439D401585C6FCB02F358307.mp3']
The directory takes up 25856261 bytes.
Looking at: image
Stat size: 238
This is a directory and contains 5 files: ['1005107061.tif', '13080t.jpg', 'k7989-7x.jpg', 'm237a2f.gif', 'orca.via_.moc_.noaa_.jpg']
The directory takes up 497284 bytes.
Looking at: pdf
Stat size: 238
This is a directory and contains 5 files: ['01-1480.pdf', 'Chapter03.pdf', 'file.pdf', 'HR2021 commtext.pdf', 'PFCHEJ.pdf']
The directory takes up 149427 bytes.
Looking at: presentation
Stat size: 170
This is a directory and contains 3 files: ['ADAEMPLOYMENTTaxIncentives.ppt', 'BudgetandGrants012710.ppt', 'Non-FTE-Trainee-Activities-060109.ppt']
The directory takes up 289792 bytes.
Looking at: vide

## Work throughs: comparing listdir(), scandir(), walk()
1. Examine the examples above that use `os.walk()`. What is the difference between this and the previous two functions? In some ways it lets you get deeper into the file structure, so please explain your observation in a sentence or two.  

### Ways to find files

```
listdir() only provides us a list of the files and directories at the given path. We need more information if we want to get other information. If we don't need other information, then this is good (it's faster than other options since it isn't retrieving a lot of information). It reveals hidden files (beginning with '.') and is not recursive, so it will not query subdirectories.

scandir() allows us to query more information, and it also reveals more information, like name, path, and can be used to find size information. It is recursive and allows us to check if things are files, directories, or other sorts of objects.  

walk() is recursive, though it does not by default show hidden files. It can be used to provide context about items in a directory (full path and containing folder), and it is recursive. 

yet another way to look through the file tree is the glob library, which can also be used to look at path information.
```

## Work throughs: harvesting file metadata 

_Activities: walking through the file tree, gathering metadata, creating a CSV_

1. Create a script that will create an inventory of all the files in the assets folder `Bundle-web-files-small`. The inventory should be a CSV file, and it should include the filename of the file, the directory path for the file, the full path to the file, and the file size. You may include any other information that you think is important. Call this file `inventory_script.py`.

In [11]:
# first use os.walk() as in the notebook, Python 104

## get information about each of the files

# first let's set some counters
fileCount = 0
# and a list to hold the information about the file, and another to hold the fileInfo
fileInfo = list()
manifestInfo = list()

for folderName, subfolders, filenames in os.walk(walk_this_directory):    
    for filename in filenames:
        fileCount += 1
        index = fileCount
        filename = filename 
        folder = folderName
        path = os.path.join(folderName, filename)
        size = os.path.getsize(path)
#        print('Found:', filename, folder, path, size)

        fileInfo = [
            index,
            filename,
            folder,
            path,
            size
            ]
        manifestInfo.append(fileInfo)
print('Looked through the file tree. Found',len(manifestInfo),'files.')

## write to a CSV

# set up the csv, create a header row
headers = [
    'index',
    'filename',
    'in_folder_path',
    'full_file_path',
    'size'
    ]

# write the information using csvwriter()
with open('file-manifest.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(headers)
    for file in manifestInfo:
#        print(file)
        writer.writerow(file)
    print('Wrote the file manifest')   

Looked through the file tree. Found 23 files.
Wrote the file manifest


## Work throughs: same as above (harvesting metadata), but adding extension to metadata
1. Extend the above script, using the techniques demonstrated here, and add in a way to determine the file extension of the file, then add the extension to the CSV output? (Hint: you could split the filename string, right?)

In [12]:
# first use os.walk() as in the notebook, Python 104

## get information about each of the files

# first let's set some counters
fileCount = 0
# and a list to hold the information about the file, and another to hold the fileInfo
fileInfo = list()
manifestInfo = list()

for folderName, subfolders, filenames in os.walk(walk_this_directory):    
    for filename in filenames:
        fileCount += 1
        index = fileCount
        filename = filename 
        folder = folderName
        path = os.path.join(folderName, filename)
        size = os.path.getsize(path)
        extension = filename.split('.')[-1]
#        print('Found:', filename, folder, path, size, extension)

        fileInfo = [
            index,
            filename,
            folder,
            path,
            size,
            extension
            ]
        manifestInfo.append(fileInfo)
print('Looked through the file tree. Found',len(manifestInfo),'files.')

## write to a CSV

# set up the csv, create a header row
headers = [
    'index',
    'filename',
    'in_folder_path',
    'full_file_path',
    'size',
    'extension'
    ]

# write the information using csvwriter()
with open('file-manifest-with-extension.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(headers)
    for file in manifestInfo:
#        print(file)
        writer.writerow(file)
    print('Wrote the file manifest with extensions!')   

Looked through the file tree. Found 23 files.
Wrote the file manifest with extensions!


## Walkthroughs: identifying file extensions, and sorting files by extension

~~1. Write a script that can walk through a series of directories and identify files based on their file extension. For example, perhaps you want to count the number of .pdf files or .jpg. Create file that can look for this information and then tally the files. Then, have the program output the list of filenames and filepaths in a CSV file. Call this file `extension_detector.py`. ~~


_This activity works, but it is much easier after introducing dictionaries to count values._

In [16]:
folderList = os.scandir(walk_this_directory)

# create a list of the file extensions:
extensionsList = list()

for item in folderList:
    if item.name.startswith('.'):
        continue
    elif item.is_file():
        extension = item.name.split('.')[-1]
        if extension not in extensionsList:
            extensionsList.append(extension)
    else:
        if item.is_dir():
            files = os.listdir(item.path)
            for file in files:
                # split off the extension (last in the list) and assign to 'extension'
                extension = file.split('.')[-1]
                # check if it's in the list, if not, then add (append) to the list
                if extension not in extensionsList:
                    extensionsList.append(extension)
print(extensionsList)

['ram', 'wma', 'mp3', 'tif', 'jpg', 'gif', 'pdf', 'ppt', 'asf', 't12z', 'asx', 'csv']


In [None]:
# to count and output, use the extensionsList as a headers row, 
# then count up and output to CSV. 

In [22]:
# use a dictionary to count the occurences
# NB: dictionary operations are necessary for this block

folderList = os.scandir(walk_this_directory)

# create a list to hold the file extensions, and a dictionary to count the extensions
extensionsList = list()
extensionsCount = dict()

for item in folderList:
    # skip the system files that begin with a '.'
    if item.name.startswith('.'):
        continue
    elif item.is_file():
        # split off the extension (last in the list) and assign to 'extension'
        extension = item.name.split('.')[-1]
        # if not in the list, append
        if extension not in extensionsList:
            extensionsList.append(extension)
        # check the dictionary and increment if the item doesn't have a dictionary entry
        if extension not in extensionsCount:
            extensionsCount[extension] = 1
        else:
            extensionsCount[extension] = extensionsCount[extension] + 1
    else:
        if item.is_dir():
            files = os.listdir(item.path)
            for file in files:
                # split off the extension 
                extension = file.split('.')[-1]
                # check if it's in the list, if not, then add (append) to the list
                if extension not in extensionsList:
                    extensionsList.append(extension)
                # check the dictionary and increment if the item doesn't have a dictionary entry
                if extension not in extensionsCount:
                    extensionsCount[extension] = 1
                else:
                    extensionsCount[extension] = extensionsCount[extension] + 1

print(extensionsList)
print(extensionsCount)

['ram', 'wma', 'mp3', 'tif', 'jpg', 'gif', 'pdf', 'ppt', 'asf', 't12z', 'asx', 'csv']
{'ram': 1, 'wma': 1, 'mp3': 2, 'tif': 1, 'jpg': 3, 'gif': 1, 'pdf': 5, 'ppt': 3, 'asf': 1, 't12z': 1, 'asx': 2, 'csv': 1}


## Advanced: wrap the above code into functions for gathering and recording file metadata

1. Building on the above examples, can you a) write functions that bundle code to ask for a directory? You could call this function `create_manifest_information` and it should be able to accept a path to a directory as an argument and return the manifestInfo list. And b) write a function that would accept the manifestInfo list as an argument and create a CSV? You could call the second function `write_manifest_to_csv`. 

_Advanced_

In [24]:
def create_manifest_information(directoryPath):
    '''This function takes a path and creates a list of lists about each file 
    found at the path and its subpaths. It writes an index, the filename, 
    the folder containing the file, the full path, the number of bytes the file uses,
    and the file extension.'''
    # set counters
    fileCount = 0
    # instantiate lists to hold the information about the file (fileInfo) and the manifest (manifestInfo)
    fileInfo = list()
    manifestInfo = list()

    for folderName, subfolders, filenames in os.walk(directoryPath):    
        for filename in filenames:
            fileCount += 1
            index = fileCount
            filename = filename 
            folder = folderName
            path = os.path.join(folderName, filename)
            size = os.path.getsize(path)
            extension = filename.split('.')[-1]
    #        print('Found:', filename, folder, path, size, extension)

            fileInfo = [
                index,
                filename,
                folder,
                path,
                size,
                extension
                ]
            manifestInfo.append(fileInfo)
    
    # print a message to notify that the function completed. Comment this out if no other response desired
    print('Looked through the file tree. Found',len(manifestInfo),'files.')
    
    return manifestInfo

create_manifest_information(walk_this_directory)

Looked through the file tree. Found 23 files.


[[1,
  '.DS_Store',
  '../../assets/Bundle-web-files-small',
  '../../assets/Bundle-web-files-small/.DS_Store',
  12292,
  'DS_Store'],
 [2,
  'web-files-small-metadata.csv',
  '../../assets/Bundle-web-files-small',
  '../../assets/Bundle-web-files-small/web-files-small-metadata.csv',
  9069,
  'csv'],
 [3,
  '000727.ram',
  '../../assets/Bundle-web-files-small/audio',
  '../../assets/Bundle-web-files-small/audio/000727.ram',
  79,
  'ram'],
 [4,
  '11-3250JohnsonvFolinoEtAl.wma',
  '../../assets/Bundle-web-files-small/audio',
  '../../assets/Bundle-web-files-small/audio/11-3250JohnsonvFolinoEtAl.wma',
  21423499,
  'wma'],
 [5,
  'mj_telework_exchange_final_100710.mp3',
  '../../assets/Bundle-web-files-small/audio',
  '../../assets/Bundle-web-files-small/audio/mj_telework_exchange_final_100710.mp3',
  3471488,
  'mp3'],
 [6,
  'NEWSLINE_802AF71F439D401585C6FCB02F358307.mp3',
  '../../assets/Bundle-web-files-small/audio',
  '../../assets/Bundle-web-files-small/audio/NEWSLINE_802AF71F43

In [26]:
def write_manifest_to_csv(manifestInfo, outfile='file-manifest-with-extension.csv'):
    '''This function will write a CSV file from a simple file manifest. 
    The manifest going into the function must contain the following information, 
    in this order: index, filename, in_folder_path, full_file_path, size, extension.
    
    The function also allows you to provide a filename as outfile.'''

    # set up the csv, create a header row
    headers = [
        'index',
        'filename',
        'in_folder_path',
        'full_file_path',
        'size',
        'extension'
        ]

    # write the information using csvwriter()
    with open(outfile, 'w') as f:
        writer = csv.writer(f)
        writer.writerow(headers)
        for file in manifestInfo:
    #        print(file)
            writer.writerow(file)
    print('Wrote',outfile)   
    
write_manifest_to_csv(manifestInfo)

Wrote file-manifest-with-extension.csv


In [29]:
# try the functions with a different directory:
newManifest = create_manifest_information(os.path.join('..','..','assets','Bundle-legacy-files'))

write_manifest_to_csv(newManifest, 'second-test-manifest.csv')

Looked through the file tree. Found 9 files.
Wrote second-test-manifest.csv


In [30]:
# get plaintext output of the manifest to see if it's expected information
!head second-test-manifest.csv

index,filename,in_folder_path,full_file_path,size,extension
1,.DS_Store,../../assets/Bundle-legacy-files,../../assets/Bundle-legacy-files/.DS_Store,6148,DS_Store
2,01N-0256_emc-000171-01.wps,../../assets/Bundle-legacy-files,../../assets/Bundle-legacy-files/01N-0256_emc-000171-01.wps,18451,wps
3,0528.jp2,../../assets/Bundle-legacy-files,../../assets/Bundle-legacy-files/0528.jp2,9400598,jp2
4,080-b2-tambarani.mp3,../../assets/Bundle-legacy-files,../../assets/Bundle-legacy-files/080-b2-tambarani.mp3,12705156,mp3
5,1898111001_1.txt,../../assets/Bundle-legacy-files,../../assets/Bundle-legacy-files/1898111001_1.txt,70753,txt
6,a0000003c.gdbtable.sdc,../../assets/Bundle-legacy-files,../../assets/Bundle-legacy-files/a0000003c.gdbtable.sdc,165699,sdc
7,ch00001.rtf,../../assets/Bundle-legacy-files,../../assets/Bundle-legacy-files/ch00001.rtf,1086,rtf
8,fertilizeruse.xls,../../assets/Bundle-legacy-files,../../assets/Bundle-legacy-files/fertilizeruse.xls,665088,xls
9,master narra