# Review - iterating, listing files, writing lists to files

Notes from class, 14 March 2019. Reviewing where we've been so far. 

Below are worked examples from the previous activities in looking through files, including `listdir()`, `scandir()`, and `os.walk()`.

Below sections work through the following, which are based on the Reflection Questions from the Python 104 examples.

## Work throughs: listdir()

1. Write a script that uses `os.listdir()` for each of the directories in the `Bundle-web-files-small` directory. You can put in the path names directly as variables, but you should use the `os.path.join()` function to create filepaths that do not depend on your inputting the exact filepath string, which will vary across operating systems.

In [77]:
import os
from os.path import join, getsize
import csv

walk_this_directory = os.path.join('..','assets','Bundle-web-files-small')
print(walk_this_directory)

../assets/Bundle-web-files-small


In [3]:
os.listdir(walk_this_directory)

['audio',
 'image',
 'pdf',
 'presentation',
 'video',
 'web-files-small-metadata.csv']

In [7]:
path_to_query = os.path.join(walk_this_directory,'image')

print(path_to_query)
os.listdir(path_to_query)

../assets/Bundle-web-files-small/image


['1005107061.tif',
 '13080t.jpg',
 'k7989-7x.jpg',
 'm237a2f.gif',
 'orca.via_.moc_.noaa_.jpg']

In [23]:
parent_directory_list = os.listdir(walk_this_directory)

for thing in parent_directory_list:
    if len(thing.split('.')) > 1:
#        continue
        print(thing,'It\'s a file!')
    else:
        print(thing,':',os.listdir(os.path.join(walk_this_directory,thing)))

audio : ['000727.ram', '11-3250JohnsonvFolinoEtAl.wma', 'mj_telework_exchange_final_100710.mp3', 'NEWSLINE_802AF71F439D401585C6FCB02F358307.mp3']
image : ['1005107061.tif', '13080t.jpg', 'k7989-7x.jpg', 'm237a2f.gif', 'orca.via_.moc_.noaa_.jpg']
pdf : ['01-1480.pdf', 'Chapter03.pdf', 'file.pdf', 'HR2021 commtext.pdf', 'PFCHEJ.pdf']
presentation : ['ADAEMPLOYMENTTaxIncentives.ppt', 'BudgetandGrants012710.ppt', 'Non-FTE-Trainee-Activities-060109.ppt']
video : ['04-04-21full.asf', 'glmp_cig.EQ.wm.p20.t12z', 'oct17cc.asx', 'vlwhcssc.asx']
web-files-small-metadata.csv It's a file!


## Work throughs: scandir()
1. Write a script that uses `os.scandir()` to check whether or not the entities in the director are files or directories. The script should output a count of files and a count of directories.

In [30]:
# check that we still have our desired directory:
print(walk_this_directory)

../assets/Bundle-web-files-small


In [37]:
# here is one way to use os.scandir, and to check if we are getting information back:
for item in os.scandir(walk_this_directory):
    print(item)

<DirEntry 'audio'>
<DirEntry 'image'>
<DirEntry 'pdf'>
<DirEntry 'presentation'>
<DirEntry 'video'>
<DirEntry 'web-files-small-metadata.csv'>


In [78]:
# notice that these are "DirEntry" types, though. That means 
# we can get more information. To do that, use a list, 
# in this case using the contextual opener 'with'

with os.scandir(walk_this_directory) as items_list:
    for entry in items_list:
        # allows an ask for a few specific things, like the "name" and the "path" without using os.path.join
        print('Looking at:',entry.name)
        # scandir also can retrieve file information using the stat() call, which gives size and other information
        statinfo = entry.stat()
        print('Stat size:',statinfo.st_size)
        # the dirEntry information returned by scandir also allows us to do 
        # a logical check to see if it's a directory or file:
        if entry.is_dir():
            file_list = os.listdir(entry.path)
            print('This is a directory and contains',len(file_list),'these files',file_list)
        elif entry.is_file():
            print('This is a file named',entry.name,'that takes up',statinfo.st_size,'bytes')
        # just in case something is not a file, add in this option:
        else:
            print('This object is unrecognized:', entry.name)

Looking at: audio
Stat size: 204
This is a directory and contains 4 these files ['000727.ram', '11-3250JohnsonvFolinoEtAl.wma', 'mj_telework_exchange_final_100710.mp3', 'NEWSLINE_802AF71F439D401585C6FCB02F358307.mp3']
Looking at: image
Stat size: 238
This is a directory and contains 5 these files ['1005107061.tif', '13080t.jpg', 'k7989-7x.jpg', 'm237a2f.gif', 'orca.via_.moc_.noaa_.jpg']
Looking at: pdf
Stat size: 238
This is a directory and contains 5 these files ['01-1480.pdf', 'Chapter03.pdf', 'file.pdf', 'HR2021 commtext.pdf', 'PFCHEJ.pdf']
Looking at: presentation
Stat size: 170
This is a directory and contains 3 these files ['ADAEMPLOYMENTTaxIncentives.ppt', 'BudgetandGrants012710.ppt', 'Non-FTE-Trainee-Activities-060109.ppt']
Looking at: video
Stat size: 204
This is a directory and contains 4 these files ['04-04-21full.asf', 'glmp_cig.EQ.wm.p20.t12z', 'oct17cc.asx', 'vlwhcssc.asx']
Looking at: web-files-small-metadata.csv
Stat size: 9069
This is a file named web-files-small-metad

## Work throughs: comparing listdir(), scandir(), walk()
1. Examine the examples above that use `os.walk()`. What is the difference between this and the previous two functions? In some ways it lets you get deeper into the file structure, so please explain your observation in a sentence or two.  

### Ways to find files

```
listdir() only provides us a list of the files and directories at the given path. We need more information if we want to get other information. If we don't need other information, then this is good (it's faster than other options since it isn't retrieving a lot of information). It reveals hidden files (beginning with '.') and is not recursive, so it will not query subdirectories.

scandir() allows us to query more information, and it also reveals more information, like name, path, and can be used to find size information. It is recursive and allows us to check if things are files, directories, or other sorts of objects.  

walk() is recursive, though it does not by default show hidden files. It can be used to provide context about items in a directory (full path and containing folder), and it is recursive. 

yet another way to look through the file tree is the glob library, which can also be used to look at path information.
```

## Work throughs: walking through the file tree, gathering metadata, creating a CSV
1. Create a script that will create an inventory of all the files in the assets folder `Bundle-web-files-small`. The inventory should be a CSV file, and it should include the filename of the file, the directory path for the file, the full path to the file, and the file size. You may include any other information that you think is important. Call this file `inventory_script.py`.

In [83]:
# first use os.walk() as in the notebook, Python 104

## get information about each of the files

# first let's set some counters
fileCount = 0
# and a list to hold the information about the file, and another to hold the fileInfo
fileInfo = list()
manifestInfo = list()

for folderName, subfolders, filenames in os.walk(walk_this_directory):    
    for filename in filenames:
        fileCount += 1
        index = fileCount
        filename = filename 
        folder = folderName
        path = os.path.join(folderName, filename)
        size = os.path.getsize(path)
#        print('Found:', filename, folder, path, size)

        fileInfo = [
            index,
            filename,
            folder,
            path,
            size
            ]
        manifestInfo.append(fileInfo)
print('Looked through the file tree. Found',len(manifestInfo),'files.')

## write to a CSV

# set up the csv, create a header row
headers = [
    'index',
    'filename',
    'in_folder_path',
    'full_file_path',
    'size'
    ]

# write the information using csvwriter()
with open('file-manifest.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(headers)
    for file in manifestInfo:
#        print(file)
        writer.writerow(file)
    print('Wrote the file manifest')   

Looked through the file tree. Found 22 files.
Wrote the file manifest


## Work throughs: same as above (harvesting metadata), but adding extension to metadata
1. Extend the above script, using the techniques demonstrated here, and add in a way to determine the file extension of the file, then add the extension to the CSV output? (Hint: you could split the filename string, right?)

In [85]:
# first use os.walk() as in the notebook, Python 104

## get information about each of the files

# first let's set some counters
fileCount = 0
# and a list to hold the information about the file, and another to hold the fileInfo
fileInfo = list()
manifestInfo = list()

for folderName, subfolders, filenames in os.walk(walk_this_directory):    
    for filename in filenames:
        fileCount += 1
        index = fileCount
        filename = filename 
        folder = folderName
        path = os.path.join(folderName, filename)
        size = os.path.getsize(path)
        extension = filename.split('.')[-1]
#        print('Found:', filename, folder, path, size, extension)

        fileInfo = [
            index,
            filename,
            folder,
            path,
            size,
            extension
            ]
        manifestInfo.append(fileInfo)
print('Looked through the file tree. Found',len(manifestInfo),'files.')

## write to a CSV

# set up the csv, create a header row
headers = [
    'index',
    'filename',
    'in_folder_path',
    'full_file_path',
    'size',
    'extension'
    ]

# write the information using csvwriter()
with open('file-manifest-with-extension.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(headers)
    for file in manifestInfo:
#        print(file)
        writer.writerow(file)
    print('Wrote the file manifest with extensions!')   

Looked through the file tree. Found 22 files.
Wrote the file manifest with extensions!


## Walkthroughs: identifying file extensions, and sorting files by extension
1. Write a script that can walk through a series of directories and identify files based on their file extension. For example, perhaps you want to count the number of .pdf files or .jpg. Create file that can look for this information and then tally the files. Then, have the program output the list of filenames and filepaths in a CSV file. Call this file `extension_detector.py`. 

In [86]:
for folderName, subfolders, filenames in os.walk(walk_this_directory):
    
    for filename in filenames:
        filename = filename
        folder = folderName
        path = os.path.join(folderName,filename)
        size = os.path.getsize(path)
#        print('Found:',filename,folder,path,size)
        extension = filename.split('.')[-1]
        print('Found:',filename,folder,path,size,extension)


Found: web-files-small-metadata.csv ../assets/Bundle-web-files-small ../assets/Bundle-web-files-small/web-files-small-metadata.csv 9069 csv
Found: 000727.ram ../assets/Bundle-web-files-small/audio ../assets/Bundle-web-files-small/audio/000727.ram 79 ram
Found: 11-3250JohnsonvFolinoEtAl.wma ../assets/Bundle-web-files-small/audio ../assets/Bundle-web-files-small/audio/11-3250JohnsonvFolinoEtAl.wma 21423499 wma
Found: mj_telework_exchange_final_100710.mp3 ../assets/Bundle-web-files-small/audio ../assets/Bundle-web-files-small/audio/mj_telework_exchange_final_100710.mp3 3471488 mp3
Found: NEWSLINE_802AF71F439D401585C6FCB02F358307.mp3 ../assets/Bundle-web-files-small/audio ../assets/Bundle-web-files-small/audio/NEWSLINE_802AF71F439D401585C6FCB02F358307.mp3 961195 mp3
Found: 1005107061.tif ../assets/Bundle-web-files-small/image ../assets/Bundle-web-files-small/image/1005107061.tif 395734 tif
Found: 13080t.jpg ../assets/Bundle-web-files-small/image ../assets/Bundle-web-files-small/image/13080

1. Building on the above examples, can you a) write functions that bundle code to ask for a directory? You could call this function `create_manifest_information` and it should be able to accept a path to a directory as an argument and return the manifestInfo list. And b) write a function that would accept the manifestInfo list as an argument and create a CSV? 