# Photo Metadata Extraction and Formatting

## Overview

This program extracts selected metadata from photos in a directory and formats the metadata as a CSV (comma-separated values) text file. The resulting CSV file can be opened and edited in a spreadsheet application; it also can be used to upload metadata to a digital library system.

## Step 1: Import modules.
Sources for the modules used in this program:<br>
<ul>
<li>https://docs.python.org/3/library/os.html#files-and-directories
<li>https://docs.python.org/3/library/mimetypes.html
<li>https://pypi.org/project/ExifRead/
<li>https://docs.python.org/3/library/csv.html</ul><br>
For information about Exchangeable Image File Format (EXIF) metadata, see https://en.wikipedia.org/wiki/Exif.

In [1]:
import os
import mimetypes
import exifread
import csv

## Step 2: Identify the photo directory.

Prompt the user to enter the path to a directory of photos from which to extract metadata. The example indicates the GitHub repository '/Users/heather_m_campbell/Documents/GitHub/452-final-project/Photos'.<br>
<br>Assign the path string to the ```photo_directory``` variable.

In [2]:
photo_directory = input('Enter the file path to your photo directory: ')

Enter the file path to your photo directory: '/Users/heather_m_campbell/Documents/GitHub/452-final-project/Photos'


## Step 3: Generate the file names in the photo directory tree.
Use the os.walk() function to generate file names. By default, this function "walks" the tree from the top root down. For each directory in the tree, including the root itself, the function yields a 3-tuple (directory path [0], directory names [1], file names [2]).
Assign the resulting object to the ```allfiles``` variable.

In [3]:
allfiles = os.walk(photo_directory)
print(allfiles)

<generator object walk at 0x111af5f68>


## Step 4: Use loops to collect file information in a list.
For each directory in the ```allfiles``` object, the outer loop assigns the first value in the 3-tuple, the path (position 0), to the ```path``` variable.<br>
<br>Then the ```path``` string is sliced to extract the subdirectory name (```folder```), if any; the slicing operation uses the length of the ```photo_directory``` string as the starting position, plus 1 to exclude the slash before the folder name.<br>
<br>Next, the third value in the 3-tuple (position 2), which is a list of files in that directory, is assigned to the ```file_names``` list.<br>
<br>The inner loop executes if the file name does not contain 'ipynb' or 'DS_Store', as it is unnecessary to include those nonimage files in the output. Note: In the ```if``` statement, the Boolean operator AND ensures both conditions are met.<br>
<br>The inner loop goes through the ```file_names``` list. In each iteration, it uses a counter to create a file ID, ```file_num```, and creates a list of data (```file_data```) about the file: its ID, its path, its folder, and the file name. Finally, it appends the ```file_data``` list to the larger ```all_file_list```.<br>
<br>After these loops iterate, ```all_file_list``` contains lists of data about every file in all subdirectories in the photo directory.

In [4]:
file_num = 0
file_names = []
file_data = []
all_file_list = []

for dir in allfiles:
    path = dir[0]
    folder = path[(len(photo_directory)+1):]  # slice path after photo directory to get folder
    file_names = dir[2]
    for file in file_names:
        if 'ipynb' not in file and 'DS_Store' not in file:
            file_num = file_num + 1
            file_data = [file_num, path, folder, file]
            all_file_list.append(file_data)

print(len(all_file_list), 'files to process.')
# print(all_file_list)  # Use this statement to view the contents of the list.

0 files to process.


## Step 5: Extract metadata from files.
The ```all_file_list``` now comprises a list for each file in the photo directory. Each file list contains the file ID number (position 0), its path (position 1), its folder (position 2), and its name (position 3). The next loop adds extracted metadata to these lists.<br>
<br>For each file list, the loop assigns values to variables that will correspond to column headings in the CSV file. Values from the previously generated file lists are assigned to the first several variables in the loop.<br>
<br>Several functions used in the loop require a complete file path, including the file name, so the ```file_path``` variable concatenates the path, a slash, and the file name.<br>
<br>To get the file format, the ```mimetypes.guess_type()``` function returns a tuple (type, encoding), based on the file extension. Later in the loop, ```file_format[0]``` returns only the first value in the tuple, the file type.<br>
<br>The ```os.stat(file_path).st_size``` function provides the file size in bytes. For the subsequent variable, the ```file_size``` value is multiplied by 0.000001 to get the size in megabytes.<br>
<br>Next, the program prepares to extract the image metadata by opening the file in binary mode. The ```exifread.process_file()``` function creates a dictionary of EXIF tags and their values. The dictionary ```.get()``` method then can be used to retrieve the values for specified tags.<br>
<br>Finally, the data collected for the file is assigned to the ```file_metadata``` list, which, in turn, is appended to the larger ```all_file_metadata``` list.

In [5]:
file_metadata = []
all_file_metadata = []

for file in all_file_list:
    file_ID = file[0]
    event = file[2]
    file_name = file[3]
    file_path = file[1] + '/' + file_name    
    file_format = mimetypes.guess_type(file_path, strict=False)
    file_size = os.stat(file_path).st_size
    file_size_MB = round((file_size*0.000001),2)  # round to 2 decimal places
    image_metadata = open(file_path, 'rb')
    tags = exifread.process_file(image_metadata, details=False)
    datetime_original = tags.get('EXIF DateTimeOriginal')
    datetime_digitized = tags.get('EXIF DateTimeOriginal')
    image_software = tags.get('Image Software')
    image_width = tags.get('Image XResolution')
    image_height = tags.get('Image YResolution')
    image_units = tags.get('Image ResolutionUnit')
    latitude_ref = tags.get('GPS GPSLatitudeRef') # generates a list of coordinates
    latitude = tags.get('GPS GPSLatitude')
    longitude_ref = tags.get('GPS GPSLongitudeRef')
    longitude = tags.get('GPS GPSLongitude')
    camera = tags.get('Image Model')
    exposure = tags.get('EXIF ExposureTime')
    flash = tags.get('EXIF Flash')
    lens = tags.get('EXIF LensModel')
    file_metadata = [file_ID, event, file_name, file_format[0], file_size, file_size_MB, 
                     datetime_original, datetime_digitized,
                     image_software, image_width, image_height, image_units, 
                     latitude_ref, latitude, longitude_ref, longitude, 
                     camera, exposure, flash, lens]
    all_file_metadata.append(file_metadata)
    
# print(all_file_metadata)

## Step 6: Output file metadata in CSV format.
Now that the ```all_file_metadata``` list holds metadata for all files in the photo directory, this information can be written out as a CSV file.<br>
<br>First, the ```photo_data.csv``` file is opened for writing. Then the ```csv.writer()``` method is used to populate the file. The first row is populated by column headings that correspond to the metadata values in the list. Each subsequent row is populated by the values in each file list collected in ```all_file_metadata```. Last, the populated file is closed.<br>
<br>The resulting CSV file is immediately visible in the same directory as the program file.

In [6]:
outfile = open('photo_data.csv', 'w')
csv_out = csv.writer(outfile)
csv_out.writerow(['ID', 'Event', 'File Name', 'File Format', 'File Size (Bytes)',
                  'File Size (MB)', 'Date Taken', 'Date Digitized',
                  'Software', 'Image Width', 'Image Height', 'Units', 
                  'Latitude', 'DMS', 'Longitude', 'DMS',
                  'Camera', 'Exposure Time', 'Flash Used?', 'Lens Model'] )
csv_out.writerows(all_file_metadata)
outfile.close()

print('Your CSV file is ready.')

Your CSV file is ready.
