# Python 105 - Harvesting Integrity Metadata

There are a few final touches that can improve our file inventory, including
more robust integrity information, identifying the mimetype, and recording Tika output. 

This workbook walks through the creation and recording of more robust 
integrity information using checksums.

## Setup

Import the libraries and functions, as developed in the previous notebooks.
The `hashlib` module supports the generation of checksum information.

In [1]:
import os
from os.path import join, getsize
import csv
import hashlib

In [2]:
# check what algorithms are available in your setup. We'll focus on MD5 and SHA
hashlib.algorithms_available

{'DSA',
 'DSA-SHA',
 'MD4',
 'MD5',
 'MDC2',
 'RIPEMD160',
 'SHA',
 'SHA1',
 'SHA224',
 'SHA256',
 'SHA384',
 'SHA512',
 'blake2b',
 'blake2s',
 'dsaEncryption',
 'dsaWithSHA',
 'ecdsa-with-SHA1',
 'md4',
 'md5',
 'mdc2',
 'ripemd160',
 'sha',
 'sha1',
 'sha224',
 'sha256',
 'sha384',
 'sha3_224',
 'sha3_256',
 'sha3_384',
 'sha3_512',
 'sha512',
 'shake_128',
 'shake_256',
 'whirlpool'}

In [13]:
print(os.getcwd())
sampleDir = join('..','assets','Bundle-web-files-small')
os.listdir(sampleDir)

/Users/rickypunzalan/Desktop/digcur/activities


['.DS_Store',
 'audio',
 'image',
 'pdf',
 'presentation',
 'video',
 'web-files-small-metadata.csv']

In [17]:
sampleFile = join(sampleDir,'web-files-small-metadata.csv')

print(sampleFile)
type(sampleFile)

../assets/Bundle-web-files-small/web-files-small-metadata.csv


str

In [26]:
# use the following example to create an MD5 checksum
# the .md5() method creates MD5, the file passed in is encoded 
m = hashlib.md5(sampleFile.encode())
# use the hexdigest to printout a hex value of the hash 
print(m.hexdigest())

c28c08c3e94fe867c81b73502405d7a9


In [27]:
# make a function
def create_md5_digest(file):
    '''This is a simple function to create an md5 checksum.
    
    The accepted file value should be a string representing a valid path.'''
    m = hashlib.md5(file.encode())
    return m.hexdigest()

In [28]:
create_md5_digest(sampleFile)

'c28c08c3e94fe867c81b73502405d7a9'

### Add to metadata harvesting script

Now, let's add to the metadata harvesting code from the previous notebook (Python 104).

In [30]:
# first use os.walk() as in the notebook, Python 104

## get information about each of the files

# first let's set some counters
fileCount = 0
# and a list to hold the information about the file, and another to hold the fileInfo
fileInfo = list()
manifestInfo = list()

for folderName, subfolders, filenames in os.walk(sampleDir):    
    for filename in filenames:
        fileCount += 1
        index = fileCount
        filename = filename 
        folder = folderName
        path = os.path.join(folderName, filename)
        size = os.path.getsize(path)
        extension = filename.split('.')[-1]
        # add in checksum
        checksumMD5 = create_md5_digest(os.path.join(folderName,filename))
#        print('Found:', filename, folder, path, size, extension)

        fileInfo = [
            index,
            filename,
            folder,
            path,
            size,
            extension,
            #add in checksum
            checksumMD5
            ]
        manifestInfo.append(fileInfo)
print('Looked through the file tree. Found',len(manifestInfo),'files.')

## write to a CSV

# set up the csv, create a header row
headers = [
    'index',
    'filename',
    'in_folder_path',
    'full_file_path',
    'size',
    'extension',
    # add in md5
    'checksum_md5'
    ]

# write the information using csvwriter()
with open('file-manifest-with-extension-and-md5.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(headers)
    for file in manifestInfo:
#        print(file)
        writer.writerow(file)
    print('Wrote the file manifest with extensions!')   

Looked through the file tree. Found 23 files.
Wrote the file manifest with extensions!


Check the value for Returned value for `web-files-small-metadata.csv` printed above, and the value in the CSV output by the above code. If the values match, this indicates a successful checksum! 

For more exploration, try adding the SHA256 or another type of digest value. 
Also, try modifying the functions created in the previous notebook (see completed functions in the 
worked examples in the 2019-03-14 notebook in the `class-notes` folder). 