# Week 3 Exercises
## Jenna Landy
Note from student: I didn't have time to create full python programs with command line arguments, so for now, I have functions within this jupyter notebook.

In [1]:
import pandas as pd
from pathlib import Path
from glob import glob
import re
from warnings import warn
import hashlib

Note: the lesson for this week covered both the backup tool and how to use mock objects
like pyfakefs in testing. It was clearly too much material—my apologies for trying to
cram it all in—so the exercises below do _not_ require you to use pyfakefs. I will put
together an entire lesson on mock objects to run later in the course.

## Comparing manifests

Write a program `compare-manifests.py` that reads two manifest files and reports:

-   Which files have the same names but different hashes
    (i.e., their contents have changed).
-   Which files have the same hashes but different names
    (i.e., they have been renamed).
-   Which files are in the first hash but neither their names nor their hashes are in the second
    (i.e., they have been deleted).
-   Which files are in the second hash but neither their names nor their hashes are in the first
    (i.e., they have been added).

You can test your program by hand-writing a few manifest CSV files with made-up hashes.

First, recall what our manifest files look like. They are CSV files with two values per row: the filename and the hash.

Second, make two made-up manifests. From manifest1 to manifest2, the file `test1.py` has been altered and the file `test3.py` has been deleted, the file `test4.py` has been renamed to `te4st.py`, and the file `test5.py` has been created.

In [2]:
manifest1 = pd.DataFrame({
    'file': ['test1.py', 'test2.py', 'test3.py', 'test4.py'],
    'hash': ['abc', 'abcd', 'abcde','dabc']
})

manifest1.to_csv('manifest1.csv', index = False, header = False)

In [3]:
manifest2 = pd.DataFrame({
    'file': ['test1.py', 'test2.py', 'te4st.py', 'test5.py'],
    'hash': ['abe', 'abcd', 'dabc', 'abcdef']
})

manifest2.to_csv('manifest2.csv', index = False, header = False)

In [4]:
def compare_manifests(path1, path2):
    assert Path(path1).exists()
    assert Path(path2).exists()
    
    with open(Path(path1)) as f:
        manifest1 = [line.replace('\n','').split(',') for line in f.readlines()]
        manifest1 = {m[0]:m[1] for m in manifest1}
    
    with open(Path(path2)) as f:
        manifest2 = [line.replace('\n','').split(',') for line in f.readlines()]
        manifest2 = {m[0]:m[1] for m in manifest2}
        manifest2_rev = {v:k for k,v in manifest2.items()}
    
    renamed = []
    changed = []
    deleted = []
    
    new = manifest2.copy()
    # new: no matching hash or name in first
    
    for i in range(len(manifest1)):
        file = list(manifest1.keys())[i]
        if file in manifest2.keys():
            if manifest1[file] != manifest2[file]:
                # changed: same name different hash
                changed.append(file)
                
            # matching name, not new
            del(new[file])
        else:
            file_hash = list(manifest1.values())[i]
            if file_hash in manifest2.values():
                # renamed: same hash different name
                new_name = manifest2_rev[file_hash]
                renamed.append(file + ' -> ' + new_name)
                
                # matching hash, not new
                del(new[new_name])
            else:
                # deleted: no matching hash or name in second
                deleted.append(file)
    
    new = new.keys()
                
    # print report
    if len(new) > 0:
        print('New Files:')
        for f in new:
            print('- ' + f)
    if len(changed) > 0:
        print('\nChanged Files:')
        for f in changed:
            print('- ' + f)
    if len(renamed) > 0:
        print('\nRenamed Files:')
        for f in renamed:
            print('- ' + f)
    if len(deleted) > 0:
        print('\nDeleted Files:')
        for f in deleted:
            print('- ' + f)

Expected Report:
- Changed: `test1.py` 
- Deleted: `test3.py` 
- Renamed: `test4.py` -> `te4st.py`
- Created: `test5.py`

In [5]:
compare_manifests('manifest1.csv','manifest2.csv')

New Files:
- test5.py

Changed Files:
- test1.py

Renamed Files:
- test4.py -> te4st.py

Deleted Files:
- test3.py


## File history

Write a program called `file_history.py`
that takes the name of a file as a command-line argument
and displays the history of that file
by tracing it back in time through the available manifests.
Again, you can test your program using made-up manifest files.


I am making the assumption that manifest files will be named `manifest[number].csv`, where the numbers are in order (e.g. time stamps, or in my baby example just manifest1.csv and manifest2.csv). 

My program first finds all manifest files in (or under) the current directory, then sorts them in chronological order from most recent to oldest. Going from most recent to oldest means we expect the input to be based on the current file names (i.e. those in the most recent manifest). It first looks for the file in the most recent manifest, and tracks its history backwards by updating the filename or hash when changes are made or the file name changes. My program returns a pandas DataFrame reporting the hash and filename of the tracked file in each manifest in which it appears.

I made a third manifest to test this where I changed the name of `test4.py` once again.

In [6]:
manifest3 = pd.DataFrame({
    'file': ['test1.py', 'test2.py', 'test4b.py', 'test5.py'],
    'hash': ['abe', 'abcd', 'dabc', 'abcdef']
})

manifest3.to_csv('manifest3.csv', index = False, header = False)

In [7]:
def find_manifests():
    '''finds all csv files starting with previx `manifest` under the current directory'''
    files = []
    for filename in glob("**/manifest*.csv", recursive=True):
        files.append(filename)
    return files
        
def file_history(filename, root = '.'):
    manifest_files = find_manifests()
    manifest_numbers = [float(re.findall(r'\d+', f)[0]) for f in manifest_files]

    traceback_m = []
    traceback_f = []
    traceback_h = []

    filehash = ''

    # iterate from most recent to oldest
    for manifest in sorted(zip(manifest_numbers, manifest_files), key = lambda x: -1*x[0]):
        manifest_file = manifest[1]
        with open(Path(manifest_file)) as f:
            manifest_dict = [line.replace('\n','').split(',') for line in f.readlines()]
            manifest_dict = {m[0]:m[1] for m in manifest_dict}

        # updates hash if only name matches, updates name if only hash matches
        if filename in manifest_dict.keys():
            filehash = manifest_dict[filename]
            traceback_m.append(manifest_file)
            traceback_f.append(filename)
            traceback_h.append(filehash)
        elif filehash in manifest_dict.values():
            manifest_dict_rev = {v:k for k,v in manifest_dict.items()}
            filename = manifest_dict_rev[filehash]
            traceback_m.append(manifest_file)
            traceback_f.append(filename)
            traceback_h.append(filehash)

    traceback_df = pd.DataFrame({
        'manifest_file': traceback_m,
        'filename': traceback_f,
        'hash': traceback_h
    })
    
    if len(traceback_m) == 0:
        warn('File '+filename+' not found in any manifest')
    
    return(traceback_df)

`test0.py` never existed

In [8]:
file_history('test0.py')

  warn('File '+filename+' not found in any manifest')


Unnamed: 0,manifest_file,filename,hash


`test1.py` was changed once

In [9]:
file_history('test1.py')

Unnamed: 0,manifest_file,filename,hash
0,manifest3.csv,test1.py,abe
1,manifest2.csv,test1.py,abe
2,manifest1.csv,test1.py,abc


`test2.py` was never changed

In [10]:
file_history('test2.py')

Unnamed: 0,manifest_file,filename,hash
0,manifest3.csv,test2.py,abcd
1,manifest2.csv,test2.py,abcd
2,manifest1.csv,test2.py,abcd


`test3.py` was deleted after the first manifest

In [11]:
file_history('test3.py')

Unnamed: 0,manifest_file,filename,hash
0,manifest1.csv,test3.py,abcde


`test4b.py` did not change, but was renamed twice.

In [12]:
file_history('test4b.py')

Unnamed: 0,manifest_file,filename,hash
0,manifest3.csv,test4b.py,dabc
1,manifest2.csv,te4st.py,dabc
2,manifest1.csv,test4.py,dabc


`test5.py` was created after manifest 1

In [13]:
file_history('test5.py')

Unnamed: 0,manifest_file,filename,hash
0,manifest3.csv,test5.py,abcdef
1,manifest2.csv,test5.py,abcdef


## Finding duplicate files

Write a program called `finddup.py` that takes a list of filenames as command-line
arguments, and reports which of those files are duplicates of each other.  The
fastest way to do this is to calculate the hash for each file, and then group files
with the same hashes together. Note that there may be several duplicates of a file,
not just two.

You can test your program by creating a few directories with test files in them
rather than using pyfakefs.


In [14]:
BUFFER_SIZE = 4 * 1024  # how much data to read at once
def hash_stream(reader):
    # copied from slides
    sha256 = hashlib.sha256()
    while True:
        block = reader.read(BUFFER_SIZE)
        if not block:
            break
        sha256.update(block)
    return sha256.hexdigest()

def hash_file(filename):
    reader = open(filename, "rb")
    result = hash_stream(reader)
    return(result)

In [15]:
def finddup(filenames):
    filenames = pd.Series(filenames)
    hashes = pd.Series([hash_file(f) for f in filenames])
    
    dups = {}
    for i in range(len(hashes)):
        if hashes[i] not in dups.keys() and sum(hashes == hashes[i]) > 1:
            dups[hashes[i]] = list(filenames[hashes == hashes[i]])
            
    # print report
    for filehash, dup_files in dups.items():
        out_str = 'Hash '+ filehash +' duplicated in:'
        for dup in dup_files:
            out_str += ('\n- ' + dup)
        print(out_str)
        
    # return dictionary with results
    return(dups)

In [16]:
dups = finddup(filenames = glob("backup_test/**/**.**", recursive=True))

Hash 1b4f0e9851971998e732078544c96b36c3d01cedf7caa332359d6f1d83567014 duplicated in:
- backup_test/test1.txt
- backup_test/test1_copy.txt
- backup_test/test_subdir/test1_copy2.txt


In [17]:
dups

{'1b4f0e9851971998e732078544c96b36c3d01cedf7caa332359d6f1d83567014': ['backup_test/test1.txt',
  'backup_test/test1_copy.txt',
  'backup_test/test_subdir/test1_copy2.txt']}