# Challenge 2

在指定的資料夾找尋內容相同的檔案，列印出哪兩個檔案是相同的。

In [21]:
# Imports
import os
import time
import hashlib

## Hash Function

提供的 hash function:

In [22]:
h = hashlib.sha1()

def hash_file(filename):
   """"This function returns the SHA-1 hash
   of the file passed into it"""

   # make a hash object
   h = hashlib.sha1()

   # open file for reading in binary mode
   with open(filename,'rb') as file:
       # loop till the end of the file
       chunk = 0
       while chunk != b'':
           # read only 1024 bytes at a time
           chunk = file.read(1024)
           h.update(chunk)

   # return the hex representation of digest
   return h.hexdigest()

## 檢查資料夾下的重覆檔案

* 使用 os.listdir 列出每個檔案
    * 針對檔案，計算檔案的 hash
    * 檢查 hash 是否已存在於 file_hashes (key)  
        ➡️ 如已存在，則印出兩個檔案的名稱
    * 將 hash (key) 和檔案名稱 (value) 記錄到 file_hashes
    * 重覆步驟
    

In [23]:
# Setup variables
# 
dir = "data"
file_hashes = {}
for filename in os.listdir(dir):
    f = os.path.join(dir, filename)
    
    if os.path.isfile(f):
        hash = hash_file(f)
        if hash in file_hashes:
            print(f'{f} and {file_hashes[hash]} is the same')
        else:
            file_hashes[hash] = f
    elif os.path.isdir(f):
        pass # We do nothing with the folder

data/file1_copy.txt and data/file1.txt is the same


結果應為：

`data/file1_copy.txt and data/file1.txt is the same`

## sub_folder 應如何處理?

使用 Recrusion 處理:

* 將剛才的程式寫成 function
* 在列出檔案時，檢查目前 filename 是一個檔案還是資料夾
    * 如果是檔案，如常做檢查
    * 如果是資料㚒，則交給 check_duplicates 去檢查

In [24]:
my_file_hashes = {}
def check_duplicates(dir):
    for filename in os.listdir(dir):
        f = os.path.join(dir, filename)
        
        if os.path.isfile(f):
            hash = hash_file(f)
            if hash in my_file_hashes:
                print(f'{f} and {my_file_hashes[hash]} is the same')
            else:
                my_file_hashes[hash] = f
        elif os.path.isdir(f):
            check_duplicates(f)

# Starting point
check_duplicates(dir)

data/file1_copy.txt and data/file1.txt is the same
data/sub_folder/file4.txt and data/file2.txt is the same


結果應為：

```
data/file1_copy.txt and data/file1.txt is the same
data/sub_folder/file4.txt and data/file2.txt is the same
```

## 如何避免無限循環 (或檢查太多層數)?

在 check_duplicates 加入 max_depth 參數控制 recursive 次數

```python
def check_duplicates(dir, max_depth = 0)
    for filename in os.listdir(dir):
        f = os.path.join(dir, filename)
        
        if os.path.isfile(f):
            pass # ...
        elif os.path.isdir(f) and max_depth > 1:
            check_duplicates(f, max_depth - 1)
```