# Python & Data - Week 2

問題：如何得知兩個檔案的內容是否一樣?

## 本週內容

1. [Cheatsheet](cheatsheet2.html) | [ipynb 檔案](cheatsheet2.ipynb)
2. [Weekly Challenge](challenge2.html) | [ipynb 檔案](challenge2.ipynb)

[打包下載](../week2.zip)

In [1]:
# Imports
import os
import time
import hashlib

## 讀取檔案 (Text 模式)

In [2]:
# open file1.txt
with open("data/file1.txt") as f:
    # prints only one line (with line ending at the end)
    print(f.readline())

# open file2.txt
with open("data/file2.txt") as f:
    # prints only one line (with line ending at the end)
    print(f.readline())

Hi! This is file 1.

Hi! This is file 2.



## 比較檔案內容

In [3]:
same_file = True
# open file1_copy.txt
with open("data/file1_copy.txt") as f1:
    # open file1.txt
    with open("data/file1.txt") as f2:
        if f1.readline() != f2.readline():
            same_file = False

if same_file:
    print("file1.txt and file1_copy.txt is the same")
else:
    print("file1.txt and file1_copy.txt is NOT the same")

file1.txt and file1_copy.txt is the same


## 取得檔案大小

In [4]:
size = os.path.getsize("data/file1.txt")
print(f'Size of file1.txt {size} bytes')

size = os.path.getsize("data/file2.txt")
print(f'Size of file2.txt {size} bytes')

Size of file1.txt 37 bytes
Size of file2.txt 37 bytes


## 取得檔案建立日期

In [5]:
created_ts = os.path.getctime('data/file1.txt')
print(f'{created_ts}')

created_time = time.ctime(created_ts)
print(f'{created_time}')

1638844711.2520077
Tue Dec  7 10:38:31 2021


## 獲得檔案的 HASH

### 甚麼是 Hash?

* 對於一個檔案來說, Hash 像是一個檔案的指紋；
* Hash 是一個數學算法，當你輸入某一個值，這個算法總會得出相同的結果；
* Hash 是無法逆向計算的(某程度上是)。

### Hash 的用途

* 用於加密（密碼）的用途；
* 用於檢查一個檔案是否完整（例如從網絡上下載的檔案是否完整或經過修改）

### Hash Function 的種類和輸出長度

* MD5 - 128 bit (不適合用於加密用途)
* SHA-1 - 160 bit (不適合用於加密用途)
* SHA-256 / SHA-512 - 256 bit / 512 bit

In [6]:
h = hashlib.sha1()

def hash_file(filename):
   """"This function returns the SHA-1 hash
   of the file passed into it"""

   # make a hash object
   h = hashlib.sha1()

   # open file for reading in binary mode
   with open(filename,'rb') as file:
       # loop till the end of the file
       chunk = 0
       while chunk != b'':
           # read only 1024 bytes at a time
           chunk = file.read(1024)
           h.update(chunk)

   # return the hex representation of digest
   return h.hexdigest()


hash1 = hash_file("data/file1.txt")
print(f'Hash of file1.txt is {hash1}')

hash2 = hash_file("data/file2.txt")
print(f'Hash of file2.txt is {hash2}')

hash3 = hash_file("data/file1_copy.txt")
print(f'Hash of file2.txt is {hash3}')

Hash of file1.txt is 4ee3673447c6b4f860b2db83ec823ac73c3a39f0
Hash of file2.txt is 01f5a8326e391a56a3346ebf98e73af2b1adc7ea
Hash of file2.txt is 4ee3673447c6b4f860b2db83ec823ac73c3a39f0


## 列出資料夾

`os.listdir` 會返回 iterator, 你可以用它列出某個資料夾內的每個檔案/資料夾名稱。

In [7]:
dirname = 'data'

for filename in os.listdir(dirname):
    f = os.path.join(dirname, filename)
    
    if os.path.isfile(f):
        print(f'{f} is file')
    elif os.path.isdir(f):
        print(f'{f} is directory')

data/file2.txt is file
data/file1.txt is file
data/.DS_Store is file
data/file1_copy.txt is file
data/sub_folder is directory
