<a href="https://colab.research.google.com/github/linesn/llm_security/blob/main/check_hf_hash.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Check the HuggingFace Hashes for a model
This notebook is intended to help you check the
hash values to ensure that the model you download matches the repository.

In [1]:
import os
import hashlib
import requests
from tqdm.notebook import tqdm
import pandas as pd

In [7]:
def calculate_sha256(file_path):
  """Report the hash of the file.
  """
  sha256_hash = hashlib.sha256()
  with open(file_path, "rb") as infile:
    chunk_size = 4096
    while True:
      chunk = infile.read(chunk_size)
      if not chunk:
        break
      sha256_hash.update(chunk)
  return sha256_hash.hexdigest()


def get_and_check_file(filename, prefix, chunk_size=4096):
  """Download, size, and hash the file.
  """
  address = prefix + filename
  with requests.get(address, stream=True) as r:
    with open(filename, "wb") as f:
      for chunk in r.iter_content(chunk_size):
        if chunk:
          f.write(chunk)
  hash = calculate_sha256(filename)
  size = os.path.getsize(filename)
  #os.remove(filename)
  return hash, size

Go check the repository for the files you plan to download. You will need to write each one you intend to check in the `hffiles` list below. You will need to set the `hfprefix` to the download address **specific to the branch you are downloading from!**

In [3]:
hffiles = [
  "pytorch_model-00001-of-00002.bin",
  "pytorch_model-00002-of-00002.bin",
  "adapt_tokenizer.py",
  "attention.py",
  "blocks.py",
  "config.json",
  "configuration_mpt.py",
  "custom_embedding.py",
  "flash_attn_triton.py",
  "generation_config.json",
  "hf_prefixlm_converter.py",
  "meta_init_context.py",
  "modeling_mpt.py",
  "norm.py",
  "param_init_fns.py",
  "requirements.txt",
  "special_tokens_map.json",
  "tokenizer.json",
  "tokenizer_config.json",
  'README.md',
  'pytorch_model_.bin.index.json',
  'gitattributes.txt',
]
hfprefix = "https://huggingface.co/mosaicml/mpt-7b-instruct/resolve/bd1748ec173f1c43e11f1973fc6e61cb3de0f327/"
# https://huggingface.co/mosaicml/mpt-7b-instruct/blob/main/

HuggingFace provides the hashes for big binary files, so you can just write those below and trust those, or you can include the big binaries in the `hffiles` list too and check them by hand (note that this latter approach might exhaust your storage on Colab).

In [4]:
ReportedHashes = {
    "pytorch_model-00001-of-00002.bin":"0693d4371bfecf4260145f0735708846d5e0ededeb91d535d8370962342cef4e",
    "pytorch_model-00002-of-00002.bin":"13197cfe0fc856d6079651e997868c9253886059c4aaad3f42ea5699a65e4cde",
}

In [5]:
ReportedSizes = {
    "pytorch_model-00001-of-00002.bin": 0,
    "pytorch_model-00002-of-00002.bin": 0,
}

In [8]:
hashes = dict()
sizes = dict()
for filename in tqdm(hffiles):
  print(filename)
  hashes[filename], sizes[filename]  = get_and_check_file(filename, hfprefix)

  0%|          | 0/22 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin
pytorch_model-00002-of-00002.bin
adapt_tokenizer.py
attention.py
blocks.py
config.json
configuration_mpt.py
custom_embedding.py
flash_attn_triton.py
generation_config.json
hf_prefixlm_converter.py
meta_init_context.py
modeling_mpt.py
norm.py
param_init_fns.py
requirements.txt
special_tokens_map.json
tokenizer.json
tokenizer_config.json
README.md
pytorch_model_.bin.index.json
gitattributes.txt


Check if the hashes agree.

In [9]:
for hash in ReportedHashes:
  if hash in hashes:
    print(hash, "matches" if hashes[hash]==ReportedHashes[hash] else "fails")
  else:
    hashes[hash] = ReportedHashes[hash]
    sizes[hash] = ReportedSizes[hash]

pytorch_model-00001-of-00002.bin matches
pytorch_model-00002-of-00002.bin matches


Export the results.

In [10]:
df = pd.DataFrame()
df["File"] = hashes.keys()
df["SHA256"] = hashes.values()
df["Sizes"] = sizes.values()
df

Unnamed: 0,File,SHA256,Sizes
0,pytorch_model-00001-of-00002.bin,0693d4371bfecf4260145f0735708846d5e0ededeb91d5...,9943040275
1,pytorch_model-00002-of-00002.bin,13197cfe0fc856d6079651e997868c9253886059c4aaad...,3355599187
2,adapt_tokenizer.py,1a877c5856cbea1fee9d986b5fee1889ca14ccba765f01...,1752
3,attention.py,ace86f3e715b709f4ec0c44179d3675fcfa03abc21248f...,16124
4,blocks.py,efc9ffe783e0f1b1bd88d7544cf129fb6905ed4afd56f2...,2493
5,config.json,4f4a7f15b8f979b45bfddb165ce9fbbfea9ef7cb5ef05a...,1227
6,configuration_mpt.py,3798eac89b49c0e469fad694cb6abccb12e6a033f2f464...,9080
7,custom_embedding.py,f36668ddf22403a332f978057d527cf285b01468bc3431...,15
8,flash_attn_triton.py,f36668ddf22403a332f978057d527cf285b01468bc3431...,15
9,generation_config.json,11e4dba2f93adb5aee189e9aa0dce3d624e72ecf943ac6...,91


In [11]:
df.to_csv("mpt-7b-instruct_SHA256_HASHES.csv")