Skip to content

Structure of zip files

Louis Maddox edited this page Jul 2, 2021 · 11 revisions

The source code for Python's zipfile module is here

The best description is the one Python's zipfile module links to, the APPNOTE.TXT which I also backed up to a GitHub Gist for posterity. The section on the general format is as follows:

4.3 General Format of a .ZIP file

4.3.1 A ZIP file MUST contain an "end of central directory record". A ZIP file containing only an "end of central directory record" is considered an empty ZIP file. Files MAY be added or replaced within a ZIP file, or deleted. A ZIP file MUST have only one "end of central directory record". Other records defined in this specification MAY be used as needed to support storage requirements for individual ZIP files.

4.3.2 Each file placed into a ZIP file MUST be preceded by a "local file header" record for that file. Each "local file header" MUST be accompanied by a corresponding "central directory header" record within the central directory section of the ZIP file.

4.3.3 Files MAY be stored in arbitrary order within a ZIP file. A ZIP file MAY span multiple volumes or it MAY be split into user-defined segment sizes. All values MUST be stored in little-endian byte order unless otherwise specified in this document for a specific data element.

4.3.4 Compression MUST NOT be applied to a "local file header", an "encryption header", or an "end of central directory record". Individual "central directory records" MUST NOT be compressed, but the aggregate of all central directory records MAY be compressed.

4.3.5 File data MAY be followed by a "data descriptor" for the file. Data descriptors are used to facilitate ZIP file streaming.

4.3.6 Overall .ZIP file format:

  [local file header 1]
  [encryption header 1]
  [file data 1]
  [data descriptor 1]
  . 
  .
  .
  [local file header n]
  [encryption header n]
  [file data n]
  [data descriptor n]
  [archive decryption header] 
  [archive extra data record] 
  [central directory header 1]
  .
  .
  .
  [central directory header n]
  [zip64 end of central directory record]
  [zip64 end of central directory locator] 
  [end of central directory record]

The first of the sections is "local file header" then the encryption header then the file data then a data descriptor, and this repeats for all the files compressed by the archive.

Here are the details of the local file header, which begins the file:

4.3.7 Local file header:

local file header signature 4 bytes (0x04034b50) version needed to extract 2 bytes general purpose bit flag 2 bytes compression method 2 bytes last mod file time 2 bytes last mod file date 2 bytes crc-32 4 bytes compressed size 4 bytes uncompressed size 4 bytes file name length 2 bytes extra field length 2 bytes

file name (variable size) extra field (variable size)

After these comes the central directory:

4.3.12 Central directory structure:

[central directory header 1] . . . [central directory header n] [digital signature]

File header:

 central file header signature   4 bytes  (0x02014b50)
 version made by                 2 bytes
 version needed to extract       2 bytes
 general purpose bit flag        2 bytes
 compression method              2 bytes
 last mod file time              2 bytes
 last mod file date              2 bytes
 crc-32                          4 bytes
 compressed size                 4 bytes
 uncompressed size               4 bytes
 file name length                2 bytes
 extra field length              2 bytes
 file comment length             2 bytes
 disk number start               2 bytes
 internal file attributes        2 bytes
 external file attributes        4 bytes
 relative offset of local header 4 bytes

 file name (variable size)
 extra field (variable size)
 file comment (variable size)

4.3.13 Digital signature:

 header signature                4 bytes  (0x05054b50)
 size of data                    2 bytes
 signature data (variable size)

With the introduction of the Central Directory Encryption feature in version 6.2 of this specification, the Central Directory Structure MAY be stored both compressed and encrypted. Although not required, it is assumed when encrypting the Central Directory Structure, that it will be compressed for greater storage efficiency. Information on the Central Directory Encryption feature can be found in the section describing the Strong Encryption Specification. The Digital Signature record will be neither compressed nor encrypted.

4.3.14 Zip64 end of central directory record

 zip64 end of central dir 
 signature                       4 bytes  (0x06064b50)
 size of zip64 end of central
 directory record                8 bytes
 version made by                 2 bytes
 version needed to extract       2 bytes
 number of this disk             4 bytes
 number of the disk with the 
 start of the central directory  4 bytes
 total number of entries in the
 central directory on this disk  8 bytes
 total number of entries in the
 central directory               8 bytes
 size of the central directory   8 bytes
 offset of start of central
 directory with respect to
 the starting disk number        8 bytes
 zip64 extensible data sector    (variable size)

4.3.14.1 The value stored into the "size of zip64 end of central directory record" SHOULD be the size of the remaining record and SHOULD NOT include the leading 12 bytes.

Size = SizeOfFixedFields + SizeOfVariableData - 12.

...

4.3.15 Zip64 end of central directory locator

zip64 end of central dir locator signature 4 bytes (0x07064b50) number of the disk with the start of the zip64 end of central directory 4 bytes relative offset of the zip64 end of central directory record 8 bytes total number of disks 4 bytes

4.3.16 End of central directory record:

end of central dir signature 4 bytes (0x06054b50) number of this disk 2 bytes number of the disk with the start of the central directory 2 bytes total number of entries in the central directory on this disk 2 bytes total number of entries in the central directory 2 bytes size of the central directory 4 bytes offset of start of central directory with respect to the starting disk number 4 bytes .ZIP file comment length 2 bytes .ZIP file comment (variable size)

The most important part to grasp is this section of the overall structure:

  [central directory header 1]
  ...
  [central directory header n]
  [zip64 end of central directory record]
  [zip64 end of central directory locator]
  [end of central directory record]

So that's:

  • Local File Headers marked by 0x04034b50 (zipfile.stringFileHeader=b"PK\003\004"`)
  • Central Directory Headers marked by 0x02014b50 (zipfile.stringCentralDir = b"PK\001\002")
    • to clarify, this signature is at the start of the 'Digital Signature' [which contains further data]
    • the 'Digital Signature' is the final part of the 'Central Directory'
  • the Central Directory Record finishes with 0x05054b50 (not declared in zipfile but = b"PK\x05\x05")
    • to clarify, this signature is at the end of the 'CDR'
  • the Zip64 End Of Central Directory Record is marked by 0x06064b50 (stringEndArchive64 = b"PK\x06\x06")
    • to clarify, this signature is at the start of the 'Z64EOCDR'

Opening a file (e.g. a .conda zip file such as https://repo.anaconda.com/pkgs/main/linux-64/decorator-4.1.2-py36hd076ac8_0.conda) with mode rb in Python, then looking through the bytes (printing them as integers not characters) you can search for 80,75 to identify:

  • P K 3 4 (three times in a row, the first immediately at the start)
    • three Local File Header signatures
  • P K 1 2 (three times in a row)
    • three Central Directory Header signatures
  • P K 5 6
    • one End Of Central Directory Record signature
    • followed by 18 bytes: 0 0 0 0 3 0 3 0 220 0 0 0 167 70 0 0 0 0

The first time zipfile closes a zip file (upon reading, upon initialising the class and calling self._RealGetContents) it sets the value of start_dir to the current offset position of the file cursor

If you create a file from a zip file:

import io
import zipfile

with open("example.zip", "rb") as f:
    b = f.read()

z = zipfile.ZipFile(io.BytesIO(b))

...You will see that z.fp is preserved (a BytesIO object storing the bytes passed in over STDIN), and the value of z.fp.tell() is a few behind z.start_dir. In fact, this is the same value you'll find if you call zipfile._EndRecData(z.fp) as the docstring on that function explains:

def _EndRecData(fpin):
    """Return data from the "End of Central Directory" record, or None.
    The data is a list of the nine items in the ZIP "End of central dir"
    record followed by a tenth item, the file seek offset of this record."""

If I run that on my example file:

>>> zipfile._EndRecData(z.fp)
[b'PK\x05\x06', 0, 0, 3, 3, 291, 5640033, 0, b'', 5640324]
>>> z.fp.tell()
5640324
>>> z.start_dir
5640033

So we can see that the End Of Central Directory record stretches from byte position 5640033 (start_dir) to 5640324 ("the file seek offset of this [End Of Central Directory] record"), so it stretches over 292 bytes [inclusive of start and end positions].

If I use a Python .conda archive from conda-forge, decorator-4.3.2-py37_0.conda I get

>>> z.start_dir
18087
>>> z.fp.tell()
18307

So here the End Of Central Directory record stretches over 221 bytes [inclusive of start and end positions].

From what I've seen from inspecting various .conda zip files from the conda-forge repository, the start_dir is usually somewhere from 240-260 bytes away.

It's important to note that though these positions vary, the size of the "End Of Central Directory" structure is constant: it is zipfile.sizeEndCentDir = 22 bytes. This allows the following:

This "End Of Central Directory Record" can be read to determine the positions of the individual files within the zip, and an example of code that does that is here (async port here)

  • Note that it uses a Struct('<H2sHHHIIIHH'): see Python's struct library

    • The < means the byte-order is little-endian
    • H means unsigned short size 2
    • s means char[] (no standard size, each is 1 byte)
      • size is given by the number before it (which is 2 here)
    • I means unsigned int size 4
    • So this means the sequence <H2sHHHIIIHH has length H+2(s)+3(H)+3(I)+2(H) = 6(H)+2(s)+3(I) = 6(2)+2+3(4) = 12+2+12 = 26 bytes
      • You can do this in Python by calling struct.calcsize("<H2sHHHIIIHH") or struct.Struct("<H2sHHHIIIHH").size = 26
  • central_directory_signature is given as b'\x50\x4b\x01\x02' which is equivalent to b'PK\x01\x02'

    • line 101 of Python's zipfile.py gives stringCentralDir = b"PK\001\002"
  • local_file_header_signature = b'\x50\x4b\x03\x04' (or b'PK\x03\x04')

  • Note that the central directory is not read, but passed on (this script just extracts files and doesn't try to be selective using the central directory)

The "signature" of the central directory is also known as the "magic number", which is used to signal the start (and another for the end)

Signature values begin with the two byte constant marker of 0x4b50, representing the characters "PK".

In fact these are all hard coded into the zipfile module itself

# The "end of central directory" structure, magic number, size, and indices
# (section V.I in the format document)
structEndArchive = b"<4s4H2LH"
stringEndArchive = b"PK\005\006"
sizeEndCentDir = struct.calcsize(structEndArchive)

# The "central directory" structure, magic number, size, and indices
# of entries in the structure (section V.F in the format document)
structCentralDir = "<4s4B4HL2L5H2L"
stringCentralDir = b"PK\001\002"
sizeCentralDir = struct.calcsize(structCentralDir)

# The "local file header" structure, magic number, size, and indices
# (section V.A in the format document)
structFileHeader = "<4s2B4HL2L2H"
stringFileHeader = b"PK\003\004"
sizeFileHeader = struct.calcsize(structFileHeader)

# The "Zip64 end of central directory locator" structure, magic number, and size
structEndArchive64Locator = "<4sLQL"
stringEndArchive64Locator = b"PK\x06\x07"
sizeEndCentDir64Locator = struct.calcsize(structEndArchive64Locator)

# The "Zip64 end of central directory" record, magic number, size, and indices
# (section V.G in the format document)
structEndArchive64 = "<4sQ2H2L4Q"
stringEndArchive64 = b"PK\x06\x06"
sizeEndCentDir64 = struct.calcsize(structEndArchive64)

More concisely:

  • EndArchive = b"PK\005\006" (struct: b"<4s4H2LH")
  • CentralDir = b"PK\001\002" (struct: b"<4s4B4HL2L5H2L")
  • FileHeader = b"PK\003\004" (struct: b"<4s2B4HL2L2H")
  • EndArchive64Locator = b"PK\x06\x07" (struct: b"<4sLQL")
  • EndArchive64 = b"PK\x06\x06" (struct: b"<4sQ2H2L4Q")

Next in zipfile.py there is a list of the index for each entry in the central directory struct which I'll accompany below by the description/title/section from the APPNOTE.TXT


Click to show details of central directory structure

# indexes of entries in the central directory structure
_CD_SIGNATURE = 0

The signature of the central directory. This is always b"\x50\x4b\x01\x02"

_CD_CREATE_VERSION = 1
_CD_CREATE_SYSTEM = 2

4.4.2 version made by (2 bytes)

4.4.2 version made by (2 bytes)

 4.4.2.1 The upper byte indicates the compatibility of the file
 attribute information.  If the external file attributes 
 are compatible with MS-DOS and can be read by PKZIP for 
 DOS version 2.04g then this value will be zero.  If these 
 attributes are not compatible, then this value will 
 identify the host system on which the attributes are 
 compatible.  Software can use this information to determine
 the line record format for text files etc.  

 4.4.2.2 The current mappings are:

  0 - MS-DOS and OS/2 (FAT / VFAT / FAT32 file systems)
  1 - Amiga                     2 - OpenVMS
  3 - UNIX                      4 - VM/CMS
  5 - Atari ST                  6 - OS/2 H.P.F.S.
  7 - Macintosh                 8 - Z-System
  9 - CP/M                     10 - Windows NTFS
 11 - MVS (OS/390 - Z/OS)      12 - VSE
 13 - Acorn Risc               14 - VFAT
 15 - alternate MVS            16 - BeOS
 17 - Tandem                   18 - OS/400
 19 - OS X (Darwin)            20 thru 255 - unused

 4.4.2.3 The lower byte indicates the ZIP specification version 
 (the version of this document) supported by the software 
 used to encode the file.  The value/10 indicates the major 
 version number, and the value mod 10 is the minor version 
 number.  
_CD_EXTRACT_VERSION = 3
_CD_EXTRACT_SYSTEM = 4

4.4.3 version needed to extract (2 bytes)

 4.4.3.1 The minimum supported ZIP specification version needed 
 to extract the file, mapped as above.  This value is based on 
 the specific format features a ZIP program MUST support to 
 be able to extract the file.  If multiple features are
 applied to a file, the minimum version MUST be set to the 
 feature having the highest value. New features or feature 
 changes affecting the published format specification will be 
 implemented using higher version numbers than the last 
 published value to avoid conflict.

 4.4.3.2 Current minimum feature versions are as defined below:

  1.0 - Default value
  1.1 - File is a volume label
  2.0 - File is a folder (directory)
  2.0 - File is compressed using Deflate compression
  2.0 - File is encrypted using traditional PKWARE encryption
  2.1 - File is compressed using Deflate64(tm)
  2.5 - File is compressed using PKWARE DCL Implode 
  2.7 - File is a patch data set 
  4.5 - File uses ZIP64 format extensions
  4.6 - File is compressed using BZIP2 compression*
  5.0 - File is encrypted using DES
  5.0 - File is encrypted using 3DES
  5.0 - File is encrypted using original RC2 encryption
  5.0 - File is encrypted using RC4 encryption
  5.1 - File is encrypted using AES encryption
  5.1 - File is encrypted using corrected RC2 encryption**
  5.2 - File is encrypted using corrected RC2-64 encryption**
  6.1 - File is encrypted using non-OAEP key wrapping***
  6.2 - Central directory encryption
  6.3 - File is compressed using LZMA
  6.3 - File is compressed using PPMd+
  6.3 - File is encrypted using Blowfish
  6.3 - File is encrypted using Twofish

 4.4.3.3 Notes on version needed to extract 

 * Early 7.x (pre-7.2) versions of PKZIP incorrectly set the
 version needed to extract for BZIP2 compression to be 50
 when it SHOULD have been 46.

 ** Refer to the section on Strong Encryption Specification
 for additional information regarding RC2 corrections.

 *** Certificate encryption using non-OAEP key wrapping is the
 intended mode of operation for all versions beginning with 6.1.
 Support for OAEP key wrapping MUST only be used for
 backward compatibility when sending ZIP files to be opened by
 versions of PKZIP older than 6.1 (5.0 or 6.0).

 + Files compressed using PPMd MUST set the version
 needed to extract field to 6.3, however, not all ZIP 
 programs enforce this and MAY be unable to decompress 
 data files compressed using PPMd if this value is set.

 When using ZIP64 extensions, the corresponding value in the
 zip64 end of central directory record MUST also be set.  
 This field SHOULD be set appropriately to indicate whether 
 Version 1 or Version 2 format is in use. 
_CD_FLAG_BITS = 5

4.4.4 general purpose bit flag: (2 bytes)

Bit 0: If set, indicates that the file is encrypted.

(For Method 6 - Imploding)
Bit 1: If the compression method used was type 6,
       Imploding, then this bit, if set, indicates
       an 8K sliding dictionary was used.  If clear,
       then a 4K sliding dictionary was used.

Bit 2: If the compression method used was type 6,
       Imploding, then this bit, if set, indicates
       3 Shannon-Fano trees were used to encode the
       sliding dictionary output.  If clear, then 2
       Shannon-Fano trees were used.

(For Methods 8 and 9 - Deflating)
Bit 2  Bit 1
  0      0    Normal (-en) compression option was used.
  0      1    Maximum (-exx/-ex) compression option was used.
  1      0    Fast (-ef) compression option was used.
  1      1    Super Fast (-es) compression option was used.

(For Method 14 - LZMA)
Bit 1: If the compression method used was type 14,
       LZMA, then this bit, if set, indicates
       an end-of-stream (EOS) marker is used to
       mark the end of the compressed data stream.
       If clear, then an EOS marker is not present
       and the compressed data size must be known
       to extract.

Note:  Bits 1 and 2 are undefined if the compression
       method is any other.

Bit 3: If this bit is set, the fields crc-32, compressed 
       size and uncompressed size are set to zero in the 
       local header.  The correct values are put in the 
       data descriptor immediately following the compressed
       data.  (Note: PKZIP version 2.04g for DOS only 
       recognizes this bit for method 8 compression, newer 
       versions of PKZIP recognize this bit for any 
       compression method.)

Bit 4: Reserved for use with method 8, for enhanced
       deflating. 

Bit 5: If this bit is set, this indicates that the file is 
       compressed patched data.  (Note: Requires PKZIP 
       version 2.70 or greater)

Bit 6: Strong encryption.  If this bit is set, you MUST
       set the version needed to extract value to at least
       50 and you MUST also set bit 0.  If AES encryption
       is used, the version needed to extract value MUST 
       be at least 51. See the section describing the Strong
       Encryption Specification for details.  Refer to the 
       section in this document entitled "Incorporating PKWARE 
       Proprietary Technology into Your Product" for more 
       information.

Bit 7: Currently unused.

Bit 8: Currently unused.

Bit 9: Currently unused.

Bit 10: Currently unused.

Bit 11: Language encoding flag (EFS).  If this bit is set,
        the filename and comment fields for this file
        MUST be encoded using UTF-8. (see APPENDIX D)

Bit 12: Reserved by PKWARE for enhanced compression.

Bit 13: Set when encrypting the Central Directory to indicate 
        selected data values in the Local Header are masked to
        hide their actual values.  See the section describing 
        the Strong Encryption Specification for details.  Refer
        to the section in this document entitled "Incorporating 
        PKWARE Proprietary Technology into Your Product" for 
        more information.

Bit 14: Reserved by PKWARE for alternate streams.

Bit 15: Reserved by PKWARE.
_CD_COMPRESS_TYPE = 6

4.4.5 compression method: (2 bytes)

0 - The file is stored (no compression)
1 - The file is Shrunk
2 - The file is Reduced with compression factor 1
3 - The file is Reduced with compression factor 2
4 - The file is Reduced with compression factor 3
5 - The file is Reduced with compression factor 4
6 - The file is Imploded
7 - Reserved for Tokenizing compression algorithm
8 - The file is Deflated
9 - Enhanced Deflating using Deflate64(tm)

10 - PKWARE Data Compression Library Imploding (old IBM TERSE) 11 - Reserved by PKWARE 12 - File is compressed using BZIP2 algorithm 13 - Reserved by PKWARE 14 - LZMA 15 - Reserved by PKWARE 16 - IBM z/OS CMPSC Compression 17 - Reserved by PKWARE 18 - File is compressed using IBM TERSE (new) 19 - IBM LZ77 z Architecture 20 - deprecated (use method 93 for zstd) 93 - Zstandard (zstd) Compression 94 - MP3 Compression 95 - XZ Compression 96 - JPEG variant 97 - WavPack compressed data 98 - PPMd version I, Rev 1 99 - AE-x encryption marker (see APPENDIX E)

4.4.5.1 Methods 1-6 are legacy algorithms and are no longer recommended for use when compressing files.

_CD_TIME = 7
_CD_DATE = 8

4.4.6 date and time fields: (2 bytes each)

The date and time are encoded in standard MS-DOS format.
If input came from standard input, the date and time are
those at which compression was started for this data. 
If encrypting the central directory and general purpose bit 
flag 13 is set indicating masking, the value stored in the 
Local Header will be zero. MS-DOS time format is different
from more commonly used computer time formats such as 
UTC. For example, MS-DOS uses year values relative to 1980
and 2 second precision.
_CD_CRC = 9

4.4.7 CRC-32: (4 bytes)

The CRC-32 algorithm was generously contributed by
David Schwaderer and can be found in his excellent
book "C Programmers Guide to NetBIOS" published by
Howard W. Sams & Co. Inc.  The 'magic number' for
the CRC is 0xdebb20e3.  The proper CRC pre and post
conditioning is used, meaning that the CRC register
is pre-conditioned with all ones (a starting value
of 0xffffffff) and the value is post-conditioned by
taking the one's complement of the CRC residual.
If bit 3 of the general purpose flag is set, this
field is set to zero in the local header and the correct
value is put in the data descriptor and in the central
directory. When encrypting the central directory, if the
local header is not in ZIP64 format and general purpose 
bit flag 13 is set indicating masking, the value stored 
in the Local Header will be zero. 
_CD_COMPRESSED_SIZE = 10
_CD_UNCOMPRESSED_SIZE = 11

4.4.8 compressed size: (4 bytes) 4.4.9 uncompressed size: (4 bytes)

The size of the file compressed (4.4.8) and uncompressed,
(4.4.9) respectively.  When a decryption header is present it 
will be placed in front of the file data and the value of the
compressed file size will include the bytes of the decryption
header.  If bit 3 of the general purpose bit flag is set, 
these fields are set to zero in the local header and the 
correct values are put in the data descriptor and
in the central directory.  If an archive is in ZIP64 format
and the value in this field is 0xFFFFFFFF, the size will be
in the corresponding 8 byte ZIP64 extended information 
extra field.  When encrypting the central directory, if the
local header is not in ZIP64 format and general purpose bit 
flag 13 is set indicating masking, the value stored for the 
uncompressed size in the Local Header will be zero. 
_CD_FILENAME_LENGTH = 12
_CD_EXTRA_FIELD_LENGTH = 13
_CD_COMMENT_LENGTH = 14

4.4.10 file name length: (2 bytes) 4.4.11 extra field length: (2 bytes) 4.4.12 file comment length: (2 bytes)

The length of the file name, extra field, and comment
fields respectively.  The combined length of any
directory record and these three fields SHOULD NOT
generally exceed 65,535 bytes.  If input came from standard
input, the file name length is set to zero.  
_CD_DISK_NUMBER_START = 15

4.4.13 disk number start: (2 bytes)

The number of the disk on which this file begins.  If an 
archive is in ZIP64 format and the value in this field is 
0xFFFF, the size will be in the corresponding 4 byte zip64 
extended information extra field.
_CD_INTERNAL_FILE_ATTRIBUTES = 16

4.4.14 internal file attributes: (2 bytes)

Bits 1 and 2 are reserved for use by PKWARE.

4.4.14.1 The lowest bit of this field indicates, if set, 
that the file is apparently an ASCII or text file.  If not
set, that the file apparently contains binary data.
The remaining bits are unused in version 1.0.

4.4.14.2 The 0x0002 bit of this field indicates, if set, that 
a 4 byte variable record length control field precedes each 
logical record indicating the length of the record. The 
record length control field is stored in little-endian byte
order.  This flag is independent of text control characters, 
and if used in conjunction with text data, includes any 
control characters in the total length of the record. This 
value is provided for mainframe data transfer support.
_CD_EXTERNAL_FILE_ATTRIBUTES = 17

4.4.15 external file attributes: (4 bytes)

The mapping of the external attributes is
host-system dependent (see 'version made by').  For
MS-DOS, the low order byte is the MS-DOS directory
attribute byte.  If input came from standard input, this
field is set to zero.
_CD_LOCAL_HEADER_OFFSET = 18

4.4.16 relative offset of local header: (4 bytes)

This is the offset from the start of the first disk on
which this file appears, to where the local header SHOULD
be found.  If an archive is in ZIP64 format and the value
in this field is 0xFFFFFFFF, the size will be in the 
corresponding 8 byte zip64 extended information extra field.

The Python module's list of entries stops here, but the PKWare list continues:

4.4.17 file name: (Variable)

4.4.17.1 The name of the file, with optional relative path.
The path stored MUST NOT contain a drive or
device letter, or a leading slash.  All slashes
MUST be forward slashes '/' as opposed to
backwards slashes '\' for compatibility with Amiga
and UNIX file systems etc.  If input came from standard
input, there is no file name field.  

4.4.17.2 If using the Central Directory Encryption Feature and 
general purpose bit flag 13 is set indicating masking, the file 
name stored in the Local Header will not be the actual file name.  
A masking value consisting of a unique hexadecimal value will 
be stored.  This value will be sequentially incremented for each 
file in the archive. See the section on the Strong Encryption 
Specification for details on retrieving the encrypted file name. 
Refer to the section in this document entitled "Incorporating PKWARE 
Proprietary Technology into Your Product" for more information.

4.4.18 file comment: (Variable)

The comment for this file.

4.4.19 number of this disk: (2 bytes)

The number of this disk, which contains central
directory end record. If an archive is in ZIP64 format
and the value in this field is 0xFFFF, the size will 
be in the corresponding 4 byte zip64 end of central 
directory field.

4.4.20 number of the disk with the start of the central directory: (2 bytes)

The number of the disk on which the central
directory starts. If an archive is in ZIP64 format
and the value in this field is 0xFFFF, the size will 
be in the corresponding 4 byte zip64 end of central 
directory field.

4.4.21 total number of entries in the central dir on this disk: (2 bytes)

The number of central directory entries on this disk. If an archive is in ZIP64 format and the value in this field is 0xFFFF, the size will be in the corresponding 8 byte zip64 end of central directory field.

4.4.22 total number of entries in the central dir: (2 bytes)

The total number of files in the .ZIP file. If an archive is in ZIP64 format and the value in this field is 0xFFFF, the size will be in the corresponding 8 byte zip64 end of central directory field.

4.4.23 size of the central directory: (4 bytes)

The size (in bytes) of the entire central directory. If an archive is in ZIP64 format and the value in this field is 0xFFFFFFFF, the size will be in the corresponding 8 byte zip64 end of central directory field.

4.4.24 offset of start of central directory with respect to the starting disk number: (4 bytes)

Offset of the start of the central directory on the disk on which the central directory starts. If an archive is in ZIP64 format and the value in this field is 0xFFFFFFFF, the size will be in the corresponding 8 byte zip64 end of central directory field.

4.4.25 .ZIP file comment length: (2 bytes)

The length of the comment for this .ZIP file.

4.4.26 .ZIP file comment: (Variable)

The comment for this .ZIP file. ZIP file comment data is stored unsecured. No encryption or data authentication is applied to this area at this time. Confidential information SHOULD NOT be stored in this section.

4.4.27 zip64 extensible data sector (variable size)

(currently reserved for use by PKWARE)

4.4.28 extra field: (Variable)

This SHOULD be used for storage expansion. If additional information needs to be stored within a ZIP file for special application or platform needs, it SHOULD be stored here.
Programs supporting earlier versions of this specification can then safely skip the file, and find the next file or header.
This field will be 0 length in version 1.0.

Existing extra fields are defined in the section Extensible data fields that follows.


This struct is read in during the _RealGetContents method. To see how it works, grab a copy of the [self-contained] zipfile module and throw a breakpoint in there, then inspect the variables involved. You'll see that fp is a io.BytesIO object much like you get when calling io.BytesIO on a byte stream from a GET request or read from file, and each time some bytes are read() in, the offset (given by the tell() method) advances accordingly.

Equally important is the 'end of central directory' structure,

Click to show details of end of central directory structure

_ECD_SIGNATURE = 0

The signature of the 'end of central directory' record. This is always b"\x50\x4b\x01\x02"

_ECD_DISK_NUMBER = 1
_ECD_DISK_START = 2
_ECD_ENTRIES_THIS_DISK = 3
_ECD_ENTRIES_TOTAL = 4
_ECD_SIZE = 5
_ECD_OFFSET = 6
_ECD_COMMENT_SIZE = 7
# These last two indices are not part of the structure as defined in the
# spec, but they are used internally by this module as a convenience
_ECD_COMMENT = 8
_ECD_LOCATION = 9
Clone this wiki locally