# Analyzing the filesystem footprint of Python 3 on Fedora

In [1]:
import collections
import pathlib

import humanize
import tabulate

from IPython.display import HTML, display

The list of Python 3 packages, `python-unversioned-command` is ommited here, as we will query Python 3.8 on Fedora 31.

In [2]:
python_pkgs = [
    'python3',
    'python3-libs',
    'python3-tkinter',
    'python3-idle',
    'python3-test',
    'python3-devel',
]

What version of Python 3.8 is installed?

In [3]:
!rpm -q python38

python38-3.8.0-1.fc31.x86_64


We do care about all files in the `python38` package. The differences between `python38` on Fedora 31 and `python3` subpackages on Fedora 32 are not relevant in this context. The standard library is the same.

In [4]:
python38_files = !rpm -ql python38
python38_files = [f for f in python38_files if not f.startswith('/usr/lib/.build-id')]  # avoid noise
python38_files[:5]

['/usr/bin/idle3.8',
 '/usr/bin/msgfmt3.8.py',
 '/usr/bin/pydoc3.8',
 '/usr/bin/pygettext3.8.py',
 '/usr/bin/python3.8']

For each Fedora 32 `python3` subpackage, we get the relevant files installed from `python38` on Fedora 31:

In [5]:
pkg_files = {}
for pkg in python_pkgs:
    pkg_files[pkg] = !repoquery --repo=rawhide -l {pkg} 2>/dev/null
    pkg_files[pkg] = [f for f in pkg_files[pkg] if f in python38_files]
pkg_files['python3-libs'][:5]

['/usr/include/python3.8',
 '/usr/lib/python3.8',
 '/usr/lib/python3.8/site-packages',
 '/usr/lib/python3.8/site-packages/__pycache__',
 '/usr/include/python3.8']

In [6]:
file_pkgs = {path: pkg for pkg in pkg_files for path in pkg_files[pkg]}
file_pkgs['/usr/lib64/python3.8/tkinter']

'python3-tkinter'

Finally, we get the size of every file. On different archtectures or different Python versions, or even different compiler version in different Fedora release, the sizes might be different. But we don't care for little differences, we are after the big stuff and we will assume what's big here will be big everywhere. We are aiming for  along term solution, so considering the differeneces here would not be helpful anyway.

Note that the `Counter.most_common()` method gives us the largest files, but we will care about directories and file types more.

In [7]:
file_sizes = collections.Counter({p: pathlib.Path(p).stat().st_size for p in python38_files})
file_sizes.most_common()[:8]

[('/usr/lib64/libpython3.8.so', 3534768),
 ('/usr/lib64/libpython3.8.so.1.0', 3534768),
 ('/usr/lib64/python3.8/lib-dynload/unicodedata.cpython-38-x86_64-linux-gnu.so',
  1096792),
 ('/usr/lib64/python3.8/pydoc_data/topics.py', 668784),
 ('/usr/lib64/python3.8/test/testtar.tar', 435200),
 ('/usr/lib64/python3.8/pydoc_data/__pycache__/topics.cpython-38.opt-1.pyc',
  416327),
 ('/usr/lib64/python3.8/pydoc_data/__pycache__/topics.cpython-38.opt-2.pyc',
  416327),
 ('/usr/lib64/python3.8/pydoc_data/__pycache__/topics.cpython-38.pyc', 416327)]

A quick check, how are directories sized:

In [8]:
humanize.naturalsize(file_sizes['/usr/lib64/python3.8'])

'12.3 kB'

Clearly, not recursively.

## Filesystem footprint by subpackages

In total, this has a large footprint, although large chunks of this are already split out:

In [9]:
humanize.naturalsize(sum(file_sizes.values()))

'115.5 MB'

In [10]:
{pkg: humanize.naturalsize(sum(s for p, s in file_sizes.items() if p in pkg_files[pkg])) for pkg in pkg_files}

{'python3': '21.6 kB',
 'python3-libs': '38.9 MB',
 'python3-tkinter': '2.1 MB',
 'python3-idle': '4.4 MB',
 'python3-test': '65.7 MB',
 'python3-devel': '4.4 MB'}

It can be seen that the "main" `python3` package is not very relevant here. The `python3-test` package is optional and pretty much only usefull to test Python itself. It contains a lot of test data and we will not try to optimize its size. The `python3-idle` packag contains an application and while we can aim to minimize anything, we will not focus on this package either.

**The main problem is in the `python3-libs` package** – it is always installed when Python is installed.

The `python3-tkinter` package is less problematic. It is optional and only recommended if *Tk* is installed.

The `python3-devel` package is quite big as well and it is used both for builidng Python extension modules and Python RPM packages. Getting it slimmed won might be nice, but we would also consider moving stuff from `python3-libs` into it.

## Filesystem footprint by filetype

The standrad library (`/usr/lib64/python3.8/`, mostly in `python3-libs` and `python3-test`) contains several file types:

In [11]:
def ext(path):
    """Get a file extenstion, but treat .opt-?.pyc as special case"""
    suffixes = pathlib.Path(path).suffixes
    if not suffixes:
        return None
    if suffixes[-1] == '.pyc' and suffixes[-2].startswith('.opt-'):
        return suffixes[-2] + suffixes[-1]
    return suffixes[-1]

In [12]:
stdlib_files = [p for p in python38_files if p.startswith('/usr/lib64/python3.8/')]
extensions = {path: ext(path) for path in stdlib_files}

In [13]:
exts = collections.Counter(extensions.values())
exts.most_common()[:10]

[('.py', 1639),
 ('.opt-1.pyc', 1622),
 ('.opt-2.pyc', 1622),
 ('.pyc', 1622),
 (None, 242),
 ('.decTest', 143),
 ('.txt', 109),
 ('.so', 75),
 ('.xml', 56),
 ('.pem', 22)]

In [14]:
extsizes = collections.Counter({ext: sum(s for p, s in file_sizes.items()
                                         if p in stdlib_files and extensions[p] == ext)
                                for ext in exts})

{ext: humanize.naturalsize(size) for ext, size in extsizes.most_common()[:5]}

{'.py': '27.7 MB',
 '.pyc': '23.1 MB',
 '.opt-1.pyc': '23.0 MB',
 '.opt-2.pyc': '20.7 MB',
 '.so': '5.5 MB'}

Only from `python3-libs`:

In [15]:
extensions_libs = {p: e for p, e in extensions.items() if p in pkg_files['python3-libs']}
exts_libs = collections.Counter(extensions_libs.values())
exts_libs.most_common()[:6]

[('.py', 607),
 ('.opt-1.pyc', 607),
 ('.opt-2.pyc', 607),
 ('.pyc', 607),
 (None, 84),
 ('.so', 67)]

In [16]:
extsizes_libs = collections.Counter({ext: sum(s for p, s in file_sizes.items()
                                         if p in stdlib_files and p in pkg_files['python3-libs'] and extensions[p] == ext)
                                for ext in exts})

{ext: humanize.naturalsize(size) for ext, size in extsizes_libs.most_common()[:5]}

{'.py': '10.2 MB',
 '.pyc': '7.0 MB',
 '.opt-1.pyc': '7.0 MB',
 '.opt-2.pyc': '5.4 MB',
 '.so': '5.1 MB'}

## Filesystem footprint by module

In [17]:
module_sizes_by_extension = collections.defaultdict(lambda: collections.defaultdict(int))
msbe = module_sizes_by_extension

In [18]:
modules_packages = collections.defaultdict(set)

In [19]:
for path in stdlib_files:
    libdir = '/usr/lib64/python3.8/'
    _path = path[len(libdir):]
    if _path.endswith(('.pyc', '.py', '.so')):
        if _path.startswith('lib-dynload/'):
            _path = _path[len('lib-dynload/'):]
        elif _path.startswith('__pycache__/'):
            _path = _path[len('__pycache__/'):]

        if '/' in _path:
            modname = _path.partition('/')[0]
        else:
            modname = _path.partition('.')[0]

        msbe[modname][ext(path)] += file_sizes[path]
        modules_packages[modname].add(file_pkgs[path])

In [20]:
by_total = collections.Counter({m: sum(e.values()) for m, e in msbe.items()})
by_total.most_common()[:7]

[('test', 51690932),
 ('idlelib', 4004920),
 ('unittest', 3173550),
 ('distutils', 2720049),
 ('encodings', 2573657),
 ('lib2to3', 2321923),
 ('tkinter', 2219606)]

In [21]:
def ns(num):
    if num:
        return humanize.naturalsize(num)
    return ''

In [26]:
sizes = [
    [
        m,
        ns(msbe[m]['.py']),
        ns(msbe[m]['.pyc']),
        ns(msbe[m]['.opt-1.pyc']),
        ns(msbe[m]['.opt-2.pyc']),
        ns(msbe[m]['.so']),
        ' / '.join(sorted(p[len('python3-'):] for p in modules_packages[m])),
        ns(sum(s for s in msbe[m].values())),
    ]
    for m, _ in by_total.most_common()]

In [27]:
hdr = ['', '.py', '.pyc', '.opt-1.pyc', '.opt-2.pyc', '.so', 'package', 'total']
display(HTML(tabulate.tabulate([hdr] + sizes, tablefmt='html')))

0,1,2,3,4,5,6,7
,.py,.pyc,.opt-1.pyc,.opt-2.pyc,.so,package,total
test,13.9 MB,12.7 MB,12.7 MB,12.3 MB,,test,51.7 MB
idlelib,1.2 MB,997.9 kB,996.5 kB,856.7 kB,,idle,4.0 MB
unittest,801.1 kB,806.4 kB,806.1 kB,760.0 kB,,libs / test,3.2 MB
distutils,910.1 kB,639.6 kB,638.8 kB,531.4 kB,,libs / test,2.7 MB
encodings,1.4 MB,387.4 kB,386.9 kB,371.1 kB,,libs,2.6 MB
lib2to3,649.4 kB,583.6 kB,570.2 kB,518.8 kB,,libs / test,2.3 MB
tkinter,612.1 kB,580.3 kB,580.3 kB,446.8 kB,,test / tkinter,2.2 MB
pydoc_data,668.8 kB,416.5 kB,416.5 kB,416.5 kB,,libs,1.9 MB
asyncio,451.5 kB,374.4 kB,372.1 kB,298.0 kB,,libs,1.5 MB
