Skip to content
This repository has been archived by the owner on Jan 30, 2023. It is now read-only.

To do sharding or not #1

Open
infinisil opened this issue Sep 2, 2022 · 6 comments
Open

To do sharding or not #1

infinisil opened this issue Sep 2, 2022 · 6 comments
Assignees

Comments

@infinisil
Copy link
Member

infinisil commented Sep 2, 2022

We had some good discussion about this RFC on Matrix. Mainly around whether sharding (splitting the auto-called directory up based on prefixes) is a good idea. Involved were @alyssais, @roberth, @adisbladis and @infinisil. The strength of arguments were also discussed.

Additional motivations for this RFC in general:

  • Improves git status performance (good with less recursive structures)

These are additional arguments for sharding:

  • Improves git performance, big directories are not great for git trees (can we quantify this? @roberth says that git packfiles can delta encode trees)
  • nix edit .#hello works in any case, no need to manually type package paths

These are some additional arguments against sharding:

  • CLI usage is nicer, e.g. cd pkgs/auto/hello or cd pkgs/hello if auto isn't needed anymore, which mirrors pkgs.hello very well
  • When we do sharding with 2-prefix, 1-letter package names like R, h, j, o, q and t would need special handling (suggested was e.g. pkgs/auto/R/R/default.nix)

Benchmarking:

  • Sorted 1- and 2-prefix sharding counts by @infinisil: https://gist.github.com/infinisil/0afcae04298390b7d02f91fca4a22219

  • Python script by @adisbladis with counts based on directory names:

    Source
    #!/usr/bin/env python
    import json
    import os.path
    import os
    
    
    PREFIX_LEN = 2
    
    
    if __name__ == "__main__":
        count = {}
    
        for root, dirs, files in os.walk("."):
            if "default.nix" in files and not dirs:
                prefix = os.path.basename(root)[:PREFIX_LEN].lower()
    
                i = count.get(prefix, 0)
                i += 1
    
                count[prefix] = i
    
        print(json.dumps(count, indent=2))
    Result (sorted)
    {
      "py": 1010,
      "li": 955,
      "co": 316,
      "ma": 287,
      "go": 235,
      "op": 224,
      "re": 223,
      "pa": 222,
      "ca": 220,
      "te": 196,
      "sp": 174,
      "st": 167,
      "cl": 162,
      "pr": 161,
      "mi": 160,
      "fl": 157,
      "mo": 149,
      "gi": 146,
      "di": 140,
      "ai": 138,
      "po": 137,
      "gn": 135,
      "se": 131,
      "pi": 130,
      "me": 130,
      "so": 129,
      "gr": 129,
      "in": 128,
      "az": 127,
      "ch": 126,
      "ge": 124,
      "ba": 123,
      "ne": 120,
      "sc": 113,
      "de": 112,
      "da": 112,
      "do": 109,
      "as": 109,
      "un": 108,
      "si": 108,
      "tr": 107,
      "bi": 107,
      "ti": 104,
      "ra": 103,
      "ar": 102,
      "fa": 101,
      "ro": 100,
      "ni": 96,
      "sh": 96,
      "pl": 96,
      "wa": 96,
      "fi": 94,
      "dj": 94,
      "cr": 92,
      "ta": 90,
      "no": 88,
      "to": 88,
      "an": 85,
      "ap": 84,
      "ha": 84,
      "sa": 83,
      "el": 81,
      "lo": 79,
      "vi": 79,
      "bl": 79,
      "bo": 79,
      "sy": 77,
      "su": 76,
      "al": 75,
      "we": 75,
      "oc": 74,
      "fr": 73,
      "au": 72,
      "he": 71,
      "mu": 70,
      "la": 69,
      "na": 69,
      "bu": 69,
      "ya": 66,
      "pu": 66,
      "ke": 65,
      "ga": 65,
      "fe": 65,
      "gl": 65,
      "pe": 64,
      "im": 62,
      "ju": 61,
      "fo": 61,
      "le": 61,
      "wi": 60,
      "br": 60,
      "ht": 60,
      "xf": 60,
      "ja": 59,
      "sn": 58,
      "ho": 57,
      "qu": 57,
      "ka": 57,
      "en": 57,
      "ex": 57,
      "du": 56,
      "be": 54,
      "th": 54,
      "js": 53,
      "ip": 52,
      "ad": 51,
      "os": 51,
      "ss": 51,
      "nu": 50,
      "gp": 50,
      "aw": 50,
      "ci": 50,
      "hy": 49,
      "ru": 49,
      "ku": 48,
      "gt": 48,
      "am": 48,
      "mp": 48,
      "gu": 48,
      "dr": 46,
      "sm": 45,
      "sw": 44,
      "ur": 44,
      "cp": 44,
      "pd": 44,
      "ps": 44,
      "rt": 44,
      "hi": 43,
      "at": 43,
      "ri": 43,
      "ce": 43,
      "xm": 43,
      "cu": 42,
      "sd": 42,
      "sq": 42,
      "va": 41,
      "ph": 41,
      "ve": 41,
      "sl": 41,
      "tw": 40,
      "my": 40,
      "ll": 40,
      "lu": 39,
      "zo": 39,
      "fu": 38,
      "zs": 38,
      "qt": 37,
      "pc": 37,
      "tu": 37,
      "ac": 36,
      "zi": 35,
      "ty": 35,
      "ms": 35,
      "md": 35,
      "sk": 34,
      "em": 34,
      "et": 34,
      "pg": 34,
      "ev": 34,
      "vo": 33,
      "lx": 33,
      "wo": 33,
      "ki": 33,
      "pp": 33,
      "on": 32,
      "xs": 32,
      "io": 31,
      "ec": 31,
      "mk": 31,
      "or": 30,
      "is": 30,
      "ko": 30,
      "ze": 30,
      "cs": 30,
      "dn": 29,
      "ed": 29,
      "gs": 29,
      "ic": 28,
      "cm": 28,
      "ep": 27,
      "ji": 27,
      "xc": 27,
      "fs": 27,
      "je": 26,
      "rs": 26,
      "wh": 26,
      "ea": 26,
      "av": 26,
      "us": 26,
      "fc": 25,
      "jo": 25,
      "rp": 25,
      "ir": 24,
      "hu": 24,
      "ag": 24,
      "xd": 24,
      "gh": 24,
      "kr": 23,
      "ib": 23,
      "sr": 23,
      "db": 23,
      "tt": 22,
      "sv": 22,
      "up": 22,
      "gm": 22,
      "ml": 22,
      "it": 22,
      "xa": 22,
      "mm": 22,
      "om": 21,
      "cc": 21,
      "bs": 21,
      "nv": 21,
      "kd": 20,
      "ob": 20,
      "bt": 20,
      "ds": 20,
      "cd": 20,
      "gd": 19,
      "id": 19,
      "yo": 19,
      "es": 19,
      "nc": 19,
      "hd": 19,
      "cf": 19,
      "ab": 19,
      "tc": 19,
      "mf": 19,
      "dd": 18,
      "ls": 18,
      "ae": 18,
      "ff": 18,
      "od": 18,
      "yu": 18,
      "wr": 17,
      "xo": 17,
      "mc": 17,
      "dc": 17,
      "vc": 17,
      "bc": 17,
      "wl": 17,
      "ld": 17,
      "xp": 17,
      "tl": 16,
      "er": 16,
      "nt": 16,
      "nb": 16,
      "cy": 16,
      "fp": 16,
      "dm": 16,
      "tm": 16,
      "ws": 16,
      "rd": 16,
      "mb": 16,
      "vu": 16,
      "gc": 16,
      "ns": 16,
      "mt": 16,
      "ng": 15,
      "nd": 15,
      "pt": 15,
      "cb": 15,
      "xb": 15,
      "qi": 15,
      "xl": 15,
      "dv": 15,
      "xt": 15,
      "qm": 15,
      "za": 14,
      "xi": 14,
      "vm": 14,
      "dy": 14,
      "bp": 14,
      "pk": 14,
      "af": 14,
      "mr": 13,
      "oa": 13,
      "jp": 13,
      "cv": 13,
      "tx": 13,
      "qc": 13,
      "tp": 13,
      "lt": 13,
      "pn": 13,
      "of": 12,
      "ov": 12,
      "uc": 12,
      "xk": 12,
      "km": 12,
      "lz": 12,
      "qs": 12,
      "gf": 12,
      "pm": 12,
      "jd": 12,
      "dp": 12,
      "kc": 12,
      "cg": 12,
      "ud": 12,
      "ut": 12,
      "ts": 12,
      "ot": 12,
      "lm": 11,
      "ef": 11,
      "rf": 11,
      "hp": 11,
      "if": 11,
      "rm": 11,
      "pw": 11,
      "dl": 11,
      "rn": 11,
      "yt": 11,
      "uu": 11,
      "dt": 11,
      "xe": 11,
      "wm": 11,
      "mn": 10,
      "ia": 10,
      "s6": 10,
      "ye": 10,
      "np": 10,
      "s3": 10,
      "qr": 10,
      "sf": 10,
      "kl": 10,
      "lc": 10,
      "lw": 10,
      "fd": 10,
      "ct": 10,
      "nf": 9,
      "eb": 9,
      "ul": 9,
      "ez": 9,
      "um": 9,
      "dw": 9,
      "ow": 9,
      "gv": 9,
      "zf": 9,
      "fn": 9,
      "hc": 9,
      "ei": 9,
      "lr": 9,
      "ua": 9,
      "ig": 9,
      "hs": 9,
      "uh": 9,
      "wg": 9,
      "fb": 9,
      "lv": 8,
      "qo": 8,
      "cn": 8,
      "ol": 8,
      "vd": 8,
      "rl": 8,
      "xx": 8,
      "cx": 8,
      "pb": 8,
      "aa": 8,
      "nw": 8,
      "ft": 8,
      "ly": 8,
      "tf": 8,
      "vp": 8,
      "eg": 8,
      "hl": 8,
      "og": 8,
      "ky": 8,
      "ks": 8,
      "xr": 8,
      "jb": 7,
      "ub": 7,
      "rh": 7,
      "ui": 7,
      "uf": 7,
      "wc": 7,
      "x1": 7,
      "gb": 7,
      "x2": 7,
      "i3": 7,
      "jw": 7,
      "vs": 7,
      "wt": 7,
      "kb": 7,
      "jm": 7,
      "xv": 7,
      "rc": 7,
      "ou": 7,
      "nr": 7,
      "wx": 7,
      "jf": 7,
      "ug": 7,
      "td": 7,
      "pf": 7,
      "kp": 7,
      "qp": 7,
      "wp": 7,
      "df": 7,
      "dh": 7,
      "ln": 7,
      "fw": 7,
      "uv": 6,
      "zu": 6,
      "ox": 6,
      "kh": 6,
      "uw": 6,
      "nm": 6,
      "qd": 6,
      "fx": 6,
      "kn": 6,
      "pv": 6,
      "mw": 6,
      "vt": 6,
      "ue": 6,
      "bz": 6,
      "nx": 6,
      "cj": 6,
      "xn": 6,
      "bw": 6,
      "cw": 6,
      "fm": 6,
      "vg": 6,
      "kw": 6,
      "sb": 6,
      "qg": 6,
      "ik": 6,
      "oh": 6,
      "gx": 6,
      "bf": 6,
      "zn": 5,
      "eq": 5,
      "eu": 5,
      "yd": 5,
      "tz": 5,
      "bg": 5,
      "tb": 5,
      "rx": 5,
      "px": 5,
      "rk": 5,
      "hw": 5,
      "tv": 5,
      "vn": 5,
      "i2": 5,
      "mx": 5,
      "oo": 5,
      "zc": 5,
      "nl": 5,
      "il": 5,
      "hm": 5,
      "ck": 5,
      "aq": 5,
      "bm": 5,
      "zd": 5,
      "vl": 5,
      "zl": 5,
      "tn": 5,
      "p4": 5,
      "nn": 5,
      "oi": 5,
      "qb": 5,
      "kt": 5,
      "zp": 5,
      "lb": 5,
      "zk": 5,
      "eo": 5,
      "gw": 5,
      "vk": 5,
      "wd": 5,
      "ny": 5,
      "lk": 4,
      "iw": 4,
      "xh": 4,
      "bb": 4,
      "ax": 4,
      "jq": 4,
      "zm": 4,
      "s2": 4,
      "qn": 4,
      "hg": 4,
      "by": 4,
      "zx": 4,
      "tg": 4,
      "aj": 4,
      "rb": 4,
      "h2": 4,
      "sg": 4,
      "rr": 4,
      "gg": 4,
      "4.": 4,
      "qj": 4,
      "ok": 4,
      "ek": 4,
      "ak": 4,
      "vb": 4,
      "dg": 4,
      "ry": 4,
      "sx": 4,
      "kv": 4,
      "xw": 4,
      "bd": 4,
      "fz": 4,
      "k3": 3,
      "ih": 3,
      "e1": 3,
      "gy": 3,
      "v2": 3,
      "w3": 3,
      "hk": 3,
      "zb": 3,
      "bk": 3,
      "gq": 3,
      "wk": 3,
      "ij": 3,
      "tk": 3,
      "h5": 3,
      "jx": 3,
      "rq": 3,
      "z3": 3,
      "uk": 3,
      "yq": 3,
      "dk": 3,
      "ew": 3,
      "nk": 3,
      "jc": 3,
      "xg": 3,
      "pq": 3,
      "bj": 3,
      "vr": 3,
      "xy": 3,
      "gj": 3,
      "lp": 3,
      "cz": 3,
      "zz": 3,
      "qx": 3,
      "m4": 3,
      "kf": 3,
      "lh": 3,
      "dx": 3,
      "jr": 3,
      "iv": 3,
      "dz": 3,
      "rg": 3,
      "7": 3,
      "f2": 3,
      "d-": 3,
      "ej": 3,
      "qa": 3,
      "mj": 3,
      "ii": 3,
      "ql": 3,
      "3.": 3,
      "qe": 3,
      "zg": 3,
      "qv": 3,
      "ah": 3,
      "xz": 3,
      "m1": 3,
      "lf": 3,
      "mh": 3,
      "x4": 3,
      "wv": 3,
      "1.": 3,
      "hf": 3,
      "uq": 3,
      "xj": 3,
      "gz": 3,
      "ay": 2,
      "d2": 2,
      "wq": 2,
      "i-": 2,
      "c2": 2,
      "jt": 2,
      "h3": 2,
      "r2": 2,
      "hj": 2,
      "vx": 2,
      "rj": 2,
      "hv": 2,
      "m2": 2,
      "rz": 2,
      "wf": 2,
      "wu": 2,
      "oy": 2,
      "ao": 2,
      "p1": 2,
      "tq": 2,
      "b2": 2,
      "oe": 2,
      "zw": 2,
      "m3": 2,
      "sj": 2,
      "fq": 2,
      "mv": 2,
      "g2": 2,
      "c-": 2,
      "sz": 2,
      "v8": 2,
      "t1": 2,
      "z8": 2,
      "vy": 2,
      "c3": 2,
      "8": 2,
      "10": 2,
      "6": 2,
      "11": 2,
      "9": 2,
      "4t": 2,
      "b4": 2,
      "xq": 2,
      "jh": 2,
      "k2": 2,
      "jl": 2,
      "kg": 2,
      "vf": 2,
      "p2": 2,
      "k4": 2,
      "1p": 2,
      "gk": 2,
      "nh": 2,
      "wb": 2,
      "qf": 2,
      "bv": 2,
      "mg": 2,
      "ux": 2,
      "cq": 2,
      "f3": 2,
      "jn": 2,
      "hq": 2,
      "ym": 2,
      "zr": 2,
      "a2": 2,
      "r1": 2,
      "a1": 2,
      "r8": 2,
      "x8": 2,
      "v4": 2,
      "ix": 2,
      "lg": 2,
      "t-": 2,
      "n2": 2,
      "20": 2,
      "wy": 2,
      "hh": 2,
      "e2": 2,
      "l2": 2,
      "mq": 2,
      "s-": 2,
      "rw": 2,
      "nz": 2,
      "b6": 1,
      "f5": 1,
      "u0": 1,
      "32": 1,
      "ww": 1,
      "f9": 1,
      "ey": 1,
      "uj": 1,
      "h1": 1,
      "u-": 1,
      "zh": 1,
      "fj": 1,
      "j2": 1,
      "yf": 1,
      "vq": 1,
      "k5": 1,
      "3t": 1,
      "uo": 1,
      "bx": 1,
      "x5": 1,
      "hx": 1,
      "bh": 1,
      "zy": 1,
      "qh": 1,
      "iq": 1,
      "l-": 1,
      "dq": 1,
      "qw": 1,
      "5": 1,
      "kq": 1,
      "a5": 1,
      "s9": 1,
      "2.": 1,
      "j": 1,
      "jy": 1,
      "12": 1,
      "8.": 1,
      "oq": 1,
      "yc": 1,
      "zq": 1,
      "yj": 1,
      "k6": 1,
      "g-": 1,
      "x3": 1,
      "a4": 1,
      "w_": 1,
      "f1": 1,
      "p3": 1,
      "hr": 1,
      "kj": 1,
      "jg": 1,
      "j4": 1,
      "9m": 1,
      "q4": 1,
      "n3": 1,
      "yi": 1,
      "vh": 1,
      "r": 1,
      "2": 1,
      "bq": 1,
      "o": 1,
      "x-": 1,
      "2b": 1,
      "vw": 1,
      "3p": 1,
      "n8": 1,
      "pj": 1,
      "r5": 1,
      "k9": 1,
      "k0": 1,
      "fv": 1,
      "g1": 1,
      "91": 1,
      "i7": 1,
      "i8": 1,
      "g4": 1,
      "m-": 1,
      "zj": 1,
      "nj": 1,
      "90": 1,
      "fh": 1,
      "qq": 1,
      "0a": 1,
      "0v": 1,
      "7k": 1,
      "t4": 1,
      "1o": 1,
      "u3": 1,
      "9p": 1,
      "xu": 1,
      "kz": 1,
      "nq": 1,
      "t": 1,
      "0x": 1,
      "q-": 1,
      "z-": 1,
      "3l": 1,
      "3m": 1,
      "zt": 1,
      "yl": 1,
      "g9": 1,
      "h": 1,
      "fg": 1,
      "p7": 1,
      "7z": 1,
      "6t": 1,
      "q": 1,
      "s4": 1,
      "yg": 1,
      "s5": 1,
      "yr": 1,
      "2f": 1,
      "p0": 1,
      "b3": 1,
      "hb": 1,
      "38": 1,
      "p9": 1,
      "u9": 1
    }
  • About 21.7% of nixpkgs files are inaccessible by clicking on directories in GitHub (@alyssais). Though a lot of these (5456) are in Python which currently isn't in scope for the approach of the RFC

How would sharding be done?

  • 1-prefix is very simple and already pretty good, though a bunch of letters have a lot of files in them.
  • 2-prefix sharding decreases shard sizes a lot, though there's still li (because of lib* packages) with a bit over 1000 packages. This also is more complicated for handling single-letter packages.
  • Dynamic sharding (any prefix is allowed, as long as it doesn't conflict with others) would certainly allow GitHub navigation and allow balancing directory sizes much better. If a prefix has too many packages, it can be split up by adding another letter to the prefix.

Meta:

  • Polling is biased and doesn't reach all the relevant people, but it may be a good option for finding out the weight of individual arguments
  • Such polls should ask for estimations or objective facts and should be easily answerable. E.g. "Are you using GitHub navigation?" or "How often do you use the git CLI with nixpkgs?". Not "Do you prefer sharding?"
@roberth
Copy link
Contributor

roberth commented Sep 3, 2022

Meta:

  • Before consulting the wider community, we should be able to inform them (and ourselves).
    Claims about performance should be backed up by experiments. Imagine making such a far reaching decision based on wrong assumptions.

@SuperSandro2000
Copy link

Another argument for sharding:

  • cli tools that list files in a directory get slower with more files
  • glob expansion like rm plugs/auto/* wouldn't work pretty quickly because there is a limit to the amount of arguments you can supply
  • Improves git status performance (good with less recursive structures)

git has a build in setting core.fsmonitor which already improves speed in the current layout by a lot.

@infinisil infinisil mentioned this issue Nov 10, 2022
@infinisil infinisil changed the title Matrix discussion points To do sharding or not Nov 10, 2022
@infinisil
Copy link
Member Author

We had a poll to probe how many people relied on the GitHub code navigation feature to find packages. Turns out a lot of people do, which is a strong argument in favor of sharding.

@infinisil
Copy link
Member Author

Depends on #18

@infinisil
Copy link
Member Author

The general consensus is that we should do sharding. PR #20 changes the draft accordingly. We discussed this a lot in Meeting #18

@infinisil
Copy link
Member Author

Depends on #17

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants