Skip to content
Commits on Dec 5, 2011
  1. sysctl: add register_net_sysctl_table_net_cookie

    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 5, 2011
  2. sysctl: add cookie to __register_sysctl_paths

    Extend the sysctl registration APIs to receive a cookie + cookie handler.
    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 5, 2011
  3. sysctl: add ctl_cookie and ctl_cookie_handler

    The cookie represent a piece of data and a handler for it that is
    associated with the header.
    No one uses this at this point. Patches will passing cookie + handler
    will be submitted later in various kernel subsystems that can benefit
    from this.
    The handler is meant to create a new ctl_table based on a template
    ctl_table and the cookie. This is useful in lots of places in the
    kernel where we kmemdup() an array of template ctl_tables and change
    fields in a predictable way.
    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 5, 2011
  4. sysctl: add register_sysctl_dir: register an empty sysctl directory

    There are a few places in the tree that register a empty ctl_table
    array just to make sure the directories are created.
    The empty ctl_table takes up memory and is checked at lookup/readdir.
    Registering empty dirs with this function only takes care of creating
    the directories, without adding the empty ctl_table.
    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 5, 2011
  5. sysctl: union-ize some ctl_table_header fields

    Both types of directory ctl_table_headers are now stored as rbtree
    nodes, so the list entry is valid only for ctl_table_headers that wrap
    ctl_table arrays.
    This saves 2*sizeof(void*).
    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 22, 2011
  6. sysctl: replace netns corresp list with rbtree

    Similar to the last patch that replaced the subdirectory list with a rbtree.
    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 22, 2011
  7. sysctl: replace subdirectory list with rbtree

    Before this we kept all subdirectories in a linked list. Some
    directories can have very many subdirs and the list can grow quite
    large: some workloads require 10^4..10^5 network interfaces and each
    of them adds a new directory in /proc/sys/ipv{4|6}/conf/DEVICE/.
    With a linked list of N elements:
    - complexity of insert  = O(N^2)
    - complexity of lookup  = O(N)
    - complexity of readdir = O(N)
    This patch replaces the list with a rbtree:
    - complexity of insert  = O(N*log(N))
    - complexity of lookup  = O(log(N))
    - complexity of readdir = O(N)
    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 22, 2011
  8. sysctl: add ctl_type member

    We have three types of headers and it's hard to keep track of which is
    which and things will get a bit worse later when dirs will be stored
    in a rbtree.
    This doesn't take up extra space because we had a 3B hole to fill (now
    reduced to 2B).
    There was a subtle check in unregister_sysctl_table(_impl):
     - dirs and tables are members of their parent's corresponding list
     - netns-specific dirs are members of a netns-specific list of headers
    When deleting the header form the list we needed to take the list owner's lock:
     - the parent's lock for dirs/tables
     - netns-specific list protector lock for netns-specific dirs
    The decision was (ctl_dirname==NULL) => type = netns-specific dir.
    That only made sense because ctl_dirname and ctl_table_arg are unioned
    and ctl_table_headers that wrap file arrays never have ctl_table_arg==NULL.
    But once the ctl_dirname/ctl_table_arg fields no longer share the same
    memory the check will not be valid.
    Using ctl_type the check is no longer dependent on that union.
    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 21, 2011
  9. sysctl: reorder members of ctl_table_header (cleanup)

    In the next commits entries that are specific to directory
    ctl_table_headers and to file-wrapping ctl_table_headers will be put
    in unions. Things will get ugly if we keep members in
    This makes the structure grow: the .rcu member is kept out of the
    union until I understand which fields may still be accessed in
    read-side rcu after the header is scheduled for deletion with call_rcu.
    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 21, 2011
  10. sysctl: always perform sysctl checks

    I'm not sure if we should always perform checks without giving
    EMBEDDED folks a chance to skip these checks. If the system is
    verified with sysctl checks, one can disable them to skip some
    performance penalties.
    Two of these checks will be more costly than the others:
    - sysctl_check_netns_correspondents
    - sysctl_check_duplicates
    The first check is run for all regular (not netns dependent) sysctl
    directories that are registered. For each element of a path
    (e.g. kernel/sched_domain/cpu0/domain0/) we scan the current netns's
    list of netns correspondents and check for matches.
    This is O(len(current-netns-corresp-list)) * O(depth(registered path))
    The second check compares every item in a registered table with all
    siblings (other tables or subdirectories that are children of the same
    parent directory).
    This is O(len(table-being-registered)) * O(sum(len(already-registered-tables))+nr-subdirs).
    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 8, 2011
  11. sysctl: warn if registration/unregistration order is not respected

    This patch sends a warning for each sysctl unregistration that cannot
    delete all the directories that it created.
    For example:
    - register   /existingdir/newdir/file-a
    - register   /existingdir/newdir/dir3/file-b
    - unregister /existingdir/newdir/file-a
    - unregister /existingdir/newdir/dir3/file-b
    Here the order is violated because the first unregister operation
    cannot delete all the directories it has created (namely 'newdir')
    because they are used by another registered path.
    If you get this warning, the rule violation can be fixed in (at least) two ways:
    - enforce order of unregistration:
      - register   /existingdir/newdir/file-a
      - register   /existingdir/newdir/dir3/file-b
      - unregister /existingdir/newdir/dir3/file-b
      - unregister /existingdir/newdir/file-a
    - have a third party register the common part:
      - register   /existingdir/newdir/
      - register   /existingdir/newdir/file-a
      - register   /existingdir/newdir/dir3/file-b
      - unregister /existingdir/newdir/file-a
      - unregister /existingdir/newdir/dir3/file-b
      - unregister /existingdir/newdir/
    The current implementation works well regardless of this order being
    respected. In the future, other sysctl implementations may only work
    if this rule is respected.
    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 4, 2011
  12. sysctl: check netns-specific registration order respected

    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 5, 2011
  13. sysctl: alloc ctl_table_header with kmem_cache

    Because now ctl_table_header objects are allocated with a fixed size
    buffer (sizeof(struct ctl_table_header)) we can do the allocations
    with kmem_cache.
    Also, by making sure that the objects that are returned to the cache
    are in a sane state we don't waste time reinitializing every field
    after kmem_cache_alloc. We only initialize fields that were not left
    with a sane value before returning an object to the cache.
    Signed-off-by: Lucian Adrian Grijincu <>
    committed Apr 30, 2011
  14. sysctl: add duplicate entry and sanity ctl_table checks

    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 4, 2011
  15. sysctl: faster tree-based sysctl implementation

    The old implementation used inefficient algorithms for lookup, readdir
    and registration.
    This patch introduces an improved algorithm:
    - lower memory consumption,
    - better time complexity for lookup/readdir/registration.
    Locking is a bit heavier in this algorithm (in this patch: reader
    locks for lookup/readdir, writer locks for register/unregister; in a
    later patch in this series: RCU + spin-lock). I'll address this
    locking issue later in this commit.
    I will shortly describe the previous algorithm, the new one and brag
    at the end with an endless list of improvements and new limitations.
    = Old algorithm =
    == Description ==
    We created a ctl_table_header for each registered sysctl table. The
    header's role is to maintain sysctl internal data, reference counting
    and as a token to unregister the table.
    All headers were put in a list in the order of registration without
    regard to the position of the tables in the sysctl tree. Headers were
    also 'attached' one to another to (somewhat) speed up lookup/readdir.
    Attaching a header meant looking at each other already registered
    header and comparing the paths to the tables. A newly registered
    header would be attached to the first header with which it would share
    most of it's path.
    e.g. paths registered: /, /a/b/c, /a/b/c/d, /a/x, /a/x/y, /a/z
      + /a/b/c
         |   + /a/b/c/d
         + /a/x
         | /a/x/y
         + /a/z
    == Time complexity ==
    - register N tables would take O(N^2) steps (see above)
    - lookup: if the item searched for is not found in the current header,
      iterate the list of headers until you find another header that's
      attached to the current position in the header's table. Lookups for
      elements that are in a header registered under the current position
      or inexistent elements would take O(N) steps each.
    - readdir: after searching the current headers table in the current
      position, always do an O(N) search for a header attached to the
      current table position.
    == Memory ==
    Each header was allocated some data and a variable-length path.
    O(1) with kzalloc/kfree.
    = New algorithm =
    == Description ==
    Reuses the 'ctl_table_header' concept but with two distinct meanings:
    - as a wrapper of a table registered by the user
    - as a directory entry.
    Registering the paths from the above example gives this tree:
     paths: /, /a/b/c, /a/b/c/d, /a/x, /a/x/y, /a/z
         /: .subdirs = a
           a: .subdirs = b x z
             b: subdirs = c
                c: subdirs = d
             x: subdirs = y
    Each directory gets a header. Each header has a parent (except root)
    and two lists:
     - ctl_subdirs: list of sub-directories - other headers
     - ctl_tables: list of headers that wrap a ctl_table array
    Because the directory structure is now maintained as ctl_table_header
    objects, we needed to remove the .child from ctl_tables (this explains
    the previous patches). A ctl_table array represents a list of files.
    == Time complexity ==
    - registration of N headers. Registration means adding new directories
      at each level or incrementing an existing directory's refcount.
      - O(N * lnN) - if the paths to the headers are evenly distributed
      - O(N^2) - if most of the headers registered are children of the
        same parent directory (searching the list of subdirs takes O(N)).
        There are cases where this happens (e.g. registering sysctl
        entries for net devices under /proc/sys/net/ipv4|6/conf/device).
        A few later patches will add an optimisation, to fix locations
        that might trigger the O(N^2) issue.
    - lookup: O(len(subdirs) + sum(len(tarr) for each tarr in ctl_tables)
      - could be made better:
         - sort ctl_subdirs (for binary search)
         - replace ctl_subdirs with a hash-table (increase memory footprint)
         - sort ctl_table entries at registration time (for binary search).
        Could be done, but I'm too lazy to do it now.
    - readdir: O(len(subdirs) + sum(len(tarr) for each tarr in ctl_tables)
       - can't get any better than this :)
    == Memory complexity ==
    Although we create more ctl_table_header (one for each directory, one
    for each table, and because we deleted the .child from ctl_table there
    are more tables registered than before this patch) we remove the need
    to store a full path (from too to the table) as was done in the old
    solution => a O(N) small memory gain with report to the old algo.
    = Limitations =
    == ctl_table will lose .child => some code uglyfication  ==
    Registering tables with multiple directories and files cannot be done
    in a single operation: there must be at least a table registered for
    each directory. This make code that registers sysctls uglier.
    The first patches in this series made the conversion from paths
    encoded with .child to paths specified by ctl_path. Later patches will
    convert all users of .child to ctl_path and the conversion layer will
    be deleted.
    == Handling of netns specific paths is weirder ==
    The algorithm descriptions from above are simplified. In reality the
    code needs to handle directories and files that must be visible only
    in some net namespaces. E.g. the /proc/sys/net/ipv4/conf/DEVICENAME/
    directory and it's files must be visible only in the netns of the
    'DEVICENAME' device.
    The old algorithm used secondary lists that indexed all netns specific
    headers (one such list per netns). The old-algorithm description is
    still valid, with the mention that besides searching the global list,
    the algorithm would also look into the current netns' list of
    headers. This scales perfectly in rapport to the number of network
    The new algorithm does something similar, but a bit more complicated.
    We also use netns specific lists of directories/tables and store them
    in a special directory ctl_table_header (which I dubbed the
    "netns-correspondent" of another directory - I'm not very pleased with
    the name either).
    When registering a netns specific table, we will create a
    "netns-correspondent" to the last directory that is not netns specific
    in that path.
    E.g.: we're registering a netns specific table for 'lo':
          common path: /proc/sys/net/ipv4/
           netns path: /proc/sys/net/ipv4/conf/lo/
       We'll create an (unnamed) netns correspondent for 'ipv4' which will
       have 'conf' as it's subdir.
    E.g.: We're registering a netns specific file in /proc/sys/net/core/somaxconn
          common path: /proc/sys/net/core/
           netns path: /proc/sys/net/core/
    We'll create an (unnamed) netns correspondent for 'core' with the
    table containing 'somaxconn' in ctl_tables.
    All netns correspondents of one netns are held in a single list, and
    each netns gets it own list. This keeps the algorithm complexity
    indifferent of the number of network namespaces (as was the old one).
    However, now only a smaller part of directories are members of this
    list, improving register/lookup/readdir time complexity.
    There is one ugly limitation that stems from this approach.
    E.g.: register these files in this order:
     - register common         /dir1/file-common1
     - register netns specific /dir1/dir2/file-netns
     - register common         /dir1/dir2/file-common2
      We'll have this tree:
       'dir1' { .subdirs = ['dir2'], .tables = ['file-common1'] }
         ^                    |
         |                    -> { .subdirs = [], .tables = ['file-common2'] }
         | (unnamed netns-corresp for dir1)
         -> { .subdir = ['dir2'] }
                            -> { .subdirs = [], .tables = ['file-netns'] }
    readdir: when we list the contents of 'dir1' we'll see it has two
             sub-directories named 'dir2' each with a file in it.
    lookup: lookup of /dir1/dir2/file-netns will not work because we find
            'dir2' as a subdir of 'dir1' and stick with it and never look
            into the netns correspondent of 'dir1'.
    This can be fixed in two ways:
    - A) by making sure to never register a netns specific directory and
      after that register that directory as a common one. From what I can
      tell there isn't such a problem in the kernel at the moment, but I
      did not study the source in detail.
    - B) by increasing the complexity of the code:
      - readdir: looking at both lists and comparing if we have already
                 listed a directory as common, so we don't list twice.
                 -> For imbalanced trees this can make readdir O(N^2) :(
      - register: the netns 'dir2' from the example above needs to be
                  connected to the common 'dir2' when 'dir2' is
                  registered. I'm not even going to thing of how time
                  complexity/ugliness is going to explode here.
    A later patch will implement version B): checks to make sure the
    registration order is maintained (a non-netns specific directory will
    not be added after netns specific directory with the same path was
    already added).
    = Change summary =
    * include/linux/sysctl.h
      - removed _set and _root, replaced with _group
      - netns correspondent directories are held in each netns's
      - reused the header structure to represent directories which don't
        use ctl_table_arg, but store the directory name directly.
      - each directory header also gets two lists: subdirs and tables
    * fs/proc/proc_sysctl.c
      - a proc inode has ->sysctl_entry set only for files, not
        directories as these store the dirname directly
      - lookup:
         - take the dirs read-lock and iterate through subdirs and tables
         - if nothing is found, try the dir's netns-correspondent
      - scan: list every subdir and file that was not listed before
      - readdir: scan the current dir and it's netns correspondent
    * kernel/sysctl.c
      - inlines the code of use_table/unuse_table as it is not used
        elsewhere (used to be called from __register, but aren't any more)
      - adds routines to get/set the netns-correspondent
      - adds routines to protect the subdirs/tables lists (rwsem)
      - __register_sysctl_paths:
        - preallocate ctl_table_header for every dir in 'path'
        - increase the ctl_header_refs of every existing directory
        - if the group needs a netns-correspondent it is created for the
          last existing directory that is part of the non-netns specific
        - all the non-existing directories are added as children of their
          parent's subdir lists.
       - unregister:
         - wait until no one uses the header
         - for normal directories and table-wrapper headers take the
           parent's write lock to be able to delete something from one of
           it's lists (ctl_subdir or ctl_tables).
         - netns-correspondent headers must take the netns group list lock
           before deleting.
    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 2, 2011
  16. sysctl: introduce ctl_table_group and ctl_table_group_ops

    ctl_table_group will replace in the future ctl_table_root and ctl_table_set.
    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 2, 2011
  17. sysctl: move removal from list out of start_unregistering

    Later on we'll switch form a global list protected by the sysctl_lock
    spin lock to rwsem protected per-header lists.
    At that point we'll need to hold the parent header's rwlock to remove
    the header from the list, not the sysctl_lock spin lock.
    As start_unregistering is called under the sysctl_lock, we move the
    list removal out.
    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 3, 2011
  18. sysctl: simplify ->permissions hook

    The @root parameter was not used at all.
    The @namespaces parameter was used to transmit current->nsproxy. We
    can access current->nsproxy directly in the ->permissions function, no
    need to send it.
    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 2, 2011
  19. sysctl: rename (un)use_table to __sysctl_(un)use_header

    The former names were not semantically correct, as the use/unuse was
    related to the header, not the table. Also this makes it clearer that
    sysctl_use_header and __sysctl_use_header are related (one takes the
    spin lock inside and the other doesn't).
    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 3, 2011
  20. sysctl: rename sysctl_head_get/put to sysctl_proc_inode_get/put

    Clarify the purpose of those references. No functional changes.
    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 1, 2011
  21. sysctl: split ->count into ctl_procfs_refs and ctl_header_refs

    This is not necessary at this point, but will be later when we replace
    the sysctl implementation with one that uses ctl_table_header objects
    as directories.
    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 1, 2011
  22. sysctl: rename sysctl_head_next to sysctl_use_next_header

    The new names makes it clear that this increments ctl_use_refs and
    that _unuse must be used on the header. No functional change.
    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 1, 2011
  23. sysctl: rename sysctl_head_grab/finish to sysctl_use_header/unuse

    The function names are clearer and they reflect the reference counter
    that is being inc/decremented. No functional change, just aesthetics.
    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 1, 2011
  24. sysctl: rename ->used to ->ctl_use_refs

    In a later patch I will split the 'count' counter in two. We need to
    have a clear distinction between the counters to be able to understand
    the code.
    This counts the number of references to this object from places that
    can tinker with it's internals (e.g. ctl_table, ctl_entry,
    attached_to, attached_by, etc.).
    The removal of `header->used = 0;` from __register_sysctl_paths_impl
    does not change anything, as `header` is allocated with kzalloc.
    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 1, 2011
  25. sysctl: delete useless grab_header function

    There are lots of header grabbing/getting functions around. We'll
    start changing them later on and this one will just make conversions
    harder. It doesn't help much, so kill it!
    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 1, 2011
  26. sysctl: sysctl_head_grab defaults to root header on NULL

    The code that could send NULL to sysctl_head_grab is grab_header
    because for the root sysctl directory ('/proc/sys/')
    PROC_I(inode)->sysctl is NULL.
    For it we used to return root_table_header indirectly through a call
    to sysctl_head_next(NULL). Now we default to the root header here.
    The BUG() has not been triggered until now so we can assume no one
    else is sending NULL here.
    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 1, 2011
  27. sysctl: simplify find_in_table

    The if (!p->procname) check is useless because the loop condition
    prevents it from happening.
    Signed-off-by: Lucian Adrian Grijincu <>
    committed May 1, 2011
  28. sysctl: remove useless ctl_table->parent field

    The 'parent' field was added for selinux in:
        commit d912b0c
        [PATCH] sysctl: add a parent entry to ctl_table and set the parent entry
    and then was used for sysctl_check_table.
    Both of the users have found other implementations.
    Signed-off-by: Lucian Adrian Grijincu <>
    committed Feb 4, 2011
  29. sysctl: faster reimplementation of sysctl_check_table

    Determining the parent of a node at depth d
    - previous implementation: O(d)
    - current  implementation: O(1)
    Printing the path to a node at depth d
    - previous implementation: O(d^2)
    - current  implementation: O(d)
    This comes with a small cost: we use an array ('parents') holding as many
    pointers as there can be sysctl levels (currently CTL_MAXNAME=10).
    The 'parents' array of pointers holds the same values as the
    ctl_table->parents field because the function that updates ->parents
    (sysctl_set_parent) is called with either NULL (for root nodes) or
    with sysctl_set_parent(table, table->child).
    Signed-off-by: Lucian Adrian Grijincu <>
    committed Feb 4, 2011
  30. sysctl: no-child: manually register root tables

    Signed-off-by: Lucian Adrian Grijincu <>
    committed Apr 14, 2011
  31. sysctl: no-child: manually register fs/epoll

    Signed-off-by: Lucian Adrian Grijincu <>
    committed Apr 14, 2011
  32. sysctl: no-child: manually register fs/inotify

    Signed-off-by: Lucian Adrian Grijincu <>
    committed Apr 14, 2011
  33. sysctl: no-child: manually register kernel/keys

    Signed-off-by: Lucian Adrian Grijincu <>
    committed Apr 14, 2011
  34. sysctl: no-child: manually register kernel/usermodehelper

    Signed-off-by: Lucian Adrian Grijincu <>
    committed Jun 2, 2011
  35. sysctl: no-child: manually register kernel/random

    Signed-off-by: Lucian Adrian Grijincu <>
    committed Apr 13, 2011
Something went wrong with that request. Please try again.