Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: add (path) "prefix" option to GPU plugin for scalability testing #1104

Closed
wants to merge 13 commits into from

Commits on Aug 19, 2022

  1. Add "prefix" option to GPU plugin for scalability testing

    Devices can be faked for scalability testing when non-standard paths
    are used (GPU plugin code assumes container paths to match host paths,
    and container runtime prevents creating fake files under real paths).
    
    Note: If one wants to run both normal GPU plugin and faked one in same
    cluster, all nodes providing fake "i915" resources should be labeled
    differently from ones with real GPU plugin + devices, so that real GPU
    workloads can be limited to correct nodes with a suitable
    nodeSelector.
    
    Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
    eero-t committed Aug 19, 2022
    Configuration menu
    Copy the full SHA
    a02335f View commit details
    Browse the repository at this point in the history
  2. More detailed log for number of found GPU devices / resource types

    Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
    eero-t committed Aug 19, 2022
    Configuration menu
    Copy the full SHA
    460fce1 View commit details
    Browse the repository at this point in the history
  3. Add code for generating fake GPU sysfs + devfs files

    Based on input JSON file
    eero-t committed Aug 19, 2022
    Configuration menu
    Copy the full SHA
    f49ca25 View commit details
    Browse the repository at this point in the history
  4. Remove pre-existing fake sysfs & devfs content + more info

    Fake devfs directory is mounted from host so OCI runtime can "mount"
    device files also to workloads requesting fake devices. This means
    that those files can persist over fake GPU plugin life-time, so
    earlier files need to be removed, as they may not match.
    
    Also, DaemonSet restarts failing init containers, so errors about
    directories generated on previous generator run would prevent getting
    logs of the real error from first generator run.
    eero-t committed Aug 19, 2022
    Configuration menu
    Copy the full SHA
    7ecd8a3 View commit details
    Browse the repository at this point in the history
  5. Container runtime requires device files to real be devices

    Represent fake GPU devices with null devices:
    https://www.kernel.org/doc/Documentation/admin-guide/devices.txt
    
    Real devfs check needed also changing, and removal warnings
    were simplified, as there's always just one entry.
    eero-t committed Aug 19, 2022
    Configuration menu
    Copy the full SHA
    ca83c87 View commit details
    Browse the repository at this point in the history
  6. Apply golang-ci-lint suggestions to device generator

    Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
    eero-t committed Aug 19, 2022
    Configuration menu
    Copy the full SHA
    1e0e04d View commit details
    Browse the repository at this point in the history
  7. Use normal GPU plugin deployment pod spec as base

    With latest devices release.
    eero-t committed Aug 19, 2022
    Configuration menu
    Copy the full SHA
    e540306 View commit details
    Browse the repository at this point in the history
  8. Add 8x DG1 configMap for fake GPU device generator

    Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
    eero-t committed Aug 19, 2022
    Configuration menu
    Copy the full SHA
    d5ff613 View commit details
    Browse the repository at this point in the history
  9. Switch Intel plugin pod to use faked devices

    Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
    eero-t committed Aug 19, 2022
    Configuration menu
    Copy the full SHA
    269788e View commit details
    Browse the repository at this point in the history
  10. Apply golang-ci-lint suggestions to GPU plugin

    Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
    eero-t committed Aug 19, 2022
    Configuration menu
    Copy the full SHA
    b9038fa View commit details
    Browse the repository at this point in the history

Commits on Aug 22, 2022

  1. Trivialize GPU plugin -prefix option handling

    As suggested by Ukri.
    
    Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
    eero-t committed Aug 22, 2022
    Configuration menu
    Copy the full SHA
    ee1ac15 View commit details
    Browse the repository at this point in the history
  2. Better error checks+logs for MkNod(), ReadDir() and RemoveAll()

    Give more detailed logging for most likely failure, as MkNod() device
    node creation can fail as normal user.
    
    Additional error checking done in new dir removal helper function
    fixes Ukri's review comments.  There's now error if to-be-removed fake
    sysfs has more content than expected (earlier such check was only for
    fake devfs content).
    
    Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
    eero-t committed Aug 22, 2022
    Configuration menu
    Copy the full SHA
    dafd079 View commit details
    Browse the repository at this point in the history
  3. Fix -prefix option name

    Noticed by Tuomas.
    
    Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
    eero-t committed Aug 22, 2022
    Configuration menu
    Copy the full SHA
    968e294 View commit details
    Browse the repository at this point in the history