Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Commits on Feb 18, 2011
  1. @gregkh

    Linux 2.6.32.29

    gregkh authored
Commits on Feb 17, 2011
  1. @namhyung @gregkh

    kernel/user.c: add lock release annotation on free_user()

    namhyung authored gregkh committed
    commit 571428b upstream.
    
    free_user() releases uidhash_lock but was missing annotation.  Add it.
    This removes following sparse warnings:
    
     include/linux/spinlock.h:339:9: warning: context imbalance in 'free_user' - unexpected unlock
     kernel/user.c:120:6: warning: context imbalance in 'free_uid' - wrong count at exit
    
    Signed-off-by: Namhyung Kim <namhyung@gmail.com>
    Cc: Ingo Molnar <mingo@elte.hu>
    Cc: Dhaval Giani <dhaval.giani@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  2. @error27 @gregkh

    sched: Remove some dead code

    error27 authored gregkh committed
    commit 618765801ebc271fe0ba3eca99fcfd62a1f786e1 upstream.
    
    This was left over from "7c9414385e sched: Remove USER_SCHED"
    
    Signed-off-by: Dan Carpenter <error27@gmail.com>
    Acked-by: Dhaval Giani <dhaval.giani@gmail.com>
    Cc: Kay Sievers <kay.sievers@vrfy.org>
    Cc: Greg Kroah-Hartman <gregkh@suse.de>
    LKML-Reference: <20100315082148.GD18181@bicker>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  3. @gregkh

    sched: Fix wake_affine() vs RT tasks

    Peter Zijlstra authored gregkh committed
    Commit: e51fd5e upstream
    
    Mike reports that since e9e9250 (sched: Scale down cpu_power due to RT
    tasks), wake_affine() goes funny on RT tasks due to them still having a
    !0 weight and wake_affine() still subtracts that from the rq weight.
    
    Since nobody should be using se->weight for RT tasks, set the value to
    zero. Also, since we now use ->cpu_power to normalize rq weights to
    account for RT cpu usage, add that factor into the imbalance computation.
    
    Reported-by: Mike Galbraith <efault@gmx.de>
    Tested-by: Mike Galbraith <efault@gmx.de>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <1275316109.27810.22969.camel@twins>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  4. @gregkh

    sched: Fix idle balancing

    Nikhil Rao authored gregkh committed
    Commit: d5ad140 upstream
    
    An earlier commit reverts idle balancing throttling reset to fix a 30%
    regression in volanomark throughput. We still need to reset idle_stamp
    when we pull a task in newidle balance.
    
    Reported-by: Alex Shi <alex.shi@intel.com>
    Signed-off-by: Nikhil Rao <ncrao@google.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <1290022924-3548-1-git-send-email-ncrao@google.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  5. @gregkh

    sched: Fix volanomark performance regression

    Alex Shi authored gregkh committed
    Commit: b5482cf upstream
    
    Commit fab4762 triggers excessive idle balancing, causing a ~30% loss in
    volanomark throughput. Remove idle balancing throttle reset.
    
    Originally-by: Alex Shi <alex.shi@intel.com>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Nikhil Rao <ncrao@google.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <1289928732.5169.211.camel@maggy.simson.net>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  6. @gregkh

    sched: Fix cross-sched-class wakeup preemption

    Peter Zijlstra authored gregkh committed
    Commit: 1e5a740 upstream
    
    Instead of dealing with sched classes inside each check_preempt_curr()
    implementation, pull out this logic into the generic wakeup preemption
    path.
    
    This fixes a hang in KVM (and others) where we are waiting for the
    stop machine thread to run ...
    
    Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
    Tested-by: Marcelo Tosatti <mtosatti@redhat.com>
    Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <1288891946.2039.31.camel@laptop>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  7. @gregkh

    sched: Use group weight, idle cpu metrics to fix imbalances during idle

    Suresh Siddha authored gregkh committed
    Commit: aae6d3d upstream
    
    Currently we consider a sched domain to be well balanced when the imbalance
    is less than the domain's imablance_pct. As the number of cores and threads
    are increasing, current values of imbalance_pct (for example 25% for a
    NUMA domain) are not enough to detect imbalances like:
    
    a) On a WSM-EP system (two sockets, each having 6 cores and 12 logical threads),
    24 cpu-hogging tasks get scheduled as 13 on one socket and 11 on another
    socket. Leading to an idle HT cpu.
    
    b) On a hypothetial 2 socket NHM-EX system (each socket having 8 cores and
    16 logical threads), 16 cpu-hogging tasks can get scheduled as 9 on one
    socket and 7 on another socket. Leaving one core in a socket idle
    whereas in another socket we have a core having both its HT siblings busy.
    
    While this issue can be fixed by decreasing the domain's imbalance_pct
    (by making it a function of number of logical cpus in the domain), it
    can potentially cause more task migrations across sched groups in an
    overloaded case.
    
    Fix this by using imbalance_pct only during newly_idle and busy
    load balancing. And during idle load balancing, check if there
    is an imbalance in number of idle cpu's across the busiest and this
    sched_group or if the busiest group has more tasks than its weight that
    the idle cpu in this_group can pull.
    
    Reported-by: Nikhil Rao <ncrao@google.com>
    Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <1284760952.2676.11.camel@sbsiddha-MOBL3.sc.intel.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  8. @gregkh

    sched, cgroup: Fixup broken cgroup movement

    Peter Zijlstra authored gregkh committed
    Commit: b2b5ce0 upstream
    
    Dima noticed that we fail to correct the ->vruntime of sleeping tasks
    when we move them between cgroups.
    
    Reported-by: Dima Zavin <dima@android.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Tested-by: Mike Galbraith <efault@gmx.de>
    LKML-Reference: <1287150604.29097.1513.camel@twins>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  9. @gregkh

    sched: Export account_system_vtime()

    Ingo Molnar authored gregkh committed
    Commit: b7dadc3 upstream
    
    KVM uses it for example:
    
     ERROR: "account_system_vtime" [arch/x86/kvm/kvm.ko] undefined!
    
    Cc: Venkatesh Pallipadi <venki@google.com>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  10. @gregkh

    sched: Call tick_check_idle before __irq_enter

    Venkatesh Pallipadi authored gregkh committed
    Commit: d267f87 upstream
    
    When CPU is idle and on first interrupt, irq_enter calls tick_check_idle()
    to notify interruption from idle. But, there is a problem if this call
    is done after __irq_enter, as all routines in __irq_enter may find
    stale time due to yet to be done tick_check_idle.
    
    Specifically, trace calls in __irq_enter when they use global clock and also
    account_system_vtime change in this patch as it wants to use sched_clock_cpu()
    to do proper irq timing.
    
    But, tick_check_idle was moved after __irq_enter intentionally to
    prevent problem of unneeded ksoftirqd wakeups by the commit ee5f80a:
    
        irq: call __irq_enter() before calling the tick_idle_check
        Impact: avoid spurious ksoftirqd wakeups
    
    Moving tick_check_idle() before __irq_enter and wrapping it with
    local_bh_enable/disable would solve both the problems.
    
    Fixed-by: Yong Zhang <yong.zhang0@gmail.com>
    Signed-off-by: Venkatesh Pallipadi <venki@google.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <1286237003-12406-9-git-send-email-venki@google.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  11. @gregkh

    sched: Remove irq time from available CPU power

    Venkatesh Pallipadi authored gregkh committed
    Commit: aa48380 upstream
    
    The idea was suggested by Peter Zijlstra here:
    
      http://marc.info/?l=linux-kernel&m=127476934517534&w=2
    
    irq time is technically not available to the tasks running on the CPU.
    This patch removes irq time from CPU power piggybacking on
    sched_rt_avg_update().
    
    Tested this by keeping CPU X busy with a network intensive task having 75%
    oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven
    cycle soakers on the system. Without this change, there will be two tasks on
    each CPU. With this change, there is a single task on irq busy CPU X and
    remaining 7 tasks are spread around among other 3 CPUs.
    
    Signed-off-by: Venkatesh Pallipadi <venki@google.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <1286237003-12406-8-git-send-email-venki@google.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  12. @gregkh

    sched: Do not account irq time to current task

    Venkatesh Pallipadi authored gregkh committed
    Commit: 305e683 upstream
    
    Scheduler accounts both softirq and interrupt processing times to the
    currently running task. This means, if the interrupt processing was
    for some other task in the system, then the current task ends up being
    penalized as it gets shorter runtime than otherwise.
    
    Change sched task accounting to acoount only actual task time from
    currently running task. Now update_curr(), modifies the delta_exec to
    depend on rq->clock_task.
    
    Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
    extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
    for later.
    
    This change will impact scheduling behavior in interrupt heavy conditions.
    
    Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
    task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
    spending 75%+ of its time in irq processing. CPU 3 spending around 35%
    time running nc task.
    
    Now, if I run another CPU intensive task on CPU 2, without this change
    /proc/<pid>/schedstat shows 100% of time accounted to this task. With this
    change, it rightly shows less than 25% accounted to this task as remaining
    time is actually spent on irq processing.
    
    Signed-off-by: Venkatesh Pallipadi <venki@google.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  13. @gregkh

    x86: Add IRQ_TIME_ACCOUNTING

    Venkatesh Pallipadi authored gregkh committed
    Commit: e82b8e4 upstream
    
    This patch adds IRQ_TIME_ACCOUNTING option on x86 and runtime enables it
    when TSC is enabled.
    
    This change just enables fine grained irq time accounting, isn't used yet.
    Following patches use it for different purposes.
    
    Signed-off-by: Venkatesh Pallipadi <venki@google.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <1286237003-12406-6-git-send-email-venki@google.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  14. @gregkh

    sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time

    Venkatesh Pallipadi authored gregkh committed
    Commit: b52bfee upstream
    
    s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does
    the fine granularity accounting of user, system, hardirq, softirq times.
    Adding that option on archs like x86 will be challenging however, given the
    state of TSC reliability on various platforms and also the overhead it will
    add in syscall entry exit.
    
    Instead, add a lighter variant that only does finer accounting of
    hardirq and softirq times, providing precise irq times (instead of timer tick
    based samples). This accounting is added with a new config option
    CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not
    interested in paying the perf penalty.
    
    This accounting is based on sched_clock, with the code being generic.
    So, other archs may find it useful as well.
    
    This patch just adds the core logic and does not enable this logic yet.
    
    Signed-off-by: Venkatesh Pallipadi <venki@google.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <1286237003-12406-5-git-send-email-venki@google.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  15. @gregkh

    sched: Add a PF flag for ksoftirqd identification

    Venkatesh Pallipadi authored gregkh committed
    Commit: 6cdd519 upstream
    
    To account softirq time cleanly in scheduler, we need to identify whether
    softirq is invoked in ksoftirqd context or softirq at hardirq tail context.
    Add PF_KSOFTIRQD for that purpose.
    
    As all PF flag bits are currently taken, create space by moving one of the
    infrequently used bits (PF_THREAD_BOUND) down in task_struct to be along
    with some other state fields.
    
    Signed-off-by: Venkatesh Pallipadi <venki@google.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <1286237003-12406-4-git-send-email-venki@google.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  16. @hidave @gregkh

    sched: Remove unused PF_ALIGNWARN flag

    hidave authored gregkh committed
    Commit: 637bbdc upstream
    
    PF_ALIGNWARN is not implemented and it is for 486 as the
    comment.
    
    It is not likely someone will implement this flag feature.
    So here remove this flag and leave the valuable 0x00000001 for
    future use.
    
    Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    LKML-Reference: <20100913121903.GB22238@darkstar>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  17. @gregkh

    sched: Consolidate account_system_vtime extern declaration

    Venkatesh Pallipadi authored gregkh committed
    Commit: e1e10a2 upstream
    
    Just a minor cleanup patch that makes things easier to the following patches.
    No functionality change in this patch.
    
    Signed-off-by: Venkatesh Pallipadi <venki@google.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  18. @gregkh

    sched: Fix softirq time accounting

    Venkatesh Pallipadi authored gregkh committed
    Commit: 75e1056 upstream
    
    Peter Zijlstra found a bug in the way softirq time is accounted in
    VIRT_CPU_ACCOUNTING on this thread:
    
       http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html
    
    The problem is, softirq processing uses local_bh_disable internally. There
    is no way, later in the flow, to differentiate between whether softirq is
    being processed or is it just that bh has been disabled. So, a hardirq when bh
    is disabled results in time being wrongly accounted as softirq.
    
    Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
    as well. As account_system_time() in normal tick based accouting also uses
    softirq_count, which will be set even when not in softirq with bh disabled.
    
    Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count
    for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
    processing. The patch below does that and adds API in_serving_softirq() which
    returns whether we are currently processing softirq or not.
    
    Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
    to in_serving_softirq.
    
    Looks like many usages of in_softirq really want in_serving_softirq. Those
    changes can be made individually on a case by case basis.
    
    Signed-off-by: Venkatesh Pallipadi <venki@google.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  19. @gregkh

    sched: Drop group_capacity to 1 only if local group has extra capacity

    Nikhil Rao authored gregkh committed
    Commit: 75dd321 upstream
    
    When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1
    only if the local group has extra capacity. The extra check prevents the case
    where you always pull from the heaviest group when it is already under-utilized
    (possible with a large weight task outweighs the tasks on the system).
    
    For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA
    scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task,
    and each task is running on one core. In this case, we observe the following
    events when balancing at the NUMA domain:
    
    - find_busiest_group() will always pick the sched group containing the niced
      task to be the busiest group.
    - find_busiest_queue() will then always pick one of the cpus running the
      nice0 task (never picks the cpu with the nice -15 task since
      weighted_cpuload > imbalance).
    - The load balancer fails to migrate the task since it is the running task
      and increments sd->nr_balance_failed.
    - It repeats the above steps a few more times until sd->nr_balance_failed > 5,
      at which point it kicks off the active load balancer, wakes up the migration
      thread and kicks the nice 0 task off the cpu.
    
    The load balancer doesn't stop until we kick out all nice 0 tasks from
    the sched group, leaving you with 3 idle cpus and one cpu running the
    nice -15 task.
    
    When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child
    domain (in this case MC) has SD_PREFER_SIBLING set.  Subsequent load checks are
    not relevant because the niced task has a very large weight.
    
    In this patch, we add an extra condition to the "if(prefer_sibling)" check in
    update_sd_lb_stats(). We drop the capacity of a group only if the local group
    has extra capacity, ie. nr_running < group_capacity. This patch preserves the
    original intent of the prefer_siblings check (to spread tasks across the system
    in low utilization scenarios) and fixes the case above.
    
    It helps in the following ways:
    - In low utilization cases (where nr_tasks << nr_cpus), we still drop
      group_capacity down to 1 if we prefer siblings.
    - On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running will most
      likely be > sgs.group_capacity.
    - When balancing large weight tasks, if the local group does not have extra
      capacity, we do not pick the group with the niced task as the busiest group.
      This prevents failed balances, active migration and the under-utilization
      described above.
    
    Signed-off-by: Nikhil Rao <ncrao@google.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <1287173550-30365-5-git-send-email-ncrao@google.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  20. @gregkh

    sched: Force balancing on newidle balance if local group has capacity

    Nikhil Rao authored gregkh committed
    Commit: fab4762 upstream
    
    This patch forces a load balance on a newly idle cpu when the local group has
    extra capacity and the busiest group does not have any. It improves system
    utilization when balancing tasks with a large weight differential.
    
    Under certain situations, such as a niced down task (i.e. nice = -15) in the
    presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and
    kicks away other tasks because of its large weight. This leads to sub-optimal
    utilization of the machine. Even though the sched group has capacity, it does
    not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL.
    
    With this patch, if the local group has extra capacity, we shortcut the checks
    in f_b_g() and try to pull a task over. A sched group has extra capacity if the
    group capacity is greater than the number of running tasks in that group.
    
    Thanks to Mike Galbraith for discussions leading to this patch and for the
    insight to reuse SD_NEWIDLE_BALANCE.
    
    Signed-off-by: Nikhil Rao <ncrao@google.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <1287173550-30365-4-git-send-email-ncrao@google.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  21. @gregkh

    sched: Set group_imb only a task can be pulled from the busiest cpu

    Nikhil Rao authored gregkh committed
    Commit: 2582f0e upstream
    
    When cycling through sched groups to determine the busiest group, set
    group_imb only if the busiest cpu has more than 1 runnable task. This patch
    fixes the case where two cpus in a group have one runnable task each, but there
    is a large weight differential between these two tasks. The load balancer is
    unable to migrate any task from this group, and hence do not consider this
    group to be imbalanced.
    
    Signed-off-by: Nikhil Rao <ncrao@google.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <1286996978-7007-3-git-send-email-ncrao@google.com>
    [ small code readability edits ]
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  22. @gregkh

    sched: Do not consider SCHED_IDLE tasks to be cache hot

    Nikhil Rao authored gregkh committed
    Commit: ef8002f upstream
    
    This patch adds a check in task_hot to return if the task has SCHED_IDLE
    policy. SCHED_IDLE tasks have very low weight, and when run with regular
    workloads, are typically scheduled many milliseconds apart. There is no
    need to consider these tasks hot for load balancing.
    
    Signed-off-by: Nikhil Rao <ncrao@google.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <1287173550-30365-2-git-send-email-ncrao@google.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  23. @gregkh

    sched: fix RCU lockdep splat from task_group()

    Peter Zijlstra authored gregkh committed
    Commit: 6506cf6 upstream
    
    This addresses the following RCU lockdep splat:
    
    [0.051203] CPU0: AMD QEMU Virtual CPU version 0.12.4 stepping 03
    [0.052999] lockdep: fixing up alternatives.
    [0.054105]
    [0.054106] ===================================================
    [0.054999] [ INFO: suspicious rcu_dereference_check() usage. ]
    [0.054999] ---------------------------------------------------
    [0.054999] kernel/sched.c:616 invoked rcu_dereference_check() without protection!
    [0.054999]
    [0.054999] other info that might help us debug this:
    [0.054999]
    [0.054999]
    [0.054999] rcu_scheduler_active = 1, debug_locks = 1
    [0.054999] 3 locks held by swapper/1:
    [0.054999]  #0:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff814be933>] cpu_up+0x42/0x6a
    [0.054999]  #1:  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff810400d8>] cpu_hotplug_begin+0x2a/0x51
    [0.054999]  #2:  (&rq->lock){-.-...}, at: [<ffffffff814be2f7>] init_idle+0x2f/0x113
    [0.054999]
    [0.054999] stack backtrace:
    [0.054999] Pid: 1, comm: swapper Not tainted 2.6.35 #1
    [0.054999] Call Trace:
    [0.054999]  [<ffffffff81068054>] lockdep_rcu_dereference+0x9b/0xa3
    [0.054999]  [<ffffffff810325c3>] task_group+0x7b/0x8a
    [0.054999]  [<ffffffff810325e5>] set_task_rq+0x13/0x40
    [0.054999]  [<ffffffff814be39a>] init_idle+0xd2/0x113
    [0.054999]  [<ffffffff814be78a>] fork_idle+0xb8/0xc7
    [0.054999]  [<ffffffff81068717>] ? mark_held_locks+0x4d/0x6b
    [0.054999]  [<ffffffff814bcebd>] do_fork_idle+0x17/0x2b
    [0.054999]  [<ffffffff814bc89b>] native_cpu_up+0x1c1/0x724
    [0.054999]  [<ffffffff814bcea6>] ? do_fork_idle+0x0/0x2b
    [0.054999]  [<ffffffff814be876>] _cpu_up+0xac/0x127
    [0.054999]  [<ffffffff814be946>] cpu_up+0x55/0x6a
    [0.054999]  [<ffffffff81ab562a>] kernel_init+0xe1/0x1ff
    [0.054999]  [<ffffffff81003854>] kernel_thread_helper+0x4/0x10
    [0.054999]  [<ffffffff814c353c>] ? restore_args+0x0/0x30
    [0.054999]  [<ffffffff81ab5549>] ? kernel_init+0x0/0x1ff
    [0.054999]  [<ffffffff81003850>] ? kernel_thread_helper+0x0/0x10
    [0.056074] Booting Node   0, Processors  #1lockdep: fixing up alternatives.
    [0.130045]  #2lockdep: fixing up alternatives.
    [0.203089]  #3 Ok.
    [0.275286] Brought up 4 CPUs
    [0.276005] Total of 4 processors activated (16017.17 BogoMIPS).
    
    The cgroup_subsys_state structures referenced by idle tasks are never
    freed, because the idle tasks should be part of the root cgroup,
    which is not removable.
    
    The problem is that while we do in-fact hold rq->lock, the newly spawned
    idle thread's cpu is not yet set to the correct cpu so the lockdep check
    in task_group():
    
      lockdep_is_held(&task_rq(p)->lock)
    
    will fail.
    
    But this is a chicken and egg problem.  Setting the CPU's runqueue requires
    that the CPU's runqueue already be set.  ;-)
    
    So insert an RCU read-side critical section to avoid the complaint.
    
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  24. @paulmck @gregkh

    sched: suppress RCU lockdep splat in task_fork_fair

    paulmck authored gregkh committed
    Commit: b0a0f66 upstream
    
    > ===================================================
    > [ INFO: suspicious rcu_dereference_check() usage. ]
    > ---------------------------------------------------
    > /home/greearb/git/linux.wireless-testing/kernel/sched.c:618 invoked rcu_dereference_check() without protection!
    >
    > other info that might help us debug this:
    >
    > rcu_scheduler_active = 1, debug_locks = 1
    > 1 lock held by ifup/23517:
    >   #0:  (&rq->lock){-.-.-.}, at: [<c042f782>] task_fork_fair+0x3b/0x108
    >
    > stack backtrace:
    > Pid: 23517, comm: ifup Not tainted 2.6.36-rc6-wl+ #5
    > Call Trace:
    >   [<c075e219>] ? printk+0xf/0x16
    >   [<c0455842>] lockdep_rcu_dereference+0x74/0x7d
    >   [<c0426854>] task_group+0x6d/0x79
    >   [<c042686e>] set_task_rq+0xe/0x57
    >   [<c042f79e>] task_fork_fair+0x57/0x108
    >   [<c042e965>] sched_fork+0x82/0xf9
    >   [<c04334b3>] copy_process+0x569/0xe8e
    >   [<c0433ef0>] do_fork+0x118/0x262
    >   [<c076302f>] ? do_page_fault+0x16a/0x2cf
    >   [<c044b80c>] ? up_read+0x16/0x2a
    >   [<c04085ae>] sys_clone+0x1b/0x20
    >   [<c04030a5>] ptregs_clone+0x15/0x30
    >   [<c0402f1c>] ? sysenter_do_call+0x12/0x38
    
    Here a newly created task is having its runqueue assigned.  The new task
    is not yet on the tasklist, so cannot go away.  This is therefore a false
    positive, suppress with an RCU read-side critical section.
    
    Reported-by: Ben Greear <greearb@candelatech.com
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Tested-by: Ben Greear <greearb@candelatech.com
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  25. @gregkh

    sched: Give CPU bound RT tasks preference

    stable-bot for Steven Rostedt authored gregkh committed
    From:: Steven Rostedt <srostedt@redhat.com>
    
    Commit: b3bc211 upstream
    
    If a high priority task is waking up on a CPU that is running a
    lower priority task that is bound to a CPU, see if we can move the
    high RT task to another CPU first. Note, if all other CPUs are
    running higher priority tasks than the CPU bounded current task,
    then it will be preempted regardless.
    
    Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Gregory Haskins <ghaskins@novell.com>
    LKML-Reference: <20100921024138.888922071@goodmis.org>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  26. @gregkh

    sched: Try not to migrate higher priority RT tasks

    Steven Rostedt authored gregkh committed
    Commit: 43fa546 upstream
    
    When first working on the RT scheduler design, we concentrated on
    keeping all CPUs running RT tasks instead of having multiple RT
    tasks on a single CPU waiting for the migration thread to move
    them. Instead we take a more proactive stance and push or pull RT
    tasks from one CPU to another on wakeup or scheduling.
    
    When an RT task wakes up on a CPU that is running another RT task,
    instead of preempting it and killing the cache of the running RT
    task, we look to see if we can migrate the RT task that is waking
    up, even if the RT task waking up is of higher priority.
    
    This may sound a bit odd, but RT tasks should be limited in
    migration by the user anyway. But in practice, people do not do
    this, which causes high prio RT tasks to bounce around the CPUs.
    This becomes even worse when we have priority inheritance, because
    a high prio task can block on a lower prio task and boost its
    priority. When the lower prio task wakes up the high prio task, if
    it happens to be on the same CPU it will migrate off of it.
    
    But in reality, the above does not happen much either, because the
    wake up of the lower prio task, which has already been boosted, if
    it was on the same CPU as the higher prio task, it would then
    migrate off of it. But anyway, we do not want to migrate them
    either.
    
    To examine the scheduling, I created a test program and examined it
    under kernelshark. The test program created CPU * 2 threads, where
    each thread had a different priority. The program takes different
    options. The options used in this change log was to have priority
    inheritance mutexes or not.
    
    All threads did the following loop:
    
    static void grab_lock(long id, int iter, int l)
    {
    	ftrace_write("thread %ld iter %d, taking lock %d\n",
    		     id, iter, l);
    	pthread_mutex_lock(&locks[l]);
    	ftrace_write("thread %ld iter %d, took lock %d\n",
    		     id, iter, l);
    	busy_loop(nr_tasks - id);
    	ftrace_write("thread %ld iter %d, unlock lock %d\n",
    		     id, iter, l);
    	pthread_mutex_unlock(&locks[l]);
    }
    
    void *start_task(void *id)
    {
    	[...]
    	while (!done) {
    		for (l = 0; l < nr_locks; l++) {
    			grab_lock(id, i, l);
    			ftrace_write("thread %ld iter %d sleeping\n",
    				     id, i);
    			ms_sleep(id);
    		}
    		i++;
    	}
    	[...]
    }
    
    The busy_loop(ms) keeps the CPU spinning for ms milliseconds. The
    ms_sleep(ms) sleeps for ms milliseconds. The ftrace_write() writes
    to the ftrace buffer to help analyze via ftrace.
    
    The higher the id, the higher the prio, the shorter it does the
    busy loop, but the longer it spins. This is usually the case with
    RT tasks, the lower priority tasks usually run longer than higher
    priority tasks.
    
    At the end of the test, it records the number of loops each thread
    took, as well as the number of voluntary preemptions, non-voluntary
    preemptions, and number of migrations each thread took, taking the
    information from /proc/$$/sched and /proc/$$/status.
    
    Running this on a 4 CPU processor, the results without changes to
    the kernel looked like this:
    
    Task        vol    nonvol   migrated     iterations
    ----        ---    ------   --------     ----------
      0:         53      3220       1470             98
      1:        562       773        724             98
      2:        752       933       1375             98
      3:        749        39        697             98
      4:        758         5        515             98
      5:        764         2        679             99
      6:        761         2        535             99
      7:        757         3        346             99
    
    total:     5156       4977      6341            787
    
    Each thread regardless of priority migrated a few hundred times.
    The higher priority tasks, were a little better but still took
    quite an impact.
    
    By letting higher priority tasks bump the lower prio task from the
    CPU, things changed a bit:
    
    Task        vol    nonvol   migrated     iterations
    ----        ---    ------   --------     ----------
      0:         37      2835       1937             98
      1:        666      1821       1865             98
      2:        654      1003       1385             98
      3:        664       635        973             99
      4:        698       197        352             99
      5:        703       101        159             99
      6:        708         1         75             99
      7:        713         1          2             99
    
    total:     4843       6594      6748            789
    
    The total # of migrations did not change (several runs showed the
    difference all within the noise). But we now see a dramatic
    improvement to the higher priority tasks. (kernelshark showed that
    the watchdog timer bumped the highest priority task to give it the
    2 count. This was actually consistent with every run).
    
    Notice that the # of iterations did not change either.
    
    The above was with priority inheritance mutexes. That is, when the
    higher prority task blocked on a lower priority task, the lower
    priority task would inherit the higher priority task (which shows
    why task 6 was bumped so many times). When not using priority
    inheritance mutexes, the current kernel shows this:
    
    Task        vol    nonvol   migrated     iterations
    ----        ---    ------   --------     ----------
      0:         56      3101       1892             95
      1:        594       713        937             95
      2:        625       188        618             95
      3:        628         4        491             96
      4:        640         7        468             96
      5:        631         2        501             96
      6:        641         1        466             96
      7:        643         2        497             96
    
    total:     4458       4018      5870            765
    
    Not much changed with or without priority inheritance mutexes. But
    if we let the high priority task bump lower priority tasks on
    wakeup we see:
    
    Task        vol    nonvol   migrated     iterations
    ----        ---    ------   --------     ----------
      0:        115      3439       2782             98
      1:        633      1354       1583             99
      2:        652       919       1218             99
      3:        645       713        934             99
      4:        690         3          3             99
      5:        694         1          4             99
      6:        720         3          4             99
      7:        747         0          1            100
    
    Which shows a even bigger change. The big difference between task 3
    and task 4 is because we have only 4 CPUs on the machine, causing
    the 4 highest prio tasks to always have preference.
    
    Although I did not measure cache misses, and I'm sure there would
    be little to measure since the test was not data intensive, I could
    imagine large improvements for higher priority tasks when dealing
    with lower priority tasks. Thus, I'm satisfied with making the
    change and agreeing with what Gregory Haskins argued a few years
    ago when we first had this discussion.
    
    One final note. All tasks in the above tests were RT tasks. Any RT
    task will always preempt a non RT task that is running on the CPU
    the RT task wants to run on.
    
    Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Gregory Haskins <ghaskins@novell.com>
    LKML-Reference: <20100921024138.605460343@goodmis.org>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  27. @gregkh

    sched: Increment cache_nice_tries only on periodic lb

    Venkatesh Pallipadi authored gregkh committed
    Commit: 58b26c4 upstream
    
    scheduler uses cache_nice_tries as an indicator to do cache_hot and
    active load balance, when normal load balance fails. Currently,
    this value is changed on any failed load balance attempt. That ends
    up being not so nice to workloads that enter/exit idle often, as
    they do more frequent new_idle balance and that pretty soon results
    in cache hot tasks being pulled in.
    
    Making the cache_nice_tries ignore failed new_idle balance seems to
    make better sense. With that only the failed load balance in
    periodic load balance gets accounted and the rate of accumulation
    of cache_nice_tries will not depend on idle entry/exit (short
    running sleep-wakeup kind of tasks). This reduces movement of
    cache_hot tasks.
    
    schedstat diff (after-before) excerpt from a workload that has
    frequent and short wakeup-idle pattern (:2 in cpu col below refers
    to NEWIDLE idx) This snapshot was across ~400 seconds.
    
    Without this change:
    domainstats:  domain0
     cpu     cnt      bln      fld      imb     gain    hgain  nobusyq  nobusyg
     0:2  306487   219575    73167  110069413    44583    19070     1172   218403
     1:2  292139   194853    81421  120893383    50745    21902     1259   193594
     2:2  283166   174607    91359  129699642    54931    23688     1287   173320
     3:2  273998   161788    93991  132757146    57122    24351     1366   160422
     4:2  289851   215692    62190  83398383    36377    13680      851   214841
     5:2  316312   222146    77605  117582154    49948    20281      988   221158
     6:2  297172   195596    83623  122133390    52801    21301      929   194667
     7:2  283391   178078    86378  126622761    55122    22239      928   177150
     8:2  297655   210359    72995  110246694    45798    19777     1125   209234
     9:2  297357   202011    79363  119753474    50953    22088     1089   200922
    10:2  278797   178703    83180  122514385    52969    22726     1128   177575
    11:2  272661   167669    86978  127342327    55857    24342     1195   166474
    12:2  293039   204031    73211  110282059    47285    19651      948   203083
    13:2  289502   196762    76803  114712942    49339    20547     1016   195746
    14:2  264446   169609    78292  115715605    50459    21017      982   168627
    15:2  260968   163660    80142  116811793    51483    21281     1064   162596
    
    With this change:
    domainstats:  domain0
     cpu     cnt      bln      fld      imb     gain    hgain  nobusyq  nobusyg
     0:2  272347   187380    77455  105420270    24975        1      953   186427
     1:2  267276   172360    86234  116242264    28087        6     1028   171332
     2:2  259769   156777    93281  123243134    30555        1     1043   155734
     3:2  250870   143129    97627  127370868    32026        6     1188   141941
     4:2  248422   177116    64096  78261112    22202        2      757   176359
     5:2  275595   180683    84950  116075022    29400        6      778   179905
     6:2  262418   162609    88944  119256898    31056        4      817   161792
     7:2  252204   147946    92646  122388300    32879        4      824   147122
     8:2  262335   172239    81631  110477214    26599        4      864   171375
     9:2  261563   164775    88016  117203621    28331        3      849   163926
    10:2  243389   140949    93379  121353071    29585        2      909   140040
    11:2  242795   134651    98310  124768957    30895        2     1016   133635
    12:2  255234   166622    79843  104696912    26483        4      746   165876
    13:2  244944   151595    83855  109808099    27787        3      801   150794
    14:2  241301   140982    89935  116954383    30403        6      845   140137
    15:2  232271   128564    92821  119185207    31207        4     1416   127148
    
    Signed-off-by: Venkatesh Pallipadi <venki@google.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <1284167957-3675-1-git-send-email-venki@google.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  28. @gregkh

    sched: Move sched_avg_update() to update_cpu_load()

    Suresh Siddha authored gregkh committed
    Commit: da2b71e upstream
    
    Currently sched_avg_update() (which updates rt_avg stats in the rq)
    is getting called from scale_rt_power() (in the load balance context)
    which doesn't take rq->lock.
    
    Fix it by moving the sched_avg_update() to more appropriate
    update_cpu_load() where the CFS load gets updated as well.
    
    Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <1282596171.2694.3.camel@sbsiddha-MOBL3>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  29. @gregkh

    sched: Remove remaining USER_SCHED code

    Li Zefan authored gregkh committed
    Commit: 32bd7eb upstream
    
    This is left over from commit 7c94143 ("sched: Remove USER_SCHED"")
    
    Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
    Acked-by: Dhaval Giani <dhaval.giani@gmail.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: David Howells <dhowells@redhat.com>
    LKML-Reference: <4BA9A05F.7010407@cn.fujitsu.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
  30. @giani @gregkh

    sched: Remove USER_SCHED

    giani authored gregkh committed
    Commit: 7c94143 upstream
    
    Remove the USER_SCHED feature. It has been scheduled to be removed in
    2.6.34 as per http://marc.info/?l=linux-kernel&m=125728479022976&w=2
    
    [trace from referenced thread]
    [1046577.884289] general protection fault: 0000 [#1] SMP
    [1046577.911332] last sysfs file: /sys/devices/platform/coretemp.7/temp1_input
    [1046577.938715] CPU 3
    [1046577.965814] Modules linked in: ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables coretemp k8temp
    [1046577.994456] Pid: 38, comm: events/3 Not tainted 2.6.32.27intel #1 X8DT3
    [1046578.023166] RIP: 0010:[] [] sched_destroy_group+0x3c/0x10d
    [1046578.052639] RSP: 0000:ffff88043e5abe10 EFLAGS: 00010097
    [1046578.081360] RAX: ffff880139fa5540 RBX: ffff8803d18419c0 RCX: ffff8801d2f8fb78
    [1046578.109903] RDX: dead000000200200 RSI: 0000000000000000 RDI: 0000000000000000
    [1046578.109905] RBP: 0000000000000246 R08: 0000000000000020 R09: ffffffff816339b8
    [1046578.109907] R10: 0000000004e6e5f0 R11: 0000000000000006 R12: ffffffff816339b8
    [1046578.109909] R13: ffff8803d63ac4e0 R14: ffff88043e582340 R15: ffffffff8104a216
    [1046578.109911] FS: 0000000000000000(0000) GS:ffff880028260000(0000) knlGS:0000000000000000
    [1046578.109914] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
    [1046578.109915] CR2: 00007f55ab220000 CR3: 00000001e5797000 CR4: 00000000000006e0
    [1046578.109917] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [1046578.109919] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    [1046578.109922] Process events/3 (pid: 38, threadinfo ffff88043e5aa000, task ffff88043e582340)
    [1046578.109923] Stack:
    [1046578.109924] ffff8803d63ac498 ffff8803d63ac4d8 ffff8803d63ac440 ffffffff8104a2c3
    [1046578.109927] <0> ffff88043e5abef8 ffff880028276040 ffff8803d63ac4d8 ffffffff81050395
    [1046578.109929] <0> ffff88043e582340 ffff88043e5826c8 ffff88043e582340 ffff88043e5abfd8
    [1046578.109932] Call Trace:
    [1046578.109938] [] ? cleanup_user_struct+0xad/0xcc
    [1046578.109942] [] ? worker_thread+0x148/0x1d4
    [1046578.109946] [] ? autoremove_wake_function+0x0/0x2e
    [1046578.109948] [] ? worker_thread+0x0/0x1d4
    [1046578.109951] [] ? kthread+0x79/0x81
    [1046578.109955] [] ? child_rip+0xa/0x20
    [1046578.109957] [] ? kthread+0x0/0x81
    [1046578.109959] [] ? child_rip+0x0/0x20
    [1046578.109961] Code: 3c 00 4c 8b 25 02 98 3d 00 48 89 c5 83 cf ff eb 5c 48 8b 43 10 48 63 f7 48 8b 04 f0 48 8b 90 80 00 00 00 48 8b 48 78 48 89 51 08 <48> 89 0a 48 b9 00 02 20 00 00 00 ad de 48 89 88 80 00 00 00 48
    [1046578.109975] RIP [] sched_destroy_group+0x3c/0x10d
    [1046578.109979] RSP
    [1046578.109981] ---[ end trace 5ebc2944b7872d4a ]---
    
    Signed-off-by: Dhaval Giani <dhaval.giani@gmail.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <1263990378.24844.3.camel@localhost>
    LKML-Reference: http://marc.info/?l=linux-kernel&m=129466345327931
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
  31. @gregkh

    usb: Realloc xHCI structures after a hub is verified.

    Sarah Sharp authored gregkh committed
    commit 653a39d upstream.
    
    When there's an xHCI host power loss after a suspend from memory, the USB
    core attempts to reset and verify the USB devices that are attached to the
    system.  The xHCI driver has to reallocate those devices, since the
    hardware lost all knowledge of them during the power loss.
    
    When a hub is plugged in, and the host loses power, the xHCI hardware
    structures are not updated to say the device is a hub.  This is usually
    done in hub_configure() when the USB hub is detected.  That function is
    skipped during a reset and verify by the USB core, since the core restores
    the old configuration and alternate settings, and the hub driver has no
    idea this happened.  This bug makes the xHCI host controller reject the
    enumeration of low speed devices under the resumed hub.
    
    Therefore, make the USB core re-setup the internal xHCI hub device
    information by calling update_hub_device() when hub_activate() is called
    for a hub reset resume.  After a host power loss, all devices under the
    roothub get a reset-resume or a disconnect.
    
    This patch should be queued for the 2.6.37 stable tree.
    
    Signed-off-by: Sarah Sharp <sarah.a.sharp@linux.intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  32. @gregkh

    x86, mm: avoid possible bogus tlb entries by clearing prev mm_cpumask…

    Suresh Siddha authored gregkh committed
    … after switching mm
    
    commit 831d52b upstream.
    
    Clearing the cpu in prev's mm_cpumask early will avoid the flush tlb
    IPI's while the cr3 is still pointing to the prev mm.  And this window
    can lead to the possibility of bogus TLB fills resulting in strange
    failures.  One such problematic scenario is mentioned below.
    
     T1. CPU-1 is context switching from mm1 to mm2 context and got a NMI
         etc between the point of clearing the cpu from the mm_cpumask(mm1)
         and before reloading the cr3 with the new mm2.
    
     T2. CPU-2 is tearing down a specific vma for mm1 and will proceed with
         flushing the TLB for mm1.  It doesn't send the flush TLB to CPU-1
         as it doesn't see that cpu listed in the mm_cpumask(mm1).
    
     T3. After the TLB flush is complete, CPU-2 goes ahead and frees the
         page-table pages associated with the removed vma mapping.
    
     T4. CPU-2 now allocates those freed page-table pages for something
         else.
    
     T5. As the CR3 and TLB caches for mm1 is still active on CPU-1, CPU-1
         can potentially speculate and walk through the page-table caches
         and can insert new TLB entries.  As the page-table pages are
         already freed and being used on CPU-2, this page walk can
         potentially insert a bogus global TLB entry depending on the
         (random) contents of the page that is being used on CPU-2.
    
     T6. This bogus TLB entry being global will be active across future CR3
         changes and can result in weird memory corruption etc.
    
    To avoid this issue, for the prev mm that is handing over the cpu to
    another mm, clear the cpu from the mm_cpumask(prev) after the cr3 is
    changed.
    
    Marking it for -stable, though we haven't seen any reported failure that
    can be attributed to this.
    
    Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
    Acked-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  33. @ickle @gregkh

    drm/i915: Add dependency on CONFIG_TMPFS

    ickle authored gregkh committed
    commit f7ab9b4 upstream.
    
    Without tmpfs, shmem_readpage() is not compiled in causing an OOPS as
    soon as we try to allocate some swappable pages for GEM.
    
    Jan 19 22:52:26 harlie kernel: Modules linked in: i915(+) drm_kms_helper cfbcopyarea video backlight cfbimgblt cfbfillrect
    Jan 19 22:52:26 harlie kernel:
    Jan 19 22:52:26 harlie kernel: Pid: 1125, comm: modprobe Not tainted 2.6.37Harlie #10 To be filled by O.E.M./To be filled by O.E.M.
    Jan 19 22:52:26 harlie kernel: EIP: 0060:[<00000000>] EFLAGS: 00010246 CPU: 3
    Jan 19 22:52:26 harlie kernel: EIP is at 0x0
    Jan 19 22:52:26 harlie kernel: EAX: 00000000 EBX: f7b7d000 ECX: f3383100 EDX: f7b7d000
    Jan 19 22:52:26 harlie kernel: ESI: f1456118 EDI: 00000000 EBP: f2303c98 ESP: f2303c7c
    Jan 19 22:52:26 harlie kernel:  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    Jan 19 22:52:26 harlie kernel: Process modprobe (pid: 1125, ti=f2302000 task=f259cd80 task.ti=f2302000)
    Jan 19 22:52:26 harlie kernel: Stack:
    Jan 19 22:52:26 harlie udevd-work[1072]: '/sbin/modprobe -b pci:v00008086d00000046sv00000000sd00000000bc03sc00i00' unexpected exit with status 0x0009
    Jan 19 22:52:26 harlie kernel:  c1074061 000000d0 f2f42b80 00000000 000a13d2 f2d5dcc0 00000001 f2303cac
    Jan 19 22:52:26 harlie kernel:  c107416f 00000000 000a13d2 00000000 f2303cd4 f8d620ed f2cee620 00001000
    Jan 19 22:52:26 harlie kernel:  00000000 000a13d2 f1456118 f2d5dcc0 f1a40000 00001000 f2303d04 f8d637ab
    Jan 19 22:52:26 harlie kernel: Call Trace:
    Jan 19 22:52:26 harlie kernel:  [<c1074061>] ? do_read_cache_page+0x71/0x160
    Jan 19 22:52:26 harlie kernel:  [<c107416f>] ? read_cache_page_gfp+0x1f/0x30
    Jan 19 22:52:26 harlie kernel:  [<f8d620ed>] ? i915_gem_object_get_pages+0xad/0x1d0 [i915]
    Jan 19 22:52:26 harlie kernel:  [<f8d637ab>] ? i915_gem_object_bind_to_gtt+0xeb/0x2d0 [i915]
    Jan 19 22:52:26 harlie kernel:  [<f8d65961>] ? i915_gem_object_pin+0x151/0x190 [i915]
    Jan 19 22:52:26 harlie kernel:  [<c11e16ed>] ? drm_gem_object_init+0x3d/0x60
    Jan 19 22:52:26 harlie kernel:  [<f8d65aa5>] ? i915_gem_init_ringbuffer+0x105/0x1e0 [i915]
    Jan 19 22:52:26 harlie kernel:  [<f8d571b7>] ? i915_driver_load+0x667/0x1160 [i915]
    
    Reported-by: John J. Stimson-III <john@idsfa.net>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  34. @gregkh

    drm/i915/lvds: Add AOpen i915GMm-HFS to the list of false-positive LVDS

    Knut Petersen authored gregkh committed
    commit 22ab70d upstream.
    
    Signed-off-by: Knut Petersen <knut_petersen@t-online.de>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Something went wrong with that request. Please try again.