Skip to content
Commits on Aug 1, 2012
  1. @gregkh

    Linux 3.0.39

    gregkh committed Aug 1, 2012
  2. @koct9i @gregkh

    vmscan: fix initial shrinker size handling

    commit 635697c663f38106063d5659f0cf2e45afcd4bb5 upstream.
    
    Stable note: The commit [acf92b48: vmscan: shrinker->nr updates race and
    	go wrong] aimed to reduce excessive reclaim of slab objects but
    	had bug in how it treated shrinker functions that returned -1.
    
    A shrinker function can return -1, means that it cannot do anything
    without a risk of deadlock.  For example prune_super() does this if it
    cannot grab a superblock refrence, even if nr_to_scan=0.  Currently we
    interpret this -1 as a ULONG_MAX size shrinker and evaluate `total_scan'
    according to this.  So the next time around this shrinker can cause
    really big pressure.  Let's skip such shrinkers instead.
    
    Also make total_scan signed, otherwise the check (total_scan < 0) below
    never works.
    
    Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
    Cc: Dave Chinner <david@fromorbit.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    koct9i committed with gregkh Dec 8, 2011
  3. @koct9i @gregkh

    mm/hugetlb: fix warning in alloc_huge_page/dequeue_huge_page_vma

    commit b1c12cbcd0a02527c180a862e8971e249d3b347d upstream.
    
    Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely
    	expensive and severely impacted page allocator performance. This
    	is part of a series of patches that reduce page allocator overhead.
    
    Fix a gcc warning (and bug?) introduced in cc9a6c877 ("cpuset: mm: reduce
    large amounts of memory barrier related damage v3")
    
    Local variable "page" can be uninitialized if the nodemask from vma policy
    does not intersects with nodemask from cpuset.  Even if it doesn't happens
    it is better to initialize this variable explicitly than to introduce
    a kernel oops in a weird corner case.
    
    mm/hugetlb.c: In function `alloc_huge_page':
    mm/hugetlb.c:1135:5: warning: `page' may be used uninitialized in this function
    
    Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Acked-by: David Rientjes <rientjes@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    koct9i committed with gregkh Apr 25, 2012
  4. @gregkh

    cpuset: mm: reduce large amounts of memory barrier related damage v3

    commit cc9a6c8776615f9c194ccf0b63a0aa5628235545 upstream.
    
    Stable note:  Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely
    	expensive and severely impacted page allocator performance. This
    	is part of a series of patches that reduce page allocator overhead.
    
    Commit c0ff745 ("cpuset,mm: fix no node to alloc memory when
    changing cpuset's mems") wins a super prize for the largest number of
    memory barriers entered into fast paths for one commit.
    
    [get|put]_mems_allowed is incredibly heavy with pairs of full memory
    barriers inserted into a number of hot paths.  This was detected while
    investigating at large page allocator slowdown introduced some time
    after 2.6.32.  The largest portion of this overhead was shown by
    oprofile to be at an mfence introduced by this commit into the page
    allocator hot path.
    
    For extra style points, the commit introduced the use of yield() in an
    implementation of what looks like a spinning mutex.
    
    This patch replaces the full memory barriers on both read and write
    sides with a sequence counter with just read barriers on the fast path
    side.  This is much cheaper on some architectures, including x86.  The
    main bulk of the patch is the retry logic if the nodemask changes in a
    manner that can cause a false failure.
    
    While updating the nodemask, a check is made to see if a false failure
    is a risk.  If it is, the sequence number gets bumped and parallel
    allocators will briefly stall while the nodemask update takes place.
    
    In a page fault test microbenchmark, oprofile samples from
    __alloc_pages_nodemask went from 4.53% of all samples to 1.15%.  The
    actual results were
    
                                 3.3.0-rc3          3.3.0-rc3
                                 rc3-vanilla        nobarrier-v2r1
        Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
        Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
        Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
        Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
        Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
        Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
        Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
        Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
        Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
        Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
        Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
        Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
        Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
        Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
        Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             135.68    132.17
        User+Sys Time Running Test (seconds)         164.2    160.13
        Total Elapsed Time (seconds)                123.46    120.87
    
    The overall improvement is small but the System CPU time is much
    improved and roughly in correlation to what oprofile reported (these
    performance figures are without profiling so skew is expected).  The
    actual number of page faults is noticeably improved.
    
    For benchmarks like kernel builds, the overall benefit is marginal but
    the system CPU time is slightly reduced.
    
    To test the actual bug the commit fixed I opened two terminals.  The
    first ran within a cpuset and continually ran a small program that
    faulted 100M of anonymous data.  In a second window, the nodemask of the
    cpuset was continually randomised in a loop.
    
    Without the commit, the program would fail every so often (usually
    within 10 seconds) and obviously with the commit everything worked fine.
    With this patch applied, it also worked fine so the fix should be
    functionally equivalent.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Cc: Miao Xie <miaox@cn.fujitsu.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Christoph Lameter <cl@linux.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Mel Gorman committed with gregkh Mar 21, 2012
  5. @gregkh

    cpusets: stall when updating mems_allowed for mempolicy or disjoint n…

    …odemask
    
    commit b246272ecc5ac68c743b15c9e41a2275f7ce70e2 upstream.
    
    Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely
    	expensive and severely impacted page allocator performance. This is
    	part of a series of patches that reduce page allocator overhead.
    
    Kernels where MAX_NUMNODES > BITS_PER_LONG may temporarily see an empty
    nodemask in a tsk's mempolicy if its previous nodemask is remapped onto a
    new set of allowed cpuset nodes where the two nodemasks, as a result of
    the remap, are now disjoint.
    
    c0ff745 ("cpuset,mm: fix no node to alloc memory when changing
    cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
    nodes from changing for a thread.  This causes any update to a set of
    allowed nodes to stall until put_mems_allowed() is called.
    
    This stall is unncessary, however, if at least one node remains unchanged
    in the update to the set of allowed nodes.  This was addressed by
    89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one
    node remains set"), but it's still possible that an empty nodemask may be
    read from a mempolicy because the old nodemask may be remapped to the new
    nodemask during rebind.  To prevent this, only avoid the stall if there is
    no mempolicy for the thread being changed.
    
    This is a temporary solution until all reads from mempolicy nodemasks can
    be guaranteed to not be empty without the get_mems_allowed()
    synchronization.
    
    Also moves the check for nodemask intersection inside task_lock() so that
    tsk->mems_allowed cannot change.  This ensures that nothing can set this
    tsk's mems_allowed out from under us and also protects tsk->mempolicy.
    
    Reported-by: Miao Xie <miaox@cn.fujitsu.com>
    Signed-off-by: David Rientjes <rientjes@google.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Paul Menage <paul@paulmenage.org>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    David Rientjes committed with gregkh Dec 19, 2011
  6. @gregkh

    cpusets: avoid looping when storing to mems_allowed if one node remai…

    …ns set
    
    commit 89e8a244b97e48f1f30e898b6f32acca477f2a13 upstream.
    
    Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is
    	extremely expensive and severely impacted page allocator performance.
    	This is part of a series of patches that reduce page allocator
    	overhead.
    
    {get,put}_mems_allowed() exist so that general kernel code may locklessly
    access a task's set of allowable nodes without having the chance that a
    concurrent write will cause the nodemask to be empty on configurations
    where MAX_NUMNODES > BITS_PER_LONG.
    
    This could incur a significant delay, however, especially in low memory
    conditions because the page allocator is blocking and reclaim requires
    get_mems_allowed() itself.  It is not atypical to see writes to
    cpuset.mems take over 2 seconds to complete, for example.  In low memory
    conditions, this is problematic because it's one of the most imporant
    times to change cpuset.mems in the first place!
    
    The only way a task's set of allowable nodes may change is through cpusets
    by writing to cpuset.mems and when attaching a task to a generic code is
    not reading the nodemask with get_mems_allowed() at the same time, and
    then clearing all the old nodes.  This prevents the possibility that a
    reader will see an empty nodemask at the same time the writer is storing a
    new nodemask.
    
    If at least one node remains unchanged, though, it's possible to simply
    set all new nodes and then clear all the old nodes.  Changing a task's
    nodemask is protected by cgroup_mutex so it's guaranteed that two threads
    are not changing the same task's nodemask at the same time, so the
    nodemask is guaranteed to be stored before another thread changes it and
    determines whether a node remains set or not.
    
    Signed-off-by: David Rientjes <rientjes@google.com>
    Cc: Miao Xie <miaox@cn.fujitsu.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Nick Piggin <npiggin@kernel.dk>
    Cc: Paul Menage <paul@paulmenage.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    David Rientjes committed with gregkh Nov 2, 2011
  7. @gregkh

    mm: vmscan: convert global reclaim to per-memcg LRU lists

    commit b95a2f2d486d0d768a92879c023a03757b9c7e58 upstream - WARNING: this is a substitute patch.
    
    Stable note: Not tracked in Bugzilla. This is a partial backport of an
    	upstream commit addressing a completely different issue
    	that accidentally contained an important fix. The workload
    	this patch helps was memcached when IO is started in the
    	background. memcached should stay resident but without this patch
    	it gets swapped. Sometimes this manifests as a drop in throughput
    	but mostly it was observed through /proc/vmstat.
    
    Commit [246e87a: memcg: fix get_scan_count() for small targets] was meant
    to fix a problem whereby small scan targets on memcg were ignored causing
    priority to raise too sharply. It forced scanning to take place if the
    target was small, memcg or kswapd.
    
    From the time it was introduced it caused excessive reclaim by kswapd
    with workloads being pushed to swap that previously would have stayed
    resident. This was accidentally fixed in commit [b95a2f2d: mm: vmscan:
    convert global reclaim to per-memcg LRU lists] by making it harder for
    kswapd to force scan small targets but that patchset is not suitable for
    backporting. This was later changed again by commit [90126375: mm/vmscan:
    push lruvec pointer into get_scan_count()] into a format that looks
    like it would be a straight-forward backport but there is a subtle
    difference due to the use of lruvecs.
    
    The impact of the accidental fix is to make it harder for kswapd to force
    scan small targets by taking zone->all_unreclaimable into account. This
    patch is the closest equivalent available based on what is backported.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Johannes Weiner committed with gregkh Jan 12, 2012
  8. @gregkh

    mm: test PageSwapBacked in lumpy reclaim

    commit 043bcbe5ec51e0478ef2b44acef17193e01d7f70 upstream.
    
    Stable note: Not tracked in Bugzilla. There were reports of shared
    	mapped pages being unfairly reclaimed in comparison to older kernels.
    	This is being addressed over time. Even though the subject
    	refers to lumpy reclaim, it impacts compaction as well.
    
    Lumpy reclaim does well to stop at a PageAnon when there's no swap, but
    better is to stop at any PageSwapBacked, which includes shmem/tmpfs too.
    
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Reviewed-by: Minchan Kim <minchan@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Hugh Dickins committed with gregkh Jan 10, 2012
  9. @minchank @gregkh

    mm/vmscan.c: consider swap space when deciding whether to continue re…

    …claim
    
    commit 86cfd3a45042ab242d47f3935a02811a402beab6 upstream.
    
    Stable note: Not tracked in Bugzilla. This patch reduces kswapd CPU
    	usage on swapless systems with high anonymous memory usage.
    
    It's pointless to continue reclaiming when we have no swap space and lots
    of anon pages in the inactive list.
    
    Without this patch, it is possible when swap is disabled to continue
    trying to reclaim when there are only anonymous pages in the system even
    though that will not make any progress.
    
    Signed-off-by: Minchan Kim <minchan@kernel.org>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Cc: Johannes Weiner <jweiner@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    minchank committed with gregkh Jan 10, 2012
  10. @koct9i @gregkh

    vmscan: activate executable pages after first usage

    commit c909e99364c8b6ca07864d752950b6b4ecf6bef4 upstream.
    
    Stable note: Not tracked in Bugzilla. There were reports of shared
    	mapped pages being unfairly reclaimed in comparison to older kernels.
    	This is being addressed over time.
    
    Logic added in commit 8cab475 ("vmscan: make mapped executable pages
    the first class citizen") was noticeably weakened in commit
    6457474 ("vmscan: detect mapped file pages used only once").
    
    Currently these pages can become "first class citizens" only after second
    usage.  After this patch page_check_references() will activate they after
    first usage, and executable code gets yet better chance to stay in memory.
    
    Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Nick Piggin <npiggin@kernel.dk>
    Cc: Mel Gorman <mel@csn.ul.ie>
    Cc: Shaohua Li <shaohua.li@intel.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    koct9i committed with gregkh Jan 10, 2012
  11. @koct9i @gregkh

    vmscan: promote shared file mapped pages

    commit 34dbc67a644f11ab3475d822d72e25409911e760 upstream.
    
    Stable note: Not tracked in Bugzilla. There were reports of shared
    	mapped pages being unfairly reclaimed in comparison to older kernels.
    	This is being addressed over time. The specific workload being
    	addressed here in described in paragraph four and while paragraph
    	five says it did not help performance as such, it made a difference
    	to major page faults. I'm aware of at least one bug for a large
    	vendor that was due to increased major faults.
    
    Commit 6457474 ("vmscan: detect mapped file pages used only once")
    greatly decreases lifetime of single-used mapped file pages.
    Unfortunately it also decreases life time of all shared mapped file
    pages.  Because after commit bf3f3bc ("mm: don't mark_page_accessed
    in fault path") page-fault handler does not mark page active or even
    referenced.
    
    Thus page_check_references() activates file page only if it was used twice
    while it stays in inactive list, meanwhile it activates anon pages after
    first access.  Inactive list can be small enough, this way reclaimer can
    accidentally throw away any widely used page if it wasn't used twice in
    short period.
    
    After this patch page_check_references() also activate file mapped page at
    first inactive list scan if this page is already used multiple times via
    several ptes.
    
    I found this while trying to fix degragation in rhel6 (~2.6.32) from rhel5
    (~2.6.18).  There a complete mess with >100 web/mail/spam/ftp containers,
    they share all their files but there a lot of anonymous pages: ~500mb
    shared file mapped memory and 15-20Gb non-shared anonymous memory.  In
    this situation major-pagefaults are very costly, because all containers
    share the same page.  In my load kernel created a disproportionate
    pressure on the file memory, compared with the anonymous, they equaled
    only if I raise swappiness up to 150 =)
    
    These patches actually wasn't helped a lot in my problem, but I saw
    noticable (10-20 times) reduce in count and average time of
    major-pagefault in file-mapped areas.
    
    Actually both patches are fixes for commit v2.6.33-5448-g6457474, because
    it was aimed at one scenario (singly used pages), but it breaks the logic
    in other scenarios (shared and/or executable pages)
    
    Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
    Acked-by: Pekka Enberg <penberg@kernel.org>
    Acked-by: Minchan Kim <minchan.kim@gmail.com>
    Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Nick Piggin <npiggin@kernel.dk>
    Cc: Mel Gorman <mel@csn.ul.ie>
    Cc: Shaohua Li <shaohua.li@intel.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    koct9i committed with gregkh Jan 10, 2012
  12. @gregkh

    mm: vmscan: check if reclaim should really abort even if compaction_r…

    …eady() is true for one zone
    
    commit 0cee34fd72c582b4f8ad8ce00645b75fb4168199 upstream.
    
    Stable note: Not tracked on Bugzilla. THP and compaction was found to
    	aggressively reclaim pages and stall systems under different
    	situations that was addressed piecemeal over time.
    
    If compaction can proceed for a given zone, shrink_zones() does not
    reclaim any more pages from it.  After commit [e0c2327: vmscan: abort
    reclaim/compaction if compaction can proceed], do_try_to_free_pages()
    tries to finish as soon as possible once one zone can compact.
    
    This was intended to prevent slabs being shrunk unnecessarily but there
    are side-effects.  One is that a small zone that is ready for compaction
    will abort reclaim even if the chances of successfully allocating a THP
    from that zone is small.  It also means that reclaim can return too early
    even though sc->nr_to_reclaim pages were not reclaimed.
    
    This partially reverts the commit until it is proven that slabs are really
    being shrunk unnecessarily but preserves the check to return 1 to avoid
    OOM if reclaim was aborted prematurely.
    
    [aarcange@redhat.com: This patch replaces a revert from Andrea]
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: Dave Jones <davej@redhat.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Andy Isaacson <adi@hexapodia.org>
    Cc: Nai Xia <nai.xia@gmail.com>
    Cc: Johannes Weiner <jweiner@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Mel Gorman committed with gregkh Jan 12, 2012
  13. @gregkh

    mm: vmscan: do not OOM if aborting reclaim to start compaction

    commit 7335084d446b83cbcb15da80497d03f0c1dc9e21 upstream.
    
    Stable note: Not tracked in Bugzilla. This patch makes later patches
    	easier to apply but otherwise has little to justify it. The
    	problem it fixes was never observed but the source of the
    	theoretical problem did not exist for very long.
    
    During direct reclaim it is possible that reclaim will be aborted so that
    compaction can be attempted to satisfy a high-order allocation.  If this
    decision is made before any pages are reclaimed, it is possible that 0 is
    returned to the page allocator potentially triggering an OOM.  This has
    not been observed but it is a possibility so this patch addresses it.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: Dave Jones <davej@redhat.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Andy Isaacson <adi@hexapodia.org>
    Cc: Nai Xia <nai.xia@gmail.com>
    Cc: Johannes Weiner <jweiner@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Mel Gorman committed with gregkh Jan 12, 2012
  14. @gregkh

    mm: vmscan: when reclaiming for compaction, ensure there are sufficie…

    …nt free pages available
    
    commit fe4b1b244bdb96136855f2c694071cb09d140766 upstream.
    
    Stable note: Not tracked on Bugzilla. THP and compaction was found to
    	aggressively reclaim pages and stall systems under different
    	situations that was addressed piecemeal over time. This patch
    	addresses a problem where the fix regressed THP allocation
    	success rates.
    
    In commit e0887c19 ("vmscan: limit direct reclaim for higher order
    allocations"), Rik noted that reclaim was too aggressive when THP was
    enabled.  In his initial patch he used the number of free pages to decide
    if reclaim should abort for compaction.  My feedback was that reclaim and
    compaction should be using the same logic when deciding if reclaim should
    be aborted.
    
    Unfortunately, this had the effect of reducing THP success rates when the
    workload included something like streaming reads that continually
    allocated pages.  The window during which compaction could run and return
    a THP was too small.
    
    This patch combines Rik's two patches together.  compaction_suitable() is
    still used to decide if reclaim should be aborted to allow compaction is
    used.  However, it will also ensure that there is a reasonable buffer of
    free pages available.  This improves upon the THP allocation success rates
    but bounds the number of pages that are freed for compaction.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Reviewed-by: Rik van Riel<riel@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: Dave Jones <davej@redhat.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Andy Isaacson <adi@hexapodia.org>
    Cc: Nai Xia <nai.xia@gmail.com>
    Cc: Johannes Weiner <jweiner@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Mel Gorman committed with gregkh Jan 12, 2012
  15. @gregkh

    mm: compaction: introduce sync-light migration for use by compaction

    commit a6bc32b899223a877f595ef9ddc1e89ead5072b8 upstream.
    
    Stable note: Not tracked in Buzilla. This was part of a series that
    	reduced interactivity stalls experienced when THP was enabled.
    	These stalls were particularly noticable when copying data
    	to a USB stick but the experiences for users varied a lot.
    
    This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
    mode that avoids writing back pages to backing storage.  Async compaction
    maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
    For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
    used.
    
    This avoids sync compaction stalling for an excessive length of time,
    particularly when copying files to a USB stick where there might be a
    large number of dirty pages backed by a filesystem that does not support
    ->writepages.
    
    [aarcange@redhat.com: This patch is heavily based on Andrea's work]
    [akpm@linux-foundation.org: fix fs/nfs/write.c build]
    [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: Dave Jones <davej@redhat.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Andy Isaacson <adi@hexapodia.org>
    Cc: Nai Xia <nai.xia@gmail.com>
    Cc: Johannes Weiner <jweiner@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Mel Gorman committed with gregkh Jan 12, 2012
  16. @gregkh

    kswapd: assign new_order and new_classzone_idx after wakeup in sleeping

    commit f0dfcde099453aa4c0dc42473828d15a6d492936 upstream.
    
    Stable note: Fixes https://bugzilla.redhat.com/show_bug.cgi?id=712019.  This
    	patch reduces kswapd CPU usage.
    
    There 2 places to read pgdat in kswapd.  One is return from a successful
    balance, another is waked up from kswapd sleeping.  The new_order and
    new_classzone_idx represent the balance input order and classzone_idx.
    
    But current new_order and new_classzone_idx are not assigned after
    kswapd_try_to_sleep(), that will cause a bug in the following scenario.
    
    1: after a successful balance, kswapd goes to sleep, and new_order = 0;
       new_classzone_idx = __MAX_NR_ZONES - 1;
    
    2: kswapd waked up with order = 3 and classzone_idx = ZONE_NORMAL
    
    3: in the balance_pgdat() running, a new balance wakeup happened with
       order = 5, and classzone_idx = ZONE_NORMAL
    
    4: the first wakeup(order = 3) finished successufly, return order = 3
       but, the new_order is still 0, so, this balancing will be treated as a
       failed balance.  And then the second tighter balancing will be missed.
    
    So, to avoid the above problem, the new_order and new_classzone_idx need
    to be assigned for later successful comparison.
    
    Signed-off-by: Alex Shi <alex.shi@intel.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
    Tested-by: Pádraig Brady <P@draigBrady.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Alex Shi committed with gregkh Oct 31, 2011
  17. @gregkh

    kswapd: avoid unnecessary rebalance after an unsuccessful balancing

    commit d2ebd0f6b89567eb93ead4e2ca0cbe03021f344b upstream.
    
    Stable note: Fixes https://bugzilla.redhat.com/show_bug.cgi?id=712019.  This
    	patch reduces kswapd CPU usage.
    
    In commit 215ddd6 ("mm: vmscan: only read new_classzone_idx from pgdat
    when reclaiming successfully") , Mel Gorman said kswapd is better to sleep
    after a unsuccessful balancing if there is tighter reclaim request pending
    in the balancing.  But in the following scenario, kswapd do something that
    is not matched our expectation.  The patch fixes this issue.
    
    1, Read pgdat request A (classzone_idx, order = 3)
    2, balance_pgdat()
    3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed
    4, balance_pgdat() returns but failed since returned order = 0
    5, pgdat of request A assigned to balance_pgdat(), and do balancing again.
       While the expectation behavior of kswapd should try to sleep.
    
    Signed-off-by: Alex Shi <alex.shi@intel.com>
    Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Tested-by: Pádraig Brady <P@draigBrady.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Alex Shi committed with gregkh Oct 31, 2011
  18. @gregkh

    mm: compaction: make isolate_lru_page() filter-aware again

    commit c82449352854ff09e43062246af86bdeb628f0c3 upstream.
    
    Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging
    	information by reducing LRU list churning had the side-effect of
    	reducing THP allocation success rates. This was part of a series
    	to restore the success rates while preserving the reclaim fix.
    
    Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
    noted that compaction does not migrate dirty or writeback pages and that
    is was meaningless to pick the page and re-add it to the LRU list.  This
    had to be partially reverted because some dirty pages can be migrated by
    compaction without blocking.
    
    This patch updates "mm: compaction: make isolate_lru_page" by skipping
    over pages that migration has no possibility of migrating to minimise LRU
    disruption.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Reviewed-by: Rik van Riel<riel@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Reviewed-by: Minchan Kim <minchan@kernel.org>
    Cc: Dave Jones <davej@redhat.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Andy Isaacson <adi@hexapodia.org>
    Cc: Nai Xia <nai.xia@gmail.com>
    Cc: Johannes Weiner <jweiner@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Mel Gorman committed with gregkh Jan 12, 2012
  19. @gregkh

    mm: page allocator: do not call direct reclaim for THP allocations wh…

    …ile compaction is deferred
    
    commit 66199712e9eef5aede09dbcd9dfff87798a66917 upstream.
    
    Stable note: Not tracked in Buzilla. This was part of a series that
    	reduced interactivity stalls experienced when THP was enabled.
    
    If compaction is deferred, direct reclaim is used to try to free enough
    pages for the allocation to succeed.  For small high-orders, this has a
    reasonable chance of success.  However, if the caller has specified
    __GFP_NO_KSWAPD to limit the disruption to the system, it makes more sense
    to fail the allocation rather than stall the caller in direct reclaim.
    This patch skips direct reclaim if compaction is deferred and the caller
    specifies __GFP_NO_KSWAPD.
    
    Async compaction only considers a subset of pages so it is possible for
    compaction to be deferred prematurely and not enter direct reclaim even in
    cases where it should.  To compensate for this, this patch also defers
    compaction only if sync compaction failed.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Minchan Kim <minchan.kim@gmail.com>
    Reviewed-by: Rik van Riel<riel@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Dave Jones <davej@redhat.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Andy Isaacson <adi@hexapodia.org>
    Cc: Nai Xia <nai.xia@gmail.com>
    Cc: Johannes Weiner <jweiner@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Mel Gorman committed with gregkh Jan 12, 2012
  20. @gregkh

    mm: compaction: determine if dirty pages can be migrated without bloc…

    …king within ->migratepage
    
    commit b969c4ab9f182a6e1b2a0848be349f99714947b0 upstream.
    
    Stable note: Not tracked in Bugzilla. A fix aimed at preserving page
    	aging information by reducing LRU list churning had the side-effect
    	of reducing THP allocation success rates. This was part of a series
    	to restore the success rates while preserving the reclaim fix.
    
    Asynchronous compaction is used when allocating transparent hugepages to
    avoid blocking for long periods of time.  Due to reports of stalling,
    there was a debate on disabling synchronous compaction but this severely
    impacted allocation success rates.  Part of the reason was that many dirty
    pages are skipped in asynchronous compaction by the following check;
    
    	if (PageDirty(page) && !sync &&
    		mapping->a_ops->migratepage != migrate_page)
    			rc = -EBUSY;
    
    This skips over all mapping aops using buffer_migrate_page() even though
    it is possible to migrate some of these pages without blocking.  This
    patch updates the ->migratepage callback with a "sync" parameter.  It is
    the responsibility of the callback to fail gracefully if migration would
    block.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: Dave Jones <davej@redhat.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Andy Isaacson <adi@hexapodia.org>
    Cc: Nai Xia <nai.xia@gmail.com>
    Cc: Johannes Weiner <jweiner@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Mel Gorman committed with gregkh Jan 12, 2012
  21. @gregkh

    mm: compaction: allow compaction to isolate dirty pages

    commit a77ebd333cd810d7b680d544be88c875131c2bd3 upstream.
    
    Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging
    	information by reducing LRU list churning had the side-effect of
    	reducing THP allocation success rates. This was part of a series
    	to restore the success rates while preserving the reclaim fix.
    
    Short summary: There are severe stalls when a USB stick using VFAT is
    used with THP enabled that are reduced by this series.  If you are
    experiencing this problem, please test and report back and considering I
    have seen complaints from openSUSE and Fedora users on this as well as a
    few private mails, I'm guessing it's a widespread issue.  This is a new
    type of USB-related stall because it is due to synchronous compaction
    writing where as in the past the big problem was dirty pages reaching
    the end of the LRU and being written by reclaim.
    
    Am cc'ing Andrew this time and this series would replace
    mm-do-not-stall-in-synchronous-compaction-for-thp-allocations.patch.
    I'm also cc'ing Dave Jones as he might have merged that patch to Fedora
    for wider testing and ideally it would be reverted and replaced by this
    series.
    
    That said, the later patches could really do with some review.  If this
    series is not the answer then a new direction needs to be discussed
    because as it is, the stalls are unacceptable as the results in this
    leader show.
    
    For testers that try backporting this to 3.1, it won't work because
    there is a non-obvious dependency on not writing back pages in direct
    reclaim so you need those patches too.
    
    Changelog since V5
    o Rebase to 3.2-rc5
    o Tidy up the changelogs a bit
    
    Changelog since V4
    o Added reviewed-bys, credited Andrea properly for sync-light
    o Allow dirty pages without mappings to be considered for migration
    o Bound the number of pages freed for compaction
    o Isolate PageReclaim pages on their own LRU list
    
    This is against 3.2-rc5 and follows on from discussions on "mm: Do
    not stall in synchronous compaction for THP allocations" and "[RFC
    PATCH 0/5] Reduce compaction-related stalls". Initially, the proposed
    patch eliminated stalls due to compaction which sometimes resulted in
    user-visible interactivity problems on browsers by simply never using
    sync compaction. The downside was that THP success allocation rates
    were lower because dirty pages were not being migrated as reported by
    Andrea. His approach at fixing this was nacked on the grounds that
    it reverted fixes from Rik merged that reduced the amount of pages
    reclaimed as it severely impacted his workloads performance.
    
    This series attempts to reconcile the requirements of maximising THP
    usage, without stalling in a user-visible fashion due to compaction
    or cheating by reclaiming an excessive number of pages.
    
    Patch 1 partially reverts commit 39deaf85 to allow migration to isolate
    	dirty pages. This is because migration can move some dirty
    	pages without blocking.
    
    Patch 2 notes that the /proc/sys/vm/compact_memory handler is not using
    	synchronous compaction when it should be. This is unrelated
    	to the reported stalls but is worth fixing.
    
    Patch 3 checks if we isolated a compound page during lumpy scan and
    	account for it properly. For the most part, this affects
    	tracing so it's unrelated to the stalls but worth fixing.
    
    Patch 4 notes that it is possible to abort reclaim early for compaction
    	and return 0 to the page allocator potentially entering the
    	"may oom" path. This has not been observed in practice but
    	the rest of the series potentially makes it easier to happen.
    
    Patch 5 adds a sync parameter to the migratepage callback and gives
    	the callback responsibility for migrating the page without
    	blocking if sync==false. For example, fallback_migrate_page
    	will not call writepage if sync==false. This increases the
    	number of pages that can be handled by asynchronous compaction
    	thereby reducing stalls.
    
    Patch 6 restores filter-awareness to isolate_lru_page for migration.
    	In practice, it means that pages under writeback and pages
    	without a ->migratepage callback will not be isolated
    	for migration.
    
    Patch 7 avoids calling direct reclaim if compaction is deferred but
    	makes sure that compaction is only deferred if sync
    	compaction was used.
    
    Patch 8 introduces a sync-light migration mechanism that sync compaction
    	uses. The objective is to allow some stalls but to not call
    	->writepage which can lead to significant user-visible stalls.
    
    Patch 9 notes that while we want to abort reclaim ASAP to allow
    	compation to go ahead that we leave a very small window of
    	opportunity for compaction to run. This patch allows more pages
    	to be freed by reclaim but bounds the number to a reasonable
    	level based on the high watermark on each zone.
    
    Patch 10 allows slabs to be shrunk even after compaction_ready() is
    	true for one zone. This is to avoid a problem whereby a single
    	small zone can abort reclaim even though no pages have been
    	reclaimed and no suitably large zone is in a usable state.
    
    Patch 11 fixes a problem with the rate of page scanning. As reclaim is
    	rarely stalling on pages under writeback it means that scan
    	rates are very high. This is particularly true for direct
    	reclaim which is not calling writepage. The vmstat figures
    	implied that much of this was busy work with PageReclaim pages
    	marked for immediate reclaim. This patch is a prototype that
    	moves these pages to their own LRU list.
    
    This has been tested and other than 2 USB keys getting trashed,
    nothing horrible fell out. That said, I am a bit unhappy with the
    rescue logic in patch 11 but did not find a better way around it. It
    does significantly reduce scan rates and System CPU time indicating
    it is the right direction to take.
    
    What is of critical importance is that stalls due to compaction
    are massively reduced even though sync compaction was still
    allowed. Testing from people complaining about stalls copying to USBs
    with THP enabled are particularly welcome.
    
    The following tests all involve THP usage and USB keys in some
    way. Each test follows this type of pattern
    
    1. Read from some fast fast storage, be it raw device or file. Each time
       the copy finishes, start again until the test ends
    2. Write a large file to a filesystem on a USB stick. Each time the copy
       finishes, start again until the test ends
    3. When memory is low, start an alloc process that creates a mapping
       the size of physical memory to stress THP allocation. This is the
       "real" part of the test and the part that is meant to trigger
       stalls when THP is enabled. Copying continues in the background.
    4. Record the CPU usage and time to execute of the alloc process
    5. Record the number of THP allocs and fallbacks as well as the number of THP
       pages in use a the end of the test just before alloc exited
    6. Run the test 5 times to get an idea of variability
    7. Between each run, sync is run and caches dropped and the test
       waits until nr_dirty is a small number to avoid interference
       or caching between iterations that would skew the figures.
    
    The individual tests were then
    
    writebackCPDeviceBasevfat
    	Disable THP, read from a raw device (sda), vfat on USB stick
    writebackCPDeviceBaseext4
    	Disable THP, read from a raw device (sda), ext4 on USB stick
    writebackCPDevicevfat
    	THP enabled, read from a raw device (sda), vfat on USB stick
    writebackCPDeviceext4
    	THP enabled, read from a raw device (sda), ext4 on USB stick
    writebackCPFilevfat
    	THP enabled, read from a file on fast storage and USB, both vfat
    writebackCPFileext4
    	THP enabled, read from a file on fast storage and USB, both ext4
    
    The kernels tested were
    
    3.1		3.1
    vanilla		3.2-rc5
    freemore	Patches 1-10
    immediate	Patches 1-11
    andrea		The 8 patches Andrea posted as a basis of comparison
    
    The results are very long unfortunately. I'll start with the case
    where we are not using THP at all
    
    writebackCPDeviceBasevfat
                       3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
    System Time         1.28 (    0.00%)   54.49 (-4143.46%)   48.63 (-3687.69%)    4.69 ( -265.11%)   51.88 (-3940.81%)
    +/-                 0.06 (    0.00%)    2.45 (-4305.55%)    4.75 (-8430.57%)    7.46 (-13282.76%)    4.76 (-8440.70%)
    User Time           0.09 (    0.00%)    0.05 (   40.91%)    0.06 (   29.55%)    0.07 (   15.91%)    0.06 (   27.27%)
    +/-                 0.02 (    0.00%)    0.01 (   45.39%)    0.02 (   25.07%)    0.00 (   77.06%)    0.01 (   52.24%)
    Elapsed Time      110.27 (    0.00%)   56.38 (   48.87%)   49.95 (   54.70%)   11.77 (   89.33%)   53.43 (   51.54%)
    +/-                 7.33 (    0.00%)    3.77 (   48.61%)    4.94 (   32.63%)    6.71 (    8.50%)    4.76 (   35.03%)
    THP Active          0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
    +/-                 0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
    Fault Alloc         0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
    +/-                 0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
    Fault Fallback      0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
    +/-                 0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)    0.00 (    0.00%)
    
    The THP figures are obviously all 0 because THP was enabled. The
    main thing to watch is the elapsed times and how they compare to
    times when THP is enabled later. It's also important to note that
    elapsed time is improved by this series as System CPu time is much
    reduced.
    
    writebackCPDevicevfat
    
                       3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
    System Time         1.22 (    0.00%)   13.89 (-1040.72%)   46.40 (-3709.20%)    4.44 ( -264.37%)   47.37 (-3789.33%)
    +/-                 0.06 (    0.00%)   22.82 (-37635.56%)    3.84 (-6249.44%)    6.48 (-10618.92%)    6.60
    (-10818.53%)
    User Time           0.06 (    0.00%)    0.06 (   -6.90%)    0.05 (   17.24%)    0.05 (   13.79%)    0.04 (   31.03%)
    +/-                 0.01 (    0.00%)    0.01 (   33.33%)    0.01 (   33.33%)    0.01 (   39.14%)    0.01 (   25.46%)
    Elapsed Time     10445.54 (    0.00%) 2249.92 (   78.46%)   70.06 (   99.33%)   16.59 (   99.84%)  472.43 (
    95.48%)
    +/-               643.98 (    0.00%)  811.62 (  -26.03%)   10.02 (   98.44%)    7.03 (   98.91%)   59.99 (   90.68%)
    THP Active         15.60 (    0.00%)   35.20 (  225.64%)   65.00 (  416.67%)   70.80 (  453.85%)   62.20 (  398.72%)
    +/-                18.48 (    0.00%)   51.29 (  277.59%)   15.99 (   86.52%)   37.91 (  205.18%)   22.02 (  119.18%)
    Fault Alloc       121.80 (    0.00%)   76.60 (   62.89%)  155.40 (  127.59%)  181.20 (  148.77%)  286.60 (  235.30%)
    +/-                73.51 (    0.00%)   61.11 (   83.12%)   34.89 (   47.46%)   31.88 (   43.36%)   68.13 (   92.68%)
    Fault Fallback    881.20 (    0.00%)  926.60 (   -5.15%)  847.60 (    3.81%)  822.00 (    6.72%)  716.60 (   18.68%)
    +/-                73.51 (    0.00%)   61.26 (   16.67%)   34.89 (   52.54%)   31.65 (   56.94%)   67.75 (    7.84%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds)       3540.88   1945.37    716.04     64.97   1937.03
    Total Elapsed Time (seconds)              52417.33  11425.90    501.02    230.95   2520.28
    
    The first thing to note is the "Elapsed Time" for the vanilla kernels
    of 2249 seconds versus 56 with THP disabled which might explain the
    reports of USB stalls with THP enabled. Applying the patches brings
    performance in line with THP-disabled performance while isolating
    pages for immediate reclaim from the LRU cuts down System CPU time.
    
    The "Fault Alloc" success rate figures are also improved. The vanilla
    kernel only managed to allocate 76.6 pages on average over the course
    of 5 iterations where as applying the series allocated 181.20 on
    average albeit it is well within variance. It's worth noting that
    applies the series at least descreases the amount of variance which
    implies an improvement.
    
    Andrea's series had a higher success rate for THP allocations but
    at a severe cost to elapsed time which is still better than vanilla
    but still much worse than disabling THP altogether. One can bring my
    series close to Andrea's by removing this check
    
            /*
             * If compaction is deferred for high-order allocations, it is because
             * sync compaction recently failed. In this is the case and the caller
             * has requested the system not be heavily disrupted, fail the
             * allocation now instead of entering direct reclaim
             */
            if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD))
                    goto nopage;
    
    I didn't include a patch that removed the above check because hurting
    overall performance to improve the THP figure is not what the average
    user wants. It's something to consider though if someone really wants
    to maximise THP usage no matter what it does to the workload initially.
    
    This is summary of vmstat figures from the same test.
    
                                           3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
    Page Ins                                  3257266139  1111844061    17263623    10901575   161423219
    Page Outs                                   81054922    30364312     3626530     3657687     8753730
    Swap Ins                                        3294        2851        6560        4964        4592
    Swap Outs                                     390073      528094      620197      790912      698285
    Direct pages scanned                      1077581700  3024951463  1764930052   115140570  5901188831
    Kswapd pages scanned                        34826043     7112868     2131265     1686942     1893966
    Kswapd pages reclaimed                      28950067     4911036     1246044      966475     1497726
    Direct pages reclaimed                     805148398   280167837     3623473     2215044    40809360
    Kswapd efficiency                                83%         69%         58%         57%         79%
    Kswapd velocity                              664.399     622.521    4253.852    7304.360     751.490
    Direct efficiency                                74%          9%          0%          1%          0%
    Direct velocity                            20557.737  264745.137 3522673.849  498551.938 2341481.435
    Percentage direct scans                          96%         99%         99%         98%         99%
    Page writes by reclaim                        722646      529174      620319      791018      699198
    Page writes file                              332573        1080         122         106         913
    Page writes anon                              390073      528094      620197      790912      698285
    Page reclaim immediate                             0  2552514720  1635858848   111281140  5478375032
    Page rescued immediate                             0           0           0       87848           0
    Slabs scanned                                  23552       23552        9216        8192        9216
    Direct inode steals                              231           0           0           0           0
    Kswapd inode steals                                0           0           0           0           0
    Kswapd skipped wait                            28076         786           0          61           6
    THP fault alloc                                  609         383         753         906        1433
    THP collapse alloc                                12           6           0           0           6
    THP splits                                       536         211         456         593        1136
    THP fault fallback                              4406        4633        4263        4110        3583
    THP collapse fail                                120         127           0           0           4
    Compaction stalls                               1810         728         623         779        3200
    Compaction success                               196          53          60          80         123
    Compaction failures                             1614         675         563         699        3077
    Compaction pages moved                        193158       53545      243185      333457      226688
    Compaction move failure                         9952        9396       16424       23676       45070
    
    The main things to look at are
    
    1. Page In/out figures are much reduced by the series.
    
    2. Direct page scanning is incredibly high (264745.137 pages scanned
       per second on the vanilla kernel) but isolating PageReclaim pages
       on their own list reduces the number of pages scanned significantly.
    
    3. The fact that "Page rescued immediate" is a positive number implies
       that we sometimes race removing pages from the LRU_IMMEDIATE list
       that need to be put back on a normal LRU but it happens only for
       0.07% of the pages marked for immediate reclaim.
    
    writebackCPDeviceext4
                       3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
    System Time         1.51 (    0.00%)    1.77 (  -17.66%)    1.46 (    2.92%)    1.15 (   23.77%)    1.89 (  -25.63%)
    +/-                 0.27 (    0.00%)    0.67 ( -148.52%)    0.33 (  -22.76%)    0.30 (  -11.15%)    0.19 (   30.16%)
    User Time           0.03 (    0.00%)    0.04 (  -37.50%)    0.05 (  -62.50%)    0.07 ( -112.50%)    0.04 (  -18.75%)
    +/-                 0.01 (    0.00%)    0.02 ( -146.64%)    0.02 (  -97.91%)    0.02 (  -75.59%)    0.02 (  -63.30%)
    Elapsed Time      124.93 (    0.00%)  114.49 (    8.36%)   96.77 (   22.55%)   27.48 (   78.00%)  205.70 (  -64.65%)
    +/-                20.20 (    0.00%)   74.39 ( -268.34%)   59.88 ( -196.48%)    7.72 (   61.79%)   25.03 (  -23.95%)
    THP Active        161.80 (    0.00%)   83.60 (   51.67%)  141.20 (   87.27%)   84.60 (   52.29%)   82.60 (   51.05%)
    +/-                71.95 (    0.00%)   43.80 (   60.88%)   26.91 (   37.40%)   59.02 (   82.03%)   52.13 (   72.45%)
    Fault Alloc       471.40 (    0.00%)  228.60 (   48.49%)  282.20 (   59.86%)  225.20 (   47.77%)  388.40 (   82.39%)
    +/-                88.07 (    0.00%)   87.42 (   99.26%)   73.79 (   83.78%)  109.62 (  124.47%)   82.62 (   93.81%)
    Fault Fallback    531.60 (    0.00%)  774.60 (  -45.71%)  720.80 (  -35.59%)  777.80 (  -46.31%)  614.80 (  -15.65%)
    +/-                88.07 (    0.00%)   87.26 (    0.92%)   73.79 (   16.22%)  109.62 (  -24.47%)   82.29 (    6.56%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds)         50.22     33.76     30.65     24.14    128.45
    Total Elapsed Time (seconds)               1113.73   1132.19   1029.45    759.49   1707.26
    
    Similar test but the USB stick is using ext4 instead of vfat. As
    ext4 does not use writepage for migration, the large stalls due to
    compaction when THP is enabled are not observed. Still, isolating
    PageReclaim pages on their own list helped completion time largely
    by reducing the number of pages scanned by direct reclaim although
    time spend in congestion_wait could also be a factor.
    
    Again, Andrea's series had far higher success rates for THP allocation
    at the cost of elapsed time. I didn't look too closely but a quick
    look at the vmstat figures tells me kswapd reclaimed 8 times more pages
    than the patch series and direct reclaim reclaimed roughly three times
    as many pages. It follows that if memory is aggressively reclaimed,
    there will be more available for THP.
    
    writebackCPFilevfat
                       3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
    System Time         1.76 (    0.00%)   29.10 (-1555.52%)   46.01 (-2517.18%)    4.79 ( -172.35%)   54.89 (-3022.53%)
    +/-                 0.14 (    0.00%)   25.61 (-18185.17%)    2.15 (-1434.83%)    6.60 (-4610.03%)    9.75
    (-6863.76%)
    User Time           0.05 (    0.00%)    0.07 (  -45.83%)    0.05 (   -4.17%)    0.06 (  -29.17%)    0.06 (  -16.67%)
    +/-                 0.02 (    0.00%)    0.02 (   20.11%)    0.02 (   -3.14%)    0.01 (   31.58%)    0.01 (   47.41%)
    Elapsed Time     22520.79 (    0.00%) 1082.85 (   95.19%)   73.30 (   99.67%)   32.43 (   99.86%)  291.84 (  98.70%)
    +/-              7277.23 (    0.00%)  706.29 (   90.29%)   19.05 (   99.74%)   17.05 (   99.77%)  125.55 (   98.27%)
    THP Active         83.80 (    0.00%)   12.80 (   15.27%)   15.60 (   18.62%)   13.00 (   15.51%)    0.80 (    0.95%)
    +/-                66.81 (    0.00%)   20.19 (   30.22%)    5.92 (    8.86%)   15.06 (   22.54%)    1.17 (    1.75%)
    Fault Alloc       171.00 (    0.00%)   67.80 (   39.65%)   97.40 (   56.96%)  125.60 (   73.45%)  133.00 (   77.78%)
    +/-                82.91 (    0.00%)   30.69 (   37.02%)   53.91 (   65.02%)   55.05 (   66.40%)   21.19 (   25.56%)
    Fault Fallback    832.00 (    0.00%)  935.20 (  -12.40%)  906.00 (   -8.89%)  877.40 (   -5.46%)  870.20 (   -4.59%)
    +/-                82.91 (    0.00%)   30.69 (   62.98%)   54.01 (   34.86%)   55.05 (   33.60%)   20.91 (   74.78%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds)       7229.81    928.42    704.52     80.68   1330.76
    Total Elapsed Time (seconds)             112849.04   5618.69    571.11    360.54   1664.28
    
    In this case, the test is reading/writing only from filesystems but as
    it's vfat, it's slow due to calling writepage during compaction. Little
    to observe really - the time to complete the test goes way down
    with the series applied and THP allocation success rates go up in
    comparison to 3.2-rc5.  The success rates are lower than 3.1.0 but
    the elapsed time for that kernel is abysmal so it is not really a
    sensible comparison.
    
    As before, Andrea's series allocates more THPs at the cost of overall
    performance.
    
    writebackCPFileext4
                       3.1.0-vanilla         rc5-vanilla       freemore-v6r1        isolate-v6r1         andrea-v2r1
    System Time         1.51 (    0.00%)    1.77 (  -17.66%)    1.46 (    2.92%)    1.15 (   23.77%)    1.89 (  -25.63%)
    +/-                 0.27 (    0.00%)    0.67 ( -148.52%)    0.33 (  -22.76%)    0.30 (  -11.15%)    0.19 (   30.16%)
    User Time           0.03 (    0.00%)    0.04 (  -37.50%)    0.05 (  -62.50%)    0.07 ( -112.50%)    0.04 (  -18.75%)
    +/-                 0.01 (    0.00%)    0.02 ( -146.64%)    0.02 (  -97.91%)    0.02 (  -75.59%)    0.02 (  -63.30%)
    Elapsed Time      124.93 (    0.00%)  114.49 (    8.36%)   96.77 (   22.55%)   27.48 (   78.00%)  205.70 (  -64.65%)
    +/-                20.20 (    0.00%)   74.39 ( -268.34%)   59.88 ( -196.48%)    7.72 (   61.79%)   25.03 (  -23.95%)
    THP Active        161.80 (    0.00%)   83.60 (   51.67%)  141.20 (   87.27%)   84.60 (   52.29%)   82.60 (   51.05%)
    +/-                71.95 (    0.00%)   43.80 (   60.88%)   26.91 (   37.40%)   59.02 (   82.03%)   52.13 (   72.45%)
    Fault Alloc       471.40 (    0.00%)  228.60 (   48.49%)  282.20 (   59.86%)  225.20 (   47.77%)  388.40 (   82.39%)
    +/-                88.07 (    0.00%)   87.42 (   99.26%)   73.79 (   83.78%)  109.62 (  124.47%)   82.62 (   93.81%)
    Fault Fallback    531.60 (    0.00%)  774.60 (  -45.71%)  720.80 (  -35.59%)  777.80 (  -46.31%)  614.80 (  -15.65%)
    +/-                88.07 (    0.00%)   87.26 (    0.92%)   73.79 (   16.22%)  109.62 (  -24.47%)   82.29 (    6.56%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds)         50.22     33.76     30.65     24.14    128.45
    Total Elapsed Time (seconds)               1113.73   1132.19   1029.45    759.49   1707.26
    
    Same type of story - elapsed times go down. In this case, allocation
    success rates are roughtly the same. As before, Andrea's has higher
    success rates but takes a lot longer.
    
    Overall the series does reduce latencies and while the tests are
    inherency racy as alloc competes with the cp processes, the variability
    was included. The THP allocation rates are not as high as they could
    be but that is because we would have to be more aggressive about
    reclaim and compaction impacting overall performance.
    
    This patch:
    
    Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
    noted that compaction does not migrate dirty or writeback pages and that
    is was meaningless to pick the page and re-add it to the LRU list.
    
    What was missed during review is that asynchronous migration moves dirty
    pages if their ->migratepage callback is migrate_page() because these can
    be moved without blocking.  This potentially impacted hugepage allocation
    success rates by a factor depending on how many dirty pages are in the
    system.
    
    This patch partially reverts 39deaf85 to allow migration to isolate dirty
    pages again.  This increases how much compaction disrupts the LRU but that
    is addressed later in the series.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
    Cc: Dave Jones <davej@redhat.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Andy Isaacson <adi@hexapodia.org>
    Cc: Nai Xia <nai.xia@gmail.com>
    Cc: Johannes Weiner <jweiner@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Mel Gorman committed with gregkh Jan 12, 2012
  22. @gregkh

    mm: migration: clean up unmap_and_move()

    commit 0dabec93de633a87adfbbe1d800a4c56cd19d73b upstream.
    
    Stable note: Not tracked in Bugzilla. This patch makes later patches
    	easier to apply but has no other impact.
    
    unmap_and_move() is one a big messy function.  Clean it up.
    
    Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
    Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Minchan Kim committed with gregkh Oct 31, 2011
  23. @gregkh

    mm: zone_reclaim: make isolate_lru_page() filter-aware

    commit f80c0673610e36ae29d63e3297175e22f70dde5f upstream.
    
    Stable note: Not tracked in Bugzilla. THP and compaction disrupt the LRU list
    	leading to poor reclaim decisions which has a variable
    	performance impact.
    
    In __zone_reclaim case, we don't want to shrink mapped page.  Nonetheless,
    we have isolated mapped page and re-add it into LRU's head.  It's
    unnecessary CPU overhead and makes LRU churning.
    
    Of course, when we isolate the page, the page might be mapped but when we
    try to migrate the page, the page would be not mapped.  So it could be
    migrated.  But race is rare and although it happens, it's no big deal.
    
    Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Reviewed-by: Michal Hocko <mhocko@suse.cz>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Minchan Kim committed with gregkh Oct 31, 2011
  24. @gregkh

    mm: compaction: make isolate_lru_page() filter-aware

    commit 39deaf8585152f1a35c1676d3d7dc6ae0fb65967 upstream.
    
    Stable note: Not tracked in Bugzilla. THP and compaction disrupt the LRU
    	list leading to poor reclaim decisions which has a variable
    	performance impact.
    
    In async mode, compaction doesn't migrate dirty or writeback pages.  So,
    it's meaningless to pick the page and re-add it to lru list.
    
    Of course, when we isolate the page in compaction, the page might be dirty
    or writeback but when we try to migrate the page, the page would be not
    dirty, writeback.  So it could be migrated.  But it's very unlikely as
    isolate and migration cycle is much faster than writeout.
    
    So, this patch helps cpu overhead and prevent unnecessary LRU churning.
    
    Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Rik van Riel <riel@redhat.com>
    Reviewed-by: Michal Hocko <mhocko@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Minchan Kim committed with gregkh Oct 31, 2011
  25. @gregkh

    mm: change isolate mode from #define to bitwise type

    commit 4356f21d09283dc6d39a6f7287a65ddab61e2808 upstream.
    
    Stable note: Not tracked in Bugzilla. This patch makes later patches
    	easier to apply but has no other impact.
    
    Change ISOLATE_XXX macro with bitwise isolate_mode_t type.  Normally,
    macro isn't recommended as it's type-unsafe and making debugging harder as
    symbol cannot be passed throught to the debugger.
    
    Quote from Johannes
    " Hmm, it would probably be cleaner to fully convert the isolation mode
    into independent flags.  INACTIVE, ACTIVE, BOTH is currently a
    tri-state among flags, which is a bit ugly."
    
    This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h
    
    Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Minchan Kim committed with gregkh Oct 31, 2011
  26. @gregkh

    mm: compaction: trivial clean up in acct_isolated()

    commit b9e84ac1536d35aee03b2601f19694949f0bd506 upstream.
    
    Stable note: Not tracked in Bugzilla. This patch makes later patches
    	easier to apply but has no other impact.
    
    acct_isolated of compaction uses page_lru_base_type which returns only
    base type of LRU list so it never returns LRU_ACTIVE_ANON or
    LRU_ACTIVE_FILE.  In addtion, cc->nr_[anon|file] is used in only
    acct_isolated so it doesn't have fields in conpact_control.
    
    This patch removes fields from compact_control and makes clear function of
    acct_issolated which counts the number of anon|file pages isolated.
    
    Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Rik van Riel <riel@redhat.com>
    Reviewed-by: Michal Hocko <mhocko@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Minchan Kim committed with gregkh Oct 31, 2011
  27. @gregkh

    vmscan: abort reclaim/compaction if compaction can proceed

    commit e0c23279c9f800c403f37511484d9014ac83adec upstream.
    
    Stable note: Not tracked on Bugzilla. THP and compaction was found to
    	aggressively reclaim pages and stall systems under different
    	situations that was addressed piecemeal over time.
    
    If compaction can proceed, shrink_zones() stops doing any work but its
    callers still call shrink_slab() which raises the priority and potentially
    sleeps.  This is unnecessary and wasteful so this patch aborts direct
    reclaim/compaction entirely if compaction can proceed.
    
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: Rik van Riel <riel@redhat.com>
    Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
    Acked-by: Johannes Weiner <jweiner@redhat.com>
    Cc: Josh Boyer <jwboyer@redhat.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Mel Gorman committed with gregkh Oct 31, 2011
  28. @rikvanriel @gregkh

    vmscan: limit direct reclaim for higher order allocations

    commit e0887c19b2daa140f20ca8104bdc5740f39dbb86 upstream.
    
    Stable note: Not tracked on Bugzilla. THP and compaction was found to
    	aggressively reclaim pages and stall systems under different
    	situations that was addressed piecemeal over time.  Paragraph
    	3 of this changelog is the motivation for this patch.
    
    When suffering from memory fragmentation due to unfreeable pages, THP page
    faults will repeatedly try to compact memory.  Due to the unfreeable
    pages, compaction fails.
    
    Needless to say, at that point page reclaim also fails to create free
    contiguous 2MB areas.  However, that doesn't stop the current code from
    trying, over and over again, and freeing a minimum of 4MB (2UL <<
    sc->order pages) at every single invocation.
    
    This resulted in my 12GB system having 2-3GB free memory, a corresponding
    amount of used swap and very sluggish response times.
    
    This can be avoided by having the direct reclaim code not reclaim from
    zones that already have plenty of free memory available for compaction.
    
    If compaction still fails due to unmovable memory, doing additional
    reclaim will only hurt the system, not help.
    
    [jweiner@redhat.com: change comment to explain the order check]
    Signed-off-by: Rik van Riel <riel@redhat.com>
    Acked-by: Johannes Weiner <jweiner@redhat.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
    Signed-off-by: Johannes Weiner <jweiner@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    rikvanriel committed with gregkh Oct 31, 2011
  29. @gregkh

    vmscan: reduce wind up shrinker->nr when shrinker can't do work

    commit 3567b59aa80ac4417002bf58e35dce5c777d4164 upstream.
    
    Stable note: Not tracked in Bugzilla. This patch reduces excessive
    	reclaim of slab objects reducing the amount of information that
    	has to be brought back in from disk. The third and fourth paragram
    	in the series describes the impact.
    
    When a shrinker returns -1 to shrink_slab() to indicate it cannot do
    any work given the current memory reclaim requirements, it adds the
    entire total_scan count to shrinker->nr. The idea ehind this is that
    whenteh shrinker is next called and can do work, it will do the work
    of the previously aborted shrinker call as well.
    
    However, if a filesystem is doing lots of allocation with GFP_NOFS
    set, then we get many, many more aborts from the shrinkers than we
    do successful calls. The result is that shrinker->nr winds up to
    it's maximum permissible value (twice the current cache size) and
    then when the next shrinker call that can do work is issued, it
    has enough scan count built up to free the entire cache twice over.
    
    This manifests itself in the cache going from full to empty in a
    matter of seconds, even when only a small part of the cache is
    needed to be emptied to free sufficient memory.
    
    Under metadata intensive workloads on ext4 and XFS, I'm seeing the
    VFS caches increase memory consumption up to 75% of memory (no page
    cache pressure) over a period of 30-60s, and then the shrinker
    empties them down to zero in the space of 2-3s. This cycle repeats
    over and over again, with the shrinker completely trashing the inode
    and dentry caches every minute or so the workload continues.
    
    This behaviour was made obvious by the shrink_slab tracepoints added
    earlier in the series, and made worse by the patch that corrected
    the concurrent accounting of shrinker->nr.
    
    To avoid this problem, stop repeated small increments of the total
    scan value from winding shrinker->nr up to a value that can cause
    the entire cache to be freed. We still need to allow it to wind up,
    so use the delta as the "large scan" threshold check - if the delta
    is more than a quarter of the entire cache size, then it is a large
    scan and allowed to cause lots of windup because we are clearly
    needing to free lots of memory.
    
    If it isn't a large scan then limit the total scan to half the size
    of the cache so that windup never increases to consume the whole
    cache. Reducing the total scan limit further does not allow enough
    wind-up to maintain the current levels of performance, whilst a
    higher threshold does not prevent the windup from freeing the entire
    cache under sustained workloads.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Dave Chinner committed with gregkh Jul 8, 2011
  30. @gregkh

    vmscan: shrinker->nr updates race and go wrong

    commit acf92b485cccf028177f46918e045c0c4e80ee10 upstream.
    
    Stable note: Not tracked in Bugzilla. This patch reduces excessive
    	reclaim of slab objects reducing the amount of information
    	that has to be brought back in from disk.
    
    shrink_slab() allows shrinkers to be called in parallel so the
    struct shrinker can be updated concurrently. It does not provide any
    exclusio for such updates, so we can get the shrinker->nr value
    increasing or decreasing incorrectly.
    
    As a result, when a shrinker repeatedly returns a value of -1 (e.g.
    a VFS shrinker called w/ GFP_NOFS), the shrinker->nr goes haywire,
    sometimes updating with the scan count that wasn't used, sometimes
    losing it altogether. Worse is when a shrinker does work and that
    update is lost due to racy updates, which means the shrinker will do
    the work again!
    
    Fix this by making the total_scan calculations independent of
    shrinker->nr, and making the shrinker->nr updates atomic w.r.t. to
    other updates via cmpxchg loops.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Dave Chinner committed with gregkh Jul 8, 2011
  31. @gregkh

    vmscan: add shrink_slab tracepoints

    commit 095760730c1047c69159ce88021a7fa3833502c8 upstream.
    
    Stable note: This patch makes later patches easier to apply but otherwise
            has little to justify it. It is a diagnostic patch that was part
            of a series addressing excessive slab shrinking after GFP_NOFS
            failures. There is detailed information on the series' motivation
            at https://lkml.org/lkml/2011/6/2/42 .
    
    It is impossible to understand what the shrinkers are actually doing
    without instrumenting the code, so add a some tracepoints to allow
    insight to be gained.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Dave Chinner committed with gregkh Jul 8, 2011
  32. @gregkh

    vmscan: clear ZONE_CONGESTED for zone with good watermark

    commit 439423f6894aa0dec22187526827456f5004baed upstream.
    
    Stable note: Not tracked in Bugzilla. kswapd is responsible for clearing
    	ZONE_CONGESTED after it balances a zone and this patch fixes a bug
    	where that was failing to happen. Without this patch, processes
    	can stall in wait_iff_congested unnecessarily. For users, this can
    	look like an interactivity stall but some workloads would see it
    	as sudden drop in throughput.
    
    ZONE_CONGESTED is only cleared in kswapd, but pages can be freed in any
    task.  It's possible ZONE_CONGESTED isn't cleared in some cases:
    
     1. the zone is already balanced just entering balance_pgdat() for
        order-0 because concurrent tasks free memory.  In this case, later
        check will skip the zone as it's balanced so the flag isn't cleared.
    
     2. high order balance fallbacks to order-0.  quote from Mel: At the
        end of balance_pgdat(), kswapd uses the following logic;
    
    	If reclaiming at high order {
    		for each zone {
    			if all_unreclaimable
    				skip
    			if watermark is not met
    				order = 0
    				loop again
    
    			/* watermark is met */
    			clear congested
    		}
    	}
    
        i.e. it clears ZONE_CONGESTED if it the zone is balanced.  if not,
        it restarts balancing at order-0.  However, if the higher zones are
        balanced for order-0, kswapd will miss clearing ZONE_CONGESTED as
        that only happens after a zone is shrunk.  This can mean that
        wait_iff_congested() stalls unnecessarily.
    
    This patch makes kswapd clear ZONE_CONGESTED during its initial
    highmem->dma scan for zones that are already balanced.
    
    Signed-off-by: Shaohua Li <shaohua.li@intel.com>
    Acked-by: Mel Gorman <mgorman@suse.de>
    Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Shaohua Li committed with gregkh Aug 25, 2011
  33. @gregkh

    mm: vmscan: fix force-scanning small targets without swap

    commit a4d3e9e76337059406fcf3ead288c0df22a790e9 upstream.
    
    Stable note: Not tracked in Bugzilla. This patch augments an earlier commit
            that avoids scanning priority being artificially raised. The older
    	fix was particularly important for small memcgs to avoid calling
    	wait_iff_congested() unnecessarily.
    
    Without swap, anonymous pages are not scanned.  As such, they should not
    count when considering force-scanning a small target if there is no swap.
    
    Otherwise, targets are not force-scanned even when their effective scan
    number is zero and the other conditions--kswapd/memcg--apply.
    
    This fixes 246e87a ("memcg: fix get_scan_count() for small
    targets").
    
    [akpm@linux-foundation.org: fix comment]
    Signed-off-by: Johannes Weiner <jweiner@redhat.com>
    Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Reviewed-by: Michal Hocko <mhocko@suse.cz>
    Cc: Ying Han <yinghan@google.com>
    Cc: Balbir Singh <bsingharora@gmail.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
    Acked-by: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Johannes Weiner committed with gregkh Sep 14, 2011
  34. @gregkh

    mm: reduce the amount of work done when updating min_free_kbytes

    commit 938929f14cb595f43cd1a4e63e22d36cab1e4a1f upstream.
    
    Stable note: Fixes https://bugzilla.novell.com/show_bug.cgi?id=726210 .
            Large machines with 1TB or more of RAM take a long time to boot
            without this patch and may spew out soft lockup warnings.
    
    When min_free_kbytes is updated, some pageblocks are marked
    MIGRATE_RESERVE.  Ordinarily, this work is unnoticable as it happens early
    in boot but on large machines with 1TB of memory, this has been reported
    to delay boot times, probably due to the NUMA distances involved.
    
    The bulk of the work is due to calling calling pageblock_is_reserved() an
    unnecessary amount of times and accessing far more struct page metadata
    than is necessary.  This patch significantly reduces the amount of work
    done by setup_zone_migrate_reserve() improving boot times on 1TB machines.
    
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Mel Gorman committed with gregkh Jan 10, 2012
  35. @gregkh

    mm: memory hotplug: Check if pages are correctly reserved on a per-se…

    …ction basis
    
    commit 2bbcb8788311a40714b585fc11b51da6ffa2ab92 upstream.
    
    Stable note: Fixes https://bugzilla.novell.com/show_bug.cgi?id=721039 .
            Without the patch, memory hot-add can fail for kernel configurations
            that do not set CONFIG_SPARSEMEM_VMEMMAP.
    
    (Resending as I am not seeing it in -next so maybe it got lost)
    
    mm: memory hotplug: Check if pages are correctly reserved on a per-section basis
    
    It is expected that memory being brought online is PageReserved
    similar to what happens when the page allocator is being brought up.
    Memory is onlined in "memory blocks" which consist of one or more
    sections. Unfortunately, the code that verifies PageReserved is
    currently assuming that the memmap backing all these pages is virtually
    contiguous which is only the case when CONFIG_SPARSEMEM_VMEMMAP is set.
    As a result, memory hot-add is failing on those configurations with
    the message;
    
    kernel: section number XXX page number 256 not reserved, was it already online?
    
    This patch updates the PageReserved check to lookup struct page once
    per section to guarantee the correct struct page is being checked.
    
    [Check pages within sections properly: rientjes@google.com]
    [original patch by: nfont@linux.vnet.ibm.com]
    Signed-off-by: Mel Gorman <mgorman@suse.de>
    Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Tested-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
    Mel Gorman committed with gregkh Oct 17, 2011
Something went wrong with that request. Please try again.