Skip to content

Commit 5bf24c6

Browse files
bingjiao-gSasha Levin
authored andcommitted
mm/vmscan: fix demotion targets checks in reclaim/demotion
[ Upstream commit 1aceed5 ] Patch series "mm/vmscan: fix demotion targets checks in reclaim/demotion", v9. This patch series addresses two issues in demote_folio_list(), can_demote(), and next_demotion_node() in reclaim/demotion. 1. demote_folio_list() and can_demote() do not correctly check demotion target against cpuset.mems_effective, which will cause (a) pages to be demoted to not-allowed nodes and (b) pages fail demotion even if the system still has allowed demotion nodes. Patch 1 fixes this bug by updating cpuset_node_allowed() and mem_cgroup_node_allowed() to return effective_mems, allowing directly logic-and operation against demotion targets. 2. next_demotion_node() returns a preferred demotion target, but it does not check the node against allowed nodes. Patch 2 ensures that next_demotion_node() filters against the allowed node mask and selects the closest demotion target to the source node. This patch (of 2): Fix two bugs in demote_folio_list() and can_demote() due to incorrect demotion target checks against cpuset.mems_effective in reclaim/demotion. Commit 7d709f4 ("vmscan,cgroup: apply mems_effective to reclaim") introduces the cpuset.mems_effective check and applies it to can_demote(). However: 1. It does not apply this check in demote_folio_list(), which leads to situations where pages are demoted to nodes that are explicitly excluded from the task's cpuset.mems. 2. It checks only the nodes in the immediate next demotion hierarchy and does not check all allowed demotion targets in can_demote(). This can cause pages to never be demoted if the nodes in the next demotion hierarchy are not set in mems_effective. These bugs break resource isolation provided by cpuset.mems. This is visible from userspace because pages can either fail to be demoted entirely or are demoted to nodes that are not allowed in multi-tier memory systems. To address these bugs, update cpuset_node_allowed() and mem_cgroup_node_allowed() to return effective_mems, allowing directly logic-and operation against demotion targets. Also update can_demote() and demote_folio_list() accordingly. Bug 1 reproduction: Assume a system with 4 nodes, where nodes 0-1 are top-tier and nodes 2-3 are far-tier memory. All nodes have equal capacity. Test script: echo 1 > /sys/kernel/mm/numa/demotion_enabled mkdir /sys/fs/cgroup/test echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control echo "0-2" > /sys/fs/cgroup/test/cpuset.mems echo $$ > /sys/fs/cgroup/test/cgroup.procs swapoff -a # Expectation: Should respect node 0-2 limit. # Observation: Node 3 shows significant allocation (MemFree drops) stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1 Bug 2 reproduction: Assume a system with 6 nodes, where nodes 0-2 are top-tier, node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes. All nodes have equal capacity. Test script: echo 1 > /sys/kernel/mm/numa/demotion_enabled mkdir /sys/fs/cgroup/test echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems echo $$ > /sys/fs/cgroup/test/cgroup.procs swapoff -a # Expectation: Pages are demoted to Nodes 4-5 # Observation: No pages are demoted before oom. stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2 Link: https://lkml.kernel.org/r/20260114205305.2869796-1-bingjiao@google.com Link: https://lkml.kernel.org/r/20260114205305.2869796-2-bingjiao@google.com Fixes: 7d709f4 ("vmscan,cgroup: apply mems_effective to reclaim") Signed-off-by: Bing Jiao <bingjiao@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Gregory Price <gourry@gourry.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
1 parent 90f5e87 commit 5bf24c6

File tree

5 files changed

+78
-38
lines changed

5 files changed

+78
-38
lines changed

include/linux/cpuset.h

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -174,7 +174,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
174174
task_unlock(current);
175175
}
176176

177-
extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
177+
extern void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask);
178178
#else /* !CONFIG_CPUSETS */
179179

180180
static inline bool cpusets_enabled(void) { return false; }
@@ -301,9 +301,9 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
301301
return false;
302302
}
303303

304-
static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
304+
static inline void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask)
305305
{
306-
return true;
306+
nodes_copy(*mask, node_states[N_MEMORY]);
307307
}
308308
#endif /* !CONFIG_CPUSETS */
309309

include/linux/memcontrol.h

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1744,7 +1744,7 @@ static inline void count_objcg_events(struct obj_cgroup *objcg,
17441744
rcu_read_unlock();
17451745
}
17461746

1747-
bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid);
1747+
void mem_cgroup_node_filter_allowed(struct mem_cgroup *memcg, nodemask_t *mask);
17481748

17491749
void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg);
17501750

@@ -1815,9 +1815,9 @@ static inline ino_t page_cgroup_ino(struct page *page)
18151815
return 0;
18161816
}
18171817

1818-
static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
1818+
static inline void mem_cgroup_node_filter_allowed(struct mem_cgroup *memcg,
1819+
nodemask_t *mask)
18191820
{
1820-
return true;
18211821
}
18221822

18231823
static inline void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg)

kernel/cgroup/cpuset.c

Lines changed: 36 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -4424,40 +4424,58 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
44244424
return allowed;
44254425
}
44264426

4427-
bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
4427+
/**
4428+
* cpuset_nodes_allowed - return effective_mems mask from a cgroup cpuset.
4429+
* @cgroup: pointer to struct cgroup.
4430+
* @mask: pointer to struct nodemask_t to be returned.
4431+
*
4432+
* Returns effective_mems mask from a cgroup cpuset if it is cgroup v2 and
4433+
* has cpuset subsys. Otherwise, returns node_states[N_MEMORY].
4434+
*
4435+
* This function intentionally avoids taking the cpuset_mutex or callback_lock
4436+
* when accessing effective_mems. This is because the obtained effective_mems
4437+
* is stale immediately after the query anyway (e.g., effective_mems is updated
4438+
* immediately after releasing the lock but before returning).
4439+
*
4440+
* As a result, returned @mask may be empty because cs->effective_mems can be
4441+
* rebound during this call. Besides, nodes in @mask are not guaranteed to be
4442+
* online due to hot plugins. Callers should check the mask for validity on
4443+
* return based on its subsequent use.
4444+
**/
4445+
void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask)
44284446
{
44294447
struct cgroup_subsys_state *css;
44304448
struct cpuset *cs;
4431-
bool allowed;
44324449

44334450
/*
44344451
* In v1, mem_cgroup and cpuset are unlikely in the same hierarchy
44354452
* and mems_allowed is likely to be empty even if we could get to it,
4436-
* so return true to avoid taking a global lock on the empty check.
4453+
* so return directly to avoid taking a global lock on the empty check.
44374454
*/
4438-
if (!cpuset_v2())
4439-
return true;
4455+
if (!cgroup || !cpuset_v2()) {
4456+
nodes_copy(*mask, node_states[N_MEMORY]);
4457+
return;
4458+
}
44404459

44414460
css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys);
4442-
if (!css)
4443-
return true;
4461+
if (!css) {
4462+
nodes_copy(*mask, node_states[N_MEMORY]);
4463+
return;
4464+
}
44444465

44454466
/*
4446-
* Normally, accessing effective_mems would require the cpuset_mutex
4447-
* or callback_lock - but node_isset is atomic and the reference
4448-
* taken via cgroup_get_e_css is sufficient to protect css.
4449-
*
4450-
* Since this interface is intended for use by migration paths, we
4451-
* relax locking here to avoid taking global locks - while accepting
4452-
* there may be rare scenarios where the result may be innaccurate.
4467+
* The reference taken via cgroup_get_e_css is sufficient to
4468+
* protect css, but it does not imply safe accesses to effective_mems.
44534469
*
4454-
* Reclaim and migration are subject to these same race conditions, and
4455-
* cannot make strong isolation guarantees, so this is acceptable.
4470+
* Normally, accessing effective_mems would require the cpuset_mutex
4471+
* or callback_lock - but the correctness of this information is stale
4472+
* immediately after the query anyway. We do not acquire the lock
4473+
* during this process to save lock contention in exchange for racing
4474+
* against mems_allowed rebinds.
44564475
*/
44574476
cs = container_of(css, struct cpuset, css);
4458-
allowed = node_isset(nid, cs->effective_mems);
4477+
nodes_copy(*mask, cs->effective_mems);
44594478
css_put(css);
4460-
return allowed;
44614479
}
44624480

44634481
/**

mm/memcontrol.c

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5624,9 +5624,21 @@ subsys_initcall(mem_cgroup_swap_init);
56245624

56255625
#endif /* CONFIG_SWAP */
56265626

5627-
bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
5627+
void mem_cgroup_node_filter_allowed(struct mem_cgroup *memcg, nodemask_t *mask)
56285628
{
5629-
return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true;
5629+
nodemask_t allowed;
5630+
5631+
if (!memcg)
5632+
return;
5633+
5634+
/*
5635+
* Since this interface is intended for use by migration paths, and
5636+
* reclaim and migration are subject to race conditions such as changes
5637+
* in effective_mems and hot-unpluging of nodes, inaccurate allowed
5638+
* mask is acceptable.
5639+
*/
5640+
cpuset_nodes_allowed(memcg->css.cgroup, &allowed);
5641+
nodes_and(*mask, *mask, allowed);
56305642
}
56315643

56325644
void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg)

mm/vmscan.c

Lines changed: 22 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -344,19 +344,21 @@ static void flush_reclaim_state(struct scan_control *sc)
344344
static bool can_demote(int nid, struct scan_control *sc,
345345
struct mem_cgroup *memcg)
346346
{
347-
int demotion_nid;
347+
struct pglist_data *pgdat = NODE_DATA(nid);
348+
nodemask_t allowed_mask;
348349

349-
if (!numa_demotion_enabled)
350+
if (!pgdat || !numa_demotion_enabled)
350351
return false;
351352
if (sc && sc->no_demotion)
352353
return false;
353354

354-
demotion_nid = next_demotion_node(nid);
355-
if (demotion_nid == NUMA_NO_NODE)
355+
node_get_allowed_targets(pgdat, &allowed_mask);
356+
if (nodes_empty(allowed_mask))
356357
return false;
357358

358-
/* If demotion node isn't in the cgroup's mems_allowed, fall back */
359-
return mem_cgroup_node_allowed(memcg, demotion_nid);
359+
/* Filter out nodes that are not in cgroup's mems_allowed. */
360+
mem_cgroup_node_filter_allowed(memcg, &allowed_mask);
361+
return !nodes_empty(allowed_mask);
360362
}
361363

362364
static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
@@ -1019,9 +1021,10 @@ static struct folio *alloc_demote_folio(struct folio *src,
10191021
* Folios which are not demoted are left on @demote_folios.
10201022
*/
10211023
static unsigned int demote_folio_list(struct list_head *demote_folios,
1022-
struct pglist_data *pgdat)
1024+
struct pglist_data *pgdat,
1025+
struct mem_cgroup *memcg)
10231026
{
1024-
int target_nid = next_demotion_node(pgdat->node_id);
1027+
int target_nid;
10251028
unsigned int nr_succeeded;
10261029
nodemask_t allowed_mask;
10271030

@@ -1033,18 +1036,25 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
10331036
*/
10341037
.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
10351038
__GFP_NOMEMALLOC | GFP_NOWAIT,
1036-
.nid = target_nid,
10371039
.nmask = &allowed_mask,
10381040
.reason = MR_DEMOTION,
10391041
};
10401042

10411043
if (list_empty(demote_folios))
10421044
return 0;
10431045

1044-
if (target_nid == NUMA_NO_NODE)
1046+
node_get_allowed_targets(pgdat, &allowed_mask);
1047+
mem_cgroup_node_filter_allowed(memcg, &allowed_mask);
1048+
if (nodes_empty(allowed_mask))
10451049
return 0;
10461050

1047-
node_get_allowed_targets(pgdat, &allowed_mask);
1051+
target_nid = next_demotion_node(pgdat->node_id);
1052+
if (target_nid == NUMA_NO_NODE)
1053+
/* No lower-tier nodes or nodes were hot-unplugged. */
1054+
return 0;
1055+
if (!node_isset(target_nid, allowed_mask))
1056+
target_nid = node_random(&allowed_mask);
1057+
mtc.nid = target_nid;
10481058

10491059
/* Demotion ignores all cpuset and mempolicy settings */
10501060
migrate_pages(demote_folios, alloc_demote_folio, NULL,
@@ -1566,7 +1576,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
15661576
/* 'folio_list' is always empty here */
15671577

15681578
/* Migrate folios selected for demotion */
1569-
nr_demoted = demote_folio_list(&demote_folios, pgdat);
1579+
nr_demoted = demote_folio_list(&demote_folios, pgdat, memcg);
15701580
nr_reclaimed += nr_demoted;
15711581
stat->nr_demoted += nr_demoted;
15721582
/* Folios that could not be demoted are still in @demote_folios */

0 commit comments

Comments
 (0)