Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler is not working #1

Closed
Msiavashi opened this issue Oct 22, 2021 · 20 comments
Closed

Scheduler is not working #1

Msiavashi opened this issue Oct 22, 2021 · 20 comments

Comments

@Msiavashi
Copy link

Msiavashi commented Oct 22, 2021

Hi,

My question might be naive but I'm trying to execute some experiments with the provided schedulers (e.g. fifo scheduler, shinjuku scheduler). I've noticed that the agent's executes and initializes the cores successfully, however it appears to me that it does not schedule any process. I have placed a few printfs in the Enqueue method as well as the constructor. The printf of the constructor prints, once the scheduler is initialized, however the prints inside of the enqueue method, do not get called at all. No significant difference in performance observed as well.

I have a web server running on the SCHED_FIFO policy with 99 priority.

The provided tests also pass but the scheduler does not log anything except for core mapping and initialization prints.

I also debugged the kernel to see if the SCHED_GHOST policy of the kernel works. It seems the kernel schedules the agent perfectly, however, the agent does not schedule the real-time (or any) of my processes.

So simply my question is, how to make sure the agent's scheduler is working correctly when the enqueue method seems not to be called at all?

I appreciate your help.

Thanks

@Msiavashi Msiavashi changed the title Agents not working Scheduler is not working Oct 22, 2021
@neelnatu
Copy link
Collaborator

neelnatu commented Oct 23, 2021

Hi Mohammad, are you trying to schedule the web server (with SCHED_FIFO policy) using one of the ghost schedulers (say fifo_scheduler)?

If yes, then you'll need to move the application threads into the ghost scheduling class. All of the various ghost schedulers only schedule tasks with policy == SCHED_GHOST.

There's a couple of ways to move application threads into ghost (for your use case I'll recommend #3):

  1. modify the application itself to create GhostThreads but unless you are writing something from scratch this is usually not feasible.
  2. have the agent monitor cgroups tagged with a special file to indicate that all tasks in the cgroup should be moved into ghost (we use this approach extensively but haven't pushed it into the open source repository).
  3. move the application threads into ghost with a command line tool. I am attaching the source code for a tool we use internally (pushtosched.c). Let us know if this is useful and we can add that to our github repository.
    pushtosched.c.txt

@Msiavashi
Copy link
Author

Thanks @neelnatu,

Yep, it is exactly what I'm trying to do. I have misunderstood that the SCHED_GHOST is only for the agents.

I'll try the tool and update the issue with the result.

Thanks

@Msiavashi
Copy link
Author

Hi Mohammad, are you trying to schedule the web server (with SCHED_FIFO policy) using one of the ghost schedulers (say fifo_scheduler)?

If yes, then you'll need to move the application threads into the ghost scheduling class. All of the various ghost schedulers only schedule tasks with policy == SCHED_GHOST.

There's a couple of ways to move application threads into ghost (for your use case I'll recommend #3):

  1. modify the application itself to create GhostThreads but unless you are writing something from scratch this is usually not feasible.
  2. have the agent monitor cgroups tagged with a special file to indicate that all tasks in the cgroup should be moved into ghost (we use this approach extensively but haven't pushed it into the open source repository).
  3. move the application threads into ghost with a command line tool. I am attaching the source code for a tool we use internally (pushtosched.c). Let us know if this is useful and we can add that to our github repository.
    pushtosched.c.txt

I tried the tool and it seems to be working fine. I pass the pid of my desired process through the pipe to the tool and the agents start logging immediately. However, when using the FIFO scheduler, my webserver process won't respond anymore until I revert from SCHED_GHOST using the tool: pushtosched 0

I tried again using the agent_shinjuku to see whether the same problem occurs with this scheduler or not. The Shinjuku agent starts without error with the following output:

Core map
(  0 )	(  1 )	(  2 )	(  3 )	(  4 )	(  5 )	(  6 )	(  7 )

Initializing...
Initialization complete, ghOSt active.

Once I migrate my web server's threads to SCHED_GHOST using the tool, the agent will exit with the following error:

Core map
(  0 )	(  1 )	(  2 )	(  3 )	(  4 )	(  5 )	(  6 )	(  7 )

Initializing...
Initialization complete, ghOSt active.
PID 6149 Fatal segfault at addr 0x10: 
[0] 0x7fe13de823c0 : __restore_rt
[1] 0x55babc33b2cb : ghost::BasicDispatchScheduler<>::DispatchMessage()
[2] 0x55babc339f19 : ghost::ShinjukuAgent::AgentThread()
[3] 0x55babc33e3c4 : ghost::Agent::ThreadBody()
[4] 0x7fe13e0d2de4 : (unknown)

For example, if I run a flask dev web server, I use the following command to pipe the pid of the running python process:

pidof python | sudo ./pushtosched 18

Where do you think the problem might be from?

Thanks

@jackhumphries
Copy link
Collaborator

jackhumphries commented Oct 26, 2021

Hi Mohammad,

Neel is out of office today, so let me help you until he is back.

Regarding Shinjuku, I assume you compiled with optimizations turned on, which strips out many of the debug symbols from the crash trace. What I assume is happening though is that since Shinjuku requires the client app being scheduled to set up a shared memory region for communication (check out the RocksDB experiment), it is crashing in your case because this region is not set up. I would not use the Shinjuku scheduler for scheduling the web server at the moment since it requires more extensive setup.

As for the FIFO scheduler, you do not need to do any sort of setup beyond running pushtosched.c to move the threads into the ghOSt sched class, so let me get some more information from you.

  1. Can you see the FIFO scheduler committing scheduling decisions in FifoSchedule::FifoSchedule() in fifo_scheduler.cc?
  2. If you try to schedule a very simple app with the FIFO scheduler (e.g., spawn a pthread from main(), printf() in the pthread, then join the pthread), does that app make forward progress (e.g., can it print things?).

I want to make sure the policy is up and running correctly on your machine. If it is, then we can delve into the specifics of the server. For example, the FIFO policy is non-preemptive, so perhaps certain threads are taking a while to run which is why you do not see the server responding.

@Msiavashi
Copy link
Author

Msiavashi commented Oct 27, 2021

Hi Mohammad,

Neel is out of office today, so let me help you until he is back.

Regarding Shinjuku, I assume you compiled with optimizations turned on, which strips out many of the debug symbols from the crash trace. What I assume is happening though is that since Shinjuku requires the client app being scheduled to set up a shared memory region for communication (check out the RocksDB experiment), it is crashing in your case because this region is not set up. I would not use the Shinjuku scheduler for scheduling the web server at the moment since it requires more extensive setup.

As for the FIFO scheduler, you do not need to do any sort of setup beyond running pushtosched.c to move the threads into the ghOSt sched class, so let me get some more information from you.

  1. Can you see the FIFO scheduler committing scheduling decisions in FifoSchedule::FifoSchedule() in fifo_scheduler.cc?
  2. If you try to schedule a very simple app with the FIFO scheduler (e.g., spawn a pthread from main(), printf() in the pthread, then join the pthread), does that app make forward progress (e.g., can it print things?).

I want to make sure the policy is up and running correctly on your machine. If it is, then we can delve into the specifics of the server. For example, the FIFO policy is non-preemptive, so perhaps certain threads are taking a while to run which is why you do not see the server responding.

Hi Jack,

Thanks for your hints.

The FifoSchedule() method works. As you mentioned I wrote a very simple app with 2 pthreads, noticed that the problem was with the pidof command which only returns the pid of the process, not the spawned threads (ls /proc/$(pidof my_multi_thread_app)/task | sudo ./pushtosched 18 solved the problem). My webserver was spawning pthreads on the fly as it receives new requests. So it works pretty well, however, I believe I should use the second solution suggested by Neel. The pushtosched tool can't monitor the spawned pthreads as they're created and terminated in runtime. Could you please share the code of the second solution mentioned by Neel? I can implement it and create a PR if you want me to. I'm gonna need this for my research anyway.

Regarding the Shinjuku's scheduler, I haven't taken a look at it yet, but since this is for a research purpose I'm gonna need to run that too. I'll investigate the RocksDB experiment as you suggested to see how it works. Will keep this issue updated with the result.

Many thanks to you guys for your kind help. 🌹

@jackhumphries
Copy link
Collaborator

jackhumphries commented Oct 27, 2021

Hi Mohammad,

Glad that it is working now. We're happy to open source the cgroup code, but there is a complication. We use cgroups v1 internally in Google whereas others generally use v2 nowadays. So our code may not be useful to others, and since cgroups v1 does not support inotify (to detect when a new thread is added to a cgroup), we rely on a periodic polling mechanism to detect new threads which may be too slow for a webserver that spawns a pthread for every new request.

If you want to implement this functionality for cgroups v2, we would be thrilled to merge it in. Alternatively, is it possible for you to modify the pthread spawn code in the server to instead use GhostThreads?

@Msiavashi
Copy link
Author

Msiavashi commented Oct 27, 2021

Hi Mohammad,

Glad that it is working now. We're happy to open source the cgroup code, but there is a complication. We use cgroups v1 internally in Google whereas others generally use v2 nowadays. So our code may not be useful to others, and since cgroups v1 does not support inotify (to detect when a new thread is added to a cgroup), we rely on a periodic polling mechanism to detect new threads which may be too slow for a webserver that spawns a pthread for every new request.

If you want to implement this functionality for cgroups v2, we would be thrilled to merge it in. Alternatively, is it possible for you to modify the pthread spawn code in the server to instead use GhostThreads?

For my current test, yes I can modify the source of my webserver. But in the end, I believe I should take my experiments on some latency-sensitive apps like Memcached, etc (not sure if Memcached spawns new threads in runtime, I assume it doesn't). Probably not easy to modify the source in this case.

@neelnatu
Copy link
Collaborator

Hmm, once you move all threads of the webserver into ghost any new threads created by that application should automatically start out in the ghost sched_class. You can verify this by periodically doing a 'grep policy /proc//sched' for all tasks you get via ls /proc/$(pidof webserver_app)/task.

Is that not happening in your case?

@Msiavashi
Copy link
Author

Hmm, once you move all threads of the webserver into ghost any new threads created by that application should automatically start out in the ghost sched_class. You can verify this by periodically doing a 'grep policy /proc//sched' for all tasks you get via ls /proc/$(pidof webserver_app)/task.

Is that not happening in your case?

Hi Neel,

You're right. I double checked and It's working as you said. However, The throughout drops significantly with FIFO. Probably because of its non-preemptive implementation.

Anyway, I'm trying to run the Shinjuku to compare the performance. But stillll have some problems with running it. I'll ask for your help if I couldn't fix it today.

Thanks

@neelnatu
Copy link
Collaborator

neelnatu commented Nov 1, 2021 via email

@jackhumphries
Copy link
Collaborator

jackhumphries commented Nov 1, 2021

For Shinjuku, here are the main gotchas:

  1. Integrate your web server with the PrioTable and make sure each pthread is marked as runnable. Our RocksDB experiment app demonstrates how to do this.
  2. Set the preemption time slice properly in Shinjuku (this is specified to the Shinjuku agent process via the --preemption_time_slice command line arg). The default is 50 µs but you probably want something higher than that for a web server.

@jackhumphries
Copy link
Collaborator

And just to be clear, Shinjuku is a centralized scheduler that has preemption. So you should get better QPS for your web server, though the spinning Shinjuku global agent will take up a logical core.

@Msiavashi
Copy link
Author

Msiavashi commented Nov 1, 2021

Thanks,

Just started to get the Shinjuku running, for now, it seems the UnzipPar function in the Setup.py file gets a wrong par path from GetPar(). It return the shinjuku.runfiles instead of shinjuku.par path. Now that it's fixed, it progressed to the RunAllExperiments function but it's throwing an error related to the number of cores: lib/topology.h:472(20266) CHECK FAILED: cpu < cpus_.size() [8 >= 8]

I'm looking into it.

@jackhumphries
Copy link
Collaborator

That's likely because the Shinjuku experiments require a machine with at least 8 logical cores. You can lower this by updating _NUM_CPUS in shinjuku.py.

@jackhumphries
Copy link
Collaborator

Also take a look at the _FIRST_CPU variable and related options in options.py. We tend to affine the experiment away from lower CPUs (e.g., CPU 0) since a lot of background daemons run on those CPUs inside Google.

@Msiavashi
Copy link
Author

Shinjuku's experiment worked successfully. I have some clues now to get my webserver working with the Shinjuku scheduler thanks to your help. I'll integrate my webserver with PrioTable and hope it goes without a hitch.

Thanks again

@jackhumphries
Copy link
Collaborator

jackhumphries commented Nov 2, 2021

Yes let us know. A key point is that you need to add a pthread to the PrioTable as soon as it is spawned, so make sure to allocate a large enough PrioTable when the server starts.

Depending on the scheduling policy and the scheduling hints from the server that your research requires, it may make sense to ditch the PrioTable altogether since it becomes a bottleneck at high thread counts (since the agent needs to scan the table). You could use Shinjuku or FIFO as a starting point and then take out the PrioTable and slim down other irrelevant parts of the policies.

Another option is to use the SOL scheduler as a starting point. It is a centralized FIFO scheduler with no PrioTable. Maybe adding preemption support to it (which should be relatively easy) would be what you're looking for.

@Msiavashi
Copy link
Author

Dear @jackhumphries,

I used the orchestrators as samples and implemented a very simple demonstration to get familiar with the code. Here is my code:

void printHelloWorld(uint32_t num){
    cout << "Hello World" << endl;
}


int main(int argc, char const *argv[])
{
    ghost_test::Ghost ghost_(1, 1);
    ghost_test::ExperimentThreadPool thread_pool_(1);
    vector<ghost::GhostThread::KernelScheduler> kernelSchedulers;
    vector<function<void (uint32_t)>> threadWork;
    kernelSchedulers.push_back(ghost::GhostThread::KernelScheduler::kGhost);
    threadWork.push_back(&printHelloWorld);
    thread_pool_.Init(kernelSchedulers, threadWork);
    ghost::sched_item si;
    ghost_.GetSchedItem(0, si);
    si.sid = 0;
    si.gpid = thread_pool_.GetGtids()[0].id();
    si.flags |= SCHED_ITEM_RUNNABLE;
    ghost_.SetSchedItem(0, si);
    return 0;
}

As you can see, I create only 1 ghOSt thread, assigned my printHelloWorld function to it and mark it as RUNNABLE. The Shinjuku agent does not crash and successfully schedules the task, however, I get multiple prints in my stdout, although I only print once. Here is the output:

Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
[2] 0x5575a4913104 : main
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
[3] 0x7f14c895b0b3 : __libc_start_main
Hello World
Hello World
Hello World

Any idea why this repetition happens? Am I missing something?

My second question is, how is it possible to ditch the PrioTable? Is it easily possible with the current implementation or it requires a new implementation of Shinjuku? I assume it should not be easy to achieve this with the current implementation of its agent. I'm I correct?

Thanks

@jackhumphries
Copy link
Collaborator

Hi Mohammad,

In the experiments directory, we have a helper thread pool that runs a thread body over and over again until the thread is marked as ready to exit. Below is some updated code that prints "Hello World" only once, as you would expect. Just update the include paths to work with your build.

#include <functional>
#include <iostream>
#include <vector>

#include "third_party/absl/synchronization/notification.h"
#include "third_party/ghost/experiments/shared/ghost.h"
#include "third_party/ghost/experiments/shared/thread_pool.h"
#include "third_party/ghost/lib/base.h"
#include "third_party/ghost/lib/ghost.h"

void printHelloWorld(uint32_t num, absl::Notification* printed,
                     absl::Notification* wait) {
  std::cout << "Hello World" << std::endl;
  printed->Notify();
  wait->WaitForNotification();
}

int main(int argc, char const* argv[]) {
  ghost_test::Ghost ghost_(1, 1);
  ghost_test::ExperimentThreadPool thread_pool_(1);
  std::vector<ghost::GhostThread::KernelScheduler> kernelSchedulers;
  std::vector<std::function<void(uint32_t)>> threadWork;
  kernelSchedulers.push_back(ghost::GhostThread::KernelScheduler::kGhost);

  absl::Notification printed;
  absl::Notification wait;
  threadWork.push_back(
      std::bind(printHelloWorld, std::placeholders::_1, &printed, &wait));

  thread_pool_.Init(kernelSchedulers, threadWork);
  ghost::sched_item si;
  ghost_.GetSchedItem(0, si);
  si.sid = 0;
  si.gpid = thread_pool_.GetGtids()[0].id();
  si.flags |= SCHED_ITEM_RUNNABLE;
  ghost_.SetSchedItem(0, si);

  printed.WaitForNotification();
  thread_pool_.MarkExit(/*sid=*/0);
  wait.Notify();
  thread_pool_.Join();

  return 0;
}

Basically, I use notifications to wait for the ghOSt thread to print once, then I mark the ghOSt thread as ready to exit which causes the thread pool to let it exit. Then I call Join() on the thread pool which avoids triggering the CHECK in the thread pool destructor, which I can see was triggered in your output.

You can modify the thread pool to avoid the repeating behavior or just create the ghOSt threads directly yourself.

If you want to ditch the PrioTable altogether (which I would recommend for your case), then I would take a look at the SOL scheduler. It is a centralized FIFO scheduler without a PrioTable. The only downside is that it is not preemptive, but you can implement this very easily yourself. On each iteration of the global scheduling loop, see how long each thread has been running so far and then schedule something else on a CPU whose currently running thread has exceeded the time slice.

@jackhumphries
Copy link
Collaborator

Hi Mohammad,

Please let me know if you have any additional questions. I will close this thread for now.

hannahyp pushed a commit that referenced this issue Jan 14, 2022
… ghost during agent restart.

Prior to this change task discovery would abort with the following error:
"ERROR: Got repeated ESTALEs from a quiescent reassociation for gtid 4494803534350351, flags 7!"

When a task departs and then moves back into ghost there can be more than one
status_word that refers to the same task (all except the most recently allocated
one are associated with earlier incarnations of the task in ghost).

By definition using the 'barrier' from an earlier incarnation will result in
queue association failing with ESTALE. We now detect if we are dealing with an
orphaned status_word by checking GHOST_SW_F_CANFREE and freeing the status_word
in that case.

Note that we could still get a false positive on queue association if the
barrier in the _orphaned_ status_word happens to match the barrier of the
_current_ status_word. This will be fixed but this is how it "works" currently:
- the first false-positive association injects a synthetic TASK_NEW.
- any subsequent false-positive associations are no-ops because they
  return GHOST_ASSOC_SF_ALREADY (thanks to the first false-positive).
- all status words associated with earlier incarnations are leaked.
- agent restarts successfully and task is scheduled.

This false-positive can be engineered by killing the agent and moving the task
into and out of ghost multiple times (each of the orphaned status words will
have a barrier of 2: TASK_NEW + TASK_DEPARTED). After the final move into ghost
we can use 'taskset -p <cpumask> <pid>' and bump the barrier of the current
status_word to 2: TASK_NEW + AFFINITY_CHANGED.

TESTED:
# create enclave
mount -t ghost ghost /sys/fs/ghost
echo "create 1" > /sys/fs/ghost/ctl
echo "0-5" > /sys/fs/ghost/enclave_1/cpulist

# start agent
fifo_agent --enclave=/sys/fs/ghost/enclave_1 &

# start test program
cat > /tmp/t.sh << EOF
#!/bin/bash
while [ 1 ]; do
	sleep 1
	iter=$((iter+1))
	echo $iter
done
EOF

/tmp/t.sh &

# move /tmp/t.sh into ghost
echo $T_SH_PID | pushtosched 18

# kill the agent
kill -INT $FIFO_AGENT_PID

# move /tmp/t.sh out of ghost and then back into ghost.
echo $T_SH_PID | pushtosched 0
echo $T_SH_PID | pushtosched 18

# We now have two status words with gtid matching /tmp/t.sh:
# - the live status word associated with the current incarnation.
# - the old status word associated with the previous incarnation.

# restart the agent
fifo_agent --enclave=/sys/fs/ghost/enclave_1 &

Prior to this change DiscoverTasks would error out trying to associate
the old status_word since it would always return ESTALE (see below).

root@(none):/usr/local/google/home/neelnatu/linux/9xx# Initializing...
sw(0,6): gtid 0xff8000000080f(2044), flags 0x7
Associating with status_word in region 0, index 6
Trying to associate with gtid 4494803534350351(2044), flags 0x7, runtime 23719147, barrier 38
AssociateTask failed: errno 116
Trying to associate with gtid 4494803534350351(2044), flags 0x7, runtime 23719147, barrier 38
AssociateTask failed: errno 116
./third_party/ghost/lib/scheduler.h:231(2088) ERROR: Got repeated ESTALEs from a quiescent reassociation for gtid 4494803534350351, flags 7!
PID 2087 Backtrace:
[0] 0x56180ef1782d : ghost::Exit()
[1] 0x56180eed9eb3 : ghost::BasicDispatchScheduler<>::DiscoverTasks()::{lambda()#1}::operator()()
[2] 0x56180eed9dae : std::__u::__function::__policy_invoker<>::__call_impl<>()
[3] 0x56180eef1faf : ghost::StatusWordTable::ForEachTaskStatusWord()
[4] 0x56180eee861d : ghost::LocalEnclave::ForEachTaskStatusWord()
[5] 0x56180eed9262 : ghost::BasicDispatchScheduler<>::DiscoverTasks()
[6] 0x56180eee762c : ghost::Enclave::Ready()
[7] 0x56180eed882e : ghost::FullFifoAgent<>::FullFifoAgent()
[8] 0x56180eed85db : std::__u::make_unique<>()
[9] 0x56180eed7d4e : ghost::AgentProcess<>::AgentProcess()
[10] 0x56180eed7091 : main
[11] 0x7f26c3bb6bbd : __libc_start_main
[12] 0x56180eed60e9 : _start
hannahyp pushed a commit that referenced this issue Mar 18, 2022
Specifically the bug is that we are depending on the initialization of
GHOST_TID_SEQNUM_BITS (in base.cc) during the initialization of another
global Agent::kVersionCheck (in agent.cc). Since there is no ordering
across compilation units this can result in accessing GHOST_TID_SEQNUM_BITS
before it has been initialized.

READ of size 4 at 0x7ff981751180 thread T0
    #0 0x7ff98134368f in ghost::gtid(long) third_party/ghost/lib/base.cc:122:19
    #1 0x7ff98134196d in GetGtid third_party/ghost/lib/base.cc:212:28
    #2 0x7ff98134196d in Current third_party/ghost/lib/base.h:164:40
    #3 0x7ff98134196d in ghost::Exit(int) third_party/ghost/lib/base.cc:285:28
    #4 0x7ff981b93db4 in ghost::Ghost::MountGhostfs() third_party/ghost/lib/ghost.cc:147:5
    #5 0x7ff981b93fe5 in ghost::Ghost::GetSupportedVersions(std::__u::vector<unsigned int, std::__u::allocator<unsigned int> >&) third_party/ghost/lib/ghost.cc:155:5
    #6 0x7ff982102744 in ghost::Ghost::CheckVersion() third_party/ghost/lib/ghost.h:274:5
    #7 0x7ff98217c564 in __cxx_global_var_init third_party/ghost/lib/agent.cc:102:35
    #8 0x7ff98217c564 in _GLOBAL__sub_I_agent.cc third_party/ghost/lib/agent.cc
    #9 0x7ff9cc37b4cc in call_init (/usr/grte/v5/lib64/ld-linux-x86-64.so.2+0x1c4cc)
    #10 0x7ff9cc37b329 in _dl_init (/usr/grte/v5/lib64/ld-linux-x86-64.so.2+0x1c329)

Fix this by caching the result in a function-local static variable in
get_tid_seqnum_bits() which is guaranteed to be initialized the first time
it is accessed, and this initialization will happen before any other access.

TESTED=all units tests pass in virtme
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants