a system call is the programmatic way in which a computer program requests a service from the kernel of the operating system on which it is executed.
▪ vs. 中断
系统调用是从用户态切换到核心态的同步机制,中断则是一种异步切换。
系统调用的整个过程中,会从用户态切换到内核态,然后又切换到用户态。中断则不一定,例如中断有可能全部在核心态中完成。
系统调用与中断在处理过程上,有较多相似之处,早期的系统调用通过中断来实现的。
▪ 有哪些系统调用?
https://elixir.bootlin.com/linux/latest/source/include/linux/syscalls.h
系统调用个数: __NR_syscalls: https://elixir.bootlin.com/linux/latest/source/include/uapi/asm-generic/unistd.h
▪ 命令
apropos man 例如: apropos write; man 2 write
IA32早期使用汇编指令int $0x80来引发中断(kernel 2.5之前),更为现代的处理器使用汇编指令syscall和sysret来快速进入和退出核心态,避免了中断的开销。
ARM64则是使用svc指令。
AMD SYSCALL/SYSRET:
"SYSCALL and SYSRET are low-latency system call and return instructions. These instructions assume the operating system implements a flat-memory model, which greatly simplifies calls to and returns from the operating system. This simplification comes from eliminating unneeded checks, and by loading pre-determined values into the CS and SS segment registers (both visible and hidden portions). As a result, SYSCALL and SYSRET can take fewer than one-fourth the number of internal clock cycles to complete than the legacy CALL and RET instructions. SYSCALL and SYSRET are particularly well-suited for use in 64-bit mode, which requires implementation of a paged, flat-memory model." — AMD System programming
In legacy mode AMD CPUs support SYSENTER/SYSEXIT. However, in long mode only SYSCALL and SYSRET are supported.
INTEL SYSENTER/SYSEXIT:
"Executes a fast call to a level 0 system procedure or routine. SYSENTER is a companion instruction to SYSEXIT.
The instruction is optimized to provide the maximum performance for system calls from user code running at privilege level 3 to
operating system or executive procedures running at privilege level 0." — Intel IA-32 (64) programming manual, volume 2B.
syscall/sysenter汇编指令作了什么呢?
https://wiki.osdev.org/SYSENTER
https://www.felixcloutier.com/x86/syscall
syscall:
(1)将当前函数执行地址(rip寄存器的值)保存到rcx中;
(2)将当前标志寄存器rflag的值保存到r11寄存器中;
(3)通过修改rip跳转到MSR_LSTAR寄存器指向的内核函数入口;
(4)根据MSR_SYSCALL_MASK寄存器修改rflag寄存器。
/* Please consult the file sysdeps/unix/sysv/linux/x86-64/sysdep.h for
more information about the value -4095 used below. */
/* Usage: long syscall (syscall_number, arg1, arg2, arg3, arg4, arg5, arg6)
We need to do some arg shifting, the syscall_number will be in
rax. */
.text
ENTRY (syscall)
movq %rdi, %rax /* Syscall number -> rax. */
movq %rsi, %rdi /* shift arg1 - arg5. */
movq %rdx, %rsi
movq %rcx, %rdx
movq %r8, %r10
movq %r9, %r8
movq 8(%rsp),%r9 /* arg6 is on the stack. */
syscall /* Do the system call. */
cmpq $-4095, %rax /* Check %rax for error. */
jae SYSCALL_ERROR_LABEL /* Jump to error handler if error. */
ret /* Return to caller. */
PSEUDO_END (syscall)
-
进入
/* May not be marked __init: used by software suspend */
void syscall_init(void)
{
//...
wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
//...
}
初始化系统调用:
当内核启动时,MSR(Model specific registers-模型特定寄存器)会存储syscall指令的入口函数地址;
当syscall指令执行时,MSR寄存器中取出入口函数地址(这里为entry_SYSCALL_64函数)进行调用。
entry_SYSCALL_64:
/*
* 64-bit SYSCALL instruction entry. Up to 6 arguments in registers.
*
* This is the only entry point used for 64-bit system calls. The
* hardware interface is reasonably well designed and the register to
* argument mapping Linux uses fits well with the registers that are
* available when SYSCALL is used.
*
* SYSCALL instructions can be found inlined in libc implementations as
* well as some other programs and libraries. There are also a handful
* of SYSCALL instructions in the vDSO used, for example, as a
* clock_gettimeofday fallback.
*
* 64-bit SYSCALL saves rip to rcx, clears rflags.RF, then saves rflags to r11,
* then loads new ss, cs, and rip from previously programmed MSRs.
* rflags gets masked by a value from another MSR (so CLD and CLAC
* are not needed). SYSCALL does not save anything on the stack
* and does not change rsp.
*
* Registers on entry:
* rax system call number
* rcx return address
* r11 saved rflags (note: r11 is callee-clobbered register in C ABI)
* rdi arg0
* rsi arg1
* rdx arg2
* r10 arg3 (needs to be moved to rcx to conform to C ABI)
* r8 arg4
* r9 arg5
* (note: r12-r15, rbp, rbx are callee-preserved in C ABI)
*
* Only called from user space.
*
* When user can change pt_regs->foo always force IRET. That is because
* it deals with uncanonical addresses better. SYSRET has trouble
* with them due to bugs in both AMD and Intel CPUs.
*/
SYM_CODE_START(entry_SYSCALL_64)
UNWIND_HINT_ENTRY
ENDBR
swapgs
/* tss.sp2 is scratch space. */
movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
movq PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %rsp
SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack, SYM_L_GLOBAL)
ANNOTATE_NOENDBR
/* Construct struct pt_regs on stack */
pushq $__USER_DS /* pt_regs->ss */
pushq PER_CPU_VAR(cpu_tss_rw + TSS_sp2) /* pt_regs->sp */
pushq %r11 /* pt_regs->flags */
pushq $__USER_CS /* pt_regs->cs */
pushq %rcx /* pt_regs->ip */
SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
pushq %rax /* pt_regs->orig_ax */
PUSH_AND_CLEAR_REGS rax=$-ENOSYS
/* IRQs are off. */
movq %rsp, %rdi
/* Sign extend the lower 32bit as syscall numbers are treated as int */
movslq %eax, %rsi
/* clobbers %rax, make sure it is after saving the syscall nr */
IBRS_ENTER
UNTRAIN_RET
call do_syscall_64 /* returns with IRQs disabled */
/*
* Try to use SYSRET instead of IRET if we're returning to
* a completely clean 64-bit userspace context. If we're not,
* go to the slow exit path.
* In the Xen PV case we must use iret anyway.
*/
ALTERNATIVE "", "jmp swapgs_restore_regs_and_return_to_usermode", \
X86_FEATURE_XENPV
movq RCX(%rsp), %rcx
movq RIP(%rsp), %r11
cmpq %rcx, %r11 /* SYSRET requires RCX == RIP */
jne swapgs_restore_regs_and_return_to_usermode
/*
* On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
* in kernel space. This essentially lets the user take over
* the kernel, since userspace controls RSP.
*
* If width of "canonical tail" ever becomes variable, this will need
* to be updated to remain correct on both old and new CPUs.
*
* Change top bits to match most significant bit (47th or 56th bit
* depending on paging mode) in the address.
*/
#ifdef CONFIG_X86_5LEVEL
ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
"shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57
#else
shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
#endif
/* If this changed %rcx, it was not canonical */
cmpq %rcx, %r11
jne swapgs_restore_regs_and_return_to_usermode
cmpq $__USER_CS, CS(%rsp) /* CS must match SYSRET */
jne swapgs_restore_regs_and_return_to_usermode
movq R11(%rsp), %r11
cmpq %r11, EFLAGS(%rsp) /* R11 == RFLAGS */
jne swapgs_restore_regs_and_return_to_usermode
/*
* SYSCALL clears RF when it saves RFLAGS in R11 and SYSRET cannot
* restore RF properly. If the slowpath sets it for whatever reason, we
* need to restore it correctly.
*
* SYSRET can restore TF, but unlike IRET, restoring TF results in a
* trap from userspace immediately after SYSRET. This would cause an
* infinite loop whenever #DB happens with register state that satisfies
* the opportunistic SYSRET conditions. For example, single-stepping
* this user code:
*
* movq $stuck_here, %rcx
* pushfq
* popq %r11
* stuck_here:
*
* would never get past 'stuck_here'.
*/
testq $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
jnz swapgs_restore_regs_and_return_to_usermode
/* nothing to check for RSP */
cmpq $__USER_DS, SS(%rsp) /* SS must match SYSRET */
jne swapgs_restore_regs_and_return_to_usermode
/*
* We win! This label is here just for ease of understanding
* perf profiles. Nothing jumps here.
*/
syscall_return_via_sysret:
IBRS_EXIT
POP_REGS pop_rdi=0
/*
* Now all regs are restored except RSP and RDI.
* Save old stack pointer and switch to trampoline stack.
*/
movq %rsp, %rdi
movq PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
UNWIND_HINT_EMPTY
pushq RSP-RDI(%rdi) /* RSP */
pushq (%rdi) /* RDI */
/*
* We are on the trampoline stack. All regs except RDI are live.
* We can do future final exit work right here.
*/
STACKLEAK_ERASE_NOCLOBBER
SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
popq %rdi
popq %rsp
SYM_INNER_LABEL(entry_SYSRETQ_unsafe_stack, SYM_L_GLOBAL)
ANNOTATE_NOENDBR
swapgs
sysretq
SYM_INNER_LABEL(entry_SYSRETQ_end, SYM_L_GLOBAL)
ANNOTATE_NOENDBR
int3
SYM_CODE_END(entry_SYSCALL_64)
swapgs指令切换gs寄存器从用户态到内核态。
swapgs toggles whether gs is the kernel gs or the user gs.
swapgs: https://stackoverflow.com/questions/62546189/where-i-should-use-swapgs-instruction
movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2): 先保存用户栈到TSS_sp2
SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp: 切换到内核空间
movq PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %rsp: 切换到内核栈
https://www.zhihu.com/question/381383261
连续的pushq以及PUSH_AND_CLEAR_REGS:
构建内核栈结构体的基本成员变量,保存通用目的寄存器到内核栈。
movq %rsp, %rdi: 将rsp内核栈地址保存到rdi寄存器
movslq %eax, %rsi: 将eax系统调用号保存到rsi寄存器
call do_syscall_64: 进行调用。
SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi: 切换到用户进程空间
popq %rsp: 恢复用户进程栈
sysretq: 退出系统调用
-
调用
static __always_inline bool do_syscall_x64(struct pt_regs *regs, int nr)
{
/*
* Convert negative numbers to very high and thus out of range
* numbers for comparisons.
*/
unsigned int unr = nr;
if (likely(unr < NR_syscalls)) {
unr = array_index_nospec(unr, NR_syscalls);
regs->ax = sys_call_table[unr](regs);
return true;
}
return false;
}
//...
__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
{
add_random_kstack_offset();
nr = syscall_enter_from_user_mode(regs, nr);
instrumentation_begin();
if (!do_syscall_x64(regs, nr) && !do_syscall_x32(regs, nr) && nr != -1) {
/* Invalid system call, but still a system call. */
regs->ax = __x64_sys_ni_syscall(regs);
}
instrumentation_end();
syscall_exit_to_user_mode(regs);
}
-
切回用户态
do_syscall_64() → syscall_exit_to_user_mode()
__visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs)
{
instrumentation_begin();
__syscall_exit_to_user_mode_work(regs);
instrumentation_end();
__exit_to_user_mode();
}
▪ x86-64 linux kernel Calling Conventions
1. User-level applications use as integer registers for passing the sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9. The kernel interface uses %rdi, %rsi, %rdx, %r10, %r8 and %r9.
2. A system-call is done via the syscall instruction. The kernel destroys registers %rcx and %r11.
3. The number of the syscall has to be passed in register %rax.
4. System-calls are limited to six arguments, no argument is passed directly on the stack.
5. Returning from the syscall, register %rax contains the result of the system-call. A value in the range between -4095 and -1 indicates an error, it is -errno.
6. Only values of class INTEGER or class MEMORY are passed to the kernel.
所有的系统调用在核心态执行
某些情况下, 内核必须访问应用程序的虚拟内存:
系统调用需要超过6个不同的参数 此时只能借助进程内存空间的C结构指针来传递给内核 系统调用产生了大量数据,不能通过返回值机制传递给用户进程 此时必须通过指定的内存区交换数据
copy_from_user(void *to, const void user *from, unsigned long n)
copy_to_user(void user *to, const void *from, unsigned long n)
https://elixir.bootlin.com/linux/latest/source/include/linux/uaccess.h
vs. memcpy
参考: http://www.wowotech.net/memory_management/454.html
参考: https://blog.51cto.com/u_15015138/2554741
__user宏: 属性标记,一些源代码检查工具例如sparse会用到
https://stackoverflow.com/questions/4521551/what-are-the-implications-of-the-linux-user-macro
get_user get_user
put_user put_user
strncpy_from_user clear_user等
https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/uaccess.h
系统调用的过程中,如果发生信号了呢?
要么返回EINTR错误,要么会自动重启该函数。
与SA_RESTART以及哪种系统调用有关:
static void
handle_signal(struct ksignal *ksig, struct pt_regs *regs)
{
bool stepping, failed;
struct fpu *fpu = ¤t->thread.fpu;
if (v8086_mode(regs))
save_v86_state((struct kernel_vm86_regs *) regs, VM86_SIGNAL);
/* Are we from a system call? */
if (syscall_get_nr(current, regs) != -1) {
/* If so, check system call restarting.. */
switch (syscall_get_error(current, regs)) {
case -ERESTART_RESTARTBLOCK:
case -ERESTARTNOHAND:
regs->ax = -EINTR;
break;
case -ERESTARTSYS:
if (!(ksig->ka.sa.sa_flags & SA_RESTART)) {
regs->ax = -EINTR;
break;
}
fallthrough;
case -ERESTARTNOINTR:
regs->ax = regs->orig_ax;
regs->ip -= 2;
break;
}
}
/*
* If TF is set due to a debugger (TIF_FORCED_TF), clear TF now
* so that register information in the sigcontext is correct and
* then notify the tracer before entering the signal handler.
*/
stepping = test_thread_flag(TIF_SINGLESTEP);
if (stepping)
user_disable_single_step(current);
failed = (setup_rt_frame(ksig, regs) < 0);
if (!failed) {
/*
* Clear the direction flag as per the ABI for function entry.
*
* Clear RF when entering the signal handler, because
* it might disable possible debug exception from the
* signal handler.
*
* Clear TF for the case when it wasn't set by debugger to
* avoid the recursive send_sigtrap() in SIGTRAP handler.
*/
regs->flags &= ~(X86_EFLAGS_DF|X86_EFLAGS_RF|X86_EFLAGS_TF);
/*
* Ensure the signal handler starts with the new fpu state.
*/
fpu__clear_user_states(fpu);
}
signal_setup_done(failed, ksig, stepping);
}
参考: Interruption of system calls and library functions by signal handlers https://man7.org/linux/man-pages/man7/signal.7.html
参考: https://www.halolinux.us/kernel-reference/reexecution-of-system-calls.html
进入系统调用后系统并不会自动关闭中断,因此系统调用的过程是可以中断的。
https://en.wikipedia.org/wiki/System_call#Processor_mode_and_context_switching
https://stackoverflow.com/questions/54878237/is-there-a-separate-kernel-level-thread-for-handling-system-calls-by-user-proces
系统调用可能被多个进程/线程同时调用,因此系统调用的实现里需要考虑竞态条件。
linux非线程安全函数: https://man7.org/linux/man-pages/man7/pthreads.7.html
vsyscall(virtual system call) 2.5.53引入内核,它在用户空间映射一个包含一些变量及一些系统调用的实现的内存页,这些系统调用将在用户空间下执行,不需要触发trap机制进入内核。
vsyscall存在一些问题:
vsyscall映射到内存的固定位置处,有潜在的安全风险;
vsyscall内存页不包含符号表等信息,程序出错时进行core dump比较麻烦。
为了解决上述问题,设计了vDSO机制,2.6引入内核。
vDSO(virtual dynamic shared object)也是一种系统调用加速机制。
vDSO和vsyscall原理类似,都是通过映射到用户空间的代码和数据来模拟系统调用。
它通过以下手段规避vsyscall的缺陷:
vDSO依赖ASLR(地址空间布局随机化)技术,对vDSO的地址进行随机化
vDSO是一个ELF格式的动态库,拥有完整的符号表信息
不过一般vDSO实现的也就几个系统调用(x86 4个):
https://elixir.bootlin.com/linux/latest/source/arch/x86/um/vdso/vdso.lds.S
clock_gettime;
vdso_clock_gettime;
gettimeofday;
vdso_gettimeofday;
getcpu;
vdso_getcpu;
time;
vdso_time;
ldd /bin/ls
linux-vdso.so.1 (0x00007ffe3c9a0000)
可以看到,linux-vdso.so.1,其映射的基地址每次都是不同的。
strace工具(使用ptrace系统调用, 即sys_ptrace)
示例:
strace -c ./testgettimeofday strace -ce mmap ./testgettimeofday #统计某个系统调用 strace -o result.txt ./testgettimeofday
系统调用的实现分布在不同的文件中。
Q: 如何查看系统调用的实现代码呢?
A: 下面以gettimeofday和brk为例, 说明如何查看系统调用的实现代码:
方法1:
打开https://elixir.bootlin.com/linux/latest/source/include/linux/syscalls.h 搜索gettimeofday, 会看到kernel/time.c这样的所在文件注释,这说明其存在于kernel/time.c(实际上是kernel/time/time.c)。
方法2:
搜索SYSCALL_DEFINE2(gettimeofday 其中数字2表示参数个数,根据函数个数酌情修改, 也可以考虑正则表达式匹配。 例如: find . -name '*.*' | xargs grep -rn 'SYSCALL_DEFINE2(gettimeofday' 正则表达式匹配: find . -name '*.*' | xargs grep -rn 'DEFINE[[:digit:]](gettimeofday'