Skip to content

Latest commit

 

History

History
570 lines (466 loc) · 19.9 KB

File metadata and controls

570 lines (466 loc) · 19.9 KB

系统调用

概念

a system call is the programmatic way in which a computer program requests a service from the kernel of the operating system on which it is executed.

▪ vs. 中断
系统调用是从用户态切换到核心态的同步机制,中断则是一种异步切换。
系统调用的整个过程中,会从用户态切换到内核态,然后又切换到用户态。中断则不一定,例如中断有可能全部在核心态中完成。
系统调用与中断在处理过程上,有较多相似之处,早期的系统调用通过中断来实现的。

▪ 命令

apropos
man
例如: apropos write; man 2 write

实现

汇编指令

IA32早期使用汇编指令int $0x80来引发中断(kernel 2.5之前),更为现代的处理器使用汇编指令syscall和sysret来快速进入和退出核心态,避免了中断的开销。
ARM64则是使用svc指令。

AMD SYSCALL/SYSRET:
"SYSCALL and SYSRET are low-latency system call and return instructions. These instructions assume the operating system implements a flat-memory model, which greatly simplifies calls to and returns from the operating system. This simplification comes from eliminating unneeded checks, and by loading pre-determined values into the CS and SS segment registers (both visible and hidden portions). As a result, SYSCALL and SYSRET can take fewer than one-fourth the number of internal clock cycles to complete than the legacy CALL and RET instructions. SYSCALL and SYSRET are particularly well-suited for use in 64-bit mode, which requires implementation of a paged, flat-memory model." — AMD System programming

In legacy mode AMD CPUs support SYSENTER/SYSEXIT. However, in long mode only SYSCALL and SYSRET are supported.

INTEL SYSENTER/SYSEXIT:
"Executes a fast call to a level 0 system procedure or routine. SYSENTER is a companion instruction to SYSEXIT.
The instruction is optimized to provide the maximum performance for system calls from user code running at privilege level 3 to
operating system or executive procedures running at privilege level 0." — Intel IA-32 (64) programming manual, volume 2B.

syscall/sysenter汇编指令作了什么呢?
https://wiki.osdev.org/SYSENTER
https://www.felixcloutier.com/x86/syscall

syscall:
(1)将当前函数执行地址(rip寄存器的值)保存到rcx中;
(2)将当前标志寄存器rflag的值保存到r11寄存器中;
(3)通过修改rip跳转到MSR_LSTAR寄存器指向的内核函数入口;
(4)根据MSR_SYSCALL_MASK寄存器修改rflag寄存器。

/* Please consult the file sysdeps/unix/sysv/linux/x86-64/sysdep.h for
   more information about the value -4095 used below.  */

/* Usage: long syscall (syscall_number, arg1, arg2, arg3, arg4, arg5, arg6)
   We need to do some arg shifting, the syscall_number will be in
   rax.  */


	.text
ENTRY (syscall)
	movq %rdi, %rax		/* Syscall number -> rax.  */
	movq %rsi, %rdi		/* shift arg1 - arg5.  */
	movq %rdx, %rsi
	movq %rcx, %rdx
	movq %r8, %r10
	movq %r9, %r8
	movq 8(%rsp),%r9	/* arg6 is on the stack.  */
	syscall			/* Do the system call.  */
	cmpq $-4095, %rax	/* Check %rax for error.  */
	jae SYSCALL_ERROR_LABEL	/* Jump to error handler if error.  */
	ret			/* Return to caller.  */

PSEUDO_END (syscall)

进入调用退出

  • 进入

/* May not be marked __init: used by software suspend */
void syscall_init(void)
{
	//...
	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
    //...
}

初始化系统调用:
当内核启动时,MSR(Model specific registers-模型特定寄存器)会存储syscall指令的入口函数地址;
当syscall指令执行时,MSR寄存器中取出入口函数地址(这里为entry_SYSCALL_64函数)进行调用。

entry_SYSCALL_64:

/*
 * 64-bit SYSCALL instruction entry. Up to 6 arguments in registers.
 *
 * This is the only entry point used for 64-bit system calls.  The
 * hardware interface is reasonably well designed and the register to
 * argument mapping Linux uses fits well with the registers that are
 * available when SYSCALL is used.
 *
 * SYSCALL instructions can be found inlined in libc implementations as
 * well as some other programs and libraries.  There are also a handful
 * of SYSCALL instructions in the vDSO used, for example, as a
 * clock_gettimeofday fallback.
 *
 * 64-bit SYSCALL saves rip to rcx, clears rflags.RF, then saves rflags to r11,
 * then loads new ss, cs, and rip from previously programmed MSRs.
 * rflags gets masked by a value from another MSR (so CLD and CLAC
 * are not needed). SYSCALL does not save anything on the stack
 * and does not change rsp.
 *
 * Registers on entry:
 * rax  system call number
 * rcx  return address
 * r11  saved rflags (note: r11 is callee-clobbered register in C ABI)
 * rdi  arg0
 * rsi  arg1
 * rdx  arg2
 * r10  arg3 (needs to be moved to rcx to conform to C ABI)
 * r8   arg4
 * r9   arg5
 * (note: r12-r15, rbp, rbx are callee-preserved in C ABI)
 *
 * Only called from user space.
 *
 * When user can change pt_regs->foo always force IRET. That is because
 * it deals with uncanonical addresses better. SYSRET has trouble
 * with them due to bugs in both AMD and Intel CPUs.
 */

SYM_CODE_START(entry_SYSCALL_64)
	UNWIND_HINT_ENTRY
	ENDBR

	swapgs
	/* tss.sp2 is scratch space. */
	movq	%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
	movq	PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %rsp

SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack, SYM_L_GLOBAL)
	ANNOTATE_NOENDBR

	/* Construct struct pt_regs on stack */
	pushq	$__USER_DS				/* pt_regs->ss */
	pushq	PER_CPU_VAR(cpu_tss_rw + TSS_sp2)	/* pt_regs->sp */
	pushq	%r11					/* pt_regs->flags */
	pushq	$__USER_CS				/* pt_regs->cs */
	pushq	%rcx					/* pt_regs->ip */
SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
	pushq	%rax					/* pt_regs->orig_ax */

	PUSH_AND_CLEAR_REGS rax=$-ENOSYS

	/* IRQs are off. */
	movq	%rsp, %rdi
	/* Sign extend the lower 32bit as syscall numbers are treated as int */
	movslq	%eax, %rsi

	/* clobbers %rax, make sure it is after saving the syscall nr */
	IBRS_ENTER
	UNTRAIN_RET

	call	do_syscall_64		/* returns with IRQs disabled */

	/*
	 * Try to use SYSRET instead of IRET if we're returning to
	 * a completely clean 64-bit userspace context.  If we're not,
	 * go to the slow exit path.
	 * In the Xen PV case we must use iret anyway.
	 */

	ALTERNATIVE "", "jmp	swapgs_restore_regs_and_return_to_usermode", \
		X86_FEATURE_XENPV

	movq	RCX(%rsp), %rcx
	movq	RIP(%rsp), %r11

	cmpq	%rcx, %r11	/* SYSRET requires RCX == RIP */
	jne	swapgs_restore_regs_and_return_to_usermode

	/*
	 * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
	 * in kernel space.  This essentially lets the user take over
	 * the kernel, since userspace controls RSP.
	 *
	 * If width of "canonical tail" ever becomes variable, this will need
	 * to be updated to remain correct on both old and new CPUs.
	 *
	 * Change top bits to match most significant bit (47th or 56th bit
	 * depending on paging mode) in the address.
	 */
#ifdef CONFIG_X86_5LEVEL
	ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
		"shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57
#else
	shl	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
	sar	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
#endif

	/* If this changed %rcx, it was not canonical */
	cmpq	%rcx, %r11
	jne	swapgs_restore_regs_and_return_to_usermode

	cmpq	$__USER_CS, CS(%rsp)		/* CS must match SYSRET */
	jne	swapgs_restore_regs_and_return_to_usermode

	movq	R11(%rsp), %r11
	cmpq	%r11, EFLAGS(%rsp)		/* R11 == RFLAGS */
	jne	swapgs_restore_regs_and_return_to_usermode

	/*
	 * SYSCALL clears RF when it saves RFLAGS in R11 and SYSRET cannot
	 * restore RF properly. If the slowpath sets it for whatever reason, we
	 * need to restore it correctly.
	 *
	 * SYSRET can restore TF, but unlike IRET, restoring TF results in a
	 * trap from userspace immediately after SYSRET.  This would cause an
	 * infinite loop whenever #DB happens with register state that satisfies
	 * the opportunistic SYSRET conditions.  For example, single-stepping
	 * this user code:
	 *
	 *           movq	$stuck_here, %rcx
	 *           pushfq
	 *           popq %r11
	 *   stuck_here:
	 *
	 * would never get past 'stuck_here'.
	 */
	testq	$(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
	jnz	swapgs_restore_regs_and_return_to_usermode

	/* nothing to check for RSP */

	cmpq	$__USER_DS, SS(%rsp)		/* SS must match SYSRET */
	jne	swapgs_restore_regs_and_return_to_usermode

	/*
	 * We win! This label is here just for ease of understanding
	 * perf profiles. Nothing jumps here.
	 */
syscall_return_via_sysret:
	IBRS_EXIT
	POP_REGS pop_rdi=0

	/*
	 * Now all regs are restored except RSP and RDI.
	 * Save old stack pointer and switch to trampoline stack.
	 */
	movq	%rsp, %rdi
	movq	PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
	UNWIND_HINT_EMPTY

	pushq	RSP-RDI(%rdi)	/* RSP */
	pushq	(%rdi)		/* RDI */

	/*
	 * We are on the trampoline stack.  All regs except RDI are live.
	 * We can do future final exit work right here.
	 */
	STACKLEAK_ERASE_NOCLOBBER

	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi

	popq	%rdi
	popq	%rsp
SYM_INNER_LABEL(entry_SYSRETQ_unsafe_stack, SYM_L_GLOBAL)
	ANNOTATE_NOENDBR
	swapgs
	sysretq
SYM_INNER_LABEL(entry_SYSRETQ_end, SYM_L_GLOBAL)
	ANNOTATE_NOENDBR
	int3
SYM_CODE_END(entry_SYSCALL_64)

swapgs指令切换gs寄存器从用户态到内核态。
swapgs toggles whether gs is the kernel gs or the user gs.
swapgs: https://stackoverflow.com/questions/62546189/where-i-should-use-swapgs-instruction

movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2): 先保存用户栈到TSS_sp2
SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp: 切换到内核空间
movq PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %rsp: 切换到内核栈
https://www.zhihu.com/question/381383261

连续的pushq以及PUSH_AND_CLEAR_REGS:
构建内核栈结构体的基本成员变量,保存通用目的寄存器到内核栈。

movq %rsp, %rdi: 将rsp内核栈地址保存到rdi寄存器
movslq %eax, %rsi: 将eax系统调用号保存到rsi寄存器

call do_syscall_64: 进行调用。

SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi: 切换到用户进程空间
popq %rsp: 恢复用户进程栈
sysretq: 退出系统调用

  • 调用

static __always_inline bool do_syscall_x64(struct pt_regs *regs, int nr)
{
	/*
	 * Convert negative numbers to very high and thus out of range
	 * numbers for comparisons.
	 */
	unsigned int unr = nr;

	if (likely(unr < NR_syscalls)) {
		unr = array_index_nospec(unr, NR_syscalls);
		regs->ax = sys_call_table[unr](regs);
		return true;
	}
	return false;
}
//...
__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
{
	add_random_kstack_offset();
	nr = syscall_enter_from_user_mode(regs, nr);

	instrumentation_begin();

	if (!do_syscall_x64(regs, nr) && !do_syscall_x32(regs, nr) && nr != -1) {
		/* Invalid system call, but still a system call. */
		regs->ax = __x64_sys_ni_syscall(regs);
	}

	instrumentation_end();
	syscall_exit_to_user_mode(regs);
}
  • 切回用户态

do_syscall_64() → syscall_exit_to_user_mode()

__visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs)
{
	instrumentation_begin();
	__syscall_exit_to_user_mode_work(regs);
	instrumentation_end();
	__exit_to_user_mode();
}

conventions

▪ x86-64 linux kernel Calling Conventions
1. User-level applications use as integer registers for passing the sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9. The kernel interface uses %rdi, %rsi, %rdx, %r10, %r8 and %r9.
2. A system-call is done via the syscall instruction. The kernel destroys registers %rcx and %r11.
3. The number of the syscall has to be passed in register %rax.
4. System-calls are limited to six arguments, no argument is passed directly on the stack.
5. Returning from the syscall, register %rax contains the result of the system-call. A value in the range between -4095 and -1 indicates an error, it is -errno.
6. Only values of class INTEGER or class MEMORY are passed to the kernel.

访问用户空间

所有的系统调用在核心态执行

某些情况下, 内核必须访问应用程序的虚拟内存:

系统调用需要超过6个不同的参数
    此时只能借助进程内存空间的C结构指针来传递给内核
系统调用产生了大量数据,不能通过返回值机制传递给用户进程
    此时必须通过指定的内存区交换数据

copy_from_user(void *to, const void user *from, unsigned long n)
copy_to_user(void
user *to, const void *from, unsigned long n)
https://elixir.bootlin.com/linux/latest/source/include/linux/uaccess.h

__user宏: 属性标记,一些源代码检查工具例如sparse会用到
https://stackoverflow.com/questions/4521551/what-are-the-implications-of-the-linux-user-macro

get_user get_user
put_user
put_user
strncpy_from_user clear_user等
https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/uaccess.h

系统调用与信号

系统调用的过程中,如果发生信号了呢?

要么返回EINTR错误,要么会自动重启该函数。

与SA_RESTART以及哪种系统调用有关:

static void
handle_signal(struct ksignal *ksig, struct pt_regs *regs)
{
	bool stepping, failed;
	struct fpu *fpu = &current->thread.fpu;

	if (v8086_mode(regs))
		save_v86_state((struct kernel_vm86_regs *) regs, VM86_SIGNAL);

	/* Are we from a system call? */
	if (syscall_get_nr(current, regs) != -1) {
		/* If so, check system call restarting.. */
		switch (syscall_get_error(current, regs)) {
		case -ERESTART_RESTARTBLOCK:
		case -ERESTARTNOHAND:
			regs->ax = -EINTR;
			break;

		case -ERESTARTSYS:
			if (!(ksig->ka.sa.sa_flags & SA_RESTART)) {
				regs->ax = -EINTR;
				break;
			}
			fallthrough;
		case -ERESTARTNOINTR:
			regs->ax = regs->orig_ax;
			regs->ip -= 2;
			break;
		}
	}

	/*
	 * If TF is set due to a debugger (TIF_FORCED_TF), clear TF now
	 * so that register information in the sigcontext is correct and
	 * then notify the tracer before entering the signal handler.
	 */
	stepping = test_thread_flag(TIF_SINGLESTEP);
	if (stepping)
		user_disable_single_step(current);

	failed = (setup_rt_frame(ksig, regs) < 0);
	if (!failed) {
		/*
		 * Clear the direction flag as per the ABI for function entry.
		 *
		 * Clear RF when entering the signal handler, because
		 * it might disable possible debug exception from the
		 * signal handler.
		 *
		 * Clear TF for the case when it wasn't set by debugger to
		 * avoid the recursive send_sigtrap() in SIGTRAP handler.
		 */
		regs->flags &= ~(X86_EFLAGS_DF|X86_EFLAGS_RF|X86_EFLAGS_TF);
		/*
		 * Ensure the signal handler starts with the new fpu state.
		 */
		fpu__clear_user_states(fpu);
	}
	signal_setup_done(failed, ksig, stepping);
}

参考: Interruption of system calls and library functions by signal handlers https://man7.org/linux/man-pages/man7/signal.7.html
参考: https://www.halolinux.us/kernel-reference/reexecution-of-system-calls.html

系统调用与中断

进入系统调用后系统并不会自动关闭中断,因此系统调用的过程是可以中断的。

加速:vDSO与vsyscall

vsyscall(virtual system call) 2.5.53引入内核,它在用户空间映射一个包含一些变量及一些系统调用的实现的内存页,这些系统调用将在用户空间下执行,不需要触发trap机制进入内核。

vsyscall存在一些问题:
vsyscall映射到内存的固定位置处,有潜在的安全风险;
vsyscall内存页不包含符号表等信息,程序出错时进行core dump比较麻烦。

为了解决上述问题,设计了vDSO机制,2.6引入内核。

vDSO(virtual dynamic shared object)也是一种系统调用加速机制。
vDSO和vsyscall原理类似,都是通过映射到用户空间的代码和数据来模拟系统调用。
它通过以下手段规避vsyscall的缺陷:
vDSO依赖ASLR(地址空间布局随机化)技术,对vDSO的地址进行随机化
vDSO是一个ELF格式的动态库,拥有完整的符号表信息

不过一般vDSO实现的也就几个系统调用(x86 4个):
https://elixir.bootlin.com/linux/latest/source/arch/x86/um/vdso/vdso.lds.S
clock_gettime;
vdso_clock_gettime;
gettimeofday;
vdso_gettimeofday;
getcpu;
vdso_getcpu;
time;
vdso_time;

ldd /bin/ls
linux-vdso.so.1 (0x00007ffe3c9a0000)
可以看到,linux-vdso.so.1,其映射的基地址每次都是不同的。

实战

追踪系统调用

strace工具(使用ptrace系统调用, 即sys_ptrace)

示例:

strace -c ./testgettimeofday
strace -ce mmap ./testgettimeofday   #统计某个系统调用
strace -o result.txt ./testgettimeofday

查看对应代码

系统调用的实现分布在不同的文件中。

Q: 如何查看系统调用的实现代码呢?
A: 下面以gettimeofday和brk为例, 说明如何查看系统调用的实现代码:

方法1:

打开https://elixir.bootlin.com/linux/latest/source/include/linux/syscalls.h
搜索gettimeofday, 会看到kernel/time.c这样的所在文件注释,这说明其存在于kernel/time.c(实际上是kernel/time/time.c)。

方法2:

搜索SYSCALL_DEFINE2(gettimeofday
其中数字2表示参数个数,根据函数个数酌情修改, 也可以考虑正则表达式匹配。
例如: find . -name '*.*' | xargs grep -rn 'SYSCALL_DEFINE2(gettimeofday'
正则表达式匹配:
find . -name '*.*' | xargs grep -rn 'DEFINE[[:digit:]](gettimeofday'