-
Notifications
You must be signed in to change notification settings - Fork 81
/
appendix.tex
1763 lines (1387 loc) · 85 KB
/
appendix.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{Appendix}
\section{Shell}
A shell is actually how you are going to be interacting with the system. Before user-friendly operating systems, when a computer started up all you had access to was a shell. This meant that all of your commands and editing had to be done this way. Nowadays, our computers boot up in desktop mode, but one can still access a shell using a terminal.
\begin{lstlisting}[language=bash]
(Stuff) $
\end{lstlisting}
It is ready for your next command! You can type in a lot of Unix utilities like \keyword{ls}, \keyword{echo\ Hello} and the shell will execute them and give you the result. Some of these are what are known as \keyword{shell-builtins} meaning that the code is in the shell program itself. Some of these are compiled programs that you run. The shell only looks through a special variable called path which contains a list of colon separated paths to search for an executable with your name, here is an example path.
\begin{lstlisting}[language=bash]
$ echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:
/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
\end{lstlisting}
So when the shell executes \keyword{ls}, it looks through all of those directories, finds \keyword{/bin/ls} and executes that.
\begin{lstlisting}[language=bash]
$ ls
...
$ /bin/ls
\end{lstlisting}
You can always call through the full path. That is always why in past classes if you want to run something on the terminal you've had to do \keyword{./exe} because typically the directory that you are working in is not in the \keyword{PATH} variable. The \keyword{.} expands to your current directory and your shell executes \keyword{\textless{}current\_dir\textgreater{}/exe} which is a valid command.
\subsection{Shell tricks and tips}
\begin{itemize}
\item The up arrow will get you your most recent command
\item \keyword{ctrl-r} will search commands that you previously ran
\item \keyword{ctrl-c} will interrupt your shell's process
\item \keyword{!!} will execute the last command
\item \keyword{!<num>} goes back that many commands and runs that
\item \keyword{!<prefix>} runs the last command that has that prefix
\item \keyword{!\$} is the last arg of the previous command
\item \keyword{!*} is all args of the previous command
\item \keyword{\^pat\^sub} takes the last command and substitutes the pattern pat for the substitution sub
\item \keyword{cd -} goes to the previous directory
\item \keyword{pushd <dir>} pushes the current directory on a stack and cds
\item \keyword{popd} cds to the directory at the top of the stack
\end{itemize}
\subsection{What's a terminal?}
A terminal is an application that displays the output from the shell. You can have your default terminal, a quake based terminal, terminator, the options are endless!
\subsection{Common Utilities}
\begin{enumerate}
\item \keyword{cat} concatenate multiple files. It is regularly used to print out the contents of a file to the terminal but the original use was concatenation.
\begin{lstlisting}[language=bash]
$ cat file.txt
...
$ cat shakespeare.txt shakespeare.txt > two_shakes.txt
\end{lstlisting}
\item \keyword{diff} tells you the difference between the two files. If nothing is printed, then zero is returned meaning the files are the same byte for byte. Otherwise, the longest common subsequence difference is printed
\begin{lstlisting}[language=bash]
$ cat prog.txt
hello
world
$ cat adele.txt
hello
it's me
$ diff prog.txt prog.txt
$ diff shakespeare.txt shakespeare.txt
2c2
< world
---
> it's me
\end{lstlisting}
\item \keyword{grep} tells you which lines in a file or standard input match a POSIX pattern.
\begin{lstlisting}[language=bash]
$ grep it adele.txt
it's me
\end{lstlisting}
\item \keyword{ls} tells you which files are in the current directory.
\item \keyword{cd} this is a shell builtin but it changes to a relative or absolute directory
\begin{lstlisting}[language=bash]
$ cd /usr
$ cd lib/
$ cd -
$ pwd
/usr/
\end{lstlisting}
\item \keyword{man} every system programmers favorite command tells you more about all your favorite functions!
\item \keyword{make} executes programs according to a makefile.
\end{enumerate}
\subsection{Syntactic}
Shells have many useful utilities like saving some output to a file using redirection \keyword{>}.
This overwrites the file from the beginning.
If you only meant to append to the file, you can use \keyword{>>}.
Unix also allows file descriptor swapping.
This means that you can take the output going to one file descriptor and make it seem like it's coming out of another.
The most common one is \keyword{2>&1} which means take the stderr and make it seem like it is coming out of standard out.
This is important because when you use \keyword{>} and \keyword{>>} they only write the standard output of the file.
There are some examples below.
\begin{lstlisting}[language=bash]
$ ./program > output.txt # To overwrite
$ ./program >> output.txt # To append
$ ./program 2>&1 > output_all.txt # stderr & stdout
$ ./program 2>&1 > /dev/null # don't care about any output
\end{lstlisting}
The pipe operator has a fascinating history.
The UNIX philosophy is writing small programs and chaining them together to do new and interesting things.
Back in the early days, hard disk space was limited and write times were slow.
Brian Kernighan wanted to maintain the philosophy while omitting intermediate files that take up hard drive space.
So, the UNIX pipe was born.
A pipe takes the \keyword{stdout} of the program on its left and feeds it to the \keyword{stdin} of the program on its write.
Consider the command \keyword{tee}.
It can be used as a replacement for the redirection operators because tee will both write to a file and output to standard out.
It also has the added benefit that it doesn't need to be the last command in the list. Meaning, that you can write an intermediate result and continue your piping.
\begin{lstlisting}[language=bash]
$ ./program | tee output.txt # Overwrite
$ ./program | tee -a output.txt # Append
$ head output.txt | wc | head -n 1 # Multi pipes
$ ((head output.txt) | wc) | head -n 1 # Same as above
$ ./program | tee intermediate.txt | wc
\end{lstlisting}
The \keyword{&&} and \keyword{||} operator are operators that execute a command sequentially. \keyword{&&} only executes a command if the previous command succeeds, and \keyword{||} always executes the next command.
\begin{lstlisting}[language=bash]
$ false && echo "Hello!"
$ true && echo "Hello!"
$ false || echo "Hello!"
\end{lstlisting}
\subsection{What are environment variables?}
Each process gets its own dictionary of environment variables that are copied over to the child. Meaning, if the parent changes their environment variables it won't be transferred to the child and vice versa. This is important in the fork-exec-wait trilogy if you want to exec a program with different environment variables than your parent (or any other process).
For example, you can write a C program that loops through all of the time zones and executes the \keyword{date} command to print out the date and time in all locals. Environment variables are used for all sorts of programs so modifying them is important.
\subsubsection{Struct packing}
Structs may require something called \href{http://www.catb.org/esr/structure-packing/}{padding} (tutorial).
\textbf{We do not expect you to pack structs in this course, know that compilers perform it}.
This is because in the early days (and even now) loading an address in memory happens in 32-bit or 64-bit blocks.
This also meant requested addresses had to be multiples of block sizes.
\begin{lstlisting}[language=C]
struct picture{
int height;
pixel** data;
int width;
char* encoding;
}
\end{lstlisting}
You think the picture looks like this.
One box is four bytes.
\begin{figure}[H]
\centering
\includegraphics[width=.7\textwidth]{appendix/drawings/struct_clean.eps}
\caption{Six box struct}
\label{fig:clean_struct}
\end{figure}
However, with struct packing, it would conceptually look like this:
\begin{lstlisting}[language=C]
struct picture{
int height;
char slop1[4];
pixel** data;
int width;
char slop2[4];
char* encoding;
}
\end{lstlisting}
Visually, we'd add two extra boxes to our diagram
\begin{figure}[H]
\centering
\includegraphics[width=.7\textwidth]{appendix/drawings/struct_slop.eps}
\caption{Eight box struct, two boxes of slop}
\label{fig:sloppy_struct}
\end{figure}
This padding is common on a 64-bit system.
Other time, a processor supports unaligned access, leaving the compiler able to pack structs.
What does this mean?
We can have a variable start at a non-64-bit boundary.
The processor will figure out the rest.
To enable this, set an attribute.
\begin{lstlisting}[language=C]
struct __attribute__((packed, aligned(4))) picture{
int height;
pixel** data;
int width;
char* encoding;
}
\end{lstlisting}
Now our figure will look like the clean struct as in figure \ref{fig:clean_struct}
But now, every time the processor needs to access \keyword{data} or \keyword{encoding},
two memory accesses are required.
A possible alternative is to reorder the struct.
\begin{lstlisting}[language=C]
struct picture{
int height;
int width;
pixel** data;
char* encoding;
}
\end{lstlisting}
\section{Stack Smashing}
Each thread uses a stack memory.
The stack `grows downwards' - if a function calls another function, then the stack is extended to smaller memory addresses.
Stack memory includes non-static automatic (temporary) variables, parameter values, and the return address.
If a buffer is too small some data (e.g.~input values from the user), then there is a real possibility that other stack variables and even the return address will be overwritten.
The precise layout of the stack's contents and order of the automatic variables is architecture and compiler dependent. With a little investigative work, we can learn how to deliberately smash the stack for a particular architecture.
The example below demonstrates how the return address is stored on the stack.
For a particular 32 bit architecture \href{http://cs-education.github.io/sys/}{Live Linux Machine}, we determine that the return address is stored at an address two pointers (8 bytes) above the address of the automatic variable.
The code deliberately changes the stack value so that when the input function returns, rather than continuing on inside the main method, it jumps to the exploit function instead.
\begin{lstlisting}[language=C]
// Overwrites the return address on the following machine:
// http://cs-education.github.io/sys/
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
void breakout() {
puts("Welcome. Have a shell...");
system("/bin/sh");
}
void input() {
void *p;
printf("Address of stack variable: %p\n", &p);
printf("Something that looks like a return address on stack: %p\n", *((&p)+2));
// Let's change it to point to the start of our sneaky function.
*((&p)+2) = breakout;
}
int main() {
printf("main() code starts at %p\n",main);
input();
while (1) {
puts("Hello");
sleep(1);
}
return 0;
}
\end{lstlisting}
There are \href{https://en.wikipedia.org/wiki/Stack_buffer_overflow}{a lot} of ways that computers tend to get around this.
\section{Compiling and Linking}
This is a high-level overview from the time you compile your program to the time you run your program.
We often know that compiling your program is easy.
You run the program through an IDE or a terminal, and it just works.
\begin{lstlisting}[language=bash]
$ cat main.c
#include <stdio.h>
int main() {
printf("Hello World!\n");
return 0;
}
$ gcc main.c -o main
$ ./main
Hello World!
$
\end{lstlisting}
Here are the rough stages of compiling for gcc.
\begin{enumerate}
\item Preprocessing: The preprocessor expands all preprocesor directives.
\item Parsing: The compiler parses the text file for function declarations, variable declarations, etc.
\item Assembly Generation: The compiler then generates assembly code for all the functions after some optimizations if enabled.
\item Assembling: The assembler turns the assembly into 0s and 1s and creates an object file. This object file maps names to pieces of code.
\item Static Linking: The linker then takes a series of objects and static libraries and resolves references of variables and functions from one object file to another. The linker then finds the main method and makes that the entry point for the function. The linker also notices when a function is meant to be dynamically linked. The compiler also creates a section in the executable that tells the operating system that these functions need addresses right before running.
\item Dynamic Linking: As the program is getting ready to be executed, the operating system looks at what libraries that the program needs and links those functions to the dynamic library.
\item The program is run.
\end{enumerate}
Further classes will teach you about parsing and assembly -- preprocessing is an extension of parsing.
Most classes won't teach you about the two different types of linking though.
Static linking a library is similar to combining object files.
To create a static library, a compiler combines different object files to create one executable.
A static library is literally is an archive of object files.
These libraries are useful when you want your executable to be secure, you know all the code that is being included into your executable, and portable, all the code is bundled with your executable meaning no additional installs.
The other type is a dynamic library.
Typically, dynamic libraries are installed user-wide or system-wide and are accessible by most programs.
Dynamic libraries' functions are filled in right before they are run.
There are a number of benefits to this.
\begin{itemize}
\item Lower code footprint for common libraries like the C standard library
\item Late binding means more generalized code and less reliance on specific behavior.
\item Differentiation means that the shared library can be updated while keeping the executable the same.
\end{itemize}
There are a number of drawbacks as well.
\begin{itemize}
\item All the code is no longer bundled into your program. This means that users have to install something else.
\item There could be security flaws in the other code leading to security exploits in your program.
\item Standard Linux allows you to "replace" dynamic libraries, leading to possible social engineering attacks.
\item This adds additional complexity to your application. Two identical binaries with different shared libraries could lead to different results.
\end{itemize}
\subsubsection{Explanation of the Fork-FILE Problem}
To parse the \href{http://pubs.opengroup.org/onlinepubs/9699919799.2008edition/functions/V2_chap02.html}{POSIX documentation}, we'll have to go deep into the terminology.
The sentence that sets the expectation is the following
\begin{quote}
The result of function calls involving any one handle (the "active handle") is defined elsewhere in this volume of POSIX.1-2008, but if two or more handles are used, and any one of them is a stream, the application shall ensure that their actions are coordinated as described below. If this is not done, the result is undefined.
\end{quote}
What this means is that if we don't follow POSIX to the letter when using two file descriptors that refer to the same description across processes, we get undefined behavior.
To be technical, the file descriptor must have a ``position'' meaning that it needs to have a beginning and an end like a file, not like an arbitrary stream of bytes.
POSIX then goes on to introduce the idea of an active handle, where a handle may be a file descriptor or a \keyword{FILE*} pointer.
File handles don't have a flag called ``active''.
An active file descriptor is one that is currently being used for reading and writing and other operations (such as \keyword{exit}).
The standard says that before a \keyword{fork} that the \textit{application} or your code must execute a series of steps to prepare the state of the file.
In simplified terms, the descriptor needs to be closed, flushed, or read to its entirety -- the gory details are explained later.
\begin{quote}
For a handle to become the active handle, the application shall ensure that the actions below are performed between the last use of the handle (the current active handle) and the first use of the second handle (the future active handle). The second handle then becomes the active handle. All activity by the application affecting the file offset on the first handle shall be suspended until it again becomes the active file handle. (If a stream function has as an underlying function one that affects the file offset, the stream function shall be considered to affect the file offset.)
\end{quote}
Summarizing as if two file descriptors are actively being used, the behavior is undefined.
The other note is that after a fork, the library code must prepare the file descriptor as if the other process were to make the file active at any time.
The last bullet point concerns itself with how a process prepares a file descriptor in our case.
\begin{quote}
If the stream is open with a mode that allows reading and the underlying open file description refers to a device that is capable of seeking, the application shall either perform an fflush(), or the stream shall be closed.
\end{quote}
The documentation says that the child needs to perform an fflush or close the stream because the file descriptor needs to be prepared in case the parent process needs to make it active.
glibc is in a no-win situation if it closes a file descriptor that the parent may expect to be open, so it'll opt for the fflush on exit because exit in POSIX terminology counts as accessing a file.
That means that for our parent process, this clause gets triggered.
\begin{quote}
If any previous active handle has been used by a function that explicitly changed the file offset, except as required above for the first handle, the application shall perform an lseek() or fseek() (as appropriate to the type of handle) to an appropriate location.
\end{quote}
Since the child calls fflush and the parent didn't prepare, the operating system chooses to where the file gets reset.
Different file systems will do different things which are supported by the standard.
The OS may look at modification times and conclude that the file hasn't changed so no resets are needed or may conclude that exit denotes a change and needs to rewind the file back to the beginning.
\section{Banker's Algorithm}
We can start with a single resource Banker's Algorithm.
Consider a banker, who has a finite amount of money.
With a finite amount of money, she wants to make loans and eventually get her money back.
Let's say that we have a set of $n$ people where each of them has a set amount or a limit $a_i$ ($i$ being the $i$th process) that they need to obtain before they can do any work.
The banker keeps track of how much she has given to each person $l_i$. She maintains an amount of money $p$ with her, at all times.
For people to request money, they do the following:
Consider the state of the system $(A=\{a_1, a_2, ...\}, L_t=\{l_{t,1}, l_{t,2}, ...\}, p)$ at time $t$.
A precondition is that we have $p \geq min(A)$, or we have enough money to suit at least one person.
Also, each person will work for a finite period and give back our money.
\begin{itemize}
\item A person $j$ requests $m$ from me
\begin{itemize}
\item if $m \geq p$, they are denied.
\item if $m + l_j > a_i$ they are denied
\item Pretend we are in a new state $(A, L_{t+1}=\{.., l_{t+1, j} = l_{t, j} + m, ...\}, p - m)$ where the process is granted the resource.
\end{itemize}
\item if now person $j$ is either satisfied ($l_{t+1,j} == a_j$) or $min(a_i - l_{t+1, i}) \leq p$. In other words, we have enough money to suit one other person. If either, consider the transaction safe and give them the money.
\end{itemize}
Why does this work? Well at the start we are in a safe state -- defined by we have enough money to suit at least one person.
Each of these "loans" results in a safe state.
If we have exhausted our reserve, one person is working and will give us money greater than or equal to our previous "loan", thus putting us in a safe state again.
Since we can always make one additional move, the system can never deadlock.
Now, there is no guarantee that the system won't livelock.
If the process we hope to request something never does, no work will be done -- but not due to deadlock.
This analogy expands to higher orders of magnitude but requires that either a process can do its work entirely or there exists a process whose combination of resources can be satisfied, which makes the algorithm a little more tricky (an additional for loop) but nothing too bad.
There are some notable downsides.
\begin{itemize}
\item The program first needs to know how much of each resource a process needs. A lot of times that is impossible or the process requests the wrong amount because the programmer didn't foresee it.
\item The system could livelock.
\item We know in most systems that resources vary, pipes and sockets for example. This could mean that the runtime of the algorithm could be slow for systems with millions of resources.
\item Also, this can't keep track of the resources that come and go. A process may delete a resource as a side effect or create a resource. The algorithm assumes a static allocation and that each process performs a non-destructive operation.
\end{itemize}
\section{Clean/Dirty Forks (Chandy/Misra Solution)}
There are many more advanced solutions.
One such solution is by Chandy and Misra \cite{Chandy:1984:DPP:1780.1804}.
This is not a true solution to the dining philosophers problem because it has the requirement that philosophers can speak to each other.
It is a solution that ensures fairness for some notion of fairness.
In essence, it defines a series of rounds that a philosopher must eat in a given round before going to the next one.
We won't detail the proof here because it is a little more involved, but feel free to read more.
\section{Actor Model}
The actor model is another form of synchronization that doesn't have to do anything with negotiating locks or waiting.
The idea is simple.
Each actor can either perform work, create more actors, send messages, or respond to messages.
Any time an actor needs something from another actor, it sends a message.
Most importantly, an actor is only responsible for one thing.
If we were implementing a real-world application, we may have an actor that handles the database, one that handles the incoming connections, one that services the connections, etc.
These actors would pass messages to each other like ``there is a new connection'' from the incoming connection actor to the servicing actor.
The servicing actor may send a data request message to the database actor and a data response message comes back.
While this seems like the perfect solution there are drawbacks.
The first is the actual library of communication needs to be synchronized.
If you don't have a framework that does this already -- like the Message Passing Interface or MPI for High-Performance Computing -- then the framework will have to be built and would most likely be as much work to build efficiently compared to direct synchronization.
Also, the messages now encounter additional overhead for serializing and deserializing or at the least.
And a final drawback is that an actor could take an arbitrarily long time to respond to a message, spurring the need for shadow actors who service the same job.
As mentioned, there are frameworks like \href{https://en.wikipedia.org/wiki/Message\_Passing\_Interface}{Message passing interface} that is somewhat based on the actor model and allows distributed systems in high-performance computing to work effectively, but your mileage may vary
If you want to read further on the model, feel free to glance over the Wikipedia page listed below.
\href{https://en.wikipedia.org/wiki/Actor\_model}{Further reading on the actor model}
\section{Includes and conditionals}
The other preprocessor include is the \keyword{\#include} directive and conditionals.
The include directive is explained by example.
\begin{lstlisting}[language=C]
// foo.h
int bar();
\end{lstlisting}
This is our file \keyword{bar.c} unpreprocessed.
\begin{lstlisting}[language=C]
#include "foo.h"
int bar() {
}
\end{lstlisting}
After preprocessing, the compiler sees this
\begin{lstlisting}[language=C]
// foo.c unpreprocessed
int bar();
int bar() {
}
\end{lstlisting}
The other tool is preprocessor conditionals.
If a macro is defined or truthy, that branch is taken.
\begin{lstlisting}[language=C]
int main() {
#ifdef __GNUC__
return 1;
#else
return 0;
#endif
}
\end{lstlisting}
Using \keyword{gcc} your compiler would preprocess the source to the following.
\begin{lstlisting}[language=C]
int main() {
return 1;
}
\end{lstlisting}
Using \keyword{clang} your compiler would preprocess to this.
\begin{lstlisting}[language=C]
int main() {
return 0;
}
\end{lstlisting}
\subsection{Thread Scheduling}
There are a few ways to split up the work.
These are common to the OpenMP framework \cite{silberschatz2005operating}.
\begin{itemize}
\item \keyword{static scheduling} breaks up the problems into fixed-size chunks (predetermined) and have each thread work on each of the chunks.
This works well when each of the subproblems takes roughly the same time because there is no additional overhead.
All you need to do is write a loop and give the map function to each sub-array.
\item \keyword{dynamic scheduling} as a new problem becomes available to have a thread serve it.
This is useful when you don't know how long the scheduling will take
\item \keyword{guided scheduling} This is a mix of the above with a mix of the benefits and tradeoffs.
You start with static scheduling and move slowly to dynamic if needed
\item \keyword{runtime scheduling} You have absolutely no idea how long the problems are going to take.
Instead of deciding it yourself, let the program decide what to do!
\end{itemize}
No need to memorize any of the scheduling routines though.
Openmp is a standard that is an alternative to pthreads.
For example, here is how to parallelize a for loop
\begin{lstlisting}[language=C]
#pragma omp parallel for
for (int i = 0; i < n; i++) {
// Do stuff
}
// Specify the scheduling as follows
// #pragma omp parallel for scheduling(static)
\end{lstlisting}
Static scheduling will divide the problem into fixed-size chunks
Dynamic scheduling will give a job once the loop is over
Guided scheduling is Dynamic with chunks
Runtime is a whole bag of worms.
\section{threads.h}
We have a lot of threading libraries discussed in the extra section.
We have the standard POSIX threads, OpenMP threads, we also have a new C11 threading library that is built into the standard.
This library provides restricted functionality.
Why use restricted functionality?
The key is in the name.
Since this is the C standard library, it has to be implemented in all operating systems that are compliant which are pretty much all of them.
This means there is first-class portability when using threads.
We won't drone on about the functions.
Most of them are renaming of pthread functions anyway.
If you ask why we don't teach these, there are a few reasons
\begin{enumerate}
\item They are pretty new. Even though the standard came out in roughly 2011, POSIX threads have been around forever.
A lot of their quirks have been ironed out.
\item You lose expressivity.
This is a concept that we'll talk about in later chapters, but when you make something portable, you lose some expressivity with the host hardware.
That means that the threads.h library is pretty bare bones.
It is hard to set CPU affinities.
Schedule threads together.
Efficiently look at the internals for performance reasons.
\item A lot of legacy code is already written with POSIX threads in mind.
Other libraries like OpenMP, CUDA, MPI will either use POSIX processes or POSIX threads with a begrudging port to Windows.
\end{enumerate}
\section{Modern Filesystems}
While the API for most filesystems have stayed the same on POSIX over the years, the actual filesystems themselves provide lots of important aspects.
\begin{itemize}
\item Data Integrity. File systems use journaling and sometimes checksums to ensure that the data written to is valid. Journalling is a simple invention where the file system writes an operation in a journal. If the filesystem crashes before the operation is complete, it can resume the operation when booted up again using the partial journal.
\item Caching. Linux does a good job of caching file system operations like finding inodes. This makes disk operations seem nearly instant. If you want to see a slow system, look at Windows with FAT/NTFS. Disk operations need to be cached by the application, or it will burn through the CPU.
\item Speed. On spinning disk machines, data that is toward the end of a metallic platter will spin faster (angular velocity is farther from the center). Programs used this to reduce time loading large files like movies in a video editing piece of software. SSDs don't have this problem because there is no spinning disk, but they will portion off a section of their space to be used as "swap space" for fiels.
\item Parallelism. Filesystems with multiple heads (for physical hard disks) or multiple controllers (for SSDs) can utilize parallelism by multiplexing the PCIe slot with data, always serving some data to the application whenever possible.
\item Encryption. Data can be encrypted with one or more keys. A good example of this is Apple's APFS file systems.
\item Redundancy. Sometimes data can be replicated to blocks to ensure that the data is always available.
\item Efficient Backups. Many of us have data that we can't store on the cloud for one reason or another. It is useful that when a filesystems is either being used as a backup medium or is the source to the backup that it is able to calculate what has changed efficiently, compress files, and sync between the external drive.
\item Integriy and Bootability. File systems need to be resillient to bit flipping. Most readers have their operating system installed on the same paritition as the file system that they used to do different operations. The file system needs to make sure a stray read or write doesn't destroy the boot sector -- meaning your computer can't start up again.
\item Fragmentation. Just like a memory allocator, allocating space for a file leads to both internal and external fragmentation. The same caching benefit occurs when disk blocks for a single file are located next to each other. File systems need to perform well under low, high, and possible fragmentation usage.
\item Distributed. Sometimes, the filesystem should be single machine fault tolerant. Hadoop and other distributed file system allow you to do that.
\end{itemize}
\subsection{Cutting Edge File systems}
There are a few filesystem hardware nowadays that are truly cutting edge.
The one we'd briefly like to touch on is AMD's StoreMI.
We aren't trying to sell AMD chipsets, but the featureset of StoreMI warrants a mention.
StoreMI is a hardware microcontroller that analyzes how the operating system accesses files and moves files/blocks around to speed up the load time.
A common usage can be imagined as having a fast, but small capacity SSD and a slower, large capcity HDD.
To make it seem like all the files are on an SSD, the StoreMI matches the pattern of file access.
If you are starting up Windows, Windows will often access many files in the same order.
StoreMI takes note of that and when the microcontroller notices it is starting the boot, it will move files from the HDD drive to the SSD before they are requested by the operating system.
By the time the operating system needs then, they are already on the SSD.
StoreMI also does this with other applications as well.
The technology still has a lot to be desired for, but it is an interesting intersection of data and pattern matching with filesystems.
\section{Linux Scheduling}
As of February 2016, Linux by default uses the \emph{Completely Fair Scheduler} for CPU scheduling and the Budget Fair Scheduling ``BFQ'' for I/O scheduling. Appropriate scheduling can have a significant impact on throughput and latency. Latency is important for interactive and soft-real time applications such as audio and video streaming. See the discussion and comparative benchmarks \href{https://lkml.org/lkml/2014/5/27/314}{here} for more information.
Here is how the CFS schedules
\begin{itemize}
\tightlist
\item
The CPU creates a Red-Black tree with the processes virtual runtime (runtime / nice\_value) and sleeper fairness flag -- if the process is waiting on something, give it the CPU when it is done waiting.
\item
Nice values are the kernel's way of giving priority to certain processes, the lower nice value the higher priority.
\item
The kernel chooses the lowest one based on this metric and schedules that process to run next, taking it off the queue. Since the red-black tree is self-balancing this operation is guaranteed \(O(log(n))\) (selecting the min process is the same runtime)
\end{itemize}
Although it is called the Fair Scheduler there are a fair bit of problems.
\begin{itemize}
\tightlist
\item
Groups of processes that are scheduled may have imbalanced loads so the scheduler roughly distributes the load. When another CPU gets free it can only look at the average load of a group schedule, not the individual cores. So the free CPU may not take the work from a CPU that is burning so long as the average is fine.
\item
If a group of processes is running on non-adjacent cores then there is a bug. If the two cores are more than a hop away, the load balancing algorithm won't even consider that core. Meaning if a CPU is free and a CPU that is doing more work is more than a hop away, it won't take the work (may have been patched).
\item
After a thread goes to sleep on a subset of cores, when it wakes up it can only be scheduled on the cores that it was sleeping on. If those cores are now busy, the thread will have to wait on them, wasting opportunities to use other idle cores.
\item
To read more on the problems of the Fair Scheduler, read \href{https://blog.acolyer.org/2016/04/26/the-linux-scheduler-a-decade-of-wasted-cores}{here}.
\end{itemize}
\subsection{Implementing Software Mutex}
Yes
With a bit of searching, it is possible to find it in production for specific simple mobile processors today.
Peterson's algorithm is used to implement low-level Linux Kernel locks for the Tegra mobile processor (a system-on-chip ARM process and GPU core by Nvidia) \href{https://android.googlesource.com/kernel/tegra.git/+/android-tegra-3.10/arch/arm/mach-tegra/sleep.S\#58}{Link to Lock Source}
In general now, CPUs and C compilers can re-order CPU instructions or use CPU-core-specific local cache values that are stale if another core updates the shared variables.
Thus a simple pseudo-code to C implementation is too naive for most platforms.
Warning, here be dragons!
Consider this advanced and gnarly topic but (spoiler alert) a happy ending.
Consider the following code,
\begin{lstlisting}[language=C]
while(flag2) { /* busy loop - go around again */
\end{lstlisting}
An efficient compiler would infer that \keyword{flag2} variable is never changed inside the loop, so that test can be optimized to \keyword{while(true)} Using \keyword{volatile} goes some way to prevent compiler optimizations of this kind.
Let's say that we solved this by telling the compiler not to optimize.
Independent instructions can be re-ordered by an optimizing compiler or at runtime by an out-of-order execution optimization by the CPU.
A related challenge is that CPU cores include a data cache to store recently read or modified main memory values.
Modified values may not be written back to main memory or re-read from memory immediately.
Thus data changes, such as the state of a flag and turn variable in the above example, may not be shared between two CPU codes.
But there is a happy ending.
Modern hardware addresses these issues using `memory fences' also known as a memory barrier.
This prevents instructions from getting ordered before or after the barrier.
There is a performance loss, but it is needed for correct programs!
Also, there are CPU instructions to ensure that main memory and the CPU's cache is in a reasonable and coherent state.
Higher-level synchronization primitives, such as \keyword{pthread\_mutex\_lock} are will call these CPU instructions as part of their implementation.
Thus, in practice, surrounding critical sections with a mutex lock and unlock calls is sufficient to ignore these lower-level problems.
For further reading, we suggest the following web post that discusses implementing Peterson's algorithm on an x86 process and the Linux documentation on memory barriers.
\begin{enumerate}
\item \href{http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/}{Memory Fences}
\item \href{http://lxr.free-electrons.com/source/Documentation/memory-barriers.txt}{Memory Barriers}
\end{enumerate}
\section{The Curious Case of Spurious Wakeups}
Condition variables need a mutex for a few reasons.
One is simply that a mutex is needed to synchronize the changes of the \textit{condition variable} across threads.
Imagine a condition variable needing to provide its own internal synchronization to ensure its data structures work correctly.
Often, we use a mutex to synchronize other parts of our code, so why double the cost of using a condition variable.
Another example relates to high priority systems.
Let's examine a code snippet.
\begin{lstlisting}[language=C]
// Thread 1
while (answer < 42) pthread_cond_wait(cv);
// Thread 2
answer = 42
pthread_cond_signal(cv);
\end{lstlisting}
\\
\begin{center}
\begin{table}[h]
\caption{Signaling without Mutex}
\begin{tabular}{|c|c|}
Thread 1 & Thread 2 \\ \hline
while(answer < 42) & \\
& answer++ \\
& pthread\_cond\_signal(cv) \\
pthread\_cond\_wait(cv)
\end{tabular}
\end{table}
\end{center}
\\
The problem here is that a programmer expects the signal to wake up the waiting thread.
Since instructions are allowed to be interleaved without a mutex, this causes an interleaving that is confusing to application designers.
Note that technically the API of the condition variable is satisfied.
The wait call \textit{happens-after} the call to signal, and signal is only required to release at most a single thread whose call to wait \textit{happened-before}.
Another problem is the need to satisfy real-time scheduling concerns which we only outline here.
In a time-critical application, the waiting thread with the \emph{highest priority} should be allowed to continue first.
To satisfy this requirement the mutex must also be locked before calling \keyword{pthread\_cond\_signal} or \keyword{pthread\_cond\_broadcast}.
For the curious, \href{https://groups.google.com/forum/?hl=ky\#!msg/comp.programming.threads/wEUgPq541v8/ZByyyS8acqMJ}{here is a longer, historical discussion}.
\section{Condition Wait Example}
The call \keyword{pthread\_cond\_wait} performs three actions:
\begin{enumerate}
\item Unlock the mutex. The mutex must be locked.
\item Sleeps until \keyword{pthread\_cond\_signal} is called on the same condition variable.
\item Before returning, locks the mutex.
\end{enumerate}
Condition variables are \emph{always} used with a mutex lock.
Before calling \emph{wait}, the mutex lock must be locked and \emph{wait} must be wrapped with a loop.
\begin{lstlisting}[language=C]
pthread_cond_t cv;
pthread_mutex_t m;
int count;
// Initialize
pthread_cond_init(&cv, NULL);
pthread_mutex_init(&m, NULL);
count = 0;
// Thread 1
pthread_mutex_lock(&m);
while (count < 10) {
pthread_cond_wait(&cv, &m);
/* Remember that cond_wait unlocks the mutex before blocking (waiting)! */
/* After unlocking, other threads can claim the mutex. */
/* When this thread is later woken it will */
/* re-lock the mutex before returning */
}
pthread_mutex_unlock(&m);
//later clean up with pthread_cond_destroy(&cv); and mutex_destroy
// Thread 2:
while (1) {
pthread_mutex_lock(&m);
count++;
pthread_cond_signal(&cv);
/* Even though the other thread is woken up it cannot not return */
/* from pthread_cond_wait until we have unlocked the mutex. This is */
/* a good thing! In fact, it is usually the best practice to call */
/* cond_signal or cond_broadcast before unlocking the mutex */
pthread_mutex_unlock(&m);
}
\end{lstlisting}
This is a pretty naive example, but it shows that we can tell threads to wake up in a standardized manner.
In the next section, we will use these to implement efficient blocking data structures.
\section{Implementing CVs with Mutexes Alone}
Implementing a condition variable using only a mutex isn't trivial.
Here is a sketch of how we could do it.
\begin{lstlisting}[language=C]
typedef struct cv_node_ {
pthread_mutex_t *dynamic;
int is_awoken;
struct cv_node_ *next;
} cv_node;
typedef struct {
cv_node_ *head
} cond_t
void cond_init(cond_t *cv) {
cv->head = NULL;
cv->dynamic = NULL;
}
void cond_destroy(cond_t *cv) {
// Nothing to see here
// Though may be useful for the future to put pieces
}
static int remove_from_list(cond_t *cv, cv_node *ptr) {
// Function assumes mutex is locked
// Some sanity checking
if (ptr == NULL) {
return
}
// Special case head
if (ptr == cv->head) {
cv->head = cv->head->next;
return;
}
// Otherwise find the node previous
for (cv_node *prev = cv->head; prev->next; prev = prev->next) {
// If we've found it, patch it through
if (prev->next == ptr) {
prev->next = prev->next->next;
return;
}
// Otherwise keep walking
prev = prev->next;
}
// We couldn't find the node, invalid call
}
\end{lstlisting}
This is all the boring definitional stuff.
The interesting stuff is below.
\begin{lstlisting}[language=C]
void cond_wait(cond_t *cv, pthread_mutex_t *m) {
// See note (dynamic) below
if (cv->dynamic == NULL) {
cv->dynamic = m
} else if (cv->dynamic != m) {
// Error can't wait with a different mutex!
abort();
}
// mutex is locked so we have the critical section right now
// Create linked list node _on the stack_
cv_node my_node;
my_node.is_awoken = 0;
my_node.next = cv->head;
cv->head = my_node.next;
pthread_mutex_unlock(m);
// May do some cache busting here
while(my_node == 0) {
pthread_yield();
}
pthread_mutex_lock(m);
remove_from_list(cv, &my_node);
// The dynamic binding is over
if (cv->head == NULL) {
cv->dynamic = NULL;
}
}
void cond_signal(cond_t *cv) {
for (cv_node *iter = cv->head; iter; iter = iter->next) {
// Signal makes sure one thread that has not woken up
// is woken up
if (iter->is_awoken == 0) {
// DON'T remove from the linked list here
// There is no mutual exclusion, so we could
// have a race condition
iter->is_awoken = 1;
return;
}
}
// No more threads to free! No-op
}
void cond_broadcast(cond_t *cv) {
for (cv_node *iter = cv->head; iter; iter = iter->next) {
// Wake everyone up!
iter->is_awoken = 1;
}
}
\end{lstlisting}
So how does this work?
Instead of allocating space which could lead to deadlock.
We keep the data structures or the linked list nodes on each thread's stack.
The linked list in the wait function is created \textbf{While the thread has the mutex lock} this is important because we may have a race condition on the insert and removal.
A more robust implementation would have a mutex per condition variable.
What is the note about (dynamic)?
In the pthread man pages, wait creates a runtime binding to a mutex. This means that after the first call is called, a mutex is associated with a condition variable while there is still a thread waiting on that condition variable.
Each new thread coming in must have the same mutex, and it must be locked.
Hence, the beginning and end of wait (everything besides the while loop) are mutually exclusive.
After the last thread leaves, meaning when head is NULL, then the binding is lost.
The signal and broadcast functions merely tell either one thread or all threads respectively that they should be woken up.
\textbf{It doesn't modify the linked lists because there is no mutex to prevent corruption if two threads call signal or broadcast}
Now an advanced point.
Do you see how a broadcast could cause a spurious wakeup in this case? Consider this series of events.
\begin{enumerate}
\item Some number more than 2 threads start waiting
\item Another thread calls broadcast.
\item That thread calling broadcast is stopped before it wake any threads.
\item Another thread calls wait on the condition variable and adds itself to the queue.
\item Broadcast iterates through and frees all of the threads.
\end{enumerate}
There is no assurance as to \textit{when} the broadcast was called and when threads were added in a high-performance mutex.
The ways to prevent this behavior are to include Lamport timestamps or require that broadcast be called with the mutex in question.
That way something that \textit{happens-before} the broadcast call doesn't get signaled after.
The same argument is put forward for signal too.
Did you also notice something else?
\textbf{This is why we ask you to signal or broadcast before you unlock}.
If you broadcast after you unlock, the time that broadcast takes could be infinite!
\begin{enumerate}
\item Broadcast is called on a waiting queue of threads
\item First thread is freed, broadcast thread is frozen. Since the mutex is unlocked, it locks and continues.
\item It continues for such a long time that it calls broadcast again.
\item With our implementation of a condition variable, this would be terminated.
If you had an implementation that appended to the tail of the list and iterated form the head to the tail, this could go on infinitely many times.
\end{enumerate}
In high-performance systems, we want to make sure that each thread that calls wait isn't passed by another thread that calls wait.
With the current API that we have, we can't assure that.
We'd have to ask users to pass in a mutex or use a global mutex.
Instead, we tell programmers to always signal or broadcast before unlocking.
\section{Higher Order Models of Synchronization}
When using atomics, you need to specify the right model of synchronization to ensure a program behaves correctly.
You can read more about them \href{https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync}{On the gcc wiki}
These examples are adapted from those.
\subsection{Sequentially Consistent}
Sequentially consistent is the simplest, least error-prone and most expensive model. This model says that any change that happens, all changes before it will be synchronized between all threads.
\begin{verbatim}
Thread 1 Thread 2
1.0 atomic_store(x, 1)
1.1 y = 10 2.1 if (atomic_load(x) == 0)
1.2 atomic_store(x, 0); 2.2 y != 10 && abort();
\end{verbatim}
Will never quit.
This is because either the store happens before the if statement in thread 2 and y == 1 or the store happens after and x does not equal 2.
\subsection{Relaxed}
Relaxed is a simple memory order providing for more optimizations.
This means that only a particular operation needs to be atomic.
One can have stale reads and writes, but after reading the new value, it won't become old.
\begin{verbatim}
-Thread 1- -Thread 2-
atomic_store(x, 1); printf("%d\n", x) // 1
atomic_store(x, 0); printf("%d\n", x) // could be 1 or 0
printf("%d\n", x) // could be 1 or 0
\end{verbatim}
But that means that previous loads and stores don't need to affect other threads.
In the previous example, the code can now fail.
\subsection{Acquire/Release}
The order of atomic variables don't need to be consistent -- meaning if atomic var y is assigned to 10 then atomic var x to be 0 those don't need to propagate, and a threa could get stale reads.
Non-atomic variables have to get updated in all threads though.
\subsection{Consume}
Imagine the same as above except non-atomic variables don't need to get updated in all threads.
This model was introduced so that there can be an Acquire/Release/Consume model without mixing in Relaxed because Consume is similar to relax.
\section{Actor Model and Goroutines}
There are a \textit{lot} of other methods of concurrency than described in this book.
Posix threads are the finest grained thread construct, allowing for tight control of the threads and the CPU.
Other languages have their abstractions.
We'll talk about a language go that is similar to C in terms of simplicity and design, go or golang
To get the 5 minute introduction, feel free to read \href{https://learnxinyminutes.com/docs/go/}{the learn x in y guide} for go.
Here is how we create a "thread" in go.
\begin{lstlisting}[language=golang]
func hello(out) {
fmt.Println(out);
}
func main() {
to_print := "Hello World!"
go hello(to_print)
}
\end{lstlisting}