/
cit_architectural_ideas.txt
executable file
·875 lines (611 loc) · 26.6 KB
/
cit_architectural_ideas.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
handling credentials in a CI job
jenkins offers a key ring but that works only if all jobs run under a user which access rights
one way to avoid the problem is to do all the checkouts locally and send the file system for builds to the CI system
CI system just handle allocation of cpu not credentials
one tool that can be used to simplify the sending of the checkout files is docker
either build an image with files checkout out, that needs to be done for every build with code change
or checkout the files in the container and commit it
the local machine can act as a docker hub, the CI system can pull the image from it
if the image was previously pulled only the new fs layer will need to be synchronized
after the build the container can be commited and the build originator can pull the image from the ci system to further work in it or debug if necessary
encryption of the layers would ensure security
encryption of the transport layer is also a possibility
we also need to insure that the run is secure, either by
securing the build node
securing the run of the docker container
job slaves
a job may be running on multiple slaves (at the same time or sequentially)
we need to know what has been run where
commit to build, builds to commit, and sequential list of builds
given a commits in one or multiple repositories
find all the builds that include the commits
find all the build started because of those commits
generally, keep a commit to build information in the builds
find the builds, failing or passing, after commits where made
No central Queue
make the whole system p2p
list advantages and drawbacks
job comment
cit -m 'manual start because we have sone something special, .....
the job comment is shown when listing the current jobs
if the build supports it, cit info could query the build/job
more likely that cit gives information where the job is jun
ssh $(cit where) make info
cluster management
run job if the load level in /proc/loadavg is below some threshold
see batch(1)
time limit (not CI, cluster/build management)
timeout (1)
not CI, build system timing information
/usr/bin/time to get info
replay
hmm, simply no!
CI is not a versioning system
what's the real use case for replay?
hack around the fact that Jenkins is a crappy development environment
when is it most often used?
to make slight modifications to make it run on jenkins
this is eliminated since cit runs on the user's environment
when a simple change needs to be done but we want to reuse the build directory on jenkins
moving the build directory is prohibitive
hoping the build will be re-run on the same node
=> use normal development strategies
if changes are made, on the build node and committed, the job needs to reflect the changes
otherwise re-running th job would give a different result
the broken job will already have reported its failure thus this means
that it's a new job, one that didn't go through queueing
or we can see it like, a new job which needs to run on the node where the modification
was done (artefacts are there)
if the change is made on a no-cluster node this is equivalent to pushing a done-job to the
done-queue; artefacts must be pushed to a cluster node or made available form build box
getting CI builds locally
utility not CI
cit rsynch UUID
creates,if necessary, UUID+name directory and rsynch the whole build
this can be large and time/cpu consuming if done too often on large builds
cit mount UUID
leave the build directory on the CI nodes and mount the filesystem in UUID directory
some operations do not require the copying of the whole build locally
we don't need to, we can work on the node directly
how do we stop cleanup of workspace while working on it?
restart the build, on the CI node, after modification
isn't it a replay?
we need to save the changes and make them very clear to people using the result
although the build can be restarted on the node directly, we are logged in to make changes, it is
important that the CI system is not short cut
if the change is small we want the CI system to prioritize this build
we want the build to be queued
stash
make local archive
stash url to archive on "master"
unstash
get url
scp
build nodes may not be on the same network
use third party service
use url
stashes can be encrypted for security
pushing local builds to the CI system
not CI, code/binary management
in a controlled environment
a local build and a CI build are the same
the local build can be pushed to the CI system and made available to other without having to
queue for resources
rebuild it on the CI system
local binaries can shorten the build time on CI
binary repo from client to server!
can be used between developers, send my binaries so helper doesn't need to build
helper can send back what he has build
=> better, let helper ssh on own machine
choosing the build node (affinity/scheduling)
apart from needed resources, the presence of a previous build is an important
allows us to do incremental builds
CI runs executable which source is in the job data
in sandbox (docker, chroot)
allows job_scheduler to install resources for computation of go/no-go of job build
environment control
we want to do incremental builds, environment control becomes primordial
needs to be done by the build system
needs to be a separate target and hierarchical
components can chose between incremental and clean builds
build directories must be different for components in case a component wants a "clean" build
or cleanup the component build directory
build system use chroot/docker to sandbox
this allows build on local machine with same build system
if "base images" exist to be reused, we must still check the versions of dependencies
that are installed and install them if not matching
application name: cit, or metra, "latests"
CI uses the configuration management system to freeze "latest" into
the version used during the build
CI -> fix versions -> CM
<- fixed version
CI -> Queue build -> Q -> CI scheduler
-> build -> build system
cit computes affinity before pushing jobs in queue
Simple parts connected by "clean" interfaces
part are easy to replace
part can will be written in multiple languages!
Interface is always a data file to help with debugging
it also act as a log
each file generated during the process is prefixed with the job's name (hash + user given name)
cit file is a container for multiple types of information (tiff style)
CI setup:
generate UUID for integration
reference to BUILD setup
hosts: host1+key host2+key
key allows scheduler to log onto the build nodes
and run
compute affinity cost
can build now?
build
affinity:
config:
gcc 4.15
mathlab
run: bash cat /proc/loadavg < 12
scheduling:
...
BUILD setup:
configure: ...
build description
configuration for the build
use_module_x = 1
build_log_file = ...
BUILD setup module X:
build description
configuration for the build
...
CIT log:
generate UUID, where
Queued nadim@his_box -> CI-Q@CI_Q_box
started scheduler: no candidates
started scheduler: candites: A, B
send to B Queue
...
Integration done
***
section can be encrypted but have a textual header
section are tag delimited to make it easy to extract sections on the command line
preferred format in order
tab delimited
TOML (simple: key=value pairs under [sections]) (send link)
XML (XML >> json because it has validation)
json
yaml
tarball
sections are versioned
this allows >> in the files (after locking)
Information (Debugging) is a not an afterthought
logging per session in case multiple jobs are being run (see data file interface)
messages to stderr with header tag
verbosity level
a job must be traceable from creation to completion, including execution rights
tool to step through a CI job instead for looking through logs that may be mixed with
sub tools log output
have different logs
How do we build a local setup as a CI job?
we want to access resources in the cluster and not be bothered by creating a job description
create a script that takes a repo and version and make a cit job
possibly local as the remote job can checkout from our local setup
possibility to build the job on own node not cluster
just sugar to avoid doing the steps by hand
results get back on the Q
jobs can start immediately
thus moving jobs from the in-Q to waiting-Q tries to schedule the jobs
from a job, created by someone else, one can get the commands to setup the job locally
IE: take a cit job description and run it locally
adding the current machine to the build cluster
cit add-builder info_file
allows anyone to add one's machine to the cluster
what are the security implications?
info_file describes
the resources
how many concurrent jobs
the node has and permissions
we can check permissions with pki
this mechanism allows an easy management of the build nodes for the admins
ssh to a node
cit add-builder info_file
equivalent cit commands for:
remove
what if jobs are running
pause/restart
get info (ls)
how do we get a status after the job is build
the status exists on the cluster node where the build was done
the status exists in the done-Queue
done-Q is
per job, all jobs, my jobs, job ID
just one type, the rest derived on demand
the status is send from the job on the cluster node back to the main done-queue
a job is run by "cit run" job_id
cit run monitors the job till ends
failure
or success
cit run updates the done-Q (and maybe success or failure Qs)
the status is kept on the cluster node before sending it to the done-Q
in case the "central" Q is off line
extra information can be added to the job before synchronizing it
done-job extra information is up to the job itself
if the information is a log, the log could be queried for ongoing job status
a bit like the stages in jenkins except it is not cit specific, it's just a log
user can query the log continuously to show the job ongoing status ("interactive")
"cit run" can update the ongoing-Q by synchronizing the log from cluster node to ongoing-Q node
CI framework doesn't know how to start builds
only user knows how a build is done
total separation CI - build system
the job format is decided by the builder
full blown shell script
including the danger associated
call to a build interface
only the parameters
verifiable
generated by a public API
encryption is possible to anonymize jobs in the Q
using the Q's private key
the CI does the minimum
handle queuing jobs
FIFO
handles triggering at the request of the jobs
triggering a build is just queueing a build, no reason for the
Q manager and trigger to be the same code
find build nodes where the jobs can be run
forward the job to the builder
keep statistics about jobs queuing
why? isn't the data about them enough nd stats can be computed in different ways
keep links between job (ID) and the job builder
who build what
does the CI start the job?
that would mean running as the user on the build node and we do not want that
just deliver the job to the user's Q and let the user build, IE cron job
the user could make a run_ci_job application available so other users can start jobs
sticky bit set
for specific user
extra permissions can be embedded in the job
the continuous integration framework consists of a set of small programs that can be used
by the user, the automation system that builds components
No master
the Q and executor need to be run on a specific node
no they don't it's just a normal user, they need to be cron'ed
but the node where they run is irrelevant if the data they need
is accessible and the job queuers know where to find it
use discovery instead for fix IP?
Q data
Q can be synchronized to any amount of nodes for replication/backup
need use cases of replication
case: only one Q is active, data is replicated, make the replicated Q active
this entails
make Q1 inactive (change permissions)
make executor inactive (changed permission should be enough)
synchronize q1 -> Q2
create new replicator and synchronize Q2 -> Q3
make Q2 active
update cit's Q list to add Q2 and Q3, removing Q1 if necessary
note: Q1, Q2, Q3, ... can be in the list from the beginning
cit tries the Qs in order and if only one is active it
uses it
administrative tools can run on all the Qs
is it active
how many jobs in the Q
...
does the Q need to be on a single node?
no but it needs to be on machines that the Q manager can access
thus the user machine is not possible but any of the build nodes is
what if the Q is not accessible when adding a job
the job is saved locally
cit can show the list of not Queued jobs
cit Queues all the jobs at next attempt to queue a job or on demand
tools to manage the local queue
the local Q is a directory, ls, grep, ... are enough, can be wrapped in small script
multiple Qs/Q managers can co-exist on the same nodes
their Qs are simply different directories, with possibly different
access rights to allow only specific users to add CI jobs
triggers
application, or person, that decided that a job needs to be queued
can be different applications and persons
where do the applications reside and are started
anywhere, cron
how does one make a trigger inactive ?
application specific, can be as simple as a '#' in a configuration file
use discovery instead for fix IP
we should have the possibility to do both as fix ip has advantages too
discovery if for the Queue and the build nodes
discovery means a service
it could be used to deliver capacity information needed for scheduling
we don't want services at least none specific to cit
what about an account?
scan network, ssh as cit on the nodes, read a capacity file
using cit commands we can easily enable/disable the cit account and
modify the capacity file (using a text editor)
No service
except basic unix services, cron, sshd.
non service tool: pgp, ...
these are module dependent, the modules install them not CIT
good if the modules list their tools dependencies
better if they verify the tools are installed to fail early (not mandatory)
excellent if the installation/verification is made through a makefile thus
allowing admins to debug installation if necessary
automated installation
no pre-installation, only a user account is needed
needed applications are scp'ed or installed on demand
No database (or database that act like a file system)
text files
CI commands can be simple pipelines in the shell
public key infrastructure
to encrypt sensitive data
to log into the build node
to copy from the build nodes
Q and job description
Q is directory
Q manager has full rights
users can only write their own
job description is plain text
sensitive data is encrypted and rendered in base64
contains data needed by the sub modules to build
configuration
name:value
url/filename of name:values
url to tarball with configuration
needed artefacts
...
or it can be files in a git repo (that way, when someone
contributes a section to the file, each contribution is tracked.
maybe not necessary if the files are append-only)
good way to keep history although file in a history directory work simpler
cit command to suspend and restart the cron job running the queue manager
log who runs it
where?
how do we get jobs to the Q?
cit push job_file
how do we tag which user added a job?
accept only signed jobs?
Getting data from SM builds
URL that can be scp'ed
synchronizing with SM build
a build can be sleeping waiting for its dependencies to be available
cit build SM --blocking
a build can put itself back in the Q till its dependencies are available
the scheduler can verify the dependencies
by running the top module's check_dependencies target
build nodes are considered secured (and must be secured)
direct access to them is subject to access control
data access is via a secured protocol implementing access contol or
via encryption of the send data
using ssl keys
build of sub modules (SM) can be done via services (rather than checking out the sub module)
CIT is the service
a build can Q a build for a SM
the SM build request can be generated by the SM's public interface
the SM can have private repositories, it is build as the SM owner
selected portions of the SM are made available to the requester
portions are defined in the requests
delivery of the SM is done by making the SM portions directly available
URL that can be scp'ed
directly from the build directory
access rights are set or encrypted
SM can be/are encrypted with requestor's key
Service has a list of accepted public keys
SM can be cached in a binary repo, only the URLs are returned, no buid is done
artefacts can be accessed by HASH of encrypted data
distribute the workload (no master)
computation of what needs to be done,where can be done on client machines
distribute the data load
logs, reports, ... are shared from the node having the data to the user machine directly
distribute reporting load
only data is transfered, report generation is done on the user machine
reports can be send back or cached or put in the binary repo
how do we get information about private SM?
the data and/or transformed data must be public
transformation procedure can be made available through SM public interface
seeing builds
one needs to have the right to see them
attach to tty
get a log
run build in tmux and allow read
gotty, wtty, .. ?
https://metacpan.org/pod/release/BBB/ttylog-0.83/ttylog
https://github.com/nbedos/cistern
script -o in file that can be scp'ed
ttyrec
exec >&1 file
cit is a cheap batch-like linux cluster
add user
add box to build cluster
cit add user@ip:ssh_port @queue @queue ...
cit configuration
$ cit config config.name value
contains
main queue
queues this box was added to, if any
jobs on this node
Continuous integration
access control
persons looking at builds, artefacts, reports, code, ... must have the right to
not all builds want to be watchable
build logs:
use a build system that generates better logs
wrap components builds in different logs
not changing the build system just the system around it
redirect all output in a/multiple log file
wrap the bash, if necessary, to associate logs to a component build
if log is hierarchical, components have separate logs, it is possible to see the
hierarchy and traverse it
Reports:
keep list of builds started locally (and some meta data)
we may not be interested by more than those builds
keep list of global builds
synch with the build queue?
only build we can access are synchronized
how do we send our access keys?
data send back must be encrypted
build running/failed/succeeded -> location
how does the CI know where things are build?
the build (or service building) knows about that information, the CI just delivers jobs
and thus doesn't know.
build system PID + status file ?
generated artefacts
go to the binary repo
binary repo has access control
are accessible via scp
how does the user find them?
* the job description should set a build directory
presents the builds in a hierarchy (tabs, ...)
previous builds
triggering
start job manually
repo dependencies
build scheduling
synchronize builds with equivalent input, if 2 dependent wait for the same build then build only once
access to workspace
only to code that the user can checkout
code in the workspace can be r or rw, upt to the SM
the SM has RW so SM maintainers can debug/fix broken builds
keeps credential for repos
NOPE, uses nodes with the right credentials and get back artefacts
UI, configure jobs
the terminal + $EDITOR is the UI
for complex configurations that need a wizard ... write a text mode wizard
NOPE, all config files under version control
EG: make a branch change the config, build the branch
if the repos are not accessible but still need a specific configuration
the configuration can be in a separate repo that is accessible
make a branch, checking changes, give name of branch as input to the main build
the config can be passes in the job
replay
what's replay?
modify some and rebuild
modify some should be done in a repo
the source repo in a branch or a new local repo
rebuild is queueing the job
does the new job need to be linked to the previous job?
no but it may help to have that in the meta data
what really needs to be saved is the commit
what happens in Jenkins if two users do a replay?
if the repo is not accessible then replay is not accessible
if accessible, clone the repo, rebuild the local changes
locally if possible
on CI-cluster otherwise
optimize by using binary repo from CI-cluster
scripting to implement build steps
NOPE! build in the build system
add a layer of build system if necessary
scripting to control CI system
need use cases
plugins
nope! to do what
Agents
all reasons below are useless, there is no need for agents
just users that can build, a cron job that makes the user check the queue
could be considered an agent but it is an agent of the user not the CI
multiplatform build
just needs ssh and permissions
pre-setup agents (build tools, ...)
let the build handle its dependencies
dependencies can be installed via the package manager or locally
local cache
let the build andle the cache as a dependency
agent configuration and management
just normal user management!
list of agents
can be a share list in a repository
find agent for a build
the executor finds the available build users in the network
ssh to the build nodes
check load
cit add job user_queue
check that the user exists
stashing
binary repo per build
this means that multiple builds with the same configuration having the same
stashed artefact (yes it's the same or we have a bigger problem) can share the
same artefact (and no need to build it either)
just have a cit stash and cit unstash that puts the artefacts in a local or global repo
the artefact name contains the job ID (even better a hash so artefacts can be shared)
synchronizing jobs on different build nodes
IE: star a job and wait for it to succeed
unique build ID
generated on the machine where the build is started
setup CI job
clone jobs
different types of jobs
support one type or let the jobs decide what they are
what does Jenkins support and why?
triggering by repo checkins
scheduling
scheduled at the trigger
rescheduled by maintainer
pause, restart later (via cron)
kill
immediate if possible, after finding a free agent
build under an alias and protect the build code
IE: user without rights can build projects that need special access to repo
find a node in the cluster that can build as an alias
parallel
ssh to node
include path to special jobs
run can_build_XXX_protected
test dependencies
tools
docker images
cpu, disk, ... availability
decide where to be build
or queue (can be handled by the trigger)
build
generate unique build ID
ssh to node
build self or ask a can_build?
gather all necessary dependencies (tools, docker)
wrap build system to catch output
run build system
Workspaces per user Vs per cit
all work is done as the user to be able to check credentials
this means that the same build started by different users will have different Worspaces
if a lot of users start the same job a lot of duplicates will exist
Note that this is the right thing to do, there is no reason someone's build is share with someone else
* the workdirs are called after the UUID
but they can be under the build name
if group builds, sharing worspaces, are needed, build as a group representative
that needs to be a service, once that is accessible to people in the group only
the workspaces are reachable as the users are part of the group
scheduling jobs on the nodes from the scheduler
the job run as user but the user is not logged in and the scheduler should not
run as root nor start the build as the user
so the user, on the build node, has to wait till the scheduler allows it to run
this means that the user could just run without scheduling
* good or bad that the user has an account on the cluster?
root could help control the users, not letting them run anything before the scheduler
allows them to run a job (what's in the job can't be controlled, just how much it gets to run)
suid script that reduce pid ressources if pid belongs to user that has no work scheduled
job waits for a notification to start, a file in a directory accessible by cit scheduler?
cit (under the users account) runs the build, in UUID_directory, based on the meta data
who starts the waiting
cit when it queues the job to the scheduler
jobs are serialized so they can be put in the wait queue in case of errors
* we can probe all the shared directory to check who has job queued
statistics about build node usage
needed to plan future capacity
some of the data we need:
load per machine
load per job
resource needed per job
disk usage
* just data extraction, stats done by someone else
workspace retention:
building for ourselves we can cleanup our own worskpaces
what if there is not enough disk space and the other workspaces are not under our control?
start a cleanup job per user from the scheduler
how do we handle "latest"?
*** builds that use latest of some repo should be killed, optionally, if a new latest pops up
so we need to keep a list of the running jobs or poll the build nodes
backup of scheduling data and build data
pause the Qs, tarball them
note
cit is more like a pipeline of commands than an application
the system is ahdhoc by design
the system is self describing by design to reduce the need of a central control
no data structure describes the system, information must be gather at run time
documentation of intermediary data is paramount as there is no
central mechanism to query
this also makes validation extra important. The user could
: validate the input file with "cit -c info_file"
new steps can be added independently from the "core" implementation