cit_architectural_ideas.txt

handling credentials in a CI job
	jenkins offers a key ring but that works only if all jobs run under a user which access rights

	one way to avoid the problem is to do all the checkouts locally and send the file system for builds to the CI system
		CI system just handle allocation of cpu not credentials

	one tool that can be used to simplify the sending of the checkout files is docker
		either build an image with files checkout out, that needs to be done for every build with code change
		or checkout the files in the container and commit it

		the local machine can act as a docker hub, the CI system can pull the image from it
			if the image was previously pulled only the new fs layer will need to be synchronized

		after the build the container can be commited and the build originator can pull the image from the ci system to further work in it or debug if necessary

	encryption of the layers would ensure security
		encryption of the transport layer is also a possibility

	we also need to insure that the run is secure, either by 
		securing the build node
		securing the run of the docker container


job slaves
	a job may be running on multiple slaves (at the same time or sequentially)
		we need to know what has been run where

commit to build, builds to commit, and sequential list of builds
	given a commits in one or multiple repositories
		find all the builds that include the commits
		find all the build started because of those commits
		
	generally, keep a commit to build information in the builds

	find the builds, failing or passing, after commits where made


No central Queue
	make the whole system p2p
	list advantages and drawbacks


job comment
	cit -m 'manual start because we have sone something special, .....
	the job comment is shown when listing the current jobs
		if the build supports it, cit info could query the build/job
			more likely that cit gives information where the job is jun

			ssh $(cit where) make info

cluster management
	run job if the load level in /proc/loadavg is below some threshold
		see  batch(1)

time limit (not CI, cluster/build management)
	timeout (1)

not CI, build system timing information
	/usr/bin/time to get info

replay
	hmm, simply no!
		CI is not a versioning system

		what's the real use case for replay?
			hack around the fact that Jenkins is a crappy development environment

		when is it most often used?
			to make slight modifications to make it run on jenkins
				this is eliminated since cit runs on the user's environment

			when a simple change needs to be done but we want to reuse the build directory on jenkins
				moving the build directory is prohibitive
				hoping the build will be re-run on the same node

	=> use normal development strategies

	if changes are made, on the build node and committed, the job needs to reflect the changes
		otherwise re-running th job would give a different result
		
		the broken job will already have reported its failure thus this means
		that it's a new job, one that didn't go through queueing 

		or we can see it like, a new job which needs to run on the node where the modification 
		was done (artefacts are there)

	if the change is made on a no-cluster node this is equivalent to pushing a done-job to the 
		done-queue; artefacts must be pushed to a cluster node or made available form build box

getting CI builds locally
	utility not CI

	cit rsynch UUID
		creates,if necessary, UUID+name directory and rsynch the whole build
			this can be large and time/cpu consuming if done too often on large builds
	cit mount UUID
		leave the build directory on the CI nodes and mount the filesystem in UUID directory

	some operations do not require the copying of the whole build locally
		we don't need to, we can work on the node directly
		how do we stop cleanup of workspace while working on it?

restart the build, on the CI node, after modification
	isn't it a replay?

	we need to save the changes and make them very clear to people using the result

	although the build can be restarted on the node directly, we are logged in to make changes, it is 
		important that the CI system is not short cut

		if the change is small we want the CI system to prioritize this build

		we want the build to be queued


stash
	make local archive
	stash url to archive on "master"

	unstash
		get url
		scp

	build nodes may not be on the same network
		use third party service
		use url

	stashes can be encrypted for security

pushing local builds to the CI system
	not CI, code/binary management

	in a controlled environment

	a local build and a CI build are the same

	the local build can be pushed to the CI system and made available to other without having to 
		queue for resources
		rebuild it on the CI system
	
	local binaries can shorten the build time on CI
		binary repo from client to server!

	can be used between developers, send my binaries so helper doesn't need to build
		helper can send back what he has build

		=> better, let helper ssh on own machine

choosing the build node (affinity/scheduling)
	apart from needed resources, the presence of a previous build is an important
		allows us to do incremental builds

	CI runs executable which source is in the job data
		in sandbox (docker, chroot)
			allows job_scheduler to install resources for computation of go/no-go of job build

environment control
	we want to do incremental builds, environment control becomes primordial
		needs to be done by the build system
			needs to be a separate target and hierarchical

	components can chose between incremental and clean builds
		build directories must be different for components in case a component wants a "clean" build
		or cleanup the component build directory

	build system use chroot/docker to sandbox
		this allows build on local machine with same build system 

		if "base images" exist to be reused, we must still check the versions of dependencies 
		that are installed and install them if not matching 

application name: cit, or metra, "latests"

CI uses the configuration management system to freeze "latest" into 
the version used during the build

	CI -> fix versions -> CM
		<- fixed version
	CI -> Queue build -> Q -> CI scheduler
		-> build -> build system

cit computes affinity before pushing jobs in queue

Simple parts connected by "clean" interfaces
	part are easy to replace
	part can will be written in multiple languages!

	Interface is always a data file to help with debugging
		it also act as a log 
		each file generated during the process is prefixed with the job's name (hash + user given name)

	cit file is a container for multiple types of information (tiff style)
		CI setup:
			generate UUID for integration
			reference to BUILD setup
			hosts: host1+key host2+key 
				key allows scheduler to log onto the build nodes
					and run
						compute affinity cost
						can build now?
						build
			affinity:
				config:
					gcc 4.15
					mathlab
					run: bash cat /proc/loadavg < 12
			scheduling:
			...
		BUILD setup: 
			configure: ...
			build description
			configuration for the build
				use_module_x = 1
				build_log_file = ...
		BUILD setup module X: 
			build description
			configuration for the build
				...
		CIT log:
			generate UUID, where
			Queued nadim@his_box -> CI-Q@CI_Q_box
			started scheduler: no candidates
			started scheduler: candites: A, B
			send to B Queue
			...
			Integration done
 
***
		section can be encrypted but have a textual header

		section are tag delimited to make it easy to extract sections on the command line

		preferred format in order
			tab delimited
			TOML (simple: key=value pairs under [sections]) (send link)
			XML (XML >> json because it has validation)
			json
			yaml
			tarball

		sections are versioned 
	
		this allows >> in the files (after locking)

Information (Debugging) is a not an afterthought 
	logging per session in case multiple jobs are being run (see data file interface)
	messages to stderr with header tag	
	verbosity level 

	a job must be traceable from creation to completion, including execution rights
	
	tool to step through a CI job instead for looking through logs that may be mixed with 
		sub tools log output
			have different logs

How do we build a local setup as a CI job?
	we want to access resources in the cluster and not be bothered by creating a job description
		create a script that takes a repo and version and make a cit job 
			possibly local as the remote job can checkout from our local setup

	possibility to build the job on own node not cluster
		just sugar to avoid doing the steps by hand
		
		results get back on the Q

		jobs can start immediately
			thus moving jobs from the in-Q to waiting-Q tries to schedule the jobs


	from a job, created by someone else, one can get the commands to setup the job locally
		IE: take a cit job description and run it locally

adding the current machine to the build cluster
	cit add-builder info_file

	allows anyone to add one's machine to the cluster
		what are the security implications?

	info_file describes
		the resources
			how many concurrent jobs
			
		the node has and permissions
			we can check permissions with pki

	this mechanism allows an easy management of the build nodes for the admins
		ssh to a node
			cit add-builder info_file


	equivalent cit commands for:
		remove
			what if jobs are running

		pause/restart

		get info (ls)


how do we get a status after the job is build
	the status exists on the cluster node where the build was done
	the status exists in the done-Queue
		done-Q is
			per job, all jobs, my jobs, job ID
			just one type, the rest derived on demand
	
	the status is send from the job on the cluster node back to the main done-queue
		a job is run by "cit run" job_id
			cit run monitors the job till ends 
				failure
				or success

			cit run updates the done-Q (and maybe success or failure Qs)
				the status is kept on the cluster node before sending it to the done-Q
					in case the "central" Q is off line

					extra information can be added to the job before synchronizing it

	done-job extra information is up to the job itself
		if the information is a log, the log could be queried for ongoing job status
			a bit like the stages in jenkins except it is not cit specific, it's just a log

		user can query the log continuously to show the job ongoing status ("interactive")

		"cit run" can update the ongoing-Q by synchronizing the log from cluster node to ongoing-Q node

		 
CI framework doesn't know how to start builds
	only user knows how a build is done
		total separation CI - build system
		
		the job format is decided by the builder
			full blown shell script
				including the danger associated

			call to a build interface
				only the parameters
					verifiable
					generated by a public API
			
			encryption is possible to anonymize jobs in the Q
				using the Q's private key

	the CI does the minimum 
		handle queuing jobs
			FIFO

		handles triggering at the request of the jobs
			triggering a build is just queueing a build, no reason for the 
			Q manager and trigger to be the same code

		find build nodes where the jobs can be run

		forward the job to the builder

		keep statistics about jobs queuing
			why? isn't the data about them enough nd stats can be computed in different ways

		keep links between job (ID) and the job builder
			who build what

	does the CI start the job?
		that would mean running as the user on the build node and we do not want that

		just deliver the job to the user's Q and let the user build, IE cron job

		the user could make a run_ci_job application available so other users can start jobs 
			sticky bit set
			for specific user
			extra permissions can be embedded in the job

the continuous integration framework consists of a set of small programs that can be used
	by the user, the automation system that builds components

No master 
	the Q and executor need to be run on a specific node
		no they don't it's just a normal user, they need to be cron'ed
		but the node where they run is irrelevant if the data they need
		is accessible and the job queuers know where to find it

		use discovery instead for fix IP?

	Q data
		Q can be synchronized to any amount of nodes for replication/backup
			need use cases of replication
				case: only one Q is active, data is replicated, make the replicated Q active
					this entails
						make Q1 inactive (change permissions)
						make executor inactive (changed permission should be enough)
						synchronize q1 -> Q2
						create new replicator and synchronize Q2 -> Q3
						make Q2 active

					update cit's Q list to add Q2 and Q3, removing Q1 if necessary
						note: Q1, Q2, Q3, ... can be in the list from the beginning
							cit tries the Qs in order and if only one is active it 
							uses it

					administrative tools can run on all the Qs
						is it active
						how many jobs in the Q
						...

		does the Q need to be on a single node?
			no but it needs to be on machines that the Q manager can access
			thus the user machine is not possible but any of the build nodes is

		what if the Q is not accessible when adding a job
			the job is saved locally 
			cit can show the list of not Queued jobs
			cit Queues all the jobs at next attempt to queue a job or on demand

			tools to manage the local queue
				the local Q is a directory, ls, grep, ... are enough, can be wrapped in small script
			
	multiple Qs/Q managers can co-exist on the same nodes
		their Qs are simply different directories, with possibly different
		access rights to allow only specific users to add CI jobs

triggers
	application, or person, that decided that a job needs to be queued
	
	can be different applications and persons

	where do the applications reside and are started
		anywhere, cron

	how does one make a trigger inactive ?
		application specific, can be as simple as a '#' in a configuration file

use discovery instead for fix IP
	we should have the possibility to do both as fix ip has advantages too

	discovery if for the Queue and the build nodes

	discovery means a service
		it  could be used to deliver capacity information needed for scheduling

		we don't want services at least none specific to cit

		what about an account?
			scan network, ssh as cit on the nodes, read a capacity file

			using cit commands we can easily enable/disable the cit account and
			modify the capacity file (using a text editor)

No service
	except basic unix services, cron, sshd.

	non service tool: pgp, ...
		these are module dependent, the modules install them not CIT
			good if the modules list their tools dependencies
			better if they verify the tools are installed to fail early (not mandatory)
			excellent if the installation/verification is made through a makefile thus
			allowing admins to debug installation if necessary

automated installation
	no pre-installation, only a user account is needed
	needed applications are scp'ed or installed on demand

No database (or database that act like a file system)
	text files
	CI commands can be simple pipelines in the shell

public key infrastructure
	to encrypt sensitive data
	to log into the build node
	to copy from the build nodes

Q and job description
	Q is directory 
		Q manager has full rights
		users can only write their own

	job description is plain text
		sensitive data is encrypted and rendered in base64

		contains data needed by the sub modules to build
			configuration
				name:value
				url/filename of name:values
				url to tarball with configuration
			
			needed artefacts
			...
		or it can be files in a git repo (that way, when someone
		contributes a section to the file, each contribution is tracked.
		maybe not necessary if the files are append-only)
			good way to keep history although file in a history directory work simpler

	cit command to suspend and restart the cron job running the queue manager
		log who runs it
			where?

	how do we get jobs to the Q?
		cit push job_file

		how do we tag which user added a job?
			accept only signed jobs?

Getting data from SM builds
	URL that can be scp'ed

synchronizing with SM build
	a build can be sleeping waiting for its dependencies to be available
		cit build SM --blocking

	a build can put itself back in the Q till its dependencies are available
		the scheduler can verify the dependencies
			by running the top module's check_dependencies target
	
build nodes are considered secured (and must be secured)
	direct access to them is subject to access control
	data access is via a secured protocol implementing access contol or 
		via encryption of the send data	
	using ssl keys

build of sub modules (SM) can be done via services (rather than checking out the sub module)
	CIT is the service
	a build can Q a build for a SM
		the SM build request can be generated by the SM's public interface

	the SM can have private repositories, it is build as the SM owner

	selected portions of the SM are made available to the requester
		portions are defined in the requests
		
	delivery of the SM is done by making the SM portions directly available	
		URL that can be scp'ed
		directly from the build directory
		access rights are set or encrypted

	SM can be/are encrypted with requestor's key
		Service has a list of accepted public keys

	SM can be cached in a binary repo, only the URLs are returned, no buid is done
		artefacts can be accessed by HASH of encrypted data

distribute the workload (no master)
	computation of what needs to be done,where can be done on client machines
	distribute the data load
		logs, reports, ... are shared from the node having the data to the user machine directly

	distribute reporting load
		only data is transfered, report generation is done on the user machine
			reports can be send back or cached or put in the binary repo

	how do we get information about private SM?
		the data and/or transformed data must be public
			
		transformation procedure can be made available through SM public interface

seeing builds
	one needs to have the right to see them

	attach to tty

	get a log

	run build in tmux and allow read

	gotty, wtty, .. ?

	https://metacpan.org/pod/release/BBB/ttylog-0.83/ttylog
	https://github.com/nbedos/cistern

	script -o in file that can be scp'ed
	ttyrec
	exec >&1 file

cit is a cheap batch-like linux cluster
	add user
	add box to build cluster
		cit add user@ip:ssh_port @queue @queue ...


cit configuration
	$ cit config config.name value
	
	contains
		main queue
		queues this box was added to, if any
		jobs on this node

Continuous integration
	access control
		persons looking at builds, artefacts, reports, code, ... must have the right to

	not all builds want to be watchable

	build logs:
		use a build system that generates better logs

		wrap components builds in different logs
			not changing the build system just the system around it

		redirect all output in a/multiple log file

		wrap the bash, if necessary, to associate logs to a component build

		if log is hierarchical, components have separate logs, it is possible to see the 
		hierarchy and traverse it


	Reports:
		keep list of builds started locally (and some meta data)
			we may not be interested by more than those builds

		keep list of global builds
			synch with the build queue?
				only build we can access are synchronized

			how do we send our access keys?
			data send back must be encrypted
		
		build running/failed/succeeded -> location
			how does the CI know where things are build? 
				the build (or service building) knows about that information, the CI just delivers jobs
				and thus doesn't know.

				build system PID + status file ?

		generated artefacts
			go to the binary repo
				binary repo has access control

			are accessible via scp

			how does the user find them?

			* the job description should set a build directory

		presents the builds in a hierarchy (tabs, ...)

		previous builds

	triggering
		start job manually
		repo  dependencies
		build scheduling
			synchronize builds with equivalent input, if 2 dependent wait for the same build then build only once

	access to workspace
		only to code that the user can checkout
			code in the workspace can be r or rw, upt to the SM
				the SM has RW so SM maintainers can debug/fix broken builds

	keeps credential for repos
		NOPE, uses nodes with the right credentials and get back artefacts

	UI, configure jobs
		the terminal + $EDITOR is the UI

		for complex configurations that need a wizard ... write a text mode wizard

		NOPE, all config files under version control
			EG: make a branch change the config, build the branch

		if the repos are not accessible but still need a specific configuration
			the configuration can be in a separate repo that is accessible
				make a branch, checking changes, give name of branch as input to the main build

			the config can be passes in the job 

	replay
		what's replay?
			modify some and rebuild
			modify some should be done in a repo
				the source repo in a branch or a new local repo
				
			rebuild is queueing the job
				does the new job need to be linked to the previous job?
					no but it may help to have that in the meta data
					what really needs to be saved is the commit

			what happens in Jenkins if two users do a replay?

		if the repo is not accessible then replay is not accessible

		if accessible, clone the repo, rebuild the local changes
			locally if possible
			on CI-cluster otherwise
		
			optimize by using binary repo from CI-cluster


	scripting to implement build steps
		NOPE! build in the build system

		add a layer of build system if necessary

		
	scripting to control CI system
		need use cases

	plugins
		nope! to do what

	Agents
		all reasons below are useless, there is no need for agents
			just users that can build, a cron job that makes the user check the queue
			could be considered an agent but it is an agent of the user not the CI

		multiplatform build
			just needs ssh and permissions

		pre-setup agents (build tools, ...)
			let the build handle its dependencies
				dependencies can be installed via the package manager or locally		

		local cache
			let the build andle the cache as a dependency

		agent configuration and management
			just normal user management!

			list of agents
				can be a share list in a repository

		find agent for a build
			the executor finds the available build users in the network
				ssh to the build nodes
				check load
				cit add job user_queue
					check that the user exists

	stashing
		binary repo per build
		
		this means that multiple builds with the same configuration having the same
		stashed artefact (yes it's the same or we have a bigger problem) can share the 
		same artefact (and no need to build it either)

		just have a cit stash and cit unstash that puts the artefacts in a local or global repo
			the artefact name contains the job ID (even better a hash so artefacts can be shared)

	synchronizing jobs on different build nodes
		IE: star a job and wait for it to succeed

	unique build ID
		generated on the machine where the build is started 

	setup CI job
		clone jobs
		different types of jobs
			support one type or let the jobs decide what they are
				what does Jenkins support and why?
					triggering by repo checkins

		scheduling
			scheduled at the trigger 
			rescheduled by maintainer
				pause, restart later (via cron)
				kill
			immediate if possible, after finding a free agent
 
	build under an alias and protect the build code 
		IE: user without rights can build projects that need special access to repo
			find a node in the cluster that can build as an alias
				parallel
					ssh to node
						include path to special jobs
						run can_build_XXX_protected
						
						test dependencies
							tools
							docker images
		
						cpu, disk, ...  availability
						

				decide where to be build
					or queue (can be handled by the trigger)

				build
					generate unique build ID

					ssh to node
						build self or ask a can_build?

						gather all necessary dependencies (tools, docker)

						wrap build system to catch output

						run build system
		
		
Workspaces per user Vs per cit
	all work is done as the user to be able to check credentials

	this means that the same build started by different users will have different Worspaces
		if a lot of users start the same job a lot of duplicates will exist

	Note that this is the right thing to do, there is no reason someone's build is share with someone else

	* the workdirs are called after the UUID
		but they can be under the build name

	if group builds, sharing worspaces, are needed, build as a group representative
		that needs to be a service, once that is accessible to people in the group only
		the workspaces are reachable as the users are part of the group


scheduling jobs on the nodes from the scheduler
	the job run as user but the user is not logged in and the scheduler should not
		run as root nor start the build as the user

	so the user, on the build node, has to wait till the scheduler allows it to run
		this means that the user could just run without scheduling
			* good or bad that the user has an account on the cluster?

	root could help control the users, not letting them run anything before the scheduler
	allows them to run a job (what's in the job can't be controlled, just how much it gets to run)
		suid script that reduce pid ressources if pid belongs to user that has no work scheduled

	job waits for a notification to start,  a file in a directory accessible by cit scheduler?
		cit (under the users account) runs the build, in UUID_directory, based on the meta data	

		who starts the waiting  
			cit when it queues the job to the scheduler
			jobs are serialized  so they can be put in the wait queue in case of errors

		* we can probe all the shared directory to check who has job queued

		
statistics about build node usage
	needed to plan future capacity

	some of the data we need:
		load per machine
		load per job
		resource needed per job

		disk usage

	* just data extraction, stats done by someone else


workspace retention:
	building for ourselves we can cleanup our own worskpaces

	what if there is not enough disk space and the other workspaces are not under our control?
		start a cleanup job per user from the scheduler


how do we handle "latest"?
	*** builds that use latest of some repo should be killed, optionally, if a new latest pops up

	so we need to keep a list of the running jobs or poll the build nodes


backup of scheduling data and build data
	pause the Qs, tarball them
	
note
	cit is more like a pipeline of commands than an application

	the system is ahdhoc by design

	the system is self describing by design to reduce the need of a central control
		no data structure describes the system, information must be gather at run time

	documentation of intermediary data is paramount as there is no
		central mechanism to query
			this also makes validation extra important. The user could
		:	validate the input file with "cit -c info_file" 

	new steps can be added independently from the "core" implementation