# Scheduling

## Single-processor scheduling
Problem: We have a lot of tasks to run
System Model: 
    * Limited # of processors
    * Finite amount of memory, disk, network bandwidth
Objectives:
    * Good throughput of jobs per time unit
        * Related: job completion time
    * High resource utilization
        * Don't leave any resources idle
        
### Strategies
* FIFO
    * Run tasks sequentially
    * Completion time: $\frac{\Sigma (tasks)}{|tasks|}$
* Shortest Task First
    * Maintain priority queue of tasks
        * Priority: Ascending job run time
    * Formally provable to give smallest job completion time
        * Could also use a custom priority
* Round-Robin
    * Evenly divide tasks into _quanta_ and schedule these separately
    * Run portion of tasks on queue
    * Preferable for interactive apps requiring quick responses

## Multi-Tenant System

__tenant__ Users or jobs
__container__ Set of resources sufficient to run one task of one job. Basic cluster unit.

#### Desires for a Multi-Tenant schedule
* We need to use a system to allocate resources (nodes, containers, processor time) to tasks (and their sub-tasks)
* We want to be fair(ish) to all tenants; can emphasize fairness if that's an important principle
* Want to use as much of our system as we can, all the time.

### Hadoop Capacity Scheduler
* Multiple queue system
* Each queue is guaranteed some portion of the cluster capacity
    * Q1 gets 80% of the cluster
    * Q2 gets 20% of the cluster
    * High priority jobs go to Q1
* Queues are now in charge of running jobs on given resources

#### Attributes
* Queues are typically FIFO
* Queues can be hierarchical
* Admins can configure queues
    * Add limits
        * Soft limit: Min % of cluster resources queue is guaranteed
        * Hard limit: Max % of cluster resources
    * Elasticity: Ability to occupy add'l resources, if cluster load is low
* Users can specify requirements
    * Must have $x$ memory available, etc.
* Pre-emption is not allowed!
    * If load is high, you must wait out task completion, then remove resource access from that queue.


### Hadoop Fair Scheduler
* Goal: All jobs get an equal share of resources
* One job gets entire cluster
* As jobs arrive, the cluster is evenly divided between jobs

#### Attributes
* Divides cluster into pools
    * Usually, one user per pool
* Resources are evenly divided between pools.
* Pool scheduling can be FIFO, FCFS, fair-sharing, whatever (configurable)
* Pools can have _min shares_
    * Guaranteed min % of cluster 
* When a min share is not met
    * If min share is not provided within a _timeout_
        * Pre-empt running tasks in other pools
            * a.k.a, kill 'em
            * OK - tasks are idempotent. Just a little wasteful.
        * Resource Scheduler kills most-recently-started tasks
            * Minimizes waste
        * Community has proposed using a saved-state pre-emption
            * Save the pre-empted tasks state for resuming later
            * Common in grid-computing

### Estimating Task Lengths
* We'd rather use Shortest-Task-First (STF) for scheduling, as it's optimal
* But that means we need to estimate task length

#### Estimation Approaches
* By job: $Run Time \propto Input Size$
* Across jobs: $Run Time \propto$ Average of run times of sibling tasks within parent job
    * Can weight averaged tasks by input size

## Dominant Resource-Fair Scheduling

Scheduling VMs in a cloud (cluster)?
* Jobs may have multi-resource requirements
    * Job 1 may need 2 CPU's, 8GB RAM
    * Job 2 may need 6 CPU's, 2GB RAM
* What _is_ fairness?

### Dominant Resource Fairness (DRF)
* Out of UC Berkeley
* Proven to be fair for multi-tenant systems
* Can't be gamed by individual tenants; requeusting more resources only penalizes
* Envy-free: No tenant can eyeball another tenant's allocation
* Works for scheduling VMs & Hadoop
* Used in Mesos
    * A cloud environment OS
    
### System Model
* 18 CPUs
* 36 GB RAM

#### Example
* Using Job examples from above,
    * Job 1 need $2/18 CPUs = 1/9$
    * Job 1 needs $8/36 RAM = 2/9$
        * _RAM_ intensive
    * Job 2 needs $6/18 CPUs = 1/3$
    * Job 2 needs $2/36 RAM = 1/18$
        * _CPU_ intensive
* DRF guarantees the % of its dominant resource type that it gets cluster-wide is the same for all jobs
    * Job 1's % of RAM = Job 2's % of CPU
* Solution: 
    * Job 1 gets 3 tasks with <2 CPUs, 8 GB> each
        * Job 1's % of RAM: $3 * 8/36 = 2/3$
    * Job 2 gets 2 tasks with <6 CPUs, 2 GB> each
        * Job 2's % of CPU: $2 * 6/18 = 2/3$
* DRF generalizes to multiple jobs
* DRF also generalizes to more than 2 resource types
    * CPU, RAM, Network, Disk, etc.
* _Guarantees_ each job gets a fair share of its dominant resource

## Summary
* STF works the fastest - but you have to be able to estimate task length
* 