## Comparing the Performance of Agents using different Observations on the Traffic Signal Control Task

### Motivation

In a Model-Free Reinforcement Learning setting, there can be many different representations one can have for a state (also called observation in a single agent case). Generally, the only limitation on choosing a state representation is that it should satisfy the Markov property. That leaves us with a lot of different state representations to try from, but are all state representations created equal? No, depending on the function approximator chosen for representing the Q function, the set of state representations that are relatively easy for the function approximator to _comprehend_ shrinks.

We are trying to find the effect of state representation on the agent performance in the traffic signal control domain. Two main objectives of the project are:
    
1. Establish Short-Range Temporal Scan performance baseline.
2. Change representation to queue. Replace convolutional layers with FC layers while keeping the parameter count the same. Compare results with previous representation.

### Background

phases of traffic signals.

what does rGrG and GrGr mean. 

### Code

![title](https://i.ytimg.com/vi/kp0nmL_KAek/hqdefault.jpg)

Code is Confidential!!

### Environment

#### Simulator

We use the SUMO traffic simulator here. https://sumo.dlr.de/docs/
sumo is an open-source package that can perform microscopic and continuous traffic simulations.
we use flow to extract out the sumo apis, but for scenarios such as dealing with traffic loop detectors needed some tinkering. 

#### Network

Network is single intersection. 
In RL, experiments generally begin with the simplest of experiments. This is because RL is very hard to train, and performing a check on the simplest of scenarios is a good sanity check. 
even this simple scenario can be scaled to more complex tasks with minor changes. This includes changing number of lanes, changing the traffic flow parameters. 

![Image](plots/empty-network.png)

![Image](plots/network.png)

#### Traffic Inflow

earlier traffic inflow was defined in the terms of vehsPerHour 

road is highly saturated with probability, and we will we see how that affect the results of different algorithms. 

we ran first in unsaturated conditions, and those are not interesting, because in those situations, queue metric can go to zero under optimum conditions. since we have stochastic spawning of vehicles, this intialization approach can ensure that there are some momemts where these is sparsity of vehicles on the road, as well ensure than agent mostly learns in tough conditions of high traffic. 

### Baselines

To establish baselines for Traffic Signal Control Task, we have two options of random policy and static time policy. The random policy chooses actions randomly, therefore it is used as one of the baselines to establish the lower limit of controller performance. 

Contrary to random policy, static time policy establishing a much more useful baseline to judge the controller's performance as this policy is similar to how actual traffic signals work. In a static time policy, Each phase is given a fixed time, and once the fixed time interval has passed, controller switches to next phase which also has its fixed time interval.

when training on traffic networks, there many constants that represent inherent attributes for the network. 
one such constant is the `traffic_light_params` which represents the min and max duration each phase can take at that intersection. Note this object will be different for different intersections, and can vary heavily for each phase, as traffic in certain directions may be higher. Also these parameters are with respect to a particular flow of traffic, and so if the traffic flow changes these guidelines should change. 


```python
traffic_light_params = {
    'rGrG': {'min': 10, 'max': 60},
    'GrGr': {'min': 10, 'max': 60},
    'yellow_duration': 3,
    'interphase_duration': 2
}
```

But to create our baselines, we assume these work for varied levels of traffic flow rates. Our second baselines would have 3 parts which are running at `min time` of each phase, `max time` of each phase, and `mean time (min + max/2)` for each phase. This should ensure there are some baselines that work in all levels of traffic. 

### Benchmark

what benchmark we wanted to design. :
    varying the number of incoming and outgoing lanes
    varying the traffic flow parameters
    
    
due to lack of time, we are only doing:
    
rest is intended for future work. 

### Experiments

In [None]:
# Evaluate multiple metrics... a controller can do well on one and bad on another.  
# Critical to look at box plots of performance over all cars (not a single mean value).

#### Architecture

#### Graphs

Figure 2: Travel Time Comparison between different algorithms. Here we create a violin plot based on the time taken by each vehicle to reach its destination. mid dash in each violin plot represents the mean value. we can see that tdtse has the lowest mean value here, so tdtse is the best algorithm based travel time metric. the worst algorithm is random policy baseline which only does slightly worse than the static-min baseline controller. 

![title](plots/travel-time.png)

even though static max is quite close best performing algorithms, it would not perform this well if there was differential in the amount of traffic coming from the north-south direction with respect to the east-west direction. nature of the static controller is such that it cannot optimize with respect to the such conditions.

why do we see such progression between static-min, static-mid and static-max baselines?

this is because the conditions we are running the network on saturated. in these conditions if you rapidly switch between different lights the time take by vehicles to deaccelerate and accelerate when approaching and leavning the traffic light will be added to the overall travel time. this is what static-min baseline controller does and therefore has a very large travel time as shown above. on the contrary the static-max baseline controller does not switch that rapidly and therefore performs quite well here, almost as well as the RL algorithms.

important thing to notice here is that the performance of the static baseline controllers will be inverted if we were to run the above in low saturation situation as shown below. this is because when the queue lengths are small, it is more useful to 

to prove the static controllers were run on a very low traffic conditions probability of spawning every second = 0.05. 

Figure 3: Travel Time Comparison between different algorithms. Here we create a violin plot based on the time taken by each vehicle to reach its destination. mid dash in each violin plot represents the mean value. we can see that tdtse has the lowest mean value here, so tdtse is the best algorithm based travel time metric. the worst algorithm is random policy baseline which only does slightly worse than the static-min baseline controller. 

![title](plots/queue-length.png)

which is a better metric travel time or queue length. 

even though rapid switching between phases decreases the queue length it ultimately increases the travel time of vehicles. thus travel time of each vehicle or mean travel time is much better metric to judge between effectiveness of controllers. 

discussion on graphs

### Videos

### Conclusions

a

downsides of current queue representation:
cannot differentiate between 5 vehicles moving on a lane and empty lane.
next version would be: 

why dont we train on travel time if it is a better metric than queue length for single intersection scenarios. 
if the episode length is of 1000 timesteps, the travel time metric is only accessible at the end of the episode which is 

travel time metric and queue length metric optimize for similar things when in low traffic scenarios. it is when are applied in high traffic scenarios they try to optimize for somewhat different things. it is in this case that agents optimizing over one quantity will do so at the expense of the other. 

### Future Work

1. establish benchmark. 
2. extend learn queue representation by learning a regressor from the loop detector data. (show data one can get form the detector). 
3 less data using skip frames. 