Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connection closed by SUMO #295

Closed
BBDrive opened this issue Dec 10, 2020 · 20 comments · Fixed by #1235 · May be fixed by #619
Closed

connection closed by SUMO #295

BBDrive opened this issue Dec 10, 2020 · 20 comments · Fixed by #1235 · May be fixed by #619
Assignees

Comments

@BBDrive
Copy link

BBDrive commented Dec 10, 2020

When I run multiple instances with ray, it gives an error.

(pid=26758) ERROR:RemoteAgentBuffer:Waiting for local zoo worker to start up, retrying 0 / 3
(pid=26765) ERROR:RemoteAgentBuffer:Waiting for local zoo worker to start up, retrying 0 / 3
(pid=26767) ERROR:RemoteAgentBuffer:Waiting for local zoo worker to start up, retrying 0 / 3
(pid=26759) ERROR:RemoteAgentBuffer:Waiting for local zoo worker to start up, retrying 0 / 3
(pid=26750) ERROR:RemoteAgentBuffer:Waiting for local zoo worker to start up, retrying 0 / 3
(pid=26763) ERROR:Zoo Worker:Failure while handling connection EOFError()
(pid=26753) ERROR:RemoteAgentBuffer:Waiting for local zoo worker to start up, retrying 0 / 3
(pid=26766) ERROR:RemoteAgentBuffer:Waiting for local zoo worker to start up, retrying 0 / 3
(pid=26752) ERROR:RemoteAgentBuffer:Waiting for local zoo worker to start up, retrying 0 / 3
(pid=26759) ERROR:Zoo Worker:Failure while handling connection EOFError()
(pid=26758) ERROR:Zoo Worker:Failure while handling connection EOFError()
(pid=26765) ERROR:Zoo Worker:Failure while handling connection EOFError()
(pid=26750) ERROR:Zoo Worker:Failure while handling connection EOFError()
(pid=26767) ERROR:Zoo Worker:Failure while handling connection EOFError()
(pid=26752) ERROR:Zoo Worker:Failure while handling connection EOFError()
(pid=26766) ERROR:Zoo Worker:Failure while handling connection EOFError()
(pid=26753) ERROR:Zoo Worker:Failure while handling connection EOFError()
(pid=26760) ERROR:RemoteAgentBuffer:Waiting for local zoo worker to start up, retrying 0 / 3
(pid=26760) ERROR:Zoo Worker:Failure while handling connection EOFError()

But it can still work for a few rounds. After running for a while, it crashed.

(pid=26763) ERROR:SMARTS:Simulation crashed with exception. Attempting to cleanly shutdown.
(pid=26763) ERROR:SMARTS:connection closed by SUMO
(pid=26763) Traceback (most recent call last):
(pid=26763)   File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/smarts/core/smarts.py", line 170, in step
(pid=26763)     return self._step(agent_actions)
(pid=26763)   File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/smarts/core/smarts.py", line 212, in _step
(pid=26763)     provider_state = self._step_providers(all_agent_actions, dt)
(pid=26763)   File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/smarts/core/smarts.py", line 684, in _step_providers
(pid=26763)     provider, actions, dt, self._elapsed_sim_time
(pid=26763)   File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/smarts/core/smarts.py", line 723, in _step_provider
(pid=26763)     provider_state = provider.step(provider_actions, dt, elapsed_sim_time)
(pid=26763)   File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/smarts/core/sumo_traffic_simulation.py", line 305, in step
(pid=26763)     self._traci_conn.simulationStep(self._cumulative_sim_seconds)
(pid=26763)   File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/traci/connection.py", line 302, in simulationStep
(pid=26763)     result = self._sendCmd(tc.CMD_SIMSTEP, None, None, "D", step)
(pid=26763)   File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/traci/connection.py", line 180, in _sendCmd
(pid=26763)     return self._sendExact()
(pid=26763)   File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/traci/connection.py", line 90, in _sendExact
(pid=26763)     raise FatalTraCIError("connection closed by SUMO")
(pid=26763) traci.exceptions.FatalTraCIError: connection closed by SUMO
Traceback (most recent call last):
  File "main.py", line 166, in <module>
    main(args)
  File "main.py", line 65, in main
    memory = sampler.sample(network)
  File "/home/hp/PycharmProjects/kaylen/ppo_highway/sampler_asyn.py", line 84, in sample
    for epi in ray.get(episode):
  File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/ray/worker.py", line 1513, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(FatalTraCIError): ray::Environment.one_episode() (pid=26763, ip=172.31.73.204)
  File "python/ray/_raylet.pyx", line 452, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 407, in ray._raylet.execute_task.function_executor
  File "/home/hp/PycharmProjects/kaylen/ppo_highway/sampler_asyn.py", line 61, in one_episode
    new_observation, reward, done, _ = self.env.step(action[0])
  File "/home/hp/PycharmProjects/kaylen/ppo_highway/ENV/smartsEnv.py", line 41, in step
    observation, reward, done, info = self.env.step({self.AGENT_ID: action})
  File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/smarts/env/hiway_env.py", line 155, in step
    observations, rewards, agent_dones, extras = self._smarts.step(agent_actions)
  File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/smarts/core/smarts.py", line 170, in step
    return self._step(agent_actions)
  File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/smarts/core/smarts.py", line 212, in _step
    provider_state = self._step_providers(all_agent_actions, dt)
  File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/smarts/core/smarts.py", line 684, in _step_providers
    provider, actions, dt, self._elapsed_sim_time
  File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/smarts/core/smarts.py", line 723, in _step_provider
    provider_state = provider.step(provider_actions, dt, elapsed_sim_time)
  File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/smarts/core/sumo_traffic_simulation.py", line 305, in step
    self._traci_conn.simulationStep(self._cumulative_sim_seconds)
  File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/traci/connection.py", line 302, in simulationStep
    result = self._sendCmd(tc.CMD_SIMSTEP, None, None, "D", step)
  File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/traci/connection.py", line 180, in _sendCmd
    return self._sendExact()
  File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/traci/connection.py", line 90, in _sendExact
    raise FatalTraCIError("connection closed by SUMO")
traci.exceptions.FatalTraCIError: connection closed by SUMO
/home/hp/anaconda3/envs/smarts/lib/python3.7/subprocess.py:883: ResourceWarning: subprocess 26716 is still running
  ResourceWarning, source=self)
ResourceWarning: Enable tracemalloc to get the object allocation traceback
@Gamenot Gamenot self-assigned this Dec 11, 2020
@Gamenot
Copy link
Collaborator

Gamenot commented Dec 11, 2020

Hello, could you give a few more details on what you were running to cause this? Does this occur for you with the examples/rllib.py example?

@BBDrive
Copy link
Author

BBDrive commented Dec 14, 2020

I run the following code.

import ray
import gym

from smarts.core.agent_interface import AgentInterface, AgentType
from smarts.core.agent import AgentSpec, Agent


class SimpleAgent(Agent):
    def act(self, obs):
        return "keep_lane"

@ray.remote
class Environment:
    def __init__(self):
        self.AGENT_ID = "Agent-007"
        agent_spec = AgentSpec(
            interface=AgentInterface.from_type(AgentType.Laner, max_episode_steps=1000),
            agent_builder=SimpleAgent,
        )
        self.env = gym.make(
            "smarts.env:hiway-v0",
            scenarios=["/home/hp/SMARTS/scenarios/loop"],
            agent_specs={self.AGENT_ID: agent_spec},
        )
        self.agent = agent_spec.build_agent()

    def sample(self):
        observations = self.env.reset()

        while True:
            agent_action = self.agent.act(observations[self.AGENT_ID])
            observations, reward, done, _ = self.env.step({self.AGENT_ID:agent_action})
            if done[self.AGENT_ID]:
                break
        return 1  # return sampled trajectory


def train(trajectory):
    return 0


if __name__ == '__main__':
    ray.init()
    cpu = 8
    environment = [Environment.remote() for _ in range(cpu)]
    for i in range(100000):
        trajectory = ray.get([env.sample.remote() for env in environment])
        train(trajectory)
        print("Episode:%d" % i)
    ray.shutdown()

The console output is

(pid=20551) pybullet build time: Nov 26 2020 23:08:25
(pid=20549) pybullet build time: Nov 26 2020 23:08:25
(pid=20545) pybullet build time: Nov 26 2020 23:08:25
(pid=20539) pybullet build time: Nov 26 2020 23:08:25
(pid=20552) pybullet build time: Nov 26 2020 23:08:25
(pid=20559) pybullet build time: Nov 26 2020 23:08:25
(pid=20553) pybullet build time: Nov 26 2020 23:08:25
(pid=20562) pybullet build time: Nov 26 2020 23:08:25
(pid=20539) ERROR:RemoteAgentBuffer:Waiting for local zoo worker to start up, retrying 0 / 3
(pid=20545) ERROR:RemoteAgentBuffer:Waiting for local zoo worker to start up, retrying 0 / 3
(pid=20552) ERROR:RemoteAgentBuffer:Waiting for local zoo worker to start up, retrying 0 / 3
(pid=20553) ERROR:RemoteAgentBuffer:Waiting for local zoo worker to start up, retrying 0 / 3
(pid=20549) ERROR:RemoteAgentBuffer:Waiting for local zoo worker to start up, retrying 0 / 3
(pid=20551) ERROR:RemoteAgentBuffer:Waiting for local zoo worker to start up, retrying 0 / 3
(pid=20562) ERROR:RemoteAgentBuffer:Waiting for local zoo worker to start up, retrying 0 / 3
(pid=20559) ERROR:RemoteAgentBuffer:Waiting for local zoo worker to start up, retrying 0 / 3
(pid=20539) ERROR:Zoo Worker:Failure while handling connection EOFError()
(pid=20545) ERROR:Zoo Worker:Failure while handling connection EOFError()
(pid=20552) ERROR:Zoo Worker:Failure while handling connection EOFError()
(pid=20553) ERROR:Zoo Worker:Failure while handling connection EOFError()
(pid=20549) ERROR:Zoo Worker:Failure while handling connection EOFError()
(pid=20551) ERROR:Zoo Worker:Failure while handling connection EOFError()
(pid=20562) ERROR:Zoo Worker:Failure while handling connection EOFError()
(pid=20559) ERROR:Zoo Worker:Failure while handling connection EOFError()
Episode:0
Episode:1
...
Episode:60
Episode:61
(pid=20553) 2000 -> Problem solution failed (solver error)
(pid=20549) 2000 -> Problem solution failed (solver error)
(pid=20545) 2000 -> Problem solution failed (solver error)
(pid=20551) 2000 -> Problem solution failed (solver error)
(pid=20559) 2000 -> Problem solution failed (solver error)
(pid=20562) 2000 -> Problem solution failed (solver error)
(pid=20539) 2000 -> Problem solution failed (solver error)
(pid=20552) 2000 -> Problem solution failed (solver error)
Episode:62
Episode:63
...
Episode:80
Episode:81
(pid=20553) 2000 -> Problem solution failed (solver error)
(pid=20545) 2000 -> Problem solution failed (solver error)
(pid=20552) 2000 -> Problem solution failed (solver error)
(pid=20553) 2000 -> Problem solution failed (solver error)
(pid=20562) 2000 -> Problem solution failed (solver error)
(pid=20545) 2000 -> Problem solution failed (solver error)
(pid=20552) 2000 -> Problem solution failed (solver error)
(pid=20562) 2000 -> Problem solution failed (solver error)
Episode:82
Episode:83
...
Episode:366
Episode:367
(pid=20552) ERROR:SMARTS:Simulation crashed with exception. Attempting to cleanly shutdown.
(pid=20552) ERROR:SMARTS:connection closed by SUMO
(pid=20552) Traceback (most recent call last):
(pid=20552)   File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/smarts/core/smarts.py", line 170, in step
(pid=20552)     return self._step(agent_actions)
(pid=20552)   File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/smarts/core/smarts.py", line 212, in _step
(pid=20552)     provider_state = self._step_providers(all_agent_actions, dt)
(pid=20552)   File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/smarts/core/smarts.py", line 684, in _step_providers
(pid=20552)     provider, actions, dt, self._elapsed_sim_time
(pid=20552)   File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/smarts/core/smarts.py", line 723, in _step_provider
(pid=20552)     provider_state = provider.step(provider_actions, dt, elapsed_sim_time)
(pid=20552)   File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/smarts/core/sumo_traffic_simulation.py", line 305, in step
(pid=20552)     self._traci_conn.simulationStep(self._cumulative_sim_seconds)
(pid=20552)   File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/traci/connection.py", line 302, in simulationStep
(pid=20552)     result = self._sendCmd(tc.CMD_SIMSTEP, None, None, "D", step)
(pid=20552)   File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/traci/connection.py", line 180, in _sendCmd
(pid=20552)     return self._sendExact()
(pid=20552)   File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/traci/connection.py", line 90, in _sendExact
(pid=20552)     raise FatalTraCIError("connection closed by SUMO")
(pid=20552) traci.exceptions.FatalTraCIError: connection closed by SUMO
Traceback (most recent call last):
  File "test.py", line 47, in <module>
    trajectory = ray.get([env.sample.remote() for env in environment])
  File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/ray/worker.py", line 1513, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(FatalTraCIError): ray::Environment.sample() (pid=20552, ip=172.31.73.204)
  File "python/ray/_raylet.pyx", line 452, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 407, in ray._raylet.execute_task.function_executor
  File "test.py", line 32, in sample
    observations, reward, done, _ = self.env.step({self.AGENT_ID:agent_action})
  File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/smarts/env/hiway_env.py", line 155, in step
    observations, rewards, agent_dones, extras = self._smarts.step(agent_actions)
  File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/smarts/core/smarts.py", line 170, in step
    return self._step(agent_actions)
  File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/smarts/core/smarts.py", line 212, in _step
    provider_state = self._step_providers(all_agent_actions, dt)
  File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/smarts/core/smarts.py", line 684, in _step_providers
    provider, actions, dt, self._elapsed_sim_time
  File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/smarts/core/smarts.py", line 723, in _step_provider
    provider_state = provider.step(provider_actions, dt, elapsed_sim_time)
  File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/smarts/core/sumo_traffic_simulation.py", line 305, in step
    self._traci_conn.simulationStep(self._cumulative_sim_seconds)
  File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/traci/connection.py", line 302, in simulationStep
    result = self._sendCmd(tc.CMD_SIMSTEP, None, None, "D", step)
  File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/traci/connection.py", line 180, in _sendCmd
    return self._sendExact()
  File "/home/hp/anaconda3/envs/smarts/lib/python3.7/site-packages/traci/connection.py", line 90, in _sendExact
    raise FatalTraCIError("connection closed by SUMO")
traci.exceptions.FatalTraCIError: connection closed by SUMO
/home/hp/anaconda3/envs/smarts/lib/python3.7/subprocess.py:883: ResourceWarning: subprocess 20493 is still running
  ResourceWarning, source=self)
ResourceWarning: Enable tracemalloc to get the object allocation traceback

@BBDrive
Copy link
Author

BBDrive commented Dec 14, 2020

I'm looking forward to you answer. Thanks.

@Gamenot
Copy link
Collaborator

Gamenot commented Dec 15, 2020

Hello, sorry for the late reply. I have done some testing for this error and the solution is unclear but the issue is reproducible.

Traceback (most recent call last):
  File "examples/rllib_problem.py", line 47, in <module>
    trajectory = ray.get([env.sample.remote() for env in environment])
  File ".../SMARTS/.venv/lib/python3.7/site-packages/ray/worker.py", line 1506, in get
    values = worker.get_objects(object_ids, timeout=timeout)
  File ".../SMARTS/.venv/lib/python3.7/site-packages/ray/worker.py", line 312, in get_objects
    return self.deserialize_objects(data_metadata_pairs, object_ids)
  File ".../SMARTS/.venv/lib/python3.7/site-packages/ray/worker.py", line 280, in deserialize_objects
    return context.deserialize_objects(data_metadata_pairs, object_ids)
  File ".../SMARTS/.venv/lib/python3.7/site-packages/ray/serialization.py", line 323, in deserialize_objects
    self._deserialize_object(data, metadata, object_id))
  File ".../SMARTS/.venv/lib/python3.7/site-packages/ray/serialization.py", line 284, in _deserialize_object
    obj = self._deserialize_pickle5_data(data)
  File ".../SMARTS/.venv/lib/python3.7/site-packages/ray/serialization.py", line 262, in _deserialize_pickle5_data
    obj = pickle.loads(in_band)
ModuleNotFoundError: No module named 'traci'

It looks like one of the ray workers was unable to import traci and then shortly later the traci connection also fails. This is not an issue we have seen before so it might take some time to resolve.

@Gamenot
Copy link
Collaborator

Gamenot commented Dec 15, 2020

May have found the potential cause: #313

@Gamenot Gamenot added this to the 0.5 milestone Dec 16, 2020
@Gamenot Gamenot added this to To do in SMARTS v0.4.10 Dec 16, 2020
@JianmingTONG
Copy link

Hi Gamenot. I also face the same problem. Is there any progress on this problem?

@Gamenot
Copy link
Collaborator

Gamenot commented Dec 22, 2020

Hello, @JianmingTONG, there is some good progress on #331. This is a fairly critical and should solve both the error messages and the SUMO connection issue.

@Gamenot Gamenot added this to To do in SMARTS v0.4.11 Dec 23, 2020
@JianmingTONG
Copy link

JianmingTONG commented Dec 23, 2020

Thanks for the reply @Gamenot . I have tried the version that seems to solve the sumo issue. i.e. 2a45972 (commit ID). However, there is still the "connection closed by sumo" error, when I try the following commands. Note: I change the episode from 10 to 1000000 to test the scenario without launching the training process.

#terminal 1
scl envision start -s ./scenarios -p 8081

#terminal 2
$python example/single_agent.py benchmark/scenarios/two_ways/bid

#terminal 3
$python example/single_agent.py benchmark/scenarios/two_ways/bid_sv

PS: both terminal 2 and terminal 3 died at 8185 iteration.

@Adaickalavan
Copy link
Member

Hi @JianmingTONG, we aware of two separate shortcomings in the impementation of (i) ray and (ii) remote agents. We are curently actively looking into them both. Unfortunately, #331 is not ready for use yet.

@JianmingTONG
Copy link

Hi @Adaickalavan @Gamenot, I see that #366 has been closed. Has the issue been solved?

Thanks, wish you a happy new year.

@Gamenot
Copy link
Collaborator

Gamenot commented Jan 4, 2021

@JianmingTONG Happy new year, thank you, it is looking like the problem is addressed however we are testing to make sure it is, in fact, solved.

@Gamenot
Copy link
Collaborator

Gamenot commented Jan 4, 2021

As for the use with ray we have found that it is important to call env.close() explicitly on ray or some resources may be left over which may prevent some ray workers from exiting properly.

A modified example is as follows:

import gym
import ray

from smarts.core.agent import Agent, AgentSpec
from smarts.core.agent_interface import AgentInterface, AgentType


class SimpleAgent(Agent):
    def act(self, obs):
        return "keep_lane"


@ray.remote
class Environment:
    def __init__(self):
        self.AGENT_ID = "Agent-007"
        agent_spec = AgentSpec(
            interface=AgentInterface.from_type(AgentType.Laner, max_episode_steps=1000),
            agent_builder=SimpleAgent,
        )
        self.env = gym.make(
            "smarts.env:hiway-v0",
            scenarios=["scenarios/loop"],
            agent_specs={self.AGENT_ID: agent_spec},
            headless=True,
        )
        self.agent = agent_spec.build_agent()

    def sample(self):
        observations = self.env.reset()

        while True:
            agent_action = self.agent.act(observations[self.AGENT_ID])
            observations, reward, done, _ = self.env.step({self.AGENT_ID: agent_action})
            if done[self.AGENT_ID]:
                break

        return 1  # return sampled trajectory

    # Should be called when the environment is no longer needed
    def close(self):
        self.env.close()

if __name__ == "__main__":
    num_cpus = 2
    ray.init(num_cpus=num_cpus)
    environments = [Environment.remote() for _ in range(num_cpus)]
    try:
        for i in range(10000):
            futures = [env.sample.remote() for env in environments]
            trajectories = []
            for env, f in zip(environments, futures):
                trajectories.append(ray.get([f]))
            train(trajectories)
            print("Episode:%d" % i)
    finally:
        close_futures = [env.close.remote() for env in environments]
        ray.get(close_futures)
        ray.shutdown()

What this means specifically is that the underlying smarts instance needs to be disposed:

def close(self):
if self._smarts is not None:
self._smarts.destroy()

@Adaickalavan
Copy link
Member

Adaickalavan commented Jan 4, 2021

Hi @JianmingTONG , happy new year.

It appears that we have fixed this issue alongside other distributed computing issues (#331).

I have verified that executing the commands below, the code runs successfully to completion.

Run in terminal 1:

$ cd /path/to/repository/SMARTS/
$ scl scenario build --clean ./benchmark/scenarios/two_ways/bid
$ scl scenario build --clean ./benchmark/scenarios/two_ways/bid_sv
$ scl envision start -s ./scenarios -p 8081

See the visualization in a browser at http://localhost:8081/.
Run in terminal 2:

$ python3.7 ./examples/single_agent.py benchmark/scenarios/two_ways/bid_sv --episodes 10000

Run in terminal 3:

$ python3.7 ./examples/single_agent.py benchmark/scenarios/two_ways/bid --episodes 10000

Going forward, please

  • pull the latest SMARTS code from the main branch,
  • setup your Python virtual environment, and
  • re-run pip install -r requirements.txt.

@Adaickalavan
Copy link
Member

Adaickalavan commented Jan 4, 2021

To summarize, the problem is broken down to two parts:

@Gamenot Gamenot reopened this Jan 4, 2021
@Gamenot Gamenot added this to To do in SMARTS v0.4.12 Jan 4, 2021
@JianmingTONG
Copy link

JianmingTONG commented Jan 5, 2021

Hi @JianmingTONG , happy new year.

It appears that we have fixed this issue alongside other distributed computing issues (#331).

I have verified that executing the commands below, the code runs successfully to completion.

Run in terminal 1:

$ cd /path/to/repository/SMARTS/
$ scl scenario build --clean ./benchmark/scenarios/two_ways/bid
$ scl scenario build --clean ./benchmark/scenarios/two_ways/bid_sv
$ scl envision start -s ./scenarios -p 8081

See the visualization in a browser at http://localhost:8081/.
Run in terminal 2:

$ python3.7 ./examples/single_agent.py benchmark/scenarios/two_ways/bid_sv --episodes 10000

Run in terminal 3:

$ python3.7 ./examples/single_agent.py benchmark/scenarios/two_ways/bid --episodes 10000

Going forward, please

  • pull the latest SMARTS code from the main branch,
  • setup your Python virtual environment, and
  • re-run pip install -r requirements.txt.

Hi, I follow the instructions here to launch the example evaluation. However, it complains the following issues.

│       4088/1000000 │               3.58 │                 25 │              35.79 │             bid_sv │          1.rou.xml │ 351907917703150455 │  60.96 - Agent-007 │
│       4089/1000000 │               3.72 │                 39 │              37.19 │             bid_sv │          2.rou.xml │ 351907917703150455 │  93.60 - Agent-007 │
╰────────────────────┴────────────────────┴────────────────────┴────────────────────┴────────────────────┴────────────────────┴────────────────────┴────────────────────╯
Traceback (most recent call last):
  File "./examples/single_agent.py", line 82, in <module>
    seed=args.seed,
  File "./examples/single_agent.py", line 60, in main
    observations = env.reset()
  File "/media/nics/Data/SMARTS/smarts/env/hiway_env.py", line 189, in reset
    env_observations = self._smarts.reset(scenario)
  File "/media/nics/Data/SMARTS/smarts/core/smarts.py", line 306, in reset
    self.setup(scenario)
  File "/media/nics/Data/SMARTS/smarts/core/smarts.py", line 353, in setup
    provider_state = self._setup_providers(self._scenario)
  File "/media/nics/Data/SMARTS/smarts/core/smarts.py", line 643, in _setup_providers
    provider_state.merge(provider.setup(scenario))
  File "/media/nics/Data/SMARTS/smarts/core/sumo_traffic_simulation.py", line 249, in setup
    [tc.VAR_DEPARTED_VEHICLES_IDS, tc.VAR_ARRIVED_VEHICLES_IDS]
  File "/home/nics/Package/sumo/tools/traci/_simulation.py", line 440, in subscribe
    Domain.subscribe(self, "", varIDs, begin, end)
  File "/home/nics/Package/sumo/tools/traci/domain.py", line 208, in subscribe
    self._connection._subscribe(self._subscribeID, begin, end, objectID, varIDs)
  File "/home/nics/Package/sumo/tools/traci/connection.py", line 231, in _subscribe
    result = self._sendCmd(cmdID, (begin, end), objID, format, *args)
  File "/home/nics/Package/sumo/tools/traci/connection.py", line 178, in _sendCmd
    return self._sendExact()
  File "/home/nics/Package/sumo/tools/traci/connection.py", line 88, in _sendExact
    raise FatalTraCIError("connection closed by SUMO")
traci.exceptions.FatalTraCIError: connection closed by SUMO
ERROR:RemoteAgentBuffer:Exception while tearing down buffered remote agent. ValueError('Cannot invoke RPC on closed channel!')
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/media/nics/Data/SMARTS/smarts/core/remote_agent_buffer.py", line 109, in destroy
    raise e
  File "/media/nics/Data/SMARTS/smarts/core/remote_agent_buffer.py", line 104, in destroy
    remote_agent.terminate()
  File "/media/nics/Data/SMARTS/smarts/core/remote_agent.py", line 88, in terminate
    manager_pb2.Port(num=self._worker_address[1])
  File "/home/nics/venv/python37_smarts_1_5/lib/python3.7/site-packages/grpc/_channel.py", line 825, in __call__
    wait_for_ready, compression)
  File "/home/nics/venv/python37_smarts_1_5/lib/python3.7/site-packages/grpc/_channel.py", line 812, in _blocking
    ),), self._context)
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 498, in grpc._cython.cygrpc.Channel.segregated_call
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 353, in grpc._cython.cygrpc._segregated_call
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 357, in grpc._cython.cygrpc._segregated_call
ValueError: Cannot invoke RPC on closed channel!

And I have tested some other scenarios as following:
test results--the last column is the number of episodes.

Might I request your help to fix it?

@Gamenot
Copy link
Collaborator

Gamenot commented Jan 6, 2021

I have found another potential source of the crash when going through the crash report from running the example I provided. I am hoping we can do something about this without going into SUMO code.

StacktraceTop:
 MSLCHelper::getRoundaboutDistBonus(MSVehicle const&, double, MSVehicle::LaneQ const&, MSVehicle::LaneQ const&, MSVehicle::LaneQ const&) ()
 MSLCM_LC2013::_wantsChange(int, MSAbstractLaneChangeModel::MSLCMessager&, int, std::pair<MSVehicle*, double> const&, std::pair<MSVehicle*, double> const&, std::pair<MSVehicle*, double> const&, MSLane const&, std::vector<MSVehicle::LaneQ, std::allocator<MSVehicle::LaneQ> > const&, MSVehicle**, MSVehicle**) ()
 MSLaneChanger::checkChange(int, MSLane const*, std::pair<MSVehicle* const, double> const&, std::pair<MSVehicle* const, double> const&, std::pair<MSVehicle* const, double> const&, std::vector<MSVehicle::LaneQ, std::allocator<MSVehicle::LaneQ> > const&) const ()
 MSLaneChanger::checkChangeWithinEdge(int, std::pair<MSVehicle* const, double> const&, std::vector<MSVehicle::LaneQ, std::allocator<MSVehicle::LaneQ> > const&) const ()
 MSLaneChanger::change() ()
Tags: bionic third-party-packages
ThreadStacktrace:
 .
 Thread 1 (Thread 0x7fdb69499780 (LWP 14606)):
 #0  0x0000557a1be90151 in MSLCHelper::getRoundaboutDistBonus(MSVehicle const&, double, MSVehicle::LaneQ const&, MSVehicle::LaneQ const&, MSVehicle::LaneQ const&) ()
 No symbol table info available.
 #1  0x0000557a1be7d29e in MSLCM_LC2013::_wantsChange(int, MSAbstractLaneChangeModel::MSLCMessager&, int, std::pair<MSVehicle*, double> const&, std::pair<MSVehicle*, double> const&, std::pair<MSVehicle*, double> const&, MSLane const&, std::vector<MSVehicle::LaneQ, std::allocator<MSVehicle::LaneQ> > const&, MSVehicle**, MSVehicle**) ()
 No symbol table info available.
 #2  0x0000557a1bcdc041 in MSLaneChanger::checkChange(int, MSLane const*, std::pair<MSVehicle* const, double> const&, std::pair<MSVehicle* const, double> const&, std::pair<MSVehicle* const, double> const&, std::vector<MSVehicle::LaneQ, std::allocator<MSVehicle::LaneQ> > const&) const ()
 No symbol table info available.
 #3  0x0000557a1bcdd367 in MSLaneChanger::checkChangeWithinEdge(int, std::pair<MSVehicle* const, double> const&, std::vector<MSVehicle::LaneQ, std::allocator<MSVehicle::LaneQ> > const&) const ()
 No symbol table info available.
 #4  0x0000557a1bce0538 in MSLaneChanger::change() ()
 No symbol table info available.
 #5  0x0000557a1bcdae19 in MSLaneChanger::laneChange(long long) ()
 No symbol table info available.
 #6  0x0000557a1bcaad7c in MSEdgeControl::changeLanes(long long) ()
 No symbol table info available.
 #7  0x0000557a1bc07a8e in MSNet::simulationStep() ()
 No symbol table info available.
 #8  0x0000557a1bc080a6 in MSNet::simulate(long long, long long) ()
 No symbol table info available.
 #9  0x0000557a1bbf0c4d in main ()
 No symbol table info available.
Title: sumo crashed with SIGSEGV in MSLCHelper::getRoundaboutDistBonus()
UnreportableReason:
 You have some obsolete package versions installed. Please upgrade the following packages and check if the problem still occurs:
 
 libp11-kit0
UpgradeStatus: No upgrade log present (probably fresh install)
_MarkForUpload: True

The cause looks like it might be from changes in the 1.7.0 release of SUMO. I am unsure why this getRoundaboutDistBonus() method is being called since there is not a roundabout in the loop scenario.

@Gamenot Gamenot moved this from To do to In progress in SMARTS v0.4.12 Jan 11, 2021
@Gamenot
Copy link
Collaborator

Gamenot commented Jan 11, 2021

SUMO connection closed

  • Fix the sumo error logs (works with sumo-gui)
  • Tie SUMO version to the dependency list
  • Compile debug version of SUMO
  • Dev fork of SUMO (if problems are fixed)

@Gamenot Gamenot removed this from In progress in SMARTS v0.4.12 Jan 11, 2021
@Gamenot Gamenot added this to In progress in SMARTS v0.4.13 Jan 18, 2021
@Adaickalavan Adaickalavan removed this from To do in SMARTS v0.4.10 Jan 19, 2021
@Adaickalavan Adaickalavan removed this from To do in SMARTS v0.4.11 Jan 19, 2021
@Adaickalavan
Copy link
Member

Hi @JianmingTONG,

Given the occurrence of traci.exceptions.FatalTraCIError: connection closed by SUMO error, could you try running all your commands and experiments inside a docker container and report back here whether the error still occurs?

I think the error does not happen when SMARTS is run inside a docker container.

$ docker run --rm -it --network=host huaweinoah/smarts:v0.4.12

Do not map the source code using -v $PWD:/src when running the docker container.

@Gamenot Gamenot modified the milestones: 0.5, Backlog Jan 27, 2021
This was linked to pull requests Mar 3, 2021
@dineshresearch
Copy link

I am also currently facing the same issue. After 10 million training steps the training process is getting killed. I am using SMARTS 0.4.16 version

@Gamenot @Adaickalavan @JianmingTONG @BBDrive Is the issue fixed? If so can you please mention the pull request using which this issue is fixed?

Also moving to 0.4.18 version or any other branch solve this issue? If so you can mention the branch that I can use

@Adaickalavan
Copy link
Member

Hi @dineshresearch,

Unfortunately, the traci.exceptions.FatalTraCIError: connection closed by SUMO error which originates from SUMO, is not solved yet.

For the time being, if you do not need background traffic vehicles, you may consider setting traffic_sim=None when instantiating SMARTS. This sidesteps the error, but removes background traffic vehicles.

class SMARTS:
def __init__(
self,
agent_interfaces,
traffic_sim: SumoTrafficSimulation,
envision: EnvisionClient = None,
visdom: VisdomClient = None,
timestep_sec=0.1,
reset_agents_only=False,
zoo_addrs=None,
):

@Gamenot Gamenot added this to To do in SMARTS v0.5.0 Dec 28, 2021
@Gamenot Gamenot removed this from In progress in SMARTS v0.4.13 Dec 28, 2021
@Gamenot Gamenot mentioned this issue Dec 28, 2021
1 task
@Gamenot Gamenot moved this from To do to In Review in SMARTS v0.5.0 Dec 28, 2021
@Gamenot Gamenot moved this from In Review to Done in SMARTS v0.5.0 Jan 10, 2022
@Gamenot Gamenot closed this as completed Jan 10, 2022
@Gamenot Gamenot linked a pull request Jan 11, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
5 participants