Skip to content

Planetlab deploy, fall 2013

HU, Pili edited this page Sep 25, 2013 · 27 revisions

Experiment Description

This section provides the most up-to-date experiment description. We'll add more details to explain the traffic and node resources if requested by PL admin and site admin. We are planning a series of experiments. With the data collected from previous experiments, we can give more accurate estimation for resources.

General facts:

  • Our slice name: cuhk_ie_01
  • Slice user: Pili Hu, hupili [at] ie [dot] cuhk [dot] edu [dot] hk
  • Machines: added 650+ machines with status "boot"; after build and all kinds of checking, about 500 machines are used for experiment.
  • Duration: Sept and Oct 2013; Expected 5-10 experiments; Each experiment runs for 1-7 days.
  • During deployment stage, there will be a bulk traffic from 500 PL nodes to http://snsapi.ie.cuhk.edu.hk. We limit the concurrent download to be 10.

Expected resource consumption in running stage:

  • Mild CPU usage. (exact number to be added later)
  • ~200MB memory
  • Listen on port 5900 (may be modified later)
  • 2.5 HTTP requests/response per second per machine. no keep-alive.
  • Traffic is all within PlanetLab.
  • Each machine is expected to initiate connection to almost all other 500 machines. (number of destinations = 12 simulated users/ machine * 60 average degree)

Current Status

  • Sept 25, 2013 12:05 HKT. The series of experiment has been finished. There will be no experiment recently.
  • Sept 17, 2013 23:15 HKT. We are continuously running a series of experiments this week. Each experiment costs about 2 hours deployment time and 1 hour's running time.
  • Sept 13, 2013 16:10 HKT. The 16th exp runs from 15:30 for about 12 hours.
  • Sept 13, 2013 01:40 HKT. The 15th exp runs from 00:50 for about 12 hours.
  • Sept 12, 2013 00:25 HKT. The 13 the experiment will run starting from 00:30 for about 12 hours. Configuration is the same as exp12. Server load will be more uniform.
  • Sept 11, 2013 15:55 HKT. The 12th experiment will run starting from 16:10. Same configuration as exp11 except for the data amount is trimmed smaller.
  • Sept 10, 2013 23:00 HKT. The 11th experiment will run from 23:15 for about 12 hours. HTTP request frequency is the same as exp9 and exp10.
  • Sept 10, 2013 23:00 HKT. Just finished exp10, the settings are the same as exp9.
  • Sept 10, 2013 15:00 HKT. The 9th experiment will run between 16:00~18:00. Most configurations are the same except that the number of HTTP requests will increase by 3 folds.
  • Sept 9, 2013 16:40 HKT. The 8th experiment is running. Configuration similar to previous ones, with non-used data removed.
  • Sept 7, 2013 19:55 HKT. The 5th exp starts running.
  • Sept 7, 2013 13:50 HKT. The 4th exp was run during the past one day.
  • Sept 5, 2013 11:00 HKT. The 3rd exp is terminated.
  • Sept 3, 2013 20:40 HKT. We started to distribute data for the 3rd round. The settings are almost the same as the previous two experiments. Forwarding data is filtered according to the output of a local simulator. Resource consumption is similar to previous two experiments.
  • Sept 3, 2013 15:40 HKT. The 2nd experiment finished running about 10am today. All bots are killed.
  • Sept 2, 2013 10:40 HKT. We are running the 2nd experiment. The setup is same as the 1st one. This experiment is for validating variance in real deployment.
  • Sept 1, 2013 15:50 HKT. Currently all bots are killed. We just finished the first experiment and are doing data analysis.

Quote my email to PL support (Aug 31, 2013):

Dear admin,

We are current running an experiment on PlanetLab for Decentralized Social Networks. We have built a middleware to bridge all kinds of social networking services, unifying the interfaces and data structures. We hope it be a viable solution to address the migration problem from centralized services to decentralized services. The middleware support some non-conventional social networking platforms like RSS and Email. The 2nd stage we envisioned for the migration process is to let SNSAPI users connect to each other directly using those platforms. In this experiment, we actually built a backbone using the RSS platform os SNSAPI. We replay the traces from real social networks to evaluate the efficiency.

The brief idea is in our wiki page of SNSAPI:

https://github.com/hupili/snsapi/wiki/Rsoc

The data summary:

  • 6K nodes from Sina Weibo (a microblogging services in China).
  • 370K edges.
  • Each node (user) consumes about 15M memory.
  • On average, 6000 nodes / 500 machines == 12 users / machine. The memory consumption is below 200M for most cases.
  • Due to random assignment, some machines may hold slightly more instances.

Resource summary:

  • 200M memory on average;
  • mild CPU usage;
  • About 2.5 HTTP request/ response per second per machine on average. (12 * (370000.0 / 6000) / (60 * 5))
  • Duration: the current experiment runs for 1.5 day (0.5 day deployment time + 1 day trace replaying time)

Killing a few (e.g. < 10) slow PL nodes will (should) not affect the overall experiment. This is because we also expect robustness from the system. Please keep our experiment running if the instance is not too short for resources. We are constantly monitoring our experiment. If you observed any abnormality, please let us know.

Many thanks!

Pili Hu Information Engineering Department Chinese University of Hong Kong

Experience for PlanetLab Deployment

This section has some notes from our PL experiment. Hope this section be useful for other researchers.

Python General

The default Python is too old.

  • Compile Python 2.7 from source.
  • Use virtualenv.

Let pip from virtualenv handle everything later.

Known issues:

  • Some dependencies consume a lot of resources during compilation, e.g. lxml. Our slice was killed at two machines during environment preparation.

PLE and PLC Difference

Some configurations are different

Time Drift

Some machines have very large time drift. Run ntp to calibrate the time is important:

  • The time of host machine can not be modified.
  • Store a file called time_diff on each machine. Use this information to calibrate time in our bot.

Can not sudo due to no tty

See the blog post of our intern FAN Qijiang. (in Chinese)

Resource Monitor

PL's virtualization is not fully isolated. Monitoring per host resource consumption does not give correct numbers for our experiment.

  • CPU and memory: ps seems to give instantaneous numbers. We use batch mode of top finally.

Port to Listen

The listening port was first chosen as 5000. Some IDS report:

several different Trojans and could indicate an attempt to exploit a Microsoft Universal Plug and Play vulnerability.

PL nodes behind firewall

Note that PL requires nodes to be placed outside local firewall, e.g. in a DMZ region. However, this is not exactly executed by all sites. We need to check the FW status before doing experiments. Or, the requests to those machines will all fail.

Suppose an HTTP server is launched on each node listening: 0.0.0.0:5900.

Test availability of the server locally:

pssh -o out.local -h hosts -l cuhk_ie_01 'curl localhost:5900/test/s'

Test availability of the server remotely:

cat hosts | xargs -i sh -c 'echo -n {}, ; curl -m 5 -s {}:5900/test/hello; echo' | tee -a out.remote

Try to filter out those nodes behind firewalls.

Check before running

PLE and PLC are different in several ways. I used to use rel path in scripts after source a script. Notice it does not work on about 200 machines. Later I changed to abs path.

Push v.s. pull

  • push: use scp to actively copy files to individual machines.
  • pull: first prepare each node's data and ask then to download from our web server.

We use pull approach at last for smaller lazy waiting at our side.

IP v.s. domain

Ask PL nodes to download from a URL. Use domain name or IP address directly?

Notice that domain name may not always be resolved. Although there are only a few such cases, we opt to use IP in next version.

Fetch logs

PL nodes can fail at any time and in any form. Running stage failure is OK if we build the system to be robust. However, attention should be paid when fetching logs (and other data). Make sure this is not from the previous instance. Getting less log files is a less serious problem. There are just less data to analyze. Getting the logs from previous instances will result in dirty data.

Disk error?

no idea. can not del or add files. remove the machine.

[cuhk_ie_01@planetlab2 ~]$ rm -rf tmp
rm: cannot remove `tmp/planetlab2.eecs.jacobs-university.de.tar.gz': Input/output error
rm: cannot remove `tmp/code': Input/output error
rm: cannot remove `tmp/config.json': Input/output error