Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watchdog is down once every hour #313

Closed
kentarosasaki opened this issue Feb 24, 2015 · 8 comments
Closed

Watchdog is down once every hour #313

kentarosasaki opened this issue Feb 24, 2015 · 8 comments

Comments

@kentarosasaki
Copy link

I found out following error log in 1.2.5. This error log was appeared once every hour. In this gateway node, I don't use watchdog function. It means that I don't modify watchdog part in leo_gateway.conf.

[E]     gateway_0@127.0.0.1      2015-02-24 13:28:12.964202 +0900        1424752092      leo_watchdog:handle_info/2      119     {leo_watchdog_disk,{function_clause,[{leo_watchdog_disk,'-get_disk_data/2-fun-1-',[["/dev/mapper/VolGroup00-LV_root"]],[{file,"src/leo_watchdog_disk.erl"},{line,170}]},{leo_watchdog_disk,'-get_disk_data/2-lc$^1/1-0-',2,[{file,"src/leo_watchdog_disk.erl"},{line,178}]},{leo_watchdog_disk,check,4,[{file,"src/leo_watchdog_disk.erl"},{line,295}]},{leo_watchdog_disk,handle_call,2,[{file,"src/leo_watchdog_disk.erl"},{line,238}]},{leo_watchdog,handle_info,2,[{file,"src/leo_watchdog.erl"},{line,112}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,604}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}}
[E]     gateway_0@127.0.0.1      2015-02-24 13:28:12.965045 +0900        1424752092      null:null       0       gen_server leo_watchdog_disk terminated with reason: no case clause matching ok in leo_watchdog:handle_info/2 line 120
[E]     gateway_0@127.0.0.1      2015-02-24 13:28:12.965708 +0900        1424752092      null:null       0       ["CRASH REPORT ",[80,114,111,99,101,115,115,32,"leo_watchdog_disk",32,119,105,116,104,32,"0",32,110,101,105,103,104,98,111,117,114,115,32,"exited",32,119,105,116,104,32,114,101,97,115,111,110,58,32,[["no case clause matching ","ok"," in ",[["leo_watchdog",58,"handle_info",47,"2"],[32,108,105,110,101,32,"120"]]]," in ",[["gen_server",58,"terminate",47,"6"],[32,108,105,110,101,32,"744"]]]]]
[E]     gateway_0@127.0.0.1      2015-02-24 13:28:12.966528 +0900        1424752092      null:null       0       Supervisor leo_watchdog_sup had child leo_watchdog_disk started with leo_watchdog_disk:start_link(["/"], [], 80, 100, 262144, 262144, 5, 5000000) at <0.23974.374> exit with reason no case clause matching ok in leo_watchdog:handle_info/2 line 120 in context child_terminated
[E]     gateway_0@127.0.0.1      2015-02-24 13:28:13.161113 +0900        1424752093      null:null       0       Error in process <0.28582.375> on node 'gateway_0@127.0.0.1' with exit value: {{badmatch,{error,{error,{already_started,<0.1056.0>}}}},[{leo_watchdog_disk,'-init/1-fun-0-',1,[{file,"src/leo_watchdog_disk.erl"},{line,225}]}]}
yosukehara added a commit to leo-project/leo_gateway that referenced this issue Feb 24, 2015
@mocchira
Copy link
Member

@kentarosasaki
Thank you for reporting this issue.
There are two issues.

  • Get watchdog enabled on leo_gateway even if you disabled it in leo_gateway.conf
  • Failed to parse df -lk command result

Since the first one is a bug, we will fix it.

Regarding the second one,
Since df -lk output on your env is not expected for us, please let us know df -lk output on your leo_gateway.
We will fix the parse logic for df -lk based on your result.

Our expected result on xnix environments.

ubuntu@ip-10-126-25-246:~/dev/leofs_client_tests/aws-sdk-ruby$ df -lk
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/xvda1      20496628 7637664  11920972  40% /
none                   4       0         4   0% /sys/fs/cgroup
udev             3824544      12   3824532   1% /dev
tmpfs             765948     356    765592   1% /run
none                5120       0      5120   0% /run/lock
none             3829736       0   3829736   0% /run/shm
none              102400       0    102400   0% /run/user
/dev/xvdb       30824956   45124  29207352   1% /mnt

@shuichiro-makigaki
Copy link
Contributor

$ df -lk
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/VolGroup00-LV_root
                      26730364   3776520  21596020  15% /
tmpfs                  8167252         0   8167252   0% /dev/shm
/dev/sda1                99150     32442     61588  35% /boot
none                   8167252        12   8167240   1% /tmp
/dev/mapper/VolGroup01-LV_leofs
                     262098944 134873592 127225352  52% /leofs

Oh, Filesystem name is too long.

However, LVM makes these long filesystem name usually. (e.g. OpenStack Cinder uses 32 chars uuid, which is longer than this case.)

@yosukehara yosukehara added this to the 1.2.7 milestone Feb 24, 2015
@kentarosasaki
Copy link
Author

Additionally, In case clause which is in leo_watchdog.erl 120 step, handle_fail return just OK, but there are no processing after that.
https://github.com/leo-project/leo_watchdog/blob/develop/src/leo_watchdog.erl#L120
https://github.com/leo-project/leo_watchdog/blob/develop/src/leo_watchdog_disk.erl#L247

Of course, I'm not sure the detail around watchdog, but I got worried about the behaviour in case clause.

@mocchira
Copy link
Member

@kentarosasaki no problem.
handle_fail ONLY can be used when there are something to be rollbacked.
if you write some code which have something side effects in handle_call,
you can use handle_fail to rollback this side effects.

@mocchira
Copy link
Member

@shuichiro-makigaki Thanks.
I found out how to avoid this strange behaviour(when a filesystem name is too long).

http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/df.c;h=4523c440f5f14f4933545841d54c68b525e97897;hb=ca637bff0e29732aad99f4c08afbf61c44ed94f0#l390

This behaviour has been fixed with gnu df version >= v8.10-40-g99679ff.

In case of lower versions,
Adding the posix_format option should fix this unexpected starting new line behaviour.
Please try df -lkP out on leo_gateway and please let us the result.
If this works for you, we will fix this issue by replacing df -lk with df -lkP.

@shuichiro-makigaki
Copy link
Contributor

$ df --version
df (GNU coreutils) 8.4
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Torbjorn Granlund, David MacKenzie, and Paul Eggert.
$ df -lkP
Filesystem         1024-blocks      Used Available Capacity Mounted on
/dev/mapper/VolGroup00-LV_root  26730364   7594808  17777732      30% /
tmpfs                  8167252         0   8167252       0% /dev/shm
/dev/sda1                99150     32442     61588      35% /boot
none                   8167252        20   8167232       1% /tmp
/dev/mapper/VolGroup01-LV_leofs 104816640     33000 104783640       1% /leofs

Looks good.

@mocchira
Copy link
Member

Fixed with this commit.
leo-project/leo_watchdog@767897a

Confirmed CentOS 6.5/UbuntuLTS.
Now test is ongoing on other xunix platforms.

@mocchira
Copy link
Member

Confirmed the latest FreeBSD/SmartOS.

@kentarosasaki @shuichiro-makigaki
This fix will be included in the next release( probably 1.2.7 ).

Thanks for you all contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants