NWBib down due to problems on quaoar2 #302

acka47 · 2016-03-29T06:52:55Z

On easter (26-28 March) several Nagios messages concerning quaoar2 and also emphytos came in.

The first critical one (26.03.2016, 16:07):

***** Nagios *****

Notification Type: PROBLEM

Service: Check disk - ALL
Host: quaoar2
Address: 193.30.112.171
State: CRITICAL

Date/Time: Sat Mar 26 16:07:04 CET 2016

Additional Info:

DISK CRITICAL - free space: / 22274 MB (5% inode=97%): /files/open_data 489833 MB (31% inode=90%): /usr/remotesrc 12982 MB (7% inode=98%):

From then on, CRITICAL ones like that for quaor2 repeated.

The first regarding emphytos (26.03.2016, 23:58):

***** Nagios *****

Notification Type: PROBLEM

Service: Frontend
Host: emphytos-lobid
Address: 193.30.112.187
State: CRITICAL

Date/Time: Sat Mar 26 23:58:54 CET 2016

Additional Info:

CRITICAL - Socket timeout after 10 seconds

I realized yesterday evening that NWBib is down and asked @jschnasse to restart the play app this morning. NWBib is up again, but strangely the detail view for NWBib resources doesn't work, e.g. http://nwbib.de/HT018866841. For other hbz01 resources it works, though, e.g. http://nwbib.de/HT018715226.

Besides resolving this bug, I need some more documentation and probably training to deal with something like this myself.

fsteeg · 2016-03-29T11:14:49Z

The issue seems to be that quaoar2 didn't recover (yet) from its critical state.

The lobid API is configured to access the cluster via quaoar3, so that remained working. NWBib accessed the cluster via quaoar2 for its classification data, so only NWBib titles failed. Configured NWBib to access the cluster via quaoar3:

http://nwbib.de/HT018866841
http://nwbib.de/HT018826410

Also deleted old indexes on the cluster via quaoar3 to make space (which seems to be the original issue on quaoar2). Load of elasticsearch process on quaoar2 is high, so I'll just leave it working and we'll see if the changes via quaoar3 propagate and quaoar2 sorts out itself.

acka47 · 2016-03-30T08:22:51Z

Doesn't look like quaoar2 would recover. Nagios mail from 30.03.2016, 04:17:

***** Nagios *****

Notification Type: PROBLEM

Service: Check disk - ALL
Host: quaoar2
Address: 193.30.112.171
State: WARNING

Date/Time: Wed Mar 30 04:17:04 CEST 2016

Additional Info:

DISK WARNING - free space: / 27256 MB (7% inode=97%): /files/open_data 489209 MB (31% inode=90%): /usr/remotesrc 12982 MB (7% inode=98%):

fsteeg · 2016-04-05T07:13:15Z

After manual deletion of some indexes, a restart of elasticsearch on quaoar2, and some time, the cluster is now back to green status. Closing.

Attempted restart on quaoar2 with:
sudo service elasticsearch restart

But did not start up, checked with:
sudo service elasticsearch status

Came back after:
sudo service elasticsearch start

Opened hbz/lobid-resources#67 to avoid this kind of problem in the future.

acka47 added the bug label Mar 29, 2016

acka47 assigned fsteeg Mar 29, 2016

fsteeg added the working label Mar 29, 2016

fsteeg added review and removed working labels Mar 29, 2016

fsteeg assigned acka47 and unassigned fsteeg Mar 29, 2016

acka47 added ready and removed review labels Mar 31, 2016

acka47 assigned fsteeg and unassigned acka47 Mar 31, 2016

fsteeg mentioned this issue Apr 4, 2016

Missing HT018925962 and HT018925945 hbz/lobid#299

Closed

fsteeg added working and removed ready labels Apr 4, 2016

This was referenced Apr 4, 2016

Fail gracefully if classification index cannot be accessed #305

Closed

Don't fill up cluster if there is no manual intervention hbz/lobid-resources#67

Closed

fsteeg closed this as completed Apr 5, 2016

fsteeg removed the working label Apr 5, 2016

fsteeg mentioned this issue Apr 5, 2016

Missing MAB-Source #303

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NWBib down due to problems on quaoar2 #302

NWBib down due to problems on quaoar2 #302

acka47 commented Mar 29, 2016

fsteeg commented Mar 29, 2016

acka47 commented Mar 30, 2016

fsteeg commented Apr 5, 2016

NWBib down due to problems on quaoar2 #302

NWBib down due to problems on quaoar2 #302

Comments

acka47 commented Mar 29, 2016

fsteeg commented Mar 29, 2016

acka47 commented Mar 30, 2016

fsteeg commented Apr 5, 2016