Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk issues on yevaud #143

Closed
tomhughes opened this issue Jan 30, 2017 · 10 comments
Closed

Disk issues on yevaud #143

tomhughes opened this issue Jan 30, 2017 · 10 comments

Comments

@tomhughes
Copy link
Member

There seem to be some disk issues on yevaud. They started pretty much as soon as I tried to deploy the stylesheet update last night with the machine becoming unresponsive and the serial console showing what appeared to be disk related errors.

It was a few hours before I noticed but I then power cycled only for it to fall over again after about ten minutes spewing errors about rejected writes on the swap device.

I paused it on pingdom and it then stayed up overnight and completed the low zoom render but within a few hours of being bought back online this morning it went again.

I couldn't see any SMART errors, but the BIOS on the Areca RAID controller is reporting read errors on IDE channel 5 at around the relevant times - that disk is a Western Digital WD3000HLHX serial number WD-WXG1C30V9532.

@tomhughes
Copy link
Member Author

I couldn't find a way in the Areca BIOS to forcibly mark channel 5 as failed so for now the machine is up but with rendering stopped and the low zoom array (which uses that disk) unmounted.

@tomhughes
Copy link
Member Author

That disk also made a SMART report last last night, around the time I did the initial reboot. It was only a single pending sector though:

This message was generated by the smartd daemon running on:

   host name:  yevaud
   DNS domain: openstreetmap.org

The following warning/error was logged by the smartd daemon:

Device: /dev/sg3 [areca_disk#05_enc#01], 1 Currently unreadable (pending) sectors

Device info:
WDC WD3000HLHX-01JJPV0, S/N:WD-WXG1C30V9532, WWN:5-0014ee-0024ecd5f, FW:04.05G04, 300 GB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
Another message will be sent in 24 hours if the problem persists.

@zerebubuth
Copy link
Collaborator

It seems like this is so flaky now that it's not much use for tile serving?

We knew that the Areca controller was on its way out since the battery died a while back. There's an 8-port LSI JBOD controller in there which we can use instead. We would have plenty of ports for 2x OS, 1x 1TB database, 2x 1TB tiles-high and 3x ?TB tiles-low, if that's what we wanted to do.

Back when we updated orm #88 we swapped out the high-zoom array with SSDs to improve latency. Did we see any tangible benefit from that?

@tomhughes
Copy link
Member Author

I don't know about orm, but scorch (which is all SSD) is definitely doing well given it only has 8 CPU cores compared to 12 in orm and yevaud.

@Firefishy
Copy link
Member

Are we able to remove the Areca controller completely and wire up the backplane / disks to the LSI JBOD controller?

We have at least 2x 500GB SATA SSD ex poldi.

@Firefishy
Copy link
Member

I have pulled IDE channel 5 disk Western Digital WD3000HLHX serial number WD-WXG1C30V9532.

@kocio-pl
Copy link

Does it mean that this server is now functional again and the issue can be closed or it is just kind of a test run?:

https://munin.openstreetmap.org/openstreetmap/yevaud.openstreetmap/uptime.html

@Firefishy
Copy link
Member

Not quite, the machine is functional and catching up the replication backlog.
https://munin.openstreetmap.org/openstreetmap/yevaud.openstreetmap/replication_delay.html

There is another disk throwing disk warnings and it should be replaced or removed.

@Firefishy
Copy link
Member

PS: I install 2x Samsung 840 Pro 512GB disks into the tile-low array.

@Firefishy
Copy link
Member

Closing this issue. Going to create a new one for WDC WD1000DHTZ-04N21V0, S/N:WD-WX41E72EL389, WWN:5-0014ee-602f524a0, FW:04.06A01, 1.00 TB issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants