-
Notifications
You must be signed in to change notification settings - Fork 52
Journal Node's task lost when launching #194
Comments
hi @F21, a couple of things here:
|
Hey @elingg
The |
hm, taking a second look, it seems you have the oversubscription module running and it is affecting your containerizer. I've seen these issues with Mesos modules. As you can see below it's giving the unknown container error and abnormal executor error. You should see this error launching various types of tasks including with other frameworks. I see this in the scheduler log: I see this in the agent log: |
I couldn't find a way to disable the oversubscription module, so I built binaries for mesos 0.22.1. I think oversubscription was a new feature in 0.23. However, once that's done, if I use marathon, the scheduler kept getting killed because it finished downloading the mesos-hdfs framework. I ended up manually copying mteos-hdfs framework to a node an running it. Having done that, I am still see JN1 getting killed. This is the log from the scheduler:
The
Other frameworks seems to be running fine, I tried Elasticsearch and Kafka. |
If you are running other frameworks on your cluster, including ones that run in docker, it could be related to a Mesos bug where with systemd and the containerizer where if you run multiple frameworks (that run in docker and outside docker), executors and tasks get killed. If you keep seeing something like that, |
In the above tests, I ran the hdfs framework by it self (after stopping and removing all other frameworks) and in fresh clusters. Is there anyway to get more debug info out of the framework or mesos to work out the exact root cause? I will also set up a Ubuntu cluster to test and see if it suffers from the same problem. |
Here are my results from setting up a Ubuntu cluster containing 4 nodes: 1 master/slave and 3 slaves. Mesos is 0.23.0 and marathon is 0.10.1. In this case, I am still seeing the JournalNode task being lost (the same situation as when I was usinng CoreOS). The logs from the mesos-slave where the scheduler is launched:
|
Hm, I have an ubuntu cluster currently running correctly, but without oversubscription. Still appears to be some kind of containerizer issue related to oversubscription. I have seen issues with the oversubscription module interacting with the containerizer. The other possibility is a bug similar to https://issues.apache.org/jira/browse/MESOS-2601 or https://issues.apache.org/jira/browse/MESOS-2605, but those were fixed in Mesos 0.23.0 Sep 11 09:45:56 mesos-slave-03 mesos-slave[733]: I0911 09:45:56.242110 888 slave.cpp:3798] Terminating executor executor.journalnode.NodeExecutor.1441964696232 of framework 20150911-055603-169978048-5050-644-0000 because it did not register within 1mins |
Which version of Ubuntu, mesos and marathon are you running? Also, which version of hdfs are you using? I've been building the latest I will try and set up an identical cluster to yours and see if I can get it to work. |
There are a couple of pretty easy set ups I usually run:
Uninstall the existing that is preinstalled HDFS with the following command, Before building HDFS-Mesos to run on GCE, I change one configuration value: The only value I typically change is in mesos-site.xml, mesos.hdfs.zkfc.ha.zookeeper.quorum. I change localhost to point to the list of zookeeper nodes (i.e. zookeeper is running on the master so I can use masterip:2181. Second step is to upload the HDFS tarball on the master and run it there.
I'm very curious about your issue to see if it's related to mesos modules or the containerizer or not. Let's get to the bottom of it and see if we can update the documentation/fix any additional issues. Thanks for your investigation, @F21! |
Here's some more testing. This cluster consists of 4 nodes: 1 master/slave and 3 slaves. Ubuntu is 14.04 64-bit, mesos is 0.22.1 and Marathon is 0.10.1. OpenJDK said I also built the latest After launching the framework, I can see it registered in the mesos web ui. However, no tasks are being launched. This is the output of the hdfs framework:
|
Noticed there was a bug in mesos 0.22.1 that didn't play well with docker 1.8.1:
Will downgrade the docker version and report back. |
Thanks @F21, we can also discuss by email or IRC to debug what the issue is with the Mesos cluster or the framework. |
@elingg Here's my latesting findings: Mesos is 0.22.1 Ubuntu is 14.04 64-bit and Marathon is 0.10.1. Docker is 1.7.1. Still seeing the same issue where the journal node fails to launch:
What's the best way to debug this over IRC? My timezone is AEST (Australia). |
Ah, timezones are tricky, but we can set up a time if needed. You can always check with other Mesos users on IRC as well as the mesos user mailing lists to see if you get a response in your timezone. At the same time, we can proceed through github issues. This seems to be a Mesos issue still with the containerizer. See the similar Marathon issue with unknown container, mesosphere/marathon#734. Also, this is a custom executor. Does the launch of the executor exceed the default mesos-slave executor registration timeout? What do you see being downloaded in the executor sandbox? |
I am currently removing marathon from the equation and am launching In terms of the download, I can see the executor and java 7 being downloaded from The What is strange is that I got the hdfs framework running back in June or July without too many problems. I will go and build |
If it's a Mesos containerizer issue, downgrading the HDFS-Mesos version probably won't help much, but you can try to help diagnose. Running without Marathon might if you are running marathon in docker or other apps in docker, because the recent containizer issues we've seen occur when the user runs docker containers alongside mesos containers. Since HDFS-Mesos doesn't require Docker, one thing to try is to remove docker from the slave containerizer flags. |
I did have docker in my previous set up, but it's a good idea to setup a very minimal mesos cluster with just mesos, mesos-dns and hdfs-mesos to replicate. I think I will also not install marathon as well. All of these are installed from the Mesosphere apt repos. I'll give that a go and see if it makes a difference. |
Just reporting back. I am still seeing the journal node task being lost. My cluster is running Ubuntu 14.04 64-bit with mesos dns and the head of HDFS-mesos. Docker and marathon was not installed. |
Very strange, not sure why you are seeing containerizer issues then. Still the same container error messages? |
@elingg I made some progress! Got the JN to launch on my stripped own setup with the HEAD of hdfs-mesos. I noticed that I neglected to set Currently, the clusters are launched using vagrant. Since mesos 0.24 was just released, I am going to build the binaries and setup a CoreOS cluster again to see if it resolves the problem. |
@F21, that's great news! executor_registration_timeout is a common issue (which is why I mentioned it above) as the container will be destroyed if the executor is not created in time. That's awesome that Ubuntu is working. Core OS should be fine, but remember the comment I made about the sym link permissions issue. To get around that, we use the predistributed binaries option. |
The strange thing is that I always had the |
Is this safe to close @F21 with the MESOS_EXECUTOR_REGISTRATION_TIMEOUT environment variable fix? |
@elingg Unfortunately, that didn't make any difference on CoreOS. I am still investigating to see why that's happening. |
Understood, let us know if you have questions @F21 and as a product plug, with DCOS, installation on Core OS is as easy as one command 👍 |
@elingg Is there anyway to increase the verbosity of the executor?
Is the executor meant to exit? I think this item in the mesos-slave log looks a bit suspicious:
Interestingly, mesos tries to relaunch the executor on the same node a few times and then the mesos-slave crashes and goes offline:
|
Yes, that is a suspicious containerizer issue. It points to a possible Mesos containerizer issue related to systemd or not giving the executor enough time to register (i.e. executor registration time out). |
@elingg I think I might be one step closer to finding the problem. I noticed that the <property>
<name>dfs.namenode.rpc-address.hdfs.nn1</name>
<value>:50071</value>
</property>
<property>
<name>dfs.namenode.http-address.hdfs.nn1</name>
<value>:50070</value>
</property> It looks like How is the value of |
I don't think that's the issue as the NN's have not yet launched so that would be expected to be blank until they launch. You should actually check for the *-site.xml files in hdfs-mesos-executor-0.1.4/etc/hadoop folder in your sandbox |
Within the
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mesos.hdfs.data.dir</name>
<description>The primary data directory in HDFS</description>
<value>/var/lib/hdfs/data</value>
</property>
<property>
<name>mesos.hdfs.secondary.data.dir</name>
<description>The secondary data directory in HDFS</description>
<value>/var/run/hadoop-hdfs</value>
</property>
<property>
<name>mesos.hdfs.native-hadoop-binaries</name>
<description>Mark true if you have hadoop pre-installed on your host machines (otherwise it will be distributed by the scheduler)</description>
<value>false</value>
</property>
<property>
<name>mesos.hdfs.framework.mnt.path</name>
<description>Mount location (if mesos.hdfs.native-hadoop-binaries is marked false)</description>
<value>/opt/mesosphere</value>
</property>
<property>
<name>mesos.hdfs.state.zk</name>
<description>Comma-separated hostname-port pairs of zookeeper node locations for HDFS framework state information</description>
<value>master.mesos:2181</value>
</property>
<property>
<name>mesos.master.uri</name>
<description>Zookeeper entry for mesos master location</description>
<value>zk://master.mesos:2181/mesos</value>
</property>
<property>
<name>mesos.hdfs.zkfc.ha.zookeeper.quorum</name>
<description>Comma-separated list of zookeeper hostname-port pairs for HDFS HA features</description>
<value>master.mesos:2181</value>
</property>
<property>
<name>mesos.hdfs.framework.name</name>
<description>Your Mesos framework name and cluster name when accessing files (hdfs://YOUR_NAME)</description>
<value>hdfs</value>
</property>
<property>
<name>mesos.hdfs.mesosdns</name>
<description>Whether to use Mesos DNS for service discovery within HDFS</description>
<value>true</value>
</property>
<property>
<name>mesos.hdfs.mesosdns.domain</name>
<description>Root domain name of Mesos DNS (usually 'mesos')</description>
<value>mesos</value>
</property>
<property>
<name>mesos.native.library</name>
<description>Location of libmesos.so</description>
<value>/opt/test/packages/mesos/lib/libmesos.so</value>
</property>
<property>
<name>mesos.hdfs.journalnode.count</name>
<description>Number of journal nodes (must be odd)</description>
<value>1</value>
</property>
<!-- Additional settings for fine-tuning -->
<property>
<name>mesos.hdfs.jvm.overhead</name>
<description>Multiplier on resources reserved in order to account for JVM allocation</description>
<value>1</value>
</property>
<property>
<name>mesos.hdfs.hadoop.heap.size</name>
<value>512</value>
</property>
<property>
<name>mesos.hdfs.namenode.heap.size</name>
<value>512</value>
</property>
<property>
<name>mesos.hdfs.datanode.heap.size</name>
<value>512</value>
</property>
<property>
<name>mesos.hdfs.executor.heap.size</name>
<value>256</value>
</property>
<property>
<name>mesos.hdfs.executor.cpus</name>
<value>0.5</value>
</property>
<property>
<name>mesos.hdfs.namenode.cpus</name>
<value>0.5</value>
</property>
<property>
<name>mesos.hdfs.journalnode.cpus</name>
<value>0.5</value>
</property>
<property>
<name>mesos.hdfs.datanode.cpus</name>
<value>0.5</value>
</property>
<property>
<name>mesos.hdfs.user</name>
<value>root</value>
</property>
<property>
<name>mesos.hdfs.role</name>
<value>*</value>
</property>
</configuration>
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.nameservice.id</name>
<value>${frameworkName}</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>${frameworkName}</value>
</property>
<property>
<name>dfs.ha.namenodes.${frameworkName}</name>
<value>nn1</value>
</property>
<property>
<name>dfs.namenode.rpc-address.${frameworkName}.nn1</name>
<value>${nn1Hostname}:50071</value>
</property>
<property>
<name>dfs.namenode.http-address.${frameworkName}.nn1</name>
<value>${nn1Hostname}:50070</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.${frameworkName}</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://${journalnodes}/${frameworkName}</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>${haZookeeperQuorum}</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>${dataDir}/jn</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file://${dataDir}/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file://${dataDir}/data</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>shell(/bin/true)</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.datanode.du.reserved</name>
<value>10485760</value>
</property>
<property>
<name>dfs.datanode.balance.bandwidthPerSec</name>
<value>41943040</value>
</property>
<property>
<name>dfs.namenode.safemode.threshold-pct</name>
<value>0.90</value>
</property>
<property>
<name>dfs.namenode.heartbeat.recheck-interval</name>
<!-- 60 seconds -->
<value>60000</value>
</property>
<property>
<name>dfs.datanode.handler.count</name>
<value>10</value>
</property>
<property>
<name>dfs.namenode.handler.count</name>
<value>20</value>
</property>
<property>
<name>dfs.image.compress</name>
<value>true</value>
</property>
<property>
<name>dfs.image.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
<property>
<name>dfs.namenode.invalidate.work.pct.per.iteration</name>
<value>0.35f</value>
</property>
<property>
<name>dfs.namenode.replication.work.multiplier.per.iteration</name>
<value>4</value>
</property>
<!-- This property allows us to use IP's directly for communication instead of hostnames. -->
<property>
<name>dfs.namenode.datanode.registration.ip-hostname-check</name>
<value>false</value>
</property>
<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
</property>
<property>
<name>dfs.client.read.shortcircuit.streams.cache.size</name>
<value>1000</value>
</property>
<property>
<name>dfs.client.read.shortcircuit.streams.cache.size.expiry.ms</name>
<value>1000</value>
</property>
<!-- This property needs to be consistent with mesos.hdfs.secondary.data.dir -->
<property>
<name>dfs.domain.socket.path</name>
<value>/var/run/hadoop-hdfs/dn._PORT</value>
</property>
</configuration> I tried running
Is there anyway to turn up the verbosity of the executor so that I can get some output in |
Copied the expanded
|
My best guess based on your findings would be a containerizer issue on coreos or executor registration timeout issue. Might be best to check with the Mesos core team. |
@elingg I have made some progress! Current stack is CoreOS 808.0.0 and Mesos 0.24.0 and marathon 0.10.0. I compiled mesos by setting I am now able to get some useful information out of The current <?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mesos.hdfs.data.dir</name>
<description>The primary data directory in HDFS</description>
<value>/var/lib/hdfs/data</value>
</property>
<property>
<name>mesos.hdfs.secondary.data.dir</name>
<description>The secondary data directory in HDFS</description>
<value>/var/run/hadoop-hdfs</value>
</property>
<property>
<name>mesos.hdfs.native-hadoop-binaries</name>
<description>Mark true if you have hadoop pre-installed on your host machines (otherwise it will be distributed by the scheduler)</description>
<value>false</value>
</property>
<property>
<name>mesos.hdfs.framework.mnt.path</name>
<description>Mount location (if mesos.hdfs.native-hadoop-binaries is marked false)</description>
<value>/opt/mesosphere</value>
</property>
<property>
<name>mesos.hdfs.state.zk</name>
<description>Comma-separated hostname-port pairs of zookeeper node locations for HDFS framework state information</description>
<value>master.mesos:2181</value>
</property>
<property>
<name>mesos.master.uri</name>
<description>Zookeeper entry for mesos master location</description>
<value>zk://master.mesos:2181/mesos</value>
</property>
<property>
<name>mesos.hdfs.zkfc.ha.zookeeper.quorum</name>
<description>Comma-separated list of zookeeper hostname-port pairs for HDFS HA features</description>
<value>master.mesos:2181</value>
</property>
<property>
<name>mesos.hdfs.framework.name</name>
<description>Your Mesos framework name and cluster name when accessing files (hdfs://YOUR_NAME)</description>
<value>hdfs</value>
</property>
<property>
<name>mesos.hdfs.mesosdns</name>
<description>Whether to use Mesos DNS for service discovery within HDFS</description>
<value>true</value>
</property>
<property>
<name>mesos.hdfs.mesosdns.domain</name>
<description>Root domain name of Mesos DNS (usually 'mesos')</description>
<value>mesos</value>
</property>
<property>
<name>mesos.native.library</name>
<description>Location of libmesos.so</description>
<value>/opt/test/mesos/lib/libmesos.so</value>
</property>
<property>
<name>mesos.hdfs.journalnode.count</name>
<description>Number of journal nodes (must be odd)</description>
<value>1</value>
</property>
<!-- Additional settings for fine-tuning -->
<property>
<name>mesos.hdfs.jvm.overhead</name>
<description>Multiplier on resources reserved in order to account for JVM allocation</description>
<value>1</value>
</property>
<property>
<name>mesos.hdfs.hadoop.heap.size</name>
<value>512</value>
</property>
<property>
<name>mesos.hdfs.namenode.heap.size</name>
<value>512</value>
</property>
<property>
<name>mesos.hdfs.datanode.heap.size</name>
<value>512</value>
</property>
<property>
<name>mesos.hdfs.executor.heap.size</name>
<value>256</value>
</property>
<property>
<name>mesos.hdfs.executor.cpus</name>
<value>0.5</value>
</property>
<property>
<name>mesos.hdfs.namenode.cpus</name>
<value>0.5</value>
</property>
<property>
<name>mesos.hdfs.journalnode.cpus</name>
<value>0.5</value>
</property>
<property>
<name>mesos.hdfs.datanode.cpus</name>
<value>0.5</value>
</property>
<property>
<name>mesos.hdfs.user</name>
<value>root</value>
</property>
<property>
<name>mesos.hdfs.role</name>
<value>*</value>
</property>
<property>
<name>mesos.hdfs.ld-library-path</name>
<value>/opt/test/mesos/lib</value>
</property>
</configuration> This is the output from the journal node executor's
This is the output of
Any ideas what might be causing this? |
I just remembered your comment regarding CoreOS being locked down. I then pre-distributed binaries for hadoop (CDH distribution) to all my CoreOS nodes and added it to the PATH environment variable available to the mesos slaves. Now, the journal node is able to launch, but it keeps on complaining about the journal directory not being an absolute path:
if I check in the sandbox, the <?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.nameservice.id</name>
<value>hdfs</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>hdfs</value>
</property>
<property>
<name>dfs.ha.namenodes.hdfs</name>
<value>nn1</value>
</property>
<property>
<name>dfs.namenode.rpc-address.hdfs.nn1</name>
<value>:50071</value>
</property>
<property>
<name>dfs.namenode.http-address.hdfs.nn1</name>
<value>:50070</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.hdfs</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://192.168.33.10:8485/hdfs</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>master.mesos:2181</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/var/lib/hdfs/data/jn</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///var/lib/hdfs/data/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///var/lib/hdfs/data/data</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>shell(/bin/true)</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.datanode.du.reserved</name>
<value>10485760</value>
</property>
<property>
<name>dfs.datanode.balance.bandwidthPerSec</name>
<value>41943040</value>
</property>
<property>
<name>dfs.namenode.safemode.threshold-pct</name>
<value>0.90</value>
</property>
<property>
<name>dfs.namenode.heartbeat.recheck-interval</name>
<!-- 60 seconds -->
<value>60000</value>
</property>
<property>
<name>dfs.datanode.handler.count</name>
<value>10</value>
</property>
<property>
<name>dfs.namenode.handler.count</name>
<value>20</value>
</property>
<property>
<name>dfs.image.compress</name>
<value>true</value>
</property>
<property>
<name>dfs.image.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
<property>
<name>dfs.namenode.invalidate.work.pct.per.iteration</name>
<value>0.35f</value>
</property>
<property>
<name>dfs.namenode.replication.work.multiplier.per.iteration</name>
<value>4</value>
</property>
<!-- This property allows us to use IP's directly for communication instead of hostnames. -->
<property>
<name>dfs.namenode.datanode.registration.ip-hostname-check</name>
<value>false</value>
</property>
<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
</property>
<property>
<name>dfs.client.read.shortcircuit.streams.cache.size</name>
<value>1000</value>
</property>
<property>
<name>dfs.client.read.shortcircuit.streams.cache.size.expiry.ms</name>
<value>1000</value>
</property>
<!-- This property needs to be consistent with mesos.hdfs.secondary.data.dir -->
<property>
<name>dfs.domain.socket.path</name>
<value>/var/run/hadoop-hdfs/dn._PORT</value>
</property>
</configuration> In <?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.nameservice.id</name>
<value>${frameworkName}</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>${frameworkName}</value>
</property>
<property>
<name>dfs.ha.namenodes.${frameworkName}</name>
<value>nn1</value>
</property>
<property>
<name>dfs.namenode.rpc-address.${frameworkName}.nn1</name>
<value>${nn1Hostname}:50071</value>
</property>
<property>
<name>dfs.namenode.http-address.${frameworkName}.nn1</name>
<value>${nn1Hostname}:50070</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.${frameworkName}</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://${journalnodes}/${frameworkName}</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>${haZookeeperQuorum}</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>${dataDir}/jn</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file://${dataDir}/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file://${dataDir}/data</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>shell(/bin/true)</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.datanode.du.reserved</name>
<value>10485760</value>
</property>
<property>
<name>dfs.datanode.balance.bandwidthPerSec</name>
<value>41943040</value>
</property>
<property>
<name>dfs.namenode.safemode.threshold-pct</name>
<value>0.90</value>
</property>
<property>
<name>dfs.namenode.heartbeat.recheck-interval</name>
<!-- 60 seconds -->
<value>60000</value>
</property>
<property>
<name>dfs.datanode.handler.count</name>
<value>10</value>
</property>
<property>
<name>dfs.namenode.handler.count</name>
<value>20</value>
</property>
<property>
<name>dfs.image.compress</name>
<value>true</value>
</property>
<property>
<name>dfs.image.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
<property>
<name>dfs.namenode.invalidate.work.pct.per.iteration</name>
<value>0.35f</value>
</property>
<property>
<name>dfs.namenode.replication.work.multiplier.per.iteration</name>
<value>4</value>
</property>
<!-- This property allows us to use IP's directly for communication instead of hostnames. -->
<property>
<name>dfs.namenode.datanode.registration.ip-hostname-check</name>
<value>false</value>
</property>
<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
</property>
<property>
<name>dfs.client.read.shortcircuit.streams.cache.size</name>
<value>1000</value>
</property>
<property>
<name>dfs.client.read.shortcircuit.streams.cache.size.expiry.ms</name>
<value>1000</value>
</property>
<!-- This property needs to be consistent with mesos.hdfs.secondary.data.dir -->
<property>
<name>dfs.domain.socket.path</name>
<value>/var/run/hadoop-hdfs/dn._PORT</value>
</property>
</configuration> Is there any reason why it's trying to load the template version of |
Glad to hear of your progress! If you use the Predistributed binaries option (which does make sense as CoreOS is locked down like we discussed) that means you need to fill out hdfs-site.xml yourself to make sure it's configured properly. My recommendation would be if you are using predistributed binaries and also use Mesos DNS with the example configs for Mesos DNS as a part of your binaries, i.e. https://github.com/mesosphere/hdfs/tree/master/example-conf/mesosphere-dcos |
Closing this as I was able to successfully launch a POC cluster with 3 slaves and 1 master/slave all running CoreOS! |
I have 4 mesos 0.23 nodes with 1 master/slave and 3 slaves running via vagrant. I am using Java 8 and running everything natively on CoreOS 794.0.0 (not inside docker containers).
I have build the
HEAD
of the hdfs framework and launched it using marathon 0.10.1.I noticed that after deploying the scheduler via marathon, it attempts to launch a JournalNode on one of my slaves. However, the JournalNode task will become lost and another attempt to launch the JournalNode starts again, until I kill the scheduler.
This is my
mesos-site.xml
:This is the stderr of the scheduler:
Inspecting the
journal
of the slave node where the journal node is launched didn't yield anything interesting:The text was updated successfully, but these errors were encountered: