Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a tool to map static cluster map information into Helix #523

Merged
merged 7 commits into from Jan 26, 2017

Conversation

pnarayanan
Copy link
Contributor

@pnarayanan pnarayanan commented Jan 13, 2017

  • Add a tool to bootstrap and, from time to time, upgrade
    cluster map information in Helix from the static clustermap files.
  • Adds tests to verify the tool's claims.
  • Adds certain functionalities in the test classes of the static
    cluster map to help test the tool.

Part of #524

--
Notes:

  • Main changes to be reviewed are in HelixBootstrapUpgradeTool.java and HelixBootstrapUpgradeToolTest.java.
  • Tests cover all of the core functionalities.
  • We make use of a recently added method in Helix. This is available in Helix 0.6.7, which has not yet been published. For now, have commented out the applicable lines and marked TODOs that will be taken care of once 0.6.7 is out.

- Add a tool to bootstrap and, from time to time, upgrade
  cluster map information in Helix from the static clustermap files.
- Adds tests to verify the tool's claims.
- Adds certain functionalities in the test classes of the static
  cluster map to help test the tool.
Copy link
Contributor

@vgkholla vgkholla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just quickly went through the scaffolding around the upgrade tool. Have to take a look at the tool and its test.

for (int i = 0; i < datacenterCount; i++) {
names.add(i, "DC" + i);
dataNodes.add(i, getDataNodes(curBasePort, sslPort, getDisks()));
curBasePort += dataNodeCount;
curBasePort += 5000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to move to an absolute number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have refactored this a bit. simply trying to partition things so that when new nodes are added at a later time (which is the new functionality added in these classes), the ports do not collide.

@@ -81,4 +83,15 @@ public static Properties createConnectionPoolProperties() {
props.put("connectionpool.connect.timeout.ms", "2000");
return props;
}

public static void ensureOrExit(List<OptionSpec> requiredArgs, OptionSet actualArgs, OptionParser parser)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -17,4 +17,5 @@ ext {
commonsVersion = "1.5"
bouncycastleVersion = "1.52"
javaxVersion = "3.0.1"
helixVersion = "0.6.7-SNAPSHOT"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can't be merged in this state right? Do we have to wait until the relevant version of helix is published?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - will figure this out (mentioned this in the description).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will have to figure something out for the general case. For now, Helix is planning to cut a release later this week, so we are good.

Copy link
Contributor

@vgkholla vgkholla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing to review ..

* The tool does the following:
* 1. bootstraps a static cluster map, adding nodes and partitions to Helix.
* 2. upgrades that involve new nodes and new partitions. To avoid over-complicating things, it assumes that the
* existing partition assignment do not change during an upgrade. Newly added partitions can be distributed in any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this also mean it assumes no new replicas are added for existing partitions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. We can make the change if required. At this time, I do not see a requirement as this never happens in the current usage of Ambry.

private final TreeMap<Long, Long> existingPartitions = new TreeMap<>();
private final TreeSet<Long> existingResources = new TreeSet<>();
private final String localDc;
private String clusterName;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't seen the code yet but just leaving a comment here in case i forget.

Can this be final?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, changed.

private final String localDc;
private String clusterName;

static final int MAX_PARTITIONS_PER_RESOURCE = 100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this represent and why is it necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This states the maximum number of Ambry partitions that will be grouped under a single Helix resource. See this section in the design.

listOpt.add(zkLayoutPathOpt);
ToolUtils.ensureOrExit(listOpt, options, parser);
bootstrapOrUpgrade(hardwareLayoutPath, partitionLayoutPath, zkLayoutPath, options.valueOf(localDcOpt));
} catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this catch isn't required. Since this is the main function, it will simply exit the program and print the stack trace of the exception (the stack trace is lost if you print it this way).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, didn't know main() could throw - was simply following what other tools did.

try {
partitionLayoutString = Utils.readStringFromFile(partitionLayoutPath);
} catch (FileNotFoundException e) {
System.out.println("Partition layout path not found. Creating new file");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we instead do

String partitionLayoutString = null;
if (new File(partitionLayoutPath).exists()) {
  partitionLayoutString = Utils.readStringFromFile(partitionLayoutPath);
}

?

I would vote against using exceptions for control flow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, exceptions are implicitly used for control flow of the kind you have in mind, unless they are bail-out exceptions (which in this case it is not). Look at it this way - if lines 204 to 207 were encapsulated in a method, would you have had the same comment?
It so happens that File provides an exists() method, but if it hadn't, this exception will have to be used for control flow (JSON parser class for example, simply throws a JSONException if the contents are not valid JSON - and that might not be a bail-out case for the caller).

That said, your suggested change is much better, have made the change.

JSONArray all = (root).getJSONArray("zkHosts");
for (int i = 0; i < all.length(); i++) {
JSONObject entry = all.getJSONObject(i);
dataCenterToZkAddress.put(entry.getString("datacenter"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you define static constants for these strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

for (int i = 0; i < all.length(); i++) {
JSONObject entry = all.getJSONObject(i);
dataCenterToZkAddress.put(entry.getString("datacenter"),
entry.getString("hostname") + ":" + entry.getString("port"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are going to concatenate, instead of hostname and port, why not let the user provide an endpoint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a more structured and extensible way of reading in fields in my opinion (say, if tomorrow the tool decides to make use of an API that takes host and port as arguments, or it makes a REST based connection).

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went over HelixBootstrapUpgradeTool. Will review the rest in the 2nd pass once the feedback is addressed

JSONObject jsonObject = jsonArray.getJSONObject(i);
currentPartitions.add((long) jsonObject.get("id"));
}
for (long i = 0; i < partitionCount; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: we can just start from currentPartitions.size() right ? Why do you want to start from 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks. Had started off with something more complex which ended up like this. Refactored.

jsonArray.put(getJsonPartition(i, partitionState, replicaCapacityInBytes, jsonReplicas));
}
return jsonArray;
}

static JSONArray getUpdatedJsonPartitions(JSONArray jsonArray, long partitionCount,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ca you add javva docs for the methods added as part of this patch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -269,32 +271,52 @@ public static JSONObject getJsonPartition(long id, PartitionState partitionState
}

public static JSONArray getJsonPartitions(long partitionCount, PartitionState partitionState,
long replicaCapacityInBytes, int replicaCount, TestHardwareLayout testHardwareLayout) throws JSONException {
long replicaCapacityInBytes, int replicaCountPerDc, TestHardwareLayout testHardwareLayout) throws JSONException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you resolve the access specifiers for all the methods in this class using intellij. Most of can be made package private. Intellij will show them as yellow coded and will give suggestion as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is not really core to this patch. Just added a method or so to help test the tool. If I go down the route of correcting these, this will become a huge patch.

Also, IntelliJ is basing these on what it knows about. You could argue that whoever authored a public method in TestUtils wanted it to be exposed outside of the package so that an outside component can make use of it (say if this tool was not part of the clustermap package) - even though currently no component outside needs it.

jsonArray.put(getJsonPartition(i, partitionState, replicaCapacityInBytes, jsonReplicas));
}
return jsonArray;
}

static JSONArray getUpdatedJsonPartitions(JSONArray jsonArray, long partitionCount,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

* @param dataNodeCountPerDc number of datanodes to get from each datacenter.
* @return a list of datanodes.
*/
public List<DataNode> getIndependentDataNodes(int dataNodeCountPerDc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

package private

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this was fixed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, fixed now.

if (!admin.getClusters().contains(clusterName)) {
admin.addCluster(clusterName);
admin.addStateModelDef(clusterName, "LeaderStandby",
BuiltInStateModelDefinitions.LeaderStandby.getStateModelDefinition());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know what is the diff between this call and LeaderStandbySMD.build()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Effectively the same, though seems like LeaderStandbySMD.build() is the recommended way of defining this, so will use your suggestion (the other one eventually makes use of a deprecated method).

InstanceConfig instanceConfig = dcAdmin.getInstanceConfig(clusterName, instanceName);
List<String> currentSealedPartitions = instanceConfig.getRecord().getListField("SEALED");
if (isSealed && !currentSealedPartitions.contains(partitionName)) {
List<String> newSealedPartitionsList = new ArrayList<>(currentSealedPartitions);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

liens 363 to 366 is repeated in more than 1 place. do you think we can make it a private method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reorganized to avoid duplication - thanks.

* tool groups together partitions under resources, with a limit to the number of partitions that will be grouped
* under a single resource.
*/
private void updateHelix() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be "updateClusterMapInHelix"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

String[] instances = updateInstancesAndGetInstanceNames(dcAdmin, partitionName, replicaList, sealed);
resourceISBuilder.assignPreferenceList(partitionName, instances);
}
resourceISBuilder.setNumReplica(numReplicas);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each helix resource contains N partitions. So, what does num of replica means for a resource?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numReplicas is the number of replicas for every partitions in a resource (yes, all partitions of a resource has to have the same number of replicas).

private String clusterName;

static final int MAX_PARTITIONS_PER_RESOURCE = 100;
static final String CAPACITY_STR = "capacityInBytes";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these could be private (93 to 100)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are used by the test.

Copy link
Contributor Author

@pnarayanan pnarayanan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed feedback.

@@ -269,32 +271,52 @@ public static JSONObject getJsonPartition(long id, PartitionState partitionState
}

public static JSONArray getJsonPartitions(long partitionCount, PartitionState partitionState,
long replicaCapacityInBytes, int replicaCount, TestHardwareLayout testHardwareLayout) throws JSONException {
long replicaCapacityInBytes, int replicaCountPerDc, TestHardwareLayout testHardwareLayout) throws JSONException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is not really core to this patch. Just added a method or so to help test the tool. If I go down the route of correcting these, this will become a huge patch.

Also, IntelliJ is basing these on what it knows about. You could argue that whoever authored a public method in TestUtils wanted it to be exposed outside of the package so that an outside component can make use of it (say if this tool was not part of the clustermap package) - even though currently no component outside needs it.

jsonArray.put(getJsonPartition(i, partitionState, replicaCapacityInBytes, jsonReplicas));
}
return jsonArray;
}

static JSONArray getUpdatedJsonPartitions(JSONArray jsonArray, long partitionCount,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

jsonArray.put(getJsonPartition(i, partitionState, replicaCapacityInBytes, jsonReplicas));
}
return jsonArray;
}

static JSONArray getUpdatedJsonPartitions(JSONArray jsonArray, long partitionCount,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

JSONObject jsonObject = jsonArray.getJSONObject(i);
currentPartitions.add((long) jsonObject.get("id"));
}
for (long i = 0; i < partitionCount; i++) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks. Had started off with something more complex which ended up like this. Refactored.

for (int i = 0; i < datacenterCount; i++) {
names.add(i, "DC" + i);
dataNodes.add(i, getDataNodes(curBasePort, sslPort, getDisks()));
curBasePort += dataNodeCount;
curBasePort += 5000;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have refactored this a bit. simply trying to partition things so that when new nodes are added at a later time (which is the new functionality added in these classes), the ports do not collide.

String[] instances = updateInstancesAndGetInstanceNames(dcAdmin, partitionName, replicaList, sealed);
resourceISBuilder.assignPreferenceList(partitionName, instances);
}
resourceISBuilder.setNumReplica(numReplicas);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numReplicas is the number of replicas for every partitions in a resource (yes, all partitions of a resource has to have the same number of replicas).

/**
* Updates instances that hosts replicas of this partition with the replica information (including the mount points
* on which these replicas should reside, which will be purely an instance level information).
* @param dcAdmin the admin to the Zk server on which this operatin is to be done.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

for (int i = 0; i < replicaList.size(); i++) {
Replica replica = (Replica) replicaList.get(i);
DataNodeId node = replica.getDataNodeId();
String instanceName = node.getHostname() + "_" + node.getPort();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -81,4 +83,15 @@ public static Properties createConnectionPoolProperties() {
props.put("connectionpool.connect.timeout.ms", "2000");
return props;
}

public static void ensureOrExit(List<OptionSpec> requiredArgs, OptionSet actualArgs, OptionParser parser)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -17,4 +17,5 @@ ext {
commonsVersion = "1.5"
bouncycastleVersion = "1.52"
javaxVersion = "3.0.1"
helixVersion = "0.6.7-SNAPSHOT"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will have to figure something out for the general case. For now, Helix is planning to cut a release later this week, so we are good.

@pnarayanan
Copy link
Contributor Author

@vgkholla @nsivabalan please take a look when you get a chance. thanks.

Copy link
Contributor

@vgkholla vgkholla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. As discussed, it might be useful to have a verifier at the end of the tool itself to build confidence that the state in helix is consistent with the state in the static file.

parseZkJsonAndPopulateZkInfo(zkLayoutPath);

ClusterMapConfig clusterMapConfig = new ClusterMapConfig(new VerifiableProperties(new Properties()));
staticClusterMap =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ternary operator made this hard reading for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

* ]
* }
*
* This tool should be run from an admin node that has access to the nodes in the hardware layout. The access is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we should word it as "this tool should be run (in every datacenter) from an admin node..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be run only once - does not have to be run from every datacenter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops. got you. I don't know why I made that comment.

private final Map<String, String> dataCenterToZkAddress = new HashMap<>();
private final Map<String, HelixAdmin> adminForDc = new HashMap<>();
// A mapping from partitions to Helix resource, populated at the start with what is present in Helix.
private final TreeMap<Long, Long> existingPartitions = new TreeMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is bit confusing. Can we name it partitionsToResourceMap. Also why do we need to append "existing" to this one and the next data structure. We don't have any other variable which stores upcoming partitions or resources. In that case, would like to keep it simple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed this to just a set of existing partitions, as the partition to resource mapping was unnecessary.
These are the partitions and resource that are already present in Helix when the tool starts up. That way I think "existing" is appropriate. These are not modified after initialization.

private final String localDc;
private final String clusterName;

static final int MAX_PARTITIONS_PER_RESOURCE = 100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this value should be made configurable ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this an optional argument.

// Add a cluster entry in every DC
if (!admin.getClusters().contains(clusterName)) {
admin.addCluster(clusterName);
admin.addStateModelDef(clusterName, "LeaderStandby", LeaderStandbySMD.build());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace the string "LeaderStandby" with LeaderStandbySMD.name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

DataNodeId node = replicaId.getDataNodeId();
String instanceName = getInstanceName(node);
InstanceConfig instanceConfig = dcAdmin.getInstanceConfig(clusterName, instanceName);
List<String> currentSealedPartitions = instanceConfig.getRecord().getListField("SEALED");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SEALED_STR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

HelixBootstrapUpgradeTool clusterMapToHelixMapper =
new HelixBootstrapUpgradeTool(hardwareLayoutPath, partitionLayoutPath, zkLayoutPath, localDc);
clusterMapToHelixMapper.updateClusterMapInHelix();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after upgrade, can we close all the HelixAdmins that are currently open

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with prod code. Reviewing test code now

* @param basePort the starting standard port number for nodes generated
* @param sslPort the starting SSL port number for nodes generated
* @param hardwareState a {@link HardwareLayout} value for each node
* @param disks a {@link JSONArray} of disks for each node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does every data node added has the same set of disks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is how TestUtils is written. I don't think that matters for the purposes of testing this tool.

}

void shutdown() {
zkServer.shutdown();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need to clean up the directories ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they get deleted because a deleteOnExit() call is made on the root directory after creation.

*/
private void verifyDataNodeAndDiskEquivalencyInDc(Datacenter dc, String clusterName, HardwareLayout hardwareLayout,
PartitionLayout partitionLayout) {
ClusterMapManager staticClusterMap = new ClusterMapManager(partitionLayout);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its ironic that hardwareLayout is never used to check for data node and disk equivalency ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets used indirectly. the partitionLayout contains a reference to the associated hardwareLayout and that is what gets used. I have removed the unused variable from the signature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I knew how it works :) just called it out

@@ -17,4 +17,5 @@ ext {
commonsVersion = "1.5"
bouncycastleVersion = "1.52"
javaxVersion = "3.0.1"
helixVersion = "0.6.7"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a stable released version ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the latest version as of this writing from the 0.6.x branch, which is the stable branch. This version itself has not yet been officially announced, but this has the changes we need.

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly minor comments only.

IdealState resourceIS = admin.getResourceIdealState(clusterName, resourceName);
Set<String> resourcePartitions = resourceIS.getPartitionSet();
for (String resourcePartition : resourcePartitions) {
Assert.assertNull(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why assertNull ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validating that an Ambry partition is not seen under more than one resource.

Map<String, Set<String>> allPartitionsToInstancesInHelix = new HashMap<>();
long numResources = 0;
for (String resourceName : admin.getResourcesInCluster(clusterName)) {
IdealState resourceIS = admin.getResourceIdealState(clusterName, resourceName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you verify stateModelName and numReplicas for each resource?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry couple of minor comments

@pnarayanan
Copy link
Contributor Author

@nsivabalan @vgkholla added validation within the tool itself and tested it with a "real-life" cluster map; addressed other comments.

Looks like Helix 0.6.7 is not yet published, will follow up with the Helix team while you review the latest commit.

Copy link
Contributor

@vgkholla vgkholla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments. LGTM

private static String getInstanceName(DataNodeId dataNode) {
return dataNode.getHostname() + "_" + dataNode.getPort();
}

void validateAndClose() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc

"Replica information not consistent for instance " + instanceName + " disk " + disk.getMountPath()
+ "\n in Helix: " + replicaList + "\n in static clustermap: " + replicasInClusterMap);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check for diskInfos.empty()?

@pnarayanan
Copy link
Contributor Author

Incorporated comments, built and test in OS X and Linux.
Helix 0.6.7 has not yet been published (I am told it will be out in a week), so for now have commented out the applicable lines and marked TODOs so this can be merged.

@pnarayanan
Copy link
Contributor Author

@nsivabalan could you merge this PR once you take a look at the latest changes?

@xiahome
Copy link
Contributor

xiahome commented Jan 26, 2017

[update] Made a clean build and run for multiple times. Error was not reproduced.

I saw ./gradlew build intermittently fails. Error message is below.

:ambry-router:unitTest

com.github.ambry.router.DeleteManagerTest > testRouterClosedDuringOperation FAILED
    java.lang.AssertionError: Deletion should be unsuccessful. Exception is expected.
        at org.junit.Assert.fail(Assert.java:91)
        at com.github.ambry.router.DeleteManagerTest.assertFailureAndCheckErrorCode(DeleteManagerTest.java:430)
        at com.github.ambry.router.DeleteManagerTest.access$300(DeleteManagerTest.java:54)
        at com.github.ambry.router.DeleteManagerTest$4.testAndAssert(DeleteManagerTest.java:377)
        at com.github.ambry.router.RouterTestHelpers.testWithErrorCodes(RouterTestHelpers.java:79)
        at com.github.ambry.router.RouterTestHelpers.testWithErrorCodes(RouterTestHelpers.java:62)
        at com.github.ambry.router.DeleteManagerTest.testRouterClosedDuringOperation(DeleteManagerTest.java:371)

90 tests completed, 1 failed
:ambry-router:unitTest FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':ambry-router:unitTest'.
> There were failing tests. See the report at: file:///Users/mxia/ws/linkedin_git_ambry/ambry-router/build/reports/tests/index.html

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

BUILD FAILED

Total time: 1 mins 46.927 secs

@pnarayanan
Copy link
Contributor Author

@xiahome this patch does not have any changes that should fail that test. Could you

  1. run the test on a clean build
  2. take a look at the test to see why it might be failing? I think you are more familiar with the tests in the DeleteManager.

Collections.shuffle(Arrays.asList(instances));
resourceISBuilder.assignPreferenceList(partitionName, instances);
}
resourceISBuilder.setNumReplica(numReplicas);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does resource have information about no of partitions it has ? Or just get the list of partitions and get the size of them ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is the latter.

.describedAs("local_dc")
.ofType(String.class);

ArgumentAcceptingOptionSpec<String> maxPartitionsInOneResourceOpt = parser.accepts("maxPartitionsInOneResource",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any suggestion as to how to set this no? During different upgrades, the total partitions added might be different right. Do you think adding a line to documentation is required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not have to be set. This is an override option, so the user would know.

* @param hardwareLayout the {@link HardwareLayout} of the static clustermap.
* @param partitionLayout the {@link PartitionLayout} of the static clustermap.
*/
private void verifyEquivalencyWithStatic(HardwareLayout hardwareLayout, PartitionLayout partitionLayout) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be rename to "verifyEquivalencyWithStaticClusterMap" or "verifyEquivalencyWithStaticCS"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed.

@@ -199,15 +210,18 @@ static void bootstrapOrUpgrade(String hardwareLayoutPath, String partitionLayout
* @throws JSONException if there is an error parsing the JSON content in any of the files.
*/
private HelixBootstrapUpgradeTool(String hardwareLayoutPath, String partitionLayoutPath, String zkLayoutPath,
String localDc) throws IOException, JSONException {
String localDc, int maxPartitionsInOneResource) throws IOException, JSONException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

java doc for this param

* @param clusterName the cluster to be verified.
* @param partitionLayout the {@link PartitionLayout} of the static clustermap.
*/
private void verifyDataNodeAndDiskEquivalencyInDc(Datacenter dc, String clusterName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume these verify methods are copied from the test class and hence I am not reviewing once again. If there are many changes, let me know

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, they are the same.

+ replicaHostsInHelix);
}
ensureOrThrow(allPartitionsToInstancesInHelix.isEmpty(),
"More partitions in Helix than in clustermap, additional" + " partitions: "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove unwanted "+" (string concatenation)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@nsivabalan
Copy link
Contributor

@pnarayanan : very few minor comments. Once you are done, will merge it

@pnarayanan
Copy link
Contributor Author

@nsivabalan incorporated comments. built and formatted.

@nsivabalan
Copy link
Contributor

Looks good. Merging. Can you confirm that you built using java 1.7 too?

@pnarayanan
Copy link
Contributor Author

Yes, I always build with 1.7. Builds in both Linux and OS X.

@nsivabalan nsivabalan merged commit a98d757 into linkedin:master Jan 26, 2017
@pnarayanan pnarayanan deleted the helixTool branch January 28, 2017 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants