Skip to content

Commit

Permalink
initial import
Browse files Browse the repository at this point in the history
  • Loading branch information
keith-turner committed Dec 20, 2011
0 parents commit 1d95b95
Show file tree
Hide file tree
Showing 13 changed files with 1,014 additions and 0 deletions.
59 changes: 59 additions & 0 deletions README
@@ -0,0 +1,59 @@
Accumulo has a simple test suite that verifies that data is not lost at scale.
This test suite is called continuous ingest. This test runs many ingest
clients that continually create linked lists containing 25 million nodes. At
some point the clients are stopped and a map reduce job is run to ensure no
linked list has a hole. A hole indicates data was lost.

This project is a version of the test suite written using Gora. Theoretically
it could run against other column stores. Currently I have only tested it at
scale using Accumulo.

Below is rough sketch of how data is written. For specific details look at the
Generator code.

1 Write out 1 million nodes
2 Flush
3 Write out 1 million that reference previous million
4 If this is the 25th set of 1 million nodes, then update 1st set of million
to point to last
5 goto 1

The key is that nodes only reference flushed nodes. Therefore a node should
never reference a missing node, even if the ingest client is killed at any
point in time.

When running this test suite w/ Accumulo we also run a script called the
Agitator that randomly and continuously kills server processes. We found many
data loss bugs in Accumulo by doing this. This test suite can also help find
bugs that impact uptime and stability when run for day or weeks.

This test suite consist of a few Java programs, a little helper script to run
the java programs, and a maven script to build it. To build the code, you may
need to edit the maven script to point to the gora data store that you want to
use. Or just use the maven script to build this java code, and copy whatever
dependencies you need into lib. To compile, do "mvn compile package". The
current maven build script depends on an unreleased version of Accumulo and an
un released version of gora-accumulo. Both of these can be downloaded and
installed in your local maven repo using mvn install.

Below is a description of the Java programs

* goraci.Generator - A map only job that generates data.
* goraci.Verify - A map reduce job that looks for holes. Look at the
counts after running. REFERENCED and UNREFERENCED are
ok, any UNDEFINED counts are bad. Do not run at the
same time as the Generator.
* goraci.Walker - A standalong program that start following a linked list
and emits timing info.
* goraci.Print - A standalone program that prints nodes in the linked list

goraci.sh is a helper script that you can use to run the above programs. It
assumes all needed jars are in the lib dir. It does not need the package name.
You can just run "./goraci.sh Generator", below is an example.

$ ./goraci.sh Generator
Usage : Generator <num mappers> <num nodes>

This test suite does not do everything that the Accumulo test suite does,
mainly it does not collect statistics and generate reports.

10 changes: 10 additions & 0 deletions cinode.json
@@ -0,0 +1,10 @@
{
"type": "record",
"name": "CINode",
"namespace": "goraci.generated",
"fields" : [
{"name": "prev", "type": "long"},
{"name": "client", "type": "string"},
{"name": "count", "type": "long"}
]
}
14 changes: 14 additions & 0 deletions goraci.sh
@@ -0,0 +1,14 @@
#!/bin/sh

GORACI_HOME=`dirname "$0"`
export HADOOP_CLASSPATH=$(JARS=("$GORACI_HOME/lib"/*.jar); IFS=:; echo "${JARS[*]}")
LIBJARS=`echo $HADOOP_CLASSPATH | tr : ,`


PACKAGE="goraci"

CMD=$1
shift

hadoop jar "$GORACI_HOME/lib/goraci-0.0.1-SNAPSHOT.jar" "$PACKAGE.$CMD" -libjars $LIBJARS $@

111 changes: 111 additions & 0 deletions pom.xml
@@ -0,0 +1,111 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>goraci</groupId>
<artifactId>goraci</artifactId>
<version>0.0.1-SNAPSHOT</version>


<dependencies>
<dependency>
<groupId>org.apache.gora</groupId>
<artifactId>gora-core</artifactId>
<version>0.2-SNAPSHOT</version>
</dependency>

<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>avro</artifactId>
<version>1.3.3</version>
</dependency>

<!-- begin dependencies for accumulo.... all needed runtime deps are specified
because enabling transitive deps brings in too much junk. Comment out if not
using accumulo -->
<!-- see https://issues.apache.org/jira/browse/GORA-65 to obtain source
for gora-accumulo -->
<dependency>
<groupId>org.apache.gora</groupId>
<artifactId>gora-accumulo</artifactId>
<version>0.2-SNAPSHOT</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.accumulo</groupId>
<artifactId>accumulo-core</artifactId>
<version>1.4.0-incubating-SNAPSHOT</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.accumulo</groupId>
<artifactId>cloudtrace</artifactId>
<version>1.4.0-incubating-SNAPSHOT</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.thrift</groupId>
<artifactId>libthrift</artifactId>
<version>0.6.1</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>zookeeper</artifactId>
<version>3.3.1</version>
<scope>runtime</scope>
</dependency>
<!-- end accumulo deps -->

</dependencies>

<build>
<plugins>
<plugin>
<artifactId>maven-dependency-plugin</artifactId>
<version>2.4</version>
<executions>
<execution>
<id>copy-dependencies</id>
<phase>package</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<outputDirectory>lib</outputDirectory>
<overWriteReleases>false</overWriteReleases>
<overWriteSnapshots>true</overWriteSnapshots>
<overWriteIfNewer>true</overWriteIfNewer>
<excludeTransitive>true</excludeTransitive>
</configuration>
</execution>
</executions>
</plugin>


<plugin>
<artifactId>maven-jar-plugin</artifactId>
<version>2.3</version>
<configuration>
<outputDirectory>lib</outputDirectory>
</configuration>
</plugin>

<plugin>
<artifactId>maven-clean-plugin</artifactId>
<version>2.4.1</version>
<configuration>
<filesets>
<fileset>
<directory>lib</directory>
<includes>
<include>**/*.jar</include>
</includes>
<followSymlinks>false</followSymlinks>
</fileset>
</filesets>
</configuration>
</plugin>

</plugins>
</build>
</project>
36 changes: 36 additions & 0 deletions src/main/java/goraci/Clear.java
@@ -0,0 +1,36 @@
/**
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package goraci;

import goraci.generated.CINode;

import java.io.IOException;

import org.apache.gora.store.DataStore;
import org.apache.gora.store.DataStoreFactory;
import org.apache.hadoop.conf.Configuration;

/**
*
*/
public class Clear {
public static void main(String[] args) throws IOException {
DataStore<Long,CINode> store = DataStoreFactory.getDataStore(Long.class, CINode.class, new Configuration());
store.truncateSchema();
store.close();
}
}

0 comments on commit 1d95b95

Please sign in to comment.