Skip to content

Commit

Permalink
Merge branch 'master' of github.com:linkedin-sna/sna-page
Browse files Browse the repository at this point in the history
  • Loading branch information
Hao Yan committed Nov 5, 2011
2 parents d203544 + 54ad948 commit 3281c40
Show file tree
Hide file tree
Showing 33 changed files with 434 additions and 1,356 deletions.
Binary file removed fang/images/bg.jpg
Binary file not shown.
52 changes: 0 additions & 52 deletions fang/index.html

This file was deleted.

2 changes: 1 addition & 1 deletion includes/project_list.php
Expand Up @@ -3,7 +3,7 @@
<a href="http://sna-projects.com/decomposer" title="Massive matrix decompositions.">decomposer</a> &middot;
<a href="http://sna-projects.com/norbert" title="Partitioned routing and cluster management.">norbert</a> &middot;
<a href="http://project-voldemort.com" title="A distributed database.">voldemort</a> &middot;
<a href="http://sna-projects.com/kafka" title="A high-throughput distributed messaging system.">kafka</a> &middot;
<a href="http://incubator.apache.org/kafka" title="A high-throughput distributed messaging system.">kafka</a> &middot;
<a href="http://sna-projects.com/kamikaze" title="Doc set compression.">kamikaze</a> &middot;
<a href="http://sna-projects.com/krati" title="A persistent high-performance data store.">krati</a> &middot;
<a href="http://sna-projects.com/sensei" title="A distributed, elastic, realtime, searchable database.">sensei</a> &middot;
Expand Down
29 changes: 4 additions & 25 deletions kafka/configuration.php
@@ -1,25 +1,4 @@
<?php require "includes/project_info.php" ?>
<?php require "../includes/header.php" ?>
<?php include "../includes/advert.php" ?>

<h2> Configuration </h2>

<h3> Important configuration properties for Kafka broker: </h3>

<p>More details about server configuration can be found in the scala class <code>kafka.server.KafkaConfig</code>.</p>

<?php include('includes/server_config.php'); ?>

<h3> Important configuration properties for the high-level consumer: </h3>

<p>More details about consumer configuration can be found in the scala class <code>kafka.consumer.ConsumerConfig</code>.</p>

<?php include('includes/consumer_config.php'); ?>

<h3> Important configuration properties for the producer: </h3>

<p>More details about producer configuration can be found in the scala class <code>kafka.producer.ProducerConfig</code>.</p>

<?php include('includes/producer_config.php'); ?>

<?php require "../includes/footer.php" ?>
<?php
header( "HTTP/1.1 301 Moved Permanently" );
header('Location: http://incubator.apache.org/kafka/configuration.html');
?>
4 changes: 4 additions & 0 deletions kafka/contact.php
@@ -0,0 +1,4 @@
<?php
header( "HTTP/1.1 301 Moved Permanently" );
header('Location: http://incubator.apache.org/kafka/contact.html');
?>
574 changes: 4 additions & 570 deletions kafka/design.php

Large diffs are not rendered by default.

23 changes: 23 additions & 0 deletions kafka/downloads.php
@@ -0,0 +1,23 @@
<?php require "includes/project_info.php" ?>
<?php require "../includes/header.php" ?>
<?php include "../includes/advert.php" ?>

<h2>Downloads</h2>
<ul>
<li><a href="downloads/kafka-0.05.zip">kafka-0.05.zip</a></li>
<li><a href="downloads/kafka-0.6.RC1.zip">kafka-0.6.RC1.zip</a></li>
<li><a href="downloads/kafka-0.6.RC2.zip">kafka-0.6.RC2.zip</a></li>
<li><a href="downloads/kafka-0.6.zip">kafka-0.6.zip</a></li>
</ul>

<h2>Release Notes</h2>
<ul>
<li><a href="downloads/v0.6-RC1-release-notes.html">v0.6.RC1 release notes</a></li>
<li><a href="downloads/v0.6-RC2-release-notes.html">v0.6.RC2 release notes</a></li>
<li><a href="downloads/v0.6-release-notes.html">v0.6 release notes</a></li>
</ul>

<h2>API docs</h2>
<ul>
<li><a href="downloads/0.6-api/index.html">0.6 API docs</a></li>
</ul>
17 changes: 4 additions & 13 deletions kafka/faq.php
@@ -1,13 +1,4 @@
<?php require "includes/project_info.php" ?>
<?php require "../includes/header.php" ?>
<?php include "../includes/advert.php" ?>

<h2>Frequently asked questions</h3>
<ol>
<li> <h3> Why does my consumer get InvalidMessageSizeException? </h3>
This typically means that the "fetch size" of the consumer is too small. Each time the consumer pulls data from the broker, it reads bytes up to a configured limit. If that limit is smaller than the largest single message stored in Kafka, the consumer can't decode the message properly and will throw an InvalidMessageSizeException. To fix this, increase the limit by setting the property "fetch.size" properly in config/consumer.properties. The default fetch.size is 300,000 bytes.

<li> <h3> On EC2, why can't my high-level consumers connect to the brokers? </h3>
When a broker starts up, it registers its host ip in ZK. The high-level consumer later uses the registered host ip to establish the socket connection to the broker. By default, the registered ip is given by InetAddress.getLocalHost.getHostAddress. Typically, this should return the real ip of the host. However, in EC2, the returned ip is an internal one and can't be connected to from outside. The solution is to explicitly set the host ip to be registered in ZK by setting the "hostname" property in server.properties.

</ol>
<?php
header( "HTTP/1.1 301 Moved Permanently" );
header('Location: http://incubator.apache.org/kafka/faq.html');
?>
13 changes: 9 additions & 4 deletions kafka/includes/producer_config.php
Expand Up @@ -6,7 +6,7 @@
</tr>
<tr>
<td><code>serializer.class</code></td>
<td>None. This is required</td>
<td>kafka.serializer.DefaultEncoder. This is a no-op encoder. The serialization of data to Message should be handled outside the Producer</td>
<td>class that implements the <code>kafka.serializer.Encoder&lt;T&gt;</code> interface, used to encode data of type T into a Kafka message </td>
</tr>
<tr>
Expand All @@ -17,12 +17,12 @@
<tr>
<td><code>producer.type</code></td>
<td>sync</td>
<td>this parameter specifies whether the messages are sent asynchronously or not. Valid values are - <ul><li><code>async</code> for asynchronous batching send through <code>kafka.producer.AyncProducer</code></li><li>sync for synchronous send through <code>kafka.producer.SyncProducer</code></li></ul></td>
<td>this parameter specifies whether the messages are sent asynchronously or not. Valid values are - <ul><li><code>async</code> for asynchronous batching send through <code>kafka.producer.AyncProducer</code></li><li><code>sync</code> for synchronous send through <code>kafka.producer.SyncProducer</code></li></ul></td>
</tr>
<tr>
<td><code>broker.partition.info</code></td>
<td><code>broker.list</code></td>
<td>null. Either this parameter or zk.connect needs to be specified by the user.</td>
<td>For bypassing zookeeper based auto partition discovery, use this config to pass in static broker and per-broker partition information. Format-<code>brokerid1:host1:port1, brokerid2:host2:port2..</code>
<td>For bypassing zookeeper based auto partition discovery, use this config to pass in static broker and per-broker partition information. Format-<code>brokerid1:host1:port1, brokerid2:host2:port2.</code>
If you use this option, the <code>partitioner.class</code> will be ignored and each producer request will be routed to a random broker partition.</td>
</tr>
<tr>
Expand Down Expand Up @@ -79,6 +79,11 @@
<td>5000</td>
<td>the maximum time spent by <code>kafka.producer.SyncProducer</code> trying to connect to the kafka broker. Once it elapses, the producer throws an ERROR and stops.</td>
</tr>
<tr>
<td><code>socket.timeout.ms</code></td>
<td>30000</td>
<td>The socket timeout in milliseconds</td>
</tr>
<tr>
<td><code>reconnect.interval</code> </td>
<td>30000</td>
Expand Down
19 changes: 10 additions & 9 deletions kafka/includes/project_info.php
Expand Up @@ -16,17 +16,18 @@
$PROJ_FAVICON_MIME = "image/png";

/* Navigation links in the sidebar */
$PROJ_NAV_LINKS = array("download" => "downloads/",
$PROJ_NAV_LINKS = array("download" => "downloads.php",
"code" => "https://github.com/kafka-dev/kafka",
"quickstart" => "quickstart.php",
"design" => "design.php",
"configuration" => "configuration.php",
"performance" => "performance.php",
"current&nbsp;work" => "projects.php",
"javadoc" => "javadoc/current",
"quickstart" => "http://incubator.apache.org/kafka/quickstart.html",
"design" => "http://incubator.apache.org/kafka/design.html",
"configuration" => "http://incubator.apache.org/kafka/configuration.html",
"performance" => "http://incubator.apache.org/kafka/performance.html",
"projects" => "http://incubator.apache.org/kafka/projects.html",
"faq" => "faq.php",
"wiki" => "http://linkedin.jira.com/wiki/display/KAFKA",
"bugs" => "http://linkedin.jira.com/browse/KAFKA",
"mailing list" => "http://groups.google.com/group/kafka-dev"
"bugs" => "https://issues.apache.org/jira/browse/KAFKA",
"mailing&nbsp;lists" => "http://incubator.apache.org/kafka/contact.html",
"unit&nbsp;tests" => "http://test.project-voldemort.com:8080/",
);

/* Project color */
Expand Down
29 changes: 4 additions & 25 deletions kafka/index.php
@@ -1,25 +1,4 @@
<?php require "includes/project_info.php" ?>
<?php require "../includes/header.php" ?>
<?php include "../includes/advert.php" ?>

<h2>Kafka is a distributed publish/subscribe messaging system</h2>
<p>
Kafka is a distributed publish-subscribe messaging system. It is designed to support the following
<ul>
<li>Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages.</li>
<li>High-throughput: even with very modest hardware Kafka can support hundreds of thousands of messages per second.</li>
<li>Explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.</li>
<li>Support for parallel data load into Hadoop.</li>
</ul>

Kafka is aimed at providing a publish-subscribe solution that can handle all activity stream data and processing on a consumer-scale web site. This kind of activity (page views, searches, and other user actions) are a key ingredient in many of the social feature on the modern web. This data is typically handled by "logging" and ad hoc log aggregation solutions due to the throughput requirements. This kind of ad hoc solution is a viable solution to providing logging data to an offline analysis system like Hadoop, but is very limiting for building real-time processing. Kafka aims to unify offline and online processing by providing a mechanism for parallel load into Hadoop as well as the ability to partition real-time consumption over a cluster of machines.
</p>

<p>
The use for activity stream processing makes Kafka comparable to <a href="https://github.com/facebook/scribe">Facebook's Scribe</a> or <a href="https://github.com/cloudera/flume">Cloudera's Flume</a>, though the architecture and primitives are very different for these systems and make Kafka more comparable to a traditional messaging system. See our <a href="design.php">design</a> page for more details.
</p>

<p>
This is a new project, and we are interested in building the community; we would welcome any thoughts or patches. You can reach us <a href="http://groups.google.com/group/kafka-dev">here<a/>.

<?php require "../includes/footer.php" ?>
<?php
header( "HTTP/1.1 301 Moved Permanently" );
header('Location: http://incubator.apache.org/kafka/index.html');
?>
91 changes: 4 additions & 87 deletions kafka/performance.php
@@ -1,87 +1,4 @@
<?php require "includes/project_info.php" ?>
<?php require "../includes/header.php" ?>
<?php include "../includes/advert.php" ?>

<h2>Performance Results</h2>
<p>The following tests give some basic information on Kafka throughput as the number of topics, consumers and producers and overall data size varies. Since Kafka nodes are independent, these tests are run with a single producer, consumer, and broker machine. Results can be extrapolated for a larger cluster.
</p>

<p>
We run producer and consumer tests separately to isolate their performance. For the consumer these tests test <i>cold</i> performance, that is consuming a large uncached backlog of messages. Simultaneous production and consumption tends to help performance since the cache is hot.
</p>

<p>We took below setting for some of the parameters:</p>

<ul>
<li>message size = 200 bytes</li>
<li>batch size = 200 messages</li>
<li>fetch size = 1MB</li>
<li>flush interval = 600 messages</li>
</ul>

In our performance tests, we run experiments to answer below questions.
<h3>What is the producer throughput as a function of batch size?</h3>
<p>We can push about 50MB/sec to the system. However, this number changes with the batch size. The below graphs show the relation between these two quantities.<p>
<p><span style="" class="image-wrap"><img border="0" src="images/onlyBatchSize.jpg" width="500" height="300"/></span><br /></p>

<h3>What is the consumer throughput?</h3>
<p>According to our experiments, we can consume about 100M/sec from a broker and the total does not seem to change much as we increase
the number of consumer threads.<p>
<p><span style="" class="image-wrap"><img border="0" src="images/onlyConsumer.jpg" width="500" height="300"/></span> </p>

<h3> Does data size effect our performance? </h3>
<p><span style="" class="image-wrap"><img border="0" src="images/dataSize.jpg" width="500" height="300"/></span><br /></p>

<h3>What is the effect of the number of producer threads on producer throughput?</h3>
<p>We are able to max out production with only a few threads.<p>
<p><span style="" class="image-wrap"><img border="0" src="images/onlyProducer.jpg" width="500" height="300"/></span><br /></p>

<h3> What is the effect of number of topics on producer throughput?</h3>
<p>Based on our experiments, the number of topic has a minimal effect on the total data produced.
The below graph is an experiment where we used 40 producers and varied the number of topics<p>

<p><span style="" class="image-wrap"><img border="0" src="images/onlyTopic.jpg" width="500" height="300"/></span><br /></p>

<h2>How to Run a Performance Test</h2>

<p>The performance related code is under perf folder. To run the simulator :</p>

<p>&nbsp;../run-simulator.sh -kafkaServer=localhost -numTopic=10&nbsp; -reportFile=report-html/data -time=15 -numConsumer=20 -numProducer=40 -xaxis=numTopic</p>

<p>It will run a simulator with 40 producer and 20 consumer threads
producing/consuming from a local kafkaserver.&nbsp; The simulator is going to
run 15 minutes and the results are going to be saved under
report-html/data</p>

<p>and they will be plotted from there. Basically it will write MB of
data consumed/produced, number of messages consumed/produced given a
number of topic and report.html will plot the charts.</p>


<p>Other parameters include numParts, fetchSize, messageSize.</p>

<p>In order to test how the number of topic affects the performance the below script can be used (it is under utl-bin)</p>



<p>#!/bin/bash<br />

for i in 1 10 20 30 40 50;<br />

do<br />

&nbsp; ../kafka-server.sh server.properties 2>&amp;1 >kafka.out&amp;<br />
sleep 60<br />
&nbsp;../run-simulator.sh -kafkaServer=localhost -numTopic=$i&nbsp; -reportFile=report-html/data -time=15 -numConsumer=20 -numProducer=40 -xaxis=numTopic<br />
&nbsp;../stop-server.sh<br />
&nbsp;rm -rf /tmp/kafka-logs<br />

&nbsp;sleep 300<br />

done</p>



<p>The charts similar to above graphs can be plotted with report.html automatically.</p>

<?php require "../includes/footer.php" ?>
<?php
header( "HTTP/1.1 301 Moved Permanently" );
header('Location: http://incubator.apache.org/kafka/performance.html');
?>

0 comments on commit 3281c40

Please sign in to comment.