Gephi-Hadoop-Connector :: Version 5

Gephi is now available in version 0.9.1 with an optimized plugin mechanism. Since Gephi is my tool of choice for graph visualization and Hadoop stores all my data, especially the time series from which I generate the graph data, which is than processed in GraphX (also using GraphFrames) I need a direct integration between Gephi and Hadoop.

##Some Background In general such an integration between analysis tools running on a workstation and the cluster which keeps the data is done on the file or file system level. This allows loading a graph from a remote location in a transparent way. Using APIs it possible to handle data on a more abstract level. A JDBC connector and an SQL query are used instead of a filename to get required node and edge data into Gephi.

The Gephi-Hadoop-Connector goes beyond that. First, you can use a file from HDFS to load edge and node data. Second, using a SQL query and the Impala connector allows you to filter huge graphs in Hadoop or even to merge multiple data sets on the fly, before you bring it into Gephi. Furthermore, the Apache Spark connector enables cluster side processing of graphs, which are to large for Gephi.

This direct interaction between Gephi and Hadoop clusters simplifies the task of network analysis.

Initially I created the Gephi-Hadoop-Connector just using a JDBC-Interface to use Hive and Impala. Now, we integrate the Etosha metadata toolbox for linking data sets and graph and metadata exposure to external tools.

Another important feature in Gephi is the support of time dependent graphs. This means, a topological analysis and visualisation of networks over time is possible in one tool while all the huge datasets still sit in a scalable environment. Build a timeline to present the topology of a complex systems becomes an easy task now. For each time interval, we can specify an individual query. In this way, even data from very different sources can be merged based on mappings included in the layer defining queries. This procedure becomes even more efficient, if data is already partitioned by time. Also the data slices, provided by Spark Streaming fit well into this framework.

One important question is still: How do I handle a multilayer time dependent graph? There is a lot of metadata required.The Etosha data link engine is created for this purpose.

##Quickstart: In order to test this tool (we are currently in a POC status) you should clone this project including the sub-modules.

  git clone https://github.com/kamir/ghc
  cd ghc
  git submodule update --init --recursive

Next, run the bootstrap.sh script in the main folder. It installs some jar files, not available in public repositories and prepares your system for building the tool. Now you can compile and package the project "gephi-plugins", and start the freshly build plugin using maven:

  cd git-modules
  cd gephi-plugins
  mvn clean compile package -DskipTests
  mvn org.gephi:gephi-maven-plugin:run

The run.sh script in the main folder makes this step easy.

Have fun with Gephi and all your huge graphs!

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
docs		docs
gephi-hadoop-connector-v5/GephiHadoopConnectorClusterBridgeModul		gephi-hadoop-connector-v5/GephiHadoopConnectorClusterBridgeModul
gephi-hadoop-connector_OLD		gephi-hadoop-connector_OLD
git-modules		git-modules
images		images
javascripts		javascripts
stylesheets		stylesheets
tutorial series		tutorial series
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
bootstrap.sh		bootstrap.sh
index.html		index.html
index.html.bak		index.html.bak
params.json		params.json
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gephi-Hadoop-Connector :: Version 5

About

Releases

Packages

Languages

kamir/ghc

Folders and files

Latest commit

History

Repository files navigation

Gephi-Hadoop-Connector :: Version 5

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages