A connector for MemSQL and Spark
Scala Python Other
Permalink
Failed to load latest commit information.
benchmark/amplab_bigdata adding prelude.scala Dec 16, 2015
common/src/main/scala/com/memsql/spark/util Convert varbinary MySQL columns to a readable form when sampling them Mar 1, 2016
conf Revert checkpointing Nov 2, 2015
connectorLib/src whitelist ColumnStoreScan executor for query pushdown Jul 7, 2016
dockertest update sample_pipelines to 1.3.3-SNAPSHOT Jul 9, 2016
etlLib/src version bump 1.5.2-distribution-1.3.1-SNAPSHOT Mar 30, 2016
examples Update Connection Pool settings Dec 15, 2015
hdfsUtils Created a utility for listing contents of an HDFS path that Ops can use Feb 4, 2016
interface Move table_name into task_config for the MySQLExtractor so that we ca… Mar 30, 2016
jarInspector Fixed typo in jarInspector Sep 18, 2015
project enforce java 7 in build.sbt Dec 14, 2015
samplingUtils Throw a better error when we don't have a UTF-8 file in samplingUtils Apr 25, 2016
scripts rename superapp to spark interface Aug 18, 2015
src/site auto docs to gh-pages branch Sep 23, 2015
tests/src/main/scala Add a custom DatetimeType so that we can create DATETIME columns in t… Feb 26, 2016
.arcconfig Testing round-tripping NULL values Jan 5, 2016
.commit_template Fixing projections with duplicate fields and other stuff Dec 10, 2015
.dockerignore add sample jar for testing Sep 9, 2015
.gitignore Convert varbinary MySQL columns to a readable form when sampling them Mar 1, 2016
.java-version adding jenv version file to force 1.7 Dec 14, 2015
.psy-dockertest add dockertest infrastructure Aug 5, 2015
CHANGELOG update changelog Dec 16, 2015
Dockerfile bump Dockerfile to use 1.3.3 Jul 15, 2016
LICENSE.txt Change license from MIT license to Apache 2.0 Sep 18, 2015
Makefile remove dependency on s3cmd Jul 15, 2016
README.md version bump 1.5.2-distribution-1.3.1-SNAPSHOT Mar 30, 2016
build.sbt version bump 1.3.3 Jul 15, 2016
scalastyle-config.xml Makding sure we have ALL JDBC types Sep 25, 2015

README.md

MemSQL Spark Library

This git repository contains a number of Scala projects that provide interoperation between MemSQL and a Spark cluster.

Name Description
MemSQL Spark Interface A Spark app providing an API to run MemSQL Streamliner Pipelines on Spark
MemSQL etlLib A library of interfaces for building custom MemSQL Streamliner Pipelines
MemSQL Spark Connector Scala tools for connecting to MemSQL from Spark

Supported Spark version

Right now this project is only supported for Spark version 1.5.2. It has been primarily tested against the MemSQL Spark Distribution which you can download here: http://versions.memsql.com/memsql-spark/latest

Documentation

You can find Scala documentation for everything exposed in this repo here: memsql.github.io/memsql-spark-connector

You can find MemSQL documentation on our Spark ecosystem here: docs.memsql.com/latest/spark/

MemSQL Spark Interface

The MemSQL Spark Interface is a Spark application that runs in a Spark cluster. The Interface provides an HTTP API to run real-time pipelines on Spark. It is also required to interface MemSQL Ops with a Spark cluster.

MemSQL etlLib

The MemSQL ETL library provides interfaces and utilities required when writing custom pipeline JARs. You can learn more about doing this on our docs.

MemSQL Spark Connector

The MemSQL Spark connector provides tools for reading from and writing to MemSQL databases in Spark.

The connector provides a number of integrations with Apache Spark including a custom RDD type, DataFrame helpers and a MemSQL Context.

MemSQLContext

The MemSQL Context maintains metadata about a MemSQL cluster and extends the Spark SQLContext.

import com.memsql.spark.connector.MemSQLContext

// NOTE: The connection details for your MemSQL Master Aggregator must be in
// the Spark configuration. See http://memsql.github.io/memsql-spark-connector/latest/api/#com.memsql.spark.connector.MemSQLConf
// for details.
val memsqlContext = new MemSQLContext(sparkContext)

val myTableDF = memsqlContext.table("my_table")
// myTableDF now is a Spark DataFrame which represents the specified MemSQL table
// and can be queried using Spark DataFrame query functions

You can also use memsqlContext.sql to pull arbitrary tables and expressions into a DataFrame

val df = memsqlContext.sql("SELECT * FROM test_table")

val result = df.select(df("test_column")).where(df("other_column") === 1).limit(1)
// Result now contains the first row where other_column == 1

Additionally you can use the DataFrameReader API

val df = memsqlContext.read.format("com.memsql.spark.connector").load("db.table")

saveToMemsql

The saveToMemsql function writes a DataFrame to a MemSQL table.

import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.memsql.SparkImplicits._

...

val rdd = sc.parallelize(Array(Row("foo", "bar"), Row("baz", "qux")))
val schema = StructType(Seq(StructField("col1", StringType, false),
                            StructField("col2", StrindType, false)))
val df = sqlContext.createDataFrame(rdd, schema)
df.saveToMemSQL("db", "table")

You can also use the DataFrameWriter API

df.write.format("com.memsql.spark.connector").save("db.table")

Using

In order to compile this library you must have the Simple Build Tool (aka sbt) installed.

Artifacts are published to Maven Central [http://repo1.maven.org/maven2/com/memsql/].

Inside a project definition you can depend on our MemSQL Connector like so:

libraryDependencies  += "com.memsql" %% "memsql-connector" % "VERSION"

And our ETL interface for MemSQL Streamliner:

libraryDependencies  += "com.memsql" %% "memsql-etl" % "VERSION"

Building

You can use SBT to compile all of the projects in this repo. To build all of the projects you can use:

sbt "project etlLib" build \
    "project connectorLib" build \
    "project interface" build

Testing

All unit tests can be run via sbt. They will also run at build time automatically.

sbt 'project etlLib' test
sbt 'project connectorLib' test
sbt 'project interface' test