In [1]:
import sqlContext.implicits._
import org.apache.spark.sql.functions._

In [2]:
import org.apache.spark.sql.SaveMode

In [3]:
val pageviewDF = sqlContext.read.format("com.databricks.spark.csv")
  .option("header", "true")
  .option("delimiter", "\t")
  .option("mode", "PERMISSIVE")
  .option("inferSchema", "true")
  .load("file:///mnt/ephemeral/summitdata/pageviews-by-second-tsv")

In [4]:
pageviewDF.printSchema

In [5]:
pageviewDF.first

In [6]:
pageviewDF.count()

Long = 7200000

### The data has duplicates.

Whilst this would not be a problem for saving to files, it causes upsert data-loss in Cassandra.  To get around this add a uuid column to the data.  As Spark does not have a built-in uuid generator, we simply define a udf for it.  To define a udf, take your function, and 

In [7]:
def uuid_as_string():String = java.util.UUID.randomUUID().toString()

In [8]:
val udf_uuid = udf(() => uuid_as_string)

In [9]:
udf_uuid

org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function0>,StringType,List())

In [None]:
%%cql create keyspace if not exists pageviews_ks with replication = {'class':'SimpleStrategy','replication_factor':1}

In [11]:
%%cql create table if not exists pageviews_ks.pageviews(
  uid uuid,  
  ts text,
  site text,
  requests int,
  PRIMARY KEY (uid)
)

In [12]:
val with_uuid = pageviewDF.select(udf_uuid().as("uid"), $"timestamp", $"site", $"requests")

In [13]:
with_uuid.printSchema

root
 |-- uid: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- site: string (nullable = true)
 |-- requests: integer (nullable = true)



Save dataframe to Cassandra.  This takes a few minutes.

In [14]:
val renamed = with_uuid.withColumnRenamed("timestamp", "ts")

renamed.write
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "pageviews", "keyspace" -> "pageviews_ks"))
.mode(SaveMode.Overwrite)
.save()

### Watch it run in the spark UI

In [15]:
val addr = java.net.InetAddress.getByName("node0_ext").getHostAddress
kernel.magics.html(s"""<iframe src="http://$addr:4040/stages" width=1000 height=500/>""")

### Check it out ...

In [16]:
%%cql select * from pageviews_ks.pageviews limit 5

uid,requests,site,ts
33849e63-6787-42b1-843b-3c066308394b,958,mobile,2015-03-16T10:36:11
89190499-d45c-4767-9002-d52efc19a4d8,1223,mobile,2015-04-17T14:58:46
c03d6a37-7b20-4aad-8087-0c10bf715fa9,1656,mobile,2015-04-05T03:30:06
da876a70-35c9-4713-b4b0-448b434de2b5,1338,mobile,2015-04-02T19:52:50
4ed32969-f1e3-447c-b32b-752229263623,1264,mobile,2015-04-18T00:34:40
