create a dataframe from a list:

In [26]:
val df = List(
    (1,"A","what a piece of work is man"),
    (2,"B","to reason is to touch the sun"),
    (3,"C","rare jongens die Romeinen")
).toDF("id","letter","text")

df: org.apache.spark.sql.DataFrame = [id: int, letter: string ... 1 more field]


In [27]:
df.show

+---+------+--------------------+
| id|letter|                text|
+---+------+--------------------+
|  1|     A|what a piece of w...|
|  2|     B|to reason is to t...|
|  3|     C|rare jongens die ...|
+---+------+--------------------+



write the data as parquet file(s) into the dir data

In [28]:
df.write.parquet("data")

## loading a csv

define a schema for the to be loaded csv:

In [13]:
import org.apache.spark.sql.types._

val schema = StructType(Array(
    StructField("id", IntegerType, true),
    StructField("quote", StringType, true),
    StructField("author", StringType, true)
))

import org.apache.spark.sql.types._
schema: org.apache.spark.sql.types.StructType = StructType(StructField(id,IntegerType,true), StructField(quote,StringType,true), StructField(author,StringType,true))


In [14]:
val quotes = spark.sqlContext.read.format("csv")
    .option("header","false")
    .schema(schema)
    .load("litemind-quotes.csv")

quotes: org.apache.spark.sql.DataFrame = [id: int, quote: string ... 1 more field]


In [15]:
quotes.show(4)

+---+--------------------+---------------+
| id|               quote|         author|
+---+--------------------+---------------+
|  1|The third-rate mi...|    A. A. Milne|
|  2|History teaches u...|      Abba Eban|
|  3|How many legs doe...|Abraham Lincoln|
|  4|Nearly all men ca...|Abraham Lincoln|
+---+--------------------+---------------+
only showing top 4 rows



In [16]:
quotes.createOrReplaceTempView("quotes")

In [24]:
val authors = spark.sql("select count(*) as nr,author from quotes group by author order by nr desc")

authors: org.apache.spark.sql.DataFrame = [nr: bigint, author: string]


In [25]:
authors.show

+---+-------------------+
| nr|             author|
+---+-------------------+
| 27|     Unknown Author|
| 14|         Mark Twain|
| 11|    Albert Einstein|
|  9|               null|
|  8|        Oscar Wilde|
|  8|George Bernard Shaw|
|  5|    Abraham Lincoln|
|  5|             Gandhi|
|  4|           Socrates|
|  4|        Eric Hoffer|
|  4|       Groucho Marx|
|  4|   Bertrand Russell|
|  3|Arthur Schopenhauer|
|  3|     Samuel Johnson|
|  3|       George Burns|
|  3|  Franklin P. Jones|
|  3|          Aristotle|
|  3|         Andr? Gide|
|  3|        Woody Allen|
|  3|        Dick Cavett|
+---+-------------------+
only showing top 20 rows



In [40]:
spark.sql("SELECT quote from quotes WHERE author = 'Oscar Wilde'").show(20, false)

+----------------------------------------------------------------------------------------------+
|quote                                                                                         |
+----------------------------------------------------------------------------------------------+
|Always forgive your enemies; nothing annoys them so much.                                     |
|True friends stab you in the front.                                                           |
|Be yourself; everyone else is already taken.                                                  |
|I am not young enough to know everything.                                                     |
|Truth, in matters of religion, is simply the opinion that has survived.                       |
|Every saint has a past, and every sinner has a future.                                        |
|There is no such thing as a moral or an immoral book. Books are well written or badly written.|
|Seriousness is the only refug