## Spark csv package

### Add package spark-csv

In [1]:
%AddJar http://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.jar --magic
%AddDeps com.databricks spark-csv_2.10 1.3.0 --transitive

Using cached version of spark-csv_2.10-1.3.0.jar
Marking com.databricks:spark-csv_2.10:1.3.0 for download
Preparing to fetch from:
-> file:/tmp/.ivy2/
-> https://repo1.maven.org/maven2
-> New file at /tmp/.ivy2/https/repo1.maven.org/maven2/org/apache/commons/commons-csv/1.1/commons-csv-1.1.jar
-> New file at /tmp/.ivy2/https/repo1.maven.org/maven2/com/univocity/univocity-parsers/1.5.1/univocity-parsers-1.5.1.jar
-> New file at /tmp/.ivy2/https/repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.3.0/spark-csv_2.10-1.3.0.jar


### Create a dataframe from csv

In [2]:
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "false").option("inferSchema", "true").load("data/iris.csv")


In [3]:
df.take(5)

Array([5.1,3.5,1.4,0.2,Iris-setosa], [4.9,3.0,1.4,0.2,Iris-setosa], [4.7,3.2,1.3,0.2,Iris-setosa], [4.6,3.1,1.5,0.2,Iris-setosa], [5.0,3.6,1.4,0.2,Iris-setosa])

In [4]:
df

[C0: double, C1: double, C2: double, C3: double, C4: string]

In [5]:
df.schema

StructType(StructField(C0,DoubleType,true), StructField(C1,DoubleType,true), StructField(C2,DoubleType,true), StructField(C3,DoubleType,true), StructField(C4,StringType,true))

In [6]:
df.show()

+---+---+---+---+-----------+
| C0| C1| C2| C3|         C4|
+---+---+---+---+-----------+
|5.1|3.5|1.4|0.2|Iris-setosa|
|4.9|3.0|1.4|0.2|Iris-setosa|
|4.7|3.2|1.3|0.2|Iris-setosa|
|4.6|3.1|1.5|0.2|Iris-setosa|
|5.0|3.6|1.4|0.2|Iris-setosa|
|5.4|3.9|1.7|0.4|Iris-setosa|
|4.6|3.4|1.4|0.3|Iris-setosa|
|5.0|3.4|1.5|0.2|Iris-setosa|
|4.4|2.9|1.4|0.2|Iris-setosa|
|4.9|3.1|1.5|0.1|Iris-setosa|
|5.4|3.7|1.5|0.2|Iris-setosa|
|4.8|3.4|1.6|0.2|Iris-setosa|
|4.8|3.0|1.4|0.1|Iris-setosa|
|4.3|3.0|1.1|0.1|Iris-setosa|
|5.8|4.0|1.2|0.2|Iris-setosa|
|5.7|4.4|1.5|0.4|Iris-setosa|
|5.4|3.9|1.3|0.4|Iris-setosa|
|5.1|3.5|1.4|0.3|Iris-setosa|
|5.7|3.8|1.7|0.3|Iris-setosa|
|5.1|3.8|1.5|0.3|Iris-setosa|
+---+---+---+---+-----------+
only showing top 20 rows



### Create a table (SQL) from a dataframe

In [7]:
df.registerTempTable("iris")
val results2 = sqlContext.sql("SELECT * FROM iris")
results2.collect().foreach(println)


[5.1,3.5,1.4,0.2,Iris-setosa]
[4.9,3.0,1.4,0.2,Iris-setosa]
[4.7,3.2,1.3,0.2,Iris-setosa]
[4.6,3.1,1.5,0.2,Iris-setosa]
[5.0,3.6,1.4,0.2,Iris-setosa]
[5.4,3.9,1.7,0.4,Iris-setosa]
[4.6,3.4,1.4,0.3,Iris-setosa]
[5.0,3.4,1.5,0.2,Iris-setosa]
[4.4,2.9,1.4,0.2,Iris-setosa]
[4.9,3.1,1.5,0.1,Iris-setosa]
[5.4,3.7,1.5,0.2,Iris-setosa]
[4.8,3.4,1.6,0.2,Iris-setosa]
[4.8,3.0,1.4,0.1,Iris-setosa]
[4.3,3.0,1.1,0.1,Iris-setosa]
[5.8,4.0,1.2,0.2,Iris-setosa]
[5.7,4.4,1.5,0.4,Iris-setosa]
[5.4,3.9,1.3,0.4,Iris-setosa]
[5.1,3.5,1.4,0.3,Iris-setosa]
[5.7,3.8,1.7,0.3,Iris-setosa]
[5.1,3.8,1.5,0.3,Iris-setosa]
[5.4,3.4,1.7,0.2,Iris-setosa]
[5.1,3.7,1.5,0.4,Iris-setosa]
[4.6,3.6,1.0,0.2,Iris-setosa]
[5.1,3.3,1.7,0.5,Iris-setosa]
[4.8,3.4,1.9,0.2,Iris-setosa]
[5.0,3.0,1.6,0.2,Iris-setosa]
[5.0,3.4,1.6,0.4,Iris-setosa]
[5.2,3.5,1.5,0.2,Iris-setosa]
[5.2,3.4,1.4,0.2,Iris-setosa]
[4.7,3.2,1.6,0.2,Iris-setosa]
[4.8,3.1,1.6,0.2,Iris-setosa]
[5.4,3.4,1.5,0.4,Iris-setosa]
[5.2,4.1,1.5,0.1,Iris-setosa]
[5.5,4.2,1

checked