![DataStax Academy](https://s3.amazonaws.com/datastaxtraining/vq8Jr36Gk48v/datastax-academy.svg "DataStax Academy")

# Exercise 08.02 - Spark SQL: Creating Dataframes

## Background

We will be reviewing several ways to create a dataframe.

***

## Directions

Let's begin by creating an dataframe using the `toDF` method and inferring the schema using reflection

In [7]:
import sqlContext.implicits._

case class Movie(title:String, year:Int)
val rdd = sc.parallelize(Array( Movie("Pirates of the Caribbean: On Stranger Tides", 2011)))
val df = rdd.toDF("title", "year")
df.show()

+--------------------+----+
|               title|year|
+--------------------+----+
|Pirates of the Ca...|2011|
+--------------------+----+



Now let's create a dataframe specifying the schema programmatically.

In [12]:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

val rdd = sc.parallelize(Array( ("Pirates of the Caribbean: On Stranger Tides", 2011))).map{case(t,y) => Row(t,y)}

val schema = StructType ( List (
StructField("title", StringType, false),
StructField("year", IntegerType, false)
))

val df = sqlContext.createDataFrame(rdd, schema)
df.show()

+--------------------+----+
|               title|year|
+--------------------+----+
|Pirates of the Ca...|2011|
+--------------------+----+



Create a dataframe specifying a SQL query. The types will be inferred from the table and this is an optimal method to use.

In [18]:
val df = sqlContext.sql("SELECT * from killr_video.videos")
df.show()

+-----------+----------+--------------------+--------------------+-----------+--------------------+------------+--------------------+-----------+
|   video_id|avg_rating|         description|              genres|mpaa_rating|        release_date|release_year|               title|    user_id|
+-----------+----------+--------------------+--------------------+-----------+--------------------+------------+--------------------+-----------+
|[B@310da662|       7.0|After being wrong...|  ArrayBuffer(Crime)|          R|2005-09-09 00:00:...|        2005|Green Street Hool...|[B@7d7c83b0|
|[B@1a3e2122|       5.8|Paulie, an intell...| ArrayBuffer(Family)|         PG|1998-04-17 00:00:...|        1998|              Paulie|[B@3665e379|
|[B@6a876d61|       6.0|A Reno singer wit...| ArrayBuffer(Comedy)|         PG|1992-05-28 00:00:...|        1992|          Sister Act|[B@737f1282|
|[B@18a31afe|       6.1|After a lightning...|ArrayBuffer(Famil...|         PG|1986-05-09 00:00:...|        1986|       Short

Lastly, let's create a dataframe using the dataframe reader method. `sqlContext.read`

In [22]:
val df = sqlContext.read.format("org.apache.spark.sql.cassandra").options(Map("keyspace" -> "killr_video", "table" -> "videos")).load

df.show()

+--------------------+----------+--------------------+--------------------+-----------+--------------------+------------+--------------------+--------------------+
|            video_id|avg_rating|         description|              genres|mpaa_rating|        release_date|release_year|               title|             user_id|
+--------------------+----------+--------------------+--------------------+-----------+--------------------+------------+--------------------+--------------------+
|ece8de8f-a5e2-11e...|       7.0|After being wrong...|  ArrayBuffer(Crime)|          R|2005-09-09 00:00:...|        2005|Green Street Hool...|6b234a61-faa6-4b4...|
|ecf288d1-a5e2-11e...|       5.8|Paulie, an intell...| ArrayBuffer(Family)|         PG|1998-04-17 00:00:...|        1998|              Paulie|6b234a61-faa6-4b4...|
|ece73c02-a5e2-11e...|       6.0|A Reno singer wit...| ArrayBuffer(Comedy)|         PG|1992-05-28 00:00:...|        1992|          Sister Act|6b234a61-faa6-4b4...|
|ece77fe6-a5e2-1