# Using Dataframes in Spark
Dataframes are the standard data structure in Spark 2.0 and later. They offer a consistent way to work with data in any Spark-supported language (Python, Scala, R, Java, etc.), and also support the Spark SQL API so you can query and manipulate data using SQL syntax.

In this lab, exercise, you'll use Dataframes to explore some data from the United Kingdom Government Department for Transport that includes details of road traffic accidents in 2016 (you'll find more of this data and related documentation at https://data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data.)

## Read a Dataframe from a File
After uploading the data files for this lab to your Azure storage account, adapt the code below to read the *Accidents.csv* file from your account into a Dataframe by replacing ***ACCOUNT_NAME*** with the name of your storage account:

In [2]:
textFile = spark.read.text('wasb://spark@ACCOUNT_NAME.blob.core.windows.net/data/Accidents.csv')
textFile.printSchema()

View the output returned, which describes the schema of the DataFrame. Note that the file content has been loaded into a DataFrame with a single column named **value**.

Let's take a look at the first ten lines of the text file:

In [4]:
textFile.show(10, truncate = False)

The file seems to contain comma-separated values, with the column header names in the first line.

You can use the **spark.read.csv** function to read a CSV file and infer the schema from its contents. Adapt the following code to use your storage account and run it to see the schema it infers:

In [6]:
accidents = spark.read.csv('wasb://spark@ACCOUNT_NAME.blob.core.windows.net/data/Accidents.csv', header=True, inferSchema=True)
accidents.printSchema()

Now let's look at the first ten rows of data:

In [8]:
accidents.show(10)

Inferring the schema makes it easy to read structured data files into a DataFrame containing multiple columns. However, it incurs a performance overhead; and in some cases you may want to have specific control over column names or data types. For example, use the following code to define the schema of the *Vehicles.csv* file:

In [10]:
from pyspark.sql.types import *

vehicle_schema = StructType([
  StructField("Accident_Index", StringType(), False),
  StructField("Vehicle_Reference", IntegerType(), False),
  StructField("Vehicle_Type", IntegerType(), False),
  StructField("Towing_and_Articulation", StringType(), False),
  StructField("Vehicle_Manoeuvre", IntegerType(), False),
  StructField("Vehicle_Location-Restricted_Lane", IntegerType(), False),
  StructField("Junction_Location", IntegerType(), False),
  StructField("Skidding_and_Overturning", IntegerType(), False),
  StructField("Hit_Object_in_Carriageway", IntegerType(), False),
  StructField("Vehicle_Leaving_Carriageway", IntegerType(), False),
  StructField("Hit_Object_off_Carriageway", IntegerType(), False),
  StructField("1st_Point_of_Impact", IntegerType(), False),
  StructField("Was_Vehicle_Left_Hand_Drive?", IntegerType(), False),
  StructField("Journey_Purpose_of_Driver", IntegerType(), False),
  StructField("Sex_of_Driver", IntegerType(), False),
  StructField("Age_of_Driver", IntegerType(), False),
  StructField("Age_Band_of_Driver", IntegerType(), False),
  StructField("Engine_Capacity_(CC)", IntegerType(), False),
  StructField("Propulsion_Code", IntegerType(), False),
  StructField("Age_of_Vehicle", IntegerType(), False),
  StructField("Driver_IMD_Decile", IntegerType(), False),
  StructField("Driver_Home_Area_Type", IntegerType(), False),
  StructField("Vehicle_IMD_Decile", IntegerType(), False)
])

print(vehicle_schema.simpleString())

Now you can use the **spark.read.csv** function with the **schema** argument to load the data from the file based on the schema you have defined.

Adapt the following code to read the *Vehicles.csv* file from your storage account and verify that it's schema matches the one you defined:

In [12]:
vehicles = spark.read.csv('wasb://spark@ACCOUNT_NAME.blob.core.windows.net/data/Vehicles.csv', schema=vehicle_schema, header=True)
vehicles.printSchema()

Once again, let's take a look at the first ten rows of data:

In [14]:
vehicles.show(10)

## Use DataFrame Methods
The Dataframe class provides numerous properties and methods that you can use to work with data.

For example, run the code in the following cell to use the **select** method. This creates a new dataframe that contains specific columns from an existing dataframe:

In [16]:
vehicle_driver = vehicles.select('Accident_Index', 'Vehicle_Reference', 'Vehicle_Type', 'Age_of_Vehicle', 'Sex_of_Driver' , 'Age_of_Driver' , 'Age_Band_of_Driver')
vehicle_driver.show()


The **filter** method creates a new dataframe with rows that match a specified criteria removed from an existing dataframe:
> *Note: The code imports the **pyspark.sql.functions** library so we can use the **col** function to specify a particular column.*

In [18]:
from pyspark.sql.functions import *

drivers = vehicle_driver.filter(col('Age_Band_of_Driver') != -1)
drivers.show()


You can chain multiple operations together into a single statement. For example, the following code uses the **select** method to define a subset of the **accidents** dataframe, and chains the output of that that to the **join** method, which creates a new dataframe by combining the columns from two dataframes based on a common key field:

In [20]:
driver_accidents = accidents.select('Accident_Index', 'Accident_Severity', 'Speed_Limit', 'Weather_Conditions').join(drivers, 'Accident_Index')
driver_accidents.show()

## Using the Spark SQL API
The Spark SQL API enables you to use SQL syntax to query dataframes that have been persisted as temporary or global tables.
For example, run the following cell to save the driver accident data as a temporary table, and then use the **spark.sql** function to query it using a SQL expression:

In [22]:
driver_accidents.createOrReplaceTempView('tmp_accidents')

q = spark.sql("SELECT * FROM tmp_accidents WHERE Speed_Limit > 50")
q.show()

When using a notebook to work with your data, you can use the **%sql** *magic* to embed SQL code directly into the notebook. For example, run the following cell to use a SQL query to filter, aggregate, and group accident data from the temporary table you created:

In [24]:
%sql
SELECT Vehicle_Type, Age_Band_of_Driver, COUNT(*) AS Accidents
FROM tmp_accidents
WHERE Vehicle_Type <> -1
GROUP BY Vehicle_Type, Age_Band_of_Driver
ORDER BY Vehicle_Type, Age_Band_of_Driver


Databricks notebooks include built-in data visualization tools that you can use to make sense of your query results. For example, perform the followng steps with the table of results returned by the query above to view the data as a bar chart:
 1. In the drop-down list for the chart type, select **Bar**.
 2. Click **Plot Options...**.
 3. Apply the following plot options:
- **Keys**: Age_Band_of_Driver
- **Series groupings**: Vehicle_Type
- **Values**: Accidents
- **Stacked**: Selected
- **Aggregation**: SUM
- **Display type**: Bar chart

View the resulting chart (you can resize it by dragging the handle at the bottom-right) and note that it clearly shows that the most accidents involve drivers in age band **6** and vehicle type **9**. The UK Department for Transport publishes a lookup table for these variables at http://data.dft.gov.uk/road-accidents-safety-data/Road-Accident-Safety-Data-Guide.xls, which indicates that these values correlate to drivers aged between *26* and *35* in *cars*.

Temporary tables are saved within the current session, which for interactive analytics can be a good way to explore the data and discard it automatically at the end of the session. If you want to to persist the data for future analysis, or to share with other data processing applications in different sessions, then you can save the dataframe as a global table.

Run the following cell to save the data as a global table and query it.

In [27]:
driver_accidents.write.mode('overwrite').saveAsTable("accidents")
q = spark.sql("SELECT * FROM accidents WHERE Speed_Limit > 50")
q.show()

On the left of the screen, click the **Data** tab, and in the **default** database note that a table named **accidents** has been created.