-sandbox
# Querying Files with Dataframes

Apache Spark&trade; and Databricks&reg; allow you to use DataFrames to query large data files.

-sandbox
### Getting Started

Run the following cell to configure our "classroom."

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Remember to attach your notebook to a cluster; click  <b>Detached</b> in the upper left hand corner and then select your preferred cluster.

<img src="https://files.training.databricks.com/images/eLearning/attach-to-cluster.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/>

In [3]:
%run "./Includes/Classroom-Setup"

### Introducing DataFrames

Under the covers, DataFrames are derived from data structures known as Resilient Distributed Datasets (RDDs). RDDs and DataFrames are immutable distributed collections of data. Let's take a closer look at what some of these terms mean before we understand how they relate to DataFrames:

* **Resilient**: They are fault tolerant, so if part of your operation fails, Spark  quickly recovers the lost computation.
* **Distributed**: RDDs are distributed across networked machines known as a cluster.
* **DataFrame**: A data structure where data is organized into named columns, like a table in a relational database, but with richer optimizations under the hood. 

Without the named columns and declared types provided by a schema, Spark wouldn't know how to optimize the executation of any computation. Since DataFrames have a schema, they use the Catalyst Optimizer to determine the optimal way to execute your code.

DataFrames were invented because the business community uses tables in a relational database, Pandas or R DataFrames, or Excel worksheets. A Spark DataFrame is conceptually equivalent to these, with richer optimizations under the hood and the benefit of being distributed across a cluster.

#### Interacting with DataFrames

Once created (instantiated), a DataFrame object has methods attached to it. Methods are operations one can perform on DataFrames such as filtering,
counting, aggregating and many others.

> <b>Example</b>: To create (instantiate) a DataFrame, use this syntax: `df = ...`

To display the contents of the DataFrame, apply a `show` operation (method) on it using the syntax `df.show()`. 

The `.` indicates you are *applying a method on the object*.

In working with DataFrames, it is common to chain operations together, such as: `df.select().filter().orderBy()`.  

By chaining operations together, you don't need to save intermediate DataFrames into local variables (thereby avoiding the creation of extra objects).

Also note that you do not have to worry about how to order operations because the optimizier determines the optimal order of execution of the operations for you.

`df.select(...).orderBy(...).filter(...)`

versus

`df.filter(...).select(...).orderBy(...)`

-sandbox
#### DataFrames and SQL

DataFrame syntax is more flexible than SQL syntax. Here we illustrate general usage patterns of SQL and DataFrames.

Suppose we have a data set we loaded as a table called `myTable` and an equivalent DataFrame, called `df`.
We have three fields/columns called `col_1` (numeric type), `col_2` (string type) and `col_3` (timestamp type)
Here are basic SQL operations and their DataFrame equivalents. 

Notice that columns in DataFrames are referenced by `col("<columnName>")`.

| SQL                                         | DataFrame (Python)                    |
| ------------------------------------------- | ------------------------------------- | 
| `SELECT col_1 FROM myTable`                 | `df.select(col("col_1"))`             | 
| `DESCRIBE myTable`                          | `df.printSchema()`                    | 
| `SELECT * FROM myTable WHERE col_1 > 0`     | `df.filter(col("col_1") > 0)`         | 
| `..GROUP BY col_2`                          | `..groupBy(col("col_2"))`             | 
| `..ORDER BY col_2`                          | `..orderBy(col("col_2"))`             | 
| `..WHERE year(col_3) > 1990`                | `..filter(year(col("col_3")) > 1990)` | 
| `SELECT * FROM myTable LIMIT 10`            | `df.limit(10)`                        |
| `display(myTable)` (text format)            | `df.show()`                           | 
| `display(myTable)` (html format)            | `display(df)`                         |

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** You can also run SQL queries with the special syntax `spark.sql("SELECT * FROM myTable")`

In this course you see many other usages of DataFrames. It is left up to you to figure out the SQL equivalents 
(left as exercises in some cases).

### Querying Data 
This lesson uses the `people-10m` data set, which is in Parquet format.

The data is fictitious; in particular, the Social Security numbers are fake.

Run the command below to see the contents of the `people-10m.parquet` file.

In [9]:
%fs ls /mnt/training/dataframes/people-10m.parquet

path,name,size
dbfs:/mnt/training/dataframes/people-10m.parquet/._SUCCESS.crc,._SUCCESS.crc,8
dbfs:/mnt/training/dataframes/people-10m.parquet/.part-00000-04bbf354-608c-43f2-9a81-7df139d1af69.snappy.parquet.crc,.part-00000-04bbf354-608c-43f2-9a81-7df139d1af69.snappy.parquet.crc,220072
dbfs:/mnt/training/dataframes/people-10m.parquet/.part-00001-04bbf354-608c-43f2-9a81-7df139d1af69.snappy.parquet.crc,.part-00001-04bbf354-608c-43f2-9a81-7df139d1af69.snappy.parquet.crc,217008
dbfs:/mnt/training/dataframes/people-10m.parquet/.part-00002-04bbf354-608c-43f2-9a81-7df139d1af69.snappy.parquet.crc,.part-00002-04bbf354-608c-43f2-9a81-7df139d1af69.snappy.parquet.crc,217012
dbfs:/mnt/training/dataframes/people-10m.parquet/.part-00003-04bbf354-608c-43f2-9a81-7df139d1af69.snappy.parquet.crc,.part-00003-04bbf354-608c-43f2-9a81-7df139d1af69.snappy.parquet.crc,217356
dbfs:/mnt/training/dataframes/people-10m.parquet/.part-00004-04bbf354-608c-43f2-9a81-7df139d1af69.snappy.parquet.crc,.part-00004-04bbf354-608c-43f2-9a81-7df139d1af69.snappy.parquet.crc,216656
dbfs:/mnt/training/dataframes/people-10m.parquet/.part-00005-04bbf354-608c-43f2-9a81-7df139d1af69.snappy.parquet.crc,.part-00005-04bbf354-608c-43f2-9a81-7df139d1af69.snappy.parquet.crc,216656
dbfs:/mnt/training/dataframes/people-10m.parquet/.part-00006-04bbf354-608c-43f2-9a81-7df139d1af69.snappy.parquet.crc,.part-00006-04bbf354-608c-43f2-9a81-7df139d1af69.snappy.parquet.crc,216656
dbfs:/mnt/training/dataframes/people-10m.parquet/.part-00007-04bbf354-608c-43f2-9a81-7df139d1af69.snappy.parquet.crc,.part-00007-04bbf354-608c-43f2-9a81-7df139d1af69.snappy.parquet.crc,206860
dbfs:/mnt/training/dataframes/people-10m.parquet/_SUCCESS,_SUCCESS,0


In [10]:
peopleDF = spark.read.parquet("/mnt/training/dataframes/people-10m.parquet")
display(peopleDF)

id,firstName,middleName,lastName,gender,birthDate,ssn,salary
1,Pennie,Carry,Hirschmann,F,1955-07-02T04:00:00.000+0000,981-43-9345,56172
2,An,Amira,Cowper,F,1992-02-08T05:00:00.000+0000,978-97-8086,40203
3,Quyen,Marlen,Dome,F,1970-10-11T04:00:00.000+0000,957-57-8246,53417
4,Coralie,Antonina,Marshal,F,1990-04-11T04:00:00.000+0000,963-39-4885,94727
5,Terrie,Wava,Bonar,F,1980-01-16T05:00:00.000+0000,964-49-8051,79908
6,Chassidy,Concepcion,Bourthouloume,F,1990-11-24T05:00:00.000+0000,954-59-9172,64652
7,Geri,Tambra,Mosby,F,1970-12-19T05:00:00.000+0000,968-16-4020,38195
8,Patria,Nancy,Arstall,F,1985-01-02T05:00:00.000+0000,984-76-3770,102053
9,Terese,Alfredia,Tocque,F,1967-11-17T05:00:00.000+0000,967-48-7309,91294
10,Wava,Lyndsey,Jeandon,F,1963-12-30T05:00:00.000+0000,997-82-2946,56521


Take a look at the schema with the `printSchema` method. This tells you the field name, field type, and whether the column is nullable or not (default is true).

In [12]:
peopleDF.printSchema()

Answer the following question:
> According to our data, which women were born after 1990?

Use the DataFrame `select` and `filter` methods.

In [14]:
from pyspark.sql.functions import year
display(peopleDF 
  .select("firstName","middleName","lastName","birthDate","gender") 
  .filter("gender = 'F'") 
  .filter(year("birthDate") > "1990")
)

firstName,middleName,lastName,birthDate,gender
An,Amira,Cowper,1992-02-08T05:00:00.000+0000,F
Caroyln,Mamie,Cardon,1994-05-15T04:00:00.000+0000,F
Yesenia,Eileen,Goldring,1997-07-09T04:00:00.000+0000,F
Hedwig,Dulcie,Pendleberry,1998-12-02T05:00:00.000+0000,F
Kala,Violeta,Lyfe,1994-06-23T04:00:00.000+0000,F
Gussie,India,McKeeman,1991-11-15T05:00:00.000+0000,F
Pansy,Suzie,Shrieves,1991-05-24T04:00:00.000+0000,F
Chung,Dian,Dautry,1998-01-12T05:00:00.000+0000,F
Erica,Louvenia,O'Drought,1991-03-08T05:00:00.000+0000,F
Katelyn,Merrie,Pocklington,1994-01-16T05:00:00.000+0000,F


### Built-In Functions

Spark provides a number of <a href="https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$" target="_blank">built-in functions</a>, many of which can be used directly with DataFrames.  Use these functions in the `filter` expressions to filter data and in `select` expressions to create derived columns.

The following DataFrame statement finds women born after 1990; it uses the `year` function, and it creates a `birthYear` column on the fly.

In [16]:
display(peopleDF
  .select("firstName","middleName","lastName",year("birthDate").alias("birthYear"),"salary")
  .filter(year("birthDate") > "1990") 
  .filter("gender = 'F' "))

firstName,middleName,lastName,birthYear,salary
An,Amira,Cowper,1992,40203
Caroyln,Mamie,Cardon,1994,60449
Yesenia,Eileen,Goldring,1997,73060
Hedwig,Dulcie,Pendleberry,1998,60857
Kala,Violeta,Lyfe,1994,101601
Gussie,India,McKeeman,1991,46945
Pansy,Suzie,Shrieves,1991,73811
Chung,Dian,Dautry,1998,47190
Erica,Louvenia,O'Drought,1991,80113
Katelyn,Merrie,Pocklington,1994,77925


-sandbox
### Visualization

Databricks provides easy-to-use, built-in visualizations for your data. 

Display the data by invoking the Spark `display` function.

Visualize the query below by selecting the bar graph icon once the table is displayed:

<img src="https://files.training.databricks.com/images/eLearning/visualization-1.png" style="border: 1px solid #aaa; padding: 10px; border-radius: 10px 10px 10px 10px"/>

How many women were named Mary in each year?

In [19]:
marysDF = (peopleDF.select(year("birthDate").alias("birthYear")) 
  .filter("firstName = 'Mary' ") 
  .filter("gender = 'F' ") 
  .orderBy("birthYear") 
  .groupBy("birthYear") 
  .count()
)

To start the visualization process, first apply the `display` function to the DataFrame. 

Next, click the graph button in the bottom left corner (second from left) to display data in different ways.

The data initially shows up in html format as an `n X 2` column where one column is the `birthYear` and another column is `count`.

In [21]:
display(marysDF)

Compare popularity of two names from 1990.

In [23]:
from pyspark.sql.functions import col
dordonDF = (peopleDF 
  .select(year("birthDate").alias("birthYear"), "firstName") 
  .filter((col("firstName") == 'Donna') | (col("firstName") == 'Dorothy')) 
  .filter("gender == 'F' ") 
  .filter(year("birthDate") > 1990) 
  .orderBy("birthYear") 
  .groupBy("birthYear", "firstName") 
  .count()
)
display(dordonDF)

### Temporary Views

In DataFrames, <b>temporary views</b> are used to make the DataFrame available to SQL, and work with SQL syntax seamlessly.

A temporary view gives you a name to query from SQL, but unlike a table it exists only for the duration of your Spark Session. As a result, the temporary view will not carry over when you restart the cluster or switch to a new notebook. It also won't show up in the Data button on the menu on the left side of a Databricks notebook which provides easy access to databases and tables.

The statement in the following cells create a temporary view containing the same data.

In [25]:
peopleDF.createOrReplaceTempView("People10M")

To view the contents of temporary view, use select notation.

In [27]:
display(spark.sql("SELECT * FROM  People10M where firstName = 'Donna' "))

Create a DataFrame with a more specific query.

In [29]:
womenBornAfter1990DF = (peopleDF 
  .select("firstName", "middleName", "lastName",year("birthDate").alias("birthYear"), "salary") 
  .filter(year("birthDate") > 1990) 
  .filter("gender = 'F' ") 
)
display(womenBornAfter1990DF)

Create Temporary Views from the `womenBornAfter1990DF` DataFrame

In [31]:
womenBornAfter1990DF.createOrReplaceTempView("womenBornAfter1990")

Once a temporary view has been created, it can be queried as if it were a table. 

Find out how many Marys are in the WomenBornAfter1990 DataFrame.

In [33]:
display(spark.sql("SELECT count(*) FROM womenBornAfter1990 where firstName = 'Mary' "))

-sandbox
## Exercise 1

Create a DataFrame called top10FemaleFirstNamesDF that contains the 10 most common female first names out of the people data set.

* `firstName` - the first name
* `total` - the total number of rows with that first name

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** 
* You may need to break ties by using firstName because some of the totals are identical.
* To restrict the number of names to 10, you need to use the `limit(10)` method.
* You also need to use the `agg()` method to do a count of `firstName` and give it an alias.
* The `agg()` method is applied after the `groupBy` since it requires column values to be collected in some fashion.
* You will need to import the `count` and `desc` methods in Scala or Python, as appropriate.

Display the results.

-sandbox
### Step 1
Create a DataFrame called `top10FemaleFirstNamesDF` and display the results.

In [36]:
# TODO
from pyspark.sql.functions import count, desc

top10FemaleFirstNamesDF =  peopleDF.filter("gender = 'F' ").groupBy("firstName").agg(count("firstName").alias("total")).sort(desc("total")).limit(10)

display(top10FemaleFirstNamesDF)
                            

firstName,total
Sharyn,1394
Lashell,1387
Lucille,1384
Alice,1384
Louie,1382
Jacquelyn,1381
Cristen,1375
Katherin,1373
Bridgette,1373
Thresa,1368


In [37]:
top10FemaleNamesDF = top10FemaleFirstNamesDF.orderBy("firstName")

display(top10FemaleNamesDF)

firstName,total
Alice,1384
Bridgette,1373
Cristen,1375
Jacquelyn,1381
Katherin,1373
Lashell,1387
Louie,1382
Lucille,1384
Sharyn,1394
Thresa,1368


In [38]:
from pyspark.sql import Row
results = top10FemaleNamesDF.collect()

dbTest("DF-L2-names-0", Row(firstName=u"Alesha",    total=1368), results[0])  
dbTest("DF-L2-names-1", Row(firstName=u"Alice",     total=1384), results[1])
dbTest("DF-L2-names-2", Row(firstName=u"Bridgette", total=1373), results[2])
dbTest("DF-L2-names-3", Row(firstName=u"Cristen",   total=1375), results[3])
dbTest("DF-L2-names-4", Row(firstName=u"Jacquelyn", total=1381), results[4])
dbTest("DF-L2-names-5", Row(firstName=u"Katherin",  total=1373), results[5])
dbTest("DF-L2-names-6", Row(firstName=u"Lashell",   total=1387), results[6])
dbTest("DF-L2-names-7", Row(firstName=u"Louie",     total=1382), results[7])
dbTest("DF-L2-names-8", Row(firstName=u"Lucille",   total=1384), results[8])
dbTest("DF-L2-names-9", Row(firstName=u"Sharyn",    total=1394), results[9]) 

print("Tests passed!")

### Step 2

Convert the DataFrame to a temporary view and display the contents of the temporary view.

In [40]:
# TODO
top10FemaleFirstNamesDF.createOrReplaceTempView("Top10FemaleFirstNames")
resultsDF = top10FemaleFirstNamesDF.select("firstName")
display(resultsDF)

firstName
Sharyn
Lashell
Lucille
Alice
Louie
Jacquelyn
Cristen
Katherin
Bridgette
Thresa


## Summary
* Spark DataFrames can be used to query Data Sets.
* Visualize the results of your queries with built-in Databricks graphs.

## Review Questions

**Q:** How do you *create* a DataFrame object?  
**A:** An object is created by introducing a variable name and equating it to something like `myDataFrameDF =`. 

**Q:** What methods (operations) can you perform on a DataFrame object?  
**A:** The full list is here: <a href="http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html" target="_blank"> pyspark.sql module</a>

**Q:** Why do you chain methods (operations) `myDataFrameDF.select().filter().groupBy()`?  
**A:** To avoid the creation of temporary DataFrames as local variables. 

For example, you could have written the above as: `tempDF1 = myDataFrameDF.select()`,  `tempDF2 = tempDF1.filter()` and
then `tempDF2.groupBy()`. 

This is syntatically equivalent, but, notice how you now have extra local variables.

**Q:** What is the DataFrame syntax to create a temporary view?    
**A:** ```myDF.createOrReplaceTempView("MyTempView")```

## Next Steps

Start the next lesson, [Aggregations, JOINs and Nested Queries]($./03-Joins-Aggregations ).

## Additional Topics & Resources

* <a href="https://spark.apache.org/docs/latest/sql-programming-guide.html" target="_blank">Spark SQL, DataFrames and Datasets Guide</a>