
# Use common DataFrame methods

In the previous notebook, you ended off by executing a count of records in a DataFrame. We will now build upon that concept by introducing common DataFrame methods.

**Technical Accomplishments:**
* Develop familiarity with the `DataFrame` APIs
* Use common DataFrame methods for performance
* Explore the Spark API documentation

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Getting Started

Run the following cell to configure our "classroom."

In [0]:
%run "./Includes/Classroom-Setup"

Prepare the data source.

In [0]:
(source, sasEntity, sasToken) = getAzureDataSource()

spark.conf.set(sasEntity, sasToken)

Create the DataFrame. This is the same one we created in the previous notebook.

In [0]:
parquetDir = source + "/wikipedia/pagecounts/staging_parquet_en_only_clean/"

In [0]:
pagecountsEnAllDF = (spark  # Our SparkSession & Entry Point
  .read                     # Our DataFrameReader
  .parquet(parquetDir)      # Returns an instance of DataFrame
)
print(pagecountsEnAllDF)    # Python hack to see the data type

Execute a count on the DataFrame as we did at the end of the previous notebook.

In [0]:
total = pagecountsEnAllDF.count()

print("Record Count: {0:,}".format( total ))

That tells us that there are around 2 million rows in the `DataFrame`. 

Before we take a closer look at the contents of the `DataFrame`, let us introduce a technique that speeds up processing.  

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) cache() & persist()

The ability to cache data is one technique for achieving better performance with Apache Spark. 

This is because every action requires Spark to read the data from its source (Azure Blob, Amazon S3, HDFS, etc.) but caching moves that data into the memory of the local executor for "instant" access.

`cache()` is just an alias for `persist()`. 

In [0]:
(pagecountsEnAllDF
  .cache()         # Mark the DataFrame as cached
  .count()         # Materialize the cache
) 

If you re-run that command, it should take significantly less time.

In [0]:
pagecountsEnAllDF.count()


## Performance considerations of Caching Data

When Caching Data you are placing it on the workers of the cluster. 

Caching takes resources, before moving a notebook into production please check and verify that you are appropriately using cache. 

And as a quick side note, you can remove a cache by calling the `DataFrame`'s `unpersist()` method but, it is not necessary.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Our Data

Let's continue by taking a look at the type of data we have. 

We can do this with the `printSchema()` command:

In [0]:
pagecountsEnAllDF.printSchema()

We should now be able to see that we have four columns of data:
* **project** (*string*): The name of the Wikipedia project. This will include values such as:
  * **en**: The English version of Wikipedia.
  * **fr**: The French version of Wikipedia.
  * **en.d**: The English version of Wiktionary.
  * **fr.b**: The French version of Wikibooks.
  * **de.n**: The German version of Wikinews.
* **article** (*string*): The name of the article in the corresponding project. This will include values such as:
  * <a href="https://en.wikipedia.org/wiki/Apache_Spark" target="_blank">Apache_Spark</a>
  * <a href="https://en.wikipedia.org/wiki/Matei_Zaharia" target="_blank">Matei_Zaharia</a>
  * <a href="https://en.wikipedia.org/wiki/Kevin_Bacon" target="_blank">Kevin_Bacon</a>
* **requests** (*integer*): The number of requests (clicks) the article has received in the hour this data represents.
* **bytes_served** (*long*): The total number of bytes delivered for the requested article.
  * **Note:** In our copy of the data, this value is zero for all records and consequently is of no value to us.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Spark API

You have already seen one command available to the `DataFrame` class, namely `DataFrame.printSchema()`
  
Let's take a look at the API to see what other operations we have available.

### **Spark API Home Page**
0. Open a new browser tab
0. Google for **Spark API Latest** or **Spark API _x.x.x_** for a specific version.
0. Select **Spark API Documentation - Spark _x.x.x_ Documentation - Apache Spark** 

Other Documentation:
* Programming Guides for DataFrames, SQL, Graphs, Machine Learning, Streaming...
* Deployment Guides for Spark Standalone, Mesos, Yarn...
* Configuration, Monitoring, Tuning, Security...

Here are some shortcuts
  * <a href="https://spark.apache.org/docs/latest/" target="_blank">Spark API Documentation - Latest</a>
  * <a href="https://spark.apache.org/docs/2.1.1/api.html" target="_blank">Spark API Documentation - 2.1.1</a>
  * <a href="https://spark.apache.org/docs/2.1.0/api.html" target="_blank">Spark API Documentation - 2.1.0</a>
  * <a href="https://spark.apache.org/docs/2.0.2/api.html" target="_blank">Spark API Documentation - 2.0.2</a>
  * <a href="https://spark.apache.org/docs/1.6.3/api.html" target="_blank">Spark API Documentation - 1.6.3</a>

Naturally, which set of documentation you will use depends on which language you will use.

### Spark API (Python)

0. Select **Spark Python API (Sphinx)**.
0. Look up the documentation for `pyspark.sql.DataFrame`.
  0. In the lower-left-hand-corner type **DataFrame** into the search field.
  0. Hit **[Enter]**.
  0. The search results should appear in the right-hand pane.
  0. Click on **pyspark.sql.DataFrame (Python class, in pyspark.sql module)**
  0. The documentation should open in the right-hand pane.

### Spark API (Scala)

0. Select **Spark Scala API (Scaladoc)**.
0. Look up the documentation for `org.apache.spark.sql.DataFrame`.
  0. In the upper-left-hand-corner type **DataFrame** into the search field.
  0. The search will execute automatically.
  0. In the class/package list, click on **DataFrame**.
  0. The documentation should open in the right-hand pane.
  
This isn't going to work, but why?

### Spark API (Scala), Try #2

Look up the documentation for `org.apache.spark.sql.Dataset`.
  0. In the upper-left-hand-corner type **Dataset** into the search field.
  0. The search will execute automatically.
  0. In the class/package list, click on **Dataset**.
  0. The documentation should open in the right-hand pane.

Now that we have found the proper documentation, we can take a quick peek at the function `printSchema()`.

Nothing special here.

If you look at the API docs, `printSchema(..)` is described like this:
> Prints the schema to the console in a nice tree format.

## Next steps

Start the next lesson, [Use the Display function]($./3.Display-function)