Load an initial DataFrame from a CSV file and then derive some new DataFrames from it using transformations. We can avoid having to recompute the original DataFrame (i.e., load and parse the CSV file) many times by adding a line to cache it along the way:

In [0]:
# Original loading code that does *not* cache DataFrame
DF1 = spark.read.format("csv")\
  .option("inferSchema", "true")\
  .option("header", "true")\
  .load("/FileStore/tables/2015_summary.csv")


- You’ll see here that we have our “lazily” created DataFrame (DF1), along with three other DataFrames that access data in DF1.
- All of our downstream DataFrames share that common parent (DF1) and will repeat the same work when we perform the preceding code. 
- In this case, it’s just reading and parsing the raw CSV data, but that can be a fairly intensive process, especially for large datasets.

In [0]:
DF2 = DF1.groupBy("DEST_COUNTRY_NAME").count().collect()
DF3 = DF1.groupBy("ORIGIN_COUNTRY_NAME").count().collect()
DF4 = DF1.groupBy("count").count().collect()


**Command took 6.10 seconds**

- Caching can help speed things up. When we ask for a DataFrame to be cached, Spark will save the data in memory or on disk the first time it computes it. 
- Then, when any other queries come along, they’ll just refer to the one stored in memory as opposed to the original file. 
- You do this using the DataFrame’s cache method:

In [0]:
DF1.cache()
DF1.count()


- We used the count above to eagerly cache the data (basically perform an action to force Spark to store it in memory), because caching itself is lazy—the data is cached only on the first time you run an action on the DataFrame.
- Now that the data is cached, the previous commands will be faster, as we can see by running the following code:

In [0]:
DF2 = DF1.groupBy("DEST_COUNTRY_NAME").count().collect()
DF3 = DF1.groupBy("ORIGIN_COUNTRY_NAME").count().collect()
DF4 = DF1.groupBy("count").count().collect()


**Command took 2.07 seconds**