# Working with File Partitioning

In this exercise, we will cover How to partition the data for fast querying.

In this lesson you:
 - Partition your data for increased query performance
 - Minimize the small file problem

Let's start with some CSV data in a single folder
* people-10m.csv
* people-10m-partitioned.csv

In [4]:
df = spark.read.csv("data/people-10m", header="true", inferSchema="true")

In [17]:
print(df.rdd.getNumPartitions())
df.show()
redf=df.repartition(8,"id","lastName","gender")
print(redf.rdd.getNumPartitions())
redf.show()


8
+---+----------+----------+-------------+------+-------------------+-----------+------+
| id| firstName|middleName|     lastName|gender|          birthDate|        ssn|salary|
+---+----------+----------+-------------+------+-------------------+-----------+------+
|  1|    Pennie|     Carry|   Hirschmann|     F|1955-07-02 04:00:00|981-43-9345| 56172|
|  2|        An|     Amira|       Cowper|     F|1992-02-08 05:00:00|978-97-8086| 40203|
|  3|     Quyen|    Marlen|         Dome|     F|1970-10-11 04:00:00|957-57-8246| 53417|
|  4|   Coralie|  Antonina|      Marshal|     F|1990-04-11 04:00:00|963-39-4885| 94727|
|  5|    Terrie|      Wava|        Bonar|     F|1980-01-16 05:00:00|964-49-8051| 79908|
|  6|  Chassidy|Concepcion|Bourthouloume|     F|1990-11-24 05:00:00|954-59-9172| 64652|
|  7|      Geri|    Tambra|        Mosby|     F|1970-12-19 05:00:00|968-16-4020| 38195|
|  8|    Patria|     Nancy|      Arstall|     F|1985-01-02 05:00:00|984-76-3770|102053|
|  9|    Terese|  Alfredia|   

What if when we filter by the year of birth?

In [24]:
df.where("year(birthDate) between 1970 and 1980").explain(mode="cost")

== Optimized Logical Plan ==
Filter ((year(cast(birthDate#57 as date)) >= 1970) AND (year(cast(birthDate#57 as date)) <= 1980)), Statistics(sizeInBytes=713.3 MiB)
+- RelationV2[id#52, firstName#53, middleName#54, lastName#55, gender#56, birthDate#57, ssn#58, salary#59] csv hdfs://training.io:8020/user/training/data/people-10m, Statistics(sizeInBytes=713.3 MiB)

== Physical Plan ==
*(1) Project [id#52, firstName#53, middleName#54, lastName#55, gender#56, birthDate#57, ssn#58, salary#59]
+- *(1) Filter ((year(cast(birthDate#57 as date)) >= 1970) AND (year(cast(birthDate#57 as date)) <= 1980))
   +- BatchScan[id#52, firstName#53, middleName#54, lastName#55, gender#56, birthDate#57, ssn#58, salary#59] CSVScan Location: InMemoryFileIndex[hdfs://training.io:8020/user/training/data/people-10m], ReadSchema: struct<id:int,firstName:string,middleName:string,lastName:string,gender:string,birthDate:timestam...




Why it took so much time or ***even more*** to count the filtered vs the whole dataset? Look at the query plan to understand.

So let's try with a partitioned version instead.

In [14]:
sc.defaultParallelism

2

In [17]:
spark.conf.get("spark.default.parallelism")

Py4JJavaError: An error occurred while calling o56.get.
: java.util.NoSuchElementException: spark.driver.cores
	at org.apache.spark.sql.internal.SQLConf.$anonfun$getConfString$3(SQLConf.scala:2656)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.internal.SQLConf.getConfString(SQLConf.scala:2656)
	at org.apache.spark.sql.RuntimeConfig.get(RuntimeConfig.scala:73)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)


In [15]:
df_by_year = spark.read.csv("data/people-10m-partitioned.csv", header="true", inferSchema="true")
print(df_by_year.rdd.getNumPartitions())


18


In [23]:
print(df_by_year.explain(mode="cost"))


== Optimized Logical Plan ==
RelationV2[id#454, firstName#455, middleName#456, lastName#457, gender#458, birthDate#459, ssn#460, salary#461, birthYear#462] csv hdfs://training.io:8020/user/training/data/people-10m-partitioned.csv, Statistics(sizeInBytes=713.3 MiB)

== Physical Plan ==
*(1) Project [id#454, firstName#455, middleName#456, lastName#457, gender#458, birthDate#459, ssn#460, salary#461, birthYear#462]
+- BatchScan[id#454, firstName#455, middleName#456, lastName#457, gender#458, birthDate#459, ssn#460, salary#461, birthYear#462] CSVScan Location: InMemoryFileIndex[hdfs://training.io:8020/user/training/data/people-10m-partitioned.csv], ReadSchema: struct<id:int,firstName:string,middleName:string,lastName:string,gender:string,birthDate:timestam...


None


In [30]:
df_by_year.where("birthYear between 1970 and 1980").count()

2287326

That's quite good, but let's examine the query plan.

Why such small reads with 8 tasks?

In [31]:
print(df_by_year.where("birthYear between 1970 and 1980").rdd.getNumPartitions())
print(df_by_year.rdd.getNumPartitions())
print(df_by_year.count())


18
18
10000000


We have 8 small files per partition folder, very inefficient especially when it comes to cloud storage!

**Question**: Why do we need `repartition` AND `partitionBy`?

In [19]:
import re

(df_by_year.repartition("birthYear")
  .write.partitionBy("birthYear")
  .format("parquet")
  .mode("overwrite")
  .option("path", "people_by_year.parquet")
  .saveAsTable("people_by_year_optimized"))

In [30]:
print(spark.read.table("people_by_year_optimized").where("birthYear between 1970 AND 1980").rdd.getNumPartitions())


3


AttributeError: 'SparkContext' object has no attribute 'textFiles'

Now, we're reading in a single larger file per partition!

## End of Exercise