### Repartition
It is expensive operation used to reshuffle the data. It is wider transformations so use it wisely.

* Used to increase / decrease the number of partitions of RDD / Data Frame
* 2 Parameters ( numPartitions,*cols)
* We can specify any one of the parameters
* If param not defined then default value will be taken


In [None]:
# Creating a data frame
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("learning").getOrCreate()

In [None]:
# Reading data from csv file - with automatic schema detection

df = spark.read.format('csv').option('header',True).load('sample_data/customers-100.csv')

In [None]:
# Write to csv file

# Number of part file generated is equal to the number of partitions

df.write.option('mode','overwrite').save('sample_data/output')
df.rdd.getNumPartitions()


1

In [None]:
# Changing the partition count using the repartition()

df2 = df.repartition(numPartitions=9)
df2.rdd.getNumPartitions()
df2.write.format('csv').mode('overwrite').save('sample_data/output_df2') # Changed the output path


In [None]:
df2.show()

+-----+---------------+----------+-----------+--------------------+------------------+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+
|Index|    Customer Id|First Name|  Last Name|             Company|              City|             Country|             Phone 1|             Phone 2|               Email|Subscription Date|             Website|
+-----+---------------+----------+-----------+--------------------+------------------+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+
|   42|6e5ad5a5e2bB5Ca|     Bryan|       Dunn|    Kaufman and Sons|     North Jimstad|        Burkina Faso|    001-710-802-5565|  078.699.8982x13881|woodwardandres@ph...|       2021-09-08|http://www.butler...|
|    6|2d08FB17EE273F4|     Aimee|      Downs|        Steele Group|     Chavezborough|Bosnia and Herzeg...| (283)437-3886x88321|        999-728-1637| louis27@gi

In [None]:
# Repartition using the col names

from pyspark.sql.functions import col
df3 = df2.repartition(5,col("Company"))
df3.write.format('csv').mode('overwrite').save('sample_data/output_df2')


### Repartition in RDD
* parallelize()
* textFile()
* wholeTextFiles()

In [None]:
# Parallelise method

rdd = spark.sparkContext.parallelize((1,20))
print('Default rdd partition : ',rdd.getNumPartitions())

rdd2 = spark.sparkContext.parallelize((1,20),7)
print('Updated rdd partition count : ',rdd2.getNumPartitions())

rdd3 = spark.sparkContext.textFile('sample_data/customers-100.csv',5)
print('Text file partition count : ',rdd3.getNumPartitions())

Default rdd partition :  2
Updated rdd partition count :  7
Text file partition count :  5


### Coalesce
Decrease the partition count

In [None]:
# Rdd coalesce
rdd = spark.sparkContext.textFile('sample_data/customers-100.csv',5)
rdd.getNumPartitions()

5

In [None]:
rdd2 = rdd.coalesce(2)

In [None]:
rdd2.getNumPartitions()

2

In [None]:
# df coalesce

df = spark.range(1,20)
df.rdd.getNumPartitions()

2

In [None]:
df = df.coalesce(1)
df.rdd.getNumPartitions()

1

### Shuffle Partition
* Number of shuffle spark perform while doing joins, group by and aggregrations.
* Default shuffle partition: 200.

In [None]:
df = spark.read.format('csv').option('header',True).load('sample_data/customers-100.csv')

In [None]:
df.show(10)

+-----+---------------+----------+---------+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+
|Index|    Customer Id|First Name|Last Name|             Company|             City|             Country|             Phone 1|             Phone 2|               Email|Subscription Date|             Website|
+-----+---------------+----------+---------+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+
|    1|DD37Cf93aecA6Dc|    Sheryl|   Baxter|     Rasmussen Group|     East Leonard|               Chile|        229.077.5154|    397.884.0519x718|zunigavanessa@smi...|       2020-08-24|http://www.stephe...|
|    2|1Ef7b82A4CAAD10|   Preston|   Lozano|         Vega-Gentry|East Jimmychester|            Djibouti|          5153435776|    686-620-1820x944|     vmata@colon.com|     

In [None]:
df2 = df.groupBy('Company').count()

In [None]:
df2.rdd.getNumPartitions()

1

In [None]:
df2.show()

+--------------------+-----+
|             Company|count|
+--------------------+-----+
|Palmer, Barnes an...|    1|
|           Novak LLC|    1|
|      Caldwell Group|    1|
|      Carter-Hancock|    1|
|      Greer and Sons|    1|
|    Osborne-Erickson|    1|
|Fitzpatrick-Lawrence|    1|
|     Perkins-Trevino|    1|
|     Decker-Mcknight|    1|
|       Murillo-Perry|    1|
|Prince, Malone an...|    1|
|Martin, Lang and ...|    1|
|Coffey, Lamb and ...|    1|
|Waters, Chase and...|    1|
|        Steele Group|    1|
|   Carter-Strickland|    1|
|           Petty Ltd|    1|
|Lee, Lucero and J...|    1|
|        Guzman-Brown|    1|
|Mcdonald, Bird an...|    1|
+--------------------+-----+
only showing top 20 rows



In [None]:
df3 = df2.repartition(5)

In [None]:
df3.rdd.getNumPartitions()

5