In [1]:
import findspark
findspark.init()

#### Spark Session is needed
- to run programs as spark 
- Configuration details are available with Spark Session
- It is the driver program/entry point for Spark
- Before SparkSession (2.0), Spark Context object was available
- One spark program will have one spark Session or one spark context
- Cluster related information will be available with the SPark Session object
- It used to build dataframes.
- If we want to build rdd's we can use Spark Context (From Spark Session, spark context object is available)

In [2]:
## Create a spark Session
### Spark Session is the gateway for creating a spark Program.
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName("RDDProgram").getOrCreate()
## * all the cores on the machine local - local environment 
### local[2] - It should use 2 cores of the CPU for running the program
### URL of the master - yarn://<ip-address>:<port>

In [3]:
spark

 AppName Sets a name for the application, which will be shown in the Spark web UI.
    

 Config Sets a config option. Options set using this method are automatically propagated to both SparkConf and SparkSession‘s own configuration.

master is a Spark, Mesos or YARN cluster URL, or a special “local” string to run in local modemaster 

getOrCreate() Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder.

In [4]:
sc = spark.sparkContext

* Spark introduces the concept of an RDD (Resilient Distributed Dataset), an
 immutable fault-tolerant, distributed collection of objects that can be operated on
 in parallel. 

* An RDD can contain any type of object and is created by loading an
 external dataset or distributing a collection from the driver program.

In [5]:
sc

* Spark introduces the concept of an RDD (Resilient Distributed Dataset), an
 immutable fault-tolerant, distributed collection of objects that can be operated on
 in parallel. 

* An RDD can contain any type of object and is created by loading an
 external dataset or distributing a collection from the driver program.

### Creating RDD in Pyspark

#####  There are three ways to create an RDD in Spark.

* Parallelizing already existing collection in driver program.
* Referencing a dataset in an external storage system (e.g. HDFS, Hbase, shared file system).
* Creating RDD from already existing RDDs.

In [6]:
l1 = [2,3,4,5,6]
l1

[2, 3, 4, 5, 6]

In [7]:
rdd0 = sc.parallelize([2,3,4,5,6])
rdd0

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274

In [8]:
rdd0.collect()

[2, 3, 4, 5, 6]

In [9]:
rdd1 = sc.parallelize([("maths",92),("english",75),("SCiences",85),("Social",90)])

In [10]:
rdd1

ParallelCollectionRDD[1] at readRDDFromFile at PythonRDD.scala:274

In [11]:
rdd1.collect()

[('maths', 92), ('english', 75), ('SCiences', 85), ('Social', 90)]

#### Creating an rdd by reading a file 

In [12]:
import os
os.getcwd()

'C:\\Users\\Jayanth\\Python_DataAnalysis\\PySpark-lessons'

In [13]:
rdd2 = sc.textFile('file:///C:\\Users\\Jayanth\\Python_DataAnalysis\\PySpark-lessons\\datasets\\temp_data.txt')

In [14]:
rdd2.collect()

['1901\t-78\t1',
 '1901\t-72\t1',
 '1901\t-94\t1',
 '1901\t-61\t1',
 '1901\t-56\t1',
 '1901\t-28\t1',
 '1901\t-67\t1',
 '1901\t-33\t1',
 '1901\t-28\t1',
 '1901\t-33\t1',
 '1901\t-44\t1',
 '1901\t-39\t1',
 '1901\t0\t1',
 '1901\t6\t1',
 '1901\t0\t1',
 '1901\t6\t1',
 '1901\t6\t1',
 '1901\t-11\t1',
 '1901\t-33\t1',
 '1901\t-50\t1',
 '1901\t-44\t1',
 '1901\t-28\t1',
 '1901\t-33\t1',
 '1901\t-33\t1',
 '1901\t-50\t1',
 '1901\t-33\t1',
 '1901\t-28\t1',
 '1901\t-44\t1',
 '1901\t-44\t1',
 '1901\t-44\t1',
 '1901\t-39\t1',
 '1901\t-50\t1',
 '1901\t-44\t1',
 '1901\t-39\t1',
 '1901\t-33\t1',
 '1901\t-22\t1',
 '1901\t0\t1',
 '1901\t-6\t1',
 '1901\t-17\t1',
 '1901\t-44\t1',
 '1901\t-39\t1',
 '1901\t-33\t1',
 '1901\t-6\t1',
 '1901\t17\t1',
 '1901\t22\t1',
 '1901\t28\t1',
 '1901\t28\t1',
 '1901\t11\t1',
 '1901\t-17\t1',
 '1901\t-28\t1',
 '1901\t-56\t1',
 '1901\t-44\t1',
 '1901\t-44\t1',
 '1901\t-67\t1',
 '1901\t-44\t1',
 '1901\t-39\t1',
 '1901\t-22\t1',
 '1901\t-22\t1',
 '1901\t-22\t1',
 '1901\t-39\t1',

In [15]:
rdd3 = rdd2.map(lambda s: s.split('\t'))

In [16]:
type(rdd3)

pyspark.rdd.PipelinedRDD

In [17]:
rdd3.collect()

[['1901', '-78', '1'],
 ['1901', '-72', '1'],
 ['1901', '-94', '1'],
 ['1901', '-61', '1'],
 ['1901', '-56', '1'],
 ['1901', '-28', '1'],
 ['1901', '-67', '1'],
 ['1901', '-33', '1'],
 ['1901', '-28', '1'],
 ['1901', '-33', '1'],
 ['1901', '-44', '1'],
 ['1901', '-39', '1'],
 ['1901', '0', '1'],
 ['1901', '6', '1'],
 ['1901', '0', '1'],
 ['1901', '6', '1'],
 ['1901', '6', '1'],
 ['1901', '-11', '1'],
 ['1901', '-33', '1'],
 ['1901', '-50', '1'],
 ['1901', '-44', '1'],
 ['1901', '-28', '1'],
 ['1901', '-33', '1'],
 ['1901', '-33', '1'],
 ['1901', '-50', '1'],
 ['1901', '-33', '1'],
 ['1901', '-28', '1'],
 ['1901', '-44', '1'],
 ['1901', '-44', '1'],
 ['1901', '-44', '1'],
 ['1901', '-39', '1'],
 ['1901', '-50', '1'],
 ['1901', '-44', '1'],
 ['1901', '-39', '1'],
 ['1901', '-33', '1'],
 ['1901', '-22', '1'],
 ['1901', '0', '1'],
 ['1901', '-6', '1'],
 ['1901', '-17', '1'],
 ['1901', '-44', '1'],
 ['1901', '-39', '1'],
 ['1901', '-33', '1'],
 ['1901', '-6', '1'],
 ['1901', '17', '1'],
 ['

### RDDs support two types of operations:
* Transformations are operations (such as map, filter, join, union, and so on) that are performed on an RDD and which yield a new RDD containing the result.

* Transformations in Spark are “lazy”, meaning that they do not compute their results right away. 
* They just “remember” the operation to be performed and the dataset (e.g., file) to which the operation is to be    performed. 
* The transformations are only actually computed when an action is called and the result is returned to the driver program. 
* This design enables Spark to run more efficiently. For example, if a big file was transformed in various ways and passed to first action, Spark would only process and return the result for the first line, rather than do the work for the entire file.

**There are two types of transformations on RDD**
* Narrow Transformations

In Narrow transformation, all the elements that are required to compute the records in single partition live in the single partition of parent RDD. A limited subset of partition is used to calculate the result. Narrow transformations are the result of map(), filter().

* Wide Transformations 

Wide transformation – In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of parent RDD. Wide transformations are the result of groupbyKey() and reducebyKey()

<img src='images/narrowWide.JPG' />

#### map transformationm

In [25]:
intRdd = sc.parallelize([10,20,30,40,50])
mapRdd = intRdd.map(lambda x : x**2)

In [26]:
mapRdd.collect()

[100, 400, 900, 1600, 2500]

#### filter

In [27]:
numRdd = sc.parallelize([9,10,11,12,13,14,15,16,17,18,19,20])
oddRdd = numRdd.filter(lambda num : num%2 == 1)
oddRdd.collect()

[9, 11, 13, 15, 17, 19]

In [28]:
evenRdd = numRdd.filter(lambda num : num%2 == 0)
evenRdd.collect()

[10, 12, 14, 16, 18, 20]

#### Construction of a pair rdd

In [29]:
x = sc.parallelize([('comp',1),('tab',1),('comp',1),('comp',1),('comp',1),
                     ('tab',1),('tab',1),('tab',1),('tab',1),('tab',1),('tab',1)])

In [30]:
x.collect()

[('comp', 1),
 ('tab', 1),
 ('comp', 1),
 ('comp', 1),
 ('comp', 1),
 ('tab', 1),
 ('tab', 1),
 ('tab', 1),
 ('tab', 1),
 ('tab', 1),
 ('tab', 1)]

#### reducebyKey

In [31]:
y = x.reduceByKey(lambda a,b:a+b)
y.collect()

[('comp', 4), ('tab', 7)]

#### map - passing a lambda function

In [32]:
range(1,4)

range(1, 4)

In [33]:
mapRddint = sc.parallelize([3,4,5]).map(lambda x :range(1,x)).collect()

In [34]:
mapRddint

[range(1, 3), range(1, 4), range(1, 5)]

In [35]:
flatMapRddInt = sc.parallelize([3,4,5]).flatMap(lambda x : range(1,x)).collect()

In [36]:
flatMapRddInt

[1, 2, 1, 2, 3, 1, 2, 3, 4]

#### Spliting a sentence using flatMap and map

In [37]:
sentence = ['Welecome to Spark.','The batch is LTI-MindTree','Module is Pyspark','Rdd are constructed']
sentRdd = sc.parallelize(sentence)

In [38]:
sentRdd

ParallelCollectionRDD[31] at readRDDFromFile at PythonRDD.scala:274

In [39]:
sentRdd.map(lambda sent : sent.split(' ')).collect()

[['Welecome', 'to', 'Spark.'],
 ['The', 'batch', 'is', 'LTI-MindTree'],
 ['Module', 'is', 'Pyspark'],
 ['Rdd', 'are', 'constructed']]

In [40]:
wordsRdd = sentRdd.flatMap(lambda sent : sent.split(' '))
wordsRdd.collect()

['Welecome',
 'to',
 'Spark.',
 'The',
 'batch',
 'is',
 'LTI-MindTree',
 'Module',
 'is',
 'Pyspark',
 'Rdd',
 'are',
 'constructed']


#### Applying operations on pair Rdds - groupbykey vs groupby

**GroupBy()** </br>

GroupBy  is to group data together which has same key and  is a transformation operation on RDD which means its lazily evaluated.This is a wide operation which will result in data shuffling hence it a costlier one.This operation can be used on both Pair and unpaired RDD but mostly it will be used on unpaired.This let the programmer to explicitly mention the key to group.  

**GroupByKey()** </br>

 GroupByKey is also to group data together which has same key but this is meant only for Pair RDD. Meaning programmer has no way to explicitly mention the key field to group like groupBy. So this operation is specialised for RDD which already defined its key. As like groupBy(), this also a transformation, wide and costlier operation in nature. Important thing to be noted here is groupByKey() always results in Hash-Partitioned RDDs



In [43]:
profRdd = sc.parallelize([('proff1',1),('proff2',1),('proff1',1),
                          ('proff1',1),('proff2',1),('proff3',1)])

In [44]:
profRdd.groupByKey().collect()

[('proff1', <pyspark.resultiterable.ResultIterable at 0x1ca9f783bb0>),
 ('proff3', <pyspark.resultiterable.ResultIterable at 0x1ca9fbfc0d0>),
 ('proff2', <pyspark.resultiterable.ResultIterable at 0x1ca9fbfc130>)]

In [45]:
profRdd.groupByKey().map(lambda x : (x[0],list(x[1]))).collect()

[('proff1', [1, 1, 1]), ('proff3', [1]), ('proff2', [1, 1])]

In [46]:
proffRdd1 = sc.parallelize(['proff1','proff2','proff1'
,'proff1','proff2','proff3','proff1','proff2','proff3','student1','student2','student3'])

In [47]:
result = proffRdd1.groupBy(lambda word : word[0]).collect()

In [48]:
result

[('p', <pyspark.resultiterable.ResultIterable at 0x1ca9f782950>),
 ('s', <pyspark.resultiterable.ResultIterable at 0x1ca9fbfc370>)]

In [49]:
[(x,list(y)) for (x,y) in result]

[('p',
  ['proff1',
   'proff2',
   'proff1',
   'proff1',
   'proff2',
   'proff3',
   'proff1',
   'proff2',
   'proff3']),
 ('s', ['student1', 'student2', 'student3'])]

#### pair rdds map and map values

In [50]:
deviceRdd = sc.parallelize(['tab','computer','mobile','router','mouseclick'])
pairRdd = deviceRdd.map(lambda x : (len(x),x))

In [51]:
pairRdd.collect()

[(3, 'tab'), (8, 'computer'), (6, 'mobile'), (6, 'router'), (10, 'mouseclick')]

In [52]:
result = pairRdd.mapValues(lambda y : "Device Name is " + y)

In [53]:
result.collect()

[(3, 'Device Name is tab'),
 (8, 'Device Name is computer'),
 (6, 'Device Name is mobile'),
 (6, 'Device Name is router'),
 (10, 'Device Name is mouseclick')]

### rdds - joins 

<img src='images/joins_sql.JPG' />

<img src='images/joinDAG.png' />

In [54]:
rdd1 = sc.parallelize([("Mercedes","E-Class"),("Toyota","Corolla"),("Renault","Duster")])
rdd2 = sc.parallelize([("Mercedes","S-Class"),("Toyota","Fortuner"),("Suzuki","Mayona")])
innerJoinRdd = rdd1.join(rdd2)

In [55]:
innerJoinRdd.collect()

[('Mercedes', ('E-Class', 'S-Class')), ('Toyota', ('Corolla', 'Fortuner'))]

#### leftouterjoin - rdds

In [56]:
leftOuterJoinRdd = rdd1.leftOuterJoin(rdd2)
leftOuterJoinRdd.collect()

[('Mercedes', ('E-Class', 'S-Class')),
 ('Toyota', ('Corolla', 'Fortuner')),
 ('Renault', ('Duster', None))]

#### union-rdds

In [53]:
unionRdd = rdd1.union(rdd2)
unionRdd.collect()

[('Mercedes', 'E-Class'),
 ('Toyota', 'Corolla'),
 ('Renault', 'Duster'),
 ('Mercedes', 'S-Class'),
 ('Toyota', 'Fortuner'),
 ('Suzuki', 'Mayona')]

In [54]:
unionRdd.first()

('Mercedes', 'E-Class')

In [55]:
unionRdd.take(2)

[('Mercedes', 'E-Class'), ('Toyota', 'Corolla')]

In [56]:
unionRdd.takeOrdered(4)

[('Mercedes', 'E-Class'),
 ('Mercedes', 'S-Class'),
 ('Renault', 'Duster'),
 ('Suzuki', 'Mayona')]

#### GroupbyKey followed by map and mapValues

In [61]:
from operator import add
pairRDD = sc.parallelize([('prodA', 20), ('prodA', 30), ('prodB', 1)])
# mapValues only used to improve format for printing
print(pairRDD.groupByKey().mapValues(lambda x: list(x)).collect())

# Different ways to sum by key
# Using mapValues, which is recommended when the key doesn't change
print(pairRDD.groupByKey().mapValues(lambda x: sum(x)).collect())
# reduceByKey is more efficient / scalable
print(pairRDD.reduceByKey(add).collect())

[('prodA', [20, 30]), ('prodB', [1])]
[('prodA', 50), ('prodB', 1)]
[('prodA', 50), ('prodB', 1)]


In [44]:
# Create sub function to subtract 1
'''def sub(value):
    """"Subtracts one from `value`.
    Args:
       value (int): A number.
    Returns:
        int: `value` minus one.
    """
    return (value - 1)
# Transform xrangeRDD through map transformation using sub function
# Because map is a transformation and Spark uses lazy evaluation, no jobs, stages,
# or tasks will be launched when we run this code.
subRDD = xrangeRDD.map(sub)
# Let's see the RDD transformation hierarchy
print subRDD.toDebugString()
'''
def sub(value):
    return (value - 1)


In [45]:
rdd1 = sc.parallelize([3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20])
subRdd = rdd1.map(sub)
subRdd.collect()

[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

In [47]:
print(subRdd.toDebugString().decode())

(4) PythonRDD[25] at collect at C:\Users\Jayanth\AppData\Local\Temp\ipykernel_7636\1655375454.py:3 []
 |  ParallelCollectionRDD[24] at readRDDFromFile at PythonRDD.scala:274 []


In [71]:
# Define a function to filter a single value
def ten(value):
    """Return whether value is below ten.

    Args:
        value (int): A number.

    Returns:
        bool: Whether `value` is less than ten.
    """
    if (value < 10):
        return True
    else:
        return False
# The ten function could also be written concisely as: def ten(value): return value < 10

# Pass the function ten to the filter transformation
# Filter is a transformation so no tasks are run
filteredRDD = subRdd.filter(ten)

# View the results using collect()
# Collect is an action and triggers the filter transformation to run
print(filteredRDD.collect())

[2, 3, 4, 5, 6, 7, 8, 9]


#### Actions - take, takeOrdered and top

In [72]:
# Let's get the first element
print(filteredRDD.first())
# The first 4
print(filteredRDD.take(4))
# Note that it is ok to take more elements than the RDD has
print(filteredRDD.take(12))

2
[2, 3, 4, 5]
[2, 3, 4, 5, 6, 7, 8, 9]


In [73]:
# Retrieve the three smallest elements
print(filteredRDD.takeOrdered(3))
# Retrieve the five largest elements
print(filteredRDD.top(5))

[2, 3, 4]
[9, 8, 7, 6, 5]


In [74]:
# Obtain Python's add function
from operator import add
# Efficiently sum the RDD using reduce
print(filteredRDD.reduce(add))
# Sum using reduce with a lambda function
print(filteredRDD.reduce(lambda a, b: a + b))
# Note that subtraction is not both associative and commutative
print(filteredRDD.reduce(lambda a, b: a - b))
print(filteredRDD.repartition(4).reduce(lambda a, b: a - b))
# While addition is
print(filteredRDD.repartition(4).reduce(lambda a, b: a + b))

44
44
8
-8
44


#### takeSample action

In [75]:
# takeSample reusing elements
print(filteredRDD.takeSample(withReplacement=True, num=6))
# takeSample without reuse
print(filteredRDD.takeSample(withReplacement=False, num=6))

[4, 2, 5, 7, 6, 6]
[6, 3, 4, 7, 5, 2]


#### countByValue

In [78]:
# Create new base RDD to show countByValue
repetitiveRDD = sc.parallelize([11, 12, 13, 11, 12, 13, 11, 12, 11, 12, 13, 13, 13, 14, 15, 14, 16])
print(repetitiveRDD.countByValue())

defaultdict(<class 'int'>, {11: 4, 12: 4, 13: 5, 14: 2, 15: 1, 16: 1})


In [None]:
langRDD = sc.parallelize(['python','java','python','cpp','golang','rust','julia'])

#### get the number of partitions
getNumPartitions(): List the number of partitions our RDD is split into.

In [83]:
print('RDD Num Partitions:', langRDD.getNumPartitions())

RDD Num Partitions: 4


#### mapPartitions

In [81]:

itemsRDD = langRDD.mapPartitions(lambda iterator: [','.join(iterator)])
print(itemsRDD.collect())

['python', 'java,python', 'cpp,golang', 'rust,julia']


#### mapPartitionsWithIndex

In [82]:
itemsByPartRDD = langRDD.mapPartitionsWithIndex(lambda index, iterator: [(index, list(iterator))])

print(itemsByPartRDD.collect())

itemsByPartRDD = langRDD.mapPartitionsWithIndex(lambda index, iterator: (index, list(iterator)))
print(itemsByPartRDD.collect())

[(0, ['python']), (1, ['java', 'python']), (2, ['cpp', 'golang']), (3, ['rust', 'julia'])]
[0, ['python'], 1, ['java', 'python'], 2, ['cpp', 'golang'], 3, ['rust', 'julia']]



#### SparkContext - number of workers and lazy evaluation
#### Checking the impact of number of workers

While initializing the SparkContext, we can specify number of worker nodes. Generally, it is recommended to have one worker per core of the machine. But it can be smaller or larger. In the following code, we will examine the impact of number of worker cores on some parallelized operation.

In [7]:
sc.stop()

In [8]:
from time import time
from pyspark import SparkContext

for j in range(1,5):
    sc= SparkContext(master = "local[%d]"%(j))
    t0=time()
    for i in range(10):
        sc.parallelize([1,2]*10000).reduce(lambda x,y:x+y)
    print(f"{j} executors, time = {time()-t0}")
    sc.stop()

1 executors, time = 19.61769390106201
2 executors, time = 33.40109062194824
3 executors, time = 45.035314083099365
4 executors, time = 58.93501329421997


#### Evaluating the computing time

In [14]:
from math import sin
def computetime(x):
    [sin(j) for j in range(100)]
    return sin(x)

In [15]:
%%time
computetime(2)

CPU times: total: 0 ns
Wall time: 0 ns


0.9092974268256817

In [11]:
%%time
sc= SparkContext(master = "local[*]")
rdd1 = sc.parallelize(range(10000))

CPU times: total: 15.6 ms
Wall time: 346 ms


In [16]:
%%time
interim = rdd1.map(lambda x: computetime(x))

CPU times: total: 0 ns
Wall time: 0 ns


- The variable interim does not point to a data structure, instead it points to a plan of execution, expressed as a dependency graph. The dependency graph defines how RDDs are computed from each other.

In [17]:
print(interim.toDebugString().decode())

(4) PythonRDD[1] at RDD at PythonRDD.scala:53 []
 |  ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274 []


In [18]:
%%time
print('output =',interim.reduce(lambda x,y:x+y))

output = 1.9395054106807104
CPU times: total: 0 ns
Wall time: 5.99 s


In [19]:
%%time
print(interim.filter(lambda x:x>0).count())

5001
CPU times: total: 0 ns
Wall time: 6.05 s



- Caching to reduce computation time on similar operation (spending memory)
- Run the same computation as before with cache method to tell the dependency graph to plan for caching

In [24]:
%%time
interim = rdd1.map(lambda x: computetime(x)).cache()

CPU times: total: 31.2 ms
Wall time: 24 ms


In [22]:
print(interim.toDebugString().decode())

(4) PythonRDD[5] at RDD at PythonRDD.scala:53 [Memory Serialized 1x Replicated]
 |  ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274 [Memory Serialized 1x Replicated]


In [25]:
%%time
print('output =',interim.reduce(lambda x,y:x+y))

output = 1.9395054106807104
CPU times: total: 31.2 ms
Wall time: 11.2 s


In [26]:
%%time
print(interim.filter(lambda x:x>0).count())

5001
CPU times: total: 46.9 ms
Wall time: 5.64 s


#### Slices and Repartitioning

In [27]:
intRdd = sc.parallelize(range(1000000))
print(intRdd.getNumPartitions())

4


- We can repartition intRdd in any number of partitions we want`

In [29]:
reintrdd = intRdd.repartition(10)
print(reintrdd.getNumPartitions())

10


- number of partitions while creating the RDD with numSlices argument

In [30]:
intRdd = sc.parallelize(range(10000),numSlices=8)
print(intRdd.getNumPartitions())

8


### Why partitions are important?
    
   * They define the unit the executor works on
   * You should have at least as many partitions as the number of worker nodes,
   * Smaller partitions may allow more parallelization



* glom() Return an RDD created by coalescing all elements within each partition 
* Repartitioning for Load Balancing

In [33]:
intRdd=sc.parallelize(range(100000)).map(lambda x:(x,x)).partitionBy(10)
print(intRdd.glom().map(len).collect())

[10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000]


In [34]:
#select 10% of the entries
# A bad filter for numbers divisable by 5
output=intRdd.filter(lambda pair: pair[0]%5==0)
# get no. of partitions
print(output.glom().map(len).collect())

[10000, 0, 0, 0, 0, 10000, 0, 0, 0, 0]




[10000, 0, 0, 0, 0, 10000, 0, 0, 0, 0]

* Future operations on output will use only two workers.
* The other workers will do nothing, because their partitions are empty.

#### Solution for above approach

To fix the situation we need to repartition the unbalanced RDD.
One way to do that is to repartition using a new key using the method partitionBy()

    The method .partitionBy(k) expects to get a (key,value) RDD where keys are integers.
    Partitions the RDD into k partitions.
    The element (key,value) is placed into partition no. key % k


In [35]:
result=output.map(lambda pair:(pair[1]/10,pair[1])).partitionBy(10) 
print(result.glom().map(len).collect())

[2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000]



#### Another approach is to use random partitioning using repartition(k)

    An advantage of random partitioning is that it does not require defining a key.
    A disadvantage of random partitioning is that you have no control on the partitioning i.e. which elements go to which partition.



In [36]:
result=output.repartition(10)
print(result.glom().map(len).collect())

[2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000]


#### Word Count problem

In [9]:
import os
os.getcwd()

'C:\\Users\\Jayanth\\Python_DataAnalysis\\PySpark-lessons'

In [37]:
inputRdd = sc.textFile('file:///C:\\Users\\Jayanth\\Python_DataAnalysis\\PySpark-lessons\\datasets\\input.txt')

inputRdd.take(2)

['2.3.0', 'Overview']

In [38]:
### Number of lines in the file
inputRdd.count()

77

In [39]:
wordRdd = inputRdd \
          .flatMap(lambda x : x.split(' ')) \
          .map(lambda word : (word,1)) \
          .reduceByKey(lambda a,b:a+b) \

In [40]:
wordRdd.collect()

[('2.3.0', 2),
 ('Overview', 2),
 ('Programming', 3),
 ('Guides', 1),
 ('Docs', 1),
 ('Spark', 48),
 ('Apache', 3),
 ('is', 4),
 ('general-purpose', 1),
 ('It', 3),
 ('provides', 4),
 ('high-level', 1),
 ('APIs', 2),
 ('in', 12),
 ('Java,', 3),
 ('Scala,', 2),
 ('Python', 8),
 ('an', 2),
 ('optimized', 1),
 ('engine', 1),
 ('supports', 2),
 ('execution', 1),
 ('set', 1),
 ('of', 15),
 ('tools', 1),
 ('SQL', 2),
 ('processing,', 2),
 ('MLlib', 1),
 ('machine', 3),
 ('learning,', 1),
 ('GraphX', 1),
 ('graph', 1),
 ('Streaming.', 1),
 ('Downloading', 1),
 ('Get', 1),
 ('page', 1),
 ('project', 1),
 ('documentation', 1),
 ('version', 4),
 ('2.3.0.', 2),
 ('uses', 2),
 ('Hadoop’s', 1),
 ('HDFS', 1),
 ('YARN.', 1),
 ('are', 6),
 ('pre-packaged', 1),
 ('handful', 1),
 ('Users', 1),
 ('free”', 1),
 ('binary', 1),
 ('run', 10),
 ('augmenting', 1),
 ('Java', 6),
 ('include', 1),
 ('projects', 2),
 ('using', 5),
 ('Maven', 2),
 ('coordinates', 1),
 ('install', 1),
 ('like', 1),
 ('visit', 1),
 (

In [41]:
sortedWords = wordRdd.sortByKey()

In [42]:
sortedWords.collect()

[('(2.11.x).', 1),
 ('(Behind', 1),
 ('(Javadoc)', 1),
 ('(MkDocs)', 1),
 ('(Roxygen2)', 1),
 ('(Scala,', 1),
 ('(Scaladoc)', 1),
 ('(Sphinx)', 1),
 ('(YARN)', 1),
 ('(core', 1),
 ('(e.g.', 1),
 ('(newer', 1),
 ('(old', 1),
 ('(only', 1),
 ('(using', 1),
 ('-', 1),
 ('--help', 1),
 ('--master', 4),
 ('./bin/pyspark', 1),
 ('./bin/run-example', 1),
 ('./bin/spark-shell', 1),
 ('./bin/spark-submit', 2),
 ('./bin/sparkR', 1),
 ('1.4', 1),
 ('10', 2),
 ('2.10', 1),
 ('2.11.', 1),
 ('2.2.0.', 1),
 ('2.3.0', 2),
 ('2.3.0.', 2),
 ('2.6', 1),
 ('2.6.5', 1),
 ('2.7+/3.4+', 1),
 ('3.1+.', 1),
 ('5', 1),
 ('7,', 1),
 ('8+,', 1),
 ('<class>', 1),
 ('AMP', 1),
 ('API', 9),
 ('API)', 1),
 ('API),', 1),
 ('API,', 1),
 ('API.', 1),
 ('API;', 1),
 ('APIs', 2),
 ('Amazon', 1),
 ('Apache', 3),
 ('Applications:', 1),
 ('Berkeley', 1),
 ('Building', 2),
 ('Built-in', 1),
 ('Camps:', 1),
 ('Cloud', 1),
 ('Cluster', 2),
 ('Code', 1),
 ('Community', 1),
 ('Configuration:', 1),
 ('Contributing', 1),
 ('DStream