<h3>Creating the RDD's</h3>

<b>There are three ways to create RDDs in Spark</b>
<ul>
       <li>Parallelizing via collections in driver program.</li>
       <li>Creating a dataset in an external storage system (e.g. HDFS, HBase, and Shared FS).</li>
       <li>Creating RDD from existing RDDs.</li>
</ul>

<h3>Parallelizing via collections in driver program</h3>

In [8]:
# Packages that must be Imported
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SparkSession

In [3]:
sparkConf = SparkConf ( ) \
 .setAppName ("WordCount") \
 .setMaster ("local") 
sc = SparkContext (conf=sparkConf)

In [9]:
spark = SparkSession.builder\
        .appName("WordCount")\
        .master("local[3]")\
        .getOrCreate()
# spark.sparkContext( ) 

In [5]:
data = [1,2,3,4,5,6,7,8,9,10,11,12]
rdd = sc.parallelize(data)
rdd.collect( )
print (rdd.take(20))

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]


<h3>What is parallelized in spark</h3>
<p>Parallelize is a <b>method to create an RDD from an existing collection (For e.g Array)</b>
    present in the driver.<br> The elements present in the collection are copied to form a distributed dataset on which we can operate on in parallel.</p>

<h3>Create RDD with partition</h3>

In [10]:
data = [1,2,3,4,5,6,7,8,9,10,11,12]
rdd = sc.parallelize(data)
print("Initial Partition Count:"+str(rdd.getNumPartitions()))
partition_data = [1,2,3,4,5,6,7,8,9,10,11,12]
partition_rdd = spark.sparkContext.parallelize(partition_data, 5)
print(partition_rdd.collect())
print(partition_rdd.glom().collect())
print("After changing Partition Count:"+str(partition_rdd.getNumPartitions()))

Initial Partition Count:1
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
[[1, 2], [3, 4], [5, 6], [7, 8], [9, 10, 11, 12]]
After changing Partition Count:5


<h3>Create RDD from external text file</h3>

In [11]:
readFile = sc.textFile("file:///home/saif/LFS/datasets/emp.txt")
readFile.collect() 

['id,name,city', '101,saif,mumbai', '102,mitali,pune', '103,ram,balewadi']

<h3>Creating RDD from existing RDD</h3>

In [12]:
print("Initial Partition Count:"+str(readFile.getNumPartitions()))
repartition_Rdd = readFile.repartition(4)
print("Re-partition count:"+str(repartition_Rdd.getNumPartitions()))
coalesce_Rdd = readFile.repartition(3)
print("Re-partition count:"+str(coalesce_Rdd.getNumPartitions())) 

Initial Partition Count:1
Re-partition count:4
Re-partition count:3
