<h1>PySpark</h1>

<ol>
    <li>PySpark is a Python API for Apache Spark to process larger datasets in a
distributed cluster. <br>It is written in Python to run a Python application using
Apache Spark capabilities</li>
    <li>Spark basically is written in Scala, and due to its adaptation in industry, <br> its
equivalent PySpark API has been released for Python Py4J. </li>
    <li><b>Py4J</b> is a Java library that is integrated within Spark and allows python to
dynamically interface with JVM objects, <br>hence to run PySpark you also need Java
to be installed along with Python, and Apache Spark. </li>
</ol>

<h2>Features of Spark</h2>
<ol>
    <li>PySpark RDD (pyspark.RDD)</li>
    <li>PySpark DataFrame and SQL (pyspark.sql)</li>
    <li>PySpark Streaming (pyspark.streaming)</li>
    <li>PySpark MLib (pyspark.ml, pyspark.mllib)</li>
    <li>PySpark GraphFrames (GraphFrames)</li>
</ol>

<h3>How to create SparkContext using SparkConf ?</h3>

In [1]:
# Packages that must be Imported
from pyspark import SparkConf
from pyspark import SparkContext

In [2]:
# Create conf object
# setAppNAme should be the relevent one based on the Program
sparkConf = SparkConf( ) \
 .setAppName("WordCount") \
 .setMaster("local") 

In [3]:
# Create SparkContext object 
sc = SparkContext(conf=sparkConf)

Note: <b>Only one SparkContext</b> may be active per <b>JVM</b>. You must stop the active one before
creating a new one as shown

<h3>How to create SparkSession?</h3>

In [5]:
from pyspark.sql import SparkSession

In [9]:
spark = SparkSession.builder\
        .appName("WordCount")\
        .master("local[3]")\
        .getOrCreate()
# spark.sparkContext( ) 

<h3>What is the setMaster in SparkContext and SparkSession Code?</h3>
<p><b>setMaster(String master)</b> The master URL to connect to,<br> 
      such as "local" to run locally with one thread,<br> "local[4]" to run locally with 4 cores,<br>
    or "spark://master:7077" to run on a Spark standalone cluster.<br> SparkConf. setSparkHome(String home) Set the location where Spark is installed on worker nodes.</p>

<h3>What does getOrCreate do in Spark?</h3>
<p>Within the same JVM, getOrCreate() will give you the same instance of SparkContext;<br>
   and this will help you share broadcast variables,<br> 
   etc among different applications spawned by the same Spark Driver.</p>

<h3>Read data from text file in RDD</h3>

In [21]:
readFile = sc.textFile("hdfs://localhost:9000/user/saif/HFS/Input/wordcount.txt")

In [29]:
print(f"type(readFile) => {type(readFile)}")
print(f"type(readFile.collect()) => {type(readFile.collect())}")
print("\noutput=>")
for i in readFile.collect():
    print(i)

type(readFile) => <class 'pyspark.rdd.RDD'>
type(readFile.collect()) => <class 'list'>

output=>
Saif Ram Ram Ram Saif
Mitali Manas Mitali Mitali
Pramod Shravan Shravan


<h3>what does the collect() do?</h3>
<p>Spark collect() and collectAsList() are <b>action operation</b> that is used to retrieve all the elements of the <b>RDD/DataFrame/Dataset</b> (from all nodes)<br> <b>to the driver node</b>.<br> We should use the collect() on smaller dataset usually after filter(), group(), count() e.t.c. <br>Retrieving on larger dataset results in out of memory.</p>

<h3> Split each line into words from the same file</h3>

In [36]:
splitWords = readFile.flatMap(lambda line: line.split(" "))
splitWords.collect()
for i in splitWords.take(5):
    print(i) 
    
# take(N) is used to get a "N" no or records

Saif
Ram
Ram
Ram
Saif


<h3>Assign the word with Value as 1</h3>

In [43]:
wordAssign = splitWords.map(lambda word: (word, 1)) 
for i in wordAssign.collect():
    print(i)

('Saif', 1)
('Ram', 1)
('Ram', 1)
('Ram', 1)
('Saif', 1)
('Mitali', 1)
('Manas', 1)
('Mitali', 1)
('Mitali', 1)
('Pramod', 1)
('Shravan', 1)
('Shravan', 1)


<h3> Count the occurrence of each word</h3>

In [44]:
wordCount = wordAssign.reduceByKey(lambda a,b:a+b)
for i in wordCount.collect():
    print(i)

('Saif', 2)
('Ram', 3)
('Mitali', 3)
('Manas', 1)
('Pramod', 1)
('Shravan', 2)


<h3>Complete Word Count Program</h3>

In [1]:
from pyspark import SparkConf
from pyspark import SparkContext

sparkConf = SparkConf( ) \
 .setAppName("WordCount") \
 .setMaster("local") 

sc = SparkContext(conf=sparkConf)

readFile = sc.textFile("hdfs://localhost:9000/user/saif/HFS/Input/wordcount.txt")

splitWords = readFile.flatMap(lambda line: line.split(" "))
wordAssign = splitWords.map(lambda word: (word, 1)) 

wordCount = wordAssign.reduceByKey(lambda a,b:a+b)
for i in wordCount.collect():
    print(i)

('Saif', 2)
('Ram', 3)
('Mitali', 3)
('Manas', 1)
('Pramod', 1)
('Shravan', 2)
