In [10]:
# Import SparkSession
from pyspark.sql import SparkSession

# Create SparkSession 
spark = SparkSession.builder \
      .master("local[1]") \
      .appName("SparkByExamples.com") \
      .getOrCreate() 

# Create RDD from external Data source
rdd2 = spark.sparkContext.textFile("/path/test.txt")

In [None]:
RDD Transformations
Spark RDD Transformations are lazy operations meaning they don’t execute until you call an action on RDD. Since RDDs are immutable, When you run a transformation(for example map()), instead of updating a current RDD, it returns a new RDD.

Some transformations on RDDs are flatMap(), map(), reduceByKey(), filter(), sortByKey() and all these return a new RDD instead of updating the current.

RDD Actions
RDD Action operation returns the values from an RDD to a driver node. In other words, any RDD function that returns non RDD[T] is considered as an action. RDD operations trigger the computation and return RDD in a List to the driver program.

Some actions on RDDs are count(),  collect(),  first(),  max(),  reduce()  and more.

In [None]:
How to create SparkSession?
SparkSession is created using SparkSession.builder().master("master-details").appName("app-name").getOrCreate(); Here, getOrCreate() method returns SparkSession if already exists. If not, it creates a new SparkSession.

How many SparkSessions can I create?
You can create as many SparkSession as you want in a Spark application using either SparkSession.builder() or SparkSession.newSession(). Many Spark session objects are required when you want to keep Spark tables (relational entities) logically separated.

How to stop SparkSession?
To stop SparkSession in Apache Spark, you can use the stop() method of the SparkSession object. If you have spark as a SparkSession object then call spark.stop() to stop the session. Calling a stop() is important to do when you’re finished with your Spark application. This ensures that resources are properly released and the Spark application terminates gracefully.

How SparkSession is different from SparkContext?
SparkSession and SparkContext are two core components of Apache Spark. Though they sound similar, they serve different purposes and are used in different contexts within a Spark application.
SparkContext provides the connection to a Spark cluster and is responsible for coordinating and distributing the operations on that cluster. SparkContext is used for low-level RDD (Resilient Distributed Dataset) programming.
SparkSession was introduced in Spark 2.0 to provide a more convenient and unified API for working with structured data. It’s designed to work with DataFrames and Datasets, which provide more structured and optimized operations than RDDs.

Do we need to stop SparkSession?
It is recommended to end the Spark session after finishing the Spark job in order for the JVMs to close and free the resources. 

How do I know if my Spark session is active?
To check if your SparkSession is active, you can use the SparkSession object’s sparkContext attribute and check its isActive property. If you have spark as a SparkSession object then call spark.sparkContext.isActive. This returns true if it is active otherwise false.

In [None]:
What are the differences between Pandas and PySpark DataFrame? Pandas and PySpark are both powerful tools for data manipulation and 
analysis in Python. Pandas is a widely-used library for working with smaller datasets in memory on a single machine, offering a 
rich set of functions for data manipulation and analysis. In contrast, PySpark, built on top of Apache Spark, is designed for 
distributed computing, allowing for the processing of massive datasets across multiple machines in a cluster.

In [None]:
Pandas is one of the most used open-source Python libraries to work with Structured tabular data for analysis. Pandas library is heavily used for Data Analytics, Machine learning, data science projects, and many more.
Pandas can load the data by reading CSV, JSON, SQL, many other formats and creates a DataFrame which is a structured object containing rows and columns (similar to SQL table).
It doesn’t support distributed processing; hence, you would always need to increase the resources when you need additional horsepower to support your growing data.
Pandas DataFrames are mutable and are not lazy, statistical functions are applied on each column by default.


In [11]:
import pandas as pd    
data = [["James","","Smith",30,"M",60000], 
        ["Michael","Rose","",50,"M",70000], 
        ["Robert","","Williams",42,"",400000], 
        ["Maria","Anne","Jones",38,"F",500000], 
        ["Jen","Mary","Brown",45,None,0]] 
columns=['First Name','Middle Name','Last Name','Age','Gender','Salary']

# Create the pandas DataFrame 
pandasDF=pd.DataFrame(data=data, columns=columns) 
  
# print dataframe. 
print(pandasDF)

  First Name Middle Name Last Name  Age Gender  Salary
0      James                 Smith   30      M   60000
1    Michael        Rose             50      M   70000
2     Robert              Williams   42         400000
3      Maria        Anne     Jones   38      F  500000
4        Jen        Mary     Brown   45   None       0


In [None]:
Note that Pandas add an index sequence number to every data frame.Below are some Pandas transformations.
df.count() – Returns the count of each column (the count includes only non-null values).
df.corr() – Returns the correlation between columns in a data frame.
df.head(n) – Returns first n rows from the top.
df.max() – Returns the maximum of each column.
df.mean() – Returns the mean of each column.
df.median() – Returns the median of each column.
df.min() – Returns the minimum value in each column.
df.std() – Returns the standard deviation of each column
df.tail(n) – Returns last n rows.

In [None]:
PySpark is a Python API for Apache Spark, a distributed computing framework designed for processing large-scale datasets across clusters 
of machines. PySpark enables parallelized data processing and analysis by distributing computations across multiple nodes in a cluster, 
providing scalability and high performance for big data analytics tasks. It offers a DataFrame API that resembles Pandas, allowing users 
to perform similar data manipulation operations but on distributed datasets.
In comparison, PySpark is designed for handling large-scale datasets that exceed the memory capacity of a single machine, making it 
suitable for big data analytics tasks that require distributed computing capabilities. While both PySpark and Pandas offer similar 
DataFrame APIs and data manipulation functionalities, PySpark’s distributed architecture provides scalability and parallelism for 
processing massive datasets across distributed clusters. Ultimately, the choice between PySpark and Pandas depends on the scale of 
the datasets and the computational resources available for data analysis tasks.

In [None]:
PySpark DataFrame is immutable (cannot be changed once created), fault-tolerant and Transformations are 
Lazy evaluation (they are not executed until actions are called). PySpark DataFrames are distributed in the cluster 
(meaning the data in PySpark DataFrames are stored in different machines in a cluster) and any operations in PySpark execute in parallel
on all machines.

Below are some pyspark transformations.
select() – Choose specific columns from a DataFrame.
filter() – Filter rows based on a condition.
groupBy() – Group rows based on one or more columns.
agg() – Perform aggregate functions (e.g., sum, average) on grouped data.
orderBy() – Sort rows based on one or more columns.
dropDuplicates() – Remove duplicate rows from the DataFrame.
withColumn() – Add a new column or replace an existing column with modified data.
drop() – Remove one or more columns from the DataFrame.
join() – Merge two DataFrames based on a common column or index.
pivot() – Pivot the DataFrame to reorganize data based on column values.


In [12]:
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
               .appName('SparkByExamples.com') \
               .getOrCreate()

data = [("James","","Smith",30,"M",60000),
        ("Michael","Rose","",50,"M",70000),
        ("Robert","","Williams",42,"",400000),
        ("Maria","Anne","Jones",38,"F",500000),
        ("Jen","Mary","Brown",45,"F",0)]

columns = ["first_name","middle_name","last_name","Age","gender","salary"]
pysparkDF = spark.createDataFrame(data = data, schema = columns)
pysparkDF.printSchema()
pysparkDF.show(truncate=False)

root
 |-- first_name: string (nullable = true)
 |-- middle_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- Age: long (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)

+----------+-----------+---------+---+------+------+
|first_name|middle_name|last_name|Age|gender|salary|
+----------+-----------+---------+---+------+------+
|James     |           |Smith    |30 |M     |60000 |
|Michael   |Rose       |         |50 |M     |70000 |
|Robert    |           |Williams |42 |      |400000|
|Maria     |Anne       |Jones    |38 |F     |500000|
|Jen       |Mary       |Brown    |45 |F     |0     |
+----------+-----------+---------+---+------+------+



In [None]:
PySpark supports SQL queries to run transformations. All you need to do is create a Table/View from the PySpark DataFrame.

In [13]:
# PySpark SQL
pysparkDF.createOrReplaceTempView("Employee")
spark.sql("select * from Employee where salary > 100000").show()
spark.sql("select mean(age),mean(salary) from Employee").show()

+----------+-----------+---------+---+------+------+
|first_name|middle_name|last_name|Age|gender|salary|
+----------+-----------+---------+---+------+------+
|    Robert|           | Williams| 42|      |400000|
|     Maria|       Anne|    Jones| 38|     F|500000|
+----------+-----------+---------+---+------+------+

+---------+------------+
|mean(age)|mean(salary)|
+---------+------------+
|     41.0|    206000.0|
+---------+------------+



In [None]:
Create PySpark DataFrame from Pandas
Due to parallel execution on all cores on multiple machines, PySpark runs operations faster than Pandas, hence we often required to covert Pandas DataFrame to PySpark (Spark with Python) for better performance. This is one of the major differences between Pandas vs PySpark DataFrame.


In [14]:
# Create PySpark DataFrame from Pandas
pysparkDF2 = spark.createDataFrame(pandasDF) 
pysparkDF2.printSchema()
pysparkDF2.show()

AttributeError: 'DataFrame' object has no attribute 'iteritems'

In [None]:
Create Pandas from PySpark DataFrame
Once the transformations are done on Spark, you can easily convert it back to Pandas using toPandas() method.

Note: toPandas() method is an action that collects the data into Spark Driver memory so you have to be very careful while
dealing with large datasets. You will get OutOfMemoryException if the collected data doesn’t fit in Spark Driver memory.

In [15]:
# Convert PySpark to Pandas
pandasDF = pysparkDF.toPandas()
print(pandasDF)

  first_name middle_name last_name  Age gender  salary
0      James                 Smith   30      M   60000
1    Michael        Rose             50      M   70000
2     Robert              Williams   42         400000
3      Maria        Anne     Jones   38      F  500000
4        Jen        Mary     Brown   45      F       0


In [None]:
How to Decide Between Pandas vs PySpark
Deciding between Pandas and PySpark depends on several factors, including the scale of the data, available computational resources, and specific requirements of the data analysis tasks. Here are some considerations to help you decide:

Data Scale:
Use Pandas for small to medium-sized datasets that fit into memory and require rapid in-memory data manipulation and analysis.
Choose PySpark for large-scale datasets that exceed the memory capacity of a single machine and require distributed computing capabilities for parallelized data processing.
Computational Resources:
If you have limited computational resources or a single machine environment, Pandas may be more suitable due to its in-memory processing capabilities.
For distributed computing environments with access to clusters of machines, PySpark offers scalability and parallelism for processing massive datasets across distributed clusters.
Performance:
Pandas performs well for small to medium-sized datasets but may struggle with large-scale datasets due to memory constraints.
PySpark excels in processing large-scale datasets across distributed clusters, offering scalability and parallelism for improved performance.
Ecosystem and Integration:
Pandas has a mature ecosystem with extensive support for data manipulation, visualization, and analysis tools, making it suitable for a wide range of data analysis tasks.
PySpark integrates with the broader Apache Spark ecosystem, offering support for various data sources, machine learning libraries, and streaming processing capabilities.

In [None]:
Python SparkContext:
Once you have a SparkContext object, you can create a PySpark RDD in several ways, below I have used the range() function.
When you try to create multiple SparkContext you will get the below error.

ValueError: Cannot run multiple SparkContexts at once;

In [1]:
# Create Spark Context
from pyspark import SparkConf, SparkContext
conf = SparkConf()
conf.setMaster("local").setAppName("Spark Example App")
sc = SparkContext.getOrCreate(conf)
print(sc.appName)

# Create RDD
rdd = spark.sparkContext.range(1, 5)
print(rdd.collect())

Spark Example App


NameError: name 'spark' is not defined

In [None]:
SparkContext Commonly Used Variables

applicationId – Returns a unique ID of a PySpark application.
version – Version of PySpark cluster where your job is running.
uiWebUrl – Provides the Spark Web UI url that started by SparkContext.

In [None]:
SparkContext Commonly Used Methods
accumulator(value[, accum_param]) – It creates an pyspark accumulator variable with initial specified value. Only a driver can access accumulator variables.

broadcast(value) – read-only PySpark broadcast variable. This will be broadcast to the entire cluster. You can broadcast a variable to a PySpark cluster only once.

emptyRDD() – Creates an empty RDD

getOrCreate() – Creates or returns a SparkContext

hadoopFile() – Returns an RDD of a Hadoop file

newAPIHadoopFile() – Creates an RDD for a Hadoop file with a new API InputFormat.

sequenceFile() – Get an RDD for a Hadoop SequenceFile with given key and value types.

setLogLevel() – Change log level to debug, info, warn, fatal, and error

textFile() – Reads a text file from HDFS, local or any Hadoop supported file systems and returns an RDD

union() – Union two RDDs

wholeTextFiles() – Reads a text file in the folder from HDFS, local or any Hadoop supported file systems and returns an RDD of Tuple2. The first element of the tuple consists 
file name and the second element consists context of the text file.