<a href="https://colab.research.google.com/github/imtheguna/PySpark-Learning/blob/GoogleColab/1_What_is_SparkSession.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **What is SparkSession?**

SparkSession is the entry point for any PySpark application, introduced in Spark 2.0 as a unified API to replace the need for separate SparkContext, SQLContext, and HiveContext.

# SparkSession offers several benefits that make it an essential component of PySpark applications

**Simplified API:** SparkSession unifies the APIs of SparkContext, SQLContext, and HiveContext, making it easier for developers to interact with Spark’s core features without switching between multiple contexts.

**Configuration management:** You can easily configure a SparkSession by setting various options, such as the application name, the master URL, and other configurations.

**Access to Spark ecosystem: **SparkSession allows you to interact with the broader Spark ecosystem, such as DataFrames, Datasets, and MLlib, enabling you to build powerful data processing pipelines.

**Improved code readability:** By encapsulating multiple Spark contexts, SparkSession helps you write cleaner and more maintainable code.

In [2]:
!apt-get update # Update apt-get repository.
!apt-get install openjdk-8-jdk-headless -qq > /dev/null # Install Java.
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz # Download Apache Sparks.
!tar xf spark-3.1.1-bin-hadoop3.2.tgz # Unzip the tgz file.
!pip install -q findspark # Install findspark. Adds PySpark to the System path during runtime.

# Set environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

!ls

# Initialize findspark
import findspark
findspark.init()
!pip install pyspark

0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,622 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [830 kB]
Get:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [109 kB]
Hit:8 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Get:9 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [2,069 kB]
Get:10 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease [18.1 kB]
Hit:11 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:12 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Pac

In [3]:
## Creating a SparkSession

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySpark Application") \
    .master("local[*]") \
    .getOrCreate()


In [4]:
## Accessing SparkSession Components

# Access SparkContext
spark_context = spark.sparkContext

# Access SQLContext
sql_context = spark._wrapped

# Access HiveContext (if Hive support is enabled)
hive_context = spark._jwrapped

In [7]:
## Read CSV File

DF = spark.read.csv('/content/data2.csv',header=True,inferSchema=True)

DF.show(5)

+-------+----------------+-------------+-----------+-------------------+----------------+------------------+--------------+----------------------+-------------+---------------+
| period|series_reference|  region_name|filled jobs|filled jobs revised|filled jobs diff|filled jobs % diff|total_earnings|total earnings revised|earnings diff|earnings % diff|
+-------+----------------+-------------+-----------+-------------------+----------------+------------------+--------------+----------------------+-------------+---------------+
|2020.09|     BDCQ.SED1RA|    Northland|      65520|              65904|             384|               0.6|           953|                   959|            6|            0.6|
|2020.09|     BDCQ.SED1RB|     Auckland|     708372|             714506|            6134|               0.9|         12420|                 12530|          110|            0.9|
|2020.09|     BDCQ.SED1RC|      Waikato|     198776|             200265|            1489|               0.7|       

In [10]:
## Executing SQL Queries with SparkSession

DF.createOrReplaceTempView('Data')

df = spark.sql('select * from data limit 1')

df.show()

+-------+----------------+-----------+-----------+-------------------+----------------+------------------+--------------+----------------------+-------------+---------------+
| period|series_reference|region_name|filled jobs|filled jobs revised|filled jobs diff|filled jobs % diff|total_earnings|total earnings revised|earnings diff|earnings % diff|
+-------+----------------+-----------+-----------+-------------------+----------------+------------------+--------------+----------------------+-------------+---------------+
|2020.09|     BDCQ.SED1RA|  Northland|      65520|              65904|             384|               0.6|           953|                   959|            6|            0.6|
+-------+----------------+-----------+-----------+-------------------+----------------+------------------+--------------+----------------------+-------------+---------------+



In [14]:
from pyspark.sql.functions import *
## Count Word

df = spark.read.text('/content/sample.txt')

df = df.select(explode(split(col('value'),' ')).alias('Word'))

df = df.groupBy(col('Word')).count()

df.show()

+------------+-----+
|        Word|count|
+------------+-----+
|    Parquet,|    1|
|     reading|    1|
|         you|    1|
|         CSV|    1|
|     example|    1|
|        read|    1|
|      Here’s|    1|
|        such|    1|
|       file:|    1|
|    formats,|    1|
|       more.|    1|
|        data|    2|
|SparkSession|    1|
|       Avro,|    1|
|        file|    1|
|         the|    1|
|       write|    1|
|     writing|    1|
|        from|    1|
|         and|    3|
+------------+-----+
only showing top 20 rows



In [None]:
spark.stop()