# Introduction To PySpark

PySpark is an interface for Apache Spark in Python.

It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment.

PySpark supports most of Spark’s features such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning) and Spark Core.

<img src="https://spark.apache.org/docs/latest/api/python/_images/pyspark-components.png" />

## Spark SQL and DataFrame

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrame and can also act as distributed SQL query engine.

### pandas API on Spark

pandas API on Spark allows you to scale your pandas workload out. With this package, you can:

Be immediately productive with Spark, with no learning curve, if you are already familiar with pandas.

Have a single codebase that works both with pandas (tests, smaller datasets) and with Spark (distributed datasets).

Switch to pandas API and PySpark API contexts easily without any overhead.

## Streaming

Running on top of Spark, the streaming feature in Apache Spark enables powerful interactive and analytical applications across both streaming and historical data, while inheriting Spark’s ease of use and fault tolerance characteristics.

## MLlib

Built on top of Spark, MLlib is a scalable machine learning library that provides a uniform set of high-level APIs that help users create and tune practical machine learning pipelines.

## Spark Core

Spark Core is the underlying general execution engine for the Spark platform that all other functionality is built on top of. It provides an RDD (Resilient Distributed Dataset) and in-memory computing capabilities

### Installation

In [1]:
# !pip install pyspark

In [2]:
import pyspark

In [3]:
import findspark

In [4]:
from pyspark.sql import SparkSession

In [5]:
findspark.init()
findspark.find()

'f:\\dcodetech\\pgp\\lib\\site-packages\\pyspark'

### Creating Spark Session

In [6]:
spark = SparkSession.builder.appName('Session 01').getOrCreate()

In [7]:
spark

## Read Data From PySpark

In [8]:
df = spark.read.csv('dataset/test1.csv')
df.show()

+---------+---+----------+------+
|      _c0|_c1|       _c2|   _c3|
+---------+---+----------+------+
|     Name|age|Experience|Salary|
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
+---------+---+----------+------+



### Using Header

In [9]:
df = spark.read.csv('dataset/test1.csv', header=True)
df.show()

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
+---------+---+----------+------+



In [10]:
df.printSchema()

root
 |-- Name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- Experience: string (nullable = true)
 |-- Salary: string (nullable = true)



### Using inferSchema

In [11]:
df = spark.read.csv('dataset/test1.csv', header=True, inferSchema=True)
df.show()

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
+---------+---+----------+------+



In [12]:
df.printSchema()

root
 |-- Name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- Experience: integer (nullable = true)
 |-- Salary: integer (nullable = true)



In [13]:
type(df)

pyspark.sql.dataframe.DataFrame

In [14]:
df

DataFrame[Name: string, age: int, Experience: int, Salary: int]

In [15]:
df.head()

Row(Name='Krish', age=31, Experience=10, Salary=30000)

In [17]:
df.tail(2)

[Row(Name='Harsha', age=21, Experience=1, Salary=15000),
 Row(Name='Shubham', age=23, Experience=2, Salary=18000)]

In [18]:
df.show()

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
+---------+---+----------+------+



In [22]:
df.dtypes

[('Name', 'string'), ('age', 'int'), ('Experience', 'int'), ('Salary', 'int')]

In [24]:
df.describe().show()

+-------+------+------------------+-----------------+------------------+
|summary|  Name|               age|       Experience|            Salary|
+-------+------+------------------+-----------------+------------------+
|  count|     6|                 6|                6|                 6|
|   mean|  null|26.333333333333332|4.666666666666667|21333.333333333332|
| stddev|  null| 4.179314138308661|3.559026084010437| 5354.126134736337|
|    min|Harsha|                21|                1|             15000|
|    max| Sunny|                31|               10|             30000|
+-------+------+------------------+-----------------+------------------+



### Select

In [19]:
df.select('Name')

DataFrame[Name: string]

In [20]:
df.select('Name').show()

+---------+
|     Name|
+---------+
|    Krish|
|Sudhanshu|
|    Sunny|
|     Paul|
|   Harsha|
|  Shubham|
+---------+



In [21]:
df.select(['Name','Age']).show()

+---------+---+
|     Name|Age|
+---------+---+
|    Krish| 31|
|Sudhanshu| 30|
|    Sunny| 29|
|     Paul| 24|
|   Harsha| 21|
|  Shubham| 23|
+---------+---+



### Where

In [41]:
df.where(df.Name != 'Krish').show()

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
+---------+---+----------+------+



In [42]:
df.where((df.age >= 24) & (df.Experience >= 3)).show()

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
+---------+---+----------+------+



### Add a Column

In [27]:
df = df.withColumn('Exp After 5 Years', df['Experience']+5)

In [28]:
df.show()

+---------+---+----------+------+-----------------+
|     Name|age|Experience|Salary|Exp After 5 Years|
+---------+---+----------+------+-----------------+
|    Krish| 31|        10| 30000|               15|
|Sudhanshu| 30|         8| 25000|               13|
|    Sunny| 29|         4| 20000|                9|
|     Paul| 24|         3| 20000|                8|
|   Harsha| 21|         1| 15000|                6|
|  Shubham| 23|         2| 18000|                7|
+---------+---+----------+------+-----------------+



### Drop a Column

In [29]:
df = df.drop('Exp After 5 Years')

In [30]:
df.show()

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
+---------+---+----------+------+



### Rename a Column

In [31]:
df.withColumnRenamed('Experience','Exp').show()

+---------+---+---+------+
|     Name|age|Exp|Salary|
+---------+---+---+------+
|    Krish| 31| 10| 30000|
|Sudhanshu| 30|  8| 25000|
|    Sunny| 29|  4| 20000|
|     Paul| 24|  3| 20000|
|   Harsha| 21|  1| 15000|
|  Shubham| 23|  2| 18000|
+---------+---+---+------+



In [32]:
from pyspark.sql.functions import lit

### Filter

In [33]:
df.filter(df['Name'] == 'Krish').show()

+-----+---+----------+------+
| Name|age|Experience|Salary|
+-----+---+----------+------+
|Krish| 31|        10| 30000|
+-----+---+----------+------+



### Replace

In [34]:
df.replace('Krish','Dharmesh').show()

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
| Dharmesh| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
+---------+---+----------+------+



### Creating a DataFrame

In [35]:
cols = ['Name','age','Experience','Salary']

In [37]:
new_df = spark.createDataFrame([('Test', 29, 5, 40000)], cols)

In [38]:
new_df.show()

+----+---+----------+------+
|Name|age|Experience|Salary|
+----+---+----------+------+
|Test| 29|         5| 40000|
+----+---+----------+------+



### Union

In [39]:
df_u = df.union(new_df)

In [40]:
df_u.show()

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
|     Test| 29|         5| 40000|
+---------+---+----------+------+



## Handling Missing Values

In [43]:
df_spark = spark.read.csv('dataset/test2.csv', header=True, inferSchema=True)

In [44]:
df_spark.show()

+--------+----+----------+------+
|    Name| Age|Experience|Salary|
+--------+----+----------+------+
|Dharmesh|  22|         3|  5000|
|  Mahesh|  23|         2|  4500|
|  Akshay|  24|         3|  4000|
|   Salam|  25|         1|  3800|
|  Mayuri|  23|         2|  3000|
|  Nilesh|null|      null|  4800|
|    null|  25|         5|  6000|
|    null|  26|      null|  null|
+--------+----+----------+------+



### Drop null Values

In [46]:
df_spark.na.drop().show()

+--------+---+----------+------+
|    Name|Age|Experience|Salary|
+--------+---+----------+------+
|Dharmesh| 22|         3|  5000|
|  Mahesh| 23|         2|  4500|
|  Akshay| 24|         3|  4000|
|   Salam| 25|         1|  3800|
|  Mayuri| 23|         2|  3000|
+--------+---+----------+------+



### Using how

In [47]:
df_spark.na.drop(how='any').show()

+--------+---+----------+------+
|    Name|Age|Experience|Salary|
+--------+---+----------+------+
|Dharmesh| 22|         3|  5000|
|  Mahesh| 23|         2|  4500|
|  Akshay| 24|         3|  4000|
|   Salam| 25|         1|  3800|
|  Mayuri| 23|         2|  3000|
+--------+---+----------+------+



In [48]:
df_spark.na.drop(how='all').show()

+--------+----+----------+------+
|    Name| Age|Experience|Salary|
+--------+----+----------+------+
|Dharmesh|  22|         3|  5000|
|  Mahesh|  23|         2|  4500|
|  Akshay|  24|         3|  4000|
|   Salam|  25|         1|  3800|
|  Mayuri|  23|         2|  3000|
|  Nilesh|null|      null|  4800|
|    null|  25|         5|  6000|
|    null|  26|      null|  null|
+--------+----+----------+------+



### Using thresh

In [50]:
df_spark.na.drop(thresh=3).show()

+--------+---+----------+------+
|    Name|Age|Experience|Salary|
+--------+---+----------+------+
|Dharmesh| 22|         3|  5000|
|  Mahesh| 23|         2|  4500|
|  Akshay| 24|         3|  4000|
|   Salam| 25|         1|  3800|
|  Mayuri| 23|         2|  3000|
|    null| 25|         5|  6000|
+--------+---+----------+------+



In [52]:
df_spark.na.drop(how='all', thresh=3).show()

+--------+---+----------+------+
|    Name|Age|Experience|Salary|
+--------+---+----------+------+
|Dharmesh| 22|         3|  5000|
|  Mahesh| 23|         2|  4500|
|  Akshay| 24|         3|  4000|
|   Salam| 25|         1|  3800|
|  Mayuri| 23|         2|  3000|
|    null| 25|         5|  6000|
+--------+---+----------+------+



In [53]:
df_spark.na.drop(how='any', thresh=3).show()

+--------+---+----------+------+
|    Name|Age|Experience|Salary|
+--------+---+----------+------+
|Dharmesh| 22|         3|  5000|
|  Mahesh| 23|         2|  4500|
|  Akshay| 24|         3|  4000|
|   Salam| 25|         1|  3800|
|  Mayuri| 23|         2|  3000|
|    null| 25|         5|  6000|
+--------+---+----------+------+



### Using Subset

In [54]:
df_spark.na.drop(subset=['Age']).show()

+--------+---+----------+------+
|    Name|Age|Experience|Salary|
+--------+---+----------+------+
|Dharmesh| 22|         3|  5000|
|  Mahesh| 23|         2|  4500|
|  Akshay| 24|         3|  4000|
|   Salam| 25|         1|  3800|
|  Mayuri| 23|         2|  3000|
|    null| 25|         5|  6000|
|    null| 26|      null|  null|
+--------+---+----------+------+



In [63]:
df_spark.na.drop(how='all', thresh=2, subset=['Age','Experience']).show()

+--------+---+----------+------+
|    Name|Age|Experience|Salary|
+--------+---+----------+------+
|Dharmesh| 22|         3|  5000|
|  Mahesh| 23|         2|  4500|
|  Akshay| 24|         3|  4000|
|   Salam| 25|         1|  3800|
|  Mayuri| 23|         2|  3000|
|    null| 25|         5|  6000|
+--------+---+----------+------+



In [64]:
df_spark.na.drop(how='any', thresh=2, subset=['Age', 'Experience']).show()

+--------+---+----------+------+
|    Name|Age|Experience|Salary|
+--------+---+----------+------+
|Dharmesh| 22|         3|  5000|
|  Mahesh| 23|         2|  4500|
|  Akshay| 24|         3|  4000|
|   Salam| 25|         1|  3800|
|  Mayuri| 23|         2|  3000|
|    null| 25|         5|  6000|
+--------+---+----------+------+



### Fill null Values

In [66]:
df_spark = spark.read.csv('dataset/test2.csv', header=True)

In [68]:
df_spark.na.fill(value='Missing').show()

+--------+-------+----------+-------+
|    Name|    Age|Experience| Salary|
+--------+-------+----------+-------+
|Dharmesh|     22|         3|   5000|
|  Mahesh|     23|         2|   4500|
|  Akshay|     24|         3|   4000|
|   Salam|     25|         1|   3800|
|  Mayuri|     23|         2|   3000|
|  Nilesh|Missing|   Missing|   4800|
| Missing|     25|         5|   6000|
| Missing|     26|   Missing|Missing|
+--------+-------+----------+-------+



In [69]:
df_spark.na.fill(value='Missing', subset=['Experience','Salary']).show()

+--------+----+----------+-------+
|    Name| Age|Experience| Salary|
+--------+----+----------+-------+
|Dharmesh|  22|         3|   5000|
|  Mahesh|  23|         2|   4500|
|  Akshay|  24|         3|   4000|
|   Salam|  25|         1|   3800|
|  Mayuri|  23|         2|   3000|
|  Nilesh|null|   Missing|   4800|
|    null|  25|         5|   6000|
|    null|  26|   Missing|Missing|
+--------+----+----------+-------+



In [71]:
df.show()

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
+---------+---+----------+------+



## Filter Data

In [72]:
df.filter('Salary < 20000').show()

+-------+---+----------+------+
|   Name|age|Experience|Salary|
+-------+---+----------+------+
| Harsha| 21|         1| 15000|
|Shubham| 23|         2| 18000|
+-------+---+----------+------+



In [74]:
df.filter(df['Salary'] < 20000).show()

+-------+---+----------+------+
|   Name|age|Experience|Salary|
+-------+---+----------+------+
| Harsha| 21|         1| 15000|
|Shubham| 23|         2| 18000|
+-------+---+----------+------+



In [75]:
df.filter('Salary < 20000').select(['Name','age']).show()

+-------+---+
|   Name|age|
+-------+---+
| Harsha| 21|
|Shubham| 23|
+-------+---+



In [77]:
df.show()

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
+---------+---+----------+------+



In [79]:
df.filter((df['Salary'] >= 25000) | (df['age'] <= 25)).show()

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
+---------+---+----------+------+



In [80]:
df.filter((df['age'] > 24) & (df['Salary'] >= 25000)).show()

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
+---------+---+----------+------+



## GroupBY

In [81]:
df_spark = spark.read.csv('dataset/test3.csv', header=True, inferSchema=True)

In [83]:
df_spark.show()

+---------+------------+------+
|     Name| Departments|salary|
+---------+------------+------+
|    Krish|Data Science| 10000|
|    Krish|         IOT|  5000|
|   Mahesh|    Big Data|  4000|
|    Krish|    Big Data|  4000|
|   Mahesh|Data Science|  3000|
|Sudhanshu|Data Science| 20000|
|Sudhanshu|         IOT| 10000|
|Sudhanshu|    Big Data|  5000|
|    Sunny|Data Science| 10000|
|    Sunny|    Big Data|  2000|
+---------+------------+------+



In [84]:
df_spark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Departments: string (nullable = true)
 |-- salary: integer (nullable = true)



In [85]:
df_spark.groupBy('Departments').sum().show()

+------------+-----------+
| Departments|sum(salary)|
+------------+-----------+
|         IOT|      15000|
|    Big Data|      15000|
|Data Science|      43000|
+------------+-----------+



In [86]:
df_spark.groupBy('Departments').avg().show()

+------------+-----------+
| Departments|avg(salary)|
+------------+-----------+
|         IOT|     7500.0|
|    Big Data|     3750.0|
|Data Science|    10750.0|
+------------+-----------+



In [87]:
df_spark.groupBy('Departments').count().show()

+------------+-----+
| Departments|count|
+------------+-----+
|         IOT|    2|
|    Big Data|    4|
|Data Science|    4|
+------------+-----+



In [88]:
df_spark.show()

+---------+------------+------+
|     Name| Departments|salary|
+---------+------------+------+
|    Krish|Data Science| 10000|
|    Krish|         IOT|  5000|
|   Mahesh|    Big Data|  4000|
|    Krish|    Big Data|  4000|
|   Mahesh|Data Science|  3000|
|Sudhanshu|Data Science| 20000|
|Sudhanshu|         IOT| 10000|
|Sudhanshu|    Big Data|  5000|
|    Sunny|Data Science| 10000|
|    Sunny|    Big Data|  2000|
+---------+------------+------+



In [89]:
from pyspark.sql import functions as f

In [92]:
df_spark.groupBy('Name').agg(f.min('Salary'), f.max('Salary')).show()

+---------+-----------+-----------+
|     Name|min(Salary)|max(Salary)|
+---------+-----------+-----------+
|Sudhanshu|       5000|      20000|
|    Sunny|       2000|      10000|
|    Krish|       4000|      10000|
|   Mahesh|       3000|       4000|
+---------+-----------+-----------+



## Sorting

In [95]:
df_spark.sort('Salary').show()

+---------+------------+------+
|     Name| Departments|salary|
+---------+------------+------+
|    Sunny|    Big Data|  2000|
|   Mahesh|Data Science|  3000|
|   Mahesh|    Big Data|  4000|
|    Krish|    Big Data|  4000|
|    Krish|         IOT|  5000|
|Sudhanshu|    Big Data|  5000|
|    Krish|Data Science| 10000|
|    Sunny|Data Science| 10000|
|Sudhanshu|         IOT| 10000|
|Sudhanshu|Data Science| 20000|
+---------+------------+------+



In [105]:
df_spark.sort(df_spark.salary.desc()).show()

+---------+------------+------+
|     Name| Departments|salary|
+---------+------------+------+
|Sudhanshu|Data Science| 20000|
|Sudhanshu|         IOT| 10000|
|    Krish|Data Science| 10000|
|    Sunny|Data Science| 10000|
|    Krish|         IOT|  5000|
|Sudhanshu|    Big Data|  5000|
|    Krish|    Big Data|  4000|
|   Mahesh|    Big Data|  4000|
|   Mahesh|Data Science|  3000|
|    Sunny|    Big Data|  2000|
+---------+------------+------+



In [108]:
df_spark.sort('Salary', ascending=False).show()

+---------+------------+------+
|     Name| Departments|salary|
+---------+------------+------+
|Sudhanshu|Data Science| 20000|
|Sudhanshu|         IOT| 10000|
|    Krish|Data Science| 10000|
|    Sunny|Data Science| 10000|
|    Krish|         IOT|  5000|
|Sudhanshu|    Big Data|  5000|
|    Krish|    Big Data|  4000|
|   Mahesh|    Big Data|  4000|
|   Mahesh|Data Science|  3000|
|    Sunny|    Big Data|  2000|
+---------+------------+------+



### OrderBY

In [109]:
df_spark.orderBy('Salary').show()

+---------+------------+------+
|     Name| Departments|salary|
+---------+------------+------+
|    Sunny|    Big Data|  2000|
|   Mahesh|Data Science|  3000|
|   Mahesh|    Big Data|  4000|
|    Krish|    Big Data|  4000|
|    Krish|         IOT|  5000|
|Sudhanshu|    Big Data|  5000|
|    Krish|Data Science| 10000|
|    Sunny|Data Science| 10000|
|Sudhanshu|         IOT| 10000|
|Sudhanshu|Data Science| 20000|
+---------+------------+------+



In [110]:
df_spark.orderBy(df_spark['Salary'].desc()).show()

+---------+------------+------+
|     Name| Departments|salary|
+---------+------------+------+
|Sudhanshu|Data Science| 20000|
|Sudhanshu|         IOT| 10000|
|    Krish|Data Science| 10000|
|    Sunny|Data Science| 10000|
|    Krish|         IOT|  5000|
|Sudhanshu|    Big Data|  5000|
|    Krish|    Big Data|  4000|
|   Mahesh|    Big Data|  4000|
|   Mahesh|Data Science|  3000|
|    Sunny|    Big Data|  2000|
+---------+------------+------+



In [111]:
df_spark.orderBy(df_spark['Salary'].desc(), df_spark['Name']).show()

+---------+------------+------+
|     Name| Departments|salary|
+---------+------------+------+
|Sudhanshu|Data Science| 20000|
|    Krish|Data Science| 10000|
|Sudhanshu|         IOT| 10000|
|    Sunny|Data Science| 10000|
|    Krish|         IOT|  5000|
|Sudhanshu|    Big Data|  5000|
|    Krish|    Big Data|  4000|
|   Mahesh|    Big Data|  4000|
|   Mahesh|Data Science|  3000|
|    Sunny|    Big Data|  2000|
+---------+------------+------+

