## Introduction to Pyspark

In this section, we will learn following concepts:
1. Import "pyspark" and other library packages
2. Create a new spark session | Attributes of SparkSession
3. Reading a .csv file (local) using PySpark
4. Reading a .csv file (remote) using PySpark
5. Display the contents of the .csv file using show() function

In [1]:
# import pyspark package
import pyspark
from pyspark.sql import SparkSession

In [2]:
# Create a new Spark session
spark=SparkSession.builder.appName("BinDataframeApp").getOrCreate()

Create a new .csv file here to process it using pyspark in later steps. Attached are some samples .csv datasets.

In [3]:
# Display the spark session details
spark

In [17]:
# Loads a CSV file and returns the result as a DataFrame
# Ref: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.csv.html

csv_df = spark.read.csv('./datasets/country_wise_latest.csv')

In [5]:
# See the created dataframe
csv_df.show()

+-------------------+---------+------+---------+------+---------+----------+-------------+--------------------+
|                _c0|      _c1|   _c2|      _c3|   _c4|      _c5|       _c6|          _c7|                 _c8|
+-------------------+---------+------+---------+------+---------+----------+-------------+--------------------+
|     Country/Region|Confirmed|Deaths|Recovered|Active|New cases|New deaths|New recovered|          WHO Region|
|        Afghanistan|    36263|  1269|    25198|  9796|      106|        10|           18|Eastern Mediterra...|
|            Albania|     4880|   144|     2745|  1991|      117|         6|           63|              Europe|
|            Algeria|    27973|  1163|    18837|  7973|      616|         8|          749|              Africa|
|            Andorra|      907|    52|      803|    52|       10|         0|            0|              Europe|
|             Angola|      950|    41|      242|   667|       18|         1|            0|              

In [6]:
# Consider csv column header in the dataframe as column headers
spark.read.option('header','true').csv('./datasets/country_wise_latest.csv').show()

+-------------------+---------+------+---------+------+---------+----------+-------------+--------------------+
|     Country/Region|Confirmed|Deaths|Recovered|Active|New cases|New deaths|New recovered|          WHO Region|
+-------------------+---------+------+---------+------+---------+----------+-------------+--------------------+
|        Afghanistan|    36263|  1269|    25198|  9796|      106|        10|           18|Eastern Mediterra...|
|            Albania|     4880|   144|     2745|  1991|      117|         6|           63|              Europe|
|            Algeria|    27973|  1163|    18837|  7973|      616|         8|          749|              Africa|
|            Andorra|      907|    52|      803|    52|       10|         0|            0|              Europe|
|             Angola|      950|    41|      242|   667|       18|         1|            0|              Africa|
|Antigua and Barbuda|       86|     3|       65|    18|        4|         0|            5|            Am

In [16]:
# Print the schema of the dataframe
csv_df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)



In [7]:
# Check the datatype of the csv_df (dataframe)
type(csv_df)

pyspark.sql.dataframe.DataFrame

In [10]:
# Diplay n number of rows from the dataframe
csv_df.head(3)

[Row(_c0='Country/Region', _c1='Confirmed', _c2='Deaths', _c3='Recovered', _c4='Active', _c5='New cases', _c6='New deaths', _c7='New recovered', _c8='WHO Region'),
 Row(_c0='Afghanistan', _c1='36263', _c2='1269', _c3='25198', _c4='9796', _c5='106', _c6='10', _c7='18', _c8='Eastern Mediterranean'),
 Row(_c0='Albania', _c1='4880', _c2='144', _c3='2745', _c4='1991', _c5='117', _c6='6', _c7='63', _c8='Europe')]