In [1]:
import pyspark
from pyspark.sql import SparkSession
import warnings

warnings.filterwarnings('ignore')

In [2]:
# Version of model
pyspark.__version__

'3.5.5'

In [3]:
# Location of model
pyspark.__file__

'/home/peter/spark/spark-3.5.5-bin-hadoop3/python/pyspark/__init__.py'

### Creating a Spark Session

Initialising the spark object using the spark session classs is relatively straight-foraward:
- `.getOrCreate()` : get the spark session if its already been created, or else create a new one based on `appName`
- `.appName()` : Name of app
- `.master()` : to define where and how spark runs its computations. `local[*]` means to run spark locally using all available CPU cores, this is particularly useful for testing given this is a demo session as well

In [4]:
spark = SparkSession.builder.master("local[*]").appName('demo').getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/10 13:43:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [5]:
# Reading .csv file - spark does not know if there is header so it should be included in .options() method
df = spark.read.option("header", "true").csv("taxi_zone_lookup.csv")

In [6]:
df.show()

+----------+-------------+--------------------+------------+
|LocationID|      Borough|                Zone|service_zone|
+----------+-------------+--------------------+------------+
|         1|          EWR|      Newark Airport|         EWR|
|         2|       Queens|         Jamaica Bay|   Boro Zone|
|         3|        Bronx|Allerton/Pelham G...|   Boro Zone|
|         4|    Manhattan|       Alphabet City| Yellow Zone|
|         5|Staten Island|       Arden Heights|   Boro Zone|
|         6|Staten Island|Arrochar/Fort Wad...|   Boro Zone|
|         7|       Queens|             Astoria|   Boro Zone|
|         8|       Queens|        Astoria Park|   Boro Zone|
|         9|       Queens|          Auburndale|   Boro Zone|
|        10|       Queens|        Baisley Park|   Boro Zone|
|        11|     Brooklyn|          Bath Beach|   Boro Zone|
|        12|    Manhattan|        Battery Park| Yellow Zone|
|        13|    Manhattan|   Battery Park City| Yellow Zone|
|        14|     Brookly

> **NOTE:** Output of the `.write.parquet()` method produces a folder in the current working directory, and in this folder there are two files, the first would be a `SUCCESS` file to indicate that the job was run successfully and the other file is the actual output.

In [7]:
# Now to output the dataframe as a .parquet file
df.write.parquet('zones')

                                                                                

> **IMPORTANT:** We need to forward another port `4040`. This is the default port for Spark which is a user interface for you to monitor your spark jobs. What we have done so far should have appeared there as well. Do check it out!