<a href="https://colab.research.google.com/github/groda/big_data/blob/master/demoSparkSQLPython.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://github.com/groda/big_data"><div><img src="https://github.com/groda/big_data/blob/master/logo_bdb.png?raw=true" align=right width="90" alt="Logo Big Data for Beginners"></div></a>
# Getting started with Spark: Spark SQL in Python

This tutorial is based on [Spark SQL Guide - Getting started](https://spark.apache.org/docs/latest/sql-getting-started.html). All Spark jobs run on the local Spark engine included in the PySpark packaging (see [PySpark project](https://pypi.org/project/pyspark/) for details).

For this demo we used the city of Vienna trees dataset ("Baumkataster") made available by [Open Data Österreich](https://www.data.gv.at) and downloadable from [here](https://www.data.gv.at/katalog/dataset/c91a4635-8b7d-43fe-9b27-d95dec8392a7).

# Table of contents
1. [Spark session](#sparkSession)
2. [Count the number of rows with `count()`](#count)
3. [Pretty-printing](#prettyprint)
4. [`groupBy`](#groupby)
5. [Running SQL Queries Programmatically](#SQLquery)
   - [Data cleaning](#cleaning)
6. [Some data exploration](#exploration)
7. [Close Spark session](#closing)

## Spark session <a name="sparkSession"></a>

We're going to start by creating a [Spark _session_](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html). Our Spark job will be named "Python Spark SQL basic example". `spark` is the variable holding our Spark session.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

Read the file into a Spark [_dataframe_](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes).

In [2]:
import pandas as pd
import requests
import os
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType

url = "https://data.wien.gv.at/daten/geo?service=WFS&request=GetFeature&version=1.1.0&typeName=ogdwien:BAUMKATOGD&srsName=EPSG:4326&outputFormat=csv"
file_path = "baumkataster.csv"

# Download the file using requests and save it locally if it doesn't exist
if not os.path.exists(file_path):
    print(f"Downloading {file_path}...")
    response = requests.get(url)
    with open(file_path, "wb") as f:
        f.write(response.content)
else:
    print(f"{file_path} already exists, skipping download.")


Downloading baumkataster.csv...


Look at the first few lines in the `.csv` file to figure out its format (separator and if it has column headers). Hopefully Spark will guess the correct encoding.

In [3]:
!head baumkataster.csv

FID,OBJECTID,SHAPE,BAUM_ID,DATENFUEHRUNG,BEZIRK,OBJEKT_STRASSE,GEBIETSGRUPPE,GATTUNG_ART,PFLANZJAHR,PFLANZJAHR_TXT,STAMMUMFANG,STAMMUMFANG_TXT,BAUMHOEHE,BAUMHOEHE_TXT,KRONENDURCHMESSER,KRONENDURCHMESSER_TXT,BAUMNUMMER,SE_ANNO_CAD_DATA
BAUMKATOGD.742832161,742832161,POINT (16.37305290330624 48.25396526394613),128890,magistrat,20,Knoten Nordbrücke,Hauptstraße B14,Prunus mahaleb (Steinweichsel),0,nicht definiert,148,148 cm,2,6-10 m,3,7-9 m,2560,
BAUMKATOGD.742832162,742832162,POINT (16.27350822664159 48.14422147386553),401554,magistrat,23,Schwarzwaldgasse,"MA 28 - Straße, Grünanlage",Jungbaum wird gepflanzt,0,nicht definiert,0,nicht bekannt,0,nicht bekannt,0,nicht bekannt,2016,
BAUMKATOGD.742832163,742832163,POINT (16.398122022316993 48.20013311276198),130667,magistrat,3,Apostelgasse,"MA 28 - Straße, Grünanlage",Tilia tomentosa (Silberlinde),1927,1927,230,230 cm,4,16-20 m,5,13-15 m,1007,
BAUMKATOGD.742832164,742832164,POINT (16.37306490734576 48.25399152008363),128891,magistrat,20,Kno

In [4]:
# Load the local file into a Spark DataFrame
df = spark.read \
          .load("baumkataster.csv",
           format="csv", sep=",", header="true")

In [5]:
df.show(10)

+--------------------+---------+--------------------+-------+-------------+------+-------------------+--------------------+--------------------+----------+---------------+-----------+---------------+---------+-------------+-----------------+---------------------+----------+----------------+
|                 FID| OBJECTID|               SHAPE|BAUM_ID|DATENFUEHRUNG|BEZIRK|     OBJEKT_STRASSE|       GEBIETSGRUPPE|         GATTUNG_ART|PFLANZJAHR| PFLANZJAHR_TXT|STAMMUMFANG|STAMMUMFANG_TXT|BAUMHOEHE|BAUMHOEHE_TXT|KRONENDURCHMESSER|KRONENDURCHMESSER_TXT|BAUMNUMMER|SE_ANNO_CAD_DATA|
+--------------------+---------+--------------------+-------+-------------+------+-------------------+--------------------+--------------------+----------+---------------+-----------+---------------+---------+-------------+-----------------+---------------------+----------+----------------+
|BAUMKATOGD.742832161|742832161|POINT (16.3730529...| 128890|    magistrat|    20|  Knoten Nordbrücke|     Hauptstraße B14|P

## Count the number of rows with `count()`<a name="count"></a>

Using [pyspark.sql.DataFrame.count](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.count.html)

In [6]:
df.count()

229231

The size of the datafile 1s $50$MB.

In [7]:
!ls -lh baumkataster.csv

-rw-r--r-- 1 root root 50M Oct 11 10:44 baumkataster.csv


## Pretty-printing <a name="prettyprint"></a>

For pretty-printing you can use `toPandas()`

In [8]:
pdf = df.toPandas()

In [9]:
pdf.head()

Unnamed: 0,FID,OBJECTID,SHAPE,BAUM_ID,DATENFUEHRUNG,BEZIRK,OBJEKT_STRASSE,GEBIETSGRUPPE,GATTUNG_ART,PFLANZJAHR,PFLANZJAHR_TXT,STAMMUMFANG,STAMMUMFANG_TXT,BAUMHOEHE,BAUMHOEHE_TXT,KRONENDURCHMESSER,KRONENDURCHMESSER_TXT,BAUMNUMMER,SE_ANNO_CAD_DATA
0,BAUMKATOGD.742832161,742832161,POINT (16.37305290330624 48.25396526394613),128890,magistrat,20,Knoten Nordbrücke,Hauptstraße B14,Prunus mahaleb (Steinweichsel),0,nicht definiert,148,148 cm,2,6-10 m,3,7-9 m,2560,
1,BAUMKATOGD.742832162,742832162,POINT (16.27350822664159 48.14422147386553),401554,magistrat,23,Schwarzwaldgasse,"MA 28 - Straße, Grünanlage",Jungbaum wird gepflanzt,0,nicht definiert,0,nicht bekannt,0,nicht bekannt,0,nicht bekannt,2016,
2,BAUMKATOGD.742832163,742832163,POINT (16.398122022316993 48.20013311276198),130667,magistrat,3,Apostelgasse,"MA 28 - Straße, Grünanlage",Tilia tomentosa (Silberlinde),1927,1927,230,230 cm,4,16-20 m,5,13-15 m,1007,
3,BAUMKATOGD.742832164,742832164,POINT (16.37306490734576 48.25399152008363),128891,magistrat,20,Knoten Nordbrücke,Hauptstraße B14,Prunus mahaleb (Steinweichsel),0,nicht definiert,91,91 cm,2,6-10 m,3,7-9 m,2561,
4,BAUMKATOGD.742832165,742832165,POINT (16.397960567598144 48.200110806556005),130669,magistrat,3,Apostelgasse,"MA 28 - Straße, Grünanlage",Acer platanoides 'Globosum' (Kugelspitzahorn),1996,1996,60,60 cm,1,0-5 m,2,4-6 m,1009,


Use Google's `data_table` for interactive tables.

In [10]:
# true if running on Google Colab
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
  from google.colab import data_table
  from vega_datasets import data
  data_table.enable_dataframe_formatter()
else:
  !pip install itables
  !pip install bokeh
  !pip install matplotlib
  from itables import init_notebook_mode
  init_notebook_mode(all_interactive=True)

In [11]:
pdf.head(100)

Unnamed: 0,FID,OBJECTID,SHAPE,BAUM_ID,DATENFUEHRUNG,BEZIRK,OBJEKT_STRASSE,GEBIETSGRUPPE,GATTUNG_ART,PFLANZJAHR,PFLANZJAHR_TXT,STAMMUMFANG,STAMMUMFANG_TXT,BAUMHOEHE,BAUMHOEHE_TXT,KRONENDURCHMESSER,KRONENDURCHMESSER_TXT,BAUMNUMMER,SE_ANNO_CAD_DATA
0,BAUMKATOGD.742832161,742832161,POINT (16.37305290330624 48.25396526394613),128890,magistrat,20,Knoten Nordbrücke,Hauptstraße B14,Prunus mahaleb (Steinweichsel),0,nicht definiert,148,148 cm,2,6-10 m,3,7-9 m,2560,
1,BAUMKATOGD.742832162,742832162,POINT (16.27350822664159 48.14422147386553),401554,magistrat,23,Schwarzwaldgasse,"MA 28 - Straße, Grünanlage",Jungbaum wird gepflanzt,0,nicht definiert,0,nicht bekannt,0,nicht bekannt,0,nicht bekannt,2016,
2,BAUMKATOGD.742832163,742832163,POINT (16.398122022316993 48.20013311276198),130667,magistrat,3,Apostelgasse,"MA 28 - Straße, Grünanlage",Tilia tomentosa (Silberlinde),1927,1927,230,230 cm,4,16-20 m,5,13-15 m,1007,
3,BAUMKATOGD.742832164,742832164,POINT (16.37306490734576 48.25399152008363),128891,magistrat,20,Knoten Nordbrücke,Hauptstraße B14,Prunus mahaleb (Steinweichsel),0,nicht definiert,91,91 cm,2,6-10 m,3,7-9 m,2561,
4,BAUMKATOGD.742832165,742832165,POINT (16.397960567598144 48.200110806556005),130669,magistrat,3,Apostelgasse,"MA 28 - Straße, Grünanlage",Acer platanoides 'Globosum' (Kugelspitzahorn),1996,1996,60,60 cm,1,0-5 m,2,4-6 m,1009,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,BAUMKATOGD.742832497,742832497,POINT (16.322439773232073 48.15880274326233),343206,magistrat,12,"12., Parkanlage An den Eisteichen, MA42",MA 42 - Parkanlage,Carpinus betulus 'Frans Fontaine' (Schlanke Sä...,2020,2020,29,29 cm,2,6-10 m,1,0-3 m,66,
96,BAUMKATOGD.742832498,742832498,POINT (16.414271910022116 48.20241321681734),29285,magistrat,2,Rustenschacherallee,"MA 28 - Straße, Grünanlage",Acer platanoides (Spitzahorn),1974,1974,135,135 cm,3,11-15 m,3,7-9 m,3198,
97,BAUMKATOGD.742832499,742832499,POINT (16.48418195149353 48.27670809760198),295651,magistrat,22,Wagramer Straße,"MA 28 - Straße, Grünanlage",Pyrus calleryana 'Chanticleer' (Zierbirne),2018,2018,36,36 cm,1,0-5 m,1,0-3 m,1012,
98,BAUMKATOGD.742832500,742832500,POINT (16.4843354038467 48.276755203728975),243711,magistrat,22,Wagramer Straße,"MA 28 - Straße, Grünanlage",Celtis australis (Südlicher Zürgelbaum),2018,2018,43,43 cm,1,0-5 m,2,4-6 m,1013,


## `groupBy` <a name="groupby"></a>

Using [pyspark.sql.DataFrame.groupBy](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.groupBy.html) show number of different trees (count species/`GATTUNG_ART` names in `df` and sort by count).

Use `truncate=False` to show the full content of the column `GATTUNG_ART`.

In [12]:
df.groupBy("GATTUNG_ART").count().orderBy('count', ascending=False).show(truncate=False)

+-------------------------------------------------+-----+
|GATTUNG_ART                                      |count|
+-------------------------------------------------+-----+
|Acer platanoides (Spitzahorn)                    |19312|
|Aesculus hippocastanum (Rosskastanie)            |11805|
|Celtis australis (Südlicher Zürgelbaum)          |9191 |
|Fraxinus excelsior (Gemeine Esche)               |8467 |
|Tilia cordata (Winterlinde)                      |8034 |
|Acer campestre (Feldahorn)                       |7605 |
|Platanus x acerifolia (Ahornblättrige Platane)   |7377 |
|Acer pseudoplatanus (Bergahorn)                  |7000 |
|Pinus nigra (Schwarzkiefer, Schwarzföhre)        |6181 |
|Robinia pseudoacacia (Scheinakazie)              |5768 |
|Tilia platyphyllos (Sommerlinde)                 |5505 |
|Pyrus calleryana 'Chanticleer' (Zierbirne)       |3993 |
|Acer platanoides 'Columnare' (Säulenahorn)       |3963 |
|Acer campestre 'Elsrijk' (Feldahorn)             |3872 |
|Populus nigra

## Running SQL Queries Programmatically <a name="SQLquery"></a>

An example of SQL query (see [Running SQL Queries Programmatically](https://spark.apache.org/docs/latest/sql-getting-started.html#running-sql-queries-programmatically)): let's sort trees by height ("Hoehe").

`trees` is a local temporary view of `df` (see [pyspark.sql.DataFrame.createOrReplaceTempView](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.createOrReplaceTempView.html?highlight=createorreplace#pyspark.sql.DataFrame.createOrReplaceTempView)).

In [13]:
df.createOrReplaceTempView("trees")

In [14]:
spark.sql("SELECT BAUM_ID, GATTUNG_ART, BAUMHOEHE, BAUMHOEHE_TXT, BEZIRK FROM trees order by BAUMHOEHE_TXT desc").toPandas()



Unnamed: 0,BAUM_ID,GATTUNG_ART,BAUMHOEHE,BAUMHOEHE_TXT,BEZIRK
0,401554,Jungbaum wird gepflanzt,0,nicht bekannt,23
1,351187,Jungbaum wird gepflanzt,0,nicht bekannt,20
2,401555,Jungbaum wird gepflanzt,0,nicht bekannt,22
3,391210,Jungbaum wird gepflanzt,0,nicht bekannt,23
4,401553,Jungbaum wird gepflanzt,0,nicht bekannt,23
...,...,...,...,...,...
229226,384803,Tamarix spec. (Tamariske),1,0-5 m,5
229227,226708,Malus spec. (Apfelbaum),1,0-5 m,19
229228,226716,Prunus serrulata 'Kanzan' (Japanische Blütenki...,1,0-5 m,19
229229,226720,Carpinus betulus (Hainbuche),1,0-5 m,19


### Data cleaning <a name="cleaning"></a>

This doesn't quite make sense, let us investigate what these `BAUMHOEHE` and `BAUMHOEHE_TXT` columns contain.

💡 Making sense of "dirty data" is called _data cleansing_ or _data cleaning_. This process involves identifying and correcting or removing errors, inconsistencies, and inaccuracies in a dataset to make it accurate, complete, and consistent.

In [15]:
spark.sql("SELECT distinct BAUMHOEHE, BAUMHOEHE_TXT from trees order by BAUMHOEHE asc").toPandas()

Unnamed: 0,BAUMHOEHE,BAUMHOEHE_TXT
0,0,nicht bekannt
1,1,0-5 m
2,102 cm,3
3,116 cm,3
4,117 cm,3
5,134 cm,2
6,145 cm,4
7,177 cm,3
8,2,6-10 m
9,29 cm,1


It looks like there was an attempt to map different heights (`BAUMHOEHE`) to height categories where `BAUMHOEHE_TXT` is a number from $0$ to $8$:
 - `0`: unknown height
 - `1`: 0-5m
 - `2` 6-10m
 - `3`: 11-15m
 - `4`: 16-20m
 - `5`: 21-25m
 - `6`: 26-30m
 - `7`: 31-35m
 - `8`: >35m

 But for a few trees these columns don't seem to make much sense.

In [16]:
spark.sql("SELECT BAUMHOEHE, BAUMHOEHE_TXT, count(BAUM_ID) from trees group by BAUMHOEHE, BAUMHOEHE_TXT sort by count(BAUM_ID) desc").show()

+---------+-------------+--------------+
|BAUMHOEHE|BAUMHOEHE_TXT|count(BAUM_ID)|
+---------+-------------+--------------+
|        2|       6-10 m|         81407|
|        1|        0-5 m|         62924|
|        3|      11-15 m|         55737|
|        4|      16-20 m|         18304|
|        5|      21-25 m|          4978|
|        0|nicht bekannt|          4448|
|        6|      26-30 m|          1284|
|        7|      31-35 m|           110|
|        8|       > 35 m|            14|
|    86 cm|            2|             2|
|    76 cm|            2|             2|
|    55 cm|            2|             1|
|   134 cm|            2|             1|
|   145 cm|            4|             1|
|    58 cm|            2|             1|
|    68 cm|            3|             1|
|    73 cm|            2|             1|
|   117 cm|            3|             1|
|    65 cm|            2|             1|
|    78 cm|            3|             1|
+---------+-------------+--------------+
only showing top

In [17]:
spark.sql("SELECT BAUMHOEHE, BAUMHOEHE_TXT, count(BAUM_ID) from trees where BAUMHOEHE RLIKE '^[0-8]$' group by BAUMHOEHE, BAUMHOEHE_TXT sort by count(BAUM_ID) desc").show()

+---------+-------------+--------------+
|BAUMHOEHE|BAUMHOEHE_TXT|count(BAUM_ID)|
+---------+-------------+--------------+
|        2|       6-10 m|         81407|
|        1|        0-5 m|         62924|
|        3|      11-15 m|         55737|
|        4|      16-20 m|         18304|
|        5|      21-25 m|          4978|
|        0|nicht bekannt|          4448|
|        6|      26-30 m|          1284|
|        7|      31-35 m|           110|
|        8|       > 35 m|            14|
+---------+-------------+--------------+



We can just ignore these few trees (there's just $25$ of them) that have the incorrectly formatted `BAUMHOEHE`. These trees are all in the same location and apparently also the `GATTUNG_ART` and `PFLANZJAHR` entries for these trees are incorrect.

In [18]:
spark.sql("SELECT BAUM_ID, BAUMHOEHE, BAUMHOEHE_TXT, GATTUNG_ART, PFLANZJAHR, BEZIRK, OBJEKT_STRASSE from trees where not BAUMHOEHE RLIKE '^[0-8]$'").toPandas()

Unnamed: 0,BAUM_ID,BAUMHOEHE,BAUMHOEHE_TXT,GATTUNG_ART,PFLANZJAHR,BEZIRK,OBJEKT_STRASSE
0,315066,68 cm,3,Diverse,Robinia pseudoacacia (Scheinakazie),10,"""10., Monte Laa Wiesenfläche """"Dreieck"""""
1,286030,50 cm,1,Diverse,Morus nigra (Schwarzer Maulbeerbaum),10,"""10., Monte Laa Wiesenfläche """"Dreieck"""""
2,286040,59 cm,2,Diverse,Pyrus calleryana 'Chanticleer' (Zierbirne),10,"""10., Monte Laa Wiesenfläche """"Dreieck"""""
3,315064,102 cm,3,Diverse,Robinia pseudoacacia (Scheinakazie),10,"""10., Monte Laa Wiesenfläche """"Dreieck"""""
4,315071,116 cm,3,Diverse,Prunus avium (Vogelkirsche),10,"""10., Monte Laa Wiesenfläche """"Dreieck"""""
5,330854,29 cm,1,Diverse,Pyrus calleryana 'Aristocrat' (Zierbirne),10,"""10., Monte Laa Wiesenfläche """"Dreieck"""""
6,286029,177 cm,3,Diverse,Populus nigra 'Italica' (Pyramidenpappel),10,"""10., Monte Laa Wiesenfläche """"Dreieck"""""
7,286033,52 cm,2,Diverse,Pyrus calleryana 'Chanticleer' (Zierbirne),10,"""10., Monte Laa Wiesenfläche """"Dreieck"""""
8,286034,68 cm,2,Diverse,Pyrus calleryana 'Chanticleer' (Zierbirne),10,"""10., Monte Laa Wiesenfläche """"Dreieck"""""
9,286041,62 cm,2,Diverse,Pyrus calleryana 'Chanticleer' (Zierbirne),10,"""10., Monte Laa Wiesenfläche """"Dreieck"""""


Let us filter out these $25$ trees from the original dataframe using

[`pyspark.sql.DataFrame.filter`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.filter.html)

In [19]:
df.filter("BAUMHOEHE NOT RLIKE '^[0-8]$'").count()

25

In [20]:
df = df.filter("BAUMHOEHE RLIKE '^[0-8]$'")

In [21]:
df.count()

229206

The temporary view `trees` needs to be updated.

In [22]:
df.createOrReplaceTempView("trees")
spark.sql("SELECT BAUM_ID, BAUMHOEHE, BAUMHOEHE_TXT, GATTUNG_ART, PFLANZJAHR, BEZIRK, OBJEKT_STRASSE from trees where not BAUMHOEHE RLIKE '^[0-8]$'").toPandas()

Unnamed: 0,BAUM_ID,BAUMHOEHE,BAUMHOEHE_TXT,GATTUNG_ART,PFLANZJAHR,BEZIRK,OBJEKT_STRASSE


We can now show trees sorted by their height category (`BAUMHOEHE`).

In [23]:
spark.sql("SELECT BAUMHOEHE, BAUMHOEHE_TXT, count(*) as COUNT from trees group by BAUMHOEHE, BAUMHOEHE_TXT sort by BAUMHOEHE desc").toPandas()

Unnamed: 0,BAUMHOEHE,BAUMHOEHE_TXT,COUNT
0,8,> 35 m,14
1,7,31-35 m,110
2,6,26-30 m,1284
3,5,21-25 m,4978
4,4,16-20 m,18304
5,3,11-15 m,55737
6,2,6-10 m,81407
7,1,0-5 m,62924
8,0,nicht bekannt,4448


## Some data exploration <a name="exploration"></a>

What species are the tallest trees?

In [24]:
spark.sql("SELECT BAUMHOEHE, BAUMHOEHE_TXT, GATTUNG_ART, count(*) as COUNT from trees where BAUMHOEHE=8 group by BAUMHOEHE, BAUMHOEHE_TXT, GATTUNG_ART sort by COUNT desc").toPandas()

Unnamed: 0,BAUMHOEHE,BAUMHOEHE_TXT,GATTUNG_ART,COUNT
0,8,> 35 m,Populus nigra 'Italica' (Pyramidenpappel),7
1,8,> 35 m,Fraxinus excelsior (Gemeine Esche),2
2,8,> 35 m,Populus nigra (Schwarzpappel),2
3,8,> 35 m,Platanus x acerifolia (Ahornblättrige Platane),1
4,8,> 35 m,Acer platanoides (Spitzahorn),1
5,8,> 35 m,Tilia tomentosa 'Brabant' (Silberlinde),1


In [25]:
spark.sql("SELECT BAUMHOEHE, BAUMHOEHE_TXT, GATTUNG_ART, BEZIRK, OBJEKT_STRASSE from trees where BAUMHOEHE=8 sort by cast(BEZIRK as int)").toPandas()

Unnamed: 0,BAUMHOEHE,BAUMHOEHE_TXT,GATTUNG_ART,BEZIRK,OBJEKT_STRASSE
0,8,> 35 m,Platanus x acerifolia (Ahornblättrige Platane),1,"01., Rathauspark, MA42"
1,8,> 35 m,Populus nigra 'Italica' (Pyramidenpappel),2,"02., Wettsteinpark, MA42"
2,8,> 35 m,Populus nigra 'Italica' (Pyramidenpappel),2,"02., Wettsteinpark, MA42"
3,8,> 35 m,Populus nigra 'Italica' (Pyramidenpappel),2,"02., Wettsteinpark, MA42"
4,8,> 35 m,Populus nigra (Schwarzpappel),2,"02., Donaukanal Pachtflächen 2. Bezirk, DHK"
5,8,> 35 m,Populus nigra 'Italica' (Pyramidenpappel),2,"02., Wettsteinpark, MA42"
6,8,> 35 m,Fraxinus excelsior (Gemeine Esche),18,"17., Schafbergbad, BAD"
7,8,> 35 m,Acer platanoides (Spitzahorn),19,"19., Grinzinger Straße 111, SPO"
8,8,> 35 m,Tilia tomentosa 'Brabant' (Silberlinde),2,Stella-Klein-Löw-Weg
9,8,> 35 m,Populus nigra 'Italica' (Pyramidenpappel),2,"02., Wettsteinpark, MA42"


## Close Spark session <a name="closing">

When done, close the Spark session. This will release resources.

In [26]:
spark.stop()