<a href="https://colab.research.google.com/github/groda/big_data/blob/master/generate_data_with_Faker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://github.com/groda/big_data"><div><img src="https://github.com/groda/big_data/blob/master/logo_bdb.png?raw=true" align=right width="90"></div></a>

# Data Generation and Aggregation with Python's Faker Library and PySpark
<br>
<br>

Explore the capabilities of the Python Faker library (https://faker.readthedocs.io/) for dynamic data generation!

Whether you're a data scientist, engineer, or analyst, this tutorial will guide you through the process of creating realistic and diverse datasets using Faker and then harnessing the distributed computing capabilities of PySpark to aggregate and analyze the generated data. Throughout this guide, you will explore effective techniques for data generation that enhance performance and optimize resource usage. Whether you're working with large datasets or simply seeking to streamline your data generation process, this tutorial offers valuable insights to elevate your skills.

**Note:** This is not _synthetic_ data, as it is generated using straightforward methods and is unlikely to conform to any real-life distribution.  Still, it serves as a valuable resource for testing purposes when authentic data is unavailable.

# Install Faker

The Python `faker` module needs to be installed. Note that on Google Colab you can use `!pip` as well as just `pip` (no exclamation mark).

In [1]:
!pip install faker

Collecting faker
  Downloading faker-40.4.0-py3-none-any.whl.metadata (16 kB)
Downloading faker-40.4.0-py3-none-any.whl (2.0 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m2.0/2.0 MB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faker
Successfully installed faker-40.4.0


# Generate a Pandas dataframe with fake data

Import `Faker` and set a random seed ($42$).

In [2]:
from faker import Faker
# Set the seed value of the shared `random.Random` object
# across all internal generators that will ever be created
Faker.seed(42)

`fake` is a fake data generator with `DE_de` locale.

In [3]:
fake = Faker('de_DE')
fake.seed_locale('de_DE', 42)
# Creates and seeds a unique `random.Random` object for
# each internal generator of this `Faker` instance
fake.seed_instance(42)

With `fake` you can generate fake data, such as name, email, etc.

In [4]:
print(f"A fake name: {fake.name()}")
print(f"A fake email: {fake.email()}")

A fake name: Aleksandr Weihmann
A fake email: ioannis32@example.net


Import Pandas to save data into a dataframe

In [5]:
import pandas as pd

The function `create_row_faker` creates one row of fake data. Here we choose to generate a row containing the following fields:
 - `fake.name()`
 - `fake.postcode()`
 - `fake.email()`
 - `fake.country()`.

In [6]:
def create_row_faker(num=1):
    fake = Faker('de_DE')
    fake.seed_locale('de_DE', 42)
    fake.seed_instance(42)
    output = [{"name": fake.name(),
               "age": fake.random_int(0, 100),
               "postcode": fake.postcode(),
               "email": fake.email(),
               "nationality": fake.country(),
              } for x in range(num)]
    return output

Generate a single row

In [7]:
create_row_faker()

[{'name': 'Aleksandr Weihmann',
  'age': 35,
  'postcode': '32181',
  'email': 'bbeckmann@example.org',
  'nationality': 'Fidschi'}]

Generate `n=3` rows

In [8]:
create_row_faker(3)

[{'name': 'Aleksandr Weihmann',
  'age': 35,
  'postcode': '32181',
  'email': 'bbeckmann@example.org',
  'nationality': 'Fidschi'},
 {'name': 'Prof. Kurt Bauer B.A.',
  'age': 91,
  'postcode': '37940',
  'email': 'hildaloechel@example.com',
  'nationality': 'Guatemala'},
 {'name': 'Ekkehart Wiek-Kallert',
  'age': 13,
  'postcode': '61559',
  'email': 'maja07@example.net',
  'nationality': 'Brasilien'}]

Generate a dataframe `df_fake` of 5000 rows using `create_row_faker`.

We're using the _cell magic_ `%%time` to time the operation.

In [9]:
%%time
df_fake = pd.DataFrame(create_row_faker(5000))

CPU times: user 393 ms, sys: 2.71 ms, total: 396 ms
Wall time: 416 ms


View dataframe

In [10]:
df_fake

Unnamed: 0,name,age,postcode,email,nationality
0,Aleksandr Weihmann,35,32181,bbeckmann@example.org,Fidschi
1,Prof. Kurt Bauer B.A.,91,37940,hildaloechel@example.com,Guatemala
2,Ekkehart Wiek-Kallert,13,61559,maja07@example.net,Brasilien
3,Annelise Rohleder-Hornig,80,93103,daniel31@example.com,Guatemala
4,Magrit Knappe B.A.,47,34192,gottliebmisicher@example.com,Guadeloupe
...,...,...,...,...,...
4995,Hanno Jopich-R√§del,99,13333,keudelstanislaus@example.org,Syrien
4996,Herr Arno Ebert B.A.,63,36790,josefaebert@example.org,Slowenien
4997,Miroslawa Sch√ºler,22,11118,ruppersbergerbetina@example.org,Republik Moldau
4998,Janusz Nerger,74,33091,ann-kathrinseip@example.net,Belarus


For more fake data generators see Faker's [standard providers](https://faker.readthedocs.io/en/master/providers.html#standard-providers) as well as [community providers](https://faker.readthedocs.io/en/master/communityproviders.html#community-providers).

# Generate PySpark dataframe with fake data

PySpark is already installed in Colab.

In [11]:
#!pip install pyspark

In [12]:
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Faker demo") \
    .getOrCreate()

In [13]:
df = spark.createDataFrame(create_row_faker(5000))

To avoid getting the warning, either use [pyspark.sql.Row](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Row) and let Spark infer datatypes or create a schema for the dataframe specifying the datatypes of all fields (here's the list of all [datatypes](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=types#module-pyspark.sql.types)).

In [14]:
from pyspark.sql.types import *
schema = StructType([StructField('name', StringType()),
                     StructField('age',IntegerType()),
                     StructField('postcode',StringType()),
                     StructField('email', StringType()),
                     StructField('nationality',StringType())])

In [15]:
df = spark.createDataFrame(create_row_faker(5000), schema)

In [16]:
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- postcode: string (nullable = true)
 |-- email: string (nullable = true)
 |-- nationality: string (nullable = true)



Let's generate some more data (dataframe with $5\cdot10^4$ rows). The file will be partitioned by Spark.

In [17]:
%%time
n = 5*10**4
df = spark.createDataFrame(create_row_faker(n), schema)

CPU times: user 2.73 s, sys: 46 ms, total: 2.78 s
Wall time: 2.85 s


In [18]:
df.show(10, truncate=False)

+---------------------------------+---+--------+----------------------------+----------------------+
|name                             |age|postcode|email                       |nationality           |
+---------------------------------+---+--------+----------------------------+----------------------+
|Aleksandr Weihmann               |35 |32181   |bbeckmann@example.org       |Fidschi               |
|Prof. Kurt Bauer B.A.            |91 |37940   |hildaloechel@example.com    |Guatemala             |
|Ekkehart Wiek-Kallert            |13 |61559   |maja07@example.net          |Brasilien             |
|Annelise Rohleder-Hornig         |80 |93103   |daniel31@example.com        |Guatemala             |
|Magrit Knappe B.A.               |47 |34192   |gottliebmisicher@example.com|Guadeloupe            |
|Univ.Prof. Gotthilf Wilmsen B.Sc.|29 |56413   |heini76@example.net         |Litauen               |
|Franjo Etzold-Hentschel          |95 |96965   |frederikpechel@example.com  |Belize        

It took a long time (~4 sec. for 50000 rows)!

Can we do better?

The function `create_row_faker()` returns a list. This is not efficient, what we need is a _generator_ instead.

In [19]:
d = create_row_faker(5)
# what type is d?
type(d)

list

Let us turn `d` into a generator

In [20]:
d = ({"name": fake.name(),
      "age": fake.random_int(0, 100),
      "postcode": fake.postcode(),
      "email": fake.email(),
      "nationality": fake.country()} for i in range(5))
# what type is d?
type(d)

generator

In [21]:
%%time
n = 5*10**4
fake = Faker('de_DE')
fake.seed_locale('de_DE', 42)
fake.seed_instance(42)
d = ({"name": fake.name(),
      "age": fake.random_int(0, 100),
      "postcode": fake.postcode(),
      "email": fake.email(),
      "nationality": fake.country()}
     for i in range(n))
df = spark.createDataFrame(d, schema)

CPU times: user 2.86 s, sys: 23.7 ms, total: 2.88 s
Wall time: 2.94 s


In [22]:
df.show(10, truncate=False)

+---------------------------------+---+--------+----------------------------+----------------------+
|name                             |age|postcode|email                       |nationality           |
+---------------------------------+---+--------+----------------------------+----------------------+
|Aleksandr Weihmann               |35 |32181   |bbeckmann@example.org       |Fidschi               |
|Prof. Kurt Bauer B.A.            |91 |37940   |hildaloechel@example.com    |Guatemala             |
|Ekkehart Wiek-Kallert            |13 |61559   |maja07@example.net          |Brasilien             |
|Annelise Rohleder-Hornig         |80 |93103   |daniel31@example.com        |Guatemala             |
|Magrit Knappe B.A.               |47 |34192   |gottliebmisicher@example.com|Guadeloupe            |
|Univ.Prof. Gotthilf Wilmsen B.Sc.|29 |56413   |heini76@example.net         |Litauen               |
|Franjo Etzold-Hentschel          |95 |96965   |frederikpechel@example.com  |Belize        

This wasn't faster.

Let us look at how one can leverage Hadoop's parallelism to generate dataframes and speed up the process.

## A more efficient way to generate a large amount of records

We are going to use Spark's RDD and the function `parallelize`. In order to do this, we are going to need to extract the Spark _context_ from the current session.

In [23]:
sc = spark.sparkContext
sc

In order to decide on the number of partitions, we are going to look at the number of (virtual) CPU's on the local machine. If you have a cluster you can have a larger number of CPUs across multiple nodes but this is not the case here.

The standard Google Colab virtual machine has $2$ virtual CPUs (one CPU with two threads), so that is the maximum parallelization that you can achieve.

**Note:**

CPUs = threads per core √ó cores per socket √ó sockets

In [24]:
!lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('

CPU(s):                                  2
Thread(s) per core:                      2
Core(s) per socket:                      1
Socket(s):                               1


Due to the limited number of CPUs on this machine, we'll use only $2$ partitions. Even so, the data generation timing has improved dramatically!

In [25]:
%%time
n = 5*10**4
num_partitions = 2
# Create an empty RDD with the specified number of partitions
empty_rdd = sc.parallelize(range(num_partitions), num_partitions)
# Define a function that will run on each partition to generate the fake data
def generate_fake_data(_):
    fake = Faker('de_DE') # Create a new Faker instance per partition
    fake.seed_locale('de_DE', 42)
    fake.seed_instance(42)
    for _ in range(n // num_partitions):  # Divide work across partitions
        yield {
            "name": fake.name(),
            "age": fake.random_int(0, 100),
            "postcode": fake.postcode(),
            "email": fake.email(),
            "nationality": fake.country()
        }

# Use mapPartitions to generate fake data for each partition
rdd = empty_rdd.mapPartitions(generate_fake_data)
# Convert the RDD to a DataFrame
df = rdd.toDF()

CPU times: user 15.8 ms, sys: 5.01 ms, total: 20.8 ms
Wall time: 780 ms


I'm convinced that the reason everyone always looks at the first $5$ rows in Spark's RDDs is an homage to the classic jazz piece üé∑üé∂.

In [26]:
rdd.take(5)

[{'name': 'Aleksandr Weihmann',
  'age': 35,
  'postcode': '32181',
  'email': 'bbeckmann@example.org',
  'nationality': 'Fidschi'},
 {'name': 'Prof. Kurt Bauer B.A.',
  'age': 91,
  'postcode': '37940',
  'email': 'hildaloechel@example.com',
  'nationality': 'Guatemala'},
 {'name': 'Ekkehart Wiek-Kallert',
  'age': 13,
  'postcode': '61559',
  'email': 'maja07@example.net',
  'nationality': 'Brasilien'},
 {'name': 'Annelise Rohleder-Hornig',
  'age': 80,
  'postcode': '93103',
  'email': 'daniel31@example.com',
  'nationality': 'Guatemala'},
 {'name': 'Magrit Knappe B.A.',
  'age': 47,
  'postcode': '34192',
  'email': 'gottliebmisicher@example.com',
  'nationality': 'Guadeloupe'}]

In [27]:
df.show()

+---+--------------------+--------------------+--------------------+--------+
|age|               email|                name|         nationality|postcode|
+---+--------------------+--------------------+--------------------+--------+
| 35|bbeckmann@example...|  Aleksandr Weihmann|             Fidschi|   32181|
| 91|hildaloechel@exam...|Prof. Kurt Bauer ...|           Guatemala|   37940|
| 13|  maja07@example.net|Ekkehart Wiek-Kal...|           Brasilien|   61559|
| 80|daniel31@example.com|Annelise Rohleder...|           Guatemala|   93103|
| 47|gottliebmisicher@...|  Magrit Knappe B.A.|          Guadeloupe|   34192|
| 29| heini76@example.net|Univ.Prof. Gotthi...|             Litauen|   56413|
| 95|frederikpechel@ex...|Franjo Etzold-Hen...|              Belize|   96965|
| 19| qraedel@example.net|   Steffen D√∂rschner|            Tunesien|   69166|
| 14|  uadler@example.net|       Milos Ullmann|        Griechenland|   51462|
| 80|augustewulff@exam...|  Prof. Urban D√∂ring|Vereinigtes K√∂

# Filter and aggregate with PySpark

Show the first five records in the dataframe of fake data.

In [28]:
df.show(n=5, truncate=False)

+---+----------------------------+------------------------+-----------+--------+
|age|email                       |name                    |nationality|postcode|
+---+----------------------------+------------------------+-----------+--------+
|35 |bbeckmann@example.org       |Aleksandr Weihmann      |Fidschi    |32181   |
|91 |hildaloechel@example.com    |Prof. Kurt Bauer B.A.   |Guatemala  |37940   |
|13 |maja07@example.net          |Ekkehart Wiek-Kallert   |Brasilien  |61559   |
|80 |daniel31@example.com        |Annelise Rohleder-Hornig|Guatemala  |93103   |
|47 |gottliebmisicher@example.com|Magrit Knappe B.A.      |Guadeloupe |34192   |
+---+----------------------------+------------------------+-----------+--------+
only showing top 5 rows


Do some data aggregation:
 - group by postcode
 - count the number of persons and the average age for each postcode
 - filter out postcodes with less than 4 persons
 - sort by average age descending
 - show the first 5 entries

In [29]:
import pyspark.sql.functions as F
aggregated_df = df.groupBy('postcode') \
  .agg(F.count('postcode').alias('Count'),
       F.round(F.avg('age'), 2).alias('Average age')) \
  .filter('Count>3') \
  .orderBy('Average age', ascending=False)

aggregated_df.show(5)

+--------+-----+-----------+
|postcode|Count|Average age|
+--------+-----+-----------+
|   60653|    4|       98.5|
|   59679|    4|       98.5|
|   37287|    4|       98.5|
|   63287|    4|       98.0|
|   37841|    4|       97.5|
+--------+-----+-----------+
only showing top 5 rows


While these are just simulated postcodes and ages, you might think as the postcodes with the highest average age as the 'best to live in'.

Real-world data analysis would require a much deeper statistical dive. For instance, you'd want to consider a broader range of demographic and socio-economic factors beyond just age. When analyzing age, looking at the full distribution (e.g., standard deviation, median, skewness) rather than just the average would provide a more complete picture of the population structure within each postcode. This would help identify areas with diverse age groups, which might be a better indicator of a 'best place to live' depending on your criteria.

But let's go on with our Spark code.

Show all entries for postcodes with max average age using `filter`.

In [30]:
import pyspark.sql.functions as F

# Find the maximum average age
max_avg_age = aggregated_df.agg(F.max('Average age')).collect()[0][0]

# Get all postcodes that have this maximum average age
postcodes = aggregated_df.filter(F.col('Average age') == max_avg_age).select('postcode').rdd.flatMap(lambda x: x).collect()

print(f"Postcodes with the maximum average age: {postcodes}")

# Filter the original DataFrame to show entries for these postcodes
df.filter(F.col('postcode').isin(postcodes)).show(truncate=False)

Postcodes with the maximum average age: ['60653', '59679', '37287']
+---+----------------------------+---------------------+------------------------+--------+
|age|email                       |name                 |nationality             |postcode|
+---+----------------------------+---------------------+------------------------+--------+
|100|natalie67@example.net       |Ing. Olaf Killer B.A.|Neukaledonien           |37287   |
|98 |magdalange@example.org      |Luzie Putz           |Jordanien               |59679   |
|99 |evelyn93@example.org        |Lutz Schomber        |Pal√§stinensische Gebiete|60653   |
|99 |maria-theresia61@example.com|Astrid L√∂chel        |Barbados                |59679   |
|98 |victoria54@example.net      |Pawel Werner         |Vanuatu                 |60653   |
|97 |jstey@example.net           |Herr Emil H√∂fig B.A. |Mali                    |37287   |
|100|natalie67@example.net       |Ing. Olaf Killer B.A.|Neukaledonien           |37287   |
|98 |magdalange@exa

# Another example with multiple locales and weights

We are going to use multiple locales with weights (following the [examples](https://faker.readthedocs.io/en/master/fakerclass.html#examples) in the documentation).

Here's the [list of all available locales](https://faker.readthedocs.io/en/master/locales.html).

In [31]:
from faker import Faker
# set a seed for the random generator
Faker.seed(0)

Generate data with locales `de_DE` and `de_AT` with weights respectively $5$ and $2$.

The distribution of locales will be:
 - `de_DE` - $71.43\%$ of the time ($5 / (5+2)$)
 - `de_AT` - $28.57\%$ of the time ($2 / (5+2)$)


In [32]:
from collections import OrderedDict
locales = OrderedDict([
    ('de_DE', 5),
    ('de_AT', 2),
])
fake = Faker(locales)
fake.seed_instance(42)
fake.locales

['de_DE', 'de_AT']

In [33]:
fake.seed_locale('de_DE', 0)
fake.seed_locale('de_AT', 0)

In [34]:
fake.profile(fields=['name', 'birthdate', 'sex', 'blood_group',
                     'mail', 'current_location'])

{'current_location': (Decimal('60.2738415'), Decimal('-125.936450')),
 'blood_group': 'AB-',
 'name': 'Axel Jung',
 'sex': 'M',
 'mail': 'claragollner@gmail.com',
 'birthdate': datetime.date(2005, 6, 5)}

In [35]:
from pyspark.sql.types import *
location = StructField('current_location',
                       StructType([StructField('lat', DecimalType()),
                                   StructField('lon', DecimalType())])
                      )
schema = StructType([StructField('name', StringType()),
                     StructField('birthdate', DateType()),
                     StructField('sex', StringType()),
                     StructField('blood_group', StringType()),
                     StructField('mail', StringType()),
                     location
                     ])

In [36]:
fake.profile(fields=['name', 'birthdate', 'sex', 'blood_group',
                     'mail', 'current_location'])

{'current_location': (Decimal('74.492712'), Decimal('-118.072077')),
 'blood_group': 'O-',
 'name': 'Dana Rieder',
 'sex': 'F',
 'mail': 'uhlanna-maria@chello.at',
 'birthdate': datetime.date(1935, 10, 3)}

In [37]:
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Faker demo - part 2") \
    .getOrCreate()

Create dataframe with $5\cdot10^3$ rows.

In [38]:
%%time
n = 5*10**3
d = (fake.profile(fields=['name', 'birthdate', 'sex', 'blood_group',
                          'mail', 'current_location'])
     for i in range(n))
df = spark.createDataFrame(d, schema)

CPU times: user 1.9 s, sys: 19.7 ms, total: 1.92 s
Wall time: 1.95 s


In [39]:
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- birthdate: date (nullable = true)
 |-- sex: string (nullable = true)
 |-- blood_group: string (nullable = true)
 |-- mail: string (nullable = true)
 |-- current_location: struct (nullable = true)
 |    |-- lat: decimal(10,0) (nullable = true)
 |    |-- lon: decimal(10,0) (nullable = true)



Note how `location` represents a _tuple_ data structure (a `StructType` of `StructField`s).

In [40]:
df.show(n=10, truncate=False)

+---------------------------+----------+---+-----------+-----------------------------+----------------+
|name                       |birthdate |sex|blood_group|mail                         |current_location|
+---------------------------+----------+---+-----------+-----------------------------+----------------+
|Prof. Valentine Niemeier   |1965-06-01|F  |O-         |schmidtkegerlind@gmx.de      |{82, -70}       |
|Betty Schuster             |1918-06-27|F  |O+         |hesschristoph@gmail.com      |{-44, -1}       |
|Maya K√§ster                |1946-05-03|F  |AB-        |benderagatha@gmx.de          |{47, -13}       |
|Ing. Walfried Rosenow      |1914-04-11|M  |AB-        |kambsliane@hotmail.de        |{-21, -108}     |
|Ava Kappel                 |1936-01-07|F  |O-         |hannafroehlich@gmail.com     |{-23, -117}     |
|Rosa-Maria Schwital B.Sc.  |1929-06-17|F  |O+         |johannessauer@yahoo.de       |{-59, 87}       |
|Liane Hornig MBA.          |2002-04-05|F  |B-         |ruppers

# Save to Parquet

[Write to parquet](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=write#pyspark.sql.DataFrameWriter.parquet) file ([Parquet](http://parquet.apache.org/) is a compressed, efficient columnar data representation compatible with all frameworks in the Hadoop ecosystem):

In [41]:
df.write.mode("overwrite").parquet("fakedata.parquet")

Check the size of the parquet file (it is actually a directory containing the partitions):

In [42]:
!du -h fakedata.parquet

188K	fakedata.parquet


In [43]:
!ls -lh fakedata.parquet

total 172K
-rw-r--r-- 1 root root 71K Feb 20 12:44 part-00000-4add539e-d9cf-4b4d-90ba-766983ded591-c000.snappy.parquet
-rw-r--r-- 1 root root 98K Feb 20 12:44 part-00001-4add539e-d9cf-4b4d-90ba-766983ded591-c000.snappy.parquet
-rw-r--r-- 1 root root   0 Feb 20 12:44 _SUCCESS


# Stop Spark session

Don't forget to close the Spark session when you're done!

## Why you should stop your Spark session

Even when no jobs are running, the Spark session holds memory resources, that get released only when the session is properly stopped.

In [44]:
# Function to check memory usage
import subprocess

def get_memory_usage_ratio():
    # Run the 'free -h' command
    result = subprocess.run(['free', '-h'], stdout=subprocess.PIPE, text=True)

    # Parse the output
    lines = result.stdout.splitlines()

    # Initialize used and total memory
    used_memory = None
    total_memory = None

    # The second line contains the memory information
    if len(lines) > 1:
        # Split the line into parts
        memory_parts = lines[1].split()
        total_memory = memory_parts[1]  # Total memory
        used_memory = memory_parts[2]   # Used memory

    return used_memory, total_memory

Compare memory usage before and after stopping the session.

In [45]:
# Check memory usage before stopping the Spark session
used_memory, total_memory = get_memory_usage_ratio()
print(f"Memory used before stopping Spark session: {used_memory}")
print(f"Total Memory: {total_memory}")


Memory used before stopping Spark session: 1.6Gi
Total Memory: 12Gi


In [46]:
# Stop the Spark session
spark.stop()

# Check memory usage after stopping the Spark session
used_memory, total_memory = get_memory_usage_ratio()
print(f"Memory used after stopping Spark session: {used_memory}")
print(f"Total Memory: {total_memory}")

Memory used after stopping Spark session: 1.5Gi
Total Memory: 12Gi


The amount of memory released may not be impressive in this case, but holding onto unnecessary resources is inefficient. Also, memory waste can add up quickly when multiple sessions are running.