<a href="https://colab.research.google.com/github/groda/big_data/blob/master/generate_data_with_Faker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://github.com/groda/big_data"><div><img src="https://github.com/groda/big_data/blob/master/logo_bdb.png?raw=true" align=right width="90"></div></a>

# Data Generation and Aggregation with Python's Faker Library and PySpark
<br>
<br>

Explore the capabilities of the Python Faker library (https://faker.readthedocs.io/) for dynamic data generation!

Whether you're a data scientist, engineer, or analyst, this tutorial will guide you through the process of creating realistic and diverse datasets using Faker and then harnessing the distributed computing capabilities of PySpark to aggregate and analyze the generated data.



**Note:** This is not _synthetic_ data as it is generated using simple methods and will most likely not fit any real-life distribution. Still, it serves as a valuable resource for testing purposes when authentic data is unavailable.

# Install Faker

The Python `faker` module needs to be installed. Note that on Google Colab you can use `!pip` as well as just `pip` (no exclamation mark).

In [1]:
!pip install faker

Collecting faker


  Downloading Faker-24.1.0-py3-none-any.whl.metadata (15 kB)
Downloading Faker-24.1.0-py3-none-any.whl (1.8 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.8 MB[0m [31m?[0m eta [36m-:--:--[0m

[2K   [91m━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.6/1.8 MB[0m [31m19.3 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.8/1.8 MB[0m [31m25.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
[?25h

Installing collected packages: faker


Successfully installed faker-24.1.0



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


# Generate a Pandas dataframe with fake data

Import `Faker` and set a random seed ($42$).

In [2]:
from faker import Faker
# Set the seed value of the shared `random.Random` object
# across all internal generators that will ever be created
Faker.seed(42)

`fake` is a fake data generator with `DE_de` locale.

In [3]:
fake = Faker('de_DE')
fake.seed_locale('de_DE', 42)
# Creates and seeds a unique `random.Random` object for
# each internal generator of this `Faker` instance
fake.seed_instance(42)

Import Pandas to save data into a dataframe

In [4]:
# true if running on Google Colab
import sys
IN_COLAB = 'google.colab' in sys.modules
if not IN_COLAB:
 !pip install pandas==1.5.3

import pandas as pd

Collecting pandas==1.5.3


  Downloading pandas-1.5.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)


Collecting numpy>=1.20.3 (from pandas==1.5.3)


  Downloading numpy-1.24.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Downloading pandas-1.5.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/12.2 MB[0m [31m?[0m eta [36m-:--:--[0m

[2K   [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/12.2 MB[0m [31m19.8 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/12.2 MB[0m [31m25.2 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/12.2 MB[0m [31m27.9 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.7/12.2 MB[0m [31m34.1 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━[0m [32m7.3/12.2 MB[0m [31m41.7 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━[0m [32m10.5/12.2 MB[0m [31m53.6 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m12.2/12.2 MB[0m [31m68.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.2/12.2 MB[0m [31m56.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.24.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/17.3 MB[0m [31m?[0m eta [36m-:--:--[0m

[2K   [91m━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.1/17.3 MB[0m [31m144.8 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━[0m [32m9.6/17.3 MB[0m [31m139.7 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━[0m [32m12.6/17.3 MB[0m [31m144.6 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m17.3/17.3 MB[0m [31m127.5 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m17.3/17.3 MB[0m [31m127.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m81.9 MB/s[0m eta [36m0:00:00[0m
[?25h

Installing collected packages: numpy, pandas


Successfully installed numpy-1.24.4 pandas-1.5.3



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


The function `create_row_faker` creates one row of fake data. Here we choose to generate a row containing the following fields:
 - `fake.name()`
 - `fake.postcode()`
 - `fake.email()`
 - `fake.country()`.

In [5]:
def create_row_faker(num=1):
    output = [{"name": fake.name(),
               "age": fake.random_int(0, 100),
               "postcode": fake.postcode(),
               "email": fake.email(),
               "nationality": fake.country(),
              } for x in range(num)]
    return output

Generate a single row

In [6]:
create_row_faker()

[{'name': 'Aleksandr Weinhage',
  'age': 35,
  'postcode': '32181',
  'email': 'bbeckmann@example.org',
  'nationality': 'Fidschi'}]

Generate a dataframe `df_fake` of 5000 rows using `create_row_faker`.

We're using the _cell magic_ `%%time` to time the operation.

In [7]:
%%time
df_fake = pd.DataFrame(create_row_faker(5000))

CPU times: user 269 ms, sys: 5.41 ms, total: 275 ms
Wall time: 275 ms


View dataframe

In [8]:
df_fake

Unnamed: 0,name,age,postcode,email,nationality
0,Prof. Kurt Bauer B.A.,91,37940,hildaloeffler@example.com,Guatemala
1,Ekkehart Wilms-Kallert,13,61559,maja07@example.net,Brasilien
2,Annelise Röhrdanz-Hornig,80,93103,daniel31@example.com,Guatemala
3,Magrit Knappe B.A.,47,34192,gottliebmitschke@example.com,Guadeloupe
4,Univ.Prof. Gotthilf Wirth B.Sc.,29,56413,heini76@example.net,Litauen
...,...,...,...,...,...
4995,Janusz Nette,74,33091,ann-kathrinsiering@example.net,Belarus
4996,Frau Cathleen Bähr,97,89681,hethurhubertus@example.org,St. Barthélemy
4997,Ulla Seidel,66,28358,klotzbabette@example.net,St. Lucia
4998,Janin Speer MBA.,64,76879,giesskarl-hermann@example.com,Kroatien


For more fake data generators see Faker's [standard providers](https://faker.readthedocs.io/en/master/providers.html#standard-providers) as well as [community providers](https://faker.readthedocs.io/en/master/communityproviders.html#community-providers).

# Generate PySpark dataframe with fake data

Install PySpark.

In [9]:
!pip install pyspark




[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [10]:
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Faker demo") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


24/03/09 20:35:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [11]:
df = spark.createDataFrame(create_row_faker(5000))

To avoid getting the warning, either use [pyspark.sql.Row](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Row) and let Spark infer datatypes or create a schema for the dataframe specifying the datatypes of all fields (here's the list of all [datatypes](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=types#module-pyspark.sql.types)).

In [12]:
from pyspark.sql.types import *
schema = StructType([StructField('name', StringType()),
                     StructField('age',IntegerType()),
                     StructField('postcode',StringType()),
                     StructField('email', StringType()),
                     StructField('nationality',StringType())])

In [13]:
df = spark.createDataFrame(create_row_faker(5000), schema)

In [14]:
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- postcode: string (nullable = true)
 |-- email: string (nullable = true)
 |-- nationality: string (nullable = true)



Let's generate some more data (dataframe with $5\cdot10^4$ rows). The file will be partitioned by Spark.

In [15]:
%%time
n = 5*10**4
df = spark.createDataFrame(create_row_faker(n), schema)

CPU times: user 2.9 s, sys: 18.4 ms, total: 2.92 s
Wall time: 2.94 s


It took a long time (~4 sec. for 50000 rows)!

Can we do better?

The function `create_row_faker()` returns a list. This is not efficient, what we need is a _generator_ instead.

In [16]:
d = create_row_faker(5)
# what type is d?
type(d)

list

Now `d` is a generator

In [17]:
d = ({"name": fake.name(),
      "age": fake.random_int(0, 100),
      "postcode": fake.postcode(),
      "email": fake.email(),
      "nationality": fake.country()} for i in range(5))
# what type is d?
type(d)

generator

In [18]:
%%time
n = 5*10**4
d = ({"name": fake.name(),
      "age": fake.random_int(0, 100),
      "postcode": fake.postcode(),
      "email": fake.email(),
      "nationality": fake.country()}
     for i in range(n))
df = spark.createDataFrame(d, schema)

CPU times: user 2.94 s, sys: 18.5 ms, total: 2.96 s
Wall time: 2.98 s


This wasn't faster.

I will look into how one can leverage Hadoop's parallelism to generate dataframes and speed the process.

## Filter and aggregate with PySpark

In [19]:
type(df)

pyspark.sql.dataframe.DataFrame

Show the first five records in the dataframe of fake data.

In [20]:
df.show(n=5, truncate=False)

[Stage 0:>                                                          (0 + 1) / 1]

                                                                                

+---------------------+---+--------+------------------------+-----------+
|name                 |age|postcode|email                   |nationality|
+---------------------+---+--------+------------------------+-----------+
|Rudolf Rust          |76 |86566   |kornelia11@example.org  |Indonesien |
|Kerstin Putz-Köster  |54 |18953   |qrudolph@example.org    |Kenia      |
|Ivo Schinke B.Sc.    |3  |60202   |radischjames@example.net|Chile      |
|Hans-Henning Staude  |82 |68552   |ladislaus89@example.com |Thailand   |
|Justine Weinhage B.A.|53 |62346   |ytaesche@example.org    |Österreich |
+---------------------+---+--------+------------------------+-----------+
only showing top 5 rows



Do some data aggregation:
 - group by postcode
 - count the number of persons and the average age for each postcode
 - filter out postcodes with less than 4 persons
 - sort by average age descending
 - show the first 5 entries

In [21]:
import pyspark.sql.functions as F
df.groupBy('postcode') \
  .agg(F.count('postcode').alias('Count'), F.round(F.avg('age'), 2).alias('Average age')) \
  .filter('Count>3') \
  .orderBy('Average age', ascending=False) \
  .show(5)

[Stage 1:>                                                          (0 + 4) / 4]

+--------+-----+-----------+
|postcode|Count|Average age|
+--------+-----+-----------+
|   18029|    4|      91.75|
|   67611|    4|       87.0|
|   47898|    4|       85.5|
|   45386|    4|      82.75|
|   46755|    4|       78.5|
+--------+-----+-----------+
only showing top 5 rows



                                                                                

Postcode $18029$ has the highest average age ($91.75$). Show all entries for postcode $18029$ using `filter`.

In [22]:
df.filter('postcode==18029').show(truncate=False)

+---------------------------+---+--------+------------------------------+------------------+
|name                       |age|postcode|email                         |nationality       |
+---------------------------+---+--------+------------------------------+------------------+
|Univ.Prof. Roderich Liebelt|89 |18029   |anne-katrinscholtz@example.com|Grönland          |
|Herwig Matthäi B.A.        |90 |18029   |steinberggerta@example.com    |Amerikanisch-Samoa|
|Univ.Prof. Mijo Weihmann   |92 |18029   |hoevelantonius@example.net    |Niederlande       |
|Aynur Karz B.Eng.          |96 |18029   |cschleich@example.com         |Puerto Rico       |
+---------------------------+---+--------+------------------------------+------------------+



# Another example with multiple locales and weights

We are going to use multiple locales with weights (following the [examples](https://faker.readthedocs.io/en/master/fakerclass.html#examples) in the documentation).

Here's the [list of all available locales](https://faker.readthedocs.io/en/master/locales.html).

In [23]:
from faker import Faker
# set a seed for the random generator
Faker.seed(0)

Generate data with locales `de_DE` and `de_AT` with weights respectively $5$ and $2$.

The distribution of locales will be:
 - `de_DE` - $71.43\%$ of the time ($5 / (5+2)$)
 - `de_AT` - $28.57\%$ of the time ($2 / (5+2)$)


In [24]:
from collections import OrderedDict
locales = OrderedDict([
    ('de_DE', 5),
    ('de_AT', 2),
])
fake = Faker(locales)
fake.seed_instance(42)
fake.locales

['de_DE', 'de_AT']

In [25]:
fake.seed_locale('de_DE', 0)
fake.seed_locale('de_AT', 0)

In [26]:
fake.profile(fields=['name', 'birthdate', 'sex', 'blood_group',
                     'mail', 'current_location'])

{'current_location': (Decimal('26.547114'), Decimal('-10.243190')),
 'blood_group': 'B-',
 'name': 'Axel Jung',
 'sex': 'M',
 'mail': 'claragollner@gmail.com',
 'birthdate': datetime.date(2003, 6, 23)}

In [27]:
from pyspark.sql.types import *
location = StructField('current_location',
                       StructType([StructField('lat', DecimalType()),
                                   StructField('lon', DecimalType())])
                      )
schema = StructType([StructField('name', StringType()),
                     StructField('birthdate', DateType()),
                     StructField('sex', StringType()),
                     StructField('blood_group', StringType()),
                     StructField('mail', StringType()),
                     location
                     ])

In [28]:
fake.profile(fields=['name', 'birthdate', 'sex', 'blood_group',
                     'mail', 'current_location'])

{'current_location': (Decimal('79.153888'), Decimal('-0.003034')),
 'blood_group': 'B-',
 'name': 'Dr. Anita Suppan',
 'sex': 'F',
 'mail': 'schauerbenedict@kabsi.at',
 'birthdate': datetime.date(1980, 3, 5)}

In [29]:
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Faker demo - part 2") \
    .getOrCreate()

24/03/09 20:35:21 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


Create dataframe with $5\cdot10^3$ rows.

In [30]:
%%time
n = 5*10**3
d = (fake.profile(fields=['name', 'birthdate', 'sex', 'blood_group',
                          'mail', 'current_location'])
     for i in range(n))
df = spark.createDataFrame(d, schema)

CPU times: user 1.62 s, sys: 4.09 ms, total: 1.62 s
Wall time: 1.71 s


In [31]:
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- birthdate: date (nullable = true)
 |-- sex: string (nullable = true)
 |-- blood_group: string (nullable = true)
 |-- mail: string (nullable = true)
 |-- current_location: struct (nullable = true)
 |    |-- lat: decimal(10,0) (nullable = true)
 |    |-- lon: decimal(10,0) (nullable = true)



Note how `location` represents a _tuple_ data structure (a `StructType` of `StructField`s).

In [32]:
df.show(n=10, truncate=False)

+-----------------------+----------+---+-----------+-----------------------------+----------------+
|name                   |birthdate |sex|blood_group|mail                         |current_location|
+-----------------------+----------+---+-----------+-----------------------------+----------------+
|Prof. Valentine Noack  |1979-04-08|F  |B-         |maricagotthard@aol.de        |{74, 164}       |
|Magrit Graf            |1943-02-09|F  |A-         |hartungclaudio@web.de        |{-86, -34}      |
|Harriet Weller-Lindau  |1959-12-19|F  |AB+        |heserhilma@gmail.com         |{20, 126}       |
|Ing. Walfried Roskoth  |1912-04-28|M  |B-         |kambsliane@hotmail.de        |{73, 169}       |
|Alexa Loidl-Schönberger|1934-01-24|F  |O-         |hannafroehlich@gmail.com     |{-23, -117}     |
|Hans-Erich Hartmann    |1971-07-27|M  |O-         |vadimkostolzin@gmx.de        |{39, -118}      |
|Ing. Sofia Fritsch B.A.|1966-12-25|F  |A-         |weinhagehans-christian@gmx.de|{-11, 73}       |


# Save to Parquet

[Write to parquet](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=write#pyspark.sql.DataFrameWriter.parquet) file ([Parquet](http://parquet.apache.org/) is a compressed, efficient columnar data representation compatible with all frameworks in the Hadoop ecosystem):

In [33]:
df.write.mode("overwrite").parquet("fakedata.parquet")

[Stage 7:>                                                          (0 + 4) / 4]

                                                                                

Check the size of parquet file (it is actually a directory containing the partitions):

In [34]:
!du -h fakedata.parquet

212K	fakedata.parquet


In [35]:
!ls -lh fakedata.parquet

total 188K
-rw-r--r-- 1 runner docker   0 Mar  9 20:35 _SUCCESS
-rw-r--r-- 1 runner docker 39K Mar  9 20:35 part-00000-88fcb83e-85d9-4d02-879b-00a7593a60aa-c000.snappy.parquet
-rw-r--r-- 1 runner docker 40K Mar  9 20:35 part-00001-88fcb83e-85d9-4d02-879b-00a7593a60aa-c000.snappy.parquet
-rw-r--r-- 1 runner docker 39K Mar  9 20:35 part-00002-88fcb83e-85d9-4d02-879b-00a7593a60aa-c000.snappy.parquet
-rw-r--r-- 1 runner docker 67K Mar  9 20:35 part-00003-88fcb83e-85d9-4d02-879b-00a7593a60aa-c000.snappy.parquet


# Stop Spark session

Don't forget to close the Spark session when you're done!

In [36]:
spark.stop()