## Need test data?

Here's a quick way to generate some fake data using the Python `Faker` library ([https://faker.readthedocs.io/](https://faker.readthedocs.io/)).

**Note:** this is not _synthetic_ data as it is generated with simple methods and will most likely not fit any real-life distribution. Still, it can be useful for test purposes when no data is at hand.

Import `Faker` and set a random seed.

In [1]:
from faker import Faker
# Set the seed value of the shared `random.Random` object
# across all internal generators that will ever be created
Faker.seed(42)

`fake` is a fake data generator with `DE_de` locale.

In [2]:
fake = Faker('de_DE')
fake.seed_locale('de_DE', 42)
# Creates and seeds a unique `random.Random` object for
# each internal generator of this `Faker` instance
fake.seed_instance(42)

Import pandas to dave data into a dataframe

In [3]:
import pandas as pd

The function `create_row_faker` creates one row of fake data. Here we choose to generate a row containing the following fields:
 - `fake.name()`
 - `fake.postcode()`
 - `fake.email()`
 - `fake.country()`.

In [4]:
def create_row_faker(num=1):
    output = [{"name": fake.name(),
               "age": fake.random_int(0, 100),
               "postcode": fake.postcode(),
               "email": fake.email(),
               "nationality": fake.country(),
              } for x in range(num)]
    return output

Generate a single row

In [5]:
create_row_faker()

[{'name': 'Alida Harloff',
  'age': 28,
  'postcode': '18196',
  'email': 'gnatzdajana@rosenow.net',
  'nationality': 'Grönland'}]

Generate a dataframe `df_fake` of 5000 rows using `create_row_faker`. 

We're using the _cell magic_ `%%time` to time the operation.

In [6]:
%%time
df_fake = pd.DataFrame(create_row_faker(5000))

CPU times: user 2.63 s, sys: 6.27 ms, total: 2.64 s
Wall time: 2.64 s


View dataframe

In [7]:
df_fake

Unnamed: 0,name,age,postcode,email,nationality
0,Prof. Carolina Kade,35,51161,jochembeer@dippel.com,Osttimor
1,Gisela Flantz B.Sc.,84,41316,olaf55@haering.de,Niue
2,Nuri Ehlert,48,83503,djunck@yahoo.de,Namibia
3,Deborah Kroker,82,42388,polina65@zorbach.com,Antigua und Barbuda
4,Leonardo Schottin,54,66978,sonja18@schlosser.com,Guernsey
...,...,...,...,...,...
4995,Ing. Angelique Bender,51,29000,ecarsten@zaenker.net,Serbien und Montenegro
4996,Dr. Dietmar Kabus B.Eng.,9,20621,hahnclarissa@loewer.com,Brasilien
4997,Rose Cichorius B.Sc.,16,12841,sieglindestahr@googlemail.com,Äthiopien
4998,Prof. Rigo Seifert,46,24254,brittkoehler@gmail.com,Suriname


For more fake data generators see Faker's [standard providers](https://faker.readthedocs.io/en/master/providers.html#standard-providers) as well as [community providers](https://faker.readthedocs.io/en/master/communityproviders.html#community-providers).

In [8]:
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("dataLAB demo fake data") \
    .getOrCreate()

In [9]:
df = spark.createDataFrame(create_row_faker(5000))



To avoid getting the warning, either use [pyspark.sql.Row](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Row) and let Spark infer datatypes or create a schema for the dataframe specifying the datatypes of all fields (here's the list of all [datatypes](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=types#module-pyspark.sql.types)).

In [10]:
from pyspark.sql.types import *
schema = StructType([StructField('name', StringType()),
                     StructField('age',IntegerType()),
                     StructField('postcode',StringType()),
                     StructField('email', StringType()), 
                     StructField('nationality',StringType())])

In [11]:
df = spark.createDataFrame(create_row_faker(5000), schema)

In [12]:
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- postcode: string (nullable = true)
 |-- email: string (nullable = true)
 |-- nationality: string (nullable = true)



Let's generate some more data (dataframe with $5\cdot10^4$ rows). The file will be partitioned by Spark.

In [13]:
%%time
n = 5*10**4
df = spark.createDataFrame(create_row_faker(n), schema)

CPU times: user 27 s, sys: 137 ms, total: 27.1 s
Wall time: 27.1 s


It took a long time (27 sec. for 50000 rows)!

Can we do better?

The function `create_row_faker()` returns a list. This is not efficient, what we need is a _generator_ instead.

In [14]:
d = create_row_faker(5)
# what type is d?
type(d)

list

Now `d` is a generator

In [15]:
d = ({"name": fake.name(),
      "age": fake.random_int(0, 100),
      "postcode": fake.postcode(),
      "email": fake.email(),
      "nationality": fake.country()} for i in range(5))
# what type is d?
type(d)

generator

In [16]:
%%time
n = 5*10**4
d = ({"name": fake.name(),
      "age": fake.random_int(0, 100),
      "postcode": fake.postcode(),
      "email": fake.email(),
      "nationality": fake.country()} 
     for i in range(n))
df = spark.createDataFrame(d, schema)

CPU times: user 27.5 s, sys: 158 ms, total: 27.6 s
Wall time: 27.6 s


This wasn't faster.

I will look into how one can leverage Hadoop's parallelism to generate dataframes and speed the process.

In [17]:
type(df)

pyspark.sql.dataframe.DataFrame

Show the first five records in the dataframe of fake data.

In [18]:
df.show(n=5)

+--------------------+---+--------+--------------------+----------------+
|                name|age|postcode|               email|     nationality|
+--------------------+---+--------+--------------------+----------------+
|     Natalija Textor| 91|   27481|  kreising@gmail.com|         Eritrea|
| Dietmar Preiß-Nette| 79|   78578|rosabiggen@doersc...|Äquatorialguinea|
|    Stanislav Scholz|  7|   57091|fjockel@googlemai...|      Martinique|
|         Marian Gute| 64|   91681|emilhartmann@goog...|            Guam|
|Dipl.-Ing. Klaus ...| 63|   98220|jacobi-jaeckelges...|       Nicaragua|
+--------------------+---+--------+--------------------+----------------+
only showing top 5 rows



Do some data aggregation:
 - group by postcode
 - count the number of persons and the average age for each postcode
 - filter out postcodes with less than 4 persons
 - sort by average age descending
 - show the first 5 entries

In [19]:
import pyspark.sql.functions as F
df.groupBy('postcode') \
  .agg(F.count('postcode').alias('Count'), F.round(F.avg('age'), 2).alias('Average age')) \
  .filter('Count>3') \
  .orderBy('Average age', ascending=False) \
  .show(5)  

+--------+-----+-----------+
|postcode|Count|Average age|
+--------+-----+-----------+
|   86678|    4|       90.0|
|   23084|    4|       87.5|
|   89884|    4|      86.75|
|   99646|    4|       84.5|
|   96353|    4|       82.0|
+--------+-----+-----------+
only showing top 5 rows



Postcode $86678$ has the highest average age ($90$). Show all entries for postcode $86678$ using `filter`.

In [20]:
df.filter('postcode==86678').show()

+--------------------+---+--------+--------------------+--------------------+
|                name|age|postcode|               email|         nationality|
+--------------------+---+--------+--------------------+--------------------+
|  Klaus Peter Weller| 90|   86678|dgeissler@benthin...|Russische Föderation|
|Anthony Wohlgemut...| 92|   86678|rmangold@googlema...|           Venezuela|
|Ing. Albertine Sc...| 84|   86678|awieloch@mosemann.de|      Marshallinseln|
|    Klaus-Peter Kaul| 94|   86678|  angelasalz@roht.de|               China|
+--------------------+---+--------+--------------------+--------------------+



## Another example

We are going to use multiple locales with weights (following the [examples](https://faker.readthedocs.io/en/master/fakerclass.html#examples) in the documentation). 

Here's the [list of all available locales](https://faker.readthedocs.io/en/master/locales.html).

In [21]:
from faker import Faker
# set a seed for the random generator
Faker.seed(0) 

In [22]:
from collections import OrderedDict
locales = OrderedDict([
    ('de_DE', 5), 
    ('de_AT', 2),
])
fake = Faker(locales)
fake.seed_instance(42)
fake.locales

['de_DE', 'de_AT']

In [23]:
fake.seed_locale('de_DE', 0)
fake.seed_locale('de_AT', 0)

In [24]:
fake.profile(fields=['name', 'birthdate', 'sex', 'blood_group', 
                     'mail', 'current_location'])

{'current_location': (Decimal('73.9837235'), Decimal('163.824695')),
 'blood_group': 'O-',
 'name': 'Eva Enzinger',
 'sex': 'F',
 'mail': 'rvogl@kabsi.at',
 'birthdate': datetime.date(2017, 10, 27)}

In [25]:
from pyspark.sql.types import *
location = StructField('current_location',
                       StructType([StructField('lat', DecimalType()),
                                   StructField('lon', DecimalType())])
                      )
schema = StructType([StructField('name', StringType()),
                     StructField('birthdate', DateType()),
                     StructField('sex', StringType()),
                     StructField('blood_group', StringType()),
                     StructField('mail', StringType()), 
                     location
                     ])

In [26]:
fake.profile(fields=['name', 'birthdate', 'sex', 'blood_group', 
                     'mail', 'current_location'])

{'current_location': (Decimal('-50.7121925'), Decimal('-62.546840')),
 'blood_group': 'A+',
 'name': 'Aleksander Wiesinger',
 'sex': 'M',
 'mail': 'marctischler@gmail.com',
 'birthdate': datetime.date(1982, 6, 29)}

In [27]:
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("dataLAB demo fake data - part 2") \
    .getOrCreate()

Create dataframe with $5\cdot10^3$ rows.

In [28]:
%%time
n = 5*10**3
d = (fake.profile(fields=['name', 'birthdate', 'sex', 'blood_group', 
                          'mail', 'current_location']) 
     for i in range(n))
df = spark.createDataFrame(d, schema)

CPU times: user 9.45 s, sys: 50 ms, total: 9.5 s
Wall time: 9.54 s


In [29]:
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- birthdate: date (nullable = true)
 |-- sex: string (nullable = true)
 |-- blood_group: string (nullable = true)
 |-- mail: string (nullable = true)
 |-- current_location: struct (nullable = true)
 |    |-- lat: decimal(10,0) (nullable = true)
 |    |-- lon: decimal(10,0) (nullable = true)



Note how `location` represents a _tuple_ data structure (a `StructType` of `StructField`s).

In [30]:
df.show(n=10)

+--------------------+----------+---+-----------+--------------------+----------------+
|                name| birthdate|sex|blood_group|                mail|current_location|
+--------------------+----------+---+-----------+--------------------+----------------+
|         Ilija Stroh|1986-02-06|  M|        AB-|emueller@googlema...|       [-5, 148]|
|     Philomena Hesse|2002-09-06|  F|         A+|naserhans-hermann...|        [82, 12]|
| Prof. Annelise Mude|1910-03-30|  F|         A-|ftschentscher@aol.de|       [29, 177]|
|       Branka Hamann|1970-06-08|  F|         A+|fheinz@googlemail...|      [78, -136]|
|       Lilli Lercher|1996-12-01|  F|         A-|arnoldlouise@kabs...|     [-15, -108]|
|  Hans-Karl Fröhlich|2000-12-23|  M|         A+|    wlosekann@aol.de|       [-32, 47]|
|Hanife Mitschke MBA.|1986-11-15|  F|        AB-|    hkramer@yahoo.de|      [64, -131]|
|Ing. Susi Weiß B....|1914-07-09|  F|         O-|        lwiek@aol.de|      [85, -113]|
|         Anika Knoll|2007-01-02

[Write to parquet](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=write#pyspark.sql.DataFrameWriter.parquet) file ([Parquet](http://parquet.apache.org/) is a compressed, efficient columnar data representation compatible with all frameworks in the Hadoop ecosystem):

In [31]:
df.write.mode("overwrite").parquet("fakedata.parquet")

Check the size of parquet file (it is actually a directory containing the partitions):

In [32]:
!hdfs dfs -ls -h fakedata.parquet

Found 3 items
-rw-r--r--   3 datalab supergroup          0 2020-12-12 12:28 fakedata.parquet/_SUCCESS
-rw-r--r--   3 datalab supergroup     73.6 K 2020-12-12 12:28 fakedata.parquet/part-00000-81067930-f1a5-41c4-a699-1ca436547bf9-c000.snappy.parquet
-rw-r--r--   3 datalab supergroup    103.4 K 2020-12-12 12:28 fakedata.parquet/part-00001-81067930-f1a5-41c4-a699-1ca436547bf9-c000.snappy.parquet


Don't forget to close the Spark session when you're done!

In [33]:
spark.stop()