## Introduction

The goal of this notebook is to demonstrate how fake Test Data can be generated using Python. I am using Databricks Free Edition because everyone with a Google or Microsoft account can sign up easily. The notebook and python packages are open source and free, it can be used in any service with python notebook capability or locally on your computer.

## Structure of this Tutorial
1. Short introduction of a Fake Story
1. Quick introduction of Faker python lirary 
1. Installation of Faker python library
1. Quick Test Data Generation
1. Customized Test Data Generation
1. Try it yourself Guide



## Fake Story
ODP is producing an animation titled "Liberator of Data", we would like to know more about our talents, examples: number of music composers, hidden talents, etc.

## Faker Python Package
Faker is a Python package that generates fake data. 
Use case: Test Data for Proof of Concept, Data Anonymization.

[Faker documentation](https://faker.readthedocs.io)

Install Faker python package:
`%pip install faker`

Quick Start
```
from faker import Faker

# seed integer is to used to make sure same values are generated if run again
Faker.seed(0)
# locale is to make sure data is generated related to Canada, like address.
fake = Faker("en_CA")
```


## Installation of Faker python package
- click on the "Run cell" button on the top left corner of the cell below.

In [0]:
%pip install faker

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


## Variable
- number of rows to be generated, default is 5:
  - feel free to change it to number of rows that you want to generate.
- click on the "Run cell" button to set the variable.

In [0]:
NUMBER_OF_ROWS = 5

## Quick Test Data Generation
Faker python package provides a quick method called profile() to generate a list of user profiles.

spark.createDataFrame() is used to store the generated Test Data in memory.
Data can be downloaded as csv or excel by clicking the button at the bottom of result.

Or, data can be stored in Unity Catalog as delta table (not shown in this notebook).

Click on the "Run cell" button on the top left corner of cell below to generate test data.

In [0]:
from faker import Faker

Faker.seed(0)
fake = Faker("en_CA")

data = [fake.profile() for _ in range(NUMBER_OF_ROWS)]

df = spark.createDataFrame(data=data)
display(df)

address,birthdate,blood_group,company,current_location,job,mail,name,residence,sex,ssn,username,website
"471 Erika Curve North Megan, SK X4L4X7",1980-09-17,A+,Williams-Sheppard,"List(-64.962524500000000000, 34.116758000000000000)",Musician,shannon51@yahoo.com,Robert Stewart,"19489 Kyle Stream Apt. 578 West Ryanborough, ON T1X9L1",M,738 137 405,kellylopez,"List(https://www.nguyen.org/, http://www.pratt.com/, http://bolton.com/, http://harris.com/)"
"22584 Candice Mills North Thomas, NS P3R9L1",1914-12-06,O-,"Mccoy, Bruce and Sanchez","List(-40.346973500000000000, -113.484289000000000000)","Radiographer, therapeutic",vharmon@hotmail.com,Melissa Myers,"30989 Anthony Roads New Maria, BC C1P1E4",F,455 823 344,michael79,"List(http://www.coleman.com/, http://graham-brown.info/)"
"30089 James Rest Apt. 442 South Melissafurt, QC A2L 5T3",1914-07-15,A+,Allen Group,"List(51.619332500000000000, 81.598724000000000000)",Health visitor,darin24@gmail.com,Tonya Irwin,"109 Holly Estate Apt. 376 Johnsonshire, NS H5T 3C8",F,528 175 664,valeriemorales,List(https://monroe-williams.biz/)
"33060 Phillip Path Shanefort, YT S4C 5P3",2004-12-23,A+,"Green, Edwards and Richardson","List(38.329506500000000000, -51.308777000000000000)","Engineer, land",hlee@gmail.com,Michael Moore,"402 Joseph Junction Suite 159 North Josephberg, PE P3R5Y2",M,085 332 146,courtneybennett,"List(http://www.powell-murphy.biz/, http://www.kerr.com/, http://lyons.com/)"
"20407 Atkins Union Port Diana, NT X2X9B2",1971-09-27,B-,"Hayes, Rhodes and Wilson","List(55.709200000000000000, 147.368102000000000000)",Air broker,nbullock@yahoo.com,Lauren Robbins,"6582 Vanessa Oval New Richard, QC X8V8N4",F,621 704 287,fjones,List(https://johnson-hernandez.com/)


## Customized Test Data Generation
The cell below contains python code to generate Test Data based on the requirements below:

| Seq | Columns | Meaning | Example |
| :------- | :------: | -------: | -------: |
| 1 | Member_ID | Unique Identifier prefix with MID and six numeric from 111111 to 999999 | MID123456 |
| 2 | Role | 5% of chance = Director <br> 15% of chance = Music Composer <br> 20% of chance = Background Artist <br> 30% of chance = Character Design <br> 30% of chance = Voice Acting | Voice Acting |
| 3 | User_Name | Random Unique name without space | dhzoey |
| 4 | First_Name | Random Canadian English First Name | Celine |
| 5 | Last_Name | Random Canadian English Last Name | Myers |
| 6 | Birth_Date | Birth Date between Age of 18 to 65 as of this year | 1982-07-02 |
| 7 | SIN | Random 9 digit numeric mimic Canadian SIN | 774 564 306 |
| 8 | Pronounce | Random selection of either "He/Him", "She/Her", "They/Them" | They/Them |
| 9 | Dream_Job | Random occupation name | Fashion designer |
| 10 | Start_Date | A random date from yesterday to 10 years ago | 2023-08-15 |
| 11 | Phone_Number | A random fake Canadian phone number | (492) 411-5781 x565 |
| 12 | Mailing_Address | A random fake Canadian address | 408 Christopher Ville Suite 097 Johnmouth, SK J3K 9K8 |
| 13 | Office_Address | Random Street Address, <br> Random selection of either "Toronto", "Vaughan", or "Guelph", <br> Harcoded ON as province, <br> Random Postal Code | 58714 Mann Plaza, Toronto ON Y2M 7T6 |

spark.createDataFrame() is used to store the generated Test Data in memory as spark DataFrame format.
Data can be downloaded as csv or excel by clicking the button at the bottom of result.

Or, data can be stored in Unity Catalog as delta table (not shown in this notebook).

Click on the "Run cell" button on the top left corner of cell below to generate test data.

In [0]:
from faker import Faker
from collections import OrderedDict

Faker.seed(0)
fake = Faker("en_CA")

schema=[
    "Member_ID",
    "Role",
    "User_Name",
    "First_Name",
    "Last_Name",
    "Birth_Date",
    "SIN",
    "Pronounce",
    "Dream_Job",
    "Start_Date",
    "Phone_Number",
    "Mailing_Address",
    "Office_Address",
]

prefix_mID = 'MID'
pronouce_choices = [("He/Him"), ("She/Her"), ("They/Them")]
role_choices = OrderedDict([
    ("Director", 0.05), 
    ("Music Composer", 0.15), 
    ("Background Artist", 0.20), 
    ("Character Design", 0.30), 
    ("Voice Acting", 0.30)
])
office_choices = OrderedDict([
    ("Toronto", 0.60), 
    ("Vaughan", 0.30), 
    ("Guelph", 0.10), 
])
data = []
for row in range(NUMBER_OF_ROWS):
    data.append(
        (
            prefix_mID + str(fake.unique.random_int(min=111111, max=999999)),
            fake.random_element(role_choices),
            fake.user_name(),
            fake.first_name(),
            fake.last_name(),
            fake.date_of_birth(minimum_age=18, maximum_age=65),
            fake.ssn(),
            fake.random_element(pronouce_choices),
            fake.job(),
            fake.date_between(start_date='-10y',end_date='-1d'),
            fake.phone_number(),
            fake.address(), 
            "{}, {} ON {}".format(
                fake.street_address(), 
                fake.random_element(office_choices), 
                fake.postcode(),         
            ),
        )
    )
df = spark.createDataFrame(schema=schema,data=data)
display(df)

Member_ID,Role,User_Name,First_Name,Last_Name,Birth_Date,SIN,Pronounce,Dream_Job,Start_Date,Phone_Number,Mailing_Address,Office_Address
MID996551,Bacground Artist,gwilliams,Todd,Hull,1999-06-17,774 564 306,He/Him,Water engineer,2023-08-15,(492) 411-5781 x565,"408 Christopher Ville Suite 097 Johnmouth, SK J3K 9K8","58714 Mann Plaza, Toronto ON Y2M 7T6"
MID714724,Bacground Artist,jaimelopez,Chloe,Douglas,1982-07-02,118 056 738,They/Them,Fashion designer,2024-02-14,+1 (894) 775-1591,"04135 Marvin Via North Kristabury, AB A2E 7J2","09032 Timothy Stream Apt. 086, Vaughan ON C5K6C7"
MID300188,Music Composer,gwilliams,Holly,Myers,1972-02-11,715 310 868,He/Him,"Lighting technician, broadcasting/film/video",2020-12-02,697.207.6984,"80715 Amy Dale Apt. 759 Emilyshire, QC M2R 1S4","33769 Johnson Well Suite 027, Toronto ON T9X 1V1"
MID629718,Bacground Artist,zwilliams,Robert,Dunn,2005-11-07,110 658 622,He/Him,"Accountant, chartered",2025-06-05,(291) 319-3442 x176,"1428 Wilson Drives Suite 000 Lake Jordan, NT B7V6T9","69402 Joseph Junction, Vaughan ON N5B 3B3"
MID722903,Bacground Artist,ihays,Kelly,Woods,1971-02-18,021 881 669,She/Her,Patent examiner,2016-09-17,(377) 551-7176 x045,"111 Kara Circle Suite 016 Shanefort, YT S4C 5P3","51108 Goodwin Flats Apt. 764, Toronto ON J4G 8E7"
