<a href="https://www.kaggle.com/code/nickkrikota/creating-a-synthetic-dataset-in-faker?scriptVersionId=160172624" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

This tutorial will go over how this synthetic dataset was created using Faker step by step. Faker will need to be installed using the !pip command on Kaggle before you can import it.

In [1]:
# Install Faker if not installed yet

!pip install Faker

Collecting Faker
  Obtaining dependency information for Faker from https://files.pythonhosted.org/packages/d8/36/47df38210deb1f076a3fc4786ccb3951f920fcecae8ba20c6f57bc2ddc29/Faker-22.5.0-py3-none-any.whl.metadata
  Downloading Faker-22.5.0-py3-none-any.whl.metadata (15 kB)
Downloading Faker-22.5.0-py3-none-any.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Faker
Successfully installed Faker-22.5.0


In [2]:
# Import Libraries

from faker import Faker
import random
import pandas as pd
fake = Faker()

print('Imported Successfully')

Imported Successfully


# Creating the Dataset

Let's start by randomly generating a number of customers and a number of sales representatives. Also it is a good idea to use a random number for the seed if you want the generated data to stay the same each time.

In [3]:
# Generate number of customers, sales reps

Faker.seed(846518)
random.seed(846518)
customer_number = random.randint(700, 800) # random number from 700 to 800
salesrep_number = random.randint(20, 30) # random number from 20 to 30

print('Number of customers:', customer_number)
print('Number of sales representatives:', salesrep_number)

Number of customers: 773
Number of sales representatives: 30


All of the columns are created using Faker except the phone, email and subscription.

For the phone number I found it easier to create a function that generates the number in the same format and for the email it is easier to add a fictional email address to the customer's name as generating it in faker would create an email address with a different name.

Phone numbers are not consistent with the states and states are chosen at random regardless of population.

In [4]:
# Create the customer dataframe

Faker.seed(846518)
random.seed(846518)

customer_names = [fake.name() for customer in range(customer_number)] # A name for each customer
salesrep_names = [fake.name() for customer in range(salesrep_number)] # A name for each sales rep

def phone_number():
    area_code = ''.join(random.choices('0123456789', k=3)) # Random 3 numberes
    three = ''.join(random.choices('0123456789', k=3)) # Random 3 numberes
    four = ''.join(random.choices('0123456789', k=4)) # Random 4 numberes

    phone_number = '({}) {}-{}'.format(area_code, three, four) # formatted to match the US phone number
    
    return phone_number

emails = [name.replace(' ', '').lower() + '@samplemail.com' for name in customer_names] # formatted to match the way most people create email addresses

subscriptions = ['Standard', 'Plus', 'Premium'] # three fictional subscription plans

customerdata = {
    'Customer ID': range(1, customer_number + 1), # customer ID starting with 1 for each customers
    'Name': customer_names, # names from earlier
    'Email': emails, # emails from earlier
    'Phone Number': [phone_number() for customer in range(customer_number)], # phone number for each customer
    'Street Address': [fake.street_address() for customer in range(customer_number)], # street address for each customer
    'City': [fake.city() for customer in range(customer_number)], # fictional city
    'State': [fake.state() for customer in range(customer_number)], # one of the U.S. states
    'Zip Code': [fake.zipcode() for customer in range(customer_number)], # fictional zipcode
    'Sales Rep': [random.choice(salesrep_names) for customer in range(customer_number)], # names of the sales reps
    'Subscription': [random.choice(subscriptions) for customer in range(customer_number)] # randomly assigned subscription type
    
}

customers = pd.DataFrame(customerdata).set_index('Customer ID')
customers

Unnamed: 0_level_0,Name,Email,Phone Number,Street Address,City,State,Zip Code,Sales Rep,Subscription
Customer ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,Kathryn Williams,kathrynwilliams@samplemail.com,(550) 269-8345,6122 Debra Court,Stewartville,Hawaii,55169,Patricia Escobar,Standard
2,Maxwell Meza,maxwellmeza@samplemail.com,(712) 706-8059,740 Bean Station,Lake Aprilton,Maine,09073,Thomas Murray,Standard
3,Jamie Crawford,jamiecrawford@samplemail.com,(278) 738-1122,65114 Tracy Track Suite 604,South Sarahbury,North Dakota,51657,Amber Taylor,Plus
4,Raven Hernandez,ravenhernandez@samplemail.com,(376) 539-3142,45083 Cunningham Drive,Jasonstad,Hawaii,13042,Ashley Collins,Standard
5,Robert Brown,robertbrown@samplemail.com,(368) 263-0915,3965 Jay Ford,New Ann,New Hampshire,50080,Samantha Horton,Plus
...,...,...,...,...,...,...,...,...,...
769,Nancy Cole,nancycole@samplemail.com,(134) 653-8529,735 Sutton Square,Lake Meganville,Pennsylvania,47331,Ashley Collins,Plus
770,Samuel Barrett,samuelbarrett@samplemail.com,(289) 571-9595,53731 Fitzgerald Keys,South David,North Carolina,22817,Bradley Todd,Standard
771,Alex Hayes,alexhayes@samplemail.com,(123) 679-2096,313 Deborah Prairie,Port Joy,California,66171,Samantha Graham,Standard
772,Roberto Kennedy,robertokennedy@samplemail.com,(852) 816-8525,48959 Kim Field,South Emily,Texas,94465,Ashley Collins,Standard


In [5]:
# Create sales rep dataframe

salesrep_emails = [name.replace(' ', '').lower() + '@company.com' for name in salesrep_names] # similar to customer name but their emial as at the company domain

salesrepdata = {
    'Employee ID': range(1, salesrep_number + 1), # employee ID starting with 1
    'Name': salesrep_names,
    'Email': salesrep_emails,
}

salesreps = pd.DataFrame(salesrepdata).set_index('Employee ID')
salesreps

Unnamed: 0_level_0,Name,Email
Employee ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Ricky Baker,rickybaker@company.com
2,Ashley Nguyen,ashleynguyen@company.com
3,Danielle Summers,daniellesummers@company.com
4,Samantha Horton,samanthahorton@company.com
5,Paige Jones,paigejones@company.com
6,Tara Peck,tarapeck@company.com
7,Jennifer Sanchez,jennifersanchez@company.com
8,Mark Lindsey,marklindsey@company.com
9,Taylor Mason,taylormason@company.com
10,Karen Beasley,karenbeasley@company.com


In [6]:
# Create subscription plan dataframe

subscriptionsdata = {
    'Subscription ID': range(1, 4), # Subscription ID 1 - 3
    'Subscription': subscriptions, # Names of the subscription service
    'Cost': ['$99', '$199', '$499'], # Cost of each subscription
}

subscriptionsdf = pd.DataFrame(subscriptionsdata).set_index('Subscription ID')
subscriptionsdf

Unnamed: 0_level_0,Subscription,Cost
Subscription ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Standard,$99
2,Plus,$199
3,Premium,$499


In [7]:
# # Save the dataframes

# customers.to_csv('customers.csv')
# salesreps.to_csv('salesreps.csv')
# subscriptionsdf.to_csv('subscriptions.csv')

# Conclusion

I hope that this tutorial was helpful and will make the process of creating a synthetic dataset much easier.