# Code for Generating Synthetic Data

### Regis University MSDS_631_SQL_NoSQL
#### Instructor: Dr. Bush
Version  | Date | Author | Notes |
:-------:|:----:|:-------|:-----:|
0.1 |29 March 2023| Ken Dizon | Initial Version

**Objective**

Write a program that creates large files of records in a format acceptable to the load command, then load the data into your PDA relations.


### Task Project 3
- Note 1: This assignment is a slight modification by Dr. Scott Scott Leutenegge (University of Denver) of material developed by the Stanford Database Group.
- Note 2: Remember to back up your work!

In this phase of your PDA you will make sure your relations have keys and foreign keys, create a substantial amount of data for you databases, and load your database with this data using the MySQL "load" command.

- First, make sure that your relations have keys and that the relations created from ER model relationships use foreign keys to specify the keys of the relations you are referencing.
- To create the data, either
    * write a program in any programming language you like that creates large files of records in a format acceptable to the load command, then load the data into your PDA relations.
    * OR -- if using real data for your PDA, your program will need to transform the data into files of records conforming to your PDA schema.
    * OR -- use an application such as Mockaroo to fabricate data. Generate either random or nonrandom (e.g., sequential) records conforming to your schema. It is both fine and expected for your data values—strings especially—to be meaningless gibberish. The point of generating large amounts of data is so that you can experiment with a database of realistic size. 
    
The data you generate, and load should be on the order of
- At least one table (relation) with tens of thousands of rows (tuples)
- At least two additional tables (relations) with thousands of rows (tuples)

When writing a program to fabricate data, keep in mind the following two points:
1. Make sure you do not generate duplicate values for attributes that serve as keys. 
2. Your PDA almost certainly includes relations that are expected to join with each other. For example, you may have a STUDENT relation with attribute courseNo that's expected to join with attribute number in relation COURSES. In generating data, be sure to generate values that actually do join—otherwise, all of your interesting queries will have empty results! One way to guarantee join-ability is to generate the values in one relation, then use the generated values in one relation to select joining values for the other relation. For example, you could generate course numbers first (either sequentially or randomly), then use these numbers to fill in the courseNo values in the STUDENT relation.


- **Content**
    1. Libraries
    2. Manual Data
    3. Using python Library

In [1]:
!pip install Faker



In [2]:
# Libraries 
import random
import string
import datetime
from faker import Faker
fake = Faker()

## 2. Manual Data
Create data using the proper dtypes and syntax with help from a library to ease the generating process

## 3. Faker Library
Docs: https://faker.readthedocs.io/en/master/

- Step 1: Create Clinics
- Step 2: Create Products
- Step 3: Generate Customers
- Step 4: Create Order Details
- Step 5: Generate Orders

**1) Create Clinics**

```SQL 
  `Clinic_id` INT NOT NULL,
  `Name` VARCHAR(45) NULL,
  `Address` VARCHAR(45) NULL,
  `City` VARCHAR(45) NULL,
  `State` VARCHAR(45) NULL,
  `Zip` INT NULL,
  `Phone` VARCHAR(45) NULL,
  PRIMARY KEY (`ClinicID`))
  ```

In [3]:
# Define the number of rows (Clinics) of data to generate
num_rows = 30

# Create a list of tuples containing the generated data
Clinic_data = []
for i in range(num_rows):
    clinic_id = i + 1
    name = fake.company()
    address = fake.street_address()
    city = fake.city()
    state = fake.state()
    zipcode = fake.zipcode()
    phone = fake.phone_number()
    Clinic_data.append((clinic_id, name, address, city, state, zipcode, phone))

In [4]:
# Write the data to a text file
with open('clinic_data.txt', 'w') as f:
    for row in Clinic_data:
        f.write(';'.join(map(str, row)) + '\n')

In [5]:
Clinic_data[0:3]

[(1,
  'Garcia, Harrell and Phillips',
  '3197 Koch Tunnel Apt. 731',
  'New Mikeview',
  'Nevada',
  '54039',
  '520-541-1189x358'),
 (2,
  'Moore LLC',
  '1582 Noah Orchard',
  'Brianland',
  'Hawaii',
  '93431',
  '108-327-9309'),
 (3,
  'Kim, Taylor and Carter',
  '099 Jonathan Wells',
  'North Emilyberg',
  'West Virginia',
  '43320',
  '074-315-0439x527')]

**2) Create products**

``` SQL
 `Product_id` INT NOT NULL,
  `Name` VARCHAR(45) NULL,
  `Description` VARCHAR(45) NULL,
  `Price` DECIMAL NULL,
 ```

In [6]:
# List of 30 random Sports Medince Product [Names, Descriptions]

products = ['Knee braces: designed to support and protect the knee joint during physical activity or recovery from injury.',
             'Ankle braces: designed to provide stability and support for the ankle joint during physical activity or recovery from injury.',
             'Compression garments: designed to improve blood flow, reduce swelling, and support muscles during physical activity or recovery from injury.',
             'Ice packs: designed to reduce pain and swelling by applying cold therapy to the affected area.',
             'Heating pads: designed to improve blood flow, reduce muscle tension, and promote healing by applying heat therapy to the affected area.',
             'Massage balls: designed to relieve muscle tension and improve flexibility by applying pressure to trigger points in the muscles.',
             'Foam rollers: designed to relieve muscle tension and improve flexibility by rolling over the muscles.',
             'Resistance bands: designed to improve strength and flexibility by providing resistance during exercise.',
             'Balance boards: designed to improve balance and stability by challenging the body to maintain equilibrium on an unstable surface.',
             'Sports tape: designed to provide support and stability for joints and muscles during physical activity or recovery from injury.',
             'Cold sprays: designed to provide instant cooling relief for muscle or joint pain.',
             'Anti-inflammatory creams: designed to reduce pain and swelling by applying a topical cream containing anti-inflammatory ingredients.',
             'Electrolyte replacement drinks: designed to replace fluids and essential minerals lost through sweating during physical activity.',
             'Protein supplements: designed to aid in muscle recovery and growth by providing additional protein to the body.',
             'Energy gels: designed to provide a quick source of energy during physical activity.',
             'Electrotherapy devices: designed to provide pain relief and promote healing through the use of electrical impulses.',
             'Insoles: designed to provide additional support and cushioning for the feet during physical activity or recovery from injury.',
             'Athletic tape: designed to prevent or treat injuries, provide support to muscles and joints, or secure bandages and splints.',
             'Wrist supports: designed to protect and support the wrist during physical activity or recovery from injury.',
             'Finger splints: designed to protect and immobilize the finger during physical activity or recovery from injury.',
             'Mouthguards: designed to protect the teeth and jaw from impact during contact sports.',
             'Shin guards: designed to protect the shins from impact during contact sports.',
             'Sports goggles: designed to protect the eyes during physical activity or contact sports.',
             'Sports-specific shoes: designed with features specific to a particular sport, such as cleats for soccer or basketball shoes with extra cushioning.',
             'Athletic socks: designed with extra cushioning and support for the feet during physical activity.',
             'Blister pads: designed to protect the skin from friction and prevent blisters during physical activity.',
             'Hydration packs: designed to provide hands-free hydration during outdoor activities.',
             'GPS watches: designed to track distance, pace, and other metrics during physical activity.',
             'Heart rate monitors: designed to track heart rate and other metrics during physical activity.',
             'Recovery tools: designed to aid in muscle recovery, such as massage guns, percussive therapy devices, or pneumatic compression boots.']

In [7]:
random.choice(products)

'Protein supplements: designed to aid in muscle recovery and growth by providing additional protein to the body.'

In [8]:
# Initialize empty lists for products 
prod_list = []

# Loop through each product and generate a unique ID and a random price
for i, product in enumerate(products):
    product_id = i + 1  # Use index as ID (1 to 30)
    Pname, Pdesc = product.split(':')  # Split name and description
    price = round(random.uniform(1.0, 100.0), 2)  # Generate random price
    # Add product data to output list
    prod_list.append((product_id, Pname.strip(), Pdesc.strip(), price))

In [9]:
# Write the output data to a text file
with open('product_data.txt', 'w') as f:
    for row in prod_list:
        f.write(';'.join(str(item) for item in row) + '\n')

In [10]:
prod_list[:3]

[(1,
  'Knee braces',
  'designed to support and protect the knee joint during physical activity or recovery from injury.',
  22.02),
 (2,
  'Ankle braces',
  'designed to provide stability and support for the ankle joint during physical activity or recovery from injury.',
  97.14),
 (3,
  'Compression garments',
  'designed to improve blood flow, reduce swelling, and support muscles during physical activity or recovery from injury.',
  67.38)]

**3) Generate Customers**
``` SQL
  `Customer_id` INT NOT NULL,
  `Start_date` VARCHAR(45) NULL,
  `First_name` VARCHAR(45) NULL,
  `Last_name` VARCHAR(45) NULL,
  `DOB` DATE NULL,
  `Phone` VARCHAR(45) NULL,
  `Email` VARCHAR(45) NULL,
  `Clinic_id` INT NULL,
  PRIMARY KEY (`Customer_id`),
  INDEX `Clinic_id_idx` (`Clinic_id` ASC) VISIBLE,
  CONSTRAINT `Clinic_id`
  FOREIGN KEY (`Clinic_id`)
  REFERENCES `mydb`.`Clinic` (`ClinicID`)
  ```

In [11]:
# Define the number of rows (Customers) of data to generate
num_rows = 3000

# date range
start_date = datetime.date(2012, 1, 1)
end_date = datetime.date(2020, 12, 31)

# Create a list of tuples containing the generated data
Customer_data = []
for i in range(num_rows):
    customer_id = i + 1
    # start_date = fake.date_between(start_date='-4y', end_date='today')
    start_date = fake.date_between(start_date=start_date, end_date=end_date)
    first_name = fake.first_name()
    last_name = fake.last_name()
    dob = fake.date_of_birth()
    phone = fake.phone_number()
    email = fake.email()
    clinic_id = random.randint(1, 30) # Randomly assign a clinic ID between 1 and 30
    Customer_data.append((customer_id, start_date, first_name, last_name, dob, phone, email, clinic_id))

In [12]:
# Write the data to a text file
with open('customer_data.txt', 'w') as f:
    for row in Customer_data:
        f.write(';'.join(map(str, row)) + '\n')

In [13]:
Customer_data[:3]

[(1,
  datetime.date(2018, 9, 1),
  'David',
  'Harding',
  datetime.date(1941, 12, 30),
  '620.445.2906',
  'tsmith@example.org',
  11),
 (2,
  datetime.date(2020, 11, 26),
  'Sara',
  'Galvan',
  datetime.date(2005, 1, 11),
  '007-814-7795x152',
  'hoffmanmindy@example.net',
  27),
 (3,
  datetime.date(2020, 12, 3),
  'Joshua',
  'Li',
  datetime.date(1956, 3, 16),
  '+1-073-664-6793x950',
  'john59@example.com',
  5)]

**Note**
We now have:
- 30 clinics
- 30 prodcuts 
- 3000 customers
    * We want too generate 5000 orders that encapsulate

**4) Create Order Details**
``` SQL
  `OrderDetials_id` INT NOT NULL,
  `Product_id` INT NULL,
  `Order_id` INT NULL,
  `Quantity` INT NULL,
  ```

In [14]:
OrderDets = []
for i in range(5000):
    OD_id = i + 1
    product_id = random.randint(1, 30)
    order_id = random.sample(range(1, 5001), 1)[0]
    quantity = random.randint(1, 10)
    OrderDets.append((OD_id, product_id, order_id, quantity))

In [15]:
# Save data to a text file
with open('orderdetails_data.txt', 'w') as f:
    for row in OrderDets:
        f.write(';'.join(str(col) for col in row) + '\n')

In [16]:
OrderDets[:3]

[(1, 16, 403, 7), (2, 27, 1169, 1), (3, 21, 1935, 2)]

In [17]:
OrderDets[:3]

[(1, 16, 403, 7), (2, 27, 1169, 1), (3, 21, 1935, 2)]

**5) Generate Orders**

``` SQL
`Order_id` INT NOT NULL,
  `Order_date` DATE NULL,
  `Customer_id` INT NULL,
  ```

In [18]:
Orders = []

for i in range(5000):
    order_id = i + 1
    date = fake.date_between(start_date='-2y', end_date='today')
    customer_id = random.randint(1, 3000)
    Orders.append((order_id, date, customer_id))

In [19]:
# Save data to a text file
with open('orders_data.txt', 'w') as f:
    for row in Orders:
        f.write(';'.join(str(col) for col in row) + '\n')

In [20]:
Orders[:3]

[(1, datetime.date(2022, 12, 18), 815),
 (2, datetime.date(2022, 9, 27), 2230),
 (3, datetime.date(2022, 1, 8), 717)]