# Synthetic Data Generation
## Description

This notebook demonstrates how to use the nemotron-4-340b-instruct model for synthetic data generation that is used in this blueprint.

This uses the nvidia gear store data as a source of product data.

It then creates a sample customer set and then creates a realistic order history based on the nvidia gear store data.

## Usage Instructions
1. Install the required libraries using pip.
2. Run each cell sequentially to execute the notebook.

### Install requirements

In [None]:
## Install requirements
!pip install pandas
!pip install --upgrade --quiet langchain-nvidia-ai-endpoints
!pip install langchain

### Set the NVIDIA API Key 

The nemotron-4-340b-instruct model is accessed from the NVIDIA API Catalog. In order to access it, you need to set a valid API Key.

In [None]:
import getpass
import os
if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvidia_api_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvidia_api_key.startswith("nvapi-"), f"{nvidia_api_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvidia_api_key

## Re-usable functions

This function is used to generated the data. Data returned is in the form of a json object.

In [None]:
def get_do_task(llm, template, input_variables=[], input_values={}):
    from langchain_core.output_parsers import JsonOutputParser    
    parser = JsonOutputParser()
    task_template = PromptTemplate(
        input_variables=input_variables,
        template=template
    )
    synthetic_data_chain = task_template | llm | parser 
    response = synthetic_data_chain.invoke(input_values)
    return response

In [None]:
import random

unique_numbers = set()
def get_random_number(low, high):
    # Loop until we have a certain number of unique random numbers
    while True:
        random_number = random.randint(low, high)
        if random_number not in unique_numbers:
            unique_numbers.add(random_number)
            break
    return random_number

In [None]:
import pandas as pd
import random
import getpass
import os
import langchain
from langchain.prompts import PromptTemplate

from langchain_nvidia_ai_endpoints import ChatNVIDIA
llm = ChatNVIDIA(model="nvidia/llama-3.1-nemotron-70b-instruct", temperature=1.0, max_tokens=2048)

## Variables

In [None]:
CUSTOMER_FILENAME="./customers.csv"
PRODUCT_FILENAME="../data/gear-store.csv"
ORDERHISTORY_FILENAME="./orders.csv"

## Customer Profile Generation

In [None]:
customer_template = '''
Create 10 rows of data representing a customer in a database.
The customer table has the following schema. 

  'FNAME': This is the customer's first name, 
  'LNAME': This is the customer's last name,
  'AGE': age,
  'GENDER': gender. this should me male or female,
  'STATE': state in united states
  'ZIPCODE': zipcode and this should be present in the state mentioned above,
  'INTERESTS': this is any 2-3 interests 
  'MEMBER_SINCE': This is a date between 1st Jaunary 2015 and 30th June 2024


Ensure the first name and last names are unique pairs in the dataset created.
Return the data in the form of json array of customer rows.
Do not include any niceties. 
Do you not add any more attributes.
Return the data in the form of a json array that be loaded into a python variable with json.loads(..). Do not have the "json" work in the returned string
'''
customers = get_do_task(llm, customer_template,input_variables=[], input_values={})

In [None]:
import json
import csv
# Specify the CSV file name
csv_file = CUSTOMER_FILENAME
# Initialize an empty set to store unique random numbers
unique_numbers = set()

for customer in customers:
    for key, value in customer.items():
        if isinstance(value, str):
            customer[key] = value.strip()
        elif isinstance(value, list):
            customer[key] = ', '.join(value).strip()
    customer['CID'] = get_random_number(100,10000)

fieldnames = ['CID'] + [key for key in customers[0].keys() if key != 'CID']
        
# Write the data to a CSV file
with open(csv_file, mode='w', newline='') as file:
    writer = csv.DictWriter(file, fieldnames=fieldnames)
    writer.writeheader()  # Write the header row
    for customer in customers:
        writer.writerow(customer)

print(f"Data successfully written to {csv_file}")

## Order History Generation

In [None]:
# Create the data frames 
# 1. Electronics 
# 2. Other stuff
# 3. Ensure you don't use the gift card 

import pandas as pd

# Load the CSV file
file_path = PRODUCT_FILENAME
df = pd.read_csv(file_path)

# Filter out any products that contain the word "Gift card" in the name column
df_filtered = df[~df['name'].str.contains('Gift card', case=False, na=False)]

# First DataFrame: NVIDIA Electronics category (excluding gift cards)
df_nvidia_electronics = df_filtered[df_filtered['category'] == 'NVIDIA Electronics'][['name', 'description', 'price']]

# Second DataFrame: All other categories (excluding gift cards)
df_other = df_filtered[df_filtered['category'] != 'NVIDIA Electronics'][['name', 'description', 'price']]

# Display both DataFrames
print("NVIDIA Electronics DataFrame:")
print(df_nvidia_electronics)

print("\nOther Categories DataFrame:")
print(df_other)

NUM_ELECTRONIC_PRODUCTS = len(df_nvidia_electronics)
NUM_OTHER_PRODUCTS = len(df_other)

In [None]:
order_template = '''
    You will be given information regarding 10 products in an array where each array element contains the 
    product name, product description and product price. You will also be given the current date. Use these things to do the following:
    
    You will need to create an order history table containing 10 rows of data representing
    where each row maps to one of the 10 products you were given.
    Make the data interesting i.e. don't usually stick to delivered but do returns and return rejects 
    Return the data in the form of json object
    Do not include any niceties. 
    Do you not add any more attributes.
    Return the data in the form of a json array that be loaded into a python variable with json.loads(..).
    Do not have the word "json" in the returned string
    
    You will be given the product description to use in the creation of the ReturnReason. 
    
    Product Description: {product_desc}
    Current Date: {current_date}
    
    This schema of the data is presented like this:- 
    <attribute> <type> <description>
    product_name STRING "Product name"
    product_description STRING "Product description"
    OrderDate DATETIME "Date and time when the order was placed"
    Quantity INT "Number of units ordered" 
    OrderAmount INT "Product Price X Quantity"
    OrderStatus	VARCHAR(50)	"Status of the order" 
    ReturnStatus	VARCHAR(50)	"Status of the return" 
    ReturnStartDate	DATETIME "Date when the return was started" 
    ReturnReceivedDate DATETIME	"Date when the return was receive" 
    ReturnCompletedDate	DATETIME "Date when the return was completed" 
    ReturnReason VARCHAR(255) "Reason for the return" 
    Notes VARCHAR(255) "Notes sent to the customer" 
    
    Instructions to set the various attributes:
    * Keep the current date in mind when generating the data.
    * product_name: This should match the product name from the input that this record is generated for.
    * product_description: This should match the product description input that this record is generated for.
    * OrderDate: This should be a date between October 1st 2024 and October 20th 2024. 
    * Quantity: This should be a number between 1 and 8
    * OrderAmount: This is a multiple of the Quantity and the Product price in the input. 
    * OrderStatus: This should be one these values [Pending, Processing, Shipped, In Transit, Out for Delivery, Delivered, Cancelled, Returned, Return Requested, Delayed, On Hold].
    * ReturnStatus: This should be set to one of these values [None, Requested, Approved, Rejected, Received, Processing, Refunded, Pending Approval, Return to Sender, Awaiting Customer Action]
      This should be set ONLY when OrderStatus is set to either "Returned" or  "Return Requested" else set to None
    
    * ReturnStartDate: This is set only when the OrderStatus is "Return Requested". It should be a date should be within 7 days after the OrderDate but before the "current date" else set to None.
    * ReturnReceivedDate: This should be set to a date within 5 days after the ReturnStartDate else set to None
    * ReturnCompletedDate: This should be a date when the return was completed and set only when the OrderStatus field is set to "Returned" and ReturnStatus field set to "Approved".
                           This should be a minimum of 15 and 30 days after the ReturnReceivedDate else set to None
    * ReturnReason: This should be set to a creative reason that would make logical sense based on the product description only when the OrderStatus is set to "Returned" or "Return Requested"
      This should be set to something when the ReturnStatus is not None. 
    * Notes: This is information sent back to the customer. This should be something that is relevant and makes sense for the 
             product when the ReturnStatus is set to "Rejected". It can also be used to put notes if the order has been in processing state for longer than usual.
             
    
    In the returned string:-
    Do not include any niceties. 
    Do you not add any more attributes.
    Return the data in the form of a json array that be loaded into a python variable with json.loads(..). Do not have the word "json" in the returned string
    '''

In [None]:
# Create a table of 10 products with 7 products being technical and 3 products being "other"
#NUM_ELECTRONIC_PRODUCTS = len(df_nvidia_electronics)
#NUM_OTHER_PRODUCTS = len(df_other)
def get_random_products():
    import pandas as pd
    # Randomly pick 7 rows from the DataFrame and select only the name and description columns
    random_rows = df_nvidia_electronics.sample(n=7)[['name', 'description', 'price']]
    # Convert the selected DataFrame rows to a regular array (list of lists)
    random_rows_json = random_rows.to_dict(orient='records')
    # Display the array    
    product_rows = random_rows_json    
    random_rows = df_other.sample(n=3)[['name', 'description', 'price']]
    # Convert the selected DataFrame rows to a regular array (list of lists)
    random_rows_json = random_rows.to_dict(orient='records')
    # Display the array
    product_rows.append(random_rows_json)
    return product_rows

In [None]:

# Delete the order history file if it exists
import os
file_path = ORDERHISTORY_FILENAME

# Check if the file exists and delete it
if os.path.exists(file_path):
    os.remove(file_path)
    print(f"File '{file_path}' has been deleted.")
else:
    print(f"File '{file_path}' does not exist.")

# Now read the customer data file one row at a time
import pandas as pd

# Load the CSV file
file_path = CUSTOMER_FILENAME
dfcustomer = pd.read_csv(file_path)

# Initialize an empty set to store unique random numbers
unique_numbers = set()

# Loop over each row in the DataFrame
for index, row in dfcustomer.iterrows():
    random_prod_list = get_random_products()
    print(row['CID'])

    order_history = get_do_task(llm, order_template, input_variables = ['product_desc', 'current_date'], 
                                    input_values = {'product_desc': random_prod_list, 'current_date': "23rd October 2024"})
    df = pd.DataFrame(order_history)

    # Add the Customer ID (CID) as a new column
    df['CID'] = row['CID']
    for index, row in df.iterrows():
        df.at[index, 'OrderID'] = int(get_random_number(10,100000))
    df['OrderID'] = df['OrderID'].astype(int)

    # Reorder columns to ensure 'CustomerID' is the first column
    columns = ['CID', 'OrderID'] + [col for col in df.columns if col not in  ['CID', 'OrderID']]
    df = df[columns]
    # Append DataFrame to CSV file
    df.to_csv(ORDERHISTORY_FILENAME, mode='a', index=False, header=not pd.io.common.file_exists(ORDERHISTORY_FILENAME))
    

#### Copy over the csv files to the "data" folder

If you are satisfied with the csv files generated, copy them over to the data folder for ingestion.