##############Synthetic Data Geneation for Predictive Lead Scoring ##########

In [None]:
import pandas as pd
import random
from datetime import datetime, timedelta

# Sample Data Generation
data = []
for i in range(1000):  # Generate 1000 leads
    lead_id = i + 1
    name = f"Lead {lead_id}"
    age = random.randint(25, 60)
    job_title = random.choice(["Marketing Manager", "Sales Executive", "Software Engineer", "CEO", "Product Manager", "Data Analyst", "Business Analyst"])
    company_size = random.choice([50, 100, 200, 300, 500, 1000])
    industry = random.choice(["Technology", "Finance", "Retail", "Healthcare", "Consulting"])
    location = random.choice(["New York", "Chicago", "San Francisco", "Boston", "Seattle", "Austin", "Denver", "Miami"])
    email_engagement_score = random.randint(0, 100)
    website_visits = random.randint(0, 30)
    content_downloads = random.randint(0, 10)
    last_interaction_date = (datetime.now() - timedelta(days=random.randint(0, 30))).date()
    conversion = random.choice([0, 1])  # Binary target

    data.append([lead_id, name, age, job_title, company_size, industry, location, email_engagement_score, website_visits, content_downloads, last_interaction_date, conversion])

# Create DataFrame
df = pd.DataFrame(data, columns=["Lead ID", "Name", "Age", "Job Title", "Company Size", "Industry", "Location", "Email Engagement Score", "Website Visits", "Content Downloads", "Last Interaction Date", "Conversion"])
print(df.head())


using open AI 

Explanation
OpenAI API: The code uses the OpenAI API to generate synthetic lead data. It creates a prompt asking the model to generate a lead record in JSON format.

Data Structure: The response from the API is expected to be a JSON string that represents the lead data.

Evaluating Response: The code uses eval() to convert the string response to a Python dictionary. Note that eval() should be used with caution; ensure the API response is sanitized and trusted.

Creating DataFrame: The generated records are stored in a DataFrame for further analysis or processing.

CSV Output: The generated data can be saved to a CSV file for later use.

Note
Replace 'YOUR_API_KEY' with your actual OpenAI API key.
Monitor your API usage, as generating a large number of entries may incur costs based on your usage tier.

Generates 5 million entries in batches, writing them to a CSV file incrementally.

Explanation of Modifications
Batch Processing: The code generates leads in batches of batch_size (1000 in this case). This reduces the load on the API and avoids hitting rate limits.

Appending to CSV: Each batch is written to a CSV file incrementally. The header parameter is set to True only for the first batch, so the column names are written only once.

Efficiency: This approach ensures you don’t need to hold all 5 million records in memory at once, making it more memory efficient.

Notes
API Limitations: Monitor your usage and adjust the batch_size as necessary to stay within your rate limits and budget.
Error Handling: Consider adding error handling to manage potential API call failures.
File Size: The resulting CSV file will be large. Ensure your environment can handle large files efficiently.


In [None]:
import openai
import pandas as pd
import random
from datetime import datetime, timedelta

# Set your OpenAI API key
openai.api_key = 'YOUR_API_KEY'

def generate_lead_data(num_entries):
    data = []
    
    for _ in range(num_entries):
        prompt = (
            "Generate a lead record with the following attributes: "
            "Lead ID, Name, Age, Job Title, Company Size, Industry, "
            "Location, Email Engagement Score, Website Visits, "
            "Content Downloads, Last Interaction Date, Conversion (0 or 1). "
            "Provide the output in JSON format."
        )
        
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}]
        )

        # Parse the JSON response
        lead_record = response['choices'][0]['message']['content']
        data.append(eval(lead_record))  # Convert string to dictionary
    
    return data

# Generate synthetic data
num_entries = 1000  # Number of leads to generate
synthetic_data = generate_lead_data(num_entries)

# Create DataFrame from the generated data
df = pd.DataFrame(synthetic_data)
print(df.head())

# Optional: Save to CSV
df.to_csv('synthetic_lead_data.csv', index=False)


save the generated data directly to an Amazon S3 bucket,
 ----move to readme file 
Install Boto3: If you haven't already, install the boto3 library:

bash
Copy code
pip install boto3
AWS Credentials: Ensure you have AWS credentials configured. You can set them in the ~/.aws/credentials file or use environment variables.

S3 Bucket: Make sure you have an S3 bucket created where you want to store the CSV file.
----
Explanation of Modifications
Boto3 Library: The boto3 library is used to interact with AWS services, including S3.

S3 Configuration: Set the S3_BUCKET_NAME and S3_OBJECT_NAME variables to define where the CSV will be uploaded.

Uploading to S3: The upload_to_s3 function uploads the CSV file to the specified S3 bucket after all data has been generated.

Error Handling: Added basic error handling for the S3 upload to catch any issues that arise.

Notes
AWS Permissions: Ensure your AWS credentials have the necessary permissions to upload files to the specified S3 bucket.
File Name in S3: The CSV file is uploaded with the same name as specified in S3_OBJECT_NAME. You can customize this if needed.
Efficiency: This approach generates the file locally and uploads it once after all data is generated, which may help with performance compared to uploading in smaller increments. If you prefer, you can modify it to upload batches directly after they are created.

In [None]:
import openai
import pandas as pd
import json
import boto3
import os
from datetime import datetime

# Set your OpenAI API key
openai.api_key = 'YOUR_API_KEY'

# AWS S3 configuration
S3_BUCKET_NAME = 'your-s3-bucket-name'  # Replace with your bucket name
S3_OBJECT_NAME = 'synthetic_lead_data.csv'  # Name of the file in S3

def generate_lead_data(num_entries, batch_size):
    for batch_index in range(num_entries // batch_size):
        batch_data = []
        for _ in range(batch_size):
            prompt = (
                "Generate a lead record with the following attributes: "
                "Lead ID, Name, Age, Job Title, Company Size, Industry, "
                "Location, Email Engagement Score, Website Visits, "
                "Content Downloads, Last Interaction Date, Conversion (0 or 1). "
                "Provide the output in JSON format."
            )
            
            response = openai.ChatCompletion.create(
                model="gpt-3.5-turbo",
                messages=[{"role": "user", "content": prompt}]
            )

            # Parse the JSON response
            lead_record = response['choices'][0]['message']['content']
            batch_data.append(json.loads(lead_record))  # Convert string to dictionary

        # Create a DataFrame from the batch data
        batch_df = pd.DataFrame(batch_data)

        # Append to CSV (you can set header only for the first batch)
        if batch_index == 0:
            batch_df.to_csv(S3_OBJECT_NAME, mode='w', index=False)
        else:
            batch_df.to_csv(S3_OBJECT_NAME, mode='a', index=False, header=False)

    # Upload to S3 after all batches are generated
    upload_to_s3(S3_OBJECT_NAME)

def upload_to_s3(file_name):
    s3_client = boto3.client('s3')
    try:
        s3_client.upload_file(file_name, S3_BUCKET_NAME, file_name)
        print(f"File {file_name} uploaded to {S3_BUCKET_NAME} successfully.")
    except Exception as e:
        print(f"Error uploading file to S3: {e}")

# Parameters
total_entries = 5000000  # Total number of leads to generate
batch_size = 1000        # Number of leads to generate in each API call

# Generate synthetic data
generate_lead_data(total_entries, batch_size)


 each batch of data is uploaded to the S3 bucket immediately after it is generated
 Explanation of Modifications
Batch Data Creation: Each batch of generated lead data is saved to a local CSV file named batch_{batch_index + 1}.csv.

Immediate Upload: After creating each batch, the code immediately uploads the CSV file to the specified S3 bucket.

Local Storage: Each batch is saved locally before being uploaded, allowing for easy management of files and enabling you to handle any potential upload issues.

Notes
Temporary Storage: If you want to avoid storing files locally after uploading, you can delete them after the upload using os.remove(local_file_name).
Monitoring and Error Handling: This code prints messages for successful uploads and errors, which can help you monitor the process.
Performance: Depending on your network speed and the AWS region, uploading a large number of files may take some time, so you may want to adjust batch sizes or include progress indicators as needed.

In [None]:
import openai
import pandas as pd
import json
import boto3
from datetime import datetime

# Set your OpenAI API key
openai.api_key = 'YOUR_API_KEY'

# AWS S3 configuration
S3_BUCKET_NAME = 'your-s3-bucket-name'  # Replace with your bucket name

def generate_lead_data(num_entries, batch_size):
    for batch_index in range(num_entries // batch_size):
        batch_data = []
        for _ in range(batch_size):
            prompt = (
                "Generate a lead record with the following attributes: "
                "Lead ID, Name, Age, Job Title, Company Size, Industry, "
                "Location, Email Engagement Score, Website Visits, "
                "Content Downloads, Last Interaction Date, Conversion (0 or 1). "
                "Provide the output in JSON format."
            )
            
            response = openai.ChatCompletion.create(
                model="gpt-3.5-turbo",
                messages=[{"role": "user", "content": prompt}]
            )

            # Parse the JSON response
            lead_record = response['choices'][0]['message']['content']
            batch_data.append(json.loads(lead_record))  # Convert string to dictionary

        # Create a DataFrame from the batch data
        batch_df = pd.DataFrame(batch_data)

        # Save the batch DataFrame to a CSV file locally
        local_file_name = f'batch_{batch_index + 1}.csv'
        batch_df.to_csv(local_file_name, index=False)

        # Upload the batch to S3
        upload_to_s3(local_file_name)

def upload_to_s3(file_name):
    s3_client = boto3.client('s3')
    try:
        s3_client.upload_file(file_name, S3_BUCKET_NAME, file_name)
        print(f"File {file_name} uploaded to {S3_BUCKET_NAME} successfully.")
    except Exception as e:
        print(f"Error uploading file to S3: {e}")

# Parameters
total_entries = 5000000  # Total number of leads to generate
batch_size = 1000        # Number of leads to generate in each API call

# Generate synthetic data
generate_lead_data(total_entries, batch_size)


Data generated in batches is uploaded directly to the S3 bucket without storing any files locally,
DataFrame is converted to a CSV string and then upload that string to S3

Explanation of Modifications
StringIO Buffer: The io.StringIO class is used to create an in-memory buffer that behaves like a file. This allows you to store the CSV data in memory instead of writing it to disk.

Direct Upload to S3: After converting the DataFrame to a CSV string, the code uses s3_client.put_object to upload the string directly to S3.

No Local Storage: This version of the code does not create any local files, ensuring that all data handling happens in memory.

Notes
Memory Considerations: Generating large batches may increase memory usage. Ensure that your environment has enough memory to handle the data being processed.
Performance: Depending on your network speed and S3 upload limits, the process may take time. You may want to monitor or log the process for better tracking.

In [None]:
import openai
import pandas as pd
import json
import boto3
import io

# Set your OpenAI API key
openai.api_key = 'YOUR_API_KEY'

# AWS S3 configuration
S3_BUCKET_NAME = 'your-s3-bucket-name'  # Replace with your bucket name

def generate_lead_data(num_entries, batch_size):
    for batch_index in range(num_entries // batch_size):
        batch_data = []
        for _ in range(batch_size):
            prompt = (
                "Generate a lead record with the following attributes: "
                "Lead ID, Name, Age, Job Title, Company Size, Industry, "
                "Location, Email Engagement Score, Website Visits, "
                "Content Downloads, Last Interaction Date, Conversion (0 or 1). "
                "Provide the output in JSON format."
            )
            
            response = openai.ChatCompletion.create(
                model="gpt-3.5-turbo",
                messages=[{"role": "user", "content": prompt}]
            )

            # Parse the JSON response
            lead_record = response['choices'][0]['message']['content']
            batch_data.append(json.loads(lead_record))  # Convert string to dictionary

        # Create a DataFrame from the batch data
        batch_df = pd.DataFrame(batch_data)

        # Convert DataFrame to CSV string
        csv_buffer = io.StringIO()
        batch_df.to_csv(csv_buffer, index=False)
        csv_buffer.seek(0)  # Move to the beginning of the StringIO object

        # Upload the CSV string to S3
        upload_to_s3(csv_buffer, f'batch_{batch_index + 1}.csv')

def upload_to_s3(csv_buffer, file_name):
    s3_client = boto3.client('s3')
    try:
        s3_client.put_object(Bucket=S3_BUCKET_NAME, Key=file_name, Body=csv_buffer.getvalue())
        print(f"File {file_name} uploaded to {S3_BUCKET_NAME} successfully.")
    except Exception as e:
        print(f"Error uploading file to S3: {e}")

# Parameters
total_entries = 5000000  # Total number of leads to generate
batch_size = 1000        # Number of leads to generate in each API call

# Generate synthetic data
generate_lead_data(total_entries, batch_size)


Ensuring that memory is freed up after each batch is uploaded to S3 by explicitly delete the variables holding the batch data after the upload is complete

Key Changes
Memory Management: After the upload is complete, the code explicitly deletes the variables holding the batch data (batch_data, batch_df, and csv_buffer) using del. This helps free up memory immediately after each batch is processed and uploaded.

No Change in Logic: The overall logic of generating leads, creating DataFrames, and uploading to S3 remains the same.

Notes
Garbage Collection: Python has an automatic garbage collection mechanism, but explicitly deleting large objects can help free memory faster, especially when dealing with large datasets.
Performance: This approach helps manage memory more effectively, particularly useful when generating a significant number of entries in a constrained memory environment.

In [None]:
import openai
import pandas as pd
import json
import boto3
import io

# Set your OpenAI API key
openai.api_key = 'YOUR_API_KEY'

# AWS S3 configuration
S3_BUCKET_NAME = 'your-s3-bucket-name'  # Replace with your bucket name

def generate_lead_data(num_entries, batch_size):
    for batch_index in range(num_entries // batch_size):
        batch_data = []
        for _ in range(batch_size):
            prompt = (
                "Generate a lead record with the following attributes: "
                "Lead ID, Name, Age, Job Title, Company Size, Industry, "
                "Location, Email Engagement Score, Website Visits, "
                "Content Downloads, Last Interaction Date, Conversion (0 or 1). "
                "Provide the output in JSON format."
            )
            
            response = openai.ChatCompletion.create(
                model="gpt-3.5-turbo",
                messages=[{"role": "user", "content": prompt}]
            )

            # Parse the JSON response
            lead_record = response['choices'][0]['message']['content']
            batch_data.append(json.loads(lead_record))  # Convert string to dictionary

        # Create a DataFrame from the batch data
        batch_df = pd.DataFrame(batch_data)

        # Convert DataFrame to CSV string
        csv_buffer = io.StringIO()
        batch_df.to_csv(csv_buffer, index=False)
        csv_buffer.seek(0)  # Move to the beginning of the StringIO object

        # Upload the CSV string to S3
        upload_to_s3(csv_buffer, f'batch_{batch_index + 1}.csv')

        # Free up memory
        del batch_data
        del batch_df
        del csv_buffer

def upload_to_s3(csv_buffer, file_name):
    s3_client = boto3.client('s3')
    try:
        s3_client.put_object(Bucket=S3_BUCKET_NAME, Key=file_name, Body=csv_buffer.getvalue())
        print(f"File {file_name} uploaded to {S3_BUCKET_NAME} successfully.")
    except Exception as e:
        print(f"Error uploading file to S3: {e}")

# Parameters
total_entries = 5000000  # Total number of leads to generate
batch_size = 1000        # Number of leads to generate in each API call

# Generate synthetic data
generate_lead_data(total_entries, batch_size)
