### Engineer Assistant
This notebook will create a GPT Assistant data engineer using the OpenAI API, give it a data set, ask it to process it using it's instructions, and then prepare the resulting dataframe for download. For this POC project I will use simple classification data from the UCI data repository[The Wisconsin Breast Cancer](https://archive.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original). The idea here is to test if the assistant can successfully complete data prep and some data engineering, then reduce the feature set using a logistic regression with LASSO regularization, subsetting only non-zero LASSO coefficient features in the final dataset. 

In [18]:
!pip install openai



In [19]:
import openai
import time
import ipywidgets as widgets
from IPython.display import display
import pandas as pd
from io import StringIO
import io
import json

## Define Functions

def read_and_save_file(first_file_id, file_name):    
    # its binary, so read it and then make it a file like object
    file_data = client.files.content(first_file_id)
    file_data_bytes = file_data.read()
    file_like_object = io.BytesIO(file_data_bytes)
    #now read as csv to create df
    returned_data = pd.read_csv(file_like_object)
    returned_data.to_csv(file_name, index=False)
    return returned_data
    # file = read_and_save_file(first_file_id, "analyst_output.csv")
    
def files_from_messages(messages, asst_name):
    first_thread_message = messages.data[0]  # Accessing the first ThreadMessage
    message_ids = first_thread_message.file_ids
    print(message_ids)
    # Loop through each file ID and save the file with a sequential name
    for i, file_id in enumerate(message_ids):
        file_name = f"{asst_name}_output_{i+1}.csv"  # Generate a sequential file name
        read_and_save_file(file_id, file_name)
        print(f'saved {file_name}')

## Initialize API Session

In [20]:
# set key and assistant ID
OPENAI_API_KEY = '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'

# Instantiate the OpenAI client
client = openai.OpenAI(api_key=OPENAI_API_KEY)

In [21]:
# load and check the file for the engineer
asst_file = 'tumor.csv'
df = pd.read_csv(asst_file)

display(df)

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2
...,...,...,...,...,...,...,...,...,...,...,...
678,776715,3,1,1,1,3,2,1,1,1,2
679,841769,2,1,1,1,2,1,1,1,1,2
680,888820,5,10,10,3,7,3,8,10,2,4
681,897471,4,8,6,4,3,4,10,6,1,4


## Create the Engineer and Pass the csv File

In [22]:
# create the assistant and give it the CSV file

mls = '''
You are a data engineer who will work with data in a csv file in your files. 
When the user asks you to perform your actions, use the csv file to read the data into a pandas dataframe.
The data set is to be used for a classification model.
Execute each of the steps listed below in your ACTIONS section. The user will identify the target variable. 

ACTIONS:

1. Read the file data into a pandas DataFrame. 
2. Summarize each feature and the target variable in the data set and prepare the results as Table_1.
3. Check for missing values and impute the column mean for any missing values.
4. Create a two new feature interaction columns for each unique pair of variables, using multiplication for one interaction column and dividion for the other.
5. Run a logistic regression to predict the target variable with LASSO to select features. Use a lambda values of 1. 
6. Prepare the Lasso coefficient values as Table_2.
7. Prepare a final data set that only contains features with non-zero LASSO coefficients and the target variable as Table_3
8. Provide a summary paragraph explaining the preparation of the data set.
9. Prepare Table_1, Table_2, and Table_3 as csv files for download by the user. 

DO NOT:
1. Do not return any images. 
'''

# send the csv file to the assistant purpose files
response = client.files.create(
  file=open(asst_file, "rb"),
  purpose="assistants"
)
print(response)
file__id = response.id

my_assistant = client.beta.assistants.create(
    instructions=mls,
    name="engine_1",
    tools=[{"type": "code_interpreter"}],
    model="gpt-4-1106-preview", # gpt-4
    file_ids=[file__id]
)

# get the file id
fileId = my_assistant.file_ids[0]
print(my_assistant)

FileObject(id='file-zk2jFjpvjhn73vN4mb9KQO5q', bytes=20320, created_at=1700943295, filename='tumor.csv', object='file', purpose='assistants', status='processed', status_details=None)
Assistant(id='asst_wEjzHPtxQZqm9lsnfM1PHLAy', created_at=1700943296, description=None, file_ids=['file-zk2jFjpvjhn73vN4mb9KQO5q'], instructions='\nYou are a data engineer who will work with data in a csv file in your files. \nWhen the user asks you to perform your actions, use the csv file to read the data into a pandas dataframe.\nThe data set is to be used for a classification model.\nExecute each of the steps listed below in your ACTIONS section. The user will identify the target variable. \n\nACTIONS:\n\n1. Read the file data into a pandas DataFrame. \n2. Summarize each feature and the target variable in the data set and prepare the results as Table_1.\n3. Check for missing values and impute the column mean for any missing values.\n4. Create a two new feature interaction columns for each unique pair of

## Create Message to Assistant

In [23]:
# make the request to the assistant

message_string = "Please execute your ACTIONS on the data stored in the csv file " + fileId + " . The Target variable is Class"
print(message_string)

# Step 2: Create a Thread
thread = client.beta.threads.create()

# Step 3: Add a Message to a Thread
message = client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content= message_string
)

# Step 4: Run the Assistant
run = client.beta.threads.runs.create(
    thread_id=thread.id,
    assistant_id=my_assistant.id
    #instructions="Overwrite hard-coded instructions here"
)

print(run.model_dump_json(indent=4))

while True:
    sec = 60
    # Wait for 5 seconds
    time.sleep(sec)  
    # Retrieve the run status
    run_status = client.beta.threads.runs.retrieve(
        thread_id=thread.id,
        run_id=run.id
    )
    print(f'{sec} seconds later...')
    # If run is completed, get messages
    if run_status.status == 'completed':
        messages = client.beta.threads.messages.list(
            thread_id=thread.id
        )
        # Loop through messages and print content based on role
        for msg in messages.data:
            role = msg.role
            try:
                content = msg.content[0].text.value
                print(f"{role.capitalize()}: {content}")
            except AttributeError:
                # This will execute if .text does not exist
                print(f"{role.capitalize()}: [Non-text content, possibly an image or other file type]")
        break

Please execute your ACTIONS on file-zk2jFjpvjhn73vN4mb9KQO5q and prepare the resulting df as a csv download. The Target variable is Class
{
    "id": "run_oNRMggKbPoRTWG7N1PdSs2mk",
    "assistant_id": "asst_wEjzHPtxQZqm9lsnfM1PHLAy",
    "cancelled_at": null,
    "completed_at": null,
    "created_at": 1700943331,
    "expires_at": 1700943931,
    "failed_at": null,
    "file_ids": [
        "file-zk2jFjpvjhn73vN4mb9KQO5q"
    ],
    "instructions": "\nYou are a data engineer who will work with data in a csv file in your files. \nWhen the user asks you to perform your actions, use the csv file to read the data into a pandas dataframe.\nThe data set is to be used for a classification model.\nExecute each of the steps listed below in your ACTIONS section. The user will identify the target variable. \n\nACTIONS:\n\n1. Read the file data into a pandas DataFrame. \n2. Summarize each feature and the target variable in the data set and prepare the results as Table_1.\n3. Check for missing va

## Extract File Name and Download Content

In [24]:
# extract the file names from the response and retrieve the content
asst_name = 'engineer'        
files_from_messages(messages, asst_name)

['file-msT5J8aN5P6KPuxyNRVrFAEK', 'file-t8dP42hgArSR713xBZ4pqkLJ', 'file-kV5rJ89aD0parXJuteOkKCNI']
saved engineer_output_1.csv
saved engineer_output_2.csv
saved engineer_output_3.csv


In [25]:
df1 = pd.read_csv('engineer_output_1.csv')
display(df1)

Unnamed: 0.1,Unnamed: 0,Class,Uniformity of Cell Shape,Bland Chromatin,Normal Nucleoli,Mitoses,Sample code number_div_Uniformity of Cell Shape,Sample code number_x_Single Epithelial Cell Size,Sample code number_div_Bare Nuclei,Sample code number_x_Normal Nucleoli,...,Single Epithelial Cell Size_div_Bare Nuclei,Single Epithelial Cell Size_x_Bland Chromatin,Single Epithelial Cell Size_div_Normal Nucleoli,Single Epithelial Cell Size_x_Mitoses,Bare Nuclei_x_Bland Chromatin,Bare Nuclei_div_Bland Chromatin,Bare Nuclei_div_Normal Nucleoli,Bare Nuclei_div_Mitoses,Bland Chromatin_div_Normal Nucleoli,Normal Nucleoli_div_Mitoses
0,0,2,1,3,1,1,1.000015e+06,2000050,1.000015e+06,1000025,...,1.999980,6,1.999980,2,3,0.333332,0.999990,0.999990,2.999970,0.999990
1,1,2,4,3,2,1,2.507356e+05,7020615,1.002944e+05,2005890,...,0.699999,21,3.499983,7,30,3.333322,4.999975,9.999900,1.499993,1.999980
2,2,2,1,3,1,1,1.015415e+06,2030850,5.077100e+05,1015425,...,0.999995,6,1.999980,2,6,0.666664,1.999980,1.999980,2.999970,0.999990
3,3,2,8,3,7,1,1.270345e+05,3048831,2.540686e+05,7113939,...,0.749998,9,0.428571,3,12,1.333329,0.571428,3.999960,0.428571,6.999930
4,4,2,1,3,1,1,1.017013e+06,2034046,1.017013e+06,1017023,...,1.999980,6,1.999980,2,3,0.333332,0.999990,0.999990,2.999970,0.999990
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
678,678,2,1,1,1,1,7.767072e+05,2330145,3.883556e+05,776715,...,1.499993,3,2.999970,3,2,1.999980,1.999980,1.999980,0.999990,0.999990
679,679,2,1,1,1,1,8.417606e+05,1683538,8.417606e+05,841769,...,1.999980,2,1.999980,2,1,0.999990,0.999990,0.999990,0.999990,0.999990
680,680,4,10,8,10,2,8.888191e+04,6221740,2.962723e+05,8888200,...,2.333326,56,0.699999,14,24,0.375000,0.300000,1.499993,0.799999,4.999975
681,681,4,6,10,6,1,1.495783e+05,2692413,2.243672e+05,5384826,...,0.749998,30,0.499999,3,40,0.400000,0.666666,3.999960,1.666664,5.999940


In [26]:
df2 = pd.read_csv('engineer_output_2.csv')
display(df2)

Unnamed: 0.1,Unnamed: 0,feature,coefficient
0,0,Clump Thickness,0.000000
1,1,Uniformity of Cell Size,0.000000
2,2,Uniformity of Cell Shape,0.380313
3,3,Marginal Adhesion,0.000000
4,4,Single Epithelial Cell Size,0.000000
...,...,...,...
94,94,Bland Chromatin_div_Normal Nucleoli,0.051929
95,95,Bland Chromatin_x_Mitoses,0.000000
96,96,Bland Chromatin_div_Mitoses,0.000000
97,97,Normal Nucleoli_x_Mitoses,0.000000


In [27]:
df3 = pd.read_csv('engineer_output_3.csv')
display(df3)

Unnamed: 0.1,Unnamed: 0,count,mean,std,min,25%,50%,75%,max,missing_values,percent_missing
0,Sample code number,683.0,1076720.0,620644.047655,63375.0,877617.0,1171795.0,1238705.0,13454352.0,0,0.0
1,Clump Thickness,683.0,4.442167,2.820761,1.0,2.0,4.0,6.0,10.0,0,0.0
2,Uniformity of Cell Size,683.0,3.150805,3.065145,1.0,1.0,1.0,5.0,10.0,0,0.0
3,Uniformity of Cell Shape,683.0,3.215227,2.988581,1.0,1.0,1.0,5.0,10.0,0,0.0
4,Marginal Adhesion,683.0,2.830161,2.864562,1.0,1.0,1.0,4.0,10.0,0,0.0
5,Single Epithelial Cell Size,683.0,3.234261,2.223085,1.0,2.0,2.0,4.0,10.0,0,0.0
6,Bare Nuclei,683.0,3.544656,3.643857,1.0,1.0,1.0,6.0,10.0,0,0.0
7,Bland Chromatin,683.0,3.445095,2.449697,1.0,2.0,3.0,5.0,10.0,0,0.0
8,Normal Nucleoli,683.0,2.869693,3.052666,1.0,1.0,1.0,4.0,10.0,0,0.0
9,Mitoses,683.0,1.603221,1.732674,1.0,1.0,1.0,1.0,10.0,0,0.0


## Clean Up the Assistant

In [28]:
# Clean up the assistant

response = client.beta.assistants.delete(my_assistant.id)
print(response)

AssistantDeleted(id='asst_wEjzHPtxQZqm9lsnfM1PHLAy', deleted=True, object='assistant.deleted')


## Launch the Analyst Notebook 

In [10]:
# import nbformat
# from nbconvert.preprocessors import ExecutePreprocessor

# # Load the notebook you want to run
# with open('path_to_notebook.ipynb') as f:
#     nb = nbformat.read(f, as_version=4)

# # Execute the notebook
# ep = ExecutePreprocessor(timeout=600, kernel_name='python3')
# ep.preprocess(nb)

# # Optionally, save the executed notebook
# with open('executed_notebook.ipynb', 'w', encoding='utf-8') as f:
#     nbformat.write(nb, f)