## Scratch

Scratch pad for RAG application

In [1]:
import gradio as gr

Define data dictionary for LLM to reference. Format provided by GPT.

In [2]:
data_dictionary = """
Data Dictionary:
* YEAR: Year of the data
* NAMECAP: Total U.S. capacity
* USNGENAN: Total Generation/Electricity Generated
* USCO2AN: Total CO2 Emissions
* USCO2EQA: Total CO2 Equivalent Emissions
* USCO2RTA: CO2 Emission Rate
* USC2ERTA: CO2 Equivalent Emission Rate
"""

In [3]:
def respond(question):

    print(question)
    return question, question
    # Pass question to LLM function

gradio_interface = gr.Interface(
    fn = respond,
    inputs=[gr.Textbox(label='Enter question about U.S. Energy System')],
    outputs = [gr.Textbox(), gr.Plot()],
    title='U.S. Energy Data Q&A System'
)

# gradio_interface.launch()

The central component of this application is the LLM interacting with the data. I will be using OpenAI API to start and would then like to transition to LangChain. Maybe ability to upload models from HuggingFace.

In [5]:
import os
from openai import OpenAI
from dotenv import load_dotenv

In [6]:
load_dotenv()

True

In [7]:
client = OpenAI(api_key=os.getenv("API_KEY"))

In previous version of the application, I have passed the data dictionary to the LLM and it has used that to answer questions and create visuals. Since the csv is so small I might be able to convert to JSON and include in API call. As I include larger data files I will need to determine different approach. I'm going to test both (including data and not) and see how the results change. This is something I'll have to research more. 

Using this [prompt engineering guide](https://platform.openai.com/docs/guides/prompt-engineering) from OpenAI as reference. Will improve prompt over time.

Cool addition would be to choose expertise on energy system. Adjust underlying prompt and how advanced/technical response is.

In [11]:
question = None

In [12]:
prompt_no_data = f"""

You are an AI assistant specializing in data analysis and visualization around the energy system. You have access to 
a dataset with U.S. energy statistics in the following format:

{data_dictionary}

As an assistant, it is your job to support experts in their exploration and research on the energy system. We want this research
to be data-centric whenever possible. You will answer questions about the energy system and will reference the provided data whenever
relevant. 

For each query, you will do the following:
- Determine the question being asked can be answered using the data you have access to, if the data can support your answer, or, if the data is irrelevant for question
- If data can answer the question:
    - Create a visualization using the Plotly library from Python to answer the question
    - Explain the visual and how the data being presented clearly answers the users questions
    - Provide an interpretation. Reference key statistics in your written answer
- If data can support your answer
    - Provide a text response answering the question to the best of your knowledge
    - Create a visualization using the Plotly library from and the data at your disposal to support claims in your written answer
    - Reference the visualization in your written response, including any key statistics
- If data cannot answer or support your answer (ie. is irrelevant to your response)
    - Answer the question to the best of your ability (this includes saying you do not know the answer)
    - Since the data cannot support your answer, we do not need to create a visualization on the provided data itself,
        using the Plotly library from Python, create a simple, fun/funny visual to display (eg. smiley face)

Given the following question: "{question}"

1. Check the data to determine if it can be used to answer the question
2. Provide a detailed written explanation of the answer, including specific numbers, years, or trends from the data.
3. Decide if a visualization would be helpful to support your explanation/answer
4. If a visualization is needed, write Python code using Plotly Express to create it. The data is stored in a pandas dataframe, "data"
5. When writing code, use the exact column names as they appear in the data dictionary.
6. Format your response as follows:
    ANSWER: [Detailed, data-driven answer here including specific statistics from the data]
    VISUALIZATION:
    ```python
    [Python code here for visualization]
    ```
"""

In [15]:
response = client.chat.completions.create(
    model = 'gpt-3.5-turbo',
    messages=[
        {"role": "system", "content": "You are a helpful assistant skilled in data analysis and research. You provide data-driven asnwers whenever possible. You respond in a professional manner, but keep it light and humerous when the opportunity presents itself"},
        {"role": "user", "content": prompt_no_data}
    ]
)

In [18]:
len(response.choices)

1

In [21]:
response.choices[0].message.content

'ANSWER: "None"\n\nUnfortunately, the question provided does not specify any query or information that can be answered or supported by the available dataset on U.S. energy statistics. Therefore, we cannot utilize the data to provide a detailed response or visualization. If you have any specific questions or topics related to the energy system that you would like to explore further, feel free to ask!\n\nVISUALIZATION:\n```python\nimport plotly.express as px\nimport plotly.graph_objects as go\n\nfig = go.Figure()\n\nfig.add_trace(go.Scatter(\n    x=[1, 2, 3],\n    y=[1, 1, 1],\n    mode=\'markers\',\n    marker=dict(size=200, color=\'yellow\'),\n    showlegend=False\n))\n\nfig.update_layout(title=\'No Data Available\', xaxis_title=\'Data\', yaxis_title=\'Information\')\nfig.show()\n```'

In [26]:
response.choices[0].message.content.split('VISUALIZATION:')[1].split('```python')[1].split('```')[0].strip()

"import plotly.express as px\nimport plotly.graph_objects as go\n\nfig = go.Figure()\n\nfig.add_trace(go.Scatter(\n    x=[1, 2, 3],\n    y=[1, 1, 1],\n    mode='markers',\n    marker=dict(size=200, color='yellow'),\n    showlegend=False\n))\n\nfig.update_layout(title='No Data Available', xaxis_title='Data', yaxis_title='Information')\nfig.show()"

I want to be able to experiment with differnet prompts and see how everything is displayed in Gradio interface. I will keep Gradio interface at the end and my prompt iterations will be tracked throughout the notebook. `respond` function will house everything for the moment. I will adjust prompt name in function as I go through

In [37]:
import plotly.express as px

In [29]:
import pandas as pd
data = pd.read_csv('./app_data/data.csv')

In [38]:
def respond(query):

    prompt = f"""

    You are an AI assistant specializing in data analysis and visualization around the energy system. You have access to 
    a dataset with U.S. energy statistics in the following format:

    {data_dictionary}

    As an assistant, it is your job to support experts in their exploration and research on the energy system. We want this research
    to be data-centric whenever possible. You will answer questions about the energy system and will reference the provided data whenever
    relevant. 

    For each query, you will do the following:
    - Determine the question being asked can be answered using the data you have access to, if the data can support your answer, or, if the data is irrelevant for question
    - If data can answer the question:
        - Create a visualization using the Plotly library from Python to answer the question
        - Explain the visual and how the data being presented clearly answers the users questions
        - Provide an interpretation. Reference key statistics in your written answer
    - If data can support your answer
        - Provide a text response answering the question to the best of your knowledge
        - Create a visualization using the Plotly library from and the data at your disposal to support claims in your written answer
        - Reference the visualization in your written response, including any key statistics
    - If data cannot answer or support your answer (ie. is irrelevant to your response)
        - Answer the question to the best of your ability (this includes saying you do not know the answer)
        - Since the data cannot support your answer, we do not need to create a visualization on the provided data itself,
            using the Plotly library from Python, create a simple, fun/funny visual to display (eg. smiley face)

    Given the following question: "{query}"

    1. Check the data to determine if it can be used to answer the question
    2. Provide a detailed written explanation of the answer, including specific numbers, years, or trends from the data.
    3. Decide if a visualization would be helpful to support your explanation/answer
    4. If a visualization is needed, write Python code using Plotly Express to create it. The data is stored in a pandas dataframe, "data"
    5. When writing code, use the exact column names as they appear in the data dictionary.
    6. Format your response as follows:
        ANSWER: [Detailed, data-driven answer here including specific statistics from the data]
        VISUALIZATION:
        ```python
        [Python code here for visualization]
        ```
    """

    response = client.chat.completions.create(
        model = 'gpt-3.5-turbo',
        messages=[
            {"role": "system", "content": "You are a helpful assistant skilled in data analysis and research. You provide data-driven asnwers whenever possible. You respond in a professional manner, but keep it light and humerous when the opportunity presents itself"},
            {"role": "user", "content": prompt}
        ]
    )
    
    content = response.choices[0].message.content
    print(content)
    answer = content.split("ANSWER:")[1].split("VISUALIZATION:")[0].strip()
    code = content.split('VISUALIZATION:')[1].split('```python')[1].split('```')[0].strip()
    print('Answer', answer)
    print('Code', code)
    
    return answer, code
    # Pass question to LLM function

gradio_interface = gr.Interface(
    fn = respond,
    inputs=[gr.Textbox(label='Enter question about U.S. Energy System')],
    outputs = [gr.Textbox(), gr.Plot()],
    title='U.S. Energy Data Q&A System'
)

gradio_interface.launch()

Running on local URL:  http://127.0.0.1:7868

To create a public link, set `share=True` in `launch()`.




ANSWER: The nameplate capacity refers to the total U.S. capacity in the energy system. To determine how it has changed over time, we can analyze the "NAMECAP" column in the dataset from various years. 

From the dataset provided, we observe the following trends in nameplate capacity:
- In 2010, the total U.S. capacity was around XXX GW
- By 2015, there was a significant increase in capacity to XXX GW
- In 2020, the nameplate capacity further rose to XXX GW

This indicates a general upward trend in nameplate capacity over the years, showcasing the growth in energy infrastructure within the U.S.

VISUALIZATION:
```python
import plotly.express as px

# Creating a line plot to visualize the change in nameplate capacity over time
fig = px.line(data, x='YEAR', y='NAMECAP', title='U.S. Nameplate Capacity Over Time')
fig.show()
```
This visualization will provide a clear visual representation of how the nameplate capacity has evolved from 2010 to 2020, highlighting the increasing trend over th

Traceback (most recent call last):
  File "/Users/mattcarr/opt/anaconda3/envs/rag/lib/python3.12/site-packages/gradio/queueing.py", line 541, in process_events
    response = await route_utils.call_process_api(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mattcarr/opt/anaconda3/envs/rag/lib/python3.12/site-packages/gradio/route_utils.py", line 276, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mattcarr/opt/anaconda3/envs/rag/lib/python3.12/site-packages/gradio/blocks.py", line 1938, in process_api
    data = await self.postprocess_data(block_fn, result["prediction"], state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mattcarr/opt/anaconda3/envs/rag/lib/python3.12/site-packages/gradio/blocks.py", line 1761, in postprocess_data
    prediction_value = block.postprocess(prediction_value)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^