# Ask the Dataset

This Jupyter Notebook demonstrates the backend logic of a Data Analysis Chatbot. Built with OpenAI's GPT model, this interactive tool allows users to input their API key, the paths to their data, and their questions. The model processes this information to provide insightful answers to the user's queries about their dataset. The AI-powered assistant is designed to understand and analyze data uploaded in CSV format, delivering its analysis in a natural, conversational manner.


In [25]:
#| default_exp AskTheDataset

In [26]:
#| export
import pandas as pd
import os
import getpass
import openai
from openai import OpenAI
import gradio as gr
from io import StringIO
import sys

In [27]:
#| export
class CSVFileManager:
    """This class is responsible for handling CSV file operations. 
    It can read one or more CSV files, either from file paths or file-like objects, and concatenate them into a
    single pandas DataFrame.
    """
    def __init__(self):
        self.data_frame = None

    def load_data(self, files):
        
        if not files:
            raise ValueError("No files provided.")

        data_frames = []
        for file in files:
            # Check if the file is a string (path) or a file-like object
            if isinstance(file, str):
                df = pd.read_csv(file)
            else:  # Assuming file-like object
                df = pd.read_csv(file)
            data_frames.append(df)

        self.data_frame = pd.concat(data_frames, ignore_index=True)
        return self.data_frame

class GPTQuestionsHandler:
    """This class interfaces with OpenAI's GPT model. 
    It sends user queries about the dataset to the GPT model and retrieves responses.
    """
    
    def __init__(self, api_key):

        # Initializes the OpenAI API client

        self.api_key = api_key
        os.environ["OPENAI_API_KEY"] = api_key
        self.client = OpenAI()

    def ask_gpt(self, history, data):
        """
        Sends a question and data to the OpenAI API and retrieves the response
        Arguments: The user's question about the data
        Data is input to the model in string format
        Model returns the response from the API
        """
        
        system_message = {
            "role": "system", 
            "content":("You are a helpful assistant skilled at data science and data analysis. "
                          "You are an expert at reading files, interpreting them and also writing python codes. "
                          "Here is the data you need to work with:\n" + data)}
        
        # Create a copy of history to include system message for API call
        messages_for_api = [system_message] + history

        response = self.client.chat.completions.create(
            model="gpt-3.5-turbo-1106",
            temperature=0.7,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0,
            messages=messages_for_api)
        answer= response.choices[0].message.content
        # Update the conversation history with the new response
        history.append({"role": "assistant", "content": answer})
        
        return answer
    


In [32]:
#| export
def main():
    try:
        file_paths_input = input("Enter CSV file paths, separated by a comma: ")
        file_paths = [path.strip() for path in file_paths_input.split(',')]
    except:
        file_paths = ['../data/009-1.csv']

    data_manager = CSVFileManager()
    data_frame = data_manager.load_data(file_paths)

    api_key = getpass.getpass("Enter OpenAI API Key: ")
    ai_handler = GPTQuestionsHandler(api_key)

    # Initialize the conversation history
    history = []
    try:
        while True:
            user_question = input("Please enter your question about the data: ")
            # Update the history with the user's question
            history.append({"role": "user", "content": user_question})

            answer = ai_handler.ask_gpt(history, data_frame.to_string(index=False))

            print("Question: ", user_question,"\n")
            print("Answer:", answer, "\n")
            
            # Ask the user if they want to continue or exit
            continue_chat = input("Do you have another question? (yes/no): ")
            if continue_chat.lower() == 'no':
                print("Exiting the chat.")
                break
            
    except Exception as e:
        print("An error occurred. Please try again.",e)


if __name__ == "__main__":
    main()



Question:  How many rows are there? 

Answer: The given dataset has 36 rows. 

Question:  Give me the line of code in python to calculate this 

Answer: Certainly! You can use the following line of code in Python to calculate the number of rows in a dataset using the pandas library:

```python
import pandas as pd

# Load the dataset into a pandas DataFrame
data = pd.read_csv('your_file.csv')  # Replace 'your_file.csv' with the actual file name

# Calculate the number of rows
num_rows = data.shape[0]
print(num_rows)
```

Replace 'your_file.csv' with the actual file name or file path of the dataset. This code will load the dataset into a pandas DataFrame and then calculate the number of rows in the dataset. 

Exiting the chat.


The above output from the model illustrates that it has the capabilities to remember the conversation and answer follow up questions.

### Commentary on code functionality

#### Analysis of GPT Model Responses with examples


In [35]:
#| hide
def main():
    file_paths_input = '../data/009-1.csv, ../data/009-2.csv'
    file_paths = [path.strip() for path in file_paths_input.split(',')]

    data_manager = CSVFileManager()
    data_frame = data_manager.load_data(file_paths)

    try:
        api_key = getpass.getpass("Enter OpenAI API Key: ")
        ai_handler = GPTQuestionsHandler(api_key)
    except Exception as e:
        print(f"An error occurred: {e}")
        return

    # Initialize the conversation history
    history = []
    try:
        while True:
            user_question = "Can you provide examples of sentences labeled as 'PRS'"
            # Update the history with the user's question
            history.append({"role": "user", "content": user_question})

            answer = ai_handler.ask_gpt(history, data_frame.to_string(index=False))

            print("Question: ", user_question,"\n")
            print("Answer:", answer, "\n")
            
            # Ask the user if they want to continue or exit
            continue_chat = "no"
            if continue_chat.lower() == 'no':
                print("Exiting the chat.")
                break
            
    except Exception as e:
        print("An error occurred. Please try again.",e)

if __name__ == "__main__":
    main()


Question:  Can you provide examples of sentences labeled as 'PRS' 

Answer: Sure! Here are the examples of sentences labeled as 'PRS' from the given data:

1. Great example, 'run' is a verb because it is an action.
2. Our sentence is now: 'The fast cat runs'.
3. Great job, that sentence definitely needed an exclamation mark!
4. Great, now our sentence is: 'She reads a book'.
5. Great, now our sentence is: 'She runs quickly'.
6. Well done, everyone! You have learned a lot today.
7. Wonderful! That sentence is showing strong emotion, so it is an exclamatory sentence.
8. Great, now our sentence is: 'The book on the table'.
9. Well done, everyone! You have learned a lot today.
10. I am looking forward to reading your paragraphs.
11. Have a great day, and remember to keep practicing your English!

These are the examples of sentences labeled as 'PRS' from the given data. 

Exiting the chat.


In [37]:
#| hide
def main():
    file_paths_input = '../data/009-1.csv, ../data/009-2.csv'
    file_paths = [path.strip() for path in file_paths_input.split(',')]

    data_manager = CSVFileManager()
    data_frame = data_manager.load_data(file_paths)

    try:
        api_key = getpass.getpass("Enter OpenAI API Key: ")
        ai_handler = GPTQuestionsHandler(api_key)
    except Exception as e:
        print(f"An error occurred: {e}")
        return

    # Initialize the conversation history
    history = []
    try:
        while True:
            user_question = "What are the unique labels in the label column?"
            # Update the history with the user's question
            history.append({"role": "user", "content": user_question})

            answer = ai_handler.ask_gpt(history, data_frame.to_string(index=False))

            print("Question: ", user_question,"\n")
            print("Answer:", answer, "\n")
            
            # Ask the user if they want to continue or exit
            continue_chat = "no"
            if continue_chat.lower() == 'no':
                print("Exiting the chat.")
                break
            
    except Exception as e:
        print("An error occurred. Please try again.",e)

if __name__ == "__main__":
    main()


Question:  What are the unique labels in the label column? 

Answer: The unique labels in the "Label" column are:
- PRS
- OTR
- NaN 

Exiting the chat.


1. Response to "How many unique labels are there in the Label column?"
The model correctly identifies three unique labels: 'PRS', 'OTR', and 'NaN'.
'NaN' is mentioned as a label, which indicates the model understands 'NaN' (commonly used to represent missing data in pandas DataFrames) as a category. This shows an understanding of data handling conventions.
The response indicates the model's ability to interpret and categorize data based on provided information.

2. Response to "Can you provide examples of sentences labeled as 'PRS'"
The model provided a list of sentences which are assumed to be labeled as 'PRS'.
The model's response demonstrates its ability to extract and present data based on specified criteria

From the above two examples, we can see that overall, the model shows a strong understanding of the context and structure of the data. It can differentiate between labels and provide relevant examples. The provided examples seem relevant and accurately categorized based on the 'PRS' label. This indicates the model's effectiveness in filtering and presenting data as per user queries.


While the model can provide information based on the data it's been given in the prompt, and can provide the python code to perform analyses, it can't run codes and provide visualizations.


The model's responses to queries about the dataset showcase its capabilities in data analysis, context understanding, and relevant information extraction. The model effectively interprets and categorizes data, providing coherent and contextually appropriate responses. 

Moreover, it has the capabilities to remember the conversation history, to answer follow up questions.

These capabilities make it a valuable tool for gaining insights from structured data, such as CSV files, although it's important to remember its limitations.


In [38]:
#| hide
import nbdev; nbdev.nbdev_export()