# Ask the Dataset

This Jupyter Notebook demonstrates the backend logic of a Data Analysis Chatbot. Built with OpenAI's GPT model, this interactive tool allows users to input their API key, the paths to their data, and their questions. The model processes this information to provide insightful answers to the user's queries about their dataset. The AI-powered assistant is designed to understand and analyze data uploaded in CSV format, delivering its analysis in a natural, conversational manner.


In [1]:
#| default_exp AskTheDataset

In [2]:
#| export
import pandas as pd
import os
import getpass
import openai
from openai import OpenAI
import gradio as gr
from io import StringIO

In [3]:
#| export
class CSVFileManager:
    """This class is responsible for handling CSV file operations. 
    It can read one or more CSV files, either from file paths or file-like objects, and concatenate them into a
    single pandas DataFrame.
    """
    def __init__(self):
        self.data_frame = None

    def load_data(self, files):
        
        if not files:
            raise ValueError("No files provided.")

        data_frames = []
        for file in files:
            # Check if the file is a string (path) or a file-like object
            if isinstance(file, str):
                df = pd.read_csv(file)
            else:  # Assuming file-like object
                df = pd.read_csv(file)
            data_frames.append(df)

        self.data_frame = pd.concat(data_frames, ignore_index=True)
        return self.data_frame

class GPTQuestionsHandler:
    """This class interfaces with OpenAI's GPT model. 
    It sends user queries about the dataset to the GPT model and retrieves responses.
    """
    
    def __init__(self, api_key):

        # Initializes the OpenAI API client

        self.api_key = api_key
        os.environ["OPENAI_API_KEY"] = api_key
        self.client = OpenAI()

    def ask_gpt(self, question, data):
        """
        Sends a question and data to the OpenAI API and retrieves the response
        Arguments: The user's question about the data
        Data is input to the model in string format
        Model returns the response from the API
        """

        system_message = ("You are a helpful assistant skilled at data science and data analysis. "
                          "You are an expert at reading files, interpreting them and also writing python codes. "
                          "Here is the data you need to work with:\n" + data)
        response = self.client.chat.completions.create(
            model="gpt-3.5-turbo-1106",
            messages=[
                {"role": "system", "content": system_message},
                {"role": "user", "content": question}
            ])
        return response.choices[0].message.content


In [5]:
#| export
def main():
    file_paths_input = input("Enter CSV file paths, separated by a comma: ")
    file_paths = [path.strip() for path in file_paths_input.split(',')]

    data_manager = CSVFileManager()
    data_frame = data_manager.load_data(file_paths)


    api_key = getpass.getpass("Enter OpenAI API Key: ")
    ai_handler = GPTQuestionsHandler(api_key)

    user_question = input("Please enter your question about the data: ")
    answer = ai_handler.ask_gpt(user_question, data_frame.to_string(index=False))

    print("Question: ", user_question,"\n")
    print("Answer:", answer, "\n")

if __name__ == "__main__":
    main()


Question:  How many rows are in the data? Also give me a python code to find this out for any dataset 

Answer: The data contains 33 rows. Here's a Python code snippet to find the number of rows in any dataset using pandas:

```python
import pandas as pd

# Load the dataset
data = pd.read_csv("your_dataset.csv")  # Replace "your_dataset.csv" with the actual file name

# Get the number of rows
num_rows = data.shape[0]
print("Number of rows:", num_rows)
```

Replace "your_dataset.csv" with the actual file name of your dataset. This code will load the dataset into a pandas DataFrame and then print the number of rows in the dataset. 



### Commentary on code functionality

#### Analysis of GPT Model Responses with examples


In [6]:
#| hide
def main():
    file_paths_input = '../data/009-1.csv, ../data/009-2.csv'
    file_paths = [path.strip() for path in file_paths_input.split(',')]

    data_manager = CSVFileManager()
    data_frame = data_manager.load_data(file_paths)


    api_key = getpass.getpass("Enter OpenAI API Key: ")
    ai_handler = GPTQuestionsHandler(api_key)

    user_question = "Can you provide examples of sentences labeled as 'PRS'"
    answer = ai_handler.ask_gpt(user_question, data_frame.to_string(index=False))

    print("Question: ", user_question,"\n")
    print("Answer:", answer, "\n")

if __name__ == "__main__":
    main()


Question:  Can you provide examples of sentences labeled as 'PRS' 

Answer: Sure! Here are some examples of sentences labeled as 'PRS' from the provided data:

1. Great example, 'run' is a verb because it is an action. 
2. Great job, that sentence definitely needed an exclamation mark!
3. She reads a book.
4. She runs quickly.
5. That's a perfect example! Declarative sentences end with a period.
6. Great job! That is indeed an interrogative sentence.
7. Wonderful! That sentence is showing strong emotion, so it is an exclamatory sentence.
8. Well done, everyone! You have learned a lot today.
9. Well done, everyone! You have learned a lot today.
10. I am looking forward to reading your stories.
11. Well done, everyone! You have learned a lot today.
12. I am looking forward to reading your paragraphs.
13. Have a great day, and remember to keep practicing your English! 



In [14]:
#| hide
def main():
    file_paths_input = '../data/009-1.csv'
    file_paths = [path.strip() for path in file_paths_input.split(',')]

    data_manager = CSVFileManager()
    data_frame = data_manager.load_data(file_paths)


    api_key = getpass.getpass("Enter OpenAI API Key: ")
    ai_handler = GPTQuestionsHandler(api_key)

    user_question = "What are the unique labels in the Label column?"
    answer = ai_handler.ask_gpt(user_question, data_frame.to_string(index=False))

    print("Question: ", user_question,"\n")
    print("Answer:", answer, "\n")

if __name__ == "__main__":
    main()


Question:  What are the unique labels in the Label column? 

Answer: The unique labels in the "Label" column are: PRS, OTR, and NaN. 



1. Response to "How many unique labels are there in the Label column?"
The model correctly identifies three unique labels: 'PRS', 'OTR', and 'NaN'.
'NaN' is mentioned as a label, which indicates the model understands 'NaN' (commonly used to represent missing data in pandas DataFrames) as a category. This shows an understanding of data handling conventions.
The response indicates the model's ability to interpret and categorize data based on provided information.

2. Response to "Can you provide examples of sentences labeled as 'PRS'"
The model provided a list of sentences which are assumed to be labeled as 'PRS'.
The model's response demonstrates its ability to extract and present data based on specified criteria

From the above two examples, we can see that overall, the model shows a strong understanding of the context and structure of the data. It can differentiate between labels and provide relevant examples. The provided examples seem relevant and accurately categorized based on the 'PRS' label. This indicates the model's effectiveness in filtering and presenting data as per user queries.


While the model can provide information based on the data it's been given in the prompt, and can provide the python code to perform analyses, it can't run codes and provide visualizations.


The model's responses to queries about the dataset showcase its capabilities in data analysis, context understanding, and relevant information extraction. The model effectively interprets and categorizes data, providing coherent and contextually appropriate responses. These capabilities make it a valuable tool for gaining insights from structured data, such as CSV files, although it's important to remember its limitations.


In [15]:
#| hide
import nbdev; nbdev.nbdev_export()