<a href="https://colab.research.google.com/github/CDAC-lab/isie2023/blob/main/tutorial-notebook-4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

This notebook is designed to demonstrate how to read data from a data sources and facilitate natural language queries on top of the data. To demonstrate this we will use a CSV file of data and generate various types of graphs based on user prompts. Using Python's pandas library, we'll load and analyze a CSV file.

## Table of Contents

1. [Introduction and Setting Up](#section1)
    - Introduction to the Notebook
    - Installing Necessary Libraries
    - Importing Libraries and Dependencies
2. [Loading the CSV Data](#section2)
    - Introduction to CSV Files
    - Reading CSV Data with Pandas
3. [Exploring the Data](#section3)
    - Introduction to the Dataset
    - Basic Data Analysis with Pandas
4. [Graph Generation](#section4)
    - Introduction to Data Visualization with Python
    - User Prompt for Graph Generation
5. [Conclusion and Possible Extensions](#section5)
    - Summary of Achievements
    - Potential Future Work
6. [References and Additional Resources](#section6)

## Install libraries

In [1]:
!pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp (from openai)
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m36.4 MB/s[0m eta [36m0:00:00[0m
Collecting multidict<7.0,>=4.5 (from aiohttp->openai)
  Downloading multidict-6.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.5/114.5 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting async-timeout<5.0,>=4.0.0a3 (from aiohttp->openai)
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting yarl<2.0,>=1.0 (from aiohttp->openai)
  Downloadin

# Introduction and Setting Up

## Introduction to the Notebook
Welcome to our notebook! This project aims to read a dataset from a CSV file and generate a variety of graphs based on user prompts. By using Python's pandas and matplotlib libraries, we aim to provide an interactive data visualization experience.

## Installing Necessary Libraries
In this section, we'll guide you through the installation process for all the necessary libraries that we'll use throughout this notebook. This includes pandas for data handling, and matplotlib or seaborn for data visualization.

## Importing Libraries and Dependencies
Here, we will import all the required Python libraries and dependencies that we'll be using in our notebook. This includes standard libraries for data handling and visualization, as well as any additional libraries we might need.

## Import libraries

In [2]:
#libraries for google drive authentication
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials


import pandas as pd
import openai
import re
import os

In [3]:
os.environ["OPENAI_API_KEY"] = "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"

# Loading the CSV Data

## Introduction to CSV Files
CSV (Comma Separated Values) files are a common format for storing tabular data. In this section, we'll provide a brief overview of CSV files and how they're used to store and share data.

## Reading CSV Data with Pandas
Here, we'll walk you through the process of loading a CSV file into a pandas DataFrame, which will allow us to easily manipulate and analyze the data.


## Load CSV file

In [14]:
#authenticate with you google drive credentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# This is the file ID of the data set, this will download the datafile from the shared location
file_id = '1z3gNK6PYhhWlFFuigfPA406gAVjntNjb'
sample_data = drive.CreateFile({'id':file_id})
sample_data.GetContentFile('sample_data.csv')

In [15]:
df = pd.read_csv(r"sample_data.csv")

In [16]:
df.head(5)

Unnamed: 0,Solar Site,Simulation,Prediction,Actual
0,Health Sciences 1,26050.766,25415.479,22126.585
1,Health Sciences 2,11027.448,10739.996,10844.289
2,Health Sciences 3,21911.317,21496.428,20349.937
3,Humanities 03,25057.726,24384.783,24448.976
4,Library,131889.003,128426.684,110282.812


## Process CSV file

In [17]:
# Apply the custom function and convert date columns
for col in df.columns:
    # check if a column name contains date substring
    if 'date' in col.lower():
        df[col] = pd.to_datetime(df[col])
        # remove timestamp
        #df[col] = df[col].dt.date

# reset index
df = df.reset_index(drop=True)

# replace space with _ in column names
df.columns = df.columns.str.replace(' ', '_')

cols = df.columns
cols = ", ".join(cols)

# Exploring the Data

## Introduction to the Dataset
Once we've loaded the data, we'll provide an introduction to the dataset. This includes a description of what the data represents, as well as an overview of the various columns in the DataFrame.

## Basic Data Analysis with Pandas
Before we can generate graphs, it's important to explore and understand our data. In this section, we'll guide you through some basic data analysis techniques using pandas, such as calculating summary statistics and identifying any missing values.

# Graph Generation

## Introduction to Data Visualization with Python
Data visualization is a crucial part of any data analysis process. In this section, we'll provide a brief introduction to data visualization with Python, and discuss how libraries like matplotlib and seaborn can help us create beautiful and informative visualizations.

## User Prompt for Graph Generation
In this interactive section, we'll prompt the user to specify what type of graph they would like to generate from the data. We'll provide a variety of options, including bar graphs, pie charts, and scatter plots, and guide the user through the process of creating each type of graph.

# Conclusion and Possible Extensions

## Summary of Achievements
In this section, we'll summarize what we've achieved in this notebook, from loading a CSV file to generating user-specified graphs.

##```markdown
Potential Future Work
The methods we've used here could be extended in a number of ways. We'll discuss some possibilities for future work, such as supporting additional types of graphs, adding more complex data analysis techniques, or expanding our notebook to handle different types of data files.

# References and Additional Resources
To wrap up the notebook, we'll provide a list of references and additional resources that you can use to further explore the topics covered in this notebook.

## Post Processing GPT response

In [18]:
def generate_gpt_reponse(gpt_input, max_tokens):

    # load api key from secrets
    openai.api_key = os.environ["OPENAI_API_KEY"]

    completion = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        max_tokens=max_tokens,
        temperature=0,
        messages=[
            {"role": "user", "content": gpt_input},
        ]
    )

    gpt_response = completion.choices[0].message['content'].strip()
    return gpt_response



def extract_code(gpt_response):
    """function to extract code from gpt response"""

    if "```" in gpt_response:
        # extract text between ``` and ```
        pattern = r'```(.*?)```'
        code = re.search(pattern, gpt_response, re.DOTALL)
        extracted_code = code.group(1)

        # remove python from the code (weird bug)
        extracted_code = extracted_code.replace('python', '')

        return extracted_code
    else:
        return gpt_response

## Construct the Prompt

In [19]:
def create_plot(user_input,cols):
  prompt = 'Write code in Python using Plotly to address the following request: {} ' \
             'Use df that has the following columns: {}.' \
             'Do not use animation_group argument and return' \
             'only code with no import statements and the data' \
             'has been already loaded in a df variable'.format(user_input, cols)

  gpt_response = generate_gpt_reponse(prompt, max_tokens=1500)
  extracted_code = extract_code(gpt_response)
  exec(extracted_code)

In [21]:
user_input = "draw a bar chart on simulations data and a line chart on predictions in the same plot" #@param {type:"string"}
create_plot(user_input,cols)