Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash "Invalid input data. Must be a Pandas or Polars dataframe" on "row" question #465

Closed
PavelAgurov opened this issue Aug 19, 2023 · 22 comments
Labels
bug Something isn't working

Comments

@PavelAgurov
Copy link
Contributor

PavelAgurov commented Aug 19, 2023

馃悰 Describe the bug

I use titanik data (attached).

Model is turbo3.5.

MODEL_NAME = "gpt-3.5-turbo" # gpt-3.5-turbo-16k
llm = OpenAI(api_token= LLM_OPENAI_API_KEY, model = MODEL_NAME, temperature=0, max_tokens=1000)

Load data:
df = pd.read_csv('./data_examples/titanic.csv')

Run:

    smart_df = SmartDataframe(df, config={
                    "llm": llm, 
                    "conversational": False, 
                    "enable_cache": True,
                    "middlewares": [StreamlitMiddleware(), ChartsMiddleware()],
                    }, 
                    logger= logger,
                )
    with get_openai_callback() as cb:
        result = smart_df.chat(question)

Question is "what is first row?"

Result - crash:

Error: Invalid input data. Must be a Pandas or Polars dataframe.. Track: Traceback (most recent call last): File "C:\DiskD\GptPOCs\AskYourDataPOC\main.py", line 142, in result = smart_df.chat(question) File "d:\Anaconda3\lib\site-packages\pandasai\smart_dataframe_init_.py", line 167, in chat return self.dl.chat(query) File "d:\Anaconda3\lib\site-packages\pandasai\smart_datalake_init.py", line 329, in chat return self.format_results(result) File "d:\Anaconda3\lib\site-packages\pandasai\smart_datalake_init.py", line 356, in format_results return SmartDataframe( File "d:\Anaconda3\lib\site-packages\pandasai\smart_dataframe_init.py", line 68, in init self.load_engine() File "d:\Anaconda3\lib\site-packages\pandasai\smart_dataframe_init.py", line 119, in _load_engine raise ValueError( ValueError: Invalid input data. Must be a Pandas or Polars dataframe.

Trace log:

Question: what is first row?
Running PandasAI with openai LLM...
Prompt ID: cd2cb999-52ae-4e94-ad07-f7cd56442874
Using cached response

                    Code generated:
                    
                    # TODO import all the dependencies required
import pandas as pd

# Analyze the data
# 1. Prepare: Preprocessing and cleaning data if necessary
# 2. Process: Manipulating data for analysis (grouping, filtering, aggregating, etc.)
# 3. Analyze: Conducting the actual analysis (if the user asks to create a chart save it to an image in exports/charts/temp_chart.png and do not show the chart.)
# 4. Output: return a dictionary of:
# - type (possible values "text", "number", "dataframe", "plot")
# - value (can be a string, a dataframe or the path of the plot, NOT a dictionary)
# Example output: { "type": "text", "value": "The average loan amount is $15,000." }
def analyze_data(dfs: list[pd.DataFrame]) -> dict:
    # Code goes here (do not add comments)
    first_row = dfs[0].iloc[0]
    return {"type": "dataframe", "value": first_row}

# Declare a result variable
result = analyze_data(dfs)
                    


Code running:

def analyze_data(dfs: list[pd.DataFrame]) ->dict:
    first_row = dfs[0].iloc[0]
    return {'type': 'dataframe', 'value': first_row}


result = analyze_data(dfs)
        
Executed in: 0.00791025161743164s

conversational can be True or False - I have crash in both cases.

@PavelAgurov
Copy link
Contributor Author

titanic.csv

@PavelAgurov
Copy link
Contributor Author

Based on code from \pandasai\smart_datalake_init_.py (line 325):

        self._logger.log(f"Executed in: {time.time() - self._start_time}s")

        self._add_result_to_memory(result)

        return self._format_results(result)

And method _add_result_to_memory

        if result["type"] == "string":
            self._memory.add(result["result"], False)
        elif result["type"] == "dataframe":
            self._memory.add("Here is the data you requested.", False)
        elif result["type"] == "plot" or result["type"] == "image":
            self._memory.add("Here is the plot you requested.", False)

I checked smart_df.datalake._memory.get_conversation() and see correct message:

User: give me all names Bot: Here is the data you requested.

It means that result['type'] is dataframe.

Also I have no exception "Invalid input data. We cannot convert it to a dataframe." from \pandasai\smart_dataframe_init_.py
It also means that df is DataFrame.

@PavelAgurov
Copy link
Contributor Author

I think problem is in \pandasai\smart_dataframe_init_.py

    def _load_df(self, df: DataFrameType):
        """
        Load a dataframe into the smart dataframe

        Args:
            df (DataFrameType): Pandas or Polars dataframe or path to a file
        """
        if isinstance(df, str):
            self._df = self._import_from_file(df)
        elif isinstance(df, (list, dict)):
            # if the list can be converted to a dataframe, convert it
            # otherwise, raise an error
            try:
                self._df = pd.DataFrame(df)
            except ValueError:
                raise ValueError(
                    "Invalid input data. We cannot convert it to a dataframe."
                )
        else:
            self._df = df

Let's check it:

df = pd.read_csv('./data_examples/titanic.csv')
print(f'{isinstance(df, pd.DataFrame)} {isinstance(pd, (list, dict))}')

Output: True False

It means elif isinstance(df, (list, dict)) - it's bug. Should be elif isinstance(df, pd.DataFrame):

@PavelAgurov
Copy link
Contributor Author

PavelAgurov commented Aug 19, 2023

Or maybe even this one to check pandas and polaris without import polaris here:


    def _load_df(self, df: DataFrameType):
        """
        Load a dataframe into the smart dataframe

        Args:
            df (DataFrameType): Pandas or Polars dataframe or path to a file
        """
        if isinstance(df, str):
            self._df = self._import_from_file(df)
        elif 'DataFrame' in type(df).__name__ # stupid way to check pandas and polars
            # if the list can be converted to a dataframe, convert it
            # otherwise, raise an error
            try:
                self._df = pd.DataFrame(df)
            except ValueError:
                raise ValueError(
                    "Invalid input data. We cannot convert it to a dataframe."
                )
        else:
            self._df = df

But in this case I'm not sure what to do with pd.Series. Maybe better to import polaris here and check it in normal way.

Or do not raise exception at all - just return original df if it's not string and we can't load it as DataFrame.

@sandiemann
Copy link
Contributor

@PavelAgurov I believe its something do with the prompt so i tried some other prompts to test,

from pandasai import SmartDataframe
  ...: from pandasai.llm import OpenAI
  ...: 
  ...: llm = OpenAI()
  ...: df = SmartDataframe("/Users/sanchit/Downloads/titanic.csv", config={"llm": llm})
  ...: 
response = df.chat("how many male passengers?")
> response
Out[4]: 577

prompt: fetch first row?

response2 = df.chat("fetch first row?")
> response2
Out[5]:
There are 577 male passengers. The first row is:
PassengerId                          1
Survived                             0
Pclass                               3
Name           Braund, Mr. Owen Harris
Sex                               male
Age                               22.0
SibSp                                1
Parch                                0
Ticket                       A/5 21171
Fare                              7.25
Cabin                              NaN
Embarked                             S
Name: 0, dtype: object

I will check on this later in depth. hope it helps for now!

@PavelAgurov
Copy link
Contributor Author

PavelAgurov commented Aug 20, 2023

Please find short test: https://colab.research.google.com/drive/15WniinCDUd_tL_z6APwEqQTXcbD9nVq2?usp=sharing
image

(skip second part of this test, because it's already about #470)

@PavelAgurov
Copy link
Contributor Author

My assumption - you loaded data directly from csv, but I loaded it from dafaframe.

@sandiemann
Copy link
Contributor

@PavelAgurov I looked deeper into it now. The issue is with the prompt where it was returning first row df.iloc[0] which returns a series but it expects a pd.DataFrame,

Code running:

def analyze_data(dfs: list[pd.DataFrame]) ->dict:
    first_row = dfs[0].iloc[0]
    return {'type': 'dataframe', 'value': first_row}


result = analyze_data(dfs)

2023-08-20 15:30:14 [INFO] Answer: {'type': 'dataframe', 'value': survived                 0
pclass                   3
sex                   male
age                   22.0
sibsp                    1
parch                    0
fare                  7.25
embarked                 S
class                Third
who                    man
adult_male            True
deck                   NaN
embark_town    Southampton
alive                   no
alone                False
Name: 0, dtype: object}

By changing the prompt to smart_df.chat("fetch first row as a df") would return result value as pd.DataFrame

Also, you do not need to initialize sns.load_dataset (pd.DataFrame(sns.load_dataset("titanic"))) as it returns a pandas datafram, simply do:

df = sns.load_dataset("titanic")

llm = OpenAI()
smart_df = SmartDataframe(df, config={"llm": llm})

@PavelAgurov
Copy link
Contributor Author

Maybe better to fix bug in code instead of changing prompt? :)

And yes, it's not needed to wrap it as DataFrame, it was just stupid test from my side :)

@sandiemann
Copy link
Contributor

@gventuri we need to have workaround to handle these cases.

@PavelAgurov
Copy link
Contributor Author

Do you need exception ""Invalid input data. We cannot convert it to a dataframe." ? We can try to convert data into DataFrame, if not - just return "as is" (after validation that it's not a string).

@gventuri
Copy link
Collaborator

@sandiemann @PavelAgurov thanks a lot for looking into it, will try to figure out how to handle it. Maybe we could make it so a SmartDataframe also accepts a series as input and converts it to a dataframe?

@gventuri gventuri added the bug Something isn't working label Aug 21, 2023
@PavelAgurov
Copy link
Contributor Author

Maybe just like this?

def _load_df(self, df: DataFrameType):
        """
        Load a dataframe into the smart dataframe

        Args:
            df (DataFrameType): Pandas or Polars dataframe or path to a file
        """
        if isinstance(df, str):
            self._df = self._import_from_file(df)
            return
        # if the list can be converted to a dataframe, convert it
        # otherwise, return "as is"
        try:
             self._df = pd.DataFrame(df)
       except:
            self._df = df

@PavelAgurov
Copy link
Contributor Author

I think it's most critical from my findings, because most of my questions to the data return with this error.

@PavelAgurov
Copy link
Contributor Author

No ideas?

@PavelAgurov
Copy link
Contributor Author

I did fork and will test my solution.

@PavelAgurov
Copy link
Contributor Author

Tested with fix - works good. No error.

    def _load_df(self, df: DataFrameType):
        """
        Load a dataframe into the smart dataframe

        Args:
            df (DataFrameType): Pandas or Polars dataframe or path to a file
        """
        if isinstance(df, str):
            if not (
                df.endswith(".csv")
                or df.endswith(".parquet")
                or df.endswith(".xlsx")
                or df.startswith("https://docs.google.com/spreadsheets/")
            ):
                df_config = self._load_from_config(df)
                if df_config:
                    if self._name is None:
                        self._name = df_config["name"]
                    if self._description is None:
                        self._description = df_config["description"]
                    df = df_config["import_path"]
                else:
                    raise ValueError(
                        "Could not find a saved dataframe configuration "
                        "with the given name."
                    )

            self._df = self._import_from_file(df)
        elif isinstance(df, pd.Series):
            self._df = df.to_frame()
        else:
            # if the list can be converted to a dataframe, convert it
            # otherwise, return it 'as is'
            try:
                self._df = pd.DataFrame(df)
            except ValueError:
                self._df = df

@gventuri
Copy link
Collaborator

gventuri commented Sep 3, 2023

@PavelAgurov thanks a lot for reporting. Glad the fix works, closing the issue :)

@gventuri gventuri closed this as completed Sep 3, 2023
@PavelAgurov
Copy link
Contributor Author

will you merge it?

@gventuri
Copy link
Collaborator

gventuri commented Sep 3, 2023

@PavelAgurov from what I realize, the fix is the following, right:

elif isinstance(df, pd.Series):
            self._df = df.to_frame()

So basically it also handles series.
This fix has already been merged. If I'm missing something, just let me know!

@PavelAgurov
Copy link
Contributor Author

Not sure, but I see problem here:

        elif isinstance(df, pd.Series):
            self._df = df.to_frame()
        elif isinstance(df, (list, dict)):
            # if the list can be converted to a dataframe, convert it
            # otherwise, raise an error
            try:
                self._df = pd.DataFrame(df)
            except ValueError:
                raise ValueError(
                    "Invalid input data. We cannot convert it to a dataframe."
                )
        else:
            self._df = df

pd.Frame is not instance of (list, dict) and it will not work if we have DataFrame here. Let's check it:

df = sns.load_dataset("titanic") # load the dataset
print(f'{isinstance(df, pd.DataFrame)} {isinstance(pd, (list, dict))} {isinstance(df, pd.Series)}')

Output: True False False

image

Solution can be to remove this checking or add direct checking:

        elif isinstance(df, pd.Series):
            self._df = df.to_frame()
        elif isinstance(df, pd.DataFrame):

@PavelAgurov
Copy link
Contributor Author

From other side - I can't find example how to reproduce it with DataFrame. Maybe it's not a case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants