Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug, lake] pl.Dataframe constructor - 'float' object cannot be interpreted as an integer #657

Closed
trentmc opened this issue Feb 22, 2024 · 7 comments · Fixed by #669 or #762
Closed
Labels
Type: Bug Something isn't working

Comments

@trentmc
Copy link
Member

trentmc commented Feb 22, 2024

Where encountered

With this setup: my_ppss.yaml. Key params: predict BTC, 5m; approach 3; just BTC c input.

In main branch.

I ran: pdr predictoor 3 my_ppss.yaml sapphire-mainnet.

After about 3h runtime, I got an error:

2024-02-21 22:09:13,933 INFO Fetch up to 1000 pts from timestamp=1708548000000, dt=2024-02-21_20:40:00.000
...
  File "/Users/trentmc/code/pdr-backend/pdr_backend/lake/ohlcv_data_factory.py", line 139, in _update_rawohlcv_files_at_feed
    next_df = pl.DataFrame(tohlcv_data, schema=TOHLCV_SCHEMA_PL)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
  File "/Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/polars/utils/_construction.py", line 392, in _construct_series_with_fallbacks
    return constructor(name, values, strict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'float' object cannot be interpreted as an integer

Full log

out.txt

Full Traceback

2024-02-21 22:09:13,076 INFO cur_epoch=5695165, cur_block_number=2555532, cur_timestamp=1708549743, next_slot=1708549800, target_slot=1708550100. 57 s left in epoch (predict if <= 60 s left). s_per_epoch=300
2024-02-21 22:09:13,911 INFO Predict for time slot = 1708550100...
2024-02-21 22:09:13,912 INFO Get historical data, across many exchanges & pairs: begin.
2024-02-21 22:09:13,912 INFO Data start: timestamp=1685577600000, dt=2023-06-01_00:00:00.000
2024-02-21 22:09:13,912 INFO Data fin: timestamp=1708549753912, dt=2024-02-21_21:09:13.912
2024-02-21 22:09:13,912 INFO Update all rawohlcv files: begin
2024-02-21 22:09:13,912 INFO Update rawohlcv file at exchange=binance, pair=BTC/USDT: begin
2024-02-21 22:09:13,913 INFO filename=/Users/trentmc/code/pdr-backend/parquet_data/binance_BTC-USDT_5m.parquet
2024-02-21 22:09:13,913 INFO File already exists
2024-02-21 22:09:13,932 INFO File starts at: timestamp=1685577600000, dt=2023-06-01_00:00:00.000
2024-02-21 22:09:13,932 INFO File finishes at: timestamp=1708547700000, dt=2024-02-21_20:35:00.000
2024-02-21 22:09:13,932 INFO User-specified start >= file start, so append file
2024-02-21 22:09:13,932 INFO Aim to fetch data from start time: timestamp=1708548000000, dt=2024-02-21_20:40:00.000
2024-02-21 22:09:13,933 INFO Fetch up to 1000 pts from timestamp=1708548000000, dt=2024-02-21_20:40:00.000
Traceback (most recent call last):
  File "/Users/trentmc/code/pdr-backend/./pdr", line 6, in <module>
    cli_module._do_main()
  File "/Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/enforce_typing/decorator.py", line 29, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/trentmc/code/pdr-backend/pdr_backend/cli/cli_module.py", line 49, in _do_main
    func(args, nested_args)
  File "/Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/enforce_typing/decorator.py", line 29, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/trentmc/code/pdr-backend/pdr_backend/cli/cli_module.py", line 85, in do_predictoor
    agent.run()
  File "/Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/enforce_typing/decorator.py", line 29, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/trentmc/code/pdr-backend/pdr_backend/predictoor/base_predictoor_agent.py", line 56, in run
    self.take_step()
  File "/Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/enforce_typing/decorator.py", line 29, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/trentmc/code/pdr-backend/pdr_backend/predictoor/base_predictoor_agent.py", line 85, in take_step
    predval, stake = self.get_prediction(target_slot)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/enforce_typing/decorator.py", line 29, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/trentmc/code/pdr-backend/pdr_backend/predictoor/approach3/predictoor_agent3.py", line 38, in get_prediction
    mergedohlcv_df = self.get_data_components()
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/enforce_typing/decorator.py", line 29, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/trentmc/code/pdr-backend/pdr_backend/predictoor/approach3/predictoor_agent3.py", line 20, in get_data_components
    mergedohlcv_df = ohlcv_data_factory.get_mergedohlcv_df()
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/trentmc/code/pdr-backend/pdr_backend/lake/ohlcv_data_factory.py", line 82, in get_mergedohlcv_df
    self._update_rawohlcv_files(fin_ut)
  File "/Users/trentmc/code/pdr-backend/pdr_backend/lake/ohlcv_data_factory.py", line 95, in _update_rawohlcv_files
    self._update_rawohlcv_files_at_feed(feed, fin_ut)
  File "/Users/trentmc/code/pdr-backend/pdr_backend/lake/ohlcv_data_factory.py", line 139, in _update_rawohlcv_files_at_feed
    next_df = pl.DataFrame(tohlcv_data, schema=TOHLCV_SCHEMA_PL)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/polars/dataframe/frame.py", line 377, in __init__
    self._df = sequence_to_pydf(
               ^^^^^^^^^^^^^^^^^
  File "/Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/polars/utils/_construction.py", line 1016, in sequence_to_pydf
    return _sequence_to_pydf_dispatcher(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.6/Frameworks/Python.framework/Versions/3.11/lib/python3.11/functools.py", line 909, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/polars/utils/_construction.py", line 1175, in _sequence_of_sequence_to_pydf
    data_series: list[PySeries] = [
                                  ^
  File "/Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/polars/utils/_construction.py", line 1176, in <listcomp>
    pl.Series(
  File "/Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/polars/series/series.py", line 298, in __init__
    self._s = sequence_to_pyseries(
              ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/polars/utils/_construction.py", line 486, in sequence_to_pyseries
    pyseries = _construct_series_with_fallbacks(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/polars/utils/_construction.py", line 392, in _construct_series_with_fallbacks
    return constructor(name, values, strict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'float' object cannot be interpreted as an integer
@idiom-bytes
Copy link
Member

idiom-bytes commented Feb 23, 2024

Hi @trentmc, I'm a bit worried that somehow a timestamp from ccxt is showing up as a float, rather than integer.

I created a couple of tests:

  1. test_schema_interpreter_float_as_integer that reproduces how to get the error
TypeError: 'float' object cannot be interpreted as an integer
  1. test_fix_schema_interpreter_float_as_integer that tries to provide a way to fix it
    The issue is that I'm having a hard time reproducing the issue, in line 130->140 of ohlcv_data_factory.py
raw_tohlcv_data = safe_fetch_ohlcv_ccxt(
                exch,
                symbol=str(pair_str).replace("-", "/"),
                timeframe=str(feed.timeframe),
                since=st_ut,
                limit=limit,
            )
            tohlcv_data = clean_raw_ohlcv(raw_tohlcv_data, feed, st_ut, fin_ut)

            # concat both TOHLCV data
            next_df = pl.DataFrame(tohlcv_data, schema=TOHLCV_SCHEMA_PL)

When implementing the fix in (2). When I try to pass raw_tolhcv_data w/ a float to clean_raw_ohlcv() in the test, I get the error... hopefully I can reproduce it successfully, and then fix it.

image

@trentmc
Copy link
Member Author

trentmc commented Feb 23, 2024

When I try to pass raw_tolhcv_data w/ a float to clean_raw_ohlcv() in the test, I get the error... hopefully I can reproduce it successfully, and then fix it.

Looks like you're not passing a list of [t, o, h, l, c, v], you're just passing a float. So of course that fails.

What you need to test is: if you pass a list of [t, o, h, l, c, v], where t is a float, how does it do on that?

I wouldn't be surprised if sometimes an exchange gives timestamps as floats. Remember that sometimes we get NaNs for o, h, l, c or v values, and we work around those. So we'd also identify exactly what it's doing with timestamps, and figure out a workaround. But first we need to get to the bottom of the issue more.

To get to the bottom of the issue more, reproduce getting the following data. BTC/USDT on Binance. That's what it was trying right before the traceback.

Fetch up to 1000 pts from timestamp=1708548000000, dt=2024-02-21_20:40:00.000

@kdetry
Copy link
Contributor

kdetry commented Feb 23, 2024

I created a temporary script to inspect the data and determine if there is a float-typed timestamp. It appears to be clear. Could this be a temporary issue with their API?

import ccxt
import csv
from datetime import datetime
import polars as pl
from pdr_backend.lake.constants import (
    TOHLCV_SCHEMA_PL,
)
# Helper function to convert OHLCV data to Unix timestamps
def _ohlcv_to_uts(tohlcv_data):
    return [row[0] for row in tohlcv_data]

# Function to filter data within a specified time range
def _filter_within_timerange(tohlcv_data, st_ut, fin_ut):
    uts = _ohlcv_to_uts(tohlcv_data)
    return [vec for ut, vec in zip(uts, tohlcv_data) if st_ut <= ut <= fin_ut]


# Initialize Binance exchange
exchange = ccxt.binanceus()

# Define the symbol and timeframe
symbol = 'BTC/USDT'
timeframe = '1m'  # 1 minute timeframe

# Fetch historical data
data = exchange.fetch_ohlcv(symbol, timeframe, since=1708548000000, limit=1000)

# Specify the CSV file name
csv_filename = './historical_data.csv'

# Save data to CSV
with open(csv_filename, mode='w', newline='') as file:
    writer = csv.writer(file, delimiter=',')
    # Write the header
    writer.writerow(['timestamp', 'open', 'high', 'low', 'close', 'volume'])
    # Write the data rows
    for row in data:
        writer.writerow(row)

# Example start and finish Unix timestamps for filtering
start_ut = 1708537200000  # Adjust to your specific start time
finish_ut = 1708597140000  # Adjust to your specific end time

# Filter the data
filtered_data = _filter_within_timerange(data, start_ut, finish_ut)

#check if the timestamp includes a float
for row in filtered_data:
    if type(row[0]) == float:
        print(row[0])

# Save the filtered data to a CSV file
csv_filename = './filtered_data.csv'
with open(csv_filename, mode='w', newline='') as file:
    writer = csv.writer(file, delimiter=',')
    writer.writerow(['timestamp', 'open', 'high', 'low', 'close', 'volume'])
    writer.writerows(filtered_data)

next_df = pl.DataFrame(filtered_data, schema=TOHLCV_SCHEMA_PL)


print(f"Filtered data successfully saved to {csv_filename}")```

@idiom-bytes
Copy link
Member

idiom-bytes commented Feb 23, 2024

Thanks for the feedback @trentmc and for fixing my issue @kdetry, it was EoD and I was tired.

My first step was to hard reset, nuke lake, and then configure ppss.yaml to do the same fetch.

lake_ss:
  parquet_dir: parquet_data
  feeds:
    - binance BTC/USDT 5m
#    - binance BTC/USDT ETH/USDT BNB/USDT XRP/USDT ADA/USDT DOGE/USDT SOL/USDT LTC/USDT TRX/USDT DOT/USDT 5m
#    - kraken BTC/USDT 5m
  st_timestr: 2024-02-21_20:00 # starting date for data
  fin_timestr: now # ending date for data

However, the data I returned from ccxt was clean. There were no floats out of place, and all ohlcv values were valid (not null). My guess is that CCXT is provides info straight from the cex API, and if I remember correctly, CEX APIs sometimes yield wrong data which then gets patched (addressing mustafa's findings and mine).

What I'm thinking is that we can't repro the issue because it will always be temporary at-the-CEX-API level.

We either (1) enforce validation through coercion of expected types, value ranges, and expected data provided by ccxt/cex.
Or (2) assume the row is bad if validation fails, don't use it, apply martingale, and try to patch it (by re-fetching).

Even better than the call stack, would be to have those bad record logged somewhere (3) so we can further inspect these issues we're seeing and address them accordingly.

@idiom-bytes
Copy link
Member

idiom-bytes commented Feb 28, 2024

After reviewing PRs & tasks today, we discussed:

  1. we merge the PR to add defensive code around timestamps
  2. we update the fetch_ohclv step to log the ohclv records, such that we can observe the data if there are errors
  3. we create a new task to continue improving ingestion/defensive code, and error handling

Based on this, I propose we:
(1) Do - Became more obvious that we should perhaps move forward as we believe this is the issue being encountered. Float values are being enforced similarly, so should int. There are other, additional checks, such as if_uts_have_gaps() (and perhaps more) that we should continue to expand/improve our ingestion code.
(2) Do - Create a task to improve logging so we can see the data that we're able to see what data problems with.
(3) Do - Create a task to improve cli tooling around merge_df and other lake artifacts (i.e. lake inspect/cull/build: gql tables, etl tables, ohlcv data, merge_df).

idiom-bytes added a commit that referenced this issue Feb 28, 2024
* Adding tests so we can review issue w/ ohlcv data, timestamps, and floats

* test fix

* black fix

---------

Co-authored-by: Mustafa Tuncay <mustafaislev@gmail.com>
@trentmc trentmc reopened this Mar 8, 2024
@trentmc
Copy link
Member Author

trentmc commented Mar 8, 2024

I ran across this error again. And I'm able to reproduce it now:) Gonna try to fix it.

Run, with error:

(venv) trentmc@tlm-macbook: ~/code/pdr-backend $ pdr sim my_ppss.yaml 
2024-03-08 10:11:27,824 INFO pdr sim: Begin
2024-03-08 10:11:27,825 INFO Arguments:
2024-03-08 10:11:27,825 INFO PPSS_FILE=my_ppss.yaml
2024-03-08 10:11:27,825 INFO {}
2024-03-08 10:11:27,935 INFO Start run
2024-03-08 10:11:27,935 INFO Get historical data, across many exchanges & pairs: begin.
2024-03-08 10:11:27,935 INFO Data start: timestamp=1685577600000, dt=2023-06-01_00:00:00.000
2024-03-08 10:11:27,935 INFO Data fin: timestamp=1709889087935, dt=2024-03-08_09:11:27.935
2024-03-08 10:11:27,935 INFO Update all rawohlcv files: begin
2024-03-08 10:11:27,935 INFO Update rawohlcv file at exchange=binance, pair=BTC/USDT: begin
2024-03-08 10:11:27,935 INFO filename=/Users/trentmc/code/pdr-backend/parquet_data/binance_BTC-USDT_5m.parquet
2024-03-08 10:11:27,935 INFO File already exists
2024-03-08 10:11:27,940 INFO File starts at: timestamp=1685577600000, dt=2023-06-01_00:00:00.000
2024-03-08 10:11:27,940 INFO File finishes at: timestamp=1709887200000, dt=2024-03-08_08:40:00.000
2024-03-08 10:11:27,940 INFO User-specified start >= file start, so append file
2024-03-08 10:11:27,940 INFO Aim to fetch data from start time: timestamp=1709887500000, dt=2024-03-08_08:45:00.000
2024-03-08 10:11:27,941 INFO Fetch up to 1000 pts from timestamp=1709887500000, dt=2024-03-08_08:45:00.000
Traceback (most recent call last):
  File "/Users/trentmc/code/pdr-backend/./pdr", line 6, in <module>
    cli_module._do_main()
  File "/Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/enforce_typing/decorator.py", line 29, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/trentmc/code/pdr-backend/pdr_backend/cli/cli_module.py", line 54, in _do_main
    func(args, nested_args)
  File "/Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/enforce_typing/decorator.py", line 29, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/trentmc/code/pdr-backend/pdr_backend/cli/cli_module.py", line 69, in do_sim
    sim_engine.run()
  File "/Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/enforce_typing/decorator.py", line 29, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/trentmc/code/pdr-backend/pdr_backend/sim/sim_engine.py", line 84, in run
    mergedohlcv_df = f.get_mergedohlcv_df()
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/trentmc/code/pdr-backend/pdr_backend/lake/ohlcv_data_factory.py", line 83, in get_mergedohlcv_df
    self._update_rawohlcv_files(fin_ut)
  File "/Users/trentmc/code/pdr-backend/pdr_backend/lake/ohlcv_data_factory.py", line 96, in _update_rawohlcv_files
    self._update_rawohlcv_files_at_feed(feed, fin_ut)
  File "/Users/trentmc/code/pdr-backend/pdr_backend/lake/ohlcv_data_factory.py", line 140, in _update_rawohlcv_files_at_feed
    next_df = pl.DataFrame(tohlcv_data, schema=TOHLCV_SCHEMA_PL)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/polars/dataframe/frame.py", line 377, in __init__
    self._df = sequence_to_pydf(
               ^^^^^^^^^^^^^^^^^
  File "/Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/polars/utils/_construction.py", line 1016, in sequence_to_pydf
    return _sequence_to_pydf_dispatcher(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.6/Frameworks/Python.framework/Versions/3.11/lib/python3.11/functools.py", line 909, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/polars/utils/_construction.py", line 1175, in _sequence_of_sequence_to_pydf
    data_series: list[PySeries] = [
                                  ^
  File "/Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/polars/utils/_construction.py", line 1176, in <listcomp>
    pl.Series(
  File "/Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/polars/series/series.py", line 298, in __init__
    self._s = sequence_to_pyseries(
              ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/polars/utils/_construction.py", line 486, in sequence_to_pyseries
    pyseries = _construct_series_with_fallbacks(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/polars/utils/_construction.py", line 392, in _construct_series_with_fallbacks
    return constructor(name, values, strict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'float' object cannot be interpreted as an integer

Run, with breakpoint, to see input values:

(venv) trentmc@tlm-macbook: ~/code/pdr-backend $ pdr sim my_ppss.yaml 
2024-03-08 10:12:25,949 INFO pdr sim: Begin
2024-03-08 10:12:25,949 INFO Arguments:
2024-03-08 10:12:25,949 INFO PPSS_FILE=my_ppss.yaml
2024-03-08 10:12:25,949 INFO {}
2024-03-08 10:12:26,056 INFO Start run
2024-03-08 10:12:26,056 INFO Get historical data, across many exchanges & pairs: begin.
2024-03-08 10:12:26,056 INFO Data start: timestamp=1685577600000, dt=2023-06-01_00:00:00.000
2024-03-08 10:12:26,056 INFO Data fin: timestamp=1709889146056, dt=2024-03-08_09:12:26.056
2024-03-08 10:12:26,056 INFO Update all rawohlcv files: begin
2024-03-08 10:12:26,056 INFO Update rawohlcv file at exchange=binance, pair=BTC/USDT: begin
2024-03-08 10:12:26,056 INFO filename=/Users/trentmc/code/pdr-backend/parquet_data/binance_BTC-USDT_5m.parquet
2024-03-08 10:12:26,056 INFO File already exists
2024-03-08 10:12:26,061 INFO File starts at: timestamp=1685577600000, dt=2023-06-01_00:00:00.000
2024-03-08 10:12:26,061 INFO File finishes at: timestamp=1709887200000, dt=2024-03-08_08:40:00.000
2024-03-08 10:12:26,061 INFO User-specified start >= file start, so append file
2024-03-08 10:12:26,061 INFO Aim to fetch data from start time: timestamp=1709887500000, dt=2024-03-08_08:45:00.000
2024-03-08 10:12:26,062 INFO Fetch up to 1000 pts from timestamp=1709887500000, dt=2024-03-08_08:45:00.000
> /Users/trentmc/code/pdr-backend/pdr_backend/lake/ohlcv_data_factory.py(144)_update_rawohlcv_files_at_feed()
-> df = concat_next_df(df, next_df)
(Pdb) pl.DataFrame(tohlcv_data, schema=TOHLCV_SCHEMA_PL)
*** TypeError: 'float' object cannot be interpreted as an integer
(Pdb) raw_tohlcv_data
[[1709887500000, 67289.04, 67390.0, 67274.7, 67300.0, 118.03242], [1709887800000, 67300.01, 67320.0, 67256.4, 67256.41, 68.07976], [1709888100000, 67256.4, 67291.97, 67200.01, 67208.55, 114.46873], [1709888400000, 67208.56, 67213.19, 67040.76, 67070.82, 173.93243], [1709888700000, 67070.82, 67120.57, 67050.0, 67100.0, 175.64183], [1709889000000, 67100.0, 67223.54, 67095.16, 67216.42, 155.12712]]
(Pdb) tohlcv_data
[[1709887500000, 67289.04, 67390.0, 67274.7, 67300.0, 118.03242], [1709887800000, 67300.01, 67320.0, 67256.4, 67256.41, 68.07976], [1709888100000, 67256.4, 67291.97, 67200.01, 67208.55, 114.46873], [1709888400000, 67208.56, 67213.19, 67040.76, 67070.82, 173.93243], [1709888700000, 67070.82, 67120.57, 67050.0, 67100.0, 175.64183], [1709889000000, 67100.0, 67223.54, 67095.16, 67216.42, 155.12712]]
(Pdb) st_ut
1709887500000
(Pdb) limit
1000
(Pdb) exch
ccxt.binance()
(Pdb) str(pair_str).replace("-", "/")
'BTC/USDT'

At the top of the call stack is:

> /Users/trentmc/code/pdr-backend/venv/lib/python3.11/site-packages/polars/utils/_construction.py(392)_construct_series_with_fallbacks()
-> return constructor(name, values, strict)

The values there are:

(Pdb) name
'timestamp'
(Pdb) values
[1709887500000, 67289.04, 67390.0, 67274.7, 67300.0, 118.03242]
(Pdb) strict
True

If name=timestamp and it thinks all these values are supposed to be timestamps, then that's the issue! Because only the first value is int, and only the first value has a reasonable value for timestamp. The other values are clearly BTC price related (ohlcv).

@trentmc
Copy link
Member Author

trentmc commented Mar 8, 2024

I got to the bottom of the issue.

I discovered other issues in polars repo that had encountered similar errors:

From the discussions there, I realized that the issue may be:

  • the DataFrame constructor sets an "orient" value of "row" or "col"
  • i.e. if you input a list of lists, does it interpret the inner list as one series, or the outer list?
  • By default, it infers "orient" automatically from the input data
  • But you can specify "orient" to "row" or "col" as an optional arg in the constructor

Supporting evidence: what I observed above:

If name=timestamp and it thinks all these values are supposed to be timestamps, then that's the issue! Because only the first value is int, and only the first value has a reasonable value for timestamp. The other values are clearly BTC price related (ohlcv).

So I created three separate unit tests, and observed the behavior:

  1. test_issue657_infer_orientation(). Let "orient" be inferred automatically, on two datasets. Dataset 1 is known to be ok, and 2 known to fail. And that's what happened.
  2. test_issue657_set_col_orientation(). Set orient="col" on both. It fails on both.
  3. test_issue657_set_row_orientation(). Set orient="row" on both. It passes on both

From this, it's clear that we need to set orient="row" everywhere we create ohlcv dfs from data. I put that fix in accordingly, into the PR.

trentmc added a commit that referenced this issue Mar 8, 2024
Fix #657: [Bug, lake] pl.Dataframe constructor - 'float' object cannot be interpreted as an integer

* Write unit tests that capture the base issue
* Update unit tests to expose orient=infer/col/row behavior. They show: infer can fail sometimes, col always fails, row always passes
* Update DataFrame() constructor calls to always explicitly set orient="row"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug Something isn't working
Projects
None yet
3 participants