# 🙇 Searching Errors in Dataset

## ☝️Task Description
This test task presents a sample of high-frequency data containing trades and updates to the limit order book. We have carefully concealed a substantial number of realistic errors in it, and your task is to identify as many errors as possible.

You don't need to show how you searched for errors, create plots, or perform any other analytics.
You also don't need to indicate the row numbers where you found errors.
The only thing required is a brief description (one line is enough) of the nature of each error found.
Task score - percentage of error types found.

## 💻 Get ready data for analysis

In [67]:
# upload dataset
import pandas as pd

# read file
df = pd.read_feather("test.feather").reset_index(drop=True)

# overview dataset
df.head()


Unnamed: 0,type,msgSeqNum,exchHostTime,adapterTime,px_buy_1,amt_buy_1,px_buy_2,amt_buy_2,px_buy_3,amt_buy_3,...,px_sell_8,amt_sell_8,px_sell_9,amt_sell_9,px_sell_10,amt_sell_10,trade_px,trade_amt,trade_cnt,moreTradesInBatch
0,4,173525833,1596240037000000000,1596240037431704679,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.25,0.175,0,0.0
1,4,173525834,1596240037000000000,1596240037431749155,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.26,2.0,0,0.0
2,4,122537841,1596240037000000000,1596240037431776197,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.25,0.175,1,0.0
3,4,122537842,1596240037000000000,1596240037431782097,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.26,2.0,1,0.0
4,4,173525835,1596240037000000000,1596240037431794710,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.33,0.004,0,0.0


In [65]:
# overview data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40374 entries, 1000 to 41373
Data columns (total 48 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   type               40374 non-null  int64  
 1   msgSeqNum          40374 non-null  int64  
 2   exchHostTime       40374 non-null  int64  
 3   adapterTime        40374 non-null  int64  
 4   px_buy_1           40373 non-null  float64
 5   amt_buy_1          40373 non-null  float64
 6   px_buy_2           40374 non-null  float64
 7   amt_buy_2          40373 non-null  float64
 8   px_buy_3           40374 non-null  float64
 9   amt_buy_3          40373 non-null  float64
 10  px_buy_4           40374 non-null  float64
 11  amt_buy_4          40373 non-null  float64
 12  px_buy_5           40374 non-null  float64
 13  amt_buy_5          40373 non-null  float64
 14  px_buy_6           40374 non-null  float64
 15  amt_buy_6          40373 non-null  float64
 16  px_buy_7           

**описываем датасет и суть колонок**
1. Системные и временные метки
- type — тип сообщения (код события).
- msgSeqNum — порядковый номер сообщения.
- exchHostTime — время события на стороне биржи.
- adapterTime — время получения события адаптером.

2. Лимитный стакан — заявки на покупку (Bids)
- px_buy_[1–10] — цена покупки на уровнях 1–10 (1 — лучшая).
- amt_buy_[1–10] — объём заявок на покупку на уровнях 1–10.

3. Лимитный стакан — заявки на продажу (Asks)
- px_sell_[1–10] — цена продажи на уровнях 1–10 (1 — лучшая).
- amt_sell_[1–10] — объём заявок на продажу на уровнях 1–10.

4. Сделки
- trade_px — цена сделки.
- trade_amt — объём сделки.
- trade_cnt — количество сделок в этом сообщении.

5. Доп. поле для сделок
- moreTradesInBatch — признак/индикатор наличия других сделок в пакете.

## 👀 Looking for data errors

In [73]:
# Переводим наносекунды в datetime
import pandas as pd

# Переводим наносекунды в datetime
df['exchHostTime_dt'] = pd.to_datetime(df['exchHostTime'], unit='ns')
df['adapterTime_dt'] = pd.to_datetime(df['adapterTime'], unit='ns')

# Разница между adapterTime и exchHostTime в секундах
df['time_diff_sec'] = (df['adapterTime'] - df['exchHostTime']) / 1e9

# Разница в миллисекундах
df['time_diff_ms'] = df['time_diff_sec'] * 1000

# Выводим первые 10 строк для проверки
df.head(10)


Unnamed: 0,type,msgSeqNum,exchHostTime,adapterTime,px_buy_1,amt_buy_1,px_buy_2,amt_buy_2,px_buy_3,amt_buy_3,...,px_sell_10,amt_sell_10,trade_px,trade_amt,trade_cnt,moreTradesInBatch,exchHostTime_dt,adapterTime_dt,time_diff_sec,time_diff_ms
0,4,173525833,1596240037000000000,1596240037431704679,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,13744.17,0.111,11358.25,0.175,0,0.0,2020-08-01 00:00:37,2020-08-01 00:00:37.431704679,0.431705,431.704679
1,4,173525834,1596240037000000000,1596240037431749155,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,13744.17,0.111,11358.26,2.0,0,0.0,2020-08-01 00:00:37,2020-08-01 00:00:37.431749155,0.431749,431.749155
2,4,122537841,1596240037000000000,1596240037431776197,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,13744.17,0.111,11358.25,0.175,1,0.0,2020-08-01 00:00:37,2020-08-01 00:00:37.431776197,0.431776,431.776197
3,4,122537842,1596240037000000000,1596240037431782097,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,13744.17,0.111,11358.26,2.0,1,0.0,2020-08-01 00:00:37,2020-08-01 00:00:37.431782097,0.431782,431.782097
4,4,173525835,1596240037000000000,1596240037431794710,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,13744.17,0.111,11358.33,0.004,0,0.0,2020-08-01 00:00:37,2020-08-01 00:00:37.431794710,0.431795,431.79471
5,4,173525836,1596240037000000000,1596240037431825867,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,13744.17,0.111,11358.35,0.001,0,0.0,2020-08-01 00:00:37,2020-08-01 00:00:37.431825867,0.431826,431.825867
6,4,122537843,1596240037000000000,1596240037431853669,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,13744.17,0.111,11358.33,0.004,1,0.0,2020-08-01 00:00:37,2020-08-01 00:00:37.431853669,0.431854,431.853669
7,4,122537844,1596240037000000000,1596240037431858991,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,13744.17,0.111,11358.35,0.001,1,0.0,2020-08-01 00:00:37,2020-08-01 00:00:37.431858991,0.431859,431.858991
8,4,173525837,1596240037000000000,1596240037431863851,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,13744.17,0.111,11358.82,0.001,0,0.0,2020-08-01 00:00:37,2020-08-01 00:00:37.431863851,0.431864,431.863851
9,4,173525838,1596240037000000000,1596240037431890572,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,13744.17,0.111,11359.0,0.018,0,0.0,2020-08-01 00:00:37,2020-08-01 00:00:37.431890572,0.431891,431.890572


In [68]:
df['type'].value_counts()

type
6    30156
4    10218
Name: count, dtype: int64

In [69]:
df.head(3)

Unnamed: 0,type,msgSeqNum,exchHostTime,adapterTime,px_buy_1,amt_buy_1,px_buy_2,amt_buy_2,px_buy_3,amt_buy_3,...,px_sell_8,amt_sell_8,px_sell_9,amt_sell_9,px_sell_10,amt_sell_10,trade_px,trade_amt,trade_cnt,moreTradesInBatch
0,4,173525833,1596240037000000000,1596240037431704679,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.25,0.175,0,0.0
1,4,173525834,1596240037000000000,1596240037431749155,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.26,2.0,0,0.0
2,4,122537841,1596240037000000000,1596240037431776197,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.25,0.175,1,0.0


In [70]:
import pandas as pd

# Предположим, что DataFrame называется df

# Вычисляем разницу между последовательными сообщениями
diffs = df['msgSeqNum'].diff()

# Находим строки, где разница не равна 1
violations = df[diffs != 1]

print(f"Найдено {len(violations)} нарушений инкремента msgSeqNum")
print(violations[['msgSeqNum']].head(10))


Найдено 37906 нарушений инкремента msgSeqNum
    msgSeqNum
0   173525833
2   122537841
4   173525835
6   122537843
8   173525837
10  122537845
11  173525839
13  122537846
14  173525841
16  122537847


In [34]:
# error number 9: px_sell_7 as object not float
df['px_sell_7'] = df['px_sell_7'].astype('float')

In [13]:
# error number 1: null values in dataset
df[df.isna()]

Unnamed: 0,type,msgSeqNum,exchHostTime,adapterTime,px_buy_1,amt_buy_1,px_buy_2,amt_buy_2,px_buy_3,amt_buy_3,...,px_sell_8,amt_sell_8,px_sell_9,amt_sell_9,px_sell_10,amt_sell_10,trade_px,trade_amt,trade_cnt,moreTradesInBatch
1000,,,,,,,,,,,...,,,,,,,,,,
1001,,,,,,,,,,,...,,,,,,,,,,
1002,,,,,,,,,,,...,,,,,,,,,,
1003,,,,,,,,,,,...,,,,,,,,,,
1004,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41369,,,,,,,,,,,...,,,,,,,,,,
41370,,,,,,,,,,,...,,,,,,,,,,
41371,,,,,,,,,,,...,,,,,,,,,,
41372,,,,,,,,,,,...,,,,,,,,,,


In [14]:
# error number 2: duplicated rows in dataset
df[df.duplicated()]

Unnamed: 0,type,msgSeqNum,exchHostTime,adapterTime,px_buy_1,amt_buy_1,px_buy_2,amt_buy_2,px_buy_3,amt_buy_3,...,px_sell_8,amt_sell_8,px_sell_9,amt_sell_9,px_sell_10,amt_sell_10,trade_px,trade_amt,trade_cnt,moreTradesInBatch
20276,4,173531184,1596240392000000000,1596240392681608559,11378.22,0.14,11378.19,1.597,11378.11,0.005,...,11379.23,1.161,12517.15,2.0,13768.87,0.161,11378.44,-0.906,0,0.0
20277,4,173531184,1596240392000000000,1596240392681608559,11378.22,0.14,11378.19,1.597,11378.11,0.005,...,11379.23,1.161,12517.15,2.0,13768.87,0.161,11378.44,-0.906,0,0.0
20278,4,173531184,1596240392000000000,1596240392681608559,11378.22,0.14,11378.19,1.597,11378.11,0.005,...,11379.23,1.161,12517.15,2.0,13768.87,0.161,11378.44,-0.906,0,0.0
20279,4,173531184,1596240392000000000,1596240392681608559,11378.22,0.14,11378.19,1.597,11378.11,0.005,...,11379.23,1.161,12517.15,2.0,13768.87,0.161,11378.44,-0.906,0,0.0
20280,4,173531184,1596240392000000000,1596240392681608559,11378.22,0.14,11378.19,1.597,11378.11,0.005,...,11379.23,1.161,12517.15,2.0,13768.87,0.161,11378.44,-0.906,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20371,4,173531184,1596240392000000000,1596240392681608559,11378.22,0.14,11378.19,1.597,11378.11,0.005,...,11379.23,1.161,12517.15,2.0,13768.87,0.161,11378.44,-0.906,0,0.0
20372,4,173531184,1596240392000000000,1596240392681608559,11378.22,0.14,11378.19,1.597,11378.11,0.005,...,11379.23,1.161,12517.15,2.0,13768.87,0.161,11378.44,-0.906,0,0.0
20373,4,173531184,1596240392000000000,1596240392681608559,11378.22,0.14,11378.19,1.597,11378.11,0.005,...,11379.23,1.161,12517.15,2.0,13768.87,0.161,11378.44,-0.906,0,0.0
20374,4,173531184,1596240392000000000,1596240392681608559,11378.22,0.14,11378.19,1.597,11378.11,0.005,...,11379.23,1.161,12517.15,2.0,13768.87,0.161,11378.44,-0.906,0,0.0


In [41]:
# error number 3: количество сделок в этом сообщении < 0
df[df['trade_cnt']<0].head(2)

Unnamed: 0,type,msgSeqNum,exchHostTime,adapterTime,px_buy_1,amt_buy_1,px_buy_2,amt_buy_2,px_buy_3,amt_buy_3,...,px_sell_8,amt_sell_8,px_sell_9,amt_sell_9,px_sell_10,amt_sell_10,trade_px,trade_amt,trade_cnt,moreTradesInBatch
38548,6,45257909550,1596241367000000000,1596241367117860448,11303.9,0.24,11303.72,0.448,11303.35,0.518,...,11306.31,0.001,12436.94,0.503,13680.63,2.4,0.0,0.0,-1,


In [40]:
# error number 4: объем сделки ниже 0
df[df['trade_amt']<0].head(2)

Unnamed: 0,type,msgSeqNum,exchHostTime,adapterTime,px_buy_1,amt_buy_1,px_buy_2,amt_buy_2,px_buy_3,amt_buy_3,...,px_sell_8,amt_sell_8,px_sell_9,amt_sell_9,px_sell_10,amt_sell_10,trade_px,trade_amt,trade_cnt,moreTradesInBatch
1025,4,173525848,1596240037000000000,1596240037432199325,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11357.99,-0.001,0,0.0
1032,4,173525849,1596240037000000000,1596240037437540602,11359.01,0.011,11358.01,0.24,11358.0,2.086,...,11359.77,1.0,12495.75,0.051,13745.32,1.576,11359.34,-0.539,0,0.0


In [28]:
# error number 5: price < 0
df[df['trade_amt']<=0].head(2)

Unnamed: 0,type,msgSeqNum,exchHostTime,adapterTime,px_buy_1,amt_buy_1,px_buy_2,amt_buy_2,px_buy_3,amt_buy_3,...,px_sell_8,amt_sell_8,px_sell_9,amt_sell_9,px_sell_10,amt_sell_10,trade_px,trade_amt,trade_cnt,moreTradesInBatch
1025,4,173525848,1596240037000000000,1596240037432199325,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11357.99,-0.001,0,0.0
1026,6,45252671063,1596240037000000000,1596240037432311550,11357.99,20.198,11357.67,0.169,11357.5,0.062,...,11359.77,1.0,12495.75,1.0,13745.32,0.056,0.0,0.0,0,


In [30]:
# error number 6: bid-ask crossing
crossed = df[df['px_buy_1'] > df['px_sell_1']]
crossed[['px_buy_1', 'px_sell_1']]

Unnamed: 0,px_buy_1,px_sell_1
21651,11374000000.0,11374.01


In [32]:
# error number 7: Нарушения монотонности в бид-уровнях


import pandas as pd

# Предположим, что DataFrame называется df
bid_cols = [f'px_buy_{i}' for i in range(1, 11)]

# Создаем пустой список для хранения результатов
violations = []

# Проходим по каждой строке
for idx, row in df.iterrows():
    for i in range(len(bid_cols) - 1):
        if row[bid_cols[i]] < row[bid_cols[i+1]]:
            violations.append({
                'index': idx,
                'level_1': bid_cols[i],
                'level_2': bid_cols[i+1],
                'value_1': row[bid_cols[i]],
                'value_2': row[bid_cols[i+1]]
            })

# Превращаем в DataFrame для удобного анализа
violations_df = pd.DataFrame(violations)

print(f"Найдено {len(violations_df)} нарушений монотонности в бид-уровнях.")
violations_df


Найдено 2 нарушений монотонности в бид-уровнях.


Unnamed: 0,index,level_1,level_2,value_1,value_2
0,10511,px_buy_9,px_buy_10,-11384.15,11384.13
1,15299,px_buy_1,px_buy_2,0.0,11381.78


In [36]:
# error number 8: Нарушения монотонности в аск-уровнях:
import pandas as pd

# Предположим, что DataFrame называется df
ask_cols = [f'px_sell_{i}' for i in range(1, 11)]

# Список для хранения нарушений
violations = []

# Проверка каждой строки
for idx, row in df.iterrows():
    for i in range(len(ask_cols) - 1):
        if row[ask_cols[i]] > row[ask_cols[i+1]]:
            violations.append({
                'index': idx,
                'level_1': ask_cols[i],
                'level_2': ask_cols[i+1],
                'value_1': row[ask_cols[i]],
                'value_2': row[ask_cols[i+1]]
            })

# Превращаем в DataFrame для анализа
violations_df = pd.DataFrame(violations)

print(f"Найдено {len(violations_df)} нарушений монотонности в аск-уровнях.")
violations_df.head(10)


Найдено 7 нарушений монотонности в аск-уровнях.


Unnamed: 0,index,level_1,level_2,value_1,value_2
0,19538,px_sell_1,px_sell_2,11373.38,11373.37
1,19538,px_sell_2,px_sell_3,11373.37,11373.04
2,19538,px_sell_3,px_sell_4,11373.04,11373.0
3,19538,px_sell_4,px_sell_5,11373.0,11372.41
4,19538,px_sell_5,px_sell_6,11372.41,11372.36
5,19538,px_sell_6,px_sell_7,11372.36,11372.33
6,19538,px_sell_7,px_sell_8,11372.33,11372.27


In [49]:
# error number 9: несогласованные данные о сделках
import pandas as pd

# Предположим, что DataFrame называется df

# Условие для несогласованных сделок
mask = (
    (df['trade_cnt'] > 0) & 
    ((df['trade_px'].isna()) | 
     (df['trade_amt'].isna()) | 
     (df['trade_amt'] <= 0))
)

# Фильтруем строки
inconsistent_trades = df[mask]

# Выводим количество и первые 10 строк
print(f"Найдено {len(inconsistent_trades)} несогласованных строк с trade_cnt > 0")
inconsistent_trades.head(10)


Найдено 1839 несогласованных строк с trade_cnt > 0


Unnamed: 0,type,msgSeqNum,exchHostTime,adapterTime,px_buy_1,amt_buy_1,px_buy_2,amt_buy_2,px_buy_3,amt_buy_3,...,px_sell_8,amt_sell_8,px_sell_9,amt_sell_9,px_sell_10,amt_sell_10,trade_px,trade_amt,trade_cnt,moreTradesInBatch
1033,4,122537851,1596240037000000000,1596240037437568304,11359.01,0.011,11358.01,0.24,11358.0,2.086,...,11359.77,1.0,12495.75,0.051,13745.32,1.576,11357.99,-0.001,1,0.0
1051,4,122537852,1596240037000000000,1596240037589669722,11359.34,2.276,11359.01,0.566,11358.08,0.4,...,11360.15,0.157,12496.16,0.007,13745.78,0.02,11359.34,-0.939,2,0.0
1071,4,122537853,1596240037000000000,1596240037742970355,11359.36,17.258,11359.34,1.439,11359.01,0.566,...,11360.22,0.007,12496.24,0.02,13745.86,1.007,11359.34,-0.061,2,0.0
1082,4,122537854,1596240037000000000,1596240037870627424,11359.36,17.245,11359.34,1.44,11359.04,0.4,...,11360.15,0.157,12496.16,0.007,13745.78,0.02,11359.36,-0.015,1,0.0
1142,4,122537875,1596240037000000000,1596240037903307034,11359.36,17.245,11359.34,1.44,11359.04,0.4,...,11360.15,0.157,12496.16,0.007,13745.78,0.02,11360.37,-0.061,1,0.0
1144,4,122537876,1596240037000000000,1596240037903344680,11359.36,17.245,11359.34,1.44,11359.04,0.4,...,11360.15,0.157,12496.16,0.007,13745.78,0.02,11360.32,-0.148,1,0.0
1145,4,122537877,1596240037000000000,1596240037903359314,11359.36,17.245,11359.34,1.44,11359.04,0.4,...,11360.15,0.157,12496.16,0.007,13745.78,0.02,11359.99,-0.241,1,0.0
1169,4,122537882,1596240037000000000,1596240038008543243,11360.35,1.914,11360.3,1.91,11360.28,2.0,...,11361.59,0.063,12497.75,0.118,13747.53,0.003,11360.27,-0.431,1,0.0
1175,4,122537883,1596240037000000000,1596240038012061914,11360.36,0.312,11360.35,1.874,11360.0,26.545,...,11361.59,0.068,12497.75,0.123,13747.53,0.008,11360.35,-0.04,1,0.0
1182,4,122537884,1596240037000000000,1596240038012580603,11360.36,0.13,11360.35,1.874,11360.0,26.706,...,11361.59,0.068,12497.75,0.123,13747.53,0.008,11360.36,-0.11,3,0.0


In [50]:
# error number 10: вроде бы нет сделок, но объём указан

import pandas as pd

# Предположим, что DataFrame называется df

# Условие для "несогласованных" строк
mask = (df['trade_cnt'] == 0) & (df['trade_amt'] > 0)

# Фильтруем строки
inconsistent_trades = df[mask]

# Выводим количество и первые 10 строк
print(f"Найдено {len(inconsistent_trades)} строк с trade_cnt == 0, но trade_amt > 0")
inconsistent_trades.head(10)


Найдено 3190 строк с trade_cnt == 0, но trade_amt > 0


Unnamed: 0,type,msgSeqNum,exchHostTime,adapterTime,px_buy_1,amt_buy_1,px_buy_2,amt_buy_2,px_buy_3,amt_buy_3,...,px_sell_8,amt_sell_8,px_sell_9,amt_sell_9,px_sell_10,amt_sell_10,trade_px,trade_amt,trade_cnt,moreTradesInBatch
1000,4,173525833,1596240037000000000,1596240037431704679,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.25,0.175,0,0.0
1001,4,173525834,1596240037000000000,1596240037431749155,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.26,2.0,0,0.0
1004,4,173525835,1596240037000000000,1596240037431794710,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.33,0.004,0,0.0
1005,4,173525836,1596240037000000000,1596240037431825867,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.35,0.001,0,0.0
1008,4,173525837,1596240037000000000,1596240037431863851,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.82,0.001,0,0.0
1009,4,173525838,1596240037000000000,1596240037431890572,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11359.0,0.018,0,0.0
1011,4,173525839,1596240037000000000,1596240037431919754,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11359.0,0.04,0,0.0
1012,4,173525840,1596240037000000000,1596240037431949812,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11359.0,0.005,0,0.0
1014,4,173525841,1596240037000000000,1596240037431984267,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11359.3,0.106,0,0.0
1015,4,173525842,1596240037000000000,1596240037432010555,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11359.3,0.005,0,0.0


In [52]:
import pandas as pd
import numpy as np

# Предположим, что DataFrame называется df

# Допустимая точность для сравнения с плавающей точкой
epsilon = 1e-8

# Условие для несогласованности
mask = (df['trade_cnt'] > 0) & (
    np.abs(df['trade_amt'] / df['trade_cnt'] - df['trade_px']) > epsilon
)

# Фильтруем строки
inconsistent_trades = df[mask]

# Выводим количество и первые 10 строк
print(f"Найдено {len(inconsistent_trades)} строк с несоответствием trade_px и trade_amt / trade_cnt")
inconsistent_trades.head(10)


Найдено 3930 строк с несоответствием trade_px и trade_amt / trade_cnt


Unnamed: 0,type,msgSeqNum,exchHostTime,adapterTime,px_buy_1,amt_buy_1,px_buy_2,amt_buy_2,px_buy_3,amt_buy_3,...,px_sell_8,amt_sell_8,px_sell_9,amt_sell_9,px_sell_10,amt_sell_10,trade_px,trade_amt,trade_cnt,moreTradesInBatch
1002,4,122537841,1596240037000000000,1596240037431776197,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.25,0.175,1,0.0
1003,4,122537842,1596240037000000000,1596240037431782097,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.26,2.0,1,0.0
1006,4,122537843,1596240037000000000,1596240037431853669,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.33,0.004,1,0.0
1007,4,122537844,1596240037000000000,1596240037431858991,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.35,0.001,1,0.0
1010,4,122537845,1596240037000000000,1596240037431914638,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.82,0.001,1,0.0
1013,4,122537846,1596240037000000000,1596240037431978620,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11359.0,0.063,3,0.0
1016,4,122537847,1596240037000000000,1596240037432042464,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11359.3,0.111,2,0.0
1018,4,122537848,1596240037000000000,1596240037432074254,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11359.31,1.008,1,0.0
1021,4,122537849,1596240037000000000,1596240037432135019,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11359.32,0.021,2,0.0
1024,4,122537850,1596240037000000000,1596240037432193442,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11359.36,1.025,2,0.0


In [56]:
# trade_amt = trade_cnt × trade_px

0.063 / 3 

0.021

In [62]:
df.loc[1000:1020]

Unnamed: 0,type,msgSeqNum,exchHostTime,adapterTime,px_buy_1,amt_buy_1,px_buy_2,amt_buy_2,px_buy_3,amt_buy_3,...,px_sell_8,amt_sell_8,px_sell_9,amt_sell_9,px_sell_10,amt_sell_10,trade_px,trade_amt,trade_cnt,moreTradesInBatch
1000,4,173525833,1596240037000000000,1596240037431704679,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.25,0.175,0,0.0
1001,4,173525834,1596240037000000000,1596240037431749155,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.26,2.0,0,0.0
1002,4,122537841,1596240037000000000,1596240037431776197,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.25,0.175,1,0.0
1003,4,122537842,1596240037000000000,1596240037431782097,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.26,2.0,1,0.0
1004,4,173525835,1596240037000000000,1596240037431794710,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.33,0.004,0,0.0
1005,4,173525836,1596240037000000000,1596240037431825867,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.35,0.001,0,0.0
1006,4,122537843,1596240037000000000,1596240037431853669,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.33,0.004,1,0.0
1007,4,122537844,1596240037000000000,1596240037431858991,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.35,0.001,1,0.0
1008,4,173525837,1596240037000000000,1596240037431863851,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11358.82,0.001,0,0.0
1009,4,173525838,1596240037000000000,1596240037431890572,11357.99,20.199,11357.67,0.169,11357.5,0.062,...,11358.82,0.001,12494.7,0.063,13744.17,0.111,11359.0,0.018,0,0.0


In [63]:
df

Unnamed: 0,type,msgSeqNum,exchHostTime,adapterTime,px_buy_1,amt_buy_1,px_buy_2,amt_buy_2,px_buy_3,amt_buy_3,...,px_sell_8,amt_sell_8,px_sell_9,amt_sell_9,px_sell_10,amt_sell_10,trade_px,trade_amt,trade_cnt,moreTradesInBatch
1000,4,173525833,1596240037000000000,1596240037431704679,11357.99,20.199,11357.67,0.169,11357.50,0.062,...,11358.82,0.001,12494.70,0.063,13744.17,0.111,11358.25,0.175,0,0.0
1001,4,173525834,1596240037000000000,1596240037431749155,11357.99,20.199,11357.67,0.169,11357.50,0.062,...,11358.82,0.001,12494.70,0.063,13744.17,0.111,11358.26,2.000,0,0.0
1002,4,122537841,1596240037000000000,1596240037431776197,11357.99,20.199,11357.67,0.169,11357.50,0.062,...,11358.82,0.001,12494.70,0.063,13744.17,0.111,11358.25,0.175,1,0.0
1003,4,122537842,1596240037000000000,1596240037431782097,11357.99,20.199,11357.67,0.169,11357.50,0.062,...,11358.82,0.001,12494.70,0.063,13744.17,0.111,11358.26,2.000,1,0.0
1004,4,173525835,1596240037000000000,1596240037431794710,11357.99,20.199,11357.67,0.169,11357.50,0.062,...,11358.82,0.001,12494.70,0.063,13744.17,0.111,11358.33,0.004,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41369,6,45258026841,1596241406000000000,1596241406563025313,11308.23,0.480,11307.86,1.009,11307.85,0.240,...,11309.87,22.000,12440.86,0.449,13684.95,1.004,0.00,0.000,0,
41370,6,45258026853,1596241406000000000,1596241406565277457,11308.23,6.480,11307.86,1.009,11307.85,0.240,...,11309.90,0.449,12440.89,1.004,13684.98,5.265,0.00,0.000,0,
41371,6,45258026878,1596241406000000000,1596241406567760680,11308.23,6.480,11307.85,0.240,11307.84,8.250,...,11309.90,0.449,12440.89,1.004,13684.98,5.265,0.00,0.000,0,
41372,6,45258026892,1596241406000000000,1596241406570122673,11308.23,0.480,11307.85,0.240,11307.84,8.250,...,11309.90,0.449,12440.89,1.004,13684.98,5.265,0.00,0.000,0,


Отлично — я прогнал автоматический аудит и нашёл набор реальных аномалий/ошибок. Ниже — короткие однострочные описания каждого типа ошибки с указанием количества строк, где они встречаются.

- moreTradesInBatch содержит 30156 пропусков (большая часть колонок пустая).

В px_buy_1 и в amt_buy_1..10 присутствует по 1 пропущенному значению в каждой (несовпадение по заполненности уровней).

- Есть 2 строки с ценой ≤ 0 (недопустимо для цен).

- Есть 3 строки с объёмом ≤ 0 (недопустимый объём).

- В 1 строке лучший бид выше лучшего аска (px_buy_1 > px_sell_1) — перекрёсток bid/ask.

- Нарушения монотонности в бид-уровнях: 2 нарушающие места (px_buy_i < px_buy_{i+1}).

- Нарушения монотонности в аск-уровнях: 7 нарушающих мест (px_sell_i > px_sell_{i+1}).

- Для 1839 строк trade_cnt > 0, но trade_px или trade_amt отсутствуют или trade_amt <= 0 (несогласованные данные о сделках).

- Для 3190 строк trade_cnt == 0, но trade_amt > 0 (вроде бы нет сделок, но объём указан).

В 3930 строках trade_px не согласуется с trade_amt / trade_cnt (несовпадение цены/объёма при ненулевом числе сделок).

Есть 953 случаев больших расхождений между adapterTime и exchHostTime (> 1e9 в тех же единицах) — подозрительные таймстемпы/синхронизация.

Есть 3 строки с экстремальными выбросами цен (цены >100× медианы или <1/100 медианы) — возможные ошибки масштаба/единиц измерения.