# Subgraph Queries with DataStreams

The goal of this notebook is to show and explain how to create a decentralized data pipeline powered by Cow Subgraphs with DataStreams. Right now DataStreams acts only as a GraphQL query manager. Should DataStreams implement any preprocessing or should that be done outside of the library? Currently I am leaning towards finishing it outside of the library but still in the jupyter notebook. 

2.4.23 What needs to happen to keep the preprocessing as simple as possible to fit in a single notebook? How would I describe my current preprocessing approach and what has been done thus far?

Subgraph Link here - https://thegraph.com/hosted-service/subgraph/cowprotocol/cow
Dune Query link - https://dune.com/queries/1941061

### TODO 
- Why does settlements.trades and trades queries return different values???

## Setup Jupyter Environment

First install DataStreams with `!pip install git+https://github.com/Evan-Kim2028/DataStreams.git` in jupyter or `pip install git+https://github.com/Evan-Kim2028/DataStreams.git` in a virtual environment. The primary DataStreams dependencies are Python 3.10, Subgrounds, and Pandas. 

In [23]:
# !pip install git+https://github.com/Evan-Kim2028/DataStreams.git

In [1]:
from datastreams.datastream import Streamer

import os
import pandas as pd
import polars as pl

In [2]:
# These commands enlarge the column size of the dataframe so things like 0x... are not truncated
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', None)

## Subgraph Query with DataStreams

1. Setup `Streamer` with the subgraph endpoint. 
2. Select schemas to query. Need to use `queryDict`, which is a dictionary that stores {str: FieldPath} key value pairs. FieldPaths are the main objects used in Subgrounds to represent subgraph data as functional Python objects.
3. Save data to a local folder.

In [3]:
# graphql endpoint of cowprotocol for ethereum
endpoint = 'https://api.thegraph.com/subgraphs/name/cowprotocol/cow'

# instantiate Streamer classW
ds = Streamer(endpoint)

### Trying with trades instead of settlements

In [4]:
# select the trades FieldPath
trades_fp = ds.queryDict.get('trades')

# fetch path for trades.trades
trade_query_cols = ds.getFieldPathQueryCols(trades_fp)
trade_col_dict = ds.getQueryCols(trades_fp, trade_query_cols)

In [5]:
trades_fp1 = trades_fp(    
    orderBy='firstTradeTimestamp', # or settlements_firstTradeTimestamp
    orderDirection='desc',
    )

In [6]:
trades_fp2 = trade_col_dict['settlement'](
    orderBy='timestamp', # or timestamp
    orderDirection='desc',
)

In [7]:
# settlement schema queries
trades_df1 = ds.runQuery(trades_fp1, query_size=1001)

FIELD - trades


In [8]:
trades_df2 = ds.runQuery(trades_fp2, query_size=1001)

FIELD - trades.settlement


In [9]:
settlements_df1 = ds.runQuery(ds.queryDict.get('settlements'), query_size=1001)

FIELD - settlements


In [29]:
# sort settlements_df1 by firstTradeTimestamp ascending
settlements_df1 = settlements_df1.sort_values(by='settlements_firstTradeTimestamp', ascending=True)

In [30]:
# get first and last values in settlements_trades_timsetamp column for testdf1
first_trades_df1 = trades_df1["trades_timestamp"].head(1)
second_trades_df1 = trades_df1["trades_timestamp"].tail(1)

print(f'TESTDF1\nfirst timestamp: {first_trades_df1}, last timestamp: {second_trades_df1}')

TESTDF1
first timestamp: 0    1663907363
Name: trades_timestamp, dtype: int64, last timestamp: 99999    1638090602
Name: trades_timestamp, dtype: int64


In [33]:
# nested field query is pretty worthless!!!
# # get first and last values in settlements_trades_timsetamp column for testdf2
# first_trades_df2 = trades_df2["trades_settlement_firstTradeTimestamp"].head(1)
# second_trades_df2 = trades_df2["trades_settlement_firstTradeTimestamp"].tail(1)

# print(f'TESTDF2\nfirst timestamp: {first_trades_df2}, last timestamp: {second_trades_df2}')

In [32]:
# get first and last values in settlements_trades_timsetamp column for testdf2
first_settlements_df1 = settlements_df1["settlements_firstTradeTimestamp"].head(1)
second_settlements_df1 = settlements_df1["settlements_firstTradeTimestamp"].tail(1)

print(f'SETTLEDF\nfirst timestamp: {first_settlements_df1}, last timestamp: {second_settlements_df1}')

SETTLEDF
first timestamp: 11384    1628089484
Name: settlements_firstTradeTimestamp, dtype: int64, last timestamp: 73636    1676231795
Name: settlements_firstTradeTimestamp, dtype: int64


In [38]:
# get list of trades_df1 trades_timestamp values
trades_df1_timestamps = trades_df1["trades_settlement_id"].to_list()

# get list of settlements_df1 settlements_firstTradeTimestamp values
settlements_df1_timestamps = settlements_df1["settlements_id"].to_list()

In [44]:
print(f'trades_df1_timestamps: {len(trades_df1_timestamps)}, settlements_df1_timestamps: {len(settlements_df1_timestamps)}')

trades_df1_timestamps: 100000, settlements_df1_timestamps: 100000


In [39]:
settlements_df1.columns

Index(['settlements_id', 'settlements_txHash',
       'settlements_firstTradeTimestamp', 'settlements_solver_id', 'endpoint'],
      dtype='object')

In [40]:
trades_df1.columns

Index(['trades_id', 'trades_timestamp', 'trades_gasPrice', 'trades_feeAmount',
       'trades_txHash', 'trades_settlement_id', 'trades_buyAmount',
       'trades_sellAmount', 'trades_sellToken_id', 'trades_buyToken_id',
       'trades_order_id', 'trades_buyAmountEth', 'trades_sellAmountEth',
       'trades_buyAmountUsd', 'trades_sellAmountUsd', 'endpoint'],
      dtype='object')

In [41]:
# check which values in trades_df1_timestamps are not in settlements_df1_timestamps
trades_df1_timestamps_not_in_settlements_df1_timestamps = [x for x in trades_df1_timestamps if x in settlements_df1_timestamps]

In [42]:
print(len(trades_df1_timestamps_not_in_settlements_df1_timestamps))

28075


In [34]:
# other list
blahblah_list = [x for x in settlements_df1_timestamps if x in trades_df1_timestamps]

In [36]:
print(len(blahblah_list))

25773


In [None]:
### END SECTION

In [27]:
# select the settlements FieldPath
settlements_fp = ds.queryDict.get('settlements')

# fetch path for settlements.trades
settlement_query_cols = ds.getFieldPathQueryCols(settlements_fp)
settlement_col_dict = ds.getQueryCols(settlements_fp, settlement_query_cols)

In [21]:
# settlement schema queries
settlements_df = ds.runQuery(settlements_fp, query_size=1200)


FIELD - settlements


#### Troubleshooting section 2.12.23

In [67]:
new_fp = ds.queryDict.get('settlements')(    
    orderBy='firstTradeTimestamp', # or settlements_firstTradeTimestamp
    orderDirection='asc',
    )

In [68]:
new_trades_fp = settlement_col_dict['trades'](
    orderBy='timestamp', # or settlements_trades_timestamp
    orderDirection='asc',
)

In [69]:
old_trades_fp = ds.queryDict.get('trades')(
    orderBy='timestamp', # or settlements_trades_timestamp
    orderDirection='asc',
)

In [70]:
testdf1 = ds.runQuery(new_fp, query_size=5000) 

FIELD - settlements


In [71]:
testdf2 = ds.runQuery(new_trades_fp, query_size=5000)

FIELD - settlements.trades


In [72]:
testdf3 = ds.runQuery(old_trades_fp, query_size=5000)

FIELD - trades


In [73]:
# get first and last values in settlements_trades_timsetamp column for testdf1
firstdf1 = testdf1["settlements_firstTradeTimestamp"].head(1)
seconddf1 = testdf1["settlements_firstTradeTimestamp"].tail(1)

print(f'TESTDF1\nfirst timestamp: {firstdf1}, last timestamp: {seconddf1}')

TESTDF1
first timestamp: 0    1648449982
Name: settlements_firstTradeTimestamp, dtype: int64, last timestamp: 4999    1673800007
Name: settlements_firstTradeTimestamp, dtype: int64


In [74]:
# get first and last values in settlements_trades_timsetamp column for testdf2
firstdf2 = testdf2["settlements_trades_timestamp"].head(1)
seconddf2 = testdf2["settlements_trades_timestamp"].tail(1)

print(f'TESTDF2\nfirst timestamp: {firstdf2}, last timestamp: {seconddf2}')

TESTDF2
first timestamp: 0    1648449982
Name: settlements_trades_timestamp, dtype: int64, last timestamp: 154    1674983159
Name: settlements_trades_timestamp, dtype: int64


In [75]:
# get first and last values in settlements_trades_timsetamp column for testdf2
firstdf3 = testdf3["trades_timestamp"].head(1)
seconddf3 = testdf3["trades_timestamp"].tail(1)

print(f'TESTDF3\nfirst timestamp: {firstdf3}, last timestamp: {seconddf3}')

TESTDF3
first timestamp: 0    1663907363
Name: trades_timestamp, dtype: int64, last timestamp: 4999    1673039735
Name: trades_timestamp, dtype: int64


In [76]:
testdf3.columns

Index(['trades_id', 'trades_timestamp', 'trades_gasPrice', 'trades_feeAmount',
       'trades_txHash', 'trades_settlement_id', 'trades_buyAmount',
       'trades_sellAmount', 'trades_sellToken_id', 'trades_buyToken_id',
       'trades_order_id', 'trades_buyAmountEth', 'trades_sellAmountEth',
       'trades_buyAmountUsd', 'trades_sellAmountUsd', 'endpoint'],
      dtype='object')

In [None]:
# END TROUBLESHOOT SECTION

In [None]:
# token schema queries
tokens_df = ds.runQuery(ds.queryDict.get('tokens'), query_size=10000) 

In [15]:
settlement_missing_df = ds.runQuery(settlement_col_dict['trades'], query_size=1200) 

# settlement_missing_df = ds.runQuery(ds.queryDict.get('trades'), query_size=1200) # For Troubleshooting purposes

FIELD - settlements.trades


In [16]:
# convert dfs into a dictionaries
settlement_missing_dict = settlement_missing_df.to_dict('records')
settlement_dict = settlements_df.to_dict('records')
tokens_dict = tokens_df.to_dict('records')
# Don't need these for now
# orders_dict = orders_df.to_dict('records')
# trades_dict = trades_df.to_dict('records')

In [17]:
settlement_missing_pl = pl.from_dicts(settlement_missing_dict)
settlement_pl = pl.from_dicts(settlement_dict)
tokens_pl = pl.from_dicts(tokens_dict)

In [18]:
settlement_missing_pl.head(7)

settlements_trades_id,settlements_trades_timestamp,settlements_trades_gasPrice,settlements_trades_feeAmount,settlements_trades_txHash,settlements_trades_settlement_id,settlements_trades_buyAmount,settlements_trades_sellAmount,settlements_trades_sellToken_id,settlements_trades_buyToken_id,settlements_trades_order_id,settlements_trades_buyAmountEth,settlements_trades_sellAmountEth,settlements_trades_buyAmountUsd,settlements_trades_sellAmountUsd,endpoint
str,i64,i64,f64,str,str,f64,f64,str,str,str,f64,f64,f64,f64,str
"""0x15d95aaa251a...",1648449982,26612766385,1.0247e+19,"""0x000012606964...","""0x000012606964...",1.5979e+16,6.334e+19,"""0x6b175474e890...","""0xeeeeeeeeeeee...","""0x15d95aaa251a...",0.015979,0.019104,52.977187,63.340168,"""https://api.th..."
"""0x8b819086f258...",1648449982,26612766385,3086400000000000.0,"""0x000012606964...","""0x000012606964...",7.6655e+17,1e+17,"""0xc02aaa39b223...","""0x6810e776880c...","""0x8b819086f258...",0.099501,0.1,329.898511,331.552775,"""https://api.th..."
"""0xd0249d0794e1...",1648449982,26612766385,9593700000000000.0,"""0x000012606964...","""0x000012606964...",1.3153e+18,6.4482e+18,"""0xa1d65e8fb6e8...","""0xeeeeeeeeeeee...","""0xd0249d0794e1...",1.315289,0.0,4360.87806,0.0,"""https://api.th..."
"""0xbc6a06f7ce5f...",1670299751,13098041329,3391000000.0,"""0x00008e5e8787...","""0x00008e5e8787...",9454900000000000.0,21203000000.0,"""0xfc4913214444...","""0xeeeeeeeeeeee...","""0xbc6a06f7ce5f...",0.009455,0.01135,11.959972,14.357202,"""https://api.th..."
"""0x6244200e0939...",1656651460,17645394095,3.2313e+18,"""0x000098565f5d...","""0x000098565f5d...",1.8867e+17,1.9908e+20,"""0x6b175474e890...","""0xeeeeeeeeeeee...","""0x6244200e0939...",0.188675,0.189935,197.761251,199.0821,"""https://api.th..."
"""0x6669b9d04516...",1648145708,61132586787,1.0754e+17,"""0x0000b671e285...","""0x0000b671e285...",4.6844e+16,1.2419e+18,"""0x990f341946a3...","""0xeeeeeeeeeeee...","""0x6669b9d04516...",0.046844,0.051457,145.460728,159.78521,"""https://api.th..."
"""0x351fe0a5da44...",1643815289,113195266740,4.402e+19,"""0x0000eb0ede5f...","""0x0000eb0ede5f...",4699800000.0,4.7442e+21,"""0x6b175474e890...","""0xa0b86991c621...","""0x351fe0a5da44...",1.755841,1.77241,4699.847482,4744.196623,"""https://api.th..."


In [19]:
settlement_pl.head(7)

settlements_id,settlements_txHash,settlements_firstTradeTimestamp,settlements_solver_id,endpoint
str,str,i64,str,str
"""0x000012606964...","""0x000012606964...",1648449982,"""0xde1c59bc25d8...","""https://api.th..."
"""0x00008e5e8787...","""0x00008e5e8787...",1670299751,"""0xa21740833858...","""https://api.th..."
"""0x000098565f5d...","""0x000098565f5d...",1656651460,"""0xc9ec550bea1c...","""https://api.th..."
"""0x0000b671e285...","""0x0000b671e285...",1648145708,"""0xdae69affe582...","""https://api.th..."
"""0x0000eb0ede5f...","""0x0000eb0ede5f...",1643815289,"""0xde786877a10d...","""https://api.th..."
"""0x00011f3edd4a...","""0x00011f3edd4a...",1664525399,"""0xe9ae2d792f98...","""https://api.th..."
"""0x00012679ac52...","""0x00012679ac52...",1647661115,"""0x15f4c337122e...","""https://api.th..."


In [20]:
# rename column to match names for join
settlement_missing_pl = settlement_missing_pl.rename({"settlements_trades_txHash": "settlements_id"})

In [21]:
#print settlement and settlement missing shapes
print(settlement_pl.shape)
print(settlement_missing_pl.shape)

(1200, 5)
(1704, 16)


In [22]:
# settlements_trades_settlement_id
total_settlement_pl = settlement_pl.join(settlement_missing_pl, on="settlements_id", how="inner")

In [23]:
# take a slice of the dataframe
tokens_pl = tokens_pl[["tokens_address", "tokens_decimals", "tokens_symbol", "tokens_name"]]

In [24]:
# seperate for buy/sell merge
tokens_sell_pl = tokens_pl.rename({"tokens_address": "settlements_trades_sellToken_id"})
tokens_buy_pl = tokens_pl.rename({"tokens_address": "settlements_trades_buyToken_id"})

In [25]:
# inner join token info
total_settlement_tokens_pl = total_settlement_pl.join(tokens_sell_pl, on="settlements_trades_sellToken_id", how="inner").join(tokens_buy_pl, on="settlements_trades_buyToken_id", how="inner")

In [26]:
total_settlement_tokens_pl = total_settlement_tokens_pl[
    [
'settlements_id',
'settlements_solver_id',
'settlements_trades_timestamp', # everything below is from settlement query on trades
 'settlements_trades_gasPrice',
 'settlements_trades_feeAmount',
 'settlements_trades_buyAmount',
 'settlements_trades_sellAmount',
 'settlements_trades_sellToken_id',
 'settlements_trades_buyToken_id',
 'settlements_trades_order_id',
 'settlements_trades_buyAmountEth',
 'settlements_trades_sellAmountEth',
 'settlements_trades_buyAmountUsd',
 'settlements_trades_sellAmountUsd',
 'tokens_decimals', # everything below is from token query. Double joins (sell and buy)
 'tokens_symbol',
 'tokens_name',
 'tokens_decimals_right',
 'tokens_symbol_right',
 'tokens_name_right'
    ]
]

#### merge solver with total_settlement_tokens

In [27]:
solvers = pd.read_csv('data/cowv2_solvers.csv') # load in pandas instead of polars. Having trouble replacing \ symbol in polars

In [28]:
# rename address to settlements_solver_id in pandas
solvers = solvers.rename(columns={"address": "settlements_solver_id"})

In [29]:
# NOTE - dune formats addresses as /x... need to convert '/' to '0'
solvers['settlements_solver_id'] = solvers['settlements_solver_id'].str.replace('\\', '0', regex=False)

In [30]:
# turn solvers into a dictionary
solvers_dict = solvers.to_dict('records')

# convert dict to polars
solvers_pl = pl.from_dicts(solvers_dict)

In [31]:
# inner join solvers_pl on total_settlement_tokens_pl
total_settlement_tokens_solvers_pl = total_settlement_tokens_pl.join(solvers_pl, on="settlements_solver_id", how="inner")

In [32]:
total_settlement_tokens_solvers_pl

settlements_id,settlements_solver_id,settlements_trades_timestamp,settlements_trades_gasPrice,settlements_trades_feeAmount,settlements_trades_buyAmount,settlements_trades_sellAmount,settlements_trades_sellToken_id,settlements_trades_buyToken_id,settlements_trades_order_id,settlements_trades_buyAmountEth,settlements_trades_sellAmountEth,settlements_trades_buyAmountUsd,settlements_trades_sellAmountUsd,tokens_decimals,tokens_symbol,tokens_name,tokens_decimals_right,tokens_symbol_right,tokens_name_right,environment,name,active
str,str,i64,i64,f64,f64,f64,str,str,str,f64,f64,f64,f64,i64,str,str,i64,str,str,str,str,bool
"""0x007662fcbc40...","""0x15f4c337122e...",1642062959,128643559896,1.87307354e8,2.8211e22,2.0000e9,"""0xdac17f958d2e...","""0x0000000de40d...","""0xf8f5e4cf8c98...",0.0,0.597915,0.0,2000.0,6,"""USDT""","""Tether USD""",18,"""DLTA""","""delta.theta""","""prod""","""Gnosis_ParaSwa...",false
"""0x000488b508c8...","""0xe18b5632df2e...",1660832563,12666845214,2.2470e20,1.2340e21,8.3777e21,"""0x4332f8a38f14...","""0x01597e397605...","""0xab1785ef9353...",0.0,0.0,0.0,0.0,18,"""APEFI""","""Ape Finance""",18,"""BENT""","""Bent Token""","""prod""","""Atlas""",false
"""0x00722acf4103...","""0xde1c59bc25d8...",1651152085,62673974907,6.9533e17,1.9724e20,1.7797e19,"""0x4e3fbd56cd56...","""0x01597e397605...","""0x4b081f515119...",0.0,0.166684,0.0,486.394626,18,"""CVX""","""Convex Token""",18,"""BENT""","""Bent Token""","""prod""","""Gnosis_1inch""",false
"""0x00722acf4103...","""0xde1c59bc25d8...",1651152085,62673974907,343476.0,1.0102e19,601870.0,"""0xbd31ea821211...","""0x01597e397605...","""0xd2f1eef485c6...",0.0,0.018856,0.0,55.12417,6,"""LUNA""","""LUNA (Wormhole...",18,"""BENT""","""Bent Token""","""prod""","""Gnosis_1inch""",false
"""0x00ad37eb273a...","""0x15f4c337122e...",1644748155,47523051478,9.3757e18,2.0343e15,1.0000e20,"""0x6b175474e890...","""0x0327112423f3...","""0x6fc448c0ca4c...",0.0,0.03426,0.0,100.0,18,"""DAI""","""Dai Stablecoin...",18,"""BTC++""","""PieDAO BTC++""","""prod""","""Gnosis_ParaSwa...",false
"""0x00b500c3b3c1...","""0xc9ec550bea1c...",1660043150,17228308814,5.3034e18,1.9260e21,5.6700e21,"""0x6b175474e890...","""0x03ab45863491...","""0x8cab99366b72...",3.29362,3.302856,5665.213835,5670.0,18,"""DAI""","""Dai Stablecoin...",18,"""RAI""","""Rai Reflex Ind...","""prod""","""Otex""",true
"""0x000274c636cc...","""0x6fa201c3aff9...",1653399335,39685121633,1.571467e7,1.6480e22,4.9497e10,"""0xa0b86991c621...","""0x03ab45863491...","""0xce595451fe08...",25.054155,25.242469,49471.59565,49497.201149,6,"""USDC""","""USD Coin""",18,"""RAI""","""Rai Reflex Ind...","""prod""","""Otex""",false
"""0x00c4ce2499d0...","""0xc9ec550bea1c...",1674143279,25817928436,3.0684144e7,3.7402e22,1.0300e11,"""0xa0b86991c621...","""0x03ab45863491...","""0x6c2728ae124a...",66.82064,67.001595,102853.235527,103000.0,6,"""USDC""","""USD Coin""",18,"""RAI""","""Rai Reflex Ind...","""prod""","""Otex""",true
"""0x0004ffeb0bbf...","""0x149d0f928233...",1661197209,23280051092,2.4216e19,8.617896e6,8.6261e22,"""0x4fabb145d646...","""0x056fd409e1d7...","""0x6ba08daa7ae4...",55.198666,55.263135,86192.865222,86293.532856,18,"""BUSD""","""Binance USD""",2,"""GUSD""","""Gemini dollar""","""prod""","""PLM""",true
"""0x00d55052a6e7...","""0xa21740833858...",1668085823,44690603333,4.8074e19,6.448118e6,6.4986e22,"""0x853d955acef8...","""0x056fd409e1d7...","""0xa33865b4e6d7...",53.72145,54.333841,64502.749465,65238.039902,18,"""FRAX""","""Frax""",2,"""GUSD""","""Gemini dollar""","""prod""","""Laertes""",true


In [33]:
# drop null values
total_settlement_tokens_solvers_pl = total_settlement_tokens_solvers_pl.drop_nulls()

In [34]:
# save total_settlement_tokens_solvers_pl to csv
total_settlement_tokens_solvers_pl.write_csv('data/cowv2_trades.csv')

### Basic Aggs

In [41]:
# filter by "prod" environments
filter_df = total_settlement_tokens_solvers_pl.filter(pl.col("environment") == "prod")

In [44]:
filter_df = filter_df.filter(pl.col("settlements_trades_buyAmountUsd") != 0.0)

In [45]:
filter_df = filter_df.filter(pl.col("settlements_trades_sellAmountUsd") != 0.0)

In [48]:
filter_df.shape

(1113, 23)

In [49]:
# group filter_df by solver name polars
grouped_df = filter_df.groupby("name").agg(
    pl.count("settlements_trades_buyAmountUsd").alias("total_trades")).sort("total_trades", reverse=True)


In [50]:
grouped_df

name,total_trades
str,u32
"""Gnosis_0x""",196
"""Gnosis_1inch""",141
"""Otex""",134
"""QuasiModo""",104
"""PLM""",101
"""Legacy""",93
"""DexCowAgg""",80
"""Laertes""",62
"""Gnosis_ParaSwa...",55
"""MIP""",37
