## AlphaMEV
### Goal of AlphaMEV is to automate/generalise most common MEV extraction on EVM comptiable block-chains. You can read more details about the way this project initially started on our twitter.
### While true North Star of this project is very far away and it's only the beggining, we've decided to host an ML competition to gather ideas from community and compare it against benchmarks.

## Competition Information
### Goal of this competition is to predict back-runable transactions and cumulative miner's profit that this transaction would generate. There are many examples of transactions which open MEV opportunities after them:
1) Oracle updates allow to perform liquidations.
2) Large AMM swaps allow to perform cross-DEX arbitrage.
3) Accepted govenance proposals which change pool parameters.
And many others.

## Each row of the training dataset contains following columns:
1) txHash - transaction hash on Ethereum blockchain
2) txData - dictionary representing all basic transaction information
3) txTrace - Geth-style transaction trace
4) Label0 - Binary label whether this transaction is back-runable.
5) Label1 - Total amount of ETH sent to miners as bribes via MEV-bundles due to this transaction.

## You can find link to the dataset below, it's a zip archive containing 2 files: "train.csv" and "test.csv".
## For each row in "test.csv" you're expected to generate two predictions separated by comma:
1) P[Label0 == 1]
2) E[Label1 | Label0 == 1]
## You can also find most basic solution in Python which generates required predictions in correct format using the link below.



In [1]:
import pandas as pd
import numpy as np
import xgboost as xgb
import ast
import csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

In [2]:
# Solution is kept trivial and highly inefficient on purpose as it's provided
# purely as an example which should be straightforward to beat by anyone
def convert_dataset(dataset):
  examples = []
  # dat = dataset['txData'].map(ast.literal_eval)
  # trc = dataset['txTrace'].map(ast.literal_eval)  

  # txTrace = pd.json_normalize(trc)
  # print(txTrace.columns)
  # err_le = LabelEncoder()
  # err_label = err_le.fit_transform(txTrace['error'])
  # err_ohe = OneHotEncoder()
  # dataset['errorLabel'] = err_label

  # err_feature_arr = err_ohe.fit_transform(dataset[['errorLabel']]).toarray()
  # # print((err_feature_arr))
  # err_feature_label = list(err_le.classes_)
  # # print(err_feature_label)
  # err_feature = pd.DataFrame(err_feature_arr.astype(int).astype(str))
  # err_feature['error'] = err_feature.stack().groupby(level=0).apply(''.join)
  

  for dat, trc in zip(dataset['txData'], dataset['txTrace']):  
    data = ast.literal_eval(dat)
    trace = ast.literal_eval(trc)
    examples.append([
      int(data.get('blockNumber'), 0),
      int(data.get('from'), 0) % (2 ** 30),
      (int(data.get('to'), 0) if data.get('to') is not None else 0) % (2 ** 30),
      int(data.get('gas'), 0),
      ((int(trace.get('gasUsed'), 0) if trace.get('gasUsed') is not None else 0) *int(data.get('gasPrice'), 0)),
      (int(data.get('input')[:10], 0) if data.get('input') != '0x' else 0) % (2 ** 30),
      int(data.get('nonce'), 0),
      (1 if  trace.get('error') is not None else 0)
    ])
  return np.array(examples)


In [23]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
testFeatures = convert_dataset(test)
 
X_train, X_test, y_train, y_test = train_test_split(convert_dataset(train), train['Label0'], test_size=0.2)


print('Training data shape: {}'.format(X_train.shape))
print('Test data shape: {}'.format(X_test.shape))

Training data shape: (656347, 8)
Test data shape: (164087, 8)


In [None]:
binaryModel = xgb.XGBClassifier(n_estimators=50)
binaryModel.fit(X_train, y_train)
pred = binaryModel.predict(X_test)
accuracy = accuracy_score(y_test, pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
print(pred.shape)
binaryPredictions = binaryModel.predict_proba(testFeatures)[:, 1]
 
regressionModel = xgb.XGBRegressor(n_estimators=50)
regressionModel.fit(
  convert_dataset(train[train['Label0'] == True]),
  train[train['Label0'] == True]['Label1']
)
regressionPredictions = regressionModel.predict(testFeatures)
 
# submission = csv.writer(open('submission.csv', 'w', encoding='UTF8'))
# for x, y in zip(binaryPredictions, regressionPredictions):
#   submission.writerow([x, y])

In [35]:
import pylab as plt
def plot_learning_curve(history):
    """ Function that accepts the result from a training run and generates loss curves. """
    plt.plot(history["loss"], label="training loss")
    plt.plot(history["val_loss"], label="validation loss")
    plt.xlabel("Epoch")
    plt.ylabel("Loss")
    plt.yscale('log')
    plt.legend()
    plt.show()


In [37]:
import tensorflow as tf

# We extract the number of classes and the input shape from the data
num_classes = len(np.unique(y_train))
input_shape = X_train.shape[1:]
print(input_shape)
print('Training Features:\n   Shape: {}\n   Type: {}\n'.format(X_train.shape, X_train.dtype))
print('Training Targets:\n   Shape: {}\n   Type: {}'.format(y_train.shape, y_train.dtype))
# Define a sequential model
model = tf.keras.Sequential()

# ADD LAYERS HERE

model.add(tf.keras.layers.Dense(128, input_shape=input_shape, activation='relu'))
model.add(tf.keras.layers.Dense(1))
model.add(tf.keras.layers.Dense(1))
model.add(tf.keras.layers.Dense(num_classes, activation="softmax"))
model.build(input_shape)
# This will print an overview of the network architecture
model.summary()

model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.1), 
              loss=tf.keras.losses.MeanSquaredError())

result = model.fit(X_train.astype('float32'),
          y_train.astype('float32'),
          batch_size=50,
          epochs=20)

test_accuracy = model.evaluate(X_test.astype('float32'), y_test.astype('float32'), verbose=0)

print('Test accuracy: {:.04}'.format(test_accuracy))
# plot_learning_curve(result.history)

(8,)
Training Features:
   Shape: (656347, 8)
   Type: object

Training Targets:
   Shape: (656347,)
   Type: bool
Model: "sequential_24"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_35 (Dense)             (None, 128)               1152      
_________________________________________________________________
dense_36 (Dense)             (None, 1)                 129       
_________________________________________________________________
dense_37 (Dense)             (None, 1)                 2         
_________________________________________________________________
dense_38 (Dense)             (None, 2)                 4         
Total params: 1,287
Trainable params: 1,287
Non-trainable params: 0
_________________________________________________________________
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20