# SmartEmbed

In [1]:
%load_ext autoreload
%autoreload 2

### modifying SmartEmbed
- this library has some Python 2 stuff, had to quickly fix that
- it has a java dependency. it's obviously deployable but i've never done that before

In [4]:
import os
from SmartEmbed.smartembed import SmartEmbed

os.chdir("/Users/pablo/Documents/Coding/company_challenges/OpenZeppelin/SmartEmbed")
se = SmartEmbed()
# read contract1 from file
contract1 = open('todo/test.sol', 'r').read() 
# get vector representation for contract1
vector1 = se.get_vector(contract1)
# read contract2 from file
contract2 = open('todo/KOTH.sol', 'r').read()
# get vector representation for contract2
vector2 = se.get_vector(contract2)
# estimate similarity between contract1 and contract2 
similarity = se.get_similarity(vector1, vector2)
print("similarity between c1 and c2:", similarity)

similarity between c1 and c2: 0.8895723343637358


### latency of calculating single embedding

In [None]:
%%time
vector1 = se.get_vector(contract1)

### process dataset
- first let's scan and find out how many files there are
- make decision as to how to store this data based on next step: FAISS
- keep track of how many files this tool is not able to process also

In [2]:
from tqdm import tqdm

base_dir = "/Users/pablo/Documents/Coding/company_challenges/OpenZeppelin/smart-contract-sanctuary-ethereum/contracts"
smart_embed_dir = "/Users/pablo/Documents/Coding/company_challenges/OpenZeppelin/SmartEmbed"
ignore_dirs = (".ipynb_checkpoints")

def contract_path_gen(base_dir: str):
    for root,dirs,files in os.walk(base_dir, topdown=True):
        for _file in files:
            if _file.split(".")[-1] == "sol":
                yield os.path.join(root, _file)
    
def read_contract(contract_path : str) -> str:
    with open(contract_path, "r") as f:
        return f.read()

def get_embedding(contract_path: str):
    try:
        return {contract_path: se.get_vector(read_contract(contract_path))}
    except: # be very broad here, we don't know what mistakes could happen 
        return dict()

In [32]:
n_contracts = sum([1 for _ in contract_path_gen(base_dir)])
print(f"total number of smart contracts: {n_contracts}")

total number of smart contracts: 1083325


In [None]:
%%time
import warnings
warnings.filterwarnings("ignore")

embeddings = dict()

for ix, contract_path in enumerate(contract_path_gen(base_dir)):
    contract = read_contract(contract_path)
    embeddings[contract_path] = se.get_vector(contract)
    if ix>100:
        break

- there was a lot of warnings so i had to clear the outpu, but it took 28.5 seconds to process 100 embeddings

In [40]:
(1083325 * 28.5) / (100 * 60 * 60)

85.76322916666666

- that means 85 hours of procesing time. that's a problem, maybe we can parallelize

In [None]:
from joblib import delayed, Parallel

gen = contract_path_gen(base_dir)
l = list()
for ix, _cp in enumerate(gen):
    if ix > 100:
        break
    l.append(_cp)
    
st = time.time()
result = Parallel(n_jobs=-1)(delayed(get_embedding)(contract_path) for contract_path in l)
elapsed = time.time() - st


In [65]:
print(f"time elapsed: {timedelta(seconds=elapsed)}")

time elapsed: 0:00:30.225761


- i don't know why parallelization is not speeding things up. it may have to do with running java command and java not handling concurrency well, i just don't know enough about java.
- unfortuntately we have to keep moving, let's how the index part will work

### Spotify Annoy
- i had considered using ElasticSearch, but I think deploying that will not be that easy
- i was looking at Python only stuff and found FAISS, but you need cmake, and c++ BLAS libraries, and link everything, again worried about deployment due to time constraint
- let's explore the Spotify annoy library, it looks better and simple pip install

In [47]:
from annoy import AnnoyIndex
t = AnnoyIndex(150, 'angular')

In [48]:
t.add_item(1, vector1.reshape(-1))

In [14]:
%%time
t.add_item(2, vector2.reshape(-1))

CPU times: user 22 µs, sys: 1 µs, total: 23 µs
Wall time: 25 µs


In [16]:
t.build(1)

True

In [18]:
t.get_nns_by_item(2, 1)

[2]

In [19]:
t.save('test.ann')

True

In [20]:
u = AnnoyIndex(150, 'angular')
u.load("test.ann")

True

In [22]:
u.get_nns_by_item(2, 1)

[2]

- ok this will have to do and seems to have enough functionality for now. I'm going to work on a script that both saves the processed embeddings, and creates the index at the same time. 
- since parallelizing is not working, im going to do it serially which gives some advantages. i dont know if AnnoyIndex handles concurrency anyways, but i can save snapshots of the processed data which i can then work with separately.

In [57]:
from typing import Dict
def get_embedding2(contract_path: str) -> Dict:
    try:
        return {
            "contract_path": contract_path,
            "embedding": se.get_vector(read_contract(contract_path)).reshape(-1)}
    except: # be very broad here, we don't know what mistakes could happen 
        return dict()

In [50]:
import pandas as pd
tmp = [get_embedding2(l[0]), get_embedding2(l[1])]

In [51]:
df = pd.DataFrame(tmp)

In [55]:
df.reset_index().rename(columns={"index":"contract_id"})

Unnamed: 0,contract_id,contract_path,embedding
0,0,/Users/pablo/Documents/Coding/company_challeng...,"[404.5833587087691, -725.1700658425689, -302.5..."
1,1,/Users/pablo/Documents/Coding/company_challeng...,"[545.3918035291135, -195.70304829627275, -68.8..."


- pandas is not that fast, but it's still an index and we're around 1M rows which is not that big
- i will leave this as a parquet file "database", which is terrible, but since it's an MVP it will work, with the idea of moving to a full solution like ElasticSearch or DynamoDB in a following iteration

### to-do
- i need to figure out how to deploy a java environment
- i need to commit all the changes i did to SmartEmbed library and create that fork. i need to figure out how i'm gonna install that in deployment, since it was in Python 2 and needed all those changes

In [68]:
import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[
        logging.FileHandler("debug.log"),
        logging.StreamHandler()
    ]
)

In [70]:
logging.info('Useful message')

2023-02-26 12:40:34,788 [INFO] Useful message
