# Sybil Slayer

- **Author**: Oujin Labs
- **Creation Date**: 30/10/2022
- **Last Update**: 30/10/2022

## Description

The goal of this notebook is to explore Gitcoin Grant data in order to propose a solution aimed at identifying sybil accounts. Although one of the project task is to train an ML model able to detect sybil accounts, we haven't been provided labels to do so. Hence, in this notebook, I'll try to extract information and create signals which could indicate that, in the context of a gitcoin grant, one of the contributor is a sybil account.

The data we will used consist of:

- GR15: `hackathon-contributions-dataset_v2.csv`
- Etherscan API (transactions)

In [1]:
import os
import requests

import numpy as np
import pandas as pd

In [2]:
os.listdir('../data/GR15_public_hackathon/')

['gr15_grants.json',
 'hackathon-contributions-dataset_v2.csv',
 'grants_applications_gr15.json',
 'GR15_grant_review_experiment.xlsx']

In [3]:
df_grants = pd.read_json('../data/GR15_public_hackathon/gr15_grants.json').T

In [4]:
df_grants.shape

(1503, 16)

# Load Data

First, let's load the hackathon contribution dataset, which summarises every contributation made to a grant on gitcoin. For every contributions, we know the user who contributed and the grant to which the contribution was made as well as the chain used, the transaction hash, the amount and the timestamp at which the contribution was made.

In [5]:
%%time
df_contrib = pd.read_csv('../data/GR15_public_hackathon/hackathon-contributions-dataset_v2.csv')

CPU times: user 2.65 s, sys: 285 ms, total: 2.93 s
Wall time: 3.01 s


In [6]:
df_contrib.shape

(475046, 10)

In [7]:
df_contrib.head()

Unnamed: 0,txn_id,user_id,address,grant_id,chain,txn_hash,network,token,amount_in_usdt,timestamp
0,9a7955760b945121d7ac2e3ccf8c53bb19389939d83a78...,205fdbe0783c1725849a6c243e23156d3c6679e2403beb...,0x76f69dcddd0593b0aff5fd3280c3433ddb68e0d2,6972,eth_std,0x1429017d0f1c7878713edb89f09f21ef5fce6afa6369...,mainnet,ETH,1.28365,2022-09-22 23:59:55.773190+00:00
1,8d656b874ef8b642bbab30e2e0164922287d8156182e61...,9872683fceb5011d980cbdce560d5c3c1dd4007ee936be...,0x1bc5ebee4738fd95bd96751c45a90889f772e0f3,2400,eth_zksync,sync-tx:d36b99aa6a9e2a3de02dc0f82562f6bcfd7caa...,mainnet,ETH,1.001247,2022-09-22 23:59:53.052568+00:00
2,fe14bb2c777595fcc407de7ec454e26ddaa48df2c11a6e...,67d35e9ecb471310efe99c9d12ffc6a9816db2cf6e9ac0...,0x3812801cbf0e41413db4835a5e36228ad45e32bf,7243,eth_polygon,0xa9ffea4e1135b52184647d0e42ced22b4f99a7a6e778...,mainnet,MATIC,1.001943,2022-09-22 23:59:52.331798+00:00
3,1d5435fae1a9942df47eb4a30c49acf418888fc7bbd233...,9872683fceb5011d980cbdce560d5c3c1dd4007ee936be...,0x1bc5ebee4738fd95bd96751c45a90889f772e0f3,3023,eth_zksync,sync-tx:7a8d72bdf0af9cf3b3756437e69610a706902e...,mainnet,ETH,1.001247,2022-09-22 23:59:52.133974+00:00
4,e3aa9456a2fac0019086e42b9a065ac0e40d108e436d1b...,9872683fceb5011d980cbdce560d5c3c1dd4007ee936be...,0x1bc5ebee4738fd95bd96751c45a90889f772e0f3,4811,eth_zksync,sync-tx:1f0521d2dbe03f283cc412eb9f6affa527eae0...,mainnet,ETH,1.001247,2022-09-22 23:59:51.260699+00:00


In [8]:
df_contrib = df_contrib.drop_duplicates()
df_contrib.shape

(432387, 10)

In [9]:
df_contrib = df_contrib.sort_values(by='timestamp')

Since I'm running out of time, I will select a single grant on which I will focus my analysis on. It should be a grant with relatively few contributors so that I can do some "manual analysis".

In [163]:
df_contrib['grant_id'].value_counts().sample(12)

7408    989
6847     19
7833     64
7541      6
5900    614
3920     33
7331     19
6069      2
524      30
5041      5
7887      8
4385    119
Name: grant_id, dtype: int64

In [168]:
# Let's select the grant 7300 - it has had 367 contributions.
df_contrib_sub = df_contrib.loc[df_contrib['grant_id'] == 7300]

In [169]:
print(df_contrib_sub.shape)
df_contrib_sub.head()

(367, 10)


Unnamed: 0,txn_id,user_id,address,grant_id,chain,txn_hash,network,token,amount_in_usdt,timestamp
474565,8a87ed993f448090d98eb6c85a952603bbb75682ad5eef...,434abbc6237b312b30c10383adb3ab797162bc0d4c0115...,0x6780f6e186667addec40c1e2b2d3c5c6cb6b1240,7300,eth_std,0xe1eb6aa5058a438dcc69fbaaeb0e7f65b8042fa4d05b...,mainnet,DAI,15.0,2022-09-07 17:09:16.243541+00:00
473810,87baa1dace499aedaf0fe9048fa1d53ee7efe5563595c7...,0429c9ccc8a41294887a6fd419d28911523d726c1779f1...,0x5a5d9ab7b1bd978f80909503ebb828879daca9c3,7300,eth_zksync,sync-tx:85d7eee5346d5c067e3dc29d9e51e1c844cf40...,mainnet,DAI,1.01,2022-09-07 19:14:00.370912+00:00
473678,18d678fc1c3e0b773a8200003ed09794a1830a6c14d00e...,e1e5ee0ac53f1b9ef99e623c561c7ff8575aff1e898c2c...,0x681bb473da4045b26fabc040ca9c6c5e80cbd451,7300,eth_std,0xe82befbc1082a1a0376257978e771ab1491518744580...,mainnet,DAI,2.0,2022-09-07 19:33:21.953516+00:00
473354,88301092a4c18304b778ef82ef0dcb075aee569ca9aea0...,b9b40b6c8a85842113129af338f8e38e925123197b652e...,0x4a24d4a7c36257e0bf256ea2970708817c597a2c,7300,eth_std,0xb819f8e7980f0899f802f21da5217b9f8499cbfccdc1...,mainnet,DAI,7.0,2022-09-07 20:11:01.178786+00:00
473281,d27b86aef4edd1f64a5fd549b1b9091b72ac337eebfa9e...,f9e614b640569ead272178d5bb40b49d758e421e7fbaa4...,0xce849efc35a0a0a046e67c76b477c5432e4ba58b,7300,eth_zksync,sync-tx:f283065c04e69053a8abe8d9daff3c598b87a9...,mainnet,ETH,1.253164,2022-09-07 20:15:33.463158+00:00


In [170]:
# Display info about the grant
df_grants.loc[
    df_grants['grant_id'] == 7300
]

Unnamed: 0,grant_id,active,title,address,amount_received,amount_received_in_round,contribution_count,contributor_count,description,website,github_project_url,twitter_handle_2,twitter_handle_1,twitter_verified,created_on,last_update
342,7300,True,MRV 101,0x7fc1a3D5264a2B2563162EbAF17180ED2e5804BC,1655.7895,1654.0433,369,255,TL;DR\r\nHere is a detailed twitter thread by ...,https://discord.gg/3J9qeau2vQ,,magentaceiba,MRV1O1,True,2022-08-30 18:49:19+00:00,2022-09-14 18:13:39.693505+00:00


## Grant 7300 - MRV 101	

In [171]:
# Grant description
print(df_grants.loc[
    df_grants['grant_id'] == 7300,
    'description'
].values[0])

TL;DR
Here is a detailed twitter thread by @darinmandarin. and here is a tweet from @Gitcoin


MRV 1O1 is a project that spun out of a founder’s circle in the ReFi DAO. When it comes to safeguarding our food supply from effects of climate change, effective measurement, reporting and verification (MRV) of soil regeneration, Crops’ health improvement and biodiversity enhancement is critical to help small scale farmers and growers navigate tough challenges of climate change.

To help small scale projects get full benefits of carbon, ecocredits and other incentives for ecosystem services, MRV Research Group is working to publish MRV1O1, a guide to quickly understand and implement relevant MRV best practices.



The Problem

Traditional MRV approaches are prohibitively expensive, often constituting 30-40% of the costs for large projects over 1200 hectares. What’s worse, these expenses are always upfront before generating revenue from the project. Additionally, the MRV methodolog

# Exploratory Analysis

In [15]:
df_contrib_sub.nunique()

txn_id            367
user_id           253
address           340
grant_id            1
chain               3
txn_hash          367
network             2
token               6
amount_in_usdt    168
timestamp         367
dtype: int64

The grant had 367 contributions from 253 different contirbutors (gitcoin account). The contributors also used 340 different addresses on 3 different chains (eth, zksync & polygon). Also, there were transaction made on `rinkeby`, which is a testnet. We will drop the `testnet` transactions since they don't really interest us.

In [181]:
df_contrib_sub = df_contrib_sub.loc[df_contrib_sub['network'] == 'mainnet']

In [183]:
df_contrib_sub.nunique()

txn_id            260
user_id           231
address           237
grant_id            1
chain               3
txn_hash          260
network             1
token               6
amount_in_usdt    141
timestamp         260
dtype: int64

If we remove testnet transactions, we end up with 260 contributions from 231 different contirbutors (gitcoin account). The contributors also used 237 different addresses on 3 different chains (eth, zksync & polygon).

In [184]:
df_contrib_sub['chain'].fillna('Missing').value_counts()

eth_std        89
eth_zksync     86
eth_polygon    85
Name: chain, dtype: int64

## Address with several user ID

In [185]:
df_contrib_sub.groupby('address')['user_id'].nunique().sort_values(ascending=False).head(5)

address
0x8da20a81aa510c420d5a81d62cbbd013e7c76d9e    2
0x784c1bbf7135cf8c84869e5dfcfedf378d0721d9    2
0x029acdbe5404114088b0824803fd04a7b9ee33a2    1
0xb6d57a7ef4f16b3369095c7f99f8aee7849ef8c6    1
0xadaae0cf49b422fb24cb988d669e77f4e015608c    1
Name: user_id, dtype: int64

There were two addresses which were used by more than a single user_id.

In [186]:
for addr in ['0x8da20a81aa510c420d5a81d62cbbd013e7c76d9e',
             '0x784c1bbf7135cf8c84869e5dfcfedf378d0721d9']:
    display(df_contrib_sub.loc[df_contrib_sub['address'] == addr])

Unnamed: 0,txn_id,user_id,address,grant_id,chain,txn_hash,network,token,amount_in_usdt,timestamp
302795,02532b03b40597b7e2f45e9fb5cb94020e05ffe1acb180...,e605583cc7f57addbc2106b42510853efec4ba23bd5082...,0x8da20a81aa510c420d5a81d62cbbd013e7c76d9e,7300,eth_std,0x529ae252e20af8880f968843224be0ef0e1e6607bead...,mainnet,ETH,1.502725,2022-09-15 23:54:28.887513+00:00
286840,f3fc8b05087a4c221b94a8ca0a5a6109b8d91247f00b82...,dd96030dc1ba5082b941ef9c8709438debc5ce8d2675da...,0x8da20a81aa510c420d5a81d62cbbd013e7c76d9e,7300,eth_std,0x9d07af779a34ed5f6744225e9a0a9620b9cf7cff864b...,mainnet,ETH,25.019325,2022-09-16 17:06:36.524997+00:00


Unnamed: 0,txn_id,user_id,address,grant_id,chain,txn_hash,network,token,amount_in_usdt,timestamp
309470,9cb99274079adb1ca2526a08e9b4749c7f4dfd34770305...,6434cd3876df3b4c6be72a9eec4e23a4799193d92d856a...,0x784c1bbf7135cf8c84869e5dfcfedf378d0721d9,7300,eth_zksync,sync-tx:deb6b8bd87a52c898ccd4185682e71b9e80f24...,mainnet,DAI,1.0,2022-09-15 12:07:30.506888+00:00
306920,7a3f05dbb1d61e3fbe084d7d488acd94dbb96e0791214a...,a8463943947b6e5c65c274ae3d2f54d25d4526de66d6b3...,0x784c1bbf7135cf8c84869e5dfcfedf378d0721d9,7300,eth_zksync,sync-tx:a5dcc22b266dc8aae1f55960ee29073530362c...,mainnet,DAI,1.0,2022-09-15 14:03:04.970205+00:00


## User ID with several addresses

In [188]:
df_contrib_sub.groupby('user_id')['address'].nunique().sort_values(ascending=False).head(5)

user_id
5c17351a16e2381e8f7f025ce508de9a11cb120497556ff89da380ac3fc56ee3    7
38f5857fbb681121c6103fc75722413ab556efe16f1e3772cd3e59e5ae7f3732    2
e605583cc7f57addbc2106b42510853efec4ba23bd50820e7dfd65ea26bf1868    2
008664cef10287b8e8955937c13afe25ebd256b08bace6f0544de4a86ae0f7d5    1
b1997eca88ba1e2f1f497a70443a568fd1404f0a535fa5e4b68807b26925a230    1
Name: address, dtype: int64

There were 5 user_id who used at least two different addresses to contribute to the grant. Let's dive into user `5c17351a16e2381e8f7f025ce508de9a11cb120497556ff89da380ac3fc56ee3`.

In [189]:
df_contrib_sub.query("user_id == '5c17351a16e2381e8f7f025ce508de9a11cb120497556ff89da380ac3fc56ee3'")\
              .sort_values(by='grant_id')

Unnamed: 0,txn_id,user_id,address,grant_id,chain,txn_hash,network,token,amount_in_usdt,timestamp
442175,e29e4cc1e156f3e0086c5601491744bdd3f997a6835679...,5c17351a16e2381e8f7f025ce508de9a11cb120497556f...,0x78ca75106f5a940269c55dc9ff51e6c16c3407d4,7300,eth_zksync,sync-tx:0f3afa5d4fa56adf103a11e4a7d7b250fd7da2...,mainnet,ETH,1.027056,2022-09-10 02:40:06.733659+00:00
435097,794448b57b3e6066b4ca06b0101600796657aeb7cef738...,5c17351a16e2381e8f7f025ce508de9a11cb120497556f...,0x76608a1ede697de8945b495aa2c3a0f91fbf469d,7300,eth_zksync,sync-tx:315557526afb0166f080b80d43f036bcf9ccbc...,mainnet,ETH,1.039485,2022-09-10 11:54:19.686124+00:00
434362,764a7f65bad0682812e6f49c6b2a9b6a6aee496ee57d94...,5c17351a16e2381e8f7f025ce508de9a11cb120497556f...,0x50b4b60858f680695d665947c7e829c7b7123658,7300,eth_zksync,sync-tx:dc52a0fa885a8170dc4f91e94bd4fb869505dd...,mainnet,ETH,1.035978,2022-09-10 12:36:47.844457+00:00
434188,cb214c98b9244e7ebc417ecd43499f8de858a087bbd1ea...,5c17351a16e2381e8f7f025ce508de9a11cb120497556f...,0x4453719be5df4984b3a76a1edfdf0b8d91fb2761,7300,eth_zksync,sync-tx:d382f627c12c9cebba8bdf6157f001c1f251c3...,mainnet,ETH,1.035978,2022-09-10 12:49:51.506615+00:00
432986,d1269ab0efe81a30fa3a28835765c054f77d826fc6dcb9...,5c17351a16e2381e8f7f025ce508de9a11cb120497556f...,0x942259d956369d6bbaed10ca9f675fdf85104e7d,7300,eth_zksync,sync-tx:340c32c3fe52eb28b09caeb26681e3530700e2...,mainnet,ETH,1.035978,2022-09-10 14:03:33.185896+00:00
432738,028a5781ec8c16648a6556175e2f497e4d6186a415b6e3...,5c17351a16e2381e8f7f025ce508de9a11cb120497556f...,0xa83697deeeb9d86e4c2165d34814c7f4d9b139e8,7300,eth_zksync,sync-tx:02436b29770f84bce9e160610a8c428c298123...,mainnet,ETH,1.035978,2022-09-10 14:16:55.543733+00:00
432700,706fcefde137ecbd87b67fa5e2ca41920876caa400f007...,5c17351a16e2381e8f7f025ce508de9a11cb120497556f...,0xd20d9fb4f194bdca4ff401fe7a0f1a5d55f8385d,7300,eth_zksync,sync-tx:5ca876ce014a0302f32aa6c818ec6fe9d8db5e...,mainnet,ETH,1.035978,2022-09-10 14:19:50.527742+00:00


Interestingly, the user made 7 contributions on the same day of roughly the same amounts. Not sure what the rational could be behind this.

# Sybil detection

## Detect Sybil in real time - Simulation

In this section, we will try to mimic the behaviour which a real time sybil detector system could have. What we want is that, when a someone makes a contribution, we fetch data about the contributor & update the states of knowledge we have. The idea is to get more and more information about contributors, and the relations they have with each others, to know if there is a chance we are dealing with Sybils.

In [222]:
from collections import Counter, defaultdict
from bs4 import BeautifulSoup
from tqdm import tqdm
import time

In [191]:
ETHERSCAN_API_KEY = SET_YOUR_API_KEY_HERE 
MAX_ATTEMPT = 0
PREFIX = 'api'

In [192]:
def ens_lookup(address: str):
    """
        Access etherscan's ens lookup page and returns the ENS name of an address
    """
    url = f'https://etherscan.io/enslookup-search?search={address}'

    # Access the url
    try:
        result = requests.get(url,headers={'User-Agent': 'Mozilla/6.0'})
    except:
        print('Error while accessing the page.')
        return -1
        
    if result.status_code == 200:
        soup = BeautifulSoup(result.content, 'html.parser')
        for s in soup.find("div", {"class": "col-md-9 d-flex mb-n1"}):
            return str(s.find("span")).split('/>')[-1][:-7]

    elif result.status_code == 404:
        print('404 error.')
        return -1

In [258]:
def get_transactions_counterparts(transactions: list):
    """
        Given a list of transactions, returns the all the "from" and "to" addresses
    """
    from_addr = Counter([r['from'] for r in transactions if r['from']!=address])
    to_addr = Counter([r['to'] for r in transactions if r['to']!=address])

    return from_addr, to_addr

def get_first_transaction(transaction: dict, transaction_erc20: dict = None):
    """
        Return the timestam, originator and value of a transaction.
    """
    if transaction_erc20 is not None:
        transaction = transaction if transaction['timeStamp'] < transaction_erc20['timeStamp'] else transaction_erc20
    first_transaction = {}
    first_transaction['timestamp'] = transaction['timeStamp']
    first_transaction['from'] = transaction['from']
    first_transaction['value'] = transaction['value']
    return first_transaction

def get_perc_of_gtc_bulk_checkout_transactions(transactions: list):
    """
        Given a list a transactions, returns the percentages which were directed to
        the gtc bulk checkout address.
    """
    GTC_BULK_ADDR = '0x7d655c57f71464b6f83811c55d84009cd9f5221c'
    gtc_trans = [r['hash'] for r in transactions if r['to'] == GTC_BULK_ADDR]
    
    return len(gtc_trans) / len(transactions)


def get_total_fees_paid(transactions: list):
    return sum([float(r['gasUsed']) * float(r['gasPrice']) * 1e-9**2 for r in transactions])

In [264]:
def get_on_chain_info(address: str):
    """
        Given an eth address, extract its transaction from etherscan and compute a set of 
        selected information.
    """
    result = {}
    transactions_base = requests.get(
            f"https://{PREFIX}.etherscan.io/api?module=account&action=txlist&address={address}&startblock=0&endblock=99999999&page=1&offset={MAX_ATTEMPT}&sort=desc&apikey={ETHERSCAN_API_KEY}"
        ).json()
    
    transactions_erc20 = requests.get(
            f"https://{PREFIX}.etherscan.io/api?module=account&action=tokentx&address={address}&startblock=0&endblock=99999999&page=1&offset={MAX_ATTEMPT}&sort=desc&apikey={ETHERSCAN_API_KEY}"
        ).json()
    
    
    if transactions_base['status'] == '1':
        transactions = transactions_base['result']
        if transactions_erc20['status'] == '1':
            transactions += transactions_erc20['result']
            result['first_transaction'] = get_first_transaction(transactions_base['result'][-1], transactions_erc20['result'][-1])
        else:
            result['first_transaction'] = get_first_transaction(transactions_base['result'][-1])

        _from, _to = get_transactions_counterparts(transactions)
        result['from_all'] = _from
        result['to_all'] = _to
        result['perc_gtc_bulk_trans'] = get_perc_of_gtc_bulk_checkout_transactions(transactions)
        result['total_fees_paid'] = get_total_fees_paid(transactions)
        result['n_transactions'] = len(transactions)
        
    return result

In [265]:
address_state = Counter()
address_contrib = defaultdict(lambda: [])
address_ens = dict()
address_tr = dict()
address_on_chain_info = dict()

fund_creator = '0x7fc1a3D5264a2B2563162EbAF17180ED2e5804BC'
prev_address = ''

def record_info(on_chain_info, address):
    address_on_chain_info[row.address] = on_chain_info
    _from = on_chain_info['from_all']
    _to = on_chain_info['to_all']
    address_tr[row.address] = _from + _to
        

for i, row in tqdm(enumerate(df_contrib_sub.itertuples()), total=len(df_contrib_sub)):
    
    address_state[row.address] += 1
    address_contrib[row.address] += [row.amount_in_usdt]
    
    if row.address == prev_address:
        continue

    if row.address not in address_tr:
        # Get ENS name (if any) - There is a limit in how many request per minutes we can do.
        # So not entirely reliable
        address_ens[row.address] = ens_lookup(row.address)
        time.sleep(1)
        # Get on chain info
        on_chain_info = get_on_chain_info(row.address)
        if on_chain_info == {}:
            continue
        _from = on_chain_info['from_all']
        _to = on_chain_info['to_all']
        
        record_info(on_chain_info, row.address)
        
        # Assume a relatively large amnt of transaction remove suspicion
        if on_chain_info['n_transactions'] > 50:
            continue
            
        # Signal - related to past transactions
        CNT = 0
        for key in address_tr:
            if (key != row.address) and (key in address_tr[row.address]):
                CNT += 1
                # print('#' * 70)
                # print(f'Warning: {key} and {row.address} transacted before.')
                tot_from = sum(_from.values())
                tot_to = sum(_to.values())
                
                perc_rec = _from[key]/tot_from
                perc_emit = _to[key]/tot_to
                if perc_rec > 0.33:
                    CNT += 1
                    # print(f'{key} ({address_ens[key]}) ->  {row.address} ({address_ens[row.address]}) accounts for {perc_rec:.2f}% of transactions {row.address} interacted with (as receiver).')
                
                if perc_emit > 0.33:
                    CNT += 1
                    # print(f'{row.address} ({address_ens[row.address]}) -> {key} ({address_ens[key]}) accounts for {perc_emit:.2f}% of transactions {row.address} interacted with (as emitter).')
                
                counterpart_intersects = np.intersect1d(
                                    list(address_tr[row.address].keys()),
                                    list(address_tr[key].keys())
                                )
                perc_inter = 100 * len(counterpart_intersects) / len(list(address_tr[row.address].keys()))
                if perc_inter >= 50:
                    CNT += 1
                    # print(f"Furthermore, {perc_inter:.3f}% of the counterparts of {row.address} are shared with {key}")
                
                if key == on_chain_info['first_transaction']['from']:
                    # print(f"{key} originally funded {row.address}")
                    CNT += 2
                    
                if address_on_chain_info[key]['first_transaction']['from'] == on_chain_info['first_transaction']['from']:
                    # Same original funder
                    CNT += 2
                
                if CNT > 4:
                    print(f'{row.timestamp} - Warning: {row.address} ({address_ens[row.address]}) & {key} ({address_ens[key]}) are likely to be related')
        
        if (on_chain_info['perc_gtc_bulk_trans'] > 0.33) & (on_chain_info['total_fees_paid'] < 0.05):
            print(f'Suspect activity from {row.address} - more than 33% of its transaction are for GTC grants')
            funder = on_chain_info['first_transaction']['from']
            print(f'This account was original funded by {funder}')
            if funder in df_contrib_sub.iloc[:i]['address'].unique():
                print(f"Funder also contributed to this grant")
                
    if fund_creator in address_tr[row.address]:
        print('Warning: Past interaction with the creator of the grant.')
    
    prev_address = row.address

 36%|███████████████▏                          | 94/260 [07:14<09:59,  3.61s/it]

Suspect activity from 0xea4d28f8218f44c84c8d7cbae5deac41acf66ced - more than 33% of its transaction are for GTC grants
This account was original funded by 0x4976a4a02f38326660d17bf34b431dc6e2eb2327


 46%|██████████████████▊                      | 119/260 [09:07<07:09,  3.05s/it]

Suspect activity from 0x8c4136f2073c69323f57c7b5b12100464244243b - more than 33% of its transaction are for GTC grants
This account was original funded by 0x455f491985c2f18b2c77d181f009ee6bdc41b1f8
Funder also contributed to this grant


 61%|█████████████████████████                | 159/260 [12:02<09:46,  5.81s/it]



 80%|████████████████████████████████▉        | 209/260 [15:01<02:17,  2.70s/it]

Suspect activity from 0xb8afab52cf30762bb80d2a2c5905588bf275077a - more than 33% of its transaction are for GTC grants
This account was original funded by 0xddfabcdc4d8ffc6d5beaf154f18b778f892a0740


 81%|█████████████████████████████████▎       | 211/260 [15:07<02:14,  2.74s/it]



 88%|████████████████████████████████████▎    | 230/260 [16:14<01:31,  3.05s/it]

Suspect activity from 0x0bd3d73638679b616022e785b898557ce43c16ea - more than 33% of its transaction are for GTC grants
This account was original funded by 0x22bc0693163ec3cee5ded3c2ee55ddbcb2ba9bbe


 95%|███████████████████████████████████████  | 248/260 [17:29<01:26,  7.18s/it]

Suspect activity from 0xa246d89f06f5fa8438ed877544aede68061e2c8f - more than 33% of its transaction are for GTC grants
This account was original funded by 0x4d846da8257bb0ebd164eff513dff0f0c2c3c0ba


100%|█████████████████████████████████████████| 260/260 [18:33<00:00,  4.28s/it]


# Solution Summary

Basically, the solution proposed is very simple. When a user makes a contribution, we get its past transactions user etherscan API (warning, this is ill defined at the moment since we query future information which were not available at the time the user actually contributed to the grant) and check a few things:

- List of addresses the user interacted with, and we compare it against previous contributors
 - If similar, we try to see if they are mainly exchanging between each other
 - Or if they have very similar activities on chain (transactions to same addresses)
- How many total activies on chain? If large enough, we assume the user its legit
- What else than doing contribution the address is used for ?
- How many spent on fees? If it's large, then probably the user is legit too.

Then, depending on whether one or more of this apply, we raise a flag and decide to investigate - or if, thanks to a validation set, we happen to find that our solution is accurate enough, we could simply automatically reject a contribution.

# Suspicious addresses analysis

With our method, we found that 7 addresses raised a flag, by analysing activies on etherscan, we concluded that: 

- `0xea4d28f8218f44c84c8d7cbae5deac41acf66ced` - Difficult to conclude. Its fund came from CEX and it did nothing else but funding the grant.

- `0x8c4136f2073c69323f57c7b5b12100464244243b` is suspect. It did nothing else but contributing to the grant get its ETH from `0x455f491985c2f18b2c77d181f009ee6bdc41b1f8` which also contributed to the grant. Might be sybils.

- `0x92dc6fcb703d46e90d363290181379cee79e192a` - Seems legit

- `0xb8afab52cf30762bb80d2a2c5905588bf275077a` - Difficult to conclude. Its fund came from CEX and it did nothing else but funding the grant.

- `0x6e13285a1b24b4b7512f56fc2969a270ba0bb92f` seems legit.

- `0x0bd3d73638679b616022e785b898557ce43c16ea` - Difficult to concldue due to very few activities on chain.

- `0xa246d89f06f5fa8438ed877544aede68061e2c8f` - Difficult to conclude. Its fund came from CEX and it did nothing else but funding the grant.

2/7 were most likely actual users, and the remaining 5 remains suspicious, especially `0x8c4136f2073c69323f57c7b5b12100464244243b`. 

# Conclusion

In this notebook - I worked on a (small) solution trying to find sybils during a grant funding on gitcoin using on-chain data only. The approach consists mainly in retrieving historical transaction of each address making a contribution, and trying to find relationship with previous contributors.

It is not an easy project to solve - we found very few potential sybils and some false positive - but we chose to be pretty conservative. The simple rule based method we applied could be greatly improved if we could apply supervised - or weakly supervised - training method on top (it would also require us to format the data in a more ML-friendly way).

Of course, this solution isn't ready for production and it could be enriched with various additional source of data in order to yield better performances. This includes:

- Better formatting and exploiting transaction data and graph
- Dig into ENS name and check for NFT or POAP activities (this could reduce suspicion)
- Try to connect ETH address/ENS to twitter handle (could be present in description, or other...) to expand the knowledge graph
- etc.