### Project MMDB
Authors:
- Nazarii Drushchak
- Igor Babin
- Uliana Zbezhkhovska

Task:

- Consider all the changes done in the wikipedia as stream.
    - Check here: https://wikitech.wikimedia.org/wiki/RCStream
- Each action is received in json format.
- Data is full of bots. There is a flag were programmers can indicate that an actions has been done by a bot.
- Using this information as ground truth, develop a system able to classify users as bot or human.
- Constrain: You need to sample, and just use the 20% of the data stream.
- Describe the distribution of edits per users and bots.

In [1]:
!pip install -q sseclient



In [1]:
from sseclient import SSEClient as EventSource
import json
import random
from transformers import AutoTokenizer, AutoModel
import torch
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
LOAD_DATA=True

Train data collection. We collected 800 thousand edits events from wikipedia. Each event is a json object. We store slices of 50k events in separate files for faster loading and processing.

In [3]:
def collect_data():
  import datetime
  EVENTS = 10e6  # 1 million
  SAVE_FREQ = 50000
  URL = 'https://stream.wikimedia.org/v2/stream/recentchange'

  count = 0
  data = []

  for event in EventSource(URL):
    if event.event == 'message':
        try:
            change = json.loads(event.data)
        except ValueError:
            continue

        data.append(change)

        count += 1

        if count >= EVENTS:
            break

        if count % SAVE_FREQ == 0:
            with open(f'data_{count}.json', 'w') as outfile:
                json.dump(data, outfile)
                data = []
            print('Time: {}, saved {} events'.format(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), count))

For faster evaluation we store data in google drive and load it from there.

In [4]:
def load_from_drive():
  !gdown 1JpYwYB1FsjUUOfZxCcr6wxdxt6Ap7PZ0
  !gdown 18UCs2o_QRszZM_1M1OiXGTqywrjUL8hH
  !gdown 1o1suE-eS8iUL1YUWgBhgkcjdPWEAlDQE
  !gdown 1Wa-NQ-X4SCn3bMEbxNKGxsFpS7lc7yVm
  !gdown 1OOx4S1EhsqxJ5wMAZsOU6iERJqsyYlur
  !gdown 1l_18qAjkDf1fz7jZ-W5JQkj-sr71erAe
  !gdown 1tjpl-RpxLWTkQ-v8P-eqS0IscW0qKX1h
  !gdown 1pytAK5dY3Nd7GIqj0xfXkvW7kSY64tja
  !gdown 1qp3_RM8m35kWaJnm0I8LgllqqjpRevTa
  !gdown 1EHf8Focau2JlH0K874Wu7qrVMyKUHm4c
  !gdown 1oaS07sIGdXkRNpXry0MoK32GMzNeSVCi
  !gdown 1FGORShD9TkQGZOZxv0CrdipbxkHBbWD9
  !gdown 11xgv0gHi9qB95aw525bEM8KLrpvR9o6P
  !gdown 1SQZmTxFykknN7zypKbV8vR_MALBpeJlv
  !gdown 166v5f1XKl5AS4wSd_RQxzWQN699a2xe2
  !gdown 1Pr_Kwl6VfivIhfx9FojEdslAzzMu40nz

In [None]:
if LOAD_DATA:
  load_from_drive()
else:
  collect_data()

Hyperparameters:

In [5]:
train_path = ["data_50000.json", "data_100000.json", "data_150000.json", "data_200000.json", "data_250000.json",
              "data_300000.json", "data_350000.json", "data_400000.json", "data_450000.json", "data_500000.json",
              "data_550000.json", "data_600000.json", "data_650000.json", "data_700000.json", "data_750000.json"]
val_path = ["data_800000.json"]

filter_size = 10000
filter_sizes = [1000, 10000, 15000]
hash_functions = 20
save_bot = True
filter_type = 'user'  # 'user' or 'random
filter_prob = 0.2
target_amount = 500
log_freq = 100
URL = 'https://stream.wikimedia.org/v2/stream/recentchange'

Function to load data from files

In [6]:
def load_data(paths):
    data = []
    for p in paths:
        with open(p, 'r') as f:
            data += json.load(f)
    return [{'bot': d['bot'], 'user': d['user']} for d in data]

First function in implementation of bloom filter. We use random hash functions to hash user names to filter.

In [7]:
def rand_hash(modulus, a=None, b=None):
    if a is None or b is None:
        a, b = random.randint(1, modulus - 1), random.randint(1, modulus - 1)
    # print(a, b)
    func = lambda x: hash(x) % (a + b) % modulus
    return {'func': func, 'a': a, 'b': b}

Just a wrapper to generate multiple random hash functions together with ability to load pretrained hash functions.

In [8]:
def random_hashes(modulus, amount_hash_functions, seed=42, params=None):
    # e.x params = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
    random.seed(seed)
    if params is None:
        fns = [rand_hash(modulus) for _ in range(amount_hash_functions)]
    else:
        fns = [rand_hash(modulus, **param) for param in params]
    return fns

Util function to check if user is in filter based on hash functions. If all hash functions return 1, then user is in filter.

In [43]:
def is_str_in_filter(bloom_filter, hashes, data):
    for h in hashes:
        if bloom_filter[h['func'](data["user"])] == 0:
            return False
    return True

Function to evaluate filter on data. We use ground truth to compare with predictions.

In [10]:
def eval_filter(bloom_filter, hashes, data):
    gt = [d['bot'] for d in data]
    pred = [is_str_in_filter(bloom_filter, hashes, d) for d in data]
    return gt, pred

Function to fit filter on data. We use pre-defined hash functions to fit filter. We set 1 in bloom filter where hash function it's index in array. 

In [11]:
def fit_filter(bloom_filter, hashes, data, save_bot=save_bot):
    for d in data:
        if d['bot'] != 1 and save_bot:
            continue
        for h in hashes:
            bloom_filter[h['func'](d['user'])] = 1

    return bloom_filter

Just accuracy function

In [12]:
def accuracy(gt, pred):
    return sum([1 for i in range(len(gt)) if gt[i] == pred[i]]) / len(gt)

Function to get best filter based on accuracy on train and val data.

In [44]:
def get_filter():
    train_data = load_data(train_path)
    val_data = load_data(val_path)

    total_train = len(train_data)
    amount_bots_train = len([d for d in train_data if d['bot'] == 1])

    total_val = len(val_data)
    amount_bots_val = len([d for d in val_data if d['bot'] == 1])

    print("Total train amount of users: ", total_train)
    print("Total val amount of users: ", total_val)

    print("Train amount of bots: ", amount_bots_train)
    print("Val amount of bots: ", amount_bots_val)

    # Calculate amount of the same bots from val in train
    names = set([d['user'] for d in train_data])
    same_users = 0
    same_bots = 0
    for d in val_data:
        if d["user"] in names:
            if d['bot'] == 0:
                same_users += 1
            else:
                same_bots += 1
    print("Amount of the same users from val in train: ", same_users)
    print("Amount of the same bots from val in train: ", same_bots)

    # with open('bloom_filter.json', 'r') as f:
    #     data = json.load(f)
    # bloom_filter_saved = data['filter']
    # hashes_a_b = data['hashes']

    best_acc = -1
    for filter_size in filter_sizes:
        bloom_filter = [0] * filter_size
        hashes = random_hashes(filter_size, hash_functions)#, params=hashes_a_b)

        bloom_filter_fitted = fit_filter(bloom_filter, hashes, train_data)

        gt, pred = eval_filter(bloom_filter_fitted, hashes, train_data)
        print("Train accuracy: ", round(accuracy(gt, pred), 3))

        gt, pred = eval_filter(bloom_filter_fitted, hashes, val_data)
        eval_acc = round(accuracy(gt, pred), 3)
        print("Val accuracy: ", eval_acc)

        if eval_acc > best_acc:
            best_filter_size = filter_size
            best_acc = eval_acc
            best_filter = bloom_filter_fitted
            best_hashes = hashes

    # Save filter
    # with open('bloom_filter.json', 'w') as f:
    #     json.dump({'filter': bloom_filter_fitted, 'hashes': [{'a': h['a'], 'b': h['b']} for h in hashes]}, f)
    print(f"Best filter size is: {best_filter_size} ")
    return best_filter, best_hashes

In [45]:
bloom_filter, hashes = get_filter()

Total train amount of users:  750000
Total val amount of users:  50000
Train amount of bots:  271831
Val amount of bots:  15798
Amount of the same users from val in train:  28251
Amount of the same bots from val in train:  15775
Train accuracy:  0.386
Val accuracy:  0.343
Train accuracy:  0.94
Val accuracy:  0.941
Train accuracy:  0.941
Val accuracy:  0.943
Best filter size is: 15000 


Add Logistic regression to confirm is user bot or not

In [42]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Function to preprocess text using pretrained model. We use BERT model to get embeddings of user names and then use Logistic regression to classify user as bot or not.

In [49]:
def preprocess_text(data):
    tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-en-v1.5')
    model = AutoModel.from_pretrained('BAAI/bge-large-en-v1.5').to(device)

    # Mean Pooling - Take attention mask into account for correct averaging
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

    # Tokenize sentences
    if type(data['user']) == str:
        encoded_input = tokenizer(data['user'], padding=True, truncation=True, return_tensors='pt')
    else:
        encoded_input = tokenizer(data['user'].values.tolist(), padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input.to(device))

    # Perform pooling. In this case, mean pooling.
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    return pd.DataFrame(sentence_embeddings.numpy())

In [27]:
train_data = load_data(train_path)[:5000]
x_train = pd.DataFrame(train_data)
y_train = x_train['bot'].astype('int')

In [28]:
x_train = preprocess_text(x_train)   

Train Logistic regression model

In [29]:
model = LogisticRegression(solver='sag').fit(x_train, y_train)



Evaluate model on validation data

In [None]:
# model accuracy
val_data = load_data(val_path)[:5000] # use it from previous node
x_val = pd.DataFrame(val_data)
y_test = x_val['bot'].astype('int')
y_pred = model.predict(preprocess_text(x_val))
print(f'The accuracy of Logistic regression model is {accuracy_score(y_test, y_pred)}')

Final function to predict if user is bot or not. We use bloom filter first to check if user is in filter. If bloom filter suggests that user is not bot, we use Logistic regression to confirm it.

In [38]:
def predict_bot(data, bloom_filter, hashes):
    # Check bloom filter first
    bloom_filter_result = is_str_in_filter(bloom_filter, hashes, data)

    # If bloom filter suggests not bot, use logistic regression for confirmation
    if not bloom_filter_result:
        print("Run logistic regression")
        logistic_regression_result = model.predict(preprocess_text(data))
        if logistic_regression_result:
            # store in bloom filter
            bloom_filter = fit_filter(bloom_filter, hashes, [data])
        return bloom_filter, logistic_regression_result

    return bloom_filter, True  


In [None]:
def left_user(user):
    if filter_type == 'user':
        return hash(user['user']) % 100 > filter_prob * 100
    elif filter_type == 'random':
        return random.random() % 100 > filter_prob * 100
    else:
        return True

In [41]:
count = 0
non_filtered_gt = []
gt_bot = []
pred_bot = []
log_freq = 5
target_amount = 50
for event in EventSource(URL):
    if event.event == 'message':
        try:
            change = json.loads(event.data)
        except ValueError:
            continue

        non_filtered_gt.append(change["bot"])
        if not left_user(change): 
            continue

        count += 1
        print("Event: ", count)

        gt_bot.append(change['bot'])
        bloom_filter, is_bot = predict_bot(change, bloom_filter, hashes)
        pred_bot.append(is_bot)

        if count % log_freq == 0:
            print('Processed {} events'.format(count))

        if count >= target_amount:
            break

print('Accuracy: {}'.format(accuracy(gt_bot, pred_bot)))
print('Non-filtered amount of bots/users: {}/{}'.format(sum(non_filtered_gt), len(non_filtered_gt) - sum(non_filtered_gt)))
print('Filtered amount of bots/users: {}/{}'.format(sum(gt_bot), len(gt_bot) - sum(gt_bot)))
print('Predicted amount of bots/users: {}/{}'.format(sum(pred_bot), len(pred_bot) - sum(pred_bot)))

Event:  1
Event:  2
Event:  3
Event:  4
Run logistic regression
Event:  5
Run logistic regression
Processed 5 events
Event:  6
Run logistic regression
Event:  7
Run logistic regression
Event:  8
Event:  9
Run logistic regression
Event:  10
Processed 10 events
Event:  11
Run logistic regression
Event:  12
Run logistic regression
Event:  13
Run logistic regression
Event:  14
Run logistic regression
Event:  15
Run logistic regression
Processed 15 events
Event:  16
Run logistic regression
Event:  17
Run logistic regression
Event:  18
Run logistic regression
Event:  19
Run logistic regression
Event:  20
Run logistic regression
Processed 20 events
Event:  21
Run logistic regression
Event:  22
Event:  23
Run logistic regression
Event:  24
Run logistic regression
Event:  25
Processed 25 events
Event:  26
Event:  27
Run logistic regression
Event:  28
Run logistic regression
Event:  29
Run logistic regression
Event:  30
Run logistic regression
Processed 30 events
Event:  31
Run logistic regressi