# Библиотека vowpal wabbit


Способ представления данных в vowpalwabbit: [Label] [Importance] [Tag] | Namespace Features |Namespace Features ... |Namespace Features

Label - метка класса при классификации или некоторое значение при регрессии

Importance - вес примера при обучении, который позволяет бороться с несбалансированными данными

Tag - некоторое "название" примера, которое сохраняется при предсказании ответа

Namespace - можно разделить признаки по пространствам, чтобы удобнее с ними работать

Пример для тренировки: -1 | Фича1:0.4222 Фича2:-0.305 Фича3:1.038 Фича4:-1.044 Фича5:-0.935

Пример для теста: | Фича1:0.4222 Фича2:-0.305 Фича3:1.038 Фича4:-1.044 Фича5:-0.935

# Преобразование поступающих данных в формат vowpalwabbit

In [487]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import numpy as np 
import re

In [488]:
def to_vw_format(document, label=None):
    return str(label or '') + ' | ' + ' '.join(re.findall('\w{3,}', document.lower())) + '\n'

In [489]:
to_vw_format("sdsds. ,ddf@", label=-1)

'-1 | sdsds ddf\n'

IMDB Dataset.csv -> https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

In [490]:
df = pd.read_csv("IMDB Dataset.csv")
df["target"] = df["sentiment"].map({"positive": 1, "negative": -1})
df = df[["review", "target"]]

In [491]:
X_train, X_test, y_train, y_test = train_test_split(df["review"], df["target"], test_size=0.3, stratify=df["target"])

In [492]:
with open("train.vw", "w", encoding="utf-8") as f:
    for text, target in zip(X_train, y_train):
        f.write(to_vw_format(text, target))

In [493]:
with open("test.vw", "w", encoding="utf-8") as f:
    for text in X_test:
        f.write(to_vw_format(text))

In [494]:
with open("test.vw", "r", encoding="utf-8") as f:
    for i in range(2):
        print(f.readline())

 | saw this movie when first came out was official selection for the temecula valley international film festival and voted for for best picture justine priestley hot the psychotic but complex amanda this not your ordinary psycho movie lots interesting and original slants the genre sort fatal attraction for the younger set with some great blues music mixed the object amanda affection married and coming blues singer who has less time for her husband her career takes off

 | lot movies often bring year old son glad did not bring him this one there are many references sex and skinny dipping scene however that not the primary reason would not take him the trailers lead you believe light hearted comedy nevertheless virtually all the funny moments are the previews kept waiting for get interesting funny anything but serious however nearly fell asleep the plot less story dragged understand that dogs can great company that being said the entire story focused poorly behaving dog that the owners w

# Самые важные флаги в vowpal wabbit

-d - Передаем данные для обучения или тестирования в формате vw

-f - путь, куда надо сохранить обученную модель

--passes - количество проходов по выборке, аналог эпох

-с - использование кэширования, позволяет ускорить все проходы после первого (без этого passes не будет работать)

--learning_rate arg - скорость обучения, arg - значение, используемое в шаге градиентного спуска

--power_t - степень убывания темпа обучения (по умолчанию = 0.5)

--initial_t - значение, регулирующее шаг градиентного спуска по формуле: a<sub>T</sub> = lr $\cdot$  ($\frac{j}{j+T}$)$^p$ , где lr - learning_rate, j - initial_t, p - power_t, T - шаг

--loss_function arg - модель машинного обучения, arg: logistic - логистическая регрессия, hinge - SVM, squared - МНК для регрессии (есть и другие)

--ngram arg - использование n-грамм при использовании текстов

-t - для тестирования модели, игнорирует метки классов данных

-p - путь к файлу, куда будут сохранены предсказания нашей модели

--quiet - не выводить диагностические данные

-i - для загрузки обученной модели в целях тестирования или возобновления обучения

--l1 arg - l1 регуляризация 

--l2 arg - l2 регуляризация 

--oaa arg - использование многоклассовой классификации (one against all), arg - число классов

-b arg – используем arg бит для хэширования, то есть признаковое пространство ограничено $2^{arg}$ признаками

# Другие флаги: 
https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Command-Line-Arguments

# Бинарная классификация 

# Обучение

In [495]:
!C:/Users/Ilsaf/vw.exe -d train.vw -f model.vw --learning_rate 0.5 --loss_function logistic --ngram 2 --passes 40 -c

You have chosen to generate 2-grams
final_regressor = model.vw
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = train.vw.cache
Reading datafile = train.vw
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.762334   0.762334            3         3.0   1.0000  -0.0283      192
1.040178   1.318023            6         6.0   1.0000   0.0081      252
0.884937   0.698648           11        11.0  -1.0000   0.0323       72
0.733942   0.582948           22        22.0  -1.0000  -0.4131      304
0.718473   0.703003           44        44.0   1.0000  -0.5033      204
0.675556   0.631641           87        87.0   1.0000  -0.8584      746
0.654739   0.633921          174       174.0   1.0000  -2.5562     1218
0.618759   0.582779          348       348.0  -1.0000   0.0680      162
0.572478   0.526197          696       696.

# Предсказания

In [496]:
!C:/Users/Ilsaf/vw.exe -t -d test.vw -i model.vw -p predictions.txt --quiet

In [497]:
with open("predictions.txt", "r") as f:
    predictions = list(map(lambda x: float(x[:-2]), f.readlines()))
    predictions = list(map(lambda x: 1 if 1 / (1 + np.exp(-x)) > 0.5 else -1, predictions))

In [498]:
accuracy_score(y_test, predictions)

0.9090666666666667

In [499]:
with open("predictions.txt", "r") as f:
    predictions = list(map(lambda x: 1 / ( 1 + np.exp(-float(x[:-2]))), f.readlines()))

In [500]:
roc_auc_score(y_test.map({-1: 0, 1: 1}), predictions)

0.9660740622222224

# Перебор параметров в vowpal wabbit

In [501]:
%%time
for p in [1,10,25]:
    !C:/Users/Ilsaf/vw.exe \
        -d train.vw \
        --loss_function logistic \
        --passes {p} \
        -f model_{p}.vw \
        --random_seed 17 \
        --quiet \
        -c
    print ('model_{}.vw is ready'.format(p))

model_1.vw is ready
model_10.vw is ready
model_25.vw is ready
CPU times: total: 46.9 ms
Wall time: 8.94 s


In [502]:
%%time
for p in [1,10,25]: 
    !C:/Users/Ilsaf/vw.exe \
        -i model_{p}.vw \
        -t -d test.vw \
        -p pred_{p}.txt \
        --quiet
    print ('pred_{}.txt is ready'.format(p))

pred_1.txt is ready
pred_10.txt is ready
pred_25.txt is ready
CPU times: total: 0 ns
Wall time: 384 ms


In [503]:
for p in [1,10,25]:
    with open(f"pred_{p}.txt", "r") as f:
        predictions = list(map(lambda x: 1 / ( 1 + np.exp(-float(x[:-2]))), f.readlines()))
    print(f"auc score for model_{p} = {roc_auc_score(y_test, predictions).round(4)}")

auc score for model_1 = 0.9586
auc score for model_10 = 0.9605
auc score for model_25 = 0.959


# Многоклассовая классификация

In [514]:
data = fetch_20newsgroups()

In [515]:
X, y = data.data, data.target + 1 #метки должны быть от 1 до N, поэтому + 1

In [516]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)

In [517]:
with open("train.vw", "w", encoding="utf-8") as f:
    for text, target in zip(X_train, y_train):
        f.write(to_vw_format(text, target))

In [518]:
with open("test.vw", "w", encoding="utf-8") as f:
    for text in X_test:
        f.write(to_vw_format(text))

In [519]:
with open("train.vw", "r", encoding="utf-8") as f:
    print(f.readline())

13 | from wayne alan martin wm1h andrew cmu edu subject dayton hamfest organization senior electrical and computer engineering carnegie mellon pittsburgh lines distribution usa nntp posting host po5 andrew cmu edu reply 1993apr19 163122 20454 cbfsb att com yes the and but does anyone have directions how get there after get dayton thanks wayne martin



# Обучение

In [520]:
!C:/Users/Ilsaf/vw.exe --oaa 20 -d train.vw -f model.vw --loss_function hinge --ngram 2 --passes 10 -c

You have chosen to generate 2-grams
final_regressor = model.vw
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = train.vw.cache
Reading datafile = train.vw
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
1.000000   1.000000          3      3.0         20       13      306
1.000000   1.000000          6      6.0          5       13      138
1.000000   1.000000         11     11.0         17       13      114
1.000000   1.000000         22     22.0          9       13      188
0.931818   0.863636         44     44.0         15       11      274
0.885057   0.837209         87     87.0         19        2      404
0.839080   0.793103        174    174.0          1       20      280
0.758621   0.678161        348    348.0         14       14      420
0.602011   0.445402        696    696.0         13       16      44

# Предсказания

In [521]:
!C:/Users/Ilsaf/vw.exe -t -d test.vw -i model.vw -p predictions.txt --quiet

In [522]:
with open("predictions.txt", "r") as f:
    pred = [float(label) for label in f.readlines()]

In [523]:
accuracy_score(y_test, pred)

0.8695139911634757

# Регрессия в vowpal wabbit

In [524]:
from sklearn.datasets import make_regression
from sklearn.metrics import mean_absolute_error, r2_score

In [525]:
X, y = make_regression(n_samples=100000, n_features=10, noise=1, random_state=42)

In [526]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [527]:
def to_vw_format(data, target=False):
    return str(target or "") + " | " + " ".join([f"f{i}:{data[i]}" for i in range(len(data))]) + "\n"

In [528]:
to_vw_format(X_train[0], target=y_train[0])

'119.09515770007185 | f0:0.6062904726446545 f1:0.23675980113915374 f2:-0.9161579471392548 f3:0.3858363071380209 f4:-0.14952873540255326 f5:-1.3318700525131166 f6:0.9948813991791735 f7:-1.4846834513046567 f8:2.088451561879559 f9:-0.30632136060998555\n'

In [529]:
with open("train.vw", "w", encoding="utf-8") as f:
    for text, target in zip(X_train, y_train):
        f.write(to_vw_format(text, target))

In [530]:
with open("test.vw", "w", encoding="utf-8") as f:
    for text in X_test:
        f.write(to_vw_format(text))

In [531]:
!C:/Users/Ilsaf/vw.exe -d train.vw -f model.vw --loss_function squared

final_regressor = model.vw
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = train.vw
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
51566.670247 51566.670247          3         3.0 -190.5473  -0.2736       11
36107.659627 20648.649007          6         6.0  44.8172   0.1673       11
30266.249379 23256.557080         11        11.0  64.5526   0.0028       11
22921.081461 15575.913543         22        22.0 -10.1413  -1.4213       11
36569.137712 50217.193964         44        44.0 -177.9140  -0.6585       11
43729.614545 51056.614095         87        87.0  46.3969   0.7297       11
44185.753772 44641.892999        174       174.0 109.2240   2.0686       11
38380.427460 32575.101148        348       348.0 122.3701   2.5391       11
38216.297827 38052.168194        696       696.0 377.4459   8.5119       11
35179.642484 32

In [532]:
!C:/Users/Ilsaf/vw.exe -t -d test.vw -i model.vw -p predictions.txt --quiet

In [533]:
with open("predictions.txt", "r") as f:
    pred = [float(label) for label in f.readlines()]

In [534]:
mean_absolute_error(y_test, pred)

127.67117144897841

In [535]:
r2_score(y_test, pred)

0.3409035049408746

In [536]:
pd.DataFrame({"real": y_test, "pred": pred}).head()

Unnamed: 0,real,pred
0,-26.410513,-4.305271
1,-399.834335,-74.87928
2,-188.712635,-38.90591
3,381.151092,74.717155
4,37.087914,10.775252
