# active learning for text classification
in this notebook we will evaluate vowpal wabbit for a binary text classification problem.
for this purpose we will use the dataset available here: http://www.cs.cornell.edu/people/pabo/movie-review-data/

In [1]:
!mkdir data
!curl http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz -o data/review_polarity.tar.gz
!tar zxC data -f data/review_polarity.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3053k  100 3053k    0     0   610k      0  0:00:05  0:00:05 --:--:--  648k


the data is pretty much ready to be used. it is already tokenized and lowercase. we just need to replace `:` and `|` characters since they are special characters in the vowpal wabbit data format

In [2]:
import os

def read_files(directory):
    for filename in os.listdir(directory):
        if filename.endswith('.txt'):
            filename = os.path.join(directory, filename)
            with open(filename) as f:
                tokens = ' '.join(line.strip() for line in f.readlines())
                yield tokens.replace(':', 'COLON').replace('|', 'PIPE')
                
positive = [f'+1 | {s}' for s in read_files('data/txt_sentoken/pos')]
negative = [f'-1 | {s}' for s in read_files('data/txt_sentoken/neg')]

print(f'read {len(positive):,} positive examples')
print(f'read {len(negative):,} negative examples')

read 1,000 positive examples
read 1,000 negative examples


let's take a peek at the data

In [3]:
print(positive[0][:100], '...')

+1 | of circumcision , psychic wounds and the family sitcom the opening segment is something of a fo ...


In [4]:
print(negative[0][:100], '...')

-1 | janeane garofalo in a romantic comedy -- it was a good idea a couple years ago with the truth a ...


shuffle data

In [5]:
import random
random.seed(1234)
data = positive + negative
random.shuffle(data)
print(''.join(d[0] for d in data[:50]))

+---++-+-+++-+-+++--+++-+++++++-++--+-+----++++-++


prepare train and test datasets, keeping 80% of the data for training and 20% for testing.

In [6]:
def write_file(filename, lines):
    with open(filename, 'w') as f:
        for line in lines:
            print(line, file=f)

split = int(len(data) * 0.8)
train = data[:split]
test = data[split:]

write_file('data/train.vw', train)
write_file('data/test.vw', test)

!wc -l data/train.vw data/test.vw

   1600 data/train.vw
    400 data/test.vw
   2000 total


In [7]:
!vw --version

8.6.1


In [8]:
!vw --binary data/train.vw -f data/sentiment.model -c -k -b 28 --ngram 2

Generating 2-grams for all namespaces.
final_regressor = data/sentiment.model
Num weight bits = 28
learning rate = 0.5
initial_t = 0
power_t = 0.5
creating cache_file = data/train.vw.cache
Reading datafile = data/train.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 1.000000            1            1.0   1.0000  -1.0000     1936
1.000000 1.000000            2            2.0  -1.0000   1.0000     2470
0.500000 0.000000            4            4.0  -1.0000  -1.0000     1542
0.625000 0.750000            8            8.0   1.0000  -1.0000     2006
0.687500 0.750000           16           16.0   1.0000  -1.0000     1246
0.531250 0.375000           32           32.0  -1.0000   1.0000      342
0.484375 0.437500           64           64.0   1.0000  -1.0000     1452
0.476562 0.468750          128          128.0  -1.0000   1.0000     1664
0.398438 0.320312          256  

In [9]:
!vw -c -k -t -i data/sentiment.model data/test.vw -p data/test.pred.vw

Generating 2-grams for all namespaces.
only testing
predictions = data/test.pred.vw
Num weight bits = 28
learning rate = 0.5
initial_t = 0
power_t = 0.5
creating cache_file = data/test.vw.cache
Reading datafile = data/test.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.000000 0.000000            1            1.0   1.0000   1.0000      710
0.000000 0.000000            2            2.0   1.0000   1.0000     1794
0.000000 0.000000            4            4.0   1.0000   1.0000     2050
0.000000 0.000000            8            8.0  -1.0000  -1.0000     1588
0.062500 0.125000           16           16.0  -1.0000   1.0000     1214
0.031250 0.000000           32           32.0   1.0000   1.0000     1004
0.078125 0.125000           64           64.0   1.0000   1.0000      578
0.109375 0.140625          128          128.0  -1.0000   1.0000     1928
0.148438 0.187500          2

In [10]:
from sklearn.metrics import classification_report
import numpy as np
y_pred = np.loadtxt('data/test.pred.vw')
y_true = !cut -c 1-2 data/test.vw
y_true = np.array([int(y) for y in y_true])
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

          -1       0.83      0.89      0.86       193
           1       0.89      0.83      0.86       207

   micro avg       0.86      0.86      0.86       400
   macro avg       0.86      0.86      0.86       400
weighted avg       0.86      0.86      0.86       400



In [11]:
!vw --binary data/train.vw -f data/sentiment.model --active --simulation --mellowness 1e-7 --ngram 2

Generating 2-grams for all namespaces.
final_regressor = data/sentiment.model
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = data/train.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 1.000000            1            1.0   1.0000  -1.0000     1936
1.000000 1.000000            2            2.0  -1.0000   1.0000     2470
0.500000 0.000000           57            4.0  -1.0000  -1.0000     4362
0.250000 0.000000          181            8.0  -1.0000  -1.0000     2710
0.125000 0.000000          386           16.0  -1.0000  -1.0000     2644
0.062500 0.000000          703           32.0  -1.0000  -1.0000     2294
0.025119 0.000000          888           79.6  -1.0000  -1.0000     1852
0.005587 0.000000         1258          358.0  -1.0000  -1.0000      584
0.344591 0.683202         1505          716.4  -1.0000  -1.

prepare unlabeled data

In [12]:
!cut -c 4- data/train.vw > data/train.unlabeled.vw
!cut -c 4- data/test.vw > data/test.unlabeled.vw
!wc -l data/train.unlabeled.vw data/test.unlabeled.vw

   1600 data/train.unlabeled.vw
    400 data/test.unlabeled.vw
   2000 total


In [13]:
import subprocess
import shlex
command = shlex.split('vw -i data/sentiment.model -p data/train.pred.al.vw --port 31337 --mellowness 1e-100')
proc = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
!python2 active_interactor.py localhost 31337 data/train.unlabeled.vw

connecting to localhost:31337 ...
done
sending unlabeled examples ...
sending unlabeled '| the promotion for '
sending unlabeled '| way of the gun is '
sending unlabeled '| " ladybugs " is a '
sending unlabeled '| the classic story '
sending unlabeled '| the postman delive'
sending unlabeled '| recently i read 4 '
sending unlabeled '| mighty joe young b'
sending unlabeled '| the computer-anima'
sending unlabeled "| if you haven't plu"
sending unlabeled '| armageddon , in it'
sending unlabeled '| after hearing revi'
sending unlabeled '| with stars like si'
sending unlabeled '| movies based on vi'
sending unlabeled "| if you're the type"
sending unlabeled '| 200 cigarettes tak'
sending unlabeled '| ultra low budget b'
sending unlabeled '| steven spielberg i'
sending unlabeled '| in wonder boys mic'
sending unlabeled '| violence is bad . '
sending unlabeled '| it happens every y'
sending unlabeled "| synopsis COLON it'"
sending unlabeled '| allen , star of ma'
sen

sending unlabeled '| on april 12th , 19'
sending unlabeled '| we share the desce'
sending unlabeled '| martial arts maste'
sending unlabeled '| starship troopers '
sending unlabeled '| in the interest of'
sending unlabeled '| have you ever been'
sending unlabeled '| let me begin by sa'
sending unlabeled '| i like movies with'
sending unlabeled '| director luis mand'
sending unlabeled '| with the sudden li'
sending unlabeled '| originally titled '
sending unlabeled '| the dramatic comed'
sending unlabeled '| synopsis COLON in '
sending unlabeled '| can a horror movie'
sending unlabeled '| so here is the sec'
sending unlabeled '| seen december 2 , '
sending unlabeled "| don't let the foll"
sending unlabeled '| first troy beyer w'
sending unlabeled '| barb wire , pamela'
sending unlabeled '| in " the 13th warr'
sending unlabeled '| note COLON some ma'
sending unlabeled '| close your eyes fo'
sending unlabeled '| i have never seen '
sending unlabeled '| " if there\'s

sending unlabeled '| marie ( charlotte '
sending unlabeled '| driving miss daisy'
sending unlabeled '| as a hot-shot defe'
sending unlabeled '| " payback , " bria'
sending unlabeled '| written by david j'
sending unlabeled '| because no one dem'
sending unlabeled '| david schwimmer ( '
sending unlabeled '| after the terminal'
sending unlabeled '| apparantly money t'
sending unlabeled '| synopsis COLON whe'
sending unlabeled '| plot COLON two sis'
sending unlabeled '| coinciding with th'
sending unlabeled '| when critics attac'
sending unlabeled "| the summer of 00' "
sending unlabeled '| eric rohmer\'s " pa'
sending unlabeled '| imagine this . you'
sending unlabeled '| george little ( jo'
sending unlabeled '| albert brooks save'
sending unlabeled '| synopsis COLON big'
sending unlabeled '| overblown remake o'
sending unlabeled '| a documentary from'
sending unlabeled '| battlefield earth '
sending unlabeled '| review COLON ghost'
sending unlabeled '| you know the

sending unlabeled '| eyes wide shut isn'
sending unlabeled '| " alcohol and drug'
sending unlabeled "| everyone's heard a"
sending unlabeled '| i have little agai'
sending unlabeled '| the kids in the ha'
sending unlabeled '| expectation rating'
sending unlabeled '| " virus " is a mon'


In [14]:
command = shlex.split('vw -i data/sentiment.model -p data/test.pred.al.vw --port 31337 --mellowness 1e-100')
proc = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
!python2 active_interactor.py localhost 31337 data/test.unlabeled.vw

connecting to localhost:31337 ...
done
sending unlabeled examples ...
sending unlabeled '| synopsis COLON leo'
sending unlabeled "| lisa cholodenko's "
sending unlabeled '| synopsis COLON val'
sending unlabeled '| mike myers , you c'
sending unlabeled '| did i do something'
sending unlabeled "| the happy bastard'"
sending unlabeled "| it's probably inev"
sending unlabeled '| call it a road tri'
sending unlabeled "| it's been a good l"
sending unlabeled '| synopsis COLON ori'
sending unlabeled '| the swooping shots'
sending unlabeled '| synopsis COLON an '
sending unlabeled '| in present day han'
sending unlabeled '| deuce bigalow ( ro'
sending unlabeled '| barely scrapping b'
sending unlabeled '| august and septemb'
sending unlabeled '| synopsis COLON pri'
sending unlabeled '| some of my friends'
sending unlabeled '| this is the movie '
sending unlabeled '| uncompromising fre'
sending unlabeled '| ever wonder what h'
sending unlabeled '| unfortunately it d'
sen

sending unlabeled '| scream 2 has a tit'
sending unlabeled '| clue is an unfairl'
sending unlabeled '| one of my colleagu'
sending unlabeled "| it's terribly unfo"
sending unlabeled "| it's a shame the e"
sending unlabeled '| though it is a fin'
sending unlabeled "| okay , i just don'"
sending unlabeled '| to paraphrase a so'
sending unlabeled "| martin scorsese's "
sending unlabeled '| when i first heard'
sending unlabeled '| the " italian hitc'
sending unlabeled '| around the end of '
sending unlabeled '| as you should know'
sending unlabeled '| the coen brothers '
sending unlabeled "| there isn't much g"
sending unlabeled '| director andrew da'
sending unlabeled "| the happy bastard'"
sending unlabeled '| i was anxious to s'
sending unlabeled '| director jan de bo'
sending unlabeled '| to put it bluntly '
sending unlabeled "| the general's daug"
sending unlabeled '| taking a few tips '
sending unlabeled '| today , war became'
sending unlabeled '| jay and silen

In [15]:
y_pred = np.loadtxt('data/test.pred.al.vw')
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

          -1       0.51      0.98      0.67       193
           1       0.88      0.10      0.18       207

   micro avg       0.53      0.53      0.53       400
   macro avg       0.69      0.54      0.42       400
weighted avg       0.70      0.53      0.42       400



In [16]:
y_true = !cut -c 1-2 data/train.vw
y_true = np.array([int(y) for y in y_true])
y_pred = np.loadtxt('data/train.pred.al.vw')
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

          -1       0.54      0.99      0.70       807
           1       0.95      0.13      0.23       793

   micro avg       0.57      0.57      0.57      1600
   macro avg       0.75      0.56      0.47      1600
weighted avg       0.74      0.57      0.47      1600

