# Test Linear Regression

This notebook is an example that will test the generalization capability of a regression to word vectors. There are three corpora involved.

## Required Data

1. The _word vector corpora_
  * Examples: New York Times, Wikipedia Text8
  * Data: Pretrained word vectors (word2vec, etc.)
2. The _training corpora_
  * Examples: IAPR-TC12, MSCOCO, Visual Genome
  * Data:
    - Training image features and text labels
    - Testing image features and text labels <-- Used as validation data
3. A _testing corpora_ with a different vocabulary
  * Examples: MSCOCO, Visual Genome, etc.
  * Data: Training and testing image features


### Imports

In [3]:
import numpy as np
import matplotlib.pylab as plt
import sys

## Should probably update to PYTHONPATH
sys.path.append('/data/fs4/home/kni/attalos/')
# ls /data/fs4/home/kni/attalos/

## Import word vector load in
import attalos.imgtxt_algorithms.util.readw2v as rw2v

## Import linear regression
import attalos.imgtxt_algorithms.linearregression.LinearRegression as linreg

## Import evaluation code (right now, using Octave soft evaluation)
# from attalos.evaluation.evaluation import Eval
from oct2py import octave
octave.addpath('../evaluation/')

reload(linreg)
%matplotlib inline

### Load the word vectors in

In [4]:
F = rw2v.ReadW2V('/data/fs4/teams/attalos/wordvecs/text9Bvin.bin')
vectors = F.readlines()

### Load training corpora in

In [5]:
data = np.load('linearregression/data/iaprtc_alexfc7.npz')
D = open('linearregression/data/iaprtc_dictionary.txt').read().splitlines()
train_ims = [ im.split('/')[-1] for im in open('linearregression/data/iaprtc_trainlist.txt').read().splitlines() ]
xTr = data['xTr'].T
yTr = data['yTr'].T
xTe = data['xTe'].T
yTe = data['yTe'].T

test_ims_full = [ im for im in open('linearregression/data/iaprtc_testlist.txt').read().splitlines() ]
train_ims_full = [ im for im in open('linearregression/data/iaprtc_trainlist.txt').read().splitlines() ]

### Create probability distribution for words

In [121]:
wordvecs = np.zeros((len(D), 200))
for i, word in enumerate(D):
    if vectors.has_key( word ):
        wordvecs[i] = vectors[word]
    else:
        print "{}: {}".format(word,i)
        
distvecs = 1 / (1 + np.exp( - 0.1*wordvecs.dot(wordvecs.T) ) )
# distvecs = np.tanh( 0.1*wordvecs.dot(wordvecs.T) )
distvecs = distvecs / np.linalg.norm( distvecs, axis=1 )

bedcover: 15
table-cloth: 259
tee-shirt: 261


### Convert multi-hot vectors to distribution vectors

1. wordvecs -> distvecs
2. onehot -> distvecs

In [122]:
def multi2dist( multihot, distvecs ):
    
    indices = np.where( multihot )[0]
    distvec = np.zeros(len(multihot))
    for i in indices:
        if i not in [15, 259, 261]:
            distvec += distvecs[i]
    return distvec

### Convert all images in the training set to distribution vectors

In [123]:
badims = []
yTarget = np.zeros( (len(yTr), len(D)) )
for i in xrange( len(yTr) ):
    distvec = multi2dist( yTr[i], distvecs)
    if not distvec.sum():
        print "Error, there are no tags associated with image "+str(i)
        badims.append(i)
        continue
    yTarget[i] = distvec

for i in badims:
    yTarget = np.concatenate((yTarget[:i,:], yTarget[(i+1):,:]), axis=0)
    xFeats = np.concatenate( (xTr[:i,:], xTr[(i+1):,:]) )

Error, there are no tags associated with image 2868


### Load testing corpora in


-------------------------------

### Train and validate

In [124]:
mp_solution = linreg.LinearRegression(normX = True, normY = True)
mp_solution.train(xFeats, yTarget)
yHat = mp_solution.predict(xTe)

Building W matrix = Y \ X = Y^T X (X X^T)^-1


### Test

### Evaluate the regression

In [125]:
[precision, recall, f1score] = octave.evaluate(yTe.T, yHat.T, 5)
print precision
print recall
print f1score

0.213402617401
0.203053943138
0.208099701362


### Visualize

In [126]:
# Randomly select an image
i=np.random.randint(0, yTe.shape[1])

# Run example
imname='linearregression/images/'+test_ims_full[i]+'.jpg';
print "Looking at the "+str(i)+"th image: "+imname
im=plt.imread(imname)

# Prediction
ypwords = [D[j] for j in yHat[i].argsort()[::-1] [ 0:(yHat[i]>0.2).sum() ] ]
# Truth
ytwords = [D[j] for j in np.where(yTe[i] > 0.5)[0] ]

plt.imshow(im)
print 'Predicted: '+ ', '.join(ypwords)
print 'Truth:     '+ ', '.join(ytwords)

plt.figure()
plt.stem( yHat[i] )

Looking at the 98th image: linearregression/images/02/2474.jpg


IOError: [Errno 2] No such file or directory: 'linearregression/images/02/2474.jpg'