# Test Linear Regression

This notebook is an example that will test the generalization capability of a regression to word vectors. There are three corpora involved.

## Required Data

1. The _word vector corpora_
  * Examples: New York Times, Wikipedia Text8
  * Data: Pretrained word vectors (word2vec, etc.)
2. The _training corpora_
  * Examples: IAPR-TC12, MSCOCO, Visual Genome
  * Data:
    - Training image features and text labels
    - Testing image features and text labels <-- Used as validation data
3. A _testing corpora_ with a different vocabulary
  * Examples: MSCOCO, Visual Genome, etc.
  * Data: Training and testing image features


### Imports

In [9]:
import numpy as np
import matplotlib.pylab as plt
import sys

## Should probably update to PYTHONPATH
sys.path.append('/data/fs4/home/kni/attalos/')
# ls /data/fs4/home/kni/attalos/

## Import word vector load in
import attalos.imgtxt_algorithms.util.readw2v as rw2v

## Import linear regression
import attalos.imgtxt_algorithms.linearregression.LinearRegression as linreg

## Import evaluation code (right now, using Octave soft evaluation)
# from attalos.evaluation.evaluation import Eval
from oct2py import octave
octave.addpath('../evaluation/')

reload(linreg)
%matplotlib inline

### Load the word vectors in

In [18]:
import pickle
import numpy as np
from scipy.special import expit

def save_centroids(centroids, target_path):
    with open(target_path, "wb") as f:
        pickle.dump(centroids, f)

def load_centroids(target_path):
    with open(target_path, "rb") as f:
        centroids = pickle.load(f)
    return centroids

def compute_centroid_projection(basis, v):
    projection = []
    for dim in xrange(0, len(basis.keys())):
        if dim not in basis:
            projection.append(0)
            continue
        similarity = np.dot(v, basis[dim])
        similarity = expit(10*similarity)
        projection.append(similarity)
    return np.asarray(projection)

In [14]:
centroids = load_centroids('/data/fs4/teams/attalos/wordvecs/centroids_kmeans500skipg.pkl')
F = rw2v.ReadW2V('/data/fs4/teams/attalos/wordvecs/text9Bvin.bin')
vectors = F.readlines()

In [25]:
hellocentroid = compute_centroid_projection( centroids, vectors['television'] )

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0], dtype=uint8)

### Load training corpora in

In [7]:
data = np.load('linearregression/data/iaprtc_alexfc7.npz')
D = open('linearregression/data/iaprtc_dictionary.txt').read().splitlines()
train_ims = [ im.split('/')[-1] for im in open('linearregression/data/iaprtc_trainlist.txt').read().splitlines() ]
xTr = data['xTr'].T
yTr = data['yTr'].T
xTe = data['xTe'].T
yTe = data['yTe'].T

test_ims_full = [ im for im in open('linearregression/data/iaprtc_testlist.txt').read().splitlines() ]
train_ims_full = [ im for im in open('linearregression/data/iaprtc_trainlist.txt').read().splitlines() ]

In [67]:
newvecs = np.zeros((len(yTr),500))
for m in xrange( 0, len(yTr) ):
    tags = [ D[ i ] for i in np.where( yTr[m] > 0.5 )[0] ]
    taglist = [vectors[ tag ] for tag in tags if tag not in ['tee-shirt', 'bedcover', 'table-cloth'] ]
    if taglist:
        tagvecs = np.array( [compute_centroid_projection(centroids, tag) for tag in taglist] )
        newvecs[m] = tagvecs.mean( axis=0 )
    else:
        print "tag list for image "+str(m)+" is empty!"

tag list for image 2868 is empty!


In [68]:
labelvecs = np.zeros( (len(D), 500) )
for i in range(len(D)):
    if D[i] not in ['tee-shirt', 'bedcover', 'table-cloth']:
        vec2save = compute_centroid_projection(centroids, vectors[D[i]])
        labelvecs[i] = vec2save / np.linalg.norm(vec2save)

### Load testing corpora in


In [77]:
print np.isnan(labelvecs).any()
print np.isnan(newvecs).any()

False
False


-------------------------------

### Train and validate

In [78]:
mp_solution = linreg.LinearRegression(normX = True)
mp_solution.train(xTr, newvecs)
yHat = mp_solution.predict(xTe)

Building W matrix = Y \ X = Y^T X (X X^T)^-1


### Test

In [87]:
FinalVal = yHat.dot(labelvecs.T)

### Evaluate the regression

In [88]:
[precision, recall, f1score] = octave.evaluate(yTe.T, FinalVal.T, 5)
print precision
print recall
print f1score

0.0940166343386
0.0513964161015
0.0664605830674


### Visualize

In [None]:
# Randomly select an image
i=np.random.randint(0, yTe.shape[1])

# Run example
imname='linearregression/images/'+test_ims_full[i]+'.jpg';
print "Looking at the "+str(i)+"th image: "+imname
im=plt.imread(imname)

# Prediction
ypwords = [D[j] for j in yHat[i].argsort()[::-1] [ 0:(yHat[i]>0.2).sum() ] ]
# Truth
ytwords = [D[j] for j in np.where(yTe[i] > 0.5)[0] ]

plt.imshow(im)
print 'Predicted: '+ ', '.join(ypwords)
print 'Truth:     '+ ', '.join(ytwords)

plt.figure()
plt.stem( yHat[i] )