# PART 2: Compute a feature vector for each buisness (Less memory intensive, but slower version)

Summary:<br>

     This Kaggle competition is a Multiple instance learning (MIL) problem:      
     Each training example (a business) has multiple instances (photos).          
     We'll use the SimpleMI algorithm briefly mentioned in 
     https://en.wikipedia.org/wiki/Multiple_instance_learning
     
     In part 1, we've obtained a 4096-dim feature vector for each image.
     In part 2, for each business, we will compute the mean feature vector among images that belong to it.
     In this way, each business is correspondent to a single feature, i.e., the mean feature vector.

Note:<br>

     This is the same as the original Step2_BusinessFeatureFc7.ipynb with a slight
     modification to ensure it will run without crashing on machines with memory 
     constraints.  This is done by avoiding copying of f['feature'] into a numpy array
     and instead reading feature values from the H5 file as needed.  As a result,
     this code will run slower than the original, but will ensure one doesn't get a
     memory-related error if their machine lacks enough memory to run the original
     script.

## Process buisness in the training set

In [6]:
data_root = '/Users/ncchen/Kaggle-Yelp/input/'
data_root = '/mnt/D264B07564B05E43/Kaggle/YelpPhoto/code/input/'
import numpy as np
import pandas as pd 
import h5py
import time

train_photo_to_biz = pd.read_csv(data_root+'train_photo_to_biz_ids.csv')
train_labels = pd.read_csv(data_root+'train.csv').dropna()
train_labels['labels'] = train_labels['labels'].apply(lambda x: tuple(sorted(int(t) for t in x.split())))
train_labels.set_index('business_id', inplace=True)
biz_ids = train_labels.index.unique()
print "Number of business: ", len(biz_ids) ,   "(4 business with missing labels are dropped)"

## Load image features
f = h5py.File(data_root+'train_image_fc7features.h5','r')
train_image_features = f['feature']


t= time.time()
## For each business, compute a feature vector 
df = pd.DataFrame(columns=['business','label','feature vector'])
index = 0
for biz in biz_ids:  
    
    label = train_labels.loc[biz]['labels']
    image_index = train_photo_to_biz[train_photo_to_biz['business_id']==biz].index.tolist()
    folder = data_root+'train_photo_folders/'  
    
    features = train_image_features[image_index]
    mean_feature =list(np.mean(features,axis=0))

    df.loc[index] = [biz, label, mean_feature]
    index+=1
    if index%1000==0:
        print "Buisness processed: ", index, "Time passed: ", "{0:.1f}".format(time.time()-t), "sec"
        break

with open(data_root+"train_biz_fc7features.csv",'w') as f:  
    df.to_csv(f, index=False)

f.close()

Number of business:  1996 (4 business with missing labels are dropped)


In [2]:
# Check file content
train_business = pd.read_csv(data_root+'train_biz_fc7features.csv')
print train_business.shape
train_business[0:5]

(1996, 3)


Unnamed: 0,business,label,feature vector
0,1000,"(1, 2, 3, 4, 5, 6, 7)","[0.19977248, 0.43287012, 0.22732441, 0.3551769..."
1,1001,"(0, 1, 6, 8)","[0.0, 0.58892941, 0.53906041, 0.17221935, 0.01..."
2,100,"(1, 2, 4, 5, 6, 7)","[0.11154944, 0.034822457, 0.1202566, 0.5201095..."
3,1006,"(1, 2, 4, 5, 6)","[0.078059368, 0.054452561, 0.056381688, 0.6942..."
4,1010,"(0, 6, 8)","[0.39656404, 0.279632, 0.0, 0.1720508, 0.36192..."


## Process business in the test set

In [5]:
data_root = '/Users/ncchen/Kaggle-Yelp/input/'

import numpy as np
import pandas as pd 
import h5py
import time

In [6]:
test_photo_to_biz = pd.read_csv(data_root+'test_photo_to_biz.csv')
biz_ids = test_photo_to_biz['business_id'].unique()

## Load image features
f = h5py.File(data_root+'test_image_fc7features.h5','r')
image_filenames = list(np.copy(f['photo_id']))
image_filenames = [name.split('/')[-1][:-4] for name in image_filenames]  #remove the full path and the str ".jpg"
image_features = f['feature']
print "Number of business: ", len(biz_ids)

df = pd.DataFrame(columns=['business','feature vector'])
index = 0
t = time.time()

for biz in biz_ids:     
    
    image_ids = test_photo_to_biz[test_photo_to_biz['business_id']==biz]['photo_id'].tolist()  
    image_index = [image_filenames.index(str(x)) for x in image_ids]
     
    folder = data_root+'test_photo_folders/'            
    features = image_features[image_index]
    mean_feature =list(np.mean(features,axis=0))

    df.loc[index] = [biz, mean_feature]
    index+=1
    if index%1000==0:
        print "Buisness processed: ", index, "Time passed: ", "{0:.1f}".format(time.time()-t), "sec"

with open(data_root+"test_biz_fc7features.csv",'w') as f:  
    df.to_csv(f, index=False)

f.close()

Number of business:  10000
Buisness processed:  1000 Time passed:  188.1 sec
Buisness processed:  2000 Time passed:  470.2 sec
Buisness processed:  3000 Time passed:  785.6 sec
Buisness processed:  4000 Time passed:  1239.6 sec
Buisness processed:  5000 Time passed:  1676.3 sec
Buisness processed:  6000 Time passed:  2081.7 sec
Buisness processed:  7000 Time passed:  2503.9 sec
Buisness processed:  8000 Time passed:  2843.3 sec
Buisness processed:  9000 Time passed:  3180.2 sec
Buisness processed:  10000 Time passed:  3526.0 sec


In [7]:
# Check file content
test_business = pd.read_csv(data_root+'test_biz_fc7features.csv')
print test_business.shape
test_business[0:5]

(10000, 2)


Unnamed: 0,business,feature vector
0,003sg,"[0.19304858, 0.25836363, 0.19439963, 0.4623323..."
1,00er5,"[0.19397432, 0.25547543, 0.18416314, 0.3357939..."
2,00kad,"[0.12131316, 0.12655321, 0.076526381, 0.383446..."
3,00mc6,"[0.28428948, 0.11111, 0.4785029, 0.44944447, 0..."
4,00q7x,"[0.23811676, 0.33040056, 0.25543597, 0.3258069..."
