In [1]:
import json
import pandas as pd

In [2]:
def _partial_acc(obs):
    """
        inner function used to calculate a weighted accuracy based on how many human
        raters' annotations the predicted value matches.  Applied to a single Series.
        
        Args:
            obs = pandas.Series object containing 'predicted_answer' and 'annotations' columns
    """
    prediction = obs['predicted_answer']
    annotations = obs['annotations']
    matches = 0
    for a in annotations:
        if prediction.strip().lower() == a.strip().lower():
            matches += 1
    return min(1, matches/3)

## Download appropriate results file from Google Storage bucket.

`gsutil cp gs://mids-w266-mw/mlflow/28/204b0b355a814e2a819e7266abc28963/artifacts/test2015_results_mrr_san_expt10_2018-11-27-21:29:56.json .`

Update `json_results_path` variable in the next cell:

### 1. Best "enhanced" model

In [3]:
# Enhanced w/ POS *test* run_timestamp: 2018-11-27-21:29:56
json_results_path = '/home/mwinton/final_runs/test2015_results_mrr_san_expt10_2018-11-27-21:29:56.json'  # Enhanced

with open(json_results_path) as f:
    resultsj = json.load(f)
    
df = pd.DataFrame(resultsj)
df['correct'] = (df['answer_str'].str.strip().str.lower() == df['predicted_answer'].str.strip().str.lower()).astype(int)
df['partial'] = df.apply(_partial_acc, axis=1)

In [4]:
df.describe()

Unnamed: 0,answer_id,image_id,question_id,correct,partial
count,60712.0,60712.0,60712.0,60712.0,60712.0
mean,29122800.0,291227.893267,2912280.0,0.441478,0.511491
std,16760830.0,167608.346535,1676083.0,0.496567,0.482391
min,4200.0,42.0,420.0,0.0,0.0
25%,14792110.0,147921.0,1479211.0,0.0,0.0
50%,29222720.0,292227.0,2922272.0,0.0,0.666667
75%,43499390.0,434993.75,4349939.0,1.0,1.0
max,58191320.0,581913.0,5819132.0,1.0,1.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60712 entries, 0 to 60711
Data columns (total 12 columns):
annotations         60712 non-null object
answer_id           60712 non-null int64
answer_str          60712 non-null object
answer_type         60712 non-null object
complement_id       0 non-null object
image_id            60712 non-null int64
predicted_answer    60712 non-null object
question_id         60712 non-null int64
question_str        60712 non-null object
question_type       60712 non-null object
correct             60712 non-null int64
partial             60712 non-null float64
dtypes: float64(1), int64(4), object(7)
memory usage: 5.6+ MB


In [6]:
df.head()

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,predicted_answer,question_id,question_str,question_type,correct,partial
0,"[5, 5, 5, 5, 5, 5, 5, 5, 5, 5]",9201400,5,number,,92014,2,920140,how many urinals are in this bathroom,how many,0,0.0
1,"[peeing, urinating, urinating, urination, urin...",9201410,urinating,other,,92014,<unk>,920141,what are these used for,what are,0,0.0
2,"[white, gray, white, white, gray, white, white...",9201420,white,other,,92014,white,920142,what is the color of the walls,what is the color of the,1,1.0
3,"[1, 0, 1, 1, 0, 0, 0, 0, 5, 0]",26546400,0,number,,265464,2,2654640,how many surfer are standing on the boards,how many,0,0.0
4,"[no, yes, yes, yes, no, no, no, no, yes, no]",26546410,no,yes/no,,265464,yes,2654641,are there high tides,are there,0,1.0


In [7]:
# count by answer type
df.groupby(['answer_type'])['correct'].count().sort_values(ascending=False)

answer_type
other     30351
yes/no    22762
number     7599
Name: correct, dtype: int64

In [8]:
# count by question type
df.groupby(['question_type'])['correct'].count().sort_values(ascending=False)

question_type
how many                    5439
is the                      4921
what                        4554
what color is the           4170
what is the                 3197
none of the above           2508
is this                     2226
is this a                   2140
what is                     1762
what kind of                1646
are the                     1570
is there a                  1275
what type of                1125
where is the                1118
is it                       1044
what are the                 922
is                           898
what color are the           891
does the                     867
is there                     847
are these                    795
what is the man              786
are there                    768
which                        749
how                          688
is the man                   671
are                          666
does this                    602
what is on the               580
what does the                

In [9]:
# accuracy overall
acc = df['correct'].mean()
partial_acc = df['partial'].mean()
print('Accuracy = {:.3f}. Partial Accuracy = {:.3f}.'.format(acc, partial_acc))

Accuracy = 0.441. Partial Accuracy = 0.511.


In [10]:
pd.set_option('display.max_rows', 75)

In [11]:
# accuracy by question type
acc_by_qtype = df.groupby(['question_type'])['correct','partial'] \
    .mean() \
    .sort_values(['correct'], ascending=False)
acc_by_qtype

Unnamed: 0_level_0,correct,partial
question_type,Unnamed: 1_level_1,Unnamed: 2_level_1
could,0.847826,0.92029
is there a,0.833725,0.878693
what room is,0.831169,0.847042
what sport is,0.815476,0.834325
is there,0.744982,0.815821
are there,0.739583,0.807292
do you,0.734694,0.829932
was,0.70082,0.800546
is it,0.700192,0.793103
can you,0.684211,0.77193


In [12]:
# accuracy by answer type
acc_by_anstype = df.groupby(['answer_type'])['correct','partial'] \
    .mean() \
    .sort_values(['correct'], ascending=False)
acc_by_anstype

Unnamed: 0_level_0,correct,partial
answer_type,Unnamed: 1_level_1,Unnamed: 2_level_1
yes/no,0.678763,0.781829
other,0.309545,0.353091
number,0.257665,0.334386


In [13]:
acc_by_anstype.to_dict('index')

{'yes/no': {'correct': 0.6787628503646428, 'partial': 0.7818293647306933},
 'other': {'correct': 0.3095449902803861, 'partial': 0.3530910568569955},
 'number': {'correct': 0.2576654823003027, 'partial': 0.334386103434663}}

### 1a. Yes/No Answer Type

In [14]:
# accuracy by question type
df[df.answer_type=='yes/no'].groupby(['question_type'])['correct', 'partial'].mean().sort_values(['correct'])

Unnamed: 0_level_0,correct,partial
question_type,Unnamed: 1_level_1,Unnamed: 2_level_1
why,0.0,0.0
what,0.0,0.0
is that a,0.603175,0.758377
is he,0.624113,0.739953
is this person,0.62963,0.767196
are these,0.634877,0.760672
are,0.635528,0.751404
none of the above,0.638195,0.765243
has,0.642586,0.737643
is the person,0.648352,0.763736


In [15]:
# examples of 0% accuracy for yes/no answer type - 7 data points
df[(df.answer_type=='yes/no') & \
   ((df.question_type=='why') | (df.question_type=='what') | (df.question_type=='what is the'))]

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,predicted_answer,question_id,question_str,question_type,correct,partial
750,"[yes, yes, yes, company logo, yes, yes, yes, y...",22455400,yes,yes/no,,224554,<unk>,2245540,what there a crown on the train,what,0,0.0
6642,"[no, no, no, no, no, yes, no, yes, yes, yes]",8276500,no,yes/no,,82765,<unk>,827650,what someone using the computer in bed,what,0,0.0
7091,"[no, emergency, no, no, ve, no, no, can't see,...",3987110,no,yes/no,,39871,<unk>,398711,what word is show on the bus,what,0,0.0
11750,"[make feeding easier, yes, yes, yes, yes, view...",18663720,yes,yes/no,,186637,<unk>,1866372,why is there a wooden platform behind the fence,why,0,0.0
29714,"[yes, his preference, yes, yes, yes, because h...",9924220,yes,yes/no,,99242,cold,992422,why does the man have a beard,why,0,0.0
43696,"[no, no, no, no, no, no, no, yes, yes, no]",32017100,no,yes/no,,320171,<unk>,3201710,what this photo taken in the present century,what,0,0.0
58184,"[no, no, no, no, no, no, no, no, no, no]",39427510,no,yes/no,,394275,<unk>,3942751,what this picture taken in the united states,what,0,0.0


In [16]:
# examples of <60% accuracy for yes/no answer type with correct answers - 914 data points
df[(df.answer_type=='yes/no') & \
   ((df.question_type=='are there any') | (df.question_type=='none of the above')) & \
   (df.correct==1)].tail(10)

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,predicted_answer,question_id,question_str,question_type,correct,partial
60160,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",11373610,yes,yes/no,,113736,yes,1137361,does animal like picnic sites,none of the above,1,1.0
60291,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",52944720,yes,yes/no,,529447,yes,5294472,can this phone take pictures,none of the above,1,1.0
60326,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, no]",7995510,yes,yes/no,,79955,yes,799551,should the driver of the truck merge with the ...,none of the above,1,1.0
60351,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",55165000,yes,yes/no,,551650,yes,5516500,does it look cloudy,none of the above,1,1.0
60483,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",14258100,yes,yes/no,,142581,yes,1425810,does there appear to be fog in this image,none of the above,1,1.0
60509,"[no, no, no, no, no, no, no, no, no, no]",5048220,no,yes/no,,50482,no,504822,are there any clouds in the sky,are there any,1,1.0
60592,"[yes, yes, yes, no, yes, yes, yes, yes, yes, yes]",196010,yes,yes/no,,1960,yes,19601,will he make the goal,none of the above,1,1.0
60646,"[yes, yes, yes, yes, yes, looks like it, yes, ...",44533410,yes,yes/no,,445334,yes,4453341,did the little girl eat all of her food,none of the above,1,1.0
60684,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",44006000,yes,yes/no,,440060,yes,4400600,does she have a wrist watch,none of the above,1,1.0
60710,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",53508020,yes,yes/no,,535080,yes,5350802,are there any shoes in the background,are there any,1,1.0


In [17]:
# examples of <60% accuracy for yes/no answer type with incorrect answers - 1555 data points
df[(df.answer_type=='yes/no') & \
   ((df.question_type=='are there any') | (df.question_type=='none of the above')) & \
   (df.correct==0)].tail(10)

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,predicted_answer,question_id,question_str,question_type,correct,partial
59524,"[no, no, yes, no, no, no, no, no, no, no]",40814700,no,yes/no,,408147,yes,4081470,will the elephant eat the wood,none of the above,0,0.333333
59574,"[yes, yes, yes, yes, no, yes, yes, no, yes, yes]",17928500,yes,yes/no,,179285,no,1792850,were these objects standing still when they we...,none of the above,0,0.666667
59656,"[yes, no, no, no, no, no, no, no, yes, no]",47234910,no,yes/no,,472349,yes,4723491,can this elephant knock down the fence,none of the above,0,0.666667
59747,"[yes, no, no, yes, yes, no, no, no, yes, no]",13493510,no,yes/no,,134935,yes,1349351,are there any flowers in the bouquet,are there any,0,1.0
59757,"[yes, yes, no, yes, yes, yes, yes, yes, yes, yes]",6601120,yes,yes/no,,66011,no,660112,did the cat destroy the flowers,none of the above,0,0.333333
59919,"[no, 1, no, no, yes, no, yes, yes, yes, 3]",29184500,yes,yes/no,,291845,no,2918450,are there any sidekicks in this picture,are there any,0,1.0
60256,"[no, no, no, no, no, no, nope, no, no, no]",14282610,no,yes/no,,142826,yes,1428261,would you drink water from the stream,none of the above,0,0.0
60281,"[yes, yes, yes, yes, yes, no, yes, yes, yes, yes]",22122220,yes,yes/no,,221222,no,2212222,does some grass need to be watered,none of the above,0,0.333333
60444,"[no, no, no, no, no, no, no, no, no, no]",23774500,no,yes/no,,237745,yes,2377450,does his board match his suit,none of the above,0,0.0
60496,"[yes, yes, yes, yes, yes, yes, weeds, yes, yes...",12547610,yes,yes/no,,125476,no,1254761,would anything be in the way of someone trying...,none of the above,0,0.0


In [18]:
# how does model compare to humans?
# what percentage of incorrect answers predicted by model are also predicted by humans?
num_incorrect = df[(df.answer_type=='yes/no') & (df.correct==0)]['annotations'].count()
num_atleast1 = df[(df.answer_type=='yes/no') & (df.correct==0) & (df.partial>0)]['annotations'].count() 
num_atleast2 = df[(df.answer_type=='yes/no') & (df.correct==0) & (df.partial>0.35)]['annotations'].count() 
num_atleast3 = df[(df.answer_type=='yes/no') & (df.correct==0) & (df.partial==1)]['annotations'].count() 

print('Percentage of incorrect answers predicted by at least one human: {:.1%}'.format(num_atleast1/num_incorrect))
print('Percentage of incorrect answers predicted by at least two human: {:.1%}'.format(num_atleast2/num_incorrect))
print('Percentage of incorrect answers predicted by at least three human: {:.1%}'.format(num_atleast3/num_incorrect))

Percentage of incorrect answers predicted by at least one human: 47.8%
Percentage of incorrect answers predicted by at least two human: 29.6%
Percentage of incorrect answers predicted by at least three human: 18.8%


### 1b. Other Answer Type

In [19]:
# accuracy by question type
df[df.answer_type=='other'].groupby(['question_type'])['correct', 'partial'].mean().sort_values(['correct'])

Unnamed: 0_level_0,correct,partial
question_type,Unnamed: 1_level_1,Unnamed: 2_level_1
how many people are,0.0,0.0
has,0.0,0.111111
what is the name,0.010309,0.012027
why,0.026667,0.035556
how many,0.041667,0.111111
why is the,0.047619,0.067019
are,0.076923,0.102564
how,0.081206,0.116783
can you,0.1,0.4
who is,0.12,0.149744


In [20]:
# examples of incorrect predictions for other answer type
df[(df.answer_type=='other') & (df.correct==0)].tail(10)

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,predicted_answer,question_id,question_str,question_type,correct,partial
60689,"[clocks, clocks, clocks, decorations, clock, c...",22462220,clocks,other,,224622,<unk>,2246222,what is attached to two sides of the building,what is,0,0.0
60691,"[bike, bike, bike, bike, suitcase, bicycle, bi...",57266510,bike,other,,572665,<unk>,5726651,what is leaning against the tree in the middle...,what is,0,0.0
60692,"[briefcase, suitcase, suitcase, yes, suitcase,...",57266520,suitcase,other,,572665,<unk>,5726652,what item is featured up front,what,0,0.0
60693,"[driving, driving, driving, driving, driving, ...",1438000,driving,other,,14380,<unk>,143800,what is the car doing,what is the,0,0.0
60695,"[white, white, white, white, white, white, whi...",1438020,white,other,,14380,gray,143802,what color is the bridge,what color is the,0,0.0
60696,"[library, library, library, library, library, ...",29819700,library,other,,298197,restaurant,2981970,where are these people,none of the above,0,0.0
60697,"[computers, laptops, laptops, laptops, compute...",29819710,laptops,other,,298197,<unk>,2981971,what is sitting on the desk in front of the boys,what is,0,0.0
60701,"[field, grassland, on hillside, white, on gras...",25109420,field,other,,251094,<unk>,2510942,where are the sheep,where are the,0,0.0
60703,"[middle, middle hot dog, middle, middle, middl...",38821100,middle,other,,388211,mustard,3882110,which hot dog is garnished differently,which,0,0.0
60711,"[paper person, paper doll, miniature person, p...",52397400,paper doll,other,,523974,umbrella,5239740,what is the person holding,what is the person,0,0.0


In [21]:
# create column with number of words in answers
df['answer_length'] = df.apply(lambda data: len(data['answer_str'].split()), axis=1)
df.tail(10)

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,predicted_answer,question_id,question_str,question_type,correct,partial,answer_length
60702,"[white, white, white, white, white, white, whi...",38821120,white,other,,388211,white,3882112,what color is the plate,what color is the,1,1.0,1
60703,"[middle, middle hot dog, middle, middle, middl...",38821100,middle,other,,388211,mustard,3882110,which hot dog is garnished differently,which,0,0.0,1
60704,"[3, 3, 3, 3, 3, 3, 3, 3, 3, 3]",38821110,3,number,,388211,2,3882111,how many hot dogs are pictured,how many,0,0.0,1
60705,"[pizza, pizza, pizza, pizza, pizza, pizza, piz...",13241520,pizza,other,,132415,pizza,1324152,what food is this,what,1,1.0,1
60706,"[no, no, no, no, yes, no, no, no, no, yes]",13241500,no,yes/no,,132415,no,1324150,does this pizza need to be eaten with a fork,does this,1,1.0,1
60707,"[2, 4, 3, 4, 4, 4, 4, 4, 4, 4]",13241510,4,number,,132415,4,1324151,how many prongs are on the fork,how many,1,1.0,1
60708,"[sleeping, sleeping on suitcase, sleeping, sle...",53508000,sleeping,other,,535080,sleeping,5350800,what is the cat doing,what is the,1,1.0,1
60709,"[suitcase, luggage, luggage, suitcase, bag, su...",53508010,suitcase,other,,535080,suitcase,5350801,what is the cat laying on,what is the,1,1.0,1
60710,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",53508020,yes,yes/no,,535080,yes,5350802,are there any shoes in the background,are there any,1,1.0,1
60711,"[paper person, paper doll, miniature person, p...",52397400,paper doll,other,,523974,umbrella,5239740,what is the person holding,what is the person,0,0.0,2


In [22]:
# number of incorrect predictions by answer length
df[(df.answer_type=='other') & (df.correct==0)].groupby(['answer_length'])['annotations'].count()

answer_length
0         4
1     16666
2      2944
3      1092
4       160
5        57
6        16
7         8
8         4
10        2
11        2
12        1
Name: annotations, dtype: int64

In [23]:
# number of correct predictions by answer length
df[(df.answer_type=='other') & (df.correct==1)].groupby(['answer_length'])['annotations'].count()

answer_length
1    8890
2     385
3     119
4       1
Name: annotations, dtype: int64

### 1c. Number Answer Type

In [24]:
# accuracy by question type
df[df.answer_type=='number'].groupby(['question_type'])['correct', 'partial'].mean().sort_values(['correct'])

Unnamed: 0_level_0,correct,partial
question_type,Unnamed: 1_level_1,Unnamed: 2_level_1
are the,0.0,0.333333
where is the,0.0,0.0
what type of,0.0,0.0
what time,0.0,0.00391
what is this,0.0,0.0
what is the name,0.0,0.0
what is on the,0.0,0.0
what brand,0.0,0.0
what are the,0.0,0.0
which,0.0,0.0


In [25]:
# number of data points by question type
df[(df.answer_type=='number')].groupby(['question_type'])['annotations'].count().sort_values(ascending=False)

question_type
how many                  5391
how many people are        541
what time                  341
how                        257
how many people are in     241
what is the                237
what                       227
what number is             180
none of the above           81
what does the               23
what are the                19
which                       19
what is                     12
what is the name             4
is this                      4
does the                     3
are there                    3
what is this                 2
where is the                 2
why                          2
can you                      1
was                          1
is                           1
is the                       1
is the woman                 1
is this person               1
what brand                   1
what is on the               1
what type of                 1
are the                      1
Name: annotations, dtype: int64

In [26]:
# examples of incorrect predictions for other answer type
# df[(df.answer_type=='number') & (df.correct==1) & (df.question_type=='how many people are')].tail(10)
df[(df.answer_type=='number') & (df.correct==0)].tail(10)

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,predicted_answer,question_id,question_str,question_type,correct,partial,answer_length
60605,"[3, 6, 2, 1, 3 on each side so 6, 3, 3, 3, 4, 2]",9775420,3,number,,97754,2,977542,how many doors does the plane have,how many,0,0.666667,1
60618,"[many, 10, 11, 12, 11, lot, 11, 11, 11, 11]",16285800,11,number,,162858,2,1628580,how many cars are there,how many,0,0.0,1
60623,"[6, 10, 6, 6, 6, 6, 6, 6, 6, 6]",35133110,6,number,,351331,2,3513311,how many slices is the food cut into,how many,0,0.0,1
60627,"[19, 19, 19, 19, 19, 15, 19, 12, 19, 19]",43120800,19,number,,431208,<unk>,4312080,what is the player's number,what is the,0,0.0,1
60640,"[8, 9, 7, 7, 7, 6, 8, 9, 7, 7]",20702700,7,number,,207027,1,2070270,how many surfboards are on the rack,how many,0,0.0,1
60654,"[3, 4, 5, 3, 5, 3, 4, 3, 4, 3]",5856900,3,number,,58569,2,585690,how many buses,how many,0,0.0,1
60671,"[3, 3, 3, 3, 3, 3, 3, 3, 3, 3]",48602020,3,number,,486020,2,4860202,how many zebras are visible,how many,0,0.0,1
60682,"[young, 3, 15, 10 years, 4, 4 years, 5 years, ...",15574300,3,number,,155743,old,1557430,how old is this zebra,how,0,0.0,1
60688,"[1:46, 1:48, 1:48, 147, 1:47, 1:47, 1:47 pm, 1...",22462210,1:47,number,,224622,<unk>,2246221,what time is on the clock,what time,0,0.0,1
60704,"[3, 3, 3, 3, 3, 3, 3, 3, 3, 3]",38821110,3,number,,388211,2,3882111,how many hot dogs are pictured,how many,0,0.0,1


In [27]:
# create column with unk token flag
df['unk_flag'] = df.apply(lambda data: data['predicted_answer']=='<unk>', axis=1)
df.head(5)

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,predicted_answer,question_id,question_str,question_type,correct,partial,answer_length,unk_flag
0,"[5, 5, 5, 5, 5, 5, 5, 5, 5, 5]",9201400,5,number,,92014,2,920140,how many urinals are in this bathroom,how many,0,0.0,1,False
1,"[peeing, urinating, urinating, urination, urin...",9201410,urinating,other,,92014,<unk>,920141,what are these used for,what are,0,0.0,1,True
2,"[white, gray, white, white, gray, white, white...",9201420,white,other,,92014,white,920142,what is the color of the walls,what is the color of the,1,1.0,1,False
3,"[1, 0, 1, 1, 0, 0, 0, 0, 5, 0]",26546400,0,number,,265464,2,2654640,how many surfer are standing on the boards,how many,0,0.0,1,False
4,"[no, yes, yes, yes, no, no, no, no, yes, no]",26546410,no,yes/no,,265464,yes,2654641,are there high tides,are there,0,1.0,1,False


In [28]:
# number of incorrect answers by unk token flag
df[(df.answer_type=='number') & (df.correct==0)].groupby(['unk_flag'])['annotations'].count()

unk_flag
False    4615
True     1026
Name: annotations, dtype: int64

In [29]:
# examples of incorrect predictions that are not unk tokens
df[(df.answer_type=='number') & (df.correct==0) & (df.predicted_answer == '<unk>')].tail(50)

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,predicted_answer,question_id,question_str,question_type,correct,partial,answer_length,unk_flag
57735,"[3:23, 3:23, 3:23, 3:22, 3:22, 3:22, 3:22, 3:2...",681000,3:22,number,,6810,<unk>,68100,what time is it in this scene,what time,0,0.0,1,True
57793,"[9867, 9867, 9867, 9862, 9867, 9867, 9867, 986...",33031610,9867,number,,330316,<unk>,3303161,what 4 digit number is visible on the upper ri...,what,0,0.0,1,True
57820,"[7 eleven, 7 eleven, 7 eleven, 7 11, 7 eleven,...",21170700,7 eleven,number,,211707,<unk>,2117070,what store did that come from,what,0,0.0,2,True
57830,"[$1.39, 1.39, 1.39, $1.39, 1.39, 1.39, 1.39, 1...",895320,1.39,number,,8953,<unk>,89532,how much does the sign say the cheeseburger li...,how,0,0.0,1,True
57854,"[12:45, 12:40, 12:43, 1:40, 12:39, 12:43, 12:4...",57192020,12:43,number,,571920,<unk>,5719202,what time is shown on the clock,what time,0,0.0,1,True
57882,"[12 inches, 12 inches, 15 inches, 13 inch, 15 ...",37433300,15 inch,number,,374333,<unk>,3743330,how big is screen on left,how,0,0.0,2,True
57916,"[10:00, 11:00, 11:00, 11:00, 10:00, 11, 10:00 ...",24318910,10:00,number,,243189,<unk>,2431891,what time does the clock show,what time,0,0.0,1,True
57927,"[winter, winter, 19.12.2007, 2007, 19 12 2007,...",2581200,winter,number,,25812,<unk>,258120,when was this picture took,none of the above,0,0.0,1,True
57928,"[19.12.2007, 12 19 2007, 19.12.2007, 19 12 200...",2581210,19.12.2007,number,,25812,<unk>,258121,what is the date on the photo,what is the,0,0.0,1,True
57972,"[12:40, 12:40, 12:40, 12:40, 12:41, 12:40, 12:...",40695920,12:40,number,,406959,<unk>,4069592,what time is the clock displaying,what time,0,0.0,1,True


In [30]:
# examples of incorrect predictions that are not unk tokens
df[(df.answer_type=='number') & (df.correct==0) & (df.predicted_answer != '<unk>')].head(10)

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,predicted_answer,question_id,question_str,question_type,correct,partial,answer_length,unk_flag
0,"[5, 5, 5, 5, 5, 5, 5, 5, 5, 5]",9201400,5,number,,92014,2,920140,how many urinals are in this bathroom,how many,0,0.0,1,False
3,"[1, 0, 1, 1, 0, 0, 0, 0, 5, 0]",26546400,0,number,,265464,2,2654640,how many surfer are standing on the boards,how many,0,0.0,1,False
6,"[5, 4, 7, 8, 6, 10, 4, no, 10, 6]",15420200,10,number,,154202,1,1542020,how many of these items contain electronics,how many,0,0.0,1,False
22,"[4, 4, 4, 4, 4, 4, 4, 4, 4, 4]",2735310,4,number,,27353,2,273531,how many tines on the fork,how many,0,0.0,1,False
28,"[4, 4, 4, 13, 4, 4, 4, 4, 4, 4]",6462910,4,number,,64629,2,646291,how many propellers are in this shot,how many,0,0.0,1,False
44,"[6, 7, 6, 9, 6, 6, 7, 6, 5, 5]",19634120,6,number,,196341,2,1963412,how many individuals are in this photo,how many,0,0.0,1,False
59,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]",23153420,1,number,,231534,2,2315342,how many plane wings are visible,how many,0,0.0,1,False
70,"[2, 2, 1, 1, 1, 10, 2, 0, 1, 10]",41582300,1,number,,415823,2,4158230,how many slices of the orange are there,how many,0,1.0,1,False
141,"[4, 4, 4, 4, 4, 4, 4, 4, 4, 4]",14353300,4,number,,143533,0,1435330,how many cows are laying down,how many,0,0.0,1,False
143,"[0, 0, 0, 2, 0, 0, 2, 0, 1, 0]",14353320,0,number,,143533,2,1435332,how many of the cows have spots,how many,0,0.666667,1,False


#### Summary
- Best accuracy (66%): Yes/No answer type
- Second accuracy (28%): Other answer type
- Worst accuracy (22%): Number answer type

**Yes/No Answer Type**  
The model does quite well consistently for all question types with accuracy above 50% except for three question types including "why", "what", and "what is the".  As we can see, the phrasing of these question types do not point to a yes or no answer so it's not a surprise that the model does poorly on these questions.  For other question types, the model does the best when the questions have clear and direct answers and it performs poorly when the questions are abstract, subjective, or require common sense knowledge.  It's also interesting to note that for the incorrect predictions almost half of them (47%) were also predicted by at least one human and almost 29% were predicted by at least two human.

**Other Answer Type**
We explored whether multi-word answers led to low accuracy for this answer type since multi-word phrases are more likely to be excluded from the training vocabulary.  However, we found that most of the incorrect predictions (80%) are made up of one-word answer and only 20% are made up of multi-word answers.  In general, the model does the best in predicting rooms, animals, sports, and colors for this type of question.


**Number Answer Type**
The model does the worst for number answer type with only 22% accuracy.  Most of the questions for this answer type involve counting i.e. "how many".  Around 20% of the incorrect predictions have the UNK token as the predicted answers.  We found that a lot of the UNK token predictions involve answers that relate to time or number sequences (such as bus number or number on a jersey) that need to be identified on objects.   



For future iterations we can try to measure accuracy by measuring distance between word vectors so that words / phrases with similar semantic meanings can be given credit (ex: nighttime vs. night).  We can also find better ways to tokenize time and numbers and explore different UNK replacement techniques.
