In [1]:
import json
import pandas as pd

In [2]:
def _partial_acc(obs):
    """
        inner function used to calculate a weighted accuracy based on how many human
        raters' annotations the predicted value matches.  Applied to a single Series.
        
        Args:
            obs = pandas.Series object containing 'predicted_answer' and 'annotations' columns
    """
    prediction = obs['predicted_answer']
    annotations = obs['annotations']
    matches = 0
    for a in annotations:
        if prediction.strip().lower() == a.strip().lower():
            matches += 1
    return min(1, matches/3)

## Download appropriate results file from Google Storage bucket.

Base Model (Top 1000 classification):
- `gs://mids-w266-mw/test2015_results_san_expt0_2018-12-06-05:05:51.json`

Best Enhanced Model (Top 1000 classification):
- `gs://mids-w266-mw/mlflow/52/b25acc49a4b04ac48d0d9e5be5a4f020/artifacts/test2015_results_mrr_san_expt28_2018-12-07-19:57:00.json`

Update `json_results_path` variable in the next cell:

### 1. Best "enhanced" model

In [3]:
# Best Enhanced Model (Top 1000 classification)
json_results_path = '/home/mwinton/report_results/test2015_results_mrr_san_expt28_2018-12-07-19:57:00.json'

with open(json_results_path) as f:
    resultsj = json.load(f)
    
df = pd.DataFrame(resultsj)
df['correct'] = (df['answer_str'].str.strip().str.lower() == df['predicted_answer'].str.strip().str.lower()).astype(int)
df['partial'] = df.apply(_partial_acc, axis=1)

In [4]:
df.describe()

Unnamed: 0,answer_id,image_id,one_hot_index,question_id,correct,partial
count,52213.0,52213.0,52213.0,52213.0,52213.0,52213.0
mean,29125090.0,291250.822515,84.372513,2912509.0,0.535518,0.61599
std,16826810.0,168268.075268,181.851758,1682681.0,0.498742,0.465337
min,4200.0,42.0,1.0,420.0,0.0,0.0
25%,14675720.0,146757.0,1.0,1467572.0,0.0,0.0
50%,29183400.0,291834.0,5.0,2918340.0,1.0,1.0
75%,43617210.0,436172.0,58.0,4361721.0,1.0,1.0
max,58191320.0,581913.0,1000.0,5819132.0,1.0,1.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52213 entries, 0 to 52212
Data columns (total 13 columns):
annotations         52213 non-null object
answer_id           52213 non-null int64
answer_str          52213 non-null object
answer_type         52213 non-null object
complement_id       0 non-null object
image_id            52213 non-null int64
one_hot_index       52213 non-null int64
predicted_answer    52213 non-null object
question_id         52213 non-null int64
question_str        52213 non-null object
question_type       52213 non-null object
correct             52213 non-null int64
partial             52213 non-null float64
dtypes: float64(1), int64(5), object(7)
memory usage: 5.2+ MB


In [6]:
df.head()

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,one_hot_index,predicted_answer,question_id,question_str,question_type,correct,partial
0,"[16, 16, 16, 16, 16, 16, 16, 16, 16, 16]",9786500,16,number,,97865,241,fire hydrant,978650,what # is it,what,0,0.0
1,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",9786510,yes,yes/no,,97865,1,no,978651,is there people here,is there,0,0.0
2,"[container, frisbee golf, frisbee golf goal, f...",9786520,frisbee,other,,97865,24,fire hydrant,978652,what is the object on the right,what is the,0,0.0
3,"[garbage, no, no, no, no, no, no, no, no, no]",57484500,no,yes/no,,574845,2,no,5748450,is this inside,is this,1,1.0
4,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",57484510,yes,yes/no,,574845,1,yes,5748451,could someone sleep here,could,1,1.0


In [7]:
# count by answer type
df.groupby(['answer_type'])['correct'].count().sort_values(ascending=False)

answer_type
yes/no    22792
other     22686
number     6735
Name: correct, dtype: int64

In [8]:
pd.set_option('display.max_rows', 75)

In [9]:
# count by question type
df.groupby(['question_type'])['correct'].count().sort_values(ascending=False)

question_type
how many                    5478
is the                      4871
what color is the           4019
what                        2952
what is the                 2342
is this                     2180
none of the above           2038
is this a                   2034
are the                     1577
what is                     1294
is there a                  1245
what kind of                1178
is it                       1020
does the                     924
is                           900
is there                     891
what color are the           872
what type of                 803
are these                    755
are there                    709
is the man                   662
what are the                 645
where is the                 640
are                          636
does this                    607
what is the man              603
which                        557
how many people are          512
do                           445
what is on the               

In [10]:
# accuracy overall
acc = df['correct'].mean()
partial_acc = df['partial'].mean()
print('Accuracy = {:.3f}. Partial Accuracy = {:.3f}.'.format(acc, partial_acc))

Accuracy = 0.536. Partial Accuracy = 0.616.


In [11]:
# accuracy by question type
acc_by_qtype = df.groupby(['question_type'])['correct','partial'] \
    .mean() \
    .sort_values(['correct'], ascending=False)
acc_by_qtype

Unnamed: 0_level_0,correct,partial
question_type,Unnamed: 1_level_1,Unnamed: 2_level_1
what room is,0.864407,0.878531
what sport is,0.858934,0.881923
is there a,0.852209,0.909505
could,0.813953,0.905039
are there,0.739069,0.801128
do you,0.734375,0.833333
can you,0.734177,0.801688
is there,0.728395,0.808081
was,0.718876,0.792503
is the woman,0.715302,0.791222


In [12]:
# accuracy by answer type
acc_by_anstype = df.groupby(['answer_type'])['correct','partial'] \
    .mean() \
    .sort_values(['correct'], ascending=False)
acc_by_anstype

Unnamed: 0_level_0,correct,partial
answer_type,Unnamed: 1_level_1,Unnamed: 2_level_1
yes/no,0.678703,0.784208
other,0.468086,0.521717
number,0.278099,0.364266


In [13]:
acc_by_anstype.to_dict('index')

{'yes/no': {'correct': 0.6787030537030537, 'partial': 0.7842079092079145},
 'other': {'correct': 0.46808604425636957, 'partial': 0.5217167709894512},
 'number': {'correct': 0.2780994803266518, 'partial': 0.36426627072506856}}

### 1a. Yes/No Answer Type

In [14]:
# accuracy by question type
df[df.answer_type=='yes/no'].groupby(['question_type'])['correct', 'partial'].mean().sort_values(['correct'])

Unnamed: 0_level_0,correct,partial
question_type,Unnamed: 1_level_1,Unnamed: 2_level_1
what is the,0.0,0.0
what,0.25,0.25
why,0.5,0.5
are there any,0.619178,0.751598
do,0.621005,0.73516
none of the above,0.621644,0.769855
are,0.631746,0.761905
is that a,0.63871,0.763441
are these,0.642857,0.777015
is he,0.645756,0.767528


In [15]:
# examples of 0% accuracy for yes/no answer type - 7 data points
df[(df.answer_type=='yes/no') & \
   ((df.question_type=='why') | (df.question_type=='what') | (df.question_type=='what is the'))]

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,one_hot_index,predicted_answer,question_id,question_str,question_type,correct,partial
7754,"[yes, yes, yes, company logo, yes, yes, yes, y...",22455400,yes,yes/no,,224554,1,tracks,2245540,what there a crown on the train,what,0,0.0
19452,"[no, no, no, no, no, no, no, no, no, no]",8474900,no,yes/no,,84749,2,no,847490,what this picture taken indoors,what,1,1.0
20140,"[make feeding easier, yes, yes, yes, yes, view...",18663720,yes,yes/no,,186637,1,yes,1866372,why is there a wooden platform behind the fence,why,1,1.0
23660,"[no, no, no, no, no, yes, no, yes, yes, yes]",8276500,no,yes/no,,82765,2,laptop,827650,what someone using the computer in bed,what,0,0.0
44443,"[no, emergency, no, no, ve, no, no, can't see,...",3987110,no,yes/no,,39871,2,subway,398711,what word is show on the bus,what,0,0.0
44703,"[yes, 2 men skateboarding, old picture, color,...",27829010,yes,yes/no,,278290,1,black and white,2782901,what is the picture white and black,what is the,0,0.0
50225,"[yes, his preference, yes, yes, yes, because h...",9924220,yes,yes/no,,99242,1,cold,992422,why does the man have a beard,why,0,0.0


In [16]:
# examples of <60% accuracy for yes/no answer type with correct answers - 914 data points
df[(df.answer_type=='yes/no') & \
   ((df.question_type=='are there any') | (df.question_type=='none of the above')) & \
   (df.correct==1)].tail(10)

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,one_hot_index,predicted_answer,question_id,question_str,question_type,correct,partial
51801,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",41001900,yes,yes/no,,410019,1,yes,4100190,should this sport be done away from house stru...,none of the above,1,1.0
51845,"[no, yes, no, no, no, no, no, yes, no, no]",53550600,no,yes/no,,535506,2,no,5355060,does she look happy,none of the above,1,1.0
51852,"[no, no, no, no, no, no, no, no, no, 0]",21420400,no,yes/no,,214204,2,no,2142040,are there any towels in this bathroom,are there any,1,1.0
51888,"[no, no, no, no, no, no, no, no, no, no]",24424600,no,yes/no,,244246,2,no,2442460,are there any clouds in the sky,are there any,1,1.0
51911,"[yes, maybe, yes, yes, yes, yes, yes, yes, yes...",52385410,yes,yes/no,,523854,1,yes,5238541,can someone eat outside,none of the above,1,1.0
51979,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",28985510,yes,yes/no,,289855,1,yes,2898551,are there any palm trees in this picture,are there any,1,1.0
52020,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",40771720,yes,yes/no,,407717,1,yes,4077172,would a vegetarian eat this food,none of the above,1,1.0
52055,"[no, no, no, no, no, no, no, no, no, no]",25803610,no,yes/no,,258036,2,no,2580361,does she have any clothes on,none of the above,1,1.0
52152,"[no, yes, yes, no, no, yes, yes, yes, yes, no]",6783220,yes,yes/no,,67832,1,yes,678322,would you eat this,none of the above,1,1.0
52168,"[yes, no, yes, yes, yes, yes, yes, yes, yes, yes]",32166500,yes,yes/no,,321665,1,yes,3216650,will this clock keep time,none of the above,1,1.0


In [17]:
# examples of <60% accuracy for yes/no answer type with incorrect answers - 1555 data points
df[(df.answer_type=='yes/no') & \
   ((df.question_type=='are there any') | (df.question_type=='none of the above')) & \
   (df.correct==0)].tail(10)

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,one_hot_index,predicted_answer,question_id,question_str,question_type,correct,partial
51066,"[radio tv, radio and television, yes, radio an...",18692710,yes,yes/no,,186927,1,no,1869271,anything on the bed,none of the above,0,0.0
51247,"[yes, yes, yes, yes, yes, no, yes, no, yes, no]",27385900,yes,yes/no,,273859,1,no,2738590,does near the door need painted,none of the above,0,1.0
51428,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",51220600,yes,yes/no,,512206,1,no,5122060,are there any butterflies in the photo,are there any,0,0.0
51467,"[yes, no, yes, yes, yes, yes, yes, no, no, yes]",3139020,yes,yes/no,,31390,1,no,313902,does everyone have on short,none of the above,0,1.0
51547,"[no, no, no, no, no, no, no, no, no, no]",40843920,no,yes/no,,408439,2,yes,4084392,if a person swan to shore would the be able to...,none of the above,0,0.0
51828,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",37031520,yes,yes/no,,370315,1,no,3703152,does he wear glasses,none of the above,0,0.0
51879,"[no, no, no, no, no, no, no, no, no, no]",15277620,no,yes/no,,152776,2,yes,1527762,did these come from a money garden,none of the above,0,0.0
51901,"[no, no, no, no, no, no, no, no, yes, no]",29378200,no,yes/no,,293782,2,yes,2937820,will this street sign wave in the wind,none of the above,0,0.333333
51985,"[no, no, no, no, no, no, no, no, no, no]",53321720,no,yes/no,,533217,2,yes,5332172,does it look like a cloudy day,none of the above,0,0.0
52038,"[no, no, no, no, no, no, no, no, no, no]",34400520,no,yes/no,,344005,2,yes,3440052,will it rain soon,none of the above,0,0.0


In [18]:
# how does model compare to humans?
# what percentage of incorrect answers predicted by model are also predicted by humans?
num_incorrect = df[(df.answer_type=='yes/no') & (df.correct==0)]['annotations'].count()
num_atleast1 = df[(df.answer_type=='yes/no') & (df.correct==0) & (df.partial>0)]['annotations'].count() 
num_atleast2 = df[(df.answer_type=='yes/no') & (df.correct==0) & (df.partial>0.35)]['annotations'].count() 
num_atleast3 = df[(df.answer_type=='yes/no') & (df.correct==0) & (df.partial==1)]['annotations'].count() 

print('Percentage of incorrect answers predicted by at least one human: {:.1%}'.format(num_atleast1/num_incorrect))
print('Percentage of incorrect answers predicted by at least two human: {:.1%}'.format(num_atleast2/num_incorrect))
print('Percentage of incorrect answers predicted by at least three human: {:.1%}'.format(num_atleast3/num_incorrect))

Percentage of incorrect answers predicted by at least one human: 48.7%
Percentage of incorrect answers predicted by at least two human: 30.3%
Percentage of incorrect answers predicted by at least three human: 19.5%


### 1b. Other Answer Type

In [19]:
# accuracy by question type
df[df.answer_type=='other'].groupby(['question_type'])['correct', 'partial'].mean().sort_values(['correct'])

Unnamed: 0_level_0,correct,partial
question_type,Unnamed: 1_level_1,Unnamed: 2_level_1
are,0.0,0.111111
how many people are in,0.0,0.0
can you,0.0,0.266667
how many people are,0.0,0.0
how many,0.172414,0.183908
are they,0.230769,0.358974
where is the,0.282813,0.363542
why,0.284848,0.305051
why is the,0.298701,0.359307
what color,0.303116,0.340888


In [20]:
# examples of incorrect predictions for other answer type
df[(df.answer_type=='other') & (df.correct==0)].tail(10)

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,one_hot_index,predicted_answer,question_id,question_str,question_type,correct,partial
52183,"[lights, lights, lights, lights, lights, light...",53979120,lights,other,,539791,321,birds,5397912,what's at the top of the poles,none of the above,0,0.0
52186,"[black and white, black and white striped, bla...",18924110,black,other,,189241,11,red,1892411,what color apron is the woman wearing,what color,0,0.0
52187,"[white silver, gray and white, white, white an...",18924120,white,other,,189241,5,red,1892412,what color is the vehicles,what color is the,0,0.0
52189,"[green and yellow, green, yellow and green, gr...",11380110,green and yellow,other,,113801,814,red,1138011,what color is the bus,what color is the,0,0.0
52193,"[helmet, helmet, helmet, helmet, helmet, helme...",40976310,helmet,other,,409763,91,hat,4097631,what is on the man's head,what is on the,0,0.333333
52202,"[flowers, flowers, flowers, sunglass, flowers ...",8603610,flowers,other,,86036,73,hat,860361,what is on the girl's head,what is on the,0,0.0
52205,"[log, tree, tree, tree, log, tree, log, tree, ...",53298910,tree,other,,532989,110,rocks,5329891,what is laying on the ground behind the giraffe,what is,0,0.0
52207,"[yellow, yellow, yellow white black, yellow, y...",20132610,yellow,other,,201326,12,white,2013261,what color is he wearing,what color is,0,0.0
52208,"[red black, red white black, red, orange, red ...",20132620,red and black,other,,201326,436,red,2013262,what color is the racquet,what color is the,0,0.333333
52212,"[table, on table, on right, on right, by napki...",1603000,table,other,,16030,77,on table,160300,where is the fork,where is the,0,1.0


In [21]:
# create column with number of words in answers
df['answer_length'] = df.apply(lambda data: len(data['answer_str'].split()), axis=1)
df.tail(10)

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,one_hot_index,predicted_answer,question_id,question_str,question_type,correct,partial,answer_length
52203,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",8603620,yes,yes/no,,86036,1,no,860362,are the girls topless,are the,0,0.0,1
52204,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",53298900,yes,yes/no,,532989,1,no,5329890,is the giraffe at the zoo,is the,0,0.0,1
52205,"[log, tree, tree, tree, log, tree, log, tree, ...",53298910,tree,other,,532989,110,rocks,5329891,what is laying on the ground behind the giraffe,what is,0,0.0,1
52206,"[yes, yes, yes, yes, yes, yes, yes, no, yes, yes]",20132600,yes,yes/no,,201326,1,yes,2013260,is the tennis player wearing a nike t shirt,is the,1,1.0,1
52207,"[yellow, yellow, yellow white black, yellow, y...",20132610,yellow,other,,201326,12,white,2013261,what color is he wearing,what color is,0,0.0,1
52208,"[red black, red white black, red, orange, red ...",20132620,red and black,other,,201326,436,red,2013262,what color is the racquet,what color is the,0,0.333333,3
52209,"[night, night, night, night, night, night, nig...",47747000,night,other,,477470,92,night,4774700,what time of day was this photo taken,what time,1,1.0,1
52210,"[yes, yes, yes, no, no, yes, yes, yes, yes, yes]",47747010,yes,yes/no,,477470,1,yes,4774701,is this the right atmosphere for dracula,is this,1,1.0,1
52211,"[stop, stop, stop, stop, stop, stop, stop, sto...",47747020,stop,other,,477470,50,stop,4774702,what does the traffic light say to do,what does the,1,1.0,1
52212,"[table, on table, on right, on right, by napki...",1603000,table,other,,16030,77,on table,160300,where is the fork,where is the,0,1.0,1


In [22]:
# number of incorrect predictions by answer length
df[(df.answer_type=='other') & (df.correct==0)].groupby(['answer_length'])['annotations'].count()

answer_length
0        5
1    11058
2      692
3      304
4        8
Name: annotations, dtype: int64

In [23]:
# number of correct predictions by answer length
df[(df.answer_type=='other') & (df.correct==1)].groupby(['answer_length'])['annotations'].count()

answer_length
1    9940
2     529
3     147
4       3
Name: annotations, dtype: int64

### 1c. Number Answer Type

In [24]:
# accuracy by question type
df[df.answer_type=='number'].groupby(['question_type'])['correct', 'partial'].mean().sort_values(['correct'])

Unnamed: 0_level_0,correct,partial
question_type,Unnamed: 1_level_1,Unnamed: 2_level_1
are there,0.0,0.0
what are the,0.0,0.0
was,0.0,0.333333
is this person,0.0,0.0
what type of,0.0,0.0
is there,0.0,1.0
is the man,0.0,0.0
is the,0.0,0.0
is this,0.0,0.0
is he,0.0,0.0


In [25]:
# number of data points by question type
df[(df.answer_type=='number')].groupby(['question_type'])['annotations'].count().sort_values(ascending=False)

question_type
how many                  5449
how many people are        508
how many people are in     221
how                        136
what                       129
what number is             118
what is the                 83
none of the above           41
which                       13
what does the                8
what time                    6
what is                      5
does the                     4
is                           2
is this                      2
what are the                 2
is the                       1
is he                        1
is the man                   1
is there                     1
what type of                 1
is this person               1
was                          1
are there                    1
Name: annotations, dtype: int64

In [26]:
# examples of incorrect predictions for other answer type
# df[(df.answer_type=='number') & (df.correct==1) & (df.question_type=='how many people are')].tail(10)
df[(df.answer_type=='number') & (df.correct==0)].tail(10)

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,one_hot_index,predicted_answer,question_id,question_str,question_type,correct,partial,answer_length
52103,"[more than 15, 20, many, many, 60, 20, 10, 22,...",31425110,many,number,,314251,161,1,3142511,how many trees are by the road,how many,0,0.0,1
52106,"[2, fork, 3, 4, 2, 1, 2, 2, 3, 2]",53063020,2,number,,530630,3,4,5306302,how many silverware items are there,how many,0,0.333333,1
52136,"[4, 7, 4, 5, 2, 3, 6, 6, 10, 15]",22553200,4,number,,225532,9,2,2255320,how many buildings are in this picture,how many,0,0.333333,1
52137,"[20, 20, 20, 20, 20, 20, 20, 20, 60, 20]",22553210,20,number,,225532,112,2,2255321,how many mph,how many,0,0.0,1
52149,"[1, 1, 1, 1, 1, obits, 1, 1, 1, 1]",51762910,1,number,,517629,4,2,5176291,how many doors are in the room,how many,0,0.0,1
52162,"[5, 1, 6, 4, 4, 3, 4, 4, 10, 4]",23307910,4,number,,233079,9,2,2330791,how many benches are in the lobby,how many,0,0.0,1
52173,"[20, 4, 20, 20, 9, 3, lot, 1, 23, 19]",16313220,20,number,,163132,112,3,1631322,how many lights are below the plane,how many,0,0.333333,1
52180,"[2, 2, 2, 2, 2, 2, 2, 2, 2, 2]",3766000,2,number,,37660,3,3,376600,how many items are in the hand,how many,0,0.0,1
52188,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0]",11380100,0,number,,113801,19,2,1138010,how many boats are in the photo,how many,0,0.0,1
52191,"[2, 2, 2, 2, 2, 2, 2, 2, 2, 2]",40976320,2,number,,409763,3,13,4097632,what is the number on the back of the batter o...,what is the,0,0.0,1


In [27]:
# create column with unk token flag
df['unk_flag'] = df.apply(lambda data: data['predicted_answer']=='<unk>', axis=1)
df.head(5)

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,one_hot_index,predicted_answer,question_id,question_str,question_type,correct,partial,answer_length,unk_flag
0,"[16, 16, 16, 16, 16, 16, 16, 16, 16, 16]",9786500,16,number,,97865,241,fire hydrant,978650,what # is it,what,0,0.0,1,False
1,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",9786510,yes,yes/no,,97865,1,no,978651,is there people here,is there,0,0.0,1,False
2,"[container, frisbee golf, frisbee golf goal, f...",9786520,frisbee,other,,97865,24,fire hydrant,978652,what is the object on the right,what is the,0,0.0,1,False
3,"[garbage, no, no, no, no, no, no, no, no, no]",57484500,no,yes/no,,574845,2,no,5748450,is this inside,is this,1,1.0,1,False
4,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",57484510,yes,yes/no,,574845,1,yes,5748451,could someone sleep here,could,1,1.0,1,False


In [28]:
# number of incorrect answers by unk token flag
df[(df.answer_type=='number') & (df.correct==0)].groupby(['unk_flag'])['annotations'].count()

unk_flag
False    4862
Name: annotations, dtype: int64

In [29]:
# examples of incorrect predictions that are not unk tokens
df[(df.answer_type=='number') & (df.correct==0) & (df.predicted_answer == '<unk>')].tail(50)

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,one_hot_index,predicted_answer,question_id,question_str,question_type,correct,partial,answer_length,unk_flag


In [30]:
# examples of incorrect predictions that are not unk tokens
df[(df.answer_type=='number') & (df.correct==0) & (df.predicted_answer != '<unk>')].head(10)

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,one_hot_index,predicted_answer,question_id,question_str,question_type,correct,partial,answer_length,unk_flag
0,"[16, 16, 16, 16, 16, 16, 16, 16, 16, 16]",9786500,16,number,,97865,241,fire hydrant,978650,what # is it,what,0,0.0,1,False
10,"[2, 2, 2, 2, 2, 2, 2, 2, 2, 2]",19991810,2,number,,199918,3,1,1999181,how many waterfalls are entering the pool,how many,0,0.0,1,False
46,"[16, 15, 14, 13, 18, 10, 16, 15, lot, 13]",38084200,13,number,,380842,137,3,3808420,how many crosswalk stripes painted on the street,how many,0,0.0,1,False
50,"[5, 4, 4, 5, 5, 5, 5, 5, 5, 5]",57433200,5,number,,574332,14,2,5743320,how many umbrellas are open,how many,0,0.0,1,False
81,"[2, 2, 2, 2, 2, 2, 2, 4, 2, 2]",44040010,2,number,,440400,3,1,4404001,how many people are behind the woman,how many people are,0,0.0,1,False
83,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]",53446720,1,number,,534467,4,2,5344672,how many towels are in the photo,how many,0,0.0,1,False
97,"[1, 1, 1, 1, 1, 2, 1, 1, 1, 1]",30653620,1,number,,306536,4,2,3065362,how many clocks,how many,0,0.333333,1,False
134,"[bananas, 7, boggles, 7, 7, 7, 7, 7, 6, 7]",49597510,7,number,,495975,31,3,4959751,how many bunches are on this scene,how many,0,0.0,1,False
135,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]",18817300,0,number,,188173,19,2,1881730,how many kangaroos are there,how many,0,0.0,1,False
136,"[5, 5, 5, 5, 5, 5, 5, 5, 5, 5]",18817320,5,number,,188173,14,2,1881732,how many cars are in the photo,how many,0,0.0,1,False


#### Summary
- Best accuracy (66%): Yes/No answer type
- Second accuracy (28%): Other answer type
- Worst accuracy (22%): Number answer type

**Yes/No Answer Type**  
The model does quite well consistently for all question types with accuracy above 50% except for three question types including "why", "what", and "what is the".  As we can see, the phrasing of these question types do not point to a yes or no answer so it's not a surprise that the model does poorly on these questions.  For other question types, the model does the best when the questions have clear and direct answers and it performs poorly when the questions are abstract, subjective, or require common sense knowledge.  It's also interesting to note that for the incorrect predictions almost half of them (47%) were also predicted by at least one human and almost 29% were predicted by at least two human.

**Other Answer Type**
We explored whether multi-word answers led to low accuracy for this answer type since multi-word phrases are more likely to be excluded from the training vocabulary.  However, we found that most of the incorrect predictions (80%) are made up of one-word answer and only 20% are made up of multi-word answers.  In general, the model does the best in predicting rooms, animals, sports, and colors for this type of question.


**Number Answer Type**
The model does the worst for number answer type with only 22% accuracy.  Most of the questions for this answer type involve counting i.e. "how many".  Around 20% of the incorrect predictions have the UNK token as the predicted answers.  We found that a lot of the UNK token predictions involve answers that relate to time or number sequences (such as bus number or number on a jersey) that need to be identified on objects.   



For future iterations we can try to measure accuracy by measuring distance between word vectors so that words / phrases with similar semantic meanings can be given credit (ex: nighttime vs. night).  We can also find better ways to tokenize time and numbers and explore different UNK replacement techniques.
