In [1]:
import json
import pandas as pd

_Download appropriate results file(s) from Google Storage bucket.  (URLs are available in MLFlow.)_

In [2]:
def _partial_acc(obs):
    """
        inner function used to calculate a weighted accuracy based on how many human
        raters' annotations the predicted value matches.  Applied to a single Series.
        
        Args:
            obs = pandas.Series object containing 'predicted_answer' and 'annotations' columns
    """
    prediction = obs['predicted_answer']
    annotations = obs['annotations']
    matches = 0
    for a in annotations:
        if prediction.strip().lower() == a.strip().lower():
            matches += 1
    return min(1, matches/3)

## Download appropriate results file from Google Storage bucket.

Base Model (Top 1000 classification):
- `gs://mids-w266-mw/test2015_results_san_expt0_2018-12-06-05:05:51.json`

Best Enhanced Model (Top 1000 classification):
- `gs://mids-w266-mw/mlflow/52/b25acc49a4b04ac48d0d9e5be5a4f020/artifacts/test2015_results_mrr_san_expt28_2018-12-07-19:57:00.json`

Update `json_results_path` variable in the next cell:

### 1. Yang's Original

In [3]:
# Base Model (Top 1000 classification)
json_results_path = '/home/mwinton/report_results/test2015_results_san_expt0_2018-12-06-05:05:51.json'

with open(json_results_path) as f:
    resultsj = json.load(f)
    
df = pd.DataFrame(resultsj)
df['correct'] = (df['answer_str'].str.strip().str.lower() == df['predicted_answer'].str.strip().str.lower()).astype(int)
df['partial'] = df.apply(_partial_acc, axis=1)

In [4]:
df.describe()

Unnamed: 0,answer_id,image_id,one_hot_index,question_id,correct,partial
count,52212.0,52212.0,52212.0,52212.0,52212.0,52212.0
mean,29125620.0,291256.093733,84.372654,2912562.0,0.514594,0.595476
std,16826540.0,168265.375679,181.853497,1682654.0,0.499792,0.469363
min,4200.0,42.0,1.0,420.0,0.0,0.0
25%,14676460.0,146764.5,1.0,1467646.0,0.0,0.0
50%,29183400.0,291834.0,5.0,2918340.0,1.0,1.0
75%,43617210.0,436172.0,58.0,4361721.0,1.0,1.0
max,58191320.0,581913.0,1000.0,5819132.0,1.0,1.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52212 entries, 0 to 52211
Data columns (total 13 columns):
annotations         52212 non-null object
answer_id           52212 non-null int64
answer_str          52212 non-null object
answer_type         52212 non-null object
complement_id       0 non-null object
image_id            52212 non-null int64
one_hot_index       52212 non-null int64
predicted_answer    52212 non-null object
question_id         52212 non-null int64
question_str        52212 non-null object
question_type       52212 non-null object
correct             52212 non-null int64
partial             52212 non-null float64
dtypes: float64(1), int64(5), object(7)
memory usage: 5.2+ MB


In [6]:
df.head()

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,one_hot_index,predicted_answer,question_id,question_str,question_type,correct,partial
0,"[16, 16, 16, 16, 16, 16, 16, 16, 16, 16]",9786500,16,number,,97865,241,frisbee,978650,what # is it,what,0,0.0
1,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",9786510,yes,yes/no,,97865,1,no,978651,is there people here,is there,0,0.0
2,"[container, frisbee golf, frisbee golf goal, f...",9786520,frisbee,other,,97865,24,frisbee,978652,what is the object on the right,what is the,1,0.666667
3,"[garbage, no, no, no, no, no, no, no, no, no]",57484500,no,yes/no,,574845,2,no,5748450,is this inside,is this,1,1.0
4,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",57484510,yes,yes/no,,574845,1,yes,5748451,could someone sleep here,could,1,1.0


In [7]:
# count by answer type
df.groupby(['answer_type'])['correct'].count().sort_values(ascending=False)

answer_type
yes/no    22792
other     22685
number     6735
Name: correct, dtype: int64

In [8]:
pd.set_option('display.max_rows', 75)

In [9]:
# count by question type
df.groupby(['question_type'])['correct'].count().sort_values(ascending=False)

question_type
how many                    5478
is the                      4871
what color is the           4019
what                        2952
what is the                 2342
is this                     2180
none of the above           2038
is this a                   2034
are the                     1577
what is                     1294
is there a                  1245
what kind of                1178
is it                       1020
does the                     924
is                           900
is there                     891
what color are the           872
what type of                 803
are these                    755
are there                    709
is the man                   662
what are the                 645
where is the                 639
are                          636
does this                    607
what is the man              603
which                        557
how many people are          512
do                           445
what is on the               

In [10]:
# accuracy overall
acc = df['correct'].mean()
partial_acc = df['partial'].mean()
print('Accuracy = {:.3f}. Partial Accuracy = {:.3f}.'.format(acc, partial_acc))

Accuracy = 0.515. Partial Accuracy = 0.595.


In [11]:
pd.set_option('display.max_rows', 75)

In [12]:
# accuracy by question type
acc_by_qtype = df.groupby(['question_type'])['correct','partial'] \
    .mean() \
    .sort_values(['correct'], ascending=False)
acc_by_qtype

Unnamed: 0_level_0,correct,partial
question_type,Unnamed: 1_level_1,Unnamed: 2_level_1
what sport is,0.874608,0.882968
what room is,0.847458,0.853107
is there a,0.841767,0.899866
could,0.825581,0.905039
is there,0.758698,0.832772
are there,0.7433,0.802069
do you,0.6875,0.795139
does the,0.681818,0.780303
has,0.673077,0.770513
is it,0.665686,0.751634


In [13]:
# accuracy by answer type
acc_by_anstype = df.groupby(['answer_type'])['correct','partial'] \
    .mean() \
    .sort_values(['correct'], ascending=False)
acc_by_anstype

Unnamed: 0_level_0,correct,partial
answer_type,Unnamed: 1_level_1,Unnamed: 2_level_1
yes/no,0.664926,0.771119
other,0.444479,0.497495
number,0.242019,0.331106


In [14]:
acc_by_anstype.to_dict('index')

{'yes/no': {'correct': 0.6649262899262899, 'partial': 0.7711185211185259},
 'other': {'correct': 0.4444787304386158, 'partial': 0.49749467342590886},
 'number': {'correct': 0.24201930215293244, 'partial': 0.3311061618411277}}

### 1a. Yes/No Answer Type

In [15]:
# accuracy by question type
df[df.answer_type=='yes/no'].groupby(['question_type'])['correct', 'partial'].mean().sort_values(['correct'])

Unnamed: 0_level_0,correct,partial
question_type,Unnamed: 1_level_1,Unnamed: 2_level_1
what,0.0,0.0
what is the,0.0,0.0
why,0.5,0.5
is this person,0.594059,0.749175
none of the above,0.600671,0.756991
was,0.612335,0.71072
are,0.612698,0.732804
are these,0.627747,0.745879
are they,0.632022,0.719101
is that a,0.632258,0.767742


In [16]:
# examples of 0% accuracy for yes/no answer type - 7 data points
df[(df.answer_type=='yes/no') & \
   ((df.question_type=='why') | (df.question_type=='what') | (df.question_type=='what is the'))]

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,one_hot_index,predicted_answer,question_id,question_str,question_type,correct,partial
7754,"[yes, yes, yes, company logo, yes, yes, yes, y...",22455400,yes,yes/no,,224554,1,people,2245540,what there a crown on the train,what,0,0.0
19452,"[no, no, no, no, no, no, no, no, no, no]",8474900,no,yes/no,,84749,2,cat,847490,what this picture taken indoors,what,0,0.0
20140,"[make feeding easier, yes, yes, yes, yes, view...",18663720,yes,yes/no,,186637,1,yes,1866372,why is there a wooden platform behind the fence,why,1,1.0
23660,"[no, no, no, no, no, yes, no, yes, yes, yes]",8276500,no,yes/no,,82765,2,bed,827650,what someone using the computer in bed,what,0,0.0
44443,"[no, emergency, no, no, ve, no, no, can't see,...",3987110,no,yes/no,,39871,2,microwave,398711,what word is show on the bus,what,0,0.0
44703,"[yes, 2 men skateboarding, old picture, color,...",27829010,yes,yes/no,,278290,1,skateboard,2782901,what is the picture white and black,what is the,0,0.0
50225,"[yes, his preference, yes, yes, yes, because h...",9924220,yes,yes/no,,99242,1,protection,992422,why does the man have a beard,why,0,0.0


In [17]:
# examples of <60% accuracy for yes/no answer type with correct answers - 914 data points
df[(df.answer_type=='yes/no') & \
   ((df.question_type=='are there any') | (df.question_type=='none of the above')) & \
   (df.correct==1)].tail(10)

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,one_hot_index,predicted_answer,question_id,question_str,question_type,correct,partial
51644,"[no, no, no, no, no, no, no, no, no, no]",57382320,no,yes/no,,573823,2,no,5738232,are there any people,are there any,1,1.0
51691,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",13674020,yes,yes/no,,136740,1,yes,1367402,if the grass gets much higher could the smalle...,none of the above,1,1.0
51801,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",41001900,yes,yes/no,,410019,1,yes,4100190,should this sport be done away from house stru...,none of the above,1,1.0
51828,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",37031520,yes,yes/no,,370315,1,yes,3703152,does he wear glasses,none of the above,1,1.0
51852,"[no, no, no, no, no, no, no, no, no, 0]",21420400,no,yes/no,,214204,2,no,2142040,are there any towels in this bathroom,are there any,1,1.0
51911,"[yes, maybe, yes, yes, yes, yes, yes, yes, yes...",52385410,yes,yes/no,,523854,1,yes,5238541,can someone eat outside,none of the above,1,1.0
51979,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",28985510,yes,yes/no,,289855,1,yes,2898551,are there any palm trees in this picture,are there any,1,1.0
52020,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",40771720,yes,yes/no,,407717,1,yes,4077172,would a vegetarian eat this food,none of the above,1,1.0
52055,"[no, no, no, no, no, no, no, no, no, no]",25803610,no,yes/no,,258036,2,no,2580361,does she have any clothes on,none of the above,1,1.0
52152,"[no, yes, yes, no, no, yes, yes, yes, yes, no]",6783220,yes,yes/no,,67832,1,yes,678322,would you eat this,none of the above,1,1.0


In [18]:
# examples of <60% accuracy for yes/no answer type with incorrect answers - 1555 data points
df[(df.answer_type=='yes/no') & \
   ((df.question_type=='are there any') | (df.question_type=='none of the above')) & \
   (df.correct==0)].tail(10)

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,one_hot_index,predicted_answer,question_id,question_str,question_type,correct,partial
51391,"[no, no, no, no, no, no, no, no, no, no]",54913600,no,yes/no,,549136,2,yes,5491360,did the bear climb up the pole,none of the above,0,0.0
51428,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",51220600,yes,yes/no,,512206,1,no,5122060,are there any butterflies in the photo,are there any,0,0.0
51547,"[no, no, no, no, no, no, no, no, no, no]",40843920,no,yes/no,,408439,2,yes,4084392,if a person swan to shore would the be able to...,none of the above,0,0.0
51845,"[no, yes, no, no, no, no, no, yes, no, no]",53550600,no,yes/no,,535506,2,yes,5355060,does she look happy,none of the above,0,0.666667
51879,"[no, no, no, no, no, no, no, no, no, no]",15277620,no,yes/no,,152776,2,yes,1527762,did these come from a money garden,none of the above,0,0.0
51888,"[no, no, no, no, no, no, no, no, no, no]",24424600,no,yes/no,,244246,2,yes,2442460,are there any clouds in the sky,are there any,0,0.0
51901,"[no, no, no, no, no, no, no, no, yes, no]",29378200,no,yes/no,,293782,2,yes,2937820,will this street sign wave in the wind,none of the above,0,0.333333
51985,"[no, no, no, no, no, no, no, no, no, no]",53321720,no,yes/no,,533217,2,yes,5332172,does it look like a cloudy day,none of the above,0,0.0
52038,"[no, no, no, no, no, no, no, no, no, no]",34400520,no,yes/no,,344005,2,yes,3440052,will it rain soon,none of the above,0,0.0
52168,"[yes, no, yes, yes, yes, yes, yes, yes, yes, yes]",32166500,yes,yes/no,,321665,1,no,3216650,will this clock keep time,none of the above,0,0.333333


In [19]:
# how does model compare to humans?
# what percentage of incorrect answers predicted by model are also predicted by humans?
num_incorrect = df[(df.answer_type=='yes/no') & (df.correct==0)]['annotations'].count()
num_atleast1 = df[(df.answer_type=='yes/no') & (df.correct==0) & (df.partial>0)]['annotations'].count() 
num_atleast2 = df[(df.answer_type=='yes/no') & (df.correct==0) & (df.partial>0.35)]['annotations'].count() 
num_atleast3 = df[(df.answer_type=='yes/no') & (df.correct==0) & (df.partial==1)]['annotations'].count() 

print('Percentage of incorrect answers predicted by at least one human: {:.1%}'.format(num_atleast1/num_incorrect))
print('Percentage of incorrect answers predicted by at least two human: {:.1%}'.format(num_atleast2/num_incorrect))
print('Percentage of incorrect answers predicted by at least three human: {:.1%}'.format(num_atleast3/num_incorrect))

Percentage of incorrect answers predicted by at least one human: 47.8%
Percentage of incorrect answers predicted by at least two human: 29.0%
Percentage of incorrect answers predicted by at least three human: 18.2%


### 1b. Other Answer Type

In [20]:
# accuracy by question type
df[df.answer_type=='other'].groupby(['question_type'])['correct', 'partial'].mean().sort_values(['correct'])

Unnamed: 0_level_0,correct,partial
question_type,Unnamed: 1_level_1,Unnamed: 2_level_1
can you,0.0,0.2
how many people are in,0.0,0.0
how many,0.034483,0.08046
are there,0.142857,0.214286
why is the,0.207792,0.281385
why,0.230303,0.268687
is he,0.25,0.75
how many people are,0.25,0.083333
are these,0.259259,0.259259
is that a,0.263158,0.298246


In [21]:
# examples of incorrect predictions for other answer type
df[(df.answer_type=='other') & (df.correct==0)].tail(10)

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,one_hot_index,predicted_answer,question_id,question_str,question_type,correct,partial
52183,"[lights, lights, lights, lights, lights, light...",53979120,lights,other,,539791,321,pigeons,5397912,what's at the top of the poles,none of the above,0,0.0
52186,"[black and white, black and white striped, bla...",18924110,black,other,,189241,11,red,1892411,what color apron is the woman wearing,what color,0,0.0
52187,"[white silver, gray and white, white, white an...",18924120,white,other,,189241,5,red,1892412,what color is the vehicles,what color is the,0,0.0
52189,"[green and yellow, green, yellow and green, gr...",11380110,green and yellow,other,,113801,814,red,1138011,what color is the bus,what color is the,0,0.0
52193,"[helmet, helmet, helmet, helmet, helmet, helme...",40976310,helmet,other,,409763,91,hat,4097631,what is on the man's head,what is on the,0,0.333333
52200,"[night, evening, twilight, night time, dusk, n...",26004810,night,other,,260048,92,sunset,2600481,what time of day is it,what time,0,0.0
52202,"[flowers, flowers, flowers, sunglass, flowers ...",8603610,flowers,other,,86036,73,hat,860361,what is on the girl's head,what is on the,0,0.0
52205,"[log, tree, tree, tree, log, tree, log, tree, ...",53298910,tree,other,,532989,110,grass,5329891,what is laying on the ground behind the giraffe,what is,0,0.0
52207,"[yellow, yellow, yellow white black, yellow, y...",20132610,yellow,other,,201326,12,white,2013261,what color is he wearing,what color is,0,0.0
52208,"[red black, red white black, red, orange, red ...",20132620,red and black,other,,201326,436,red,2013262,what color is the racquet,what color is the,0,0.333333


In [22]:
# create column with number of words in answers
df['answer_length'] = df.apply(lambda data: len(data['answer_str'].split()), axis=1)
df.tail(10)

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,one_hot_index,predicted_answer,question_id,question_str,question_type,correct,partial,answer_length
52202,"[flowers, flowers, flowers, sunglass, flowers ...",8603610,flowers,other,,86036,73,hat,860361,what is on the girl's head,what is on the,0,0.0,1
52203,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",8603620,yes,yes/no,,86036,1,no,860362,are the girls topless,are the,0,0.0,1
52204,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",53298900,yes,yes/no,,532989,1,no,5329890,is the giraffe at the zoo,is the,0,0.0,1
52205,"[log, tree, tree, tree, log, tree, log, tree, ...",53298910,tree,other,,532989,110,grass,5329891,what is laying on the ground behind the giraffe,what is,0,0.0,1
52206,"[yes, yes, yes, yes, yes, yes, yes, no, yes, yes]",20132600,yes,yes/no,,201326,1,yes,2013260,is the tennis player wearing a nike t shirt,is the,1,1.0,1
52207,"[yellow, yellow, yellow white black, yellow, y...",20132610,yellow,other,,201326,12,white,2013261,what color is he wearing,what color is,0,0.0,1
52208,"[red black, red white black, red, orange, red ...",20132620,red and black,other,,201326,436,red,2013262,what color is the racquet,what color is the,0,0.333333,3
52209,"[night, night, night, night, night, night, nig...",47747000,night,other,,477470,92,night,4774700,what time of day was this photo taken,what time,1,1.0,1
52210,"[yes, yes, yes, no, no, yes, yes, yes, yes, yes]",47747010,yes,yes/no,,477470,1,no,4774701,is this the right atmosphere for dracula,is this,0,0.666667,1
52211,"[stop, stop, stop, stop, stop, stop, stop, sto...",47747020,stop,other,,477470,50,stop,4774702,what does the traffic light say to do,what does the,1,1.0,1


In [23]:
# number of incorrect predictions by answer length
df[(df.answer_type=='other') & (df.correct==0)].groupby(['answer_length'])['annotations'].count()

answer_length
0        5
1    11541
2      740
3      306
4       10
Name: annotations, dtype: int64

In [24]:
# number of correct predictions by answer length
df[(df.answer_type=='other') & (df.correct==1)].groupby(['answer_length'])['annotations'].count()

answer_length
1    9456
2     481
3     145
4       1
Name: annotations, dtype: int64

### 1c. Number Answer Type

In [25]:
# accuracy by question type
df[df.answer_type=='number'].groupby(['question_type'])['correct', 'partial'].mean().sort_values(['correct'])

Unnamed: 0_level_0,correct,partial
question_type,Unnamed: 1_level_1,Unnamed: 2_level_1
is the man,0.0,0.0
does the,0.0,0.0
what are the,0.0,0.0
was,0.0,0.333333
is this person,0.0,0.0
what type of,0.0,0.0
is,0.0,0.333333
is there,0.0,1.0
is the,0.0,0.0
what is,0.0,0.0


In [26]:
# number of data points by question type
df[(df.answer_type=='number')].groupby(['question_type'])['annotations'].count().sort_values(ascending=False)

question_type
how many                  5449
how many people are        508
how many people are in     221
how                        136
what                       129
what number is             118
what is the                 83
none of the above           41
which                       13
what does the                8
what time                    6
what is                      5
does the                     4
is                           2
is this                      2
what are the                 2
is the                       1
is he                        1
is the man                   1
is there                     1
what type of                 1
is this person               1
was                          1
are there                    1
Name: annotations, dtype: int64

In [27]:
# examples of incorrect predictions for other answer type
# df[(df.answer_type=='number') & (df.correct==1) & (df.question_type=='how many people are')].tail(10)
df[(df.answer_type=='number') & (df.correct==0)].tail(10)

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,one_hot_index,predicted_answer,question_id,question_str,question_type,correct,partial,answer_length
52102,"[10, 10, 10, 10, i can't tell, few, 11, 10, 10...",31425100,10,number,,314251,40,2,3142510,how many motorcycle,how many,0,0.0,1
52103,"[more than 15, 20, many, many, 60, 20, 10, 22,...",31425110,many,number,,314251,161,4,3142511,how many trees are by the road,how many,0,0.0,1
52136,"[4, 7, 4, 5, 2, 3, 6, 6, 10, 15]",22553200,4,number,,225532,9,3,2255320,how many buildings are in this picture,how many,0,0.333333,1
52137,"[20, 20, 20, 20, 20, 20, 20, 20, 60, 20]",22553210,20,number,,225532,112,3,2255321,how many mph,how many,0,0.0,1
52149,"[1, 1, 1, 1, 1, obits, 1, 1, 1, 1]",51762910,1,number,,517629,4,2,5176291,how many doors are in the room,how many,0,0.0,1
52162,"[5, 1, 6, 4, 4, 3, 4, 4, 10, 4]",23307910,4,number,,233079,9,2,2330791,how many benches are in the lobby,how many,0,0.0,1
52173,"[20, 4, 20, 20, 9, 3, lot, 1, 23, 19]",16313220,20,number,,163132,112,3,1631322,how many lights are below the plane,how many,0,0.333333,1
52180,"[2, 2, 2, 2, 2, 2, 2, 2, 2, 2]",3766000,2,number,,37660,3,5,376600,how many items are in the hand,how many,0,0.0,1
52188,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0]",11380100,0,number,,113801,19,3,1138010,how many boats are in the photo,how many,0,0.0,1
52191,"[2, 2, 2, 2, 2, 2, 2, 2, 2, 2]",40976320,2,number,,409763,3,6,4097632,what is the number on the back of the batter o...,what is the,0,0.0,1


In [28]:
# create column with unk token flag
df['unk_flag'] = df.apply(lambda data: data['predicted_answer']=='<unk>', axis=1)
df.head(5)

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,one_hot_index,predicted_answer,question_id,question_str,question_type,correct,partial,answer_length,unk_flag
0,"[16, 16, 16, 16, 16, 16, 16, 16, 16, 16]",9786500,16,number,,97865,241,frisbee,978650,what # is it,what,0,0.0,1,False
1,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",9786510,yes,yes/no,,97865,1,no,978651,is there people here,is there,0,0.0,1,False
2,"[container, frisbee golf, frisbee golf goal, f...",9786520,frisbee,other,,97865,24,frisbee,978652,what is the object on the right,what is the,1,0.666667,1,False
3,"[garbage, no, no, no, no, no, no, no, no, no]",57484500,no,yes/no,,574845,2,no,5748450,is this inside,is this,1,1.0,1,False
4,"[yes, yes, yes, yes, yes, yes, yes, yes, yes, ...",57484510,yes,yes/no,,574845,1,yes,5748451,could someone sleep here,could,1,1.0,1,False


In [29]:
# number of incorrect answers by unk token flag
df[(df.answer_type=='number') & (df.correct==0)].groupby(['unk_flag'])['annotations'].count()

unk_flag
False    5105
Name: annotations, dtype: int64

In [30]:
# examples of incorrect predictions that are not unk tokens
df[(df.answer_type=='number') & (df.correct==0) & (df.predicted_answer == '<unk>')].tail(50)

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,one_hot_index,predicted_answer,question_id,question_str,question_type,correct,partial,answer_length,unk_flag


In [31]:
# examples of incorrect predictions that are not unk tokens
df[(df.answer_type=='number') & (df.correct==0) & (df.predicted_answer != '<unk>')].head(10)

Unnamed: 0,annotations,answer_id,answer_str,answer_type,complement_id,image_id,one_hot_index,predicted_answer,question_id,question_str,question_type,correct,partial,answer_length,unk_flag
0,"[16, 16, 16, 16, 16, 16, 16, 16, 16, 16]",9786500,16,number,,97865,241,frisbee,978650,what # is it,what,0,0.0,1,False
10,"[2, 2, 2, 2, 2, 2, 2, 2, 2, 2]",19991810,2,number,,199918,3,1,1999181,how many waterfalls are entering the pool,how many,0,0.0,1,False
16,"[2, 2, 2, 2, 2, 2, 2, 2, 2, 2]",43068110,2,number,,430681,3,1,4306811,how many animals are there,how many,0,0.0,1,False
46,"[16, 15, 14, 13, 18, 10, 16, 15, lot, 13]",38084200,13,number,,380842,137,1,3808420,how many crosswalk stripes painted on the street,how many,0,0.0,1,False
50,"[5, 4, 4, 5, 5, 5, 5, 5, 5, 5]",57433200,5,number,,574332,14,3,5743320,how many umbrellas are open,how many,0,0.0,1,False
78,"[yes, yes, yes, 3, 3, 2, 3, 3, 2, 3]",892320,3,number,,8923,6,1,89232,how many different colored flowers are in fron...,how many,0,0.0,1,False
81,"[2, 2, 2, 2, 2, 2, 2, 4, 2, 2]",44040010,2,number,,440400,3,3,4404001,how many people are behind the woman,how many people are,0,0.0,1,False
83,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]",53446720,1,number,,534467,4,3,5344672,how many towels are in the photo,how many,0,0.0,1,False
101,"[2, 2, 2, 2, 2, 3, 2, 3, 2, 2]",3268210,2,number,,32682,3,4,326821,how many species of animals are visible,how many,0,0.0,1,False
134,"[bananas, 7, boggles, 7, 7, 7, 7, 7, 6, 7]",49597510,7,number,,495975,31,5,4959751,how many bunches are on this scene,how many,0,0.0,1,False


#### Summary
- Best accuracy (66%): Yes/No answer type
- Second accuracy (28%): Other answer type
- Worst accuracy (22%): Number answer type

**Yes/No Answer Type**  
The model does quite well consistently for all question types with accuracy above 50% except for three question types including "why", "what", and "what is the".  As we can see, the phrasing of these question types do not point to a yes or no answer so it's not a surprise that the model does poorly on these questions.  For other question types, the model does the best when the questions have clear and direct answers and it performs poorly when the questions are abstract, subjective, or require common sense knowledge.  It's also interesting to note that for the incorrect predictions almost half of them (47%) were also predicted by at least one human and almost 29% were predicted by at least two human.

**Other Answer Type**
We explored whether multi-word answers led to low accuracy for this answer type since multi-word phrases are more likely to be excluded from the training vocabulary.  However, we found that most of the incorrect predictions (80%) are made up of one-word answer and only 20% are made up of multi-word answers.  In general, the model does the best in predicting rooms, animals, sports, and colors for this type of question.


**Number Answer Type**
The model does the worst for number answer type with only 22% accuracy.  Most of the questions for this answer type involve counting i.e. "how many".  Around 20% of the incorrect predictions have the UNK token as the predicted answers.  We found that a lot of the UNK token predictions involve answers that relate to time or number sequences (such as bus number or number on a jersey) that need to be identified on objects.   



For future iterations we can try to measure accuracy by measuring distance between word vectors so that words / phrases with similar semantic meanings can be given credit (ex: nighttime vs. night).  We can also find better ways to tokenize time and numbers and explore different UNK replacement techniques.
