# 这个实验演示如何使用检索到的信息训练模型用于回答科学测试问题

数据和思路参考 github repo [teambetm/allenAI](https://github.com/tambetm/allenAI)

In [1]:
from tqdm import tqdm
import numpy as np

## 1. 读 AI2 数据

In [2]:
# 去掉文本中的标点和多余的空格
filters=set('!\"#$%&(),.:;<=>?@[\\\\]^`{|}~-')
filterpunct = lambda s: ''.join([x if x not in filters else ' ' for x in s])
filterspace = lambda s: ' '.join(filter(lambda x: len(x)>0, s.split(' ')))

demo_s = '1-hypothesis 2.theory 3--law'
print('demo of filters:')
print('\t输入字符串:%s' % demo_s)
print('\t输出字符串:%s' % filterspace(filterpunct(demo_s)))

demo of filters:
	输入字符串:1-hypothesis 2.theory 3--law
	输出字符串:1 hypothesis 2 theory 3 law


In [4]:
import json

filename_tr = 'AI2-ScienceQuestions-V2-May2017/MiddleSchool/Middle-NDMC-Train.jsonl'
filename_dev = 'AI2-ScienceQuestions-V2-May2017/MiddleSchool/Middle-NDMC-Dev.jsonl'
filename_te = 'AI2-ScienceQuestions-V2-May2017/MiddleSchool/Middle-NDMC-Test.jsonl'
data, count_AI2 = [], {}

id = 0
with open(filename_tr, encoding='utf-8') as f:
    for line in f:
        data.append(json.loads(line))
        id += 1
count_AI2['train'] = [0, id-1]

with open(filename_dev, encoding='utf-8') as f:
    for line in f:
        data.append(json.loads(line))
        id += 1
count_AI2['dev'] = [count_AI2['train'][-1]+1, id-1]

with open(filename_te, encoding='utf-8') as f:
    for line in f:
        data.append(json.loads(line))
        id += 1
count_AI2['test'] = [count_AI2['dev'][-1]+1, id-1]

print('AI2 data, (train, valid, test) 样本数是： (%d, %d, %d).' % 
      (count_AI2['train'][1] - count_AI2['train'][0]+1,
       count_AI2['dev'][1] - count_AI2['dev'][0]+1,
       count_AI2['test'][1] - count_AI2['test'][0]+1))

AI2 data, (train, valid, test) 样本数是： (605, 125, 679).


In [5]:
questions = []
choices = []
solutions = []
for s in data:
    if len(s['question']['choices']) == 4:
        questions.append(filterspace(filterpunct(s['question']['stem'])).lower())
        solutions.append(ord(s['answerKey']) - ord('A'))
        choices.append([filterspace(filterpunct(s['question']['choices'][i]['text'])).lower() for i in range(4)])

In [7]:
print('examples of AI2 questions, solutions and choices:')
i = 0
print('question:\t%s' % questions[i])
print('choices:\t%s' % choices[i])
print('true choice:\t%s\n' % choices[i][solutions[i]])

examples of AI2 questions, solutions and choices:
question:	which correctly arranges three scientific terms theory law and hypothesis from least to most accepted or tested
choices:	['theory hypothesis law', 'hypothesis law theory', 'theory law hypothesis', 'hypothesis theory law']
true choice:	hypothesis theory law



### 原版问题例子：

**问题**:	which correctly arranges three scientific termstheory law and hypothesisfrom least to most accepted or tested
**选项**:	['theory hypothesis law', 'hypothesis law theory', 'theory law hypothesis', 'hypothesis theory law']
**正确选项**:	hypothesis theory law

**问题**:	which of these best defines communicable diseases
**选项**:	['they can be cured', 'they are caused by bacteria', 'they are spread to others', 'they can spread only in winter']
**正确选项**:	they are spread to others

**问题**:	a scientist combines oxygen and hydrogen to form water this combination illustrates that water is
**选项**:	['an atom', 'an element', 'a mixture', 'a compound']
**正确选项**:	a compound

**问题**:	comparing the skeletons of which of the following fish would best show the evolution of a fish species
**选项**:	['a male fish and a female fish that could produce offspring', 'the same fish just before it received a cut and after it healed', 'a fish that lived recently and a fish that lived a long time ago', 'the same fish just after it hatched and when it was fullgrown']
**正确选项**:	a fish that lived recently and a fish that lived a long time ago

**问题**:	when oil is burning the reaction will
**选项**:	['only release energy', 'only absorb energy', 'neither absorb nor release energy', 'sometimes release and sometimes absorb energy depending on the oil']
**正确选项**:	only release energy

**问题**:	two pure substances combine to make a new substance the new substance cannot be physically separated and has a different boiling point than each of the original substances this new substance can best be classified as
**选项**:	['an atom', 'a mixture', 'an element', 'a compound']
**正确选项**:	a compound

**问题**:	a change in the environment that causes a response is known as a
**选项**:	['stimulus', 'habit', 'reflex', 'source']
**正确选项**:	stimulus

**问题**:	the movement of an air mass over earth's surface causes
**选项**:	['earthquake activity', 'local weather changes', 'global warming', 'ecological succession']
**正确选项**:	local weather changes

**问题**:	sound will not travel in a
**选项**:	['solid', 'liquid', 'gas', 'vacuum']
**正确选项**:	vacuum

**问题**:	what is the smallest unit of an element that still has the properties of that element
**选项**:	['an atom', 'a compound', 'an electron', 'a molecule']
**正确选项**:	an atom

**问题**:	in a food pyramid which best explains why the number of organisms decreases from one trophic level to the next
**选项**:	['consumers at the lower level require more energy than the toplevel consumers', 'consumers at the top level require more energy than the lowerlevel consumers', 'the consumers are feeding on larger organisms that have less energy', 'the consumers are feeding on smaller organisms that have less energy']
**正确选项**:	consumers at the top level require more energy than the lowerlevel consumers

### 中文版问题例子：
**问题**：哪个正确排列了三个科学术语 - 理论，法律和假设 - 从最少到最接受或测试？
**选项**：[“理论，假设，定律”，“假设，定律，理论”，“理论，定律，假设”，“假设，理论，定律”
**正确选项**：假设，理论，定律

**问题**：哪些最好定义传染病？
**选项**：“他们可以治愈”，“它们是由细菌引起的”，“他们传播给他人”，“他们只能在冬天传播”。
**正确选项**：他们传播给他人。

**问题**：科学家结合氧气和氢气形成水。这个组合说明了水是
**选项**：['一个原子'，'一个元素'，'混合物'，'复合物']
**正确选项**：一个复合物。

**问题**：比较以下鱼类的骨骼最能显示鱼类的进化？
**选项**：[“一只可以产生后代的雄鱼和一只雌鱼”，“在接受切割之前的同一条鱼，治愈之后”，最近生活的鱼和长寿的鱼之前“，”孵化后的同样的鱼，当它是完全成长的时候“）
**正确选项**：最近生活的鱼和很久以前的鱼

**问题**：当油燃烧时，反应会
**选项**：“只能释放能量”，“只吸收能量”，“既不吸收也不释放能量”，“有时释放并有时根据油而吸收能量”]
**正确选项**：只释放能量

**问题**：两种纯物质结合起来制成新物质。新物质不能物理分离，并且具有与每种原始物质不同的沸点。这种新物质最好分类为
**选项**：['一个原子'，'混合物'，'一个元素'，'一个复合物']
**正确选项**：一个复合物。

**问题**：导致响应的环境变化被称为a
**选项**：['刺激'，'习惯'，'反射'，'源']
**正确选项**：刺激

**问题**：空气在地球表面的运动导致
**选项**：[“地震活动”，“当地天气变化”，“全球变暖”，“生态继承”]
**正确选项**：当地天气变化

**问题**：声音不会在一个旅行
**选项**：['固体'，'液体'，'气体'，'真空']
**正确选项**：真空

**问题**：一个元素的最小单元是什么，该元素仍然具有该元素的属性？
**选项**：['一个原子'，'一个化合物'，'一个电子'，'一个分子']
**正确选项**：一个原子

**问题**：在一个食物金字塔中，这最好地解释了为什么有机体的数量从一个营养水平下降到下一个？
**选项**：[“下层消费者需要比顶级消费者更多的能量”，“顶级消费者需要比低级消费者更多的能量”，“消费者正在更大量有能量较少的生物体“，”消费者正在食用能量较少的较小生物“。
**正确选项**：顶级消费者需要比低层消费者更多的能量。

## 2. 读 studystack 笔记创建一个小型知识库

In [8]:
name = 'studystack_qa_cleaner_no_qm.txt'
defs, terms = [], []
with open(name, encoding='utf-8') as f:
    for line in f:
        l = line.strip().split('\t')
        defs.append(filterspace(filterpunct(l[1])).lower())
        terms.append(filterspace(filterpunct(l[2])).lower())

In [13]:
print('知识库中的概念（或答案） 和 定义（或问题）的例子:\n')
i = 2
print('term (or solution):\t%s' % terms[i])
print('definition (or question):\t%s\n' % defs[i])

知识库中的概念（或答案） 和 定义（或问题）的例子:

term (or solution):	reference point
definition (or question):	a place or object used for comparison to determine if an object is in motion must be stationary



### 例子的中文版

**运动**：一个物体与另一个物体的距离正在改变的状态

**寄生**：一种生物受益并且另一种生物受到损害的共生关系。

**全球变暖**：地球二氧化碳过剩平均气温的假设上升

**功率=工作除以时间或可以写为p = w / t，可以重新排列为工作=功率时间，或者可以写为W = P * T，再次重新排列时间为时间=功率除以功率 或者可以是T = W / P **：什么是功率方程？

**咸水**：（O.Comp。）称为与海水混合的淡水

**质量**：什么弥补了事情

**蘑菇**：什么是俱乐部真菌的例子？

**农业**：种植作物和饲养动物的业务是什么？

**固体，液体，气体，等离子体**：四种物质状态

**不是由人造的，在环境中自然发生**：什么是自然结构？

**数据**：信息

## 3. 数据预处理

参考DrQA的方法，使用一个简化的方案做信息检索：
1. 建立词典（包括处理停止词汇，过滤样本，提取名词和动词等具体任务）
2. 计算每个词语的IDF
3. 计算词语在每个样本里面的TF
4. 使用score函数选择相关的条目
  + 只从studystack里面选择条目
  + 使用studystack条目中的Q+A一起计算分数
5. 每个样本选择5个条目作为story


In [14]:
import nltk
# 过滤停止词汇
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
# 将动词（ing, ed时态）和名词（复数）转化为初始形式
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [15]:
print('demo, nltk的默认停止词汇:\n%s' % stopwords)

demo, nltk的默认停止词汇:
{'not', "needn't", 're', 'only', 'weren', 'ain', 'between', 'up', 'with', 'now', 'did', 'before', 'your', 'will', 'below', 'didn', 'shan', 'me', 'she', 'be', 'haven', 'few', 'was', 'hasn', 'most', "mustn't", 'are', "you're", 'wouldn', 'such', 'his', 'out', "isn't", "weren't", 'themselves', "mightn't", 'which', 'that', 'm', 'herself', 'won', 'y', 'don', 'doesn', 'needn', "that'll", 'very', 'both', "aren't", 'shouldn', "it's", 'and', "haven't", 'once', "couldn't", 'here', "should've", 'couldn', 'just', 'is', 'by', 'there', 'my', 'hers', 'll', "hadn't", 'doing', 'can', "she's", "wouldn't", 'itself', 'to', "you'll", "wasn't", 'while', 'it', 'any', 'himself', "shan't", 'does', 's', 'an', 'when', 'during', 'because', 'isn', 'them', 'ourselves', 'they', 'what', 'so', 'd', 'no', 'their', 'aren', 've', 'who', 'how', 'through', "hasn't", 'again', 'wasn', 'other', 'been', 'why', 'yours', 'after', 't', 'further', 'as', 'being', "you've", 'those', 'am', 'its', 'for', 'her', 'this

#### `textfilter()`: 使用 nltk.pos_tag 提取句子中每一个词语的类型，挑选动词和名词，并使用nltk lemmatizer转化为动词和名词的原始形态

In [16]:
def textfilter(text, stopwords, lemmatizer):
    text = nltk.pos_tag(text)
    text = [w for w in text if w[0] not in stopwords and w[1][:2] in ['NN', 'VB']]
    words = []
    for w in text:
        if w[1].startswith('VB'):
            words.append(lemmatizer.lemmatize(w[0], 'v'))
        else:
            words.append(lemmatizer.lemmatize(w[0], 'n'))
    return words

#### 产生一个和原始数据样本一一对应的数据集，只保留原始形态的名词和动词
* `questions + options` : `filter_AI2`
* `defs + terms` : `filter_SS`

In [17]:
filter_AI2 = []
outliers = []
for i in tqdm(range(len(questions))):
    try:
        text = questions[i].split(' ')
        #for j in range(4):
        #    text.extend(choices[i][j].split(' '))
        filter_AI2.append(textfilter(text, stopwords, lemmatizer))
    except:
        outliers.append(i)

if outliers:
    print('过滤掉的一些样本:')
    for i in range(min(3, len(outliers))):
        print('%s, %s' % (questions[outliers[i]], choices[outliers[i]]))
    print('过滤样本...')
    for i in outliers[::-1]:
        del questions[i]
        del choices[i]
        del solutions[i]
    print('done')

100%|█████████████████████████████████████| 1394/1394 [00:04<00:00, 313.06it/s]


In [27]:
# 使用 textfilter() 处理studystack的条目
filter_SS = []
for i in tqdm(range(len(defs))):
    text = [x for x in defs[i].split(' ') + terms[i].split(' ') if x]
    filter_SS.append(textfilter(text, stopwords, lemmatizer))

# 删除一些空样本,即，不包含名词和动词的条目
empty_lines = []
for i in range(len(filter_SS)):
    if len(filter_SS[i]) < 1:
        empty_lines.append(i)
empty_lines = set(empty_lines)

N = len(defs)
# studystack后续处理，删除特别短的不包含名词和动词的条目
if len(filter_SS[list(empty_lines)[0]]) == 0:
    defs =[defs[i] for i in range(N) if i not in empty_lines]
    terms = [terms[i] for i in range(N) if i not in empty_lines]
    filter_SS = [filter_SS[i] for i in range(N) if i not in empty_lines]

100%|█████████████████████████████████| 454743/454743 [13:13<00:00, 573.44it/s]


#### 建立词典

In [28]:
# 1. build vocab
from collections import Counter
c = Counter()
for line in filter_AI2:
    c.update(line)
for line in filter_SS:
    c.update(line)

In [29]:
print('最常见的100个词语:\n\t%s' % c.most_common(100))

最常见的100个词语:
	[('cell', 42122), ('water', 24315), ('use', 24127), ('energy', 24054), ('change', 19717), ('make', 19637), ('organism', 18938), ('form', 17789), ('body', 17469), ('plant', 15939), ('system', 14965), ('food', 13260), ('substance', 12988), ('cause', 12917), ('blood', 12850), ('part', 12312), ('process', 12060), ('type', 12036), ('move', 11949), ('object', 11871), ('force', 10645), ('produce', 10382), ('rock', 10151), ('mass', 9983), ('call', 9955), ('animal', 9884), ('air', 9635), ('surface', 9331), ('group', 9251), ('matter', 8877), ('area', 8872), ('material', 8866), ('muscle', 8688), ('earth', 8670), ('gas', 8637), ('time', 8598), ('measure', 8408), ('chemical', 8350), ('thing', 8338), ('element', 8164), ('wave', 7939), ('tissue', 7912), ('atom', 7899), ('bone', 7791), ('find', 7736), ('number', 7678), ('layer', 7668), ('amount', 7462), ('structure', 7147), ('function', 7099), ('particle', 6916), ('disease', 6887), ('molecule', 6834), ('membrane', 6711), ('live', 6640), (

In [30]:
vocab, ivocab = {}, {}
id = 0
for k, v in c.items():
    vocab[k] = id
    ivocab[id] = k
    id += 1

In [32]:
filter_AI2[0], filter_SS[0]

(['arrange', 'term', 'law', 'hypothesis', 'test'],
 ['state', "object's", 'distance', 'change', 'motion'])

### 计算tfidf

In [33]:
# 2. 统计IDF
N = len(filter_AI2) + len(filter_SS) # 总文档数
df = {i: 0 for i in range(len(vocab))}
for text in filter_AI2:
    for w in set(text):
        df[vocab[w]] += 1

for text in filter_SS:
    for w in set(text):
        df[vocab[w]] += 1

for i in range(len(vocab)):
    df[i] = df[i] / N

import math
idf = {ivocab[k]: -math.log(v) / math.log(10) for k, v in df.items()}

In [34]:
idf['organism']

1.4343630731481094

### 检索样本
1. 对每一个`filter_AI2`从`filter_SS`里面检索最接近的5个
2. 对每一个`filter_SS`从`filter_SS`里面前后共6000个样本检索最接近的6个

### Term frequency tf
* 词汇$t$在文档$d$中的**term frequency** $tf(t,d)$ 定义为词汇$t$ 在文档 $d$ 中出现的次数
* 使用*tf*计算**query-document match scores**的时候，假定query和文档的相关程度不随词频线性增加
  + 而是使用词频的log：$w(t,d) = \log_{10}(1 + tf(t,d))$
* 计算query和document的相关程度时，对同时出现在query和document中的词汇的词频加和:$score = \sum_{t \in q \cap d} \log_{10}(1 + tf(t,d))$
* 如果query中的词语在一个文档中没有出现，那么query和这个文档的相关程度为0

### Document frequency:  *idf* 权重
* 常见词语的信息量往往小于不常见的词语，例如
  + 停止词汇的信息量往往很小
  + 稀有词汇可能对于文档检索非常有用：query中有一个稀有词汇*arachnocentric*，那么包含*arachnocentric*的文档很可能是有用的文档
* 如何量化词汇的权重？
  + $df(t)$ 是词汇$t$的 **document frequency** of $t$，即包含词汇$t$的文档的数量
  + 使用处理后的$df$倒数，$idf(t) = \log_{10}\frac{N}{df(t)}$，描述词汇$t$的权重或信息量（informativeness）

### tf-idf 分数
* $ W(t,d) = \log(1+tf(t,d))\times \log_{10}(\frac{N}{df(t)})$
* $ score(q,d) = \sum_{t\in q\cap d} w(t,d)$
* 这个score用来在检索时排序

In [36]:
# 一些辅助变量，方便计算tf
filter_SS_dict = []
for line in filter_SS:
    filter_SS_dict.append(Counter(line))
    
# 计算基于tfidf的matching score
def calc_score(idf, q, list_d, dict_d):
    s = set(q).intersection(set(list_d))
    score = 0
    if len(s) > 0:
        for w in s:
            score += math.log(1 + dict_d[w]) * idf[w]
    return score

## 4. 为每个问题（和备选答案）挑选对应的知识库条目

In [37]:
idx_AI2 = []

for i in tqdm(range(len(filter_AI2))):
    idx_AI2.append([])
    scores = []
    for j in range(len(filter_SS)):
        scores.append(calc_score(idf, filter_AI2[i], filter_SS[j], filter_SS_dict[j]))
    values = np.array(scores)
    for k in range(5):
        idx_AI2[-1].append(np.argmax(values))
        values[idx_AI2[-1][-1]] = -100

100%|██████████████████████████████████████| 1394/1394 [16:58<00:00,  1.37it/s]


In [39]:
i = 0
print('question: %s' % questions[i])
for j in range(5):
    print('\t%d:%s' % (j, terms[idx_AI2[i][j]] + ' ' + defs[idx_AI2[i][j]]))
print('\n')

question: which correctly arranges three scientific terms theory law and hypothesis from least to most accepted or tested
	0:make observations form a hypothesis perform experiments to confirm hypothesis hypothesis is now a theory perform many experiments over several years theory is now a law put into their proper order a theory is now a law b hypothesis is now a theory c make observations d perform experiments to confirm the hypothesis e form a hypothesis f perform many experiments over several years
	1:make observations form a hypothesis perform experiments to confirm hypothesis hypothesis is now a theory perform many experiments over several years theory is now a law put into their proper order a theory is now a law b hypothesis is now a theory c make observations d perform experiments to confirm the hypothesis e form a hypothesis f perform many experiments over several years
	2:after a hypothesis is made scientists will design an experiment to test the hypothesis what is done when 

问题：从最小到最接受或测试，正确地排列了三个科学术语理论定律和假设
0：使观察形成一个假设进行实验来证实假设假设现在是一个理论进行了许多实验，几年来的理论现在是一个法律放在正确的秩序中一个理论现在是一个法律b假设现在是一个理论c做观察d执行实验证实假设f形成假设f在几年内执行许多实验
1：将观察结果作为假设进行实验以证实假设假设现在是一个理论，在几年内进行了许多实验理论是现在一个法律放在正确的秩序中一个理论现在是一个法律b假设现在是一个理论，实验证实假设f形成假设f在几年内执行许多实验
2：定义反例假设理论反例是与科学结论相矛盾的一个例子，一个假设是一个受过教育的猜测，试图解释观察或回答一个问题的理论是一个已经用大量理论测试的假设
3：科学理论与科学法不同，因为科学法在自然界中描述了一种观察模式，而不试图解释科学理论作为一个经过良好测试的解释，科学理论与科学定律如何不同
4：基于你的想法结论理论的思想假设理解从假设到结论到理论到法律的越来越有效


问题：哪些最好定义传染病
0：你必须通过停止他们的死亡来恢复黑质神经元，使新的黑质神经元干细胞移植目前正在完成黑质神经元的干细胞移植，但是新的神经元仍然会再次死亡，因此延长了治疗，但没有治愈，但药物合用显着提高了质量的帕金森病患者的生命，但仍然无法治愈疾病**如何治愈**有没有目前的治疗方法，这样做
1：感冒和脊髓灰质炎感冒冷咳和打喷嚏鼻病毒或腺病毒无法治疗可以治疗咳嗽脊髓灰质炎脊髓灰质炎病毒肌肉无力和瘫痪症状无法治愈疫苗从粪便或手到嘴巴传播
2：水痘和天花水痘水痘病毒瘙痒/瘙痒疹疮疮高烧腹痛模糊病感觉无法治疗可以治疗由天花天花病毒传播的症状小瘙痒斑点高烧死亡无治愈疫苗通过接触传播
3：流行性感冒，因为它在一个地理区域发生，然后传播给其他人流感病毒来源于东南亚，并传播到包括美国和欧洲在内的国家感染全世界数百万人的传染病称为
4：莱姆病导致病媒传染病在我们身上的症状是流感，像流氓皮疹一样称为红斑，未经治疗可以传播到其他身体系统引起的细菌称为螺旋体


问题：科学家结合氧气和氢气来形成水，这种组合说明了水
0：分子可以由两个或更多个相同元素的原子组成，如氢气2个氢原子或两个或多个连接在一起的不同元素的原子，如水2个氢原子和1个氧原子
1：下标下标是用于告诉您在方程式中存在多少个元素原子的数字，水h20具有2个氢原子，一个氧气•葡萄糖c6h12o6具有6个碳原子的12个氢气和6个氧气
2：化合物2个或更多个元素结合在一起以产生分子•相同物质的化合物具有相同的化学成分相同的化学成分水是氢氧原子与1个氧原子键合的盐•来自盐 - 水 - 二氧化碳 - 二氧化碳
3：将两个或多个不同元素的原子连接在一起，如水2个氢原子和1个氧原子
4：将h2o中的两种或多种元素的组合化合为含有氢和氧的元素的化合物


问题：比较下列哪种鱼类的骨骼最能显示鱼类的进化
0：列表支持理论演化化石记录的3件证据显示物种dna的变化可以显示如果不同的物种是密切相关的类似结构与生物动物的结构相比寻找相似性
1：塑造一种化石，可以帮助身体展现出很久以前生活的鱼类____
2：比较和对比一个缅甸蟒蛇和一个园林蜗牛的生命周期，因为他们既生鸡蛋，也不是鸡蛋孵卵，因为鸡蛋孵出后，自己放下鸡蛋后的蜗牛叶，蟒蛇等待直到鸡蛋孵化
3：物种概念不同的种类

In [40]:
# 对每一个选项，计算选项和 support facts 的相关程度
wrong = 0
for i in range(len(questions)):
    q = questions[i]
    c = choices[i]
    s = solutions[i]
    scores = []
    for j in range(4):
        choice = textfilter(c[j].split(), stopwords, lemmatizer)
        score = 0
        for k in idx_AI2[i]:
            score += calc_score(idf, choice, filter_SS[k], filter_SS_dict[k])
        scores.append(score)
    if scores[s] < max(scores):
        wrong += 1

In [41]:
print('among %d questions, %d are correctly classified' % (len(questions), len(questions)-wrong))
print('accuracy: %f' % (1-wrong/len(questions)))

among 1394 questions, 950 are correctly classified
accuracy: 0.681492
