# crfsgd examples

## 前言
CRF算法的开源工具中最常见的有三种：CRF++, CRFsuite, SGD(crfsgd)，crfsgd是使用sgd的优化方式实现的crf工具，具有效果相当但是速度要快很多的优势，详情可看[官网](http://leon.bottou.org/projects/sgd)，有一篇博文[《三种CRF实现在中文分词任务上的表现比较》](https://jianqiangma.wordpress.com/2011/11/14/%E4%B8%89%E7%A7%8Dcrf%E5%AE%9E%E7%8E%B0%E7%9A%84%E7%AE%80%E5%8D%95%E6%AF%94%E8%BE%83/)对三者的性能进行了实验对比。

工具说明、源码下载、编译、文档等参见[官网](http://leon.bottou.org/projects/sgd)。本例从官网下载源码，本地编译成可执行程序放在bin/中，crfsgd和crfasgd为两个版本的工具，conlleval为评测脚本，本例说明和演示怎么在chunking和NER两个任务上使用该工具。

## 语料
### CONLL2000 (chunking task)
* 训练数据：data/conll2000/train.txt
* 测试数据：data/conll2000/test.txt  
    样例：

In [5]:
!head -20 data/conll2000/train.txt

Confidence NN B-NP
in IN B-PP
the DT B-NP
pound NN I-NP
is VBZ B-VP
widely RB I-VP
expected VBN I-VP
to TO I-VP
take VB I-VP
another DT B-NP
sharp JJ I-NP
dive NN I-NP
if IN B-SBAR
trade NN B-NP
figures NNS I-NP
for IN B-PP
September NNP B-NP
, , O
due JJ B-ADJP
for IN B-PP


### CONLL2003 (NER(实体识别) task)
* 训练数据：data/conll2003/eng.train
* 测试数据：data/conll2003/eng.testa,data/conll2003/eng.testb  
    样例：

In [8]:
!head -20 data/conll2003/eng.train

EU NNP I-NP I-ORG
rejects VBZ I-VP O
German JJ I-NP I-MISC
call NN I-NP O
to TO I-VP O
boycott VB I-VP O
British JJ I-NP I-MISC
lamb NN I-NP O
. . O O

Peter NNP I-NP I-PER
Blackburn NNP I-NP I-PER

BRUSSELS NNP I-NP I-LOC
1996-08-22 CD I-NP O

The DT I-NP O
European NNP I-NP I-ORG
Commission NNP I-NP I-ORG
said VBD I-VP O


## 训练

### CONLL2000 (chunking task)

In [18]:
!./bin/crfsgd -c 1.0 -f 3 -r 10 ./model/chunk_model.gz template ../data/conll2000/train.txt

Reading template file template.
  u-templates: 19  b-templates: 1
Scanning ../data/conll2000/train.txt to build dictionary.
  sentences: 8936  outputs: 22
  cutoff: 3  features: 76329  parameters: 1679700
  duration: 2.12 seconds.
Using c=1, i.e. lambda=0.000111907
Reading and preprocessing ../data/conll2000/train.txt.
  processed: 8936 sentences.
  duration: 2.65 seconds.
[Calibrating] --  1000 samples
 initial objective=72.0924
 trying eta=0.1  obj=2.31697 (possible)
 trying eta=0.2  obj=2.11996 (possible)
 trying eta=0.4  obj=3.11353 (possible)
 trying eta=0.8  obj=7.48934 (possible)
 trying eta=1.6  obj=20.8896 (possible)
 trying eta=3.2  obj=55.0811 (possible)
 trying eta=6.4  obj=152.961 (too large)
 trying eta=0.05  obj=3.21627 (possible)
 trying eta=0.025  obj=4.52218 (possible)
 trying eta=0.0125  obj=6.1844 (possible)
 trying eta=0.00625  obj=8.43184 (possible)
 taking eta=0.1  t0=89360 time=8.49s.
[Epoch 1] -- wnorm=3428.22 time=13.76s.
[Epoch 2] -- wnorm=4997.64 time=19.1s.

### CONLL2003 (NER(实体识别) task)

In [19]:
!./bin/crfsgd -c 1.0 -f 3 -r 10 ./model/ner_model.gz template ../data/conll2003/eng.train

Reading template file template.
  u-templates: 19  b-templates: 1
Scanning ../data/conll2003/eng.train to build dictionary.
  sentences: 14986  outputs: 8
  cutoff: 3  features: 79313  parameters: 634560
  duration: 1.98 seconds.
Using c=1, i.e. lambda=6.67289e-05
Reading and preprocessing ../data/conll2003/eng.train.
  processed: 14986 sentences.
  duration: 2.12 seconds.
[Calibrating] --  1000 samples
 initial objective=29.1247
 trying eta=0.1  obj=2.10574 (possible)
 trying eta=0.2  obj=2.96499 (possible)
 trying eta=0.4  obj=6.10803 (possible)
 trying eta=0.8  obj=11.961 (possible)
 trying eta=1.6  obj=33.2681 (too large)
 trying eta=0.05  obj=2.22704 (possible)
 trying eta=0.025  obj=2.68477 (possible)
 trying eta=0.0125  obj=3.23359 (possible)
 trying eta=0.00625  obj=3.83905 (possible)
 trying eta=0.003125  obj=4.61365 (possible)
 trying eta=0.0015625  obj=5.71213 (possible)
 taking eta=0.05  t0=299720 time=1.22s.
[Epoch 1] -- wnorm=1285.97 time=2.36s.
[Epoch 2] -- wnorm=2283.23

## 评估

### CONLL2000 (chunking task)

In [20]:
!./bin/crfsgd -t model/chunk_model.gz data/conll2000/test.txt | ./conlleval

processed 47377 tokens with 23852 phrases; found: 23815 phrases; correct: 22281.
accuracy:  95.86%; precision:  93.56%; recall:  93.41%; FB1:  93.49
             ADJP: precision:  79.05%; recall:  75.80%; FB1:  77.39  420
             ADVP: precision:  83.08%; recall:  80.48%; FB1:  81.76  839
            CONJP: precision:  55.56%; recall:  55.56%; FB1:  55.56  9
             INTJ: precision: 100.00%; recall:  50.00%; FB1:  66.67  1
              LST: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               NP: precision:  93.93%; recall:  93.64%; FB1:  93.78  12384
               PP: precision:  96.54%; recall:  98.00%; FB1:  97.27  4884
              PRT: precision:  79.21%; recall:  75.47%; FB1:  77.29  101
             SBAR: precision:  88.21%; recall:  83.93%; FB1:  86.02  509
               VP: precision:  93.62%; recall:  93.82%; FB1:  93.72  4668


### CONLL2003 (NER(实体识别) task)

In [21]:
!./bin/crfsgd -t model/ner_model.gz data/conll2003/eng.testa | ./conlleval

processed 51577 tokens with 5942 phrases; found: 5573 phrases; correct: 4639.
accuracy:  96.48%; precision:  83.24%; recall:  78.07%; FB1:  80.57
              LOC: precision:  82.52%; recall:  81.22%; FB1:  81.87  1808
             MISC: precision:  87.07%; recall:  63.56%; FB1:  73.48  673
              ORG: precision:  82.92%; recall:  72.04%; FB1:  77.09  1165
              PER: precision:  82.77%; recall:  86.59%; FB1:  84.64  1927


In [23]:
!./bin/crfsgd -t model/ner_model.gz data/conll2003/eng.testb | ./conlleval

processed 46665 tokens with 5648 phrases; found: 5324 phrases; correct: 4246.
accuracy:  95.44%; precision:  79.75%; recall:  75.18%; FB1:  77.40
              LOC: precision:  82.63%; recall:  83.27%; FB1:  82.95  1681
             MISC: precision:  77.27%; recall:  53.28%; FB1:  63.07  484
              ORG: precision:  79.47%; recall:  67.37%; FB1:  72.92  1408
              PER: precision:  77.90%; recall:  84.35%; FB1:  81.00  1751
