<h1>Create Answer from DG model </h1>
<h2>What do you need to train SEFR?</h2>
<ul>
  <li><b>Baseline model</b>: DeepCut (We already prepare for you)</li>
  <li><b>Train/test Corpus</b>: In the paper experiment we use Wisesight<a href="https://github.com/PyThaiNLP/wisesight-sentiment/tree/master/word-tokenization">[here]</a> and TNHC<a href="https://attapol.github.io/tlc.html">[here]</a></li>
</ul>


In [1]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import copy as cp
import operator
from preprocessing import preprocess #Our class
prepro = preprocess()
import extract_features
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [2]:
'''
path_corpus : put your training corpus in corpus/ and put the folder name here
y_pred : y from deepcut shape(#sentence,#character in sentence) ex. [[1,0,0,0,.....,0],[1,0,0,1,....,0]]
y_entropy : entropy calcuated from y_prob shape(#sentence,#character in sentence) ex. [[0.01,0.1,0.15,.....,0],[0.01,0.2,0.45,.....,0]]
y_prob : probability from softmax layer shape(#sentence,#character in sentence) ex. [[0.01,0.1,0.15,.....,0],[0.01,0.2,0.45,.....,0]]
'''
path_corpus = ['CORPUS_FOLDER_NAME']

# create x,y
x,y_true = prepro.preprocess_x_y(path_corpus)

# 2D to 1D
y_true = [j for sub in y_true for j in sub if len(j) > 1]
x = [j for sub in x for j in sub if len(j) > 1]

y_pred,y_entropy,y_prob = prepro.predict_(x) # DeepCut Baseline/BEST+WS/WS


<h1>Train CRF Model

In [3]:
'''
read more about pycrfsuite: https://python-crfsuite.readthedocs.io/en/latest/
'''
import pycrfsuite

In [17]:
X_data = []
for idx,item in enumerate(x):
    X_data.append(extract_features.extract_features_crf(x[idx],idx,y_entropy,y_prob))
y_data = [list(map(str, l)) for l in y_true]

#2d to 1d
X_data_1d = [j for sub in X_data for j in sub] 
y_data_1d = [j for sub in y_data for j in sub]


In [19]:
# Sample data
X_data_1d[0]

[{'bias': 'b',
  'char': 'E',
  'entropy': 0.00021009574620019548,
  'prob': 0.9999824166297913,
  'start': True,
  'end': False,
  'char_[-4]': 'ห',
  'ctype[-4]': 'n',
  'char_[-3]': 'น',
  'ctype[-3]': 'c',
  'char_[-2]': '้',
  'ctype[-2]': 't',
  'char_[-1]': '…',
  'ctype[-1]': 'x',
  'char_[+1]': 'u',
  'ctype[+1]': 'o',
  'char_[+2]': 'c',
  'ctype[+2]': 'o',
  'dict_start': False,
  'dict_end': False}]

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X_data_1d, y_data_1d, test_size=0.2, random_state=99)


In [21]:
# Train model

trainer = pycrfsuite.Trainer(verbose=True)
#trainer = pycrfsuite.Trainer(verbose=False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

trainer.set_params({
    'c1': 0.01,
    'c2': 0.01,
    'max_iterations': 1000,
    'feature.possible_transitions': True,
})

#your model name
trainer.train('model/my_model.model')

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 2933
Seconds required: 0.122

L-BFGS optimization
c1: 0.010000
c2: 0.010000
num_memories: 6
max_iterations: 1000
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 33278.665343
Feature norm: 1.000000
Error norm: 33985.191455
Active features: 2805
Line search trials: 1
Line search step: 0.000028
Seconds required for this iteration: 0.096

***** Iteration #2 *****
Loss: 23922.799944
Feature norm: 1.054541
Error norm: 21200.004357
Active features: 2791
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.054

***** Iteration #3 *****
Loss: 14303.387012
Feature norm: 1.837843
Error norm: 10243.724081
Active features: 2777
Line search trials: 1
Line search step: 1.000000
Seconds required for th

***** Iteration #41 *****
Loss: 4520.009734
Feature norm: 48.786587
Error norm: 46.199086
Active features: 2626
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.056

***** Iteration #42 *****
Loss: 4514.202972
Feature norm: 50.080546
Error norm: 59.400648
Active features: 2630
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.054

***** Iteration #43 *****
Loss: 4511.589069
Feature norm: 49.958082
Error norm: 64.538362
Active features: 2632
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.052

***** Iteration #44 *****
Loss: 4509.610987
Feature norm: 50.021076
Error norm: 18.033953
Active features: 2632
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.055

***** Iteration #45 *****
Loss: 4505.590489
Feature norm: 50.204850
Error norm: 29.639797
Active features: 2629
Line search trials: 1
Line search step: 1.000000
Seconds required fo

***** Iteration #83 *****
Loss: 4480.629092
Feature norm: 53.979660
Error norm: 14.560446
Active features: 2564
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.051

***** Iteration #84 *****
Loss: 4480.527593
Feature norm: 54.025335
Error norm: 31.797866
Active features: 2562
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.055

***** Iteration #85 *****
Loss: 4480.391182
Feature norm: 54.079564
Error norm: 17.285795
Active features: 2560
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.051

***** Iteration #86 *****
Loss: 4480.274351
Feature norm: 54.116250
Error norm: 28.325434
Active features: 2562
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.058

***** Iteration #87 *****
Loss: 4480.140249
Feature norm: 54.211088
Error norm: 17.999675
Active features: 2565
Line search trials: 1
Line search step: 1.000000
Seconds required fo

***** Iteration #126 *****
Loss: 4477.919134
Feature norm: 54.929838
Error norm: 21.194331
Active features: 2557
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.053

***** Iteration #127 *****
Loss: 4477.876227
Feature norm: 54.931793
Error norm: 11.103104
Active features: 2559
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.054

***** Iteration #128 *****
Loss: 4477.845209
Feature norm: 54.934272
Error norm: 16.191454
Active features: 2562
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.046

***** Iteration #129 *****
Loss: 4477.817505
Feature norm: 54.936942
Error norm: 19.039962
Active features: 2564
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.055

***** Iteration #130 *****
Loss: 4477.793598
Feature norm: 54.947125
Error norm: 23.088236
Active features: 2564
Line search trials: 1
Line search step: 1.000000
Seconds requir

***** Iteration #167 *****
Loss: 4476.996824
Feature norm: 55.122405
Error norm: 14.232289
Active features: 2554
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.060

***** Iteration #168 *****
Loss: 4476.978996
Feature norm: 55.129257
Error norm: 9.842149
Active features: 2552
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.061

***** Iteration #169 *****
Loss: 4476.968553
Feature norm: 55.134449
Error norm: 12.150917
Active features: 2554
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.061

***** Iteration #170 *****
Loss: 4476.953073
Feature norm: 55.143637
Error norm: 7.526520
Active features: 2555
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.060

***** Iteration #171 *****
Loss: 4476.940569
Feature norm: 55.148357
Error norm: 7.484489
Active features: 2557
Line search trials: 1
Line search step: 1.000000
Seconds required 

***** Iteration #209 *****
Loss: 4476.424398
Feature norm: 55.096577
Error norm: 9.622077
Active features: 2553
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.049

***** Iteration #210 *****
Loss: 4476.408817
Feature norm: 55.086271
Error norm: 10.995182
Active features: 2551
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.059

***** Iteration #211 *****
Loss: 4476.389635
Feature norm: 55.070733
Error norm: 8.608657
Active features: 2550
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.055

***** Iteration #212 *****
Loss: 4476.376245
Feature norm: 55.063239
Error norm: 8.760699
Active features: 2550
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.052

***** Iteration #213 *****
Loss: 4476.359794
Feature norm: 55.051061
Error norm: 5.658010
Active features: 2548
Line search trials: 1
Line search step: 1.000000
Seconds required f

***** Iteration #253 *****
Loss: 4476.051340
Feature norm: 54.929081
Error norm: 6.236913
Active features: 2542
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.062

***** Iteration #254 *****
Loss: 4476.047328
Feature norm: 54.929687
Error norm: 6.828530
Active features: 2544
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.062

***** Iteration #255 *****
Loss: 4476.040317
Feature norm: 54.929556
Error norm: 5.016259
Active features: 2544
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.056

***** Iteration #256 *****
Loss: 4476.034331
Feature norm: 54.924185
Error norm: 5.447318
Active features: 2542
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.048

***** Iteration #257 *****
Loss: 4476.030723
Feature norm: 54.924533
Error norm: 5.340295
Active features: 2544
Line search trials: 1
Line search step: 1.000000
Seconds required fo

In [22]:
# load model
tagger = pycrfsuite.Tagger()
tagger.open('model/my_model.model')
y_pred = [tagger.tag(xseq) for xseq in X_test]

# Evaluate
labels = {'1': 1, "0": 0} # classification_report() needs values in 0s and 1s
predictions = np.array([labels[tag] for row in y_pred for tag in row])
truths = np.array([labels[tag] for row in y_test for tag in row])

print(classification_report(
    truths, predictions,
    target_names=["B", "I"]))

              precision    recall  f1-score   support

           B       0.98      0.97      0.98     10493
           I       0.94      0.95      0.95      4534

    accuracy                           0.97     15027
   macro avg       0.96      0.96      0.96     15027
weighted avg       0.97      0.97      0.97     15027



<h1>What's next?</h1>
Try <b>'Stacked Model Example.ipynb'</b> to understand about <i><b>'How we do stacked model and filter and refine'</b></i>

<h1>Credit</h1>
CRF training model, code from Arthit Suriyawongkul (PyCon Thailand 2018) <a href="https://docs.google.com/presentation/d/1zsn-DqoWm2HNPxjuJvjPB3qS892QmNko2Sjr8ZwFmOc/edit?fbclid=IwAR3xNVxG7xmM1G4PN3o1nAHiLDb8B1d33GJ7q2tXPN1GcX1eXTiOU1xz8Ro#slide=id.p">[here]</a>