# Evaluation

## Data Pre-processing

先從置於同目錄下的 proc/ 讀取處理過的字幕資料，並利用 *jieba* 進行斷詞

In [1]:
from sklearn.metrics import f1_score
from utils import read_csvdir, sep_train_test, accuracy
from simple_nb import SimpleNB
from binary_nb import BinaryNB
from tfidf_nb import tfidfNB

data_arr = read_csvdir('./proc', details=True)

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/jt/ms93_ljx2mvgwxx24xrlw03r0000gn/T/jieba.cache


Read 2944 lines from Inferno_2016.csv
Read 6516 lines from Schindlers_List_1993.csv
Read 3088 lines from Wonder_Woman_2017.csv
Read 4437 lines from Inception_2010.csv
Read 4485 lines from Spider-Man_Homecoming_2017.csv
Read 3104 lines from Baby_Driver_2017.csv


Loading model cost 1.023 seconds.
Prefix dict has been built succesfully.


將所有訓練資料 shuffle，且依照 9:1 分為訓練資料與測試資料

In [2]:
train_arr, test_arr = sep_train_test(data_arr, ratio=0.1)
test_Y, test_X = zip(*test_arr)

print('# of training data =', len(train_arr))
print('# of test data     =', len(test_arr))

# of training data = 22117
# of test data     = 2457


## Cross-validation and Test

![cross-validation](https://raw.githubusercontent.com/ritchieng/machine-learning-dataschool/master/images/07_cross_validation_diagram.png)

In [3]:
def cross_val(model_class, data, cv=5):
    scores = []
    nb_seg = len(data) // cv

    print('Cross Validation')
    for i in range(cv):
        dev_data = data[i*nb_seg : (i+1)*nb_seg]
        train_data = data[: i*nb_seg] + data[(i+1)*nb_seg :]
        
        dev_Y, dev_X = zip(*dev_data)
        
        model = model_class()
        model.train(train_data)
        preds = model.predict(dev_X)
        
        score = f1_score(dev_Y, preds, average='macro')
        scores.append(score)

    print('  f1_score = [{}]'.format(', '.join(['{:4f}'.format(s) for s in scores])))
    print('  {}-fold cross-validation f1_score = {:.4f}'.format(cv, sum(scores)/cv))

def test(model_class, train_data, test_X, test_Y):
    model = model_class()
    model.train(train_data)
    preds = model.predict(test_X)
    
    print('\nTest')
    print('  accuracy        = {:.4f}'.format(accuracy(test_Y, preds)))
    print('  f1_score(micro) = {:.4f}'.format(f1_score(test_Y, preds, average='micro')))
    print('  f1_score(macro) = {:.4f}'.format(f1_score(test_Y, preds, average='macro')))

### Naive Bayes

In [4]:
cross_val(SimpleNB, train_arr, cv=5)
test(SimpleNB, train_arr, test_X, test_Y)

Cross Validation
  f1_score = [0.642867, 0.633875, 0.630873, 0.624296, 0.623581]
  5-fold cross-validation f1_score = 0.6311

Test
  accuracy        = 0.6422
  f1_score(micro) = 0.6422
  f1_score(macro) = 0.6234


### Binarized Naive Bayes

In [5]:
cross_val(BinaryNB, train_arr, cv=5)
test(BinaryNB, train_arr, test_X, test_Y)

Cross Validation
  f1_score = [0.609302, 0.617092, 0.602089, 0.602649, 0.606567]
  5-fold cross-validation f1_score = 0.6075

Test
  accuracy        = 0.6593
  f1_score(micro) = 0.6593
  f1_score(macro) = 0.6140


### TF-IDF Naive Bayes

In [6]:
cross_val(tfidfNB, train_arr, cv=5)
test(tfidfNB, train_arr, test_X, test_Y)

Cross Validation


KeyError: 'CN'