# Cross Entropy Loss Calculation for each n-gram

## To-Do
- Language Modeling에 따른 Cross Entropy 구현
- N-gram Generation, Word Counting, Cleaning Revisit
- Smoothing (Add-1) implementation

### Entropy

In Information Theory, entropy (denoted $H(X)$) of a random variable X is the expected log probabiltiy:

\begin{equation}
    H(X) = - \sum P(x)log_2 P(x)
\end{equation}

and is a measure of uncertainty. 


### Defn: Cross Entropy

The cross entropy, H(p,m), of a true distribution **p** and a model distribution **m** is defined as:

\begin{equation}
    H(p,m) = - \sum_{x} p(x) log_2 m(x)
\end{equation}

The lower the cross entropy is the closer it is to the true distribution.

## Contents
- NIRW1900000011.json 은 전자신문 뉴스기사이다. (training data로 사용)
- NWRW1800000045.json 은 동아일보 뉴스 기사이다. (test data로 사용)
- 국립국어원의 웹 코퍼스 (WEB) 중의 하나인 EBRW1908000138.json (첨부)은 블로그 자료이다. (test data로 사용)
- training data에서 학습한 한글 자소/글자(음절)/어절 별 unigram, Bigram, trigram 모델이 같은 신문기사와 웹자료에 얼마나 잘 부합하는지를 교차 엔트로피로 살펴봄
- 세 데이터에서 "form"에 해당하는 부분만을 각각 추출하여, 한글 글자들만 남긴 후 (스페이스도 고려) unigram, bigram, trigram 구성을 만들고 빈도를 구함

## Things to consider
- training 코퍼스에서는 entropy와 cross entropy는 같고 따라서 그 차이는 0이다
- 코퍼스에서 테스트 하기 위한 테스트 코퍼스의 교차엔트로피는 각 모델의 확률을 구하고 이를 교차 엔트로피 공식에 따라 구하면 되는데, **이 경우 P(x)는 이 test 코퍼스의 자소별/글자별/어절별 unigram/bigram/trigram의 확률이고 모델의 확률인 logp(m)은 training 코퍼스인 코퍼스에서 구해진 각 모델의 확률이다.** 각 글자별로 이를 다 곱해서 더 하면 교차엔트로피가 구해진다. 즉 training테스트에서 설정한 언어모델이 test 코퍼스에 더 부합할수록 test에서의 각 구성의 확률이 training의 해당 확률에 근접하게 될 것임. 완벽한 경우 두 모델이 일치한다면, 즉 교차엔트로피가 실제 엔트로피와 동일하게 되면 그 차이는 0이 된다. 따라서 H(P,m) - H(p)의 차이가 작을수록 더 좋은 모델이 된다. 
- 이 경우 training 코퍼스에 없는 n-gram 구성이  test 코퍼스에 있을 경우 문제가 되니 이 구성의 확률을 얻기 위해 ADD-1을 사용해서 smoothing하라.
(힌트: training 데이터의 각 n-gram모델의 구성과 test-data의 n-gram모델의 구성을 비교하여 빠져 있는 구성을 보충하고 add-1을 사용해서 확률을 구함)

# 반복 사용 함수 정의

In [8]:
import json
import re
from jamo import h2j, j2hcj
from collections import Counter
import math

# Preprocessing 함수
"""
:param file: 전처리하고자 하는 json file
:param n: n-gram method를 사용 (default = 1)
:reutrn: word_dict, letter_dict, jamo_dict; 단어, 글자, 자모의 dcitionary
"""
def preprocessing(file, n=1):
    with open(file, 'rt', encoding='UTF8') as json_file:
        data = json.load(json_file)
    word_dict = {}
    letter_dict = {}
    jamo_dict = {}
    for i in range(len(data['document'])):
        each_document = data['document'][i]
        for j in range(len(each_document['paragraph'])):
            #유니코드 상 예외 사항을 공백(space)로 대체
            each_pragraph = each_document['paragraph'][j]
            #form 안에 존재하는 유니코드 상 예외 사항 및 기타 오류를 공백(space)으로 대체
            each_pragraph_preprocessed = re.sub('[^ ㄱ-ㅣ가-힣]', ''
                                                , re.sub("[\u3000\t\n\ax03]", "", each_pragraph['form']).replace('”',' ').replace('=', ' '))
            temp_word_list = re.findall('\w+', each_pragraph_preprocessed)
            temp_word_dict = dict(Counter(ngrams(temp_word_list, n)))
            word_dict = update_with_increment(word_dict, temp_word_dict)
            temp_letter_list = []
            words_preprocessed = each_pragraph_preprocessed.split(' ')
            for k in range(len(words_preprocessed)):
                each_word_preprocessed = words_preprocessed[k]
                #띄어쓰기 단위로 글자와 자모의 ngram을 수행함
                temp_letter_list_by_words = []
                temp_jamo_list_by_words = []                
                for l in range(len(list(each_word_preprocessed))):
                    each_letter_preprocessed = list(each_word_preprocessed)[l]
                    temp_letter_list_by_words.append(each_letter_preprocessed)
                    temp_jamo_list = []
                    each_jamo_preprocessed = list(j2hcj(h2j(each_letter_preprocessed)))
                    if len(each_jamo_preprocessed) == 2:
                        each_jamo_preprocessed.append('X')
                    temp_jamo_list_by_words.append(each_jamo_preprocessed)
                temp_letter_dict = dict(Counter(ngrams(temp_letter_list_by_words, n)))
                letter_dict = update_with_increment(letter_dict, temp_letter_dict)
                flatten_temp_jamo_list_by_words = flatten(temp_jamo_list_by_words)
                temp_jamo_dict = dict(Counter(ngrams(flatten_temp_jamo_list_by_words, n)))
                jamo_dict = update_with_increment(jamo_dict, temp_jamo_dict)    
                        
    return word_dict, letter_dict, jamo_dict


#Ngram 함수
"""
:param lst: n-gram을 하고자하는 corpus의 list ex: ['ㄱ', 'ㅏ', 'ㅇ']
:param n: 'N' of n-gram method
:return: tuple, n-gram으로 만들어진 튜플
"""
from itertools import tee, islice
def ngrams(lst, n):
    tlst = lst
    while True:
        a, b = tee(tlst)
        l = tuple(islice(a, n))
        if len(l) == n:
            yield l
            next(b)
            tlst = b
        else:
            break
            
            
#List를 word 단위로 정리하기 위한 함수(flatten nested list)
"""
:param t:, a netsted list
:return: list, a flattened list 
"""
def flatten(t):
    return [item for sublist in t for item in sublist]


#중복되는 key의 값은 늘려주고 새로 생긴 key는 업데이트 해주는 함수
"""
:param dict1: preivous dictionary
:param dict2: current dictionary
:return: dict, updated dictionary
"""
def update_with_increment(dict1, dict2):
    for i in range(len(dict2)):
        tkey = list(dict2.keys())[i]
        tvalue = dict2[tkey]
        if tkey in dict1.keys():
            dict1[tkey] += tvalue
        else:
            dict1[tkey] = tvalue
    return dict1


#빈도 dictionary를 확률로 바꿔주는 함수
"""
:param dictionary: 확률로 value를 업데이트할 dictionary
:return: dict, 확률값으로 value가 변경된 dictionary(add-1 smoothing applied)
"""
def dict_to_prob(dictionary):
    cdict = dictionary.copy()
    #v value of add-1 smoothing
    v = len(cdict)
    total = sum(cdict.values())
    for key, value in cdict.items():
        cdict[key] = (value+1) / (total+v)
    return cdict


#entropy를 구하는 함수
"""
:param : X, frequency dictionary
:return: float, entropy 값
"""
def entropy_calc(X):
    dist_X = dict_to_prob(X)
    entropy = 0
    for value in dist_X.values():
        entropy += -(value*math.log(value, 2))
    return entropy


#model distribution과 true distribution의 cross entropy를 구하는 함수
"""
:param X: n-gram actual frequency dictionary
:param M: n-gram model frequency dictionary
:return: cross_entropy, cross entropy between X and M
"""
def cross_entropy_calc(X, M):
    dist_X = dict_to_prob(X)
    dist_M = dict_to_prob(M)
    total_X = len(X)
    V_X = sum(X.values())
    total_M = len(M)
    V_M = sum(M.values())
    cross_entropy = 0
    allkey = set().union(*[X, M])
    temp_dict_X = dict.fromkeys(list(allkey), 0)
    temp_dict_M = dict.fromkeys(list(allkey), 0)
    for key in temp_dict_X.keys():
        if key in X.keys():
            temp_dict_X[key] = dist_X[key]
        else:
            temp_dict_X[key] = (1)/(total_X + V_X)
    for key in temp_dict_M.keys():
        if key in M.keys():
            temp_dict_M[key] = dist_M[key]
        else:
            temp_dict_M[key] = (1)/(total_M + V_M)
    for key in temp_dict_X.keys():
        cross_entropy += -(temp_dict_X[key]*math.log(temp_dict_M[key],2))
    return cross_entropy

# train, test1, test2의 어절, 글자, 자모음 별 빈도 계산(uni, bi, tri)

In [9]:
#unigram
word_dict_train_uni, letter_dict_train_uni, jamo_dict_train_uni = preprocessing('NIRW1900000011.json', 1)
word_dict_test1_uni, letter_dict_test1_uni, jamo_dict_test1_uni = preprocessing('NWRW1800000045.json', 1)
word_dict_test2_uni, letter_dict_test2_uni, jamo_dict_test2_uni = preprocessing('EBRW1908000138.json', 1)

#bigram
word_dict_train_bi, letter_dict_train_bi, jamo_dict_train_bi = preprocessing('NIRW1900000011.json', 2)
word_dict_test1_bi, letter_dict_test1_bi, jamo_dict_test1_bi = preprocessing('NWRW1800000045.json', 2)
word_dict_test2_bi, letter_dict_test2_bi, jamo_dict_test2_bi = preprocessing('EBRW1908000138.json', 2)

#trigram
word_dict_train_tri, letter_dict_train_tri, jamo_dict_train_tri = preprocessing('NIRW1900000011.json', 3)
word_dict_test1_tri, letter_dict_test1_tri, jamo_dict_test1_tri = preprocessing('NWRW1800000045.json', 3)
word_dict_test2_tri, letter_dict_test2_tri, jamo_dict_test2_tri = preprocessing('EBRW1908000138.json', 3)

# entropy, cross entropy, difference 계산(train, test1, test2) X (uni, bi, tri)

## train

In [10]:
#unigram
entropy_word_train_uni = entropy_calc(word_dict_train_uni)
entropy_letter_train_uni = entropy_calc(letter_dict_train_uni)
entropy_jamo_train_uni = entropy_calc(jamo_dict_train_uni)

cross_entropy_word_train_uni = cross_entropy_calc(word_dict_train_uni, word_dict_train_uni)
cross_entropy_letter_train_uni = cross_entropy_calc(letter_dict_train_uni, letter_dict_train_uni)
cross_entropy_jamo_train_uni = cross_entropy_calc(jamo_dict_train_uni, jamo_dict_train_uni)

diff_word_uni_train = cross_entropy_word_train_uni - entropy_word_train_uni
diff_letter_uni_train = cross_entropy_letter_train_uni - entropy_letter_train_uni
diff_jamo_uni_train = cross_entropy_jamo_train_uni - entropy_jamo_train_uni

#bigram
entropy_word_train_bi = entropy_calc(word_dict_train_bi)
entropy_letter_train_bi = entropy_calc(letter_dict_train_bi)
entropy_jamo_train_bi = entropy_calc(jamo_dict_train_bi)

cross_entropy_word_train_bi = cross_entropy_calc(word_dict_train_bi, word_dict_train_bi)
cross_entropy_letter_train_bi = cross_entropy_calc(letter_dict_train_bi, letter_dict_train_bi)
cross_entropy_jamo_train_bi = cross_entropy_calc(jamo_dict_train_bi, jamo_dict_train_bi)

diff_word_bi_train = cross_entropy_word_train_bi - entropy_word_train_bi
diff_letter_bi_train = cross_entropy_letter_train_bi - entropy_letter_train_bi
diff_jamo_bi_train = cross_entropy_jamo_train_bi - entropy_jamo_train_bi

#trigram
entropy_word_train_tri = entropy_calc(word_dict_train_tri)
entropy_letter_train_tri = entropy_calc(letter_dict_train_tri)
entropy_jamo_train_tri = entropy_calc(jamo_dict_train_tri)

cross_entropy_word_train_tri = cross_entropy_calc(word_dict_train_tri, word_dict_train_tri)
cross_entropy_letter_train_tri = cross_entropy_calc(letter_dict_train_tri, letter_dict_train_tri)
cross_entropy_jamo_train_tri = cross_entropy_calc(jamo_dict_train_tri, jamo_dict_train_tri)

diff_word_tri_train = cross_entropy_word_train_tri - entropy_word_train_tri
diff_letter_tri_train = cross_entropy_letter_train_tri - entropy_letter_train_tri
diff_jamo_tri_train = cross_entropy_jamo_train_tri - entropy_jamo_train_tri

## test1

In [11]:
#unigram
entropy_word_test1_uni = entropy_calc(word_dict_test1_uni)
entropy_letter_test1_uni = entropy_calc(letter_dict_test1_uni)
entropy_jamo_test1_uni = entropy_calc(jamo_dict_test1_uni)

cross_entropy_word_test1_uni = cross_entropy_calc(word_dict_test1_uni, word_dict_train_uni)
cross_entropy_letter_test1_uni = cross_entropy_calc(letter_dict_test1_uni, letter_dict_train_uni)
cross_entropy_jamo_test1_uni = cross_entropy_calc(jamo_dict_test1_uni, jamo_dict_train_uni)

diff_word_uni_test1 = cross_entropy_word_test1_uni - entropy_word_test1_uni
diff_letter_uni_test1 = cross_entropy_letter_test1_uni - entropy_letter_test1_uni
diff_jamo_uni_test1 = cross_entropy_jamo_test1_uni - entropy_jamo_test1_uni

#bigram
entropy_word_test1_bi = entropy_calc(word_dict_test1_bi)
entropy_letter_test1_bi = entropy_calc(letter_dict_test1_bi)
entropy_jamo_test1_bi = entropy_calc(jamo_dict_test1_bi)

cross_entropy_word_test1_bi = cross_entropy_calc(word_dict_test1_bi, word_dict_train_bi)
cross_entropy_letter_test1_bi = cross_entropy_calc(letter_dict_test1_bi, letter_dict_train_bi)
cross_entropy_jamo_test1_bi = cross_entropy_calc(jamo_dict_test1_bi, jamo_dict_train_bi)

diff_word_bi_test1 = cross_entropy_word_test1_bi - entropy_word_test1_bi
diff_letter_bi_test1 = cross_entropy_letter_test1_bi - entropy_letter_test1_bi
diff_jamo_bi_test1 = cross_entropy_jamo_test1_bi - entropy_jamo_test1_bi

#trigram
entropy_word_test1_tri = entropy_calc(word_dict_test1_tri)
entropy_letter_test1_tri = entropy_calc(letter_dict_test1_tri)
entropy_jamo_test1_tri = entropy_calc(jamo_dict_test1_tri)

cross_entropy_word_test1_tri = cross_entropy_calc(word_dict_test1_tri, word_dict_train_tri)
cross_entropy_letter_test1_tri = cross_entropy_calc(letter_dict_test1_tri, letter_dict_train_tri)
cross_entropy_jamo_test1_tri = cross_entropy_calc(jamo_dict_test1_tri, jamo_dict_train_tri)

diff_word_tri_test1 = cross_entropy_word_test1_tri - entropy_word_test1_tri
diff_letter_tri_test1 = cross_entropy_letter_test1_tri - entropy_letter_test1_tri
diff_jamo_tri_test1 = cross_entropy_jamo_test1_tri - entropy_jamo_test1_tri

## test2

In [12]:
#unigram
entropy_word_test2_uni = entropy_calc(word_dict_test2_uni)
entropy_letter_test2_uni = entropy_calc(letter_dict_test2_uni)
entropy_jamo_test2_uni = entropy_calc(jamo_dict_test2_uni)

cross_entropy_word_test2_uni = cross_entropy_calc(word_dict_test2_uni, word_dict_train_uni)
cross_entropy_letter_test2_uni = cross_entropy_calc(letter_dict_test2_uni, letter_dict_train_uni)
cross_entropy_jamo_test2_uni = cross_entropy_calc(jamo_dict_test2_uni, jamo_dict_train_uni)

diff_word_uni_test2 = cross_entropy_word_test2_uni - entropy_word_test2_uni
diff_letter_uni_test2 = cross_entropy_letter_test2_uni - entropy_letter_test2_uni
diff_jamo_uni_test2 = cross_entropy_jamo_test2_uni - entropy_jamo_test2_uni

#bigram
entropy_word_test2_bi = entropy_calc(word_dict_test2_bi)
entropy_letter_test2_bi = entropy_calc(letter_dict_test2_bi)
entropy_jamo_test2_bi = entropy_calc(jamo_dict_test2_bi)

cross_entropy_word_test2_bi = cross_entropy_calc(word_dict_test2_bi, word_dict_train_bi)
cross_entropy_letter_test2_bi = cross_entropy_calc(letter_dict_test2_bi, letter_dict_train_bi)
cross_entropy_jamo_test2_bi = cross_entropy_calc(jamo_dict_test2_bi, jamo_dict_train_bi)

diff_word_bi_test2 = cross_entropy_word_test2_bi - entropy_word_test2_bi
diff_letter_bi_test2 = cross_entropy_letter_test2_bi - entropy_letter_test2_bi
diff_jamo_bi_test2 = cross_entropy_jamo_test2_bi - entropy_jamo_test2_bi

#trigram
entropy_word_test2_tri = entropy_calc(word_dict_test2_tri)
entropy_letter_test2_tri = entropy_calc(letter_dict_test2_tri)
entropy_jamo_test2_tri = entropy_calc(jamo_dict_test2_tri)

cross_entropy_word_test2_tri = cross_entropy_calc(word_dict_test2_tri, word_dict_train_tri)
cross_entropy_letter_test2_tri = cross_entropy_calc(letter_dict_test2_tri, letter_dict_train_tri)
cross_entropy_jamo_test2_tri = cross_entropy_calc(jamo_dict_test2_tri, jamo_dict_train_tri)

diff_word_tri_test2 = cross_entropy_word_test2_tri - entropy_word_test2_tri
diff_letter_tri_test2 = cross_entropy_letter_test2_tri - entropy_letter_test2_tri
diff_jamo_tri_test2 = cross_entropy_jamo_test2_tri - entropy_jamo_test2_tri

# Visualization

In [13]:
import pandas as pd
col = ["Entropy","Cross Entropy","Difference H(p,m)-H(p)"]
ind = [["Training","Training","Training",
        "Training","Training","Training",
        "Training","Training","Training",
        "Test: 뉴스기사","Test: 뉴스기사","Test: 뉴스기사",
        "Test: 뉴스기사","Test: 뉴스기사","Test: 뉴스기사",
        "Test: 뉴스기사","Test: 뉴스기사","Test: 뉴스기사",
        "Test: 웹 기사","Test: 웹 기사","Test: 웹 기사",
        "Test: 웹 기사","Test: 웹 기사","Test: 웹 기사",
        "Test: 웹 기사","Test: 웹 기사","Test: 웹 기사"],
       ["자소별","자소별","자소별","글자별","글자별","글자별","어절별","어절별","어절별",
       "자소별","자소별","자소별","글자별","글자별","글자별","어절별","어절별","어절별",
       "자소별","자소별","자소별","글자별","글자별","글자별","어절별","어절별","어절별"],
      ["unigram", "bigram", "trigram","unigram", "bigram", "trigram","unigram", "bigram", "trigram",
      "unigram", "bigram", "trigram","unigram", "bigram", "trigram","unigram", "bigram", "trigram",
      "unigram", "bigram", "trigram","unigram", "bigram", "trigram","unigram", "bigram", "trigram"]]
con = [[entropy_jamo_train_uni, cross_entropy_jamo_train_uni, diff_jamo_uni_train],
       [entropy_jamo_train_bi, cross_entropy_jamo_train_bi,diff_jamo_bi_train],
       [entropy_jamo_train_tri, cross_entropy_jamo_train_tri,diff_jamo_tri_train],
       [entropy_letter_train_uni, cross_entropy_letter_train_uni, diff_letter_uni_train],
       [entropy_letter_train_bi, cross_entropy_letter_train_bi,diff_letter_bi_train],
       [entropy_letter_train_tri, cross_entropy_letter_train_tri,diff_letter_tri_train],
       [entropy_word_train_uni, cross_entropy_word_train_uni, diff_word_uni_train],
       [entropy_word_train_bi, cross_entropy_word_train_bi,diff_word_bi_train],
       [entropy_word_train_tri, cross_entropy_word_train_tri,diff_word_tri_train],
       [entropy_jamo_test1_uni, cross_entropy_jamo_test1_uni, diff_jamo_uni_test1],
       [entropy_jamo_test1_bi, cross_entropy_jamo_test1_bi,diff_jamo_bi_test1],
       [entropy_jamo_test1_tri, cross_entropy_jamo_test1_tri,diff_jamo_tri_test1],
       [entropy_letter_test1_uni, cross_entropy_letter_test1_uni, diff_letter_uni_test1],
       [entropy_letter_test1_bi, cross_entropy_letter_test1_bi,diff_letter_bi_test1],
       [entropy_letter_test1_tri, cross_entropy_letter_test1_tri,diff_letter_tri_test1],
       [entropy_word_test1_uni, cross_entropy_word_test1_uni, diff_word_uni_test1],
       [entropy_word_test1_bi, cross_entropy_word_test1_bi,diff_word_bi_test1],
       [entropy_word_test1_tri, cross_entropy_word_test1_tri,diff_word_tri_test1],
       [entropy_jamo_test2_uni, cross_entropy_jamo_test2_uni, diff_jamo_uni_test2],
       [entropy_jamo_test2_bi, cross_entropy_jamo_test2_bi,diff_jamo_bi_test2],
       [entropy_jamo_test2_tri, cross_entropy_jamo_test2_tri,diff_jamo_tri_test2],
       [entropy_letter_test2_uni, cross_entropy_letter_test2_uni, diff_letter_uni_test2],
       [entropy_letter_test2_bi, cross_entropy_letter_test2_bi,diff_letter_bi_test2],
       [entropy_letter_test2_tri, cross_entropy_letter_test2_tri,diff_letter_tri_test2],
       [entropy_word_test2_uni, cross_entropy_word_test2_uni, diff_word_uni_test2],
       [entropy_word_test2_bi, cross_entropy_word_test2_bi,diff_word_bi_test2],
       [entropy_word_test2_tri, cross_entropy_word_test2_tri,diff_word_tri_test2]]
df = pd.DataFrame(con,columns=col,index=ind).round(5)

In [14]:
df

Unnamed: 0,Unnamed: 1,Unnamed: 2,Entropy,Cross Entropy,"Difference H(p,m)-H(p)"
Training,자소별,unigram,4.3752,4.3752,0.0
Training,자소별,bigram,7.49098,7.49098,-0.0
Training,자소별,trigram,9.80415,9.80415,-0.0
Training,글자별,unigram,7.96269,7.96269,-0.0
Training,글자별,bigram,12.67227,12.67227,-0.0
Training,글자별,trigram,14.96964,14.96964,-0.0
Training,어절별,unigram,14.50234,14.50234,-0.0
Training,어절별,bigram,17.89565,17.89565,0.0
Training,어절별,trigram,18.22401,18.22401,0.0
Test: 뉴스기사,자소별,unigram,4.37371,4.37862,0.00491
