# Chinese Next Character Predictor
### *Bigram model with unicode backoff*

### Overview

There are over 50,000 individual characters (though the average person fluent in Chinese uses far fewer characters). Many of these characters when Romanized using Hanyu Pinyin have similar phonetic spelling.

This project uses a bigram model (with unigram backoff) to predict the correct Chinese character when given the Hanyu Pinyin phonetic spelling for each character.

*Please Note: This project has been modified from a project for the course DTSC 685 Natural Language Processing at Eastern University. The datasets used were supplied by the professor. As a challenge for this project, we were only allowed to use base Python and the 'Collections' library.*

## Chinese Data Setup

There are six files for this project:
- `chinese/train.han`: A training file using Chinese characters.
- `chinese/charmap`: A character map. Each line in this file contains exactly two whitespace-seperated columns. The first column is a Chinese character and the second column is the pronunciation.
- `chinese/dev.pin`: A file of inputs for the dev set using phonetic
- `chinese/test.pin`: A file of inputs for the test set. For each whitespace-seperated token in this file, you will predict which character the user meant to type.
- `chinese/dev.han`: A file of correct outputs for the dev set. You will compare your predictions on the dev set to this file.
- `chinese/test.han`: A file of correct outputs for the test set. You will compare your predictions on the test set to this file.

The following paths to the Chinese text data are stored for convenience. Create a list of characters called `chinese_chars` that contains every Chinese character in the charmap file as well as a list of tuples called `prons` that contains the typed pronunciations as the first entry in the tuple and the associated Chinese character as the second entry in the tuple.

In [1]:
# Loading the 'Collections' library
import collections

# Mapping the data paths

charmap = './data/charmap'
chinese_dev_pin = './data/dev.pin'
chinese_dev_han = './data/dev.han'
chinese_test_pin = './data/test.pin'
chinese_test_han = './data/test.han'
chinese_train = './data/train.han'


In [2]:
# Reading and mapping the pronounciations

charmap_list = list(open(charmap, 'r', encoding='utf-8'))
chinese_chars = []
pronunciation = []

for i in charmap_list:
    x = i.split()
    chinese_chars.append(x[0])
    pronunciation.append(x[1])

pronunciations = list(zip(pronunciation, chinese_chars))

In [3]:
# Showing the first 50 pronunciations
pronunciations[:50]

[('qiu', '㐀'),
 ('tian', '㐁'),
 ('kua', '㐄'),
 ('wu', '㐅'),
 ('yin', '㐆'),
 ('yi', '㐌'),
 ('xie', '㐖'),
 ('chou', '㐜'),
 ('nuo', '㐡'),
 ('dan', '㐤'),
 ('xu', '㐨'),
 ('xing', '㐩'),
 ('xiong', '㐫'),
 ('liu', '㐬'),
 ('lin', '㐭'),
 ('xiang', '㐮'),
 ('yong', '㐯'),
 ('xin', '㐰'),
 ('zhen', '㐱'),
 ('dai', '㐲'),
 ('wu', '㐳'),
 ('pan', '㐴'),
 ('ru', '㐵'),
 ('ma', '㐷'),
 ('qian', '㐸'),
 ('yi', '㐹'),
 ('yin', '㐺'),
 ('nei', '㐻'),
 ('cheng', '㐼'),
 ('feng', '㐽'),
 ('zhuo', '㑁'),
 ('fang', '㑂'),
 ('ao', '㑃'),
 ('wu', '㑄'),
 ('zuo', '㑅'),
 ('zhou', '㑇'),
 ('dong', '㑈'),
 ('su', '㑉'),
 ('yi', '㑊'),
 ('qiong', '㑋'),
 ('kuang', '㑌'),
 ('lei', '㑍'),
 ('nao', '㑎'),
 ('zhu', '㑏'),
 ('shu', '㑐'),
 ('xu', '㑔'),
 ('shen', '㑗'),
 ('jie', '㑘'),
 ('die', '㑙'),
 ('nuo', '㑚')]

## Chinese Character Candidate

The Candidates function takes an input token as a parameter and returns a list of possible characters that the token could specify. Each input token could be one of the following:
- a typed pronunciation, which can convert to one of the Chinese characters from the charmap,
- an English character, which can convert to itself
- `<space>`, which converts to a space

In [4]:
def Candidates(token):
    candidates = []

    # If pronunciation
    found = 0
    for pair in pronunciations:
        if pair[0] == token:
            candidates.append(pair[1])
            found = 1

    # If space
    if token == '<space>':
        candidates.append(' ')

    # If English character
    if len(token) == 1:
        candidates.append(token)

    return candidates

## Bigram Character Predictor (Without Unigram Backoff)

Below is a bigram class and bigram next character predictor (without unigram backoff) that takes in previous predicted Chinese character and the current Pinyin pronounciation and makes a prediction of what Chinese character it is.

### Bigram Class

*Original Instructor Instructions: Create a Bigram class, which modifies the Unigram class [from an earlier assignment] to implement a bigram language model. It should contain the same methods as the Unigram class, which should be modified (and in some cases defined) for bigrams. You do not need to modify the `__init()__` method.*

In [5]:
class Bigram(object):

    def __init__(self):
        self.counts = collections.Counter()
        self.total_count = 0
        self.current = []

    def train(self, filename):
        # Trains the model on a text file
        for line in open(filename, encoding='utf-8'):
            line = '^' + '^' + line
            self.start()
            for w in line.rstrip('\n'):
                self.read(w)
                gram = "".join(self.current)
                self.counts[gram] += 1
                self.total_count += 1

    def start(self):
        # Resets the state to the initial state
        self.current = ['^', '^']

    def read(self, w):
        # Reads in w, updating the state
        self.current.pop(0)
        self.current.append(w)

    def prob(self, w):
        # Returns the probability of the next character being w given the current state
        temp = self.current[1:2]
        temp.append(w)
        guess = "".join(temp)

        return self.counts[guess] / self.total_count

### Bigram Character Predictor Function

*Original Instructor Instructions: Create a Bigram Character Prediction function. This function should create an object of the Bigram class and train it on a training file. It should then predict the most probable Chinese character for each token in a test file from the list of candidate characters generated by `Candidate(token)`. The Bigram Character Prediction function should also calculate the total number of correct predictions it makes for the test file and return the percentage of correct predictions.*

In [6]:
def BigramCharPred(train_file, test_file, correct_output):
    bigram = Bigram()
    bigram.train(train_file)

    test_list = list(open(test_file, 'r', encoding='utf-8'))
    correct_list = list(open(correct_output, 'r', encoding='utf-8'))
    correct_list_chars = []

    for line in correct_list:
        line = line[:-1]
        for char in line:
            correct_list_chars.append(char)

    prediction = []


    bigram.start()

    for line in test_list:
        tokens = line.split()
        for token in tokens:
            candidates = Candidates(token)

            char_probs = []
            for char in candidates:
                char_probs.append(bigram.prob(char))


            index = char_probs.index(max(char_probs))
            most_likely = candidates[index]

            prediction.append(most_likely)
            bigram.read(most_likely)

    correct = 0
    total_tested = 0

    for n in range(len(correct_list_chars)):
        if prediction[n] == correct_list_chars[n]:
            correct += 1
            total_tested += 1
        else:
            total_tested +=1


    return correct/total_tested

In [7]:
bigram_dev_accuracy = BigramCharPred(chinese_train, chinese_dev_pin, chinese_dev_han)
print('Chinese character prediction bigram accuracy on dev set: '+ str(bigram_dev_accuracy))

Chinese character prediction bigram accuracy on dev set: 0.7785547785547785


In [8]:
bigram_test_accuracy = BigramCharPred(chinese_train, chinese_test_pin, chinese_test_han)
print('Chinese character prediction bigram accuracy on test set: '+ str(bigram_test_accuracy))

Chinese character prediction bigram accuracy on test set: 0.5886157826649417


## Bigram Character Predictor With Backoff

We can improve our bigram model by adding unigram backoff for instances where we do not have bigram data.

Below is a unigram class needed for backoff and the bigram next character predictor with unigram backoff that takes in previous predicted Chinese character and the current Pinyin pronounciation and makes a prediction of what Chinese character it is.

When we add unigram backoff, our model improves from 59% accuracy to 81% accuracy on the test data. This is a huge improvement!


### Unigram Class

This unigram class we created on an earlier assignment in DTSC 685. We need it for the unigram backoff to improve our next character predictor.

In [9]:
class Unigram(object):

    def __init__(self):
        self.counts = collections.Counter()
        self.total_count = 0

    def train(self, filename):
        #Trains the model on a text file
        for line in open(filename, encoding='utf-8'):
            for w in line.rstrip('\n'):
                self.counts[w] += 1
                self.total_count += 1

    def prob(self, w):
        # Return the probability of the next character being w given the current state
        return self.counts[w] / self.total_count


### Bigram Character Predictor With Backoff Function


*Original Instructor Instructions: Create another Bigram Character Prediction function. This function should work the same as your first Bigram Character prediction function, but this time it should use backoff. If the bigram returns a probability of 0 for all candidate characters, use the unigram to predict the character.*

In [10]:
def BackoffCharPred(train_file, test_file, correct_output):

    # Unigram
    unigram = Unigram()
    unigram.train(train_file)

    # Bigram
    bigram = Bigram()
    bigram.train(train_file)

    test_list = list(open(test_file, 'r', encoding='utf-8'))
    correct_list = list(open(correct_output, 'r', encoding='utf-8'))
    correct_list_chars = []

    for line in correct_list:
        line = line[:-1]
        for char in line:
            correct_list_chars.append(char)

    prediction = []
    char_num = -1


    for line in test_list:
        bigram.start()
        tokens = line.split()
        for token in tokens:
            candidates = Candidates(token)

            char_probs = []
            for char in candidates:
                char_probs.append(bigram.prob(char))

            # Bigram Selection
            if max(char_probs) > 0:
                index = char_probs.index(max(char_probs))
                most_likely = candidates[index]

                prediction.append(most_likely)
                char_num += 1
                bigram.read(prediction[char_num])

            # Unigram Backoff
            else:
                char_probs = []
                for char in candidates:
                    char_probs.append(unigram.prob(char))

                index = char_probs.index(max(char_probs))

                unigram_answer = candidates[index]

                prediction.append(unigram_answer)
                char_num += 1
                bigram.read(prediction[char_num])

    correct = 0
    total_tested = 0

    for n in range(len(correct_list_chars)):
        if prediction[n] == correct_list_chars[n]:
            correct += 1
            total_tested += 1
        else:
            total_tested += 1

    return correct/total_tested


In [11]:
backoff_dev_accuracy = BackoffCharPred(chinese_train, chinese_dev_pin, chinese_dev_han)
print('Chinese character prediction (Bigram with backoff) accuracy on dev set: '+ str(backoff_dev_accuracy))

Chinese character prediction (Bigram with backoff) accuracy on dev set: 0.8863636363636364


In [12]:
backoff_test_accuracy = BackoffCharPred(chinese_train, chinese_test_pin, chinese_test_han)
print('Chinese character prediction (Bigram with backoff) accuracy on test set: '+ str(backoff_test_accuracy))

Chinese character prediction (Bigram with backoff) accuracy on test set: 0.8163001293661061
