# Часть 1
В первой части на базе предложенных заготовок на Python предлагается реализовать алгоритм Витерби для создания PoS-теггера.   

Заготовки и корпус для тестирования можно найти в архиве LabWork1.zip  
В архиве можно найти pdf файл lm-pos.pdf, который содержит краткую справку по NLTK и описания лабораторной работы.  
В рамках лабораторной работы требуется выполнить Assignment Part B - Part-of-Speech Tagging, пункты 1-5 (реализация пункта 6 не требуется).  

pos.py используется для оценки точности.  
solutionB.py - скелет приложения (требуется реализация ряда функций, подробности в lm-pos.pdf).  


In [1]:
import sys
import nltk
import math
import time
import collections 
from nltk.tag import CRFTagger
import pycrfsuite
from tqdm import tqdm

START_SYMBOL = '*'
STOP_SYMBOL = 'STOP'
RARE_SYMBOL = '_RARE_'
RARE_WORD_MAX_FREQ = 5
LOG_PROB_OF_ZERO = -1000

In [2]:
# Receives a list of tagged sentences and processes each sentence to generate a list of words and a list of tags.
# Each sentence is a string of space separated "WORD/TAG" tokens, with a newline character in the end.
# Remember to include start and stop symbols in yout returned lists, as defined by the constants START_SYMBOL and STOP_SYMBOL.
# brown_words (the list of words) should be a list where every element is a list of the tags of a particular sentence.
# brown_tags (the list of tags) should be a list where every element is a list of the tags of a particular sentence.
def split_wordtags(brown_train):
    brown_words = []
    brown_tags = []
    for line in brown_train:
        tags = []
        words = []
        tokens = line.split()
        for token in tokens:
            word = token.rsplit('/', 1)
            for i in range(0, len(word)-1):
                words.append(word[i])
            tags += [word[len(word)-1]]
        brown_words.append(words)
        brown_tags.append(tags)
    return brown_words, brown_tags

In [3]:
# This function takes tags from the training data and calculates tag trigram probabilities.
# It returns a python dictionary where the keys are tuples that represent the tag trigram, and the values are the log probability of that trigram
def calc_trigrams(brown_tags):
    tokens = []
    c_bigram_values = collections.defaultdict(float)
    c_trigram_values = collections.defaultdict(float)
    q_values = collections.defaultdict(float)
    for line in brown_tags:
        tokens = ['*'] + line + ['STOP']
        bigram_tuples = (tuple(nltk.bigrams(tokens)))
        tokens = ['*'] + tokens
        trigram_tuples = (tuple(nltk.trigrams(tokens)))
        for pair in bigram_tuples:
            c_bigram_values[pair] += 1.0
        for triple in trigram_tuples:
            c_trigram_values[triple] += 1.0
        for key in c_trigram_values:
            if key[0] == '*' and key[1] == '*':
                q_values[key] = math.log(c_trigram_values[key],2).real - math.log(len(brown_tags),2).real
            else:
                q_values[key] = math.log(c_trigram_values[key],2).real - math.log(c_bigram_values[(key[0],key[1])],2).real 
    return q_values

In [4]:
# This function takes output from calc_trigrams() and outputs it in the proper format
def q2_output(q_values, filename):
    outfile = open(filename, "w")
    trigrams = q_values.keys()

    # trigrams.sort()
    trigrams = sorted(trigrams)
    for trigram in trigrams:
        output = " ".join(['TRIGRAM', trigram[0], trigram[1], trigram[2], str(q_values[trigram])])
        outfile.write(output + '\n')
    outfile.close()

In [5]:
# Takes the words from the training data and returns a set of all of the words that occur more than 5 times (use RARE_WORD_MAX_FREQ)
# brown_words is a python list where every element is a python list of the words of a particular sentence.
# Note: words that appear exactly 5 times should be considered rare!
def calc_known(brown_words):
    known_words = []
    words = collections.defaultdict(float)
    for line in brown_words:
        for item in line:
            words[item] += 1
    for item in words:
        if words[item] > 5:
            known_words.append(item)
    known_words = set(known_words)
    return known_words

In [6]:
# Takes the words from the training data and a set of words that should not be replaced for '_RARE_'
# Returns the equivalent to brown_words but replacing the unknown words by '_RARE_' (use RARE_SYMBOL constant)
def replace_rare(brown_words, known_words):
    brown_words_rare = []
    for line in brown_words:
        sentence = []
        for word in line:
            if word not in known_words:
                sentence.append('_RARE_')
            else:
                sentence.append(word)
        sentence.append('STOP')
        brown_words_rare.append(sentence)
    return brown_words_rare

In [7]:
# This function takes the ouput from replace_rare and outputs it to a file
def q3_output(rare, filename):
    outfile = open(filename, 'w')
    for sentence in rare:
        outfile.write(' '.join(sentence[2:-1]) + '\n')
    outfile.close()

In [8]:
# Calculates emission probabilities and creates a set of all possible tags
# The first return value is a python dictionary where each key is a tuple in which the first element is a word
# and the second is a tag, and the value is the log probability of the emission of the word given the tag
# The second return value is a set of all possible tags for this data set
def calc_emission(brown_words_rare, brown_tags):
    e_values = {}
    taglist = []
    zipped_nested_list = [zip(brown_words_rare[i], brown_tags[i]) for i in range(0, len(brown_words_rare))]
    zipped_list = [j for i in zipped_nested_list for j in i]
    d_zipped_list = collections.Counter(zipped_list)             # d->dictionary

    temp_brown_words_rare = brown_words_rare[:]
    temp_brown_words_rare = [j for i in temp_brown_words_rare for j in i]
    d_temp_brown_words_rare = collections.Counter(temp_brown_words_rare)

    temp_brown_tags = brown_tags[:]
    temp_brown_tags = [j for i in temp_brown_tags for j in i]
    d_temp_brown_tags = collections.Counter(temp_brown_tags)

    for i in d_zipped_list:
        e_values[i] = math.log(d_zipped_list[i]/(1.0*d_temp_brown_tags[i[1]]), 2)

    for i in d_temp_brown_tags:
        taglist.append(i)
    taglist = set(taglist)
    return e_values, taglist

In [9]:
# This function takes the output from calc_emissions() and outputs it
def q4_output(e_values, filename):
    outfile = open(filename, "w")
    emissions = e_values.keys()
    # emissions.sort()  for python 2
    emissions = sorted(emissions)  
    for item in emissions:
        output = " ".join([item[0], item[1], str(e_values[item])])
        outfile.write(output + '\n')
    outfile.close()

In [10]:
# This function takes data to tag (brown_dev_words), a set of all possible tags (taglist), a set of all known words (known_words),
# trigram probabilities (q_values) and emission probabilities (e_values) and outputs a list where every element is a tagged sentence 
# (in the WORD/TAG format, separated by spaces and with a newline in the end, just like our input tagged data)
# brown_dev_words is a python list where every element is a python list of the words of a particular sentence.
# taglist is a set of all possible tags
# known_words is a set of all known words
# q_values is from the return of calc_trigrams()
# e_values is from the return of calc_emissions()
# The return value is a list of tagged sentences in the format "WORD/TAG", separated by spaces. Each sentence is a string with a 
# terminal newline, not a list of tokens. Remember also that the output should not contain the "_RARE_" symbol, but rather the
# original words of the sentence!
def viterbi(brown_dev_words, taglist, known_words, q_values, e_values):
    tagged = []
    temp_brown_dev_words = brown_dev_words[:]

    for sentence in temp_brown_dev_words:
        temp = []
        t = [START_SYMBOL, START_SYMBOL, 0]  # make it tuple, like tuple(t) and use it as key in q_values
        for words in sentence:
            d_word_tag_values = {}
            if (words in known_words):
                for key in e_values:
                    if (key[0] == words):
                        t[-1] = key[-1]
                        if (q_values.get(tuple(t), LOG_PROB_OF_ZERO) != LOG_PROB_OF_ZERO):
                            d_word_tag_values[(key, tuple(t))] = e_values[key] + q_values[tuple(t)]
                        else:
                            d_word_tag_values[(key, None)] = e_values[key] + LOG_PROB_OF_ZERO
            else:
                for key in e_values:
                    if (key[0] == RARE_SYMBOL):
                        t[-1] = key[-1]
                        if (q_values.get(tuple(t), LOG_PROB_OF_ZERO) != LOG_PROB_OF_ZERO):
                            d_word_tag_values[(key, tuple(t))] = e_values[key] + q_values[tuple(t)]
                        else:
                            d_word_tag_values[(key, None)] = e_values[key] + LOG_PROB_OF_ZERO

            key_tuple_state = max(d_word_tag_values, key=d_word_tag_values.get)
            t[-1] = key_tuple_state[0][-1]
            temp.append(words + "/" + t[-1])
            t.append(0)
            t = t[1:]
        tagged.append(" ".join(temp[:]) + '\n')
    return tagged

In [11]:
# This function takes the output of viterbi() and outputs it to file
def q5_output(tagged, filename):
    outfile = open(filename, 'w')
    for sentence in tagged:
        outfile.write(sentence)
    outfile.close()

In [16]:
DATA_PATH = 'LabWork1/data/Brown/'
OUTPUT_PATH = 'LabWork1/output/Brown/'

def main():
    # start timer
    time.clock()

    # open Brown training data
    infile = open(DATA_PATH + "Brown_tagged_train.txt", "r")
    brown_train = infile.readlines()
    infile.close()

    # split words and tags, and add start and stop symbols (question 1)
    brown_words, brown_tags = split_wordtags(brown_train)

    # calculate tag trigram probabilities (question 2)
    q_values = calc_trigrams(brown_tags)

    # question 2 output
    q2_output(q_values, OUTPUT_PATH + 'B2.txt')

    # calculate list of words with count > 5 (question 3)
    known_words = calc_known(brown_words)

    # get a version of brown_words with rare words replace with '_RARE_' (question 3)
    brown_words_rare = replace_rare(brown_words, known_words)

    # question 3 output
    q3_output(brown_words_rare, OUTPUT_PATH + "B3.txt")

    # calculate emission probabilities (question 4)
    e_values, taglist = calc_emission(brown_words_rare, brown_tags)

    # question 4 output
    q4_output(e_values, OUTPUT_PATH + "B4.txt")

    # delete unneceessary data
    del brown_train
    del brown_words_rare

    # open Brown development data (question 5)
    infile = open(DATA_PATH + "Brown_dev.txt", "r")
    brown_dev = infile.readlines()
    infile.close()

    # format Brown development data here
    brown_dev_words = []
    for sentence in brown_dev:
        brown_dev_words.append(sentence.split(" ")[:-1])

    # do viterbi on brown_dev_words (question 5)
    viterbi_tagged = viterbi(brown_dev_words, taglist, known_words, q_values, e_values)

    # question 5 output
    q5_output(viterbi_tagged, OUTPUT_PATH + 'B5.txt')

    # print total time to run Part B
    print("Part B time: ", str(time.clock()),' sec')

if __name__ == "__main__": main()

Part B time:  299.305008  sec


In [17]:
def check (outputfile, referencefile):

    infile = open(outputfile, "r")
    user_sentences = infile.readlines()
    infile.close()

    infile = open(referencefile, "r")
    correct_sentences = infile.readlines()
    infile.close()

    num_correct = 0
    total = 0

    for user_sent, correct_sent in zip(user_sentences, correct_sentences):
        user_tok = user_sent.split()
        correct_tok = correct_sent.split()

        if len(user_tok) != len(correct_tok):
            continue

        for u, c in zip(user_tok, correct_tok):
            if u == c:
                num_correct += 1
            total += 1

    score = float(num_correct) / total * 100
    return score

In [18]:
print(check(OUTPUT_PATH+"B5.txt", DATA_PATH+"Brown_tagged_dev.txt"))

91.68487501710086


# Часть 2
Вторая часть лабораторной работы связана с использованием созданного PoS-теггера и его использовании для стороннего корпуса.  
Следует использовать корпус Universal Dependencies: http://universaldependencies.org/  

Скачать Universal Dependency можно отсюда: https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1983


Корпуса представляются в формате CoNLL. Нужно прочесть документацию о том, что это за формат, преобразовать его к тому же виду, что используется в первой части лабораторной работы (слово/тег).  
Использовать PoS-теггер и получить показатели точности работы теггера, основанного на HMM.  

Языки: Korean, Latin, Russian

In [19]:
languages_in = {'korean': 'LabWork1/data/UD_Korean', 
             'latin': 'LabWork1/data/UD_Latin-ITTB',
             'russian': 'LabWork1/data/UD_Russian-SynTagRus'}

languages_out = {'korean': 'LabWork1/output/Korean', 
             'latin': 'LabWork1/output/Latin',
             'russian': 'LabWork1/output/Russian'}


# # open Brown training data
# infile = open(DATA_PATH + "ko-ud-train.conllu", "r")
# brown_train = infile.readlines()
# infile.close()

# # split words and tags, and add start and stop symbols (question 1)
# brown_words, brown_tags = split_wordtags(brown_train)

# # calculate tag trigram probabilities (question 2)
# q_values = calc_trigrams(brown_tags)

# # calculate list of words with count > 5 (question 3)
# known_words = calc_known(brown_words)

# # get a version of brown_words with rare words replace with '_RARE_' (question 3)
# brown_words_rare = replace_rare(brown_words, known_words)

# # calculate emission probabilities (question 4)
# e_values, taglist = calc_emission(brown_words_rare, brown_tags)

# # delete unneceessary data
# del brown_train
# del brown_words_rare

# # open Brown development data (question 5)
# infile = open(DATA_PATH + "Brown_dev.txt", "r")
# brown_dev = infile.readlines()
# infile.close()

# # format Brown development data here
# brown_dev_words = []
# for sentence in brown_dev:
#     brown_dev_words.append(sentence.split(" ")[:-1])

# # do viterbi on brown_dev_words (question 5)
# viterbi_tagged = viterbi(brown_dev_words, taglist, known_words, q_values, e_values)

# # question 5 output
# q5_output(viterbi_tagged, OUTPUT_PATH + 'B5.txt')

# Часть 3
Третья часть лабораторной работы предполагает решение той же задачи, с использованием алгоритма условно случайных полей CRF.
(Возможно, придется спроектировать атрибуты)  

Результаты лабораторной работы:  
Мне на почту выслать реализованный PoS-теггер на HMM и CRF (Source code)  
Преобразованный тренировочный и тестовый корпуса из Universal Dependencies в формате PoS-теггера;  
Прислать показатели точности, которых достиг PoS-теггер;  
Сравнить их друг с другом.  

Замечания:  
Для CRF предлагается (но не ограничивается) использовать http://www.chokkan.org/software/crfsuite/ (для Python есть wrapper: https://python-crfsuite.readthedocs.io/en/latest/).  
Для HMM ограничений нет.  


In [20]:
# This function takes data to tag (brown_dev_words), a set of all possible tags (taglist), a set of all known words (known_words),
# trigram probabilities (q_values) and emission probabilities (e_values) and outputs a list where every element is a tagged sentence 
# (in the WORD/TAG format, separated by spaces and with a newline in the end, just like our input tagged data)
# brown_dev_words is a python list where every element is a python list of the words of a particular sentence.
# taglist is a set of all possible tags
# known_words is a set of all known words
# q_values is from the return of calc_trigrams()
# e_values is from the return of calc_emissions()
# The return value is a list of tagged sentences in the format "WORD/TAG", separated by spaces. Each sentence is a string with a 
# terminal newline, not a list of tokens. Remember also that the output should not contain the "_RARE_" symbol, but rather the
# original words of the sentence!
def conditional_random_field(brown_words, brown_tags):
    train_words_tags = []
    crf = CRFTagger(verbose=True)
    for i in tqdm(range(len(brown_words))):
        tmp = []
        for j in range(len(brown_words[i])):
            tmp.append((brown_words[i][j], brown_tags[i][j]))
            #tmp.append((unicode(brown_words[i][j].decode('utf-8')), unicode(brown_tags[i][j].decode('utf-8'))))
        train_words_tags.append(tmp)
    crf.train(train_words_tags, 'model.crf.tagger')
    return crf, train_words_tags

In [None]:
# start timer
time.clock()

# open Brown training data
infile = open(DATA_PATH + "Brown_tagged_train.txt", "r")
brown_train = infile.readlines()
infile.close()

# split words and tags, and add start and stop symbols (question 1)
brown_words, brown_tags = split_wordtags(brown_train)

# do viterbi on brown_dev_words (question 5)
crf, crf_tagged = conditional_random_field(brown_words, brown_tags)

# open Brown development data (question 5)
infile = open(DATA_PATH + "Brown_dev.txt", "r")
brown_dev = infile.readlines()
infile.close()

# format Brown development data here
brown_dev_words = []
for sentence in brown_dev:
    brown_dev_words.append(sentence.split(" ")[:-1])

results = crf.tag_sents(brown_dev_words)
formated_results = [" ".join([word[0] + "/" + word[1] for word in sentence]) + "\n" for sentence in results]    
    
# question 5 output
q5_output(formated_results, OUTPUT_PATH + 'B5_CFG.txt')


100%|██████████| 27491/27491 [00:00<00:00, 107670.78it/s]


Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 42023
Seconds required: 0.333

L-BFGS optimization
c1: 0.000000
c2: 1.000000
num_memories: 6
max_iterations: 2147483647
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 808618.785917
Feature norm: 5.000000
Error norm: 72813.733703
Active features: 42023
Line search trials: 2
Line search step: 0.000036
Seconds required for this iteration: 1.223

***** Iteration #2 *****
Loss: 553697.951593
Feature norm: 11.348525
Error norm: 49535.832385
Active features: 42023
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.434

***** Iteration #3 *****
Loss: 474113.622321
Feature norm: 15.107980
Error norm: 45528.832898
Active features: 42023
Line search trials: 1
Line search step: 1.000000
Seconds 

In [None]:
print(check(OUTPUT_PATH+"B5_CFG.txt", DATA_PATH+"Brown_tagged_dev.txt"))

|         | HMM | CRF |
|---------|---------|-----|
| Brown   |   91.68 | 95.72  |
| Korean  |         |     |
| Latin   |         |     |
| Russian |         |     |


https://github.com/UniversalDependencies/UD_Korean  
https://github.com/UniversalDependencies/UD_Latin-ITTB  
https://github.com/UniversalDependencies/UD_Russian-SynTagRus  