<a href="https://colab.research.google.com/github/heinohen/tko_7095_i2hlt/blob/main/week4_exercise_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 8: POS transition probabilities

In the lecture, we briefly saw the concept of hidden markov models and transition probabilies, as applied to POS tags. In the simplest case, these probabilities model the sequences of POS tag pairs such that e.g. probability of DET -> NOUN will be the probability of seeing a NOUN, having seen a DET (determiner), i.e. more formally the conditional probability P(NOUN|DET). We also had the intuition that for example DET -> NOUN should be much larger than, say DET -> VERB. And of course, since these are conditional probabilities, sum of P(x|y) over all x should sum up to 1 for any given y. These probabilities can be easily estimated by counting from the data, i.e. the probability of DET -> NOUN transition, i.e. P(NOUN|DET) is simply the count of how many times you saw NOUN following a DET, divided by how many times you saw DET.

Your task is to pick a Universal Dependencies dataset of your choice, e.g. UD_English-EWT training data, calculate these transition probabilities, pretty-print them if you can, and check that our intuitions hold, i.e. that for example DET -> NOUN is substantially more likely than, say, DET -> VERB.

## Get data

Your task is to pick a Universal Dependencies dataset of your choice, e.g. UD_English-EWT training data,



Description

A Gold Standard Universal Dependencies Corpus for English, built over the source material of the English Web Treebank LDC2012T13 (https://catalog.ldc.upenn.edu/LDC2012T13).


In [None]:
!wget -nc --quiet https://github.com/UniversalDependencies/UD_English-EWT/raw/master/en_ewt-ud-train.conllu

## Calculate scores

In [28]:
def counts(file_name, target_pos, another_pos, counter_pos) -> list:
  with open(file_name) as f:
    first_line = True # just to determine if it is or not :)
    last_UPOS = ""
    # probability of DET -> NOUN will be the probability of seeing a NOUN, having seen a DET (determiner),
    # i.e. more formally the conditional probability P(NOUN|DET).
    noun_after_det = 0
    # Just DET
    just_det = 0
    # DET -> VERB
    verb_after_det = 0
    # ALL
    all_words = 0
    for line in f:
      line = line.rstrip('\n')

      if first_line or line.startswith('#') or not line:
        first_line = False
        continue
      all_words += 1
      # expect datalines
      cols = line.split('\t')
      ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC = cols

      # if case is that current is NOUN and last seen was DET
      if UPOS == target_pos and last_UPOS == another_pos:
        noun_after_det += 1
        first_line = False

      # if case is that current is VERB and last seen was DET
      elif UPOS == counter_pos and last_UPOS == another_pos:
        verb_after_det += 1
        last_UPOS = UPOS
        first_line = False

      # if current is DET
      elif UPOS == another_pos:
        just_det += 1
        last_UPOS = UPOS
        first_line = False

      # if not, update the last UPOS with current
      else:
        last_UPOS = UPOS
        first_line = False

  return [noun_after_det,just_det, verb_after_det, all_words]



file_name = "en_ewt-ud-train.conllu"
count_list = counts(file_name, "NOUN", "DET", "VERB")
count_list




[10920, 16299, 962, 207227]


## Prints

 calculate these transition probabilities, pretty-print them if you can, and check that our intuitions hold, i.e. that for example DET -> NOUN is substantially more likely than, say, DET -> VERB.

 the probability of DET -> NOUN transition, i.e. P(NOUN|DET) is simply the count of how many times you saw NOUN following a DET, divided by how many times you saw DET.

In [34]:
import tabulate

det_noun_trans = count_list[0] / count_list[1]
det_verb_trans = count_list[2] / count_list[1]
det_alone = count_list[1] / count_list[3]

data = [
    ["DET NOUN transfer", det_noun_trans],
    ["DET VERB transfer", det_verb_trans],
    ["DET alone", det_alone]
]

print(tabulate.tabulate(data, headers=["Case: ", "(Conditional) probability"]))





Case:                (Conditional) probability
-----------------  ---------------------------
DET NOUN transfer                    0.66998
DET VERB transfer                    0.059022
DET alone                            0.0786529
