# XML Parser for NLTK Corpus
## What is this for
To train our CRF we have to prepare the data.
It should have a format of the following tuples:
('word', 'some tag', 'maybe another tag as well', 'etc')

For starters, we gonna use these tags:
* ctag
* msd

## XML Parsing tutorial
Let's learn how to work with XML

In [5]:
import xml.etree.ElementTree as ET

In [6]:
tree = ET.parse(".\_\\010-2-000000001\\text.xml")
root = tree.getroot()
print(root.tag)

{http://www.tei-c.org/ns/1.0}teiCorpus


In [7]:
for child in root:
    print(child.tag, child.attrib)

{http://www.w3.org/2001/XInclude}include {'href': 'NKJP_1M_header.xml'}
{http://www.tei-c.org/ns/1.0}TEI {}


In [8]:
for child in root.iter('{http://www.tei-c.org/ns/1.0}ab'):
    print(child.text.split())

['Zatrzasnął', 'drzwi', 'od', 'mieszkania,', 'dwa', 'razy', 'przekręcił', 'klucz,', 'nacisnął', 'klamkę,', 'by', 'sprawdzić,', 'czy', 'dobrze', 'zamknięte,', 'zbiegł', 'po', 'schodach,', 'minął', 'furtkę,', 'także', 'ją', 'zamknął,', 'i', 'znalazł', 'się', 'na', 'wąskiej', 'uliczce', 'między', 'ogródkami,', 'gdzie', 'drzemały', 'w', 'majowym', 'słońcu', 'trójkątne', 'ciemnozielone', 'świerki,', 'jakich', 'nie', 'było', 'w', 'pobliżu', 'jego', 'domu.']
['Bohaterem', 'powieści', 'Paźniewskiego', 'jest', 'miasto,', 'Krzemieniec.']
['Jak', 'za', 'czasów', 'Słowackiego', 'funkcjonuje', 'Liceum', 'i', 'płynie', 'Ikwa.', 'Krzemieniec', 'powieściowy', 'jest', 'tamtym', 'Krzemieńcem,', 'ale', 'jest', 'także', 'miastem', 'wywołanym', 'z', 'osobistej', 'pamięci', 'Paźniewskiego.', 'Swoją', 'drogę', 'do', 'tego', 'miasta', 'autor', '"Krótkich', 'dni"', 'zaczął', 'z', 'bardzo', 'daleka.', '"Nigdy', 'nie', 'byłem', 'w', 'tym', 'domu,', 'a', 'przecież', 'wszystko', 'pamiętam', 'doskonale".']
['Ale', 

## Parsing our Data
Let's get to know the file we will be working on!

In [9]:
tree = ET.parse(".\_\\010-2-000000001\\ann_words.xml")
root = tree.getroot()
print(root.tag)

{http://www.tei-c.org/ns/1.0}teiCorpus


In [10]:
for child in root.iter():
    print(child.tag, child.attrib)

{http://www.tei-c.org/ns/1.0}teiCorpus {}
{http://www.w3.org/2001/XInclude}include {'href': 'NKJP_1M_header.xml'}
{http://www.tei-c.org/ns/1.0}TEI {}
{http://www.w3.org/2001/XInclude}include {'href': 'header.xml'}
{http://www.tei-c.org/ns/1.0}text {'{http://www.w3.org/XML/1998/namespace}lang': 'pl'}
{http://www.tei-c.org/ns/1.0}body {}
{http://www.tei-c.org/ns/1.0}p {'{http://www.w3.org/XML/1998/namespace}id': 'words_1-p', 'corresp': 'ann_morphosyntax.xml#morph_1-p'}
{http://www.tei-c.org/ns/1.0}s {'corresp': 'ann_morphosyntax.xml#morph_1.57-s', '{http://www.w3.org/XML/1998/namespace}id': 'words_1.57-s'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_1.57-s_sa2d'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c

{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Interp'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': ''}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_1.13-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_1.57-s_sa3f'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Comp'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': ''}
{http://www.tei-c.or

{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Noun'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'sg:acc:f'}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_1.26-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_1.57-s_sa46'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Interp'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': ''}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_1.27-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_1.57-s_sa4c'}
{htt

{http://www.tei-c.org/ns/1.0}symbol {'value': 'Interp'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': ''}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_1.40-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_1.57-s_sa3d'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Adv'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': ''}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_1.41-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_1.57-s_sa36'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www

{http://www.tei-c.org/ns/1.0}symbol {'value': 'Ppron3'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'sg:gen:m1:ter:akc:npraep'}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_1.55-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_1.57-s_sa5d'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Noun'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'sg:gen:m3'}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_1.56-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_1.57-s_sa4a'}
{http://www.tei-c.org/ns/1.

{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Noun'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'pl:gen:m3'}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_2.11-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_2.18-s_sa93'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Noun'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'sg:gen:m1'}
{htt

{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Interp'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': ''}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_2.24-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_2.34-s_saab'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Conj'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': ''}
{http://www.tei-c.or

{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Prep'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'gen'}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_2.37-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_2.49-s_sac3'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Adj'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'sg:gen:n:pos'}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphos

{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Adv'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': ''}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_2.51-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_2.65-s_sae1'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Verbfin'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'sg:pri:past:ind:imperf:nrefl:neg:m1'}
{http://www.tei-c.org/ns/1.0}ptr 

{http://www.tei-c.org/ns/1.0}s {'corresp': 'ann_morphosyntax.xml#morph_3.3-s', '{http://www.w3.org/XML/1998/namespace}id': 'words_3.3-s'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_3.3-s_saf6'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Conj'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': ''}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_3.1-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_3.3-s_saf5'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns

{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_3.24-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_3.36-s_sa12a'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Prep'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'acc:nwok'}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_3.25-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_3.36-s_sa12e'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.

{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_3.37-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_3.50-s_sa14c'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Noun'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'pl:gen:m1'}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_3.38-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_3.50-s_sa143'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c

{http://www.tei-c.org/ns/1.0}s {'corresp': 'ann_morphosyntax.xml#morph_3.80-s', '{http://www.w3.org/XML/1998/namespace}id': 'words_3.80-s'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_3.80-s_sa164'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Adv'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': ''}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_3.51-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_3.80-s_sa16f'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.

{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_4.2-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_4.11-s_sa19d'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Prep'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'loc:wok'}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_4.3-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_4.11-s_sa19a'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org

{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_4.15-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_4.47-s_sa1c0'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Interp'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': ''}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_4.16-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_4.47-s_sa1b6'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns

{http://www.tei-c.org/ns/1.0}symbol {'value': 'Noun'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'sg:gen:f'}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_4.28-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_4.47-s_sa1ca'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Prep'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'gen'}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_4.31-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_4.47-s_sa1d3'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}

{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_4.45-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_4.47-s_sa1d9'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Noun'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'sg:gen:n'}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_4.46-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_4.47-s_sa1c4'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.

{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_4.58-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_4.67-s_sa1fb'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Interp'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': ''}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_4.59-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_4.67-s_sa203'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns

{http://www.tei-c.org/ns/1.0}symbol {'value': 'pl:loc:m3:pos'}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_5.4-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_5.16-s_sa228'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Noun'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'pl:loc:m3'}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_5.5-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_5.16-s_sa221'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{htt

{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Adj'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'sg:nom:f:pos'}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_5.55-seg'}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_5.56-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_5.59-s_sa284'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Adj'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1

{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Adj'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'pl:gen:f:sup'}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_6.9-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_6.40-s_sa2a3'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Adj'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'sg:gen:f:pos'}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_6.46-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/19

{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Adj'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'sg:acc:f:pos'}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_6.60-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_6.63-s_sa2df'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Pred'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'pres:ind:imperf:aff'}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_6.61-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://

{http://www.tei-c.org/ns/1.0}symbol {'value': 'Interp'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': ''}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_7.49-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_7.54-s_sa33b'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://www.tei-c.org/ns/1.0}f {'name': 'orth'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'base'}
{http://www.tei-c.org/ns/1.0}string {}
{http://www.tei-c.org/ns/1.0}f {'name': 'ctag'}
{http://www.tei-c.org/ns/1.0}symbol {'value': 'Conj'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': ''}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_7.50-seg'}
{http://www.tei-c.org/ns/1.0}seg {'{http://www.w3.org/XML/1998/namespace}id': 'words_7.54-s_sa344'}
{http://www.tei-c.org/ns/1.0}fs {'type': 'words'}
{http://

{http://www.tei-c.org/ns/1.0}symbol {'value': 'Interp'}
{http://www.tei-c.org/ns/1.0}f {'name': 'msd'}
{http://www.tei-c.org/ns/1.0}symbol {'value': ''}
{http://www.tei-c.org/ns/1.0}ptr {'target': 'ann_morphosyntax.xml#morph_7.62-seg'}


In [18]:
for sentence in root.iter('{http://www.tei-c.org/ns/1.0}s'):
    print("#########################")
    for child in sentence.iter('{http://www.tei-c.org/ns/1.0}f'):
        if child.attrib['name'] == 'orth':
            print('\n' + child.getchildren()[0].text)
        elif child.attrib['name'] in {'ctag', 'msd'}:
            print(child.getchildren()[0].attrib['value'])

#########################

Zatrzasnął
Verbfin
sg:ter:past:ind:perf:nrefl:aff:m1

drzwi
Noun
pl:acc:n

od
Prep
gen:nwok

mieszkania
Noun
sg:gen:n

,
Interp


dwa
Num
pl:acc:m3:congr

razy
Noun
pl:acc:m3

przekręcił
Verbfin
sg:ter:past:ind:perf:nrefl:aff:m1

klucz
Noun
sg:acc:m3

,
Interp


nacisnął
Verbfin
sg:ter:past:ind:perf:nrefl:aff:m1

klamkę
Noun
sg:acc:f

,
Interp


by
Comp


sprawdzić
Inf
perf:nrefl:aff

,
Interp


czy
Qub


dobrze
Adv
pos

zamknięte
Ppas
pl:nom:n:perf:nrefl:aff

,
Interp


zbiegł
Verbfin
sg:ter:past:ind:perf:nrefl:aff:m1

po
Prep
loc

schodach
Noun
pl:loc:n

,
Interp


minął
Verbfin
sg:ter:past:ind:perf:nrefl:aff:m1

furtkę
Noun
sg:acc:f

,
Interp


także
Qub


ją
Ppron3
sg:acc:f:ter:akc:npraep

zamknął
Verbfin
sg:ter:past:ind:perf:nrefl:aff:m1

,
Interp


i
Conj


znalazł się
Verbfin
sg:ter:past:ind:perf:refl:aff:m1

na
Prep
loc

wąskiej
Adj
sg:loc:f:pos

uliczce
Noun
sg:loc:f

między
Prep
inst

ogródkami
Noun
pl:inst:m3

,
Interp


gdzie
Adv


drzemały
Verbfi

sg:nom:f

Piętaka
Noun
sg:gen:m1

,
Interp


jedna
Adj
sg:nom:f:pos

spośród
Prep
gen

kilku
Num
pl:gen:f:congr

najznakomitszych
Adj
pl:gen:f:sup

współczesnych
Adj
pl:gen:f:pos

powieści
Noun
pl:gen:f

,
Interp


także
Qub


ze względu na
Prep
acc

jej
Ppron3
sg:gen:f:ter:akc:npraep

zaklasyfikowanie
Noun
sg:acc:n:perf:nrefl:aff

wraz z
Prep
inst:nwok

całą
Adj
sg:inst:f:pos

twórczością
Noun
sg:inst:f

tego
Adj
sg:gen:m1:pos

pisarza
Noun
sg:gen:m1

do
Prep
gen

nurtu
Noun
sg:gen:m3

wiejskiego
Adj
sg:gen:m3:pos

,
Interp


nie ma
Verbfin
sg:ter:pres:ind:imperf:nrefl:neg

w
Prep
loc:nwok

odbiorze
Noun
sg:loc:m3

powszechnym
Adj
sg:loc:m3:pos

tej
Adj
sg:gen:f:pos

rangi
Noun
sg:gen:f

,
Interp


jaką
Adj
sg:acc:f:pos

rzeczywiście
Adv
pos

posiada
Verbfin
sg:ter:pres:ind:imperf:nrefl:aff

.
Interp

#########################

Wszystko
Noun
sg:nom:n

co
Noun
sg:acc:n

Piętak
Noun
sg:nom:m1

wyniósł
Verbfin
sg:ter:past:ind:perf:nrefl:aff:m1

z
Prep
gen:nwok

chłopskiej
Adj
sg:gen:f:po

## Actual Parsing
Now we should be able to extract the data we need.

Unfortunately, ann_words.xml doesn't have info whether a word is a named entity or not.
Because of that, we have to use both ann_named.xml and ann_words.xml files.

In [88]:
tree = ET.parse(".\_\\010-2-000000001\\ann_named.xml")
namedRoot = tree.getroot()

In [100]:
named_words = []

for child in namedRoot.iter('{http://www.tei-c.org/ns/1.0}f'):
    if child.get('name') == 'orth':
        named_words.append(child.getchildren()[0].text)

sent_arr = []

for sentence in root.iter('{http://www.tei-c.org/ns/1.0}s'):
    data_arr = []
    
    for child in sentence.iter('{http://www.tei-c.org/ns/1.0}f'):
        if child.attrib['name'] == 'orth':
            word = child.getchildren()[0].text
            if word in named_words:
                named = 'B'
            else:
                named = 'O'
        elif child.attrib['name'] == 'ctag':
            ctag = child.getchildren()[0].attrib['value']
        elif child.attrib['name'] == 'msd':
            msd = child.getchildren()[0].attrib['value']
            data_tuple = (word, ctag, msd, named)
            data_arr.append(data_tuple)

    train_sent = list(data_arr)
    sent_arr.append(train_sent)

train_sents = list(sent_arr)

In [101]:
train_sents

[[('Zatrzasnął', 'Verbfin', 'sg:ter:past:ind:perf:nrefl:aff:m1', 'O'),
  ('drzwi', 'Noun', 'pl:acc:n', 'O'),
  ('od', 'Prep', 'gen:nwok', 'O'),
  ('mieszkania', 'Noun', 'sg:gen:n', 'O'),
  (',', 'Interp', '', 'O'),
  ('dwa', 'Num', 'pl:acc:m3:congr', 'O'),
  ('razy', 'Noun', 'pl:acc:m3', 'O'),
  ('przekręcił', 'Verbfin', 'sg:ter:past:ind:perf:nrefl:aff:m1', 'O'),
  ('klucz', 'Noun', 'sg:acc:m3', 'O'),
  (',', 'Interp', '', 'O'),
  ('nacisnął', 'Verbfin', 'sg:ter:past:ind:perf:nrefl:aff:m1', 'O'),
  ('klamkę', 'Noun', 'sg:acc:f', 'O'),
  (',', 'Interp', '', 'O'),
  ('by', 'Comp', '', 'O'),
  ('sprawdzić', 'Inf', 'perf:nrefl:aff', 'O'),
  (',', 'Interp', '', 'O'),
  ('czy', 'Qub', '', 'O'),
  ('dobrze', 'Adv', 'pos', 'O'),
  ('zamknięte', 'Ppas', 'pl:nom:n:perf:nrefl:aff', 'O'),
  (',', 'Interp', '', 'O'),
  ('zbiegł', 'Verbfin', 'sg:ter:past:ind:perf:nrefl:aff:m1', 'O'),
  ('po', 'Prep', 'loc', 'O'),
  ('schodach', 'Noun', 'pl:loc:n', 'O'),
  (',', 'Interp', '', 'O'),
  ('minął', 'Verbf

# CRF training
## Preparing the Environment

In [1]:
from itertools import chain

import nltk
import sklearn
import scipy.stats
from sklearn.metrics import make_scorer
from sklearn.cross_validation import cross_val_score
from sklearn.grid_search import RandomizedSearchCV

import sklearn_crfsuite
from sklearn_crfsuite import scorers 
from sklearn_crfsuite import metrics



In [2]:
nltk.download_shell()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> all
    Downloading collection 'all'
       | 
       | Downloading package abc to
       |     C:\Users\nietup\AppData\Roaming\nltk_data...
       |   Unzipping corpora\abc.zip.
       | Downloading package alpino to
       |     C:\Users\nietup\AppData\Roaming\nltk_data...
       |   Unzipping corpora\alpino.zip.
       | Downloading package biocreative_ppi to
       |     C:\Users\nietup\AppData\Roaming\nltk_data...
       |   Unzipping corpora\biocreative_ppi.zip.
       | Downloading package brown to
       |     C:\Users\nietup\AppData\Roaming\nltk_data...
       |   Unzipping corpora\brown.zip.
       | Downloading package brown_tei to
       |     C:\Users\nietup\AppData\R

       |     C:\Users\nietup\AppData\Roaming\nltk_data...
       |   Unzipping corpora\sentiwordnet.zip.
       | Downloading package sentence_polarity to
       |     C:\Users\nietup\AppData\Roaming\nltk_data...
       |   Unzipping corpora\sentence_polarity.zip.
       | Downloading package shakespeare to
       |     C:\Users\nietup\AppData\Roaming\nltk_data...
       |   Unzipping corpora\shakespeare.zip.
       | Downloading package sinica_treebank to
       |     C:\Users\nietup\AppData\Roaming\nltk_data...
       |   Unzipping corpora\sinica_treebank.zip.
       | Downloading package smultron to
       |     C:\Users\nietup\AppData\Roaming\nltk_data...
       |   Unzipping corpora\smultron.zip.
       | Downloading package state_union to
       |     C:\Users\nietup\AppData\Roaming\nltk_data...
       |   Unzipping corpora\state_union.zip.
       | Downloading package stopwords to
       |     C:\Users\nietup\AppData\Roaming\nltk_data...
       |   Unzipping corpora\stopwords.zi

In [33]:
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))

In [35]:
test_sents[99]

[('"', 'Fe', 'O'),
 ('No', 'RN', 'O'),
 ('se', 'P0', 'O'),
 ('trata', 'VMI', 'O'),
 ('de', 'SP', 'O'),
 ('un', 'DI', 'O'),
 ('abanico', 'NC', 'O'),
 ('ni', 'CC', 'O'),
 ('de', 'SP', 'O'),
 ('un', 'DI', 'O'),
 ('muestrario', 'NC', 'O'),
 (',', 'Fc', 'O'),
 ('donde', 'PR', 'O'),
 ('la', 'DA', 'O'),
 ('música', 'NC', 'O'),
 ('de', 'SP', 'O'),
 ('diferentes', 'DI', 'O'),
 ('países', 'NC', 'O'),
 ('puede', 'VMI', 'O'),
 ('estar', 'VMN', 'O'),
 ('representada', 'AQ', 'O'),
 (',', 'Fc', 'O'),
 ('son', 'VSI', 'O'),
 ('músicas', 'AQ', 'O'),
 ('que', 'PR', 'O'),
 ('yo', 'PP', 'O'),
 ('paso', 'VMI', 'O'),
 ('por', 'SP', 'O'),
 ('mí', 'PP', 'O'),
 ('y', 'CC', 'O'),
 ('luego', 'RG', 'O'),
 ('devuelvo', 'VMI', 'O'),
 ('"', 'Fe', 'O'),
 (',', 'Fc', 'O'),
 ('agregó', 'VMI', 'O'),
 ('Serrat', 'AQ', 'B-PER'),
 (',', 'Fc', 'O'),
 ('que', 'PR', 'O'),
 ('añadirá', 'VMI', 'O'),
 ('posteriormente', 'RG', 'O'),
 ('su', 'DP', 'O'),
 ('voz', 'NC', 'O'),
 ('a', 'SP', 'O'),
 ('los', 'DA', 'O'),
 ('temas', 'NC', '