-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added specialized datatypes for tokenized text and POS tagged text.
- Loading branch information
Mark Granroth-Wilding
committed
Mar 23, 2016
1 parent
9fbf42c
commit 477271b
Showing
6 changed files
with
61 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
from pimlico.datatypes.tar import TarredCorpus | ||
|
||
|
||
class PosTaggedCorpus(TarredCorpus): | ||
""" | ||
Specialized datatype for a tarred corpus that's had POS tagging applied. | ||
Each document is a list of sentences. Each sentence is a list of words. Each word is a list of | ||
pairs (word, POS tag). | ||
""" | ||
def process_document(self, data): | ||
return [ | ||
[_word_tag_pair(word) for word in sentence.split(" ")] for sentence in data.split("\n") | ||
] | ||
|
||
|
||
def _word_tag_pair(text): | ||
word, __, tag = text.rpartition("|") | ||
return word, tag |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
from pimlico.datatypes.tar import TarredCorpus | ||
|
||
|
||
class TokenizedCorpus(TarredCorpus): | ||
""" | ||
Specialized datatype for a tarred corpus that's had tokenization applied. The datatype does very little - | ||
the main reason for its existence is to allow modules to require that a corpus has been tokenized before | ||
it's given as input. | ||
Each document is a list of sentences. Each sentence is a list of words. | ||
""" | ||
def process_document(self, data): | ||
return [ | ||
sentence.split(" ") for sentence in data.split("\n") | ||
] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
__author__ = 'mtw29' |