Permalink
Browse files

first commit

  • Loading branch information...
0 parents commit 83b1fede88954d86e5028d95b370bc4ce99a1e9e @mark-watson committed Jul 10, 2012
Showing with 92,694 additions and 0 deletions.
  1. +1 −0 KBSnlp.st
  2. +31 −0 README
  3. +92,662 −0 lexicon.txt
@@ -0,0 +1 @@
+Object subclass: #NLPtagger instanceVariableNames: '' classVariableNames: 'NLPlexicon' poolDictionaries: '' category: 'KBSnlp'!!NLPtagger commentStamp: 'MW 1/27/2008 12:20' prior: 0!NLP tagger converted to Squeak.Copyring 2000-2008 Mark Watson. All rights reserved.!"-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- "!NLPtagger class instanceVariableNames: ''!!NLPtagger class methodsFor: 'tagging' stamp: 'MW 1/27/2008 12:50'!initializeLexicon "Read data/lexicon.txt and build in memory lexicon" | read count strm aLine word taglist token lex | lex := Dictionary new. read := (FileStream fileNamed: './ai_data/lexicon.txt') readOnly. count := 0. [read atEnd] whileFalse: [count := count + 1. aLine := read upTo: Character lf. "Mac: use lf, Windows: use cr ???" strm := ReadStream on: aLine. word := strm upTo: Character space. taglist := OrderedCollection new. [strm atEnd] whileFalse: [token := strm upTo: Character space. taglist add: token]. "Transcript show: word; cr." "Transcript show: taglist printString; cr." lex at: word put: taglist]. read close. lex inspect. Smalltalk at: #NLPlexicon put: lex! !!NLPtagger class methodsFor: 'tagging' stamp: 'MW 1/27/2008 13:21'!pptag: wordString "returns a string of word/tag ..." | words tags write size count | words := NLPtagger tokenize: wordString. tags := NLPtagger tag: words. write := TextStream on: String new. size := words size. count := 1. [count <= size] whileTrue: [ write nextPutAll: (words at: count). write nextPutAll: '/'. write nextPutAll: (tags at: count). write nextPutAll: ' '. count := count + 1]. ^write contents string! !!NLPtagger class methodsFor: 'tagging' stamp: 'MW 1/27/2008 12:53'!tag: words "tag an ordered collection of words, returning an ordered collection of corresponding tags" | lex tags tag count i word lastWord lastTag | tags := OrderedCollection new. lex := Smalltalk at: #NLPlexicon. words do: [:aWord | tag := lex at: aWord ifAbsent: [nil]. tag isNil ifFalse: [tag := tag at: 1] ifTrue: [tag := 'NN']. " the default tag " tags add: tag]. " Apply transformation rules: " lastWord := ''. lastTag := ''. i := 0. count := words size. [i < count] whileTrue: [i := i + 1. word := words at: i. tag := tags at: i. " reuse tag variable " " First, handle all rules for i &gt; 1 " i > 1 ifTrue: [" rule 1: DT, {VBD | VBP} --> DT, NN " lastTag = 'DT' & (tag = 'VBD' | (tag = 'VBP') | (tag = 'VB')) ifTrue: [tags at: i put: 'NN']. tag size > 1 ifTrue: [" rule 6: convert a noun to a verb if the preceeding work is 'would' " (tag at: 1) = $N & ((tag at: 2) = $N) & (lastWord asLowercase = 'would') ifTrue: [tags at: i put: 'VB']]]. " Now, handle the remaining rules that are valid for i = 1: " " rule 2: convert a noun to a number (CD) if '.' appears in the word" (word findString: '.') > 0 ifTrue: [(tag at: 1) = $N ifTrue: [tags at: i put: 'CD']]. " not working - tokenizer tosses '.' characters " " rule 3: convert a noun to a past participle if words[i] ends with 'ed' " (tag at: 1) = $N & (word endsWith: 'ed') ifTrue: [tags at: i put: 'VBN']. " rule 4: convert any type to adverb if it ends in 'ly' " (word endsWith: 'ly') ifTrue: [tags at: i put: 'RB']. " rule 5: convert a common noun (NN or NNS) to a adjective if it ends with 'al' " (tag at: 1) = $N & (word endsWith: 'al') ifTrue: [tags at: i put: 'JJ']. " rule 7: if a word has been categorized as a common noun and it ends with 's;, " " then set its type to plural common noun (NNS) " tag = 'NN' & (word endsWith: 's') ifTrue: [tags at: i put: 'NNS']. " rule 8: convert a common noun to a present prticiple verb (i.e., a gerand) " (tag at: 1) = $N & (word endsWith: 'ing') ifTrue: [tags at: i put: 'VBG']. lastWord := word. lastTag := tag]. ^tags! !!NLPtagger class methodsFor: 'tagging' stamp: 'MW 1/27/2008 12:29'!tokenize: wordsInAString "This method is modified by ADvance." "tokenizes a string" ^wordsInAString findTokens: ' ;:.,<>[]{}!!@#$%^&*()?' keep: ';:.,<>[]{}!!$' " keep CR in this string!!!! "! !
31 README
@@ -0,0 +1,31 @@
+= Natural Language Processing Library for Pharo Smalltalk
+
+Copyright 2005 to 2012 by Mark Watson
+
+License: choose either Apache 2 or LGPL 3.0, whichever works best for you.
+
+== Setup
+
+Copy the files KBSnlp.st and lexicon.txt to the top directory of your Pharo
+distribution library. On a Mac, copy them into the App folder that defines
+the Pharo application.
+
+== Running an example
+
+Open a File Browser and fileIn the KBSnlp.st source file. Open a Class Browser
+and and look at the code in the KBnlp class.
+
+Open a Workspace and one time only evaluate:
+
+* NLPtagger initializeLexicon
+
+Try tagging a sentence:
+
+* NLPtagger pptag: 'The dog ran down the street'
+
+== To be done
+
+The enclosed code is a simple Smalltalk port of my Java FastTag part of speech
+(POS) tagger that is available on my open source page http://markwatson.com/opensource
+
+When/if I have time I will also port my classifier and named entity recognizer (NER) code.
Oops, something went wrong.

0 comments on commit 83b1fed

Please sign in to comment.