Skip to content

mjlavin80/text_process_classes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

text_process_classes

These classes (so far just Text()) perform basic text prcocessing tasks, such as standardizing text to utf8 encoding, stripping punctuation, cleaning out html tags, tokenization, part-of-speech tagging, lemmatization, frequency distribution tables, etc. Many leverage NLTK but add to its functionality or package it for ease of use.

Eventually I hope to add Corpus-level functionalities such as vector space conversion, tf-idf weighting, etc.

Usage

  1. import classes at the beginning of any python script, like so
import text_process_classes
  1. Instantiate with a source text
my_string = "Hello, world"
a = Text(my_string)
  1. Set the purge_html parameter (default is True)
my_string = "Hello, world"
a = Text(my_string, False)
  1. Run lemmas and parts of speech via methods (these don't run on instantiation, as they are memory heavy)
my_text= "run runs hello, world!"
a = Text(my_text)
a.lemmatize()
a.pos_tag()
    
print a.lemma_list
print a.pos_tuples
## ['run', 'run', 'hello', 'world']
## [('run', 'NN'), ('runs', 'VBZ'), ('hello', 'NN'), ('world', 'NN')]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages