text_process_classes

These classes (so far just Text()) perform basic text prcocessing tasks, such as standardizing text to utf8 encoding, stripping punctuation, cleaning out html tags, tokenization, part-of-speech tagging, lemmatization, frequency distribution tables, etc. Many leverage NLTK but add to its functionality or package it for ease of use.

Eventually I hope to add Corpus-level functionalities such as vector space conversion, tf-idf weighting, etc.

Usage

import classes at the beginning of any python script, like so

import text_process_classes

Instantiate with a source text

my_string = "Hello, world"
a = Text(my_string)

Set the purge_html parameter (default is True)

my_string = "Hello, world"
a = Text(my_string, False)

Run lemmas and parts of speech via methods (these don't run on instantiation, as they are memory heavy)

my_text= "run runs hello, world!"
a = Text(my_text)
a.lemmatize()
a.pos_tag()
    
print a.lemma_list
print a.pos_tuples
## ['run', 'run', 'hello', 'world']
## [('run', 'NN'), ('runs', 'VBZ'), ('hello', 'NN'), ('world', 'NN')]

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
text_process_classes.py		text_process_classes.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

text_process_classes

Usage

About

Uh oh!

Releases

Packages

Languages

mjlavin80/text_process_classes

Folders and files

Latest commit

History

Repository files navigation

text_process_classes

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages