Skip to content

jaredks/tweetokenize

master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 

tweetokenize

Regular expression based tokenizer for Twitter. Focused on tokenization and pre-processing to train classifiers for sentiment, emotion, or mood.

Intended as glue between Python wrappers for Twitter API and machine learning algorithms of the Natural Language Toolkit (NLTK), but probably applicable to tokenizing any short messages of the social networking variety.

from tweetokenize import Tokenizer
gettokens = Tokenizer()
gettokens.tokenize('hey playa!:):3.....@SHAQ can you still dunk?#old🍕🍔😵LOL')
[u'hey', u'playa', u'!', u':)', u':3', u'...', u'USERNAME', u'can', u'you', u'still', u'dunk', u'?', u'#old', u'🍕', u'🍔', u'😵', u'LOL']

Features

  • Can easily replace tweet features like usernames, urls, phone numbers, times, etc. with tokens in order to reduce feature set complexity and improve performance of classifiers
  • Allows user-defined sets of emoticons to be used in tokenization
  • Correctly separates emoji, written consecutively, into individual tokens

Installation

python setup.py install

After installation, you can make sure everything is working by running the following inside the project root folder,

python tests

Documentation

http://htmlpreview.github.io/?https://raw.github.com/jaredks/tweetokenize/master/documentation/tweetokenize.Tokenizer-class.html

License

"Modified BSD License". See LICENSE for details. Copyright Jared Suttles, 2013.

About

Tokenization and pre-processing for Twitter data used to train classifiers.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published