This notebook outlines several methods for tokenizing text into words (and sentences), including:

* whitespace
* nltk (Penn Treebank tokenizer)
* nltk (Twitter-aware)
* spaCy
* custom regular expressions

highlighting differences between them.

In [1]:
import nltk, re, json
import spacy
from collections import Counter

In [2]:
# If you haven't downloaded the sentence segmentation model before, do so here
nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/pierrejaumier/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
# spaCy lemmatization needs tagger but disable the rest
nlp = spacy.load('en', disable=['tagger,ner,parser'])
nlp.remove_pipe('tagger')
nlp.remove_pipe('ner')
nlp.remove_pipe('parser');

In [4]:
def read_tweets_from_json(filename):
    tweets=[]
    with open(filename, encoding="utf-8") as file:
        data=json.load(file)
        for tweet in data:
            tweets.append(tweet["text"])
    return tweets

trump_tweets.json comes from the Trump Twitter collection here (downloaded 1/19/19) http://www.trumptwitterarchive.com/archive

In [5]:
filename="../data/trump_tweets.json"

In [7]:
tweets=read_tweets_from_json(filename)

In [8]:
whitespace_tokens=[]
for tweet in tweets:
    whitespace_tokens.append(tweet.split())

In [9]:
nltk_tokens=[]
for tweet in tweets:
    nltk_tokens.append(nltk.word_tokenize(tweet, language="english"))

In [10]:
nltk_casual_tokens=[]
for tweet in tweets:
    nltk_casual_tokens.append(nltk.casual_tokenize(tweet))

In [11]:
spacy_tokens=[]
for tweet in tweets:
    spacy_tokens.append([token.text for token in nlp(tweet)])

In [12]:
# Shorter version of http://sentiment.christopherpotts.net/code-data/happyfuntokenizing.py

# The order here is important (match from first to last)

# Keep usernames together (any token starting with @, followed by A-Z, a-z, 0-9)
regexes=(r"(?:@[\w_]+)",

# Keep hashtags together (any token starting with #, followed by A-Z, a-z, 0-9, _, or -)
r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)",

# Keep words with apostrophes, hyphens and underscores together
r"(?:[a-z][a-z’'\-_]+[a-z])",

# Keep all other sequences of A-Z, a-z, 0-9, _ together
r"(?:[\w_]+)",

# Everything else that's not whitespace
r"(?:\S)"
)

big_regex="|".join(regexes)

my_extensible_tokenizer = re.compile(big_regex, re.VERBOSE | re.I | re.UNICODE)

def my_extensible_tokenize(text):
    return my_extensible_tokenizer.findall(text)

In [13]:
extensible_tokens=[]
for tweet in tweets:
    extensible_tokens.append(my_extensible_tokenize(tweet))

Q1: Write a function to print out the first 5 tokenized tweets in each of the five tokenizers above. Examine those tweets; how would you characterize the differences?

In [14]:
for idx, (one, two, three, four, five) in enumerate(zip(nltk_tokens, nltk_casual_tokens, spacy_tokens, whitespace_tokens, extensible_tokens)):
    if idx >= 5:
        break
    print("NLTK      :\t%s" % ' '.join(one))
    print("CASUAL    :\t%s" % ' '.join(two))
    print("SPACY     :\t%s" % ' '.join(three))
    print("WHITESPACE:\t%s" % ' '.join(four))
    print("EXTENSIBLE:\t%s" % ' '.join(five))


    print()


NLTK      :	Mexico is doing NOTHING to stop the Caravan which is now fully formed and heading to the United States . We stopped the last two - many are still in Mexico but can ’ t get through our Wall , but it takes a lot of Border Agents if there is no Wall . Not easy !
CASUAL    :	Mexico is doing NOTHING to stop the Caravan which is now fully formed and heading to the United States . We stopped the last two - many are still in Mexico but can ’ t get through our Wall , but it takes a lot of Border Agents if there is no Wall . Not easy !
SPACY     :	Mexico is doing NOTHING to stop the Caravan which is now fully formed and heading to the United States . We stopped the last two - many are still in Mexico but ca n’t get through our Wall , but it takes a lot of Border Agents if there is no Wall . Not easy !
WHITESPACE:	Mexico is doing NOTHING to stop the Caravan which is now fully formed and heading to the United States. We stopped the last two - many are still in Mexico but can’t get thro

Q2: Write a function `compare(tokenization_one, tokenization_two)` that compares two tokenizations of the same text and finds the 20 most frequent tokens that don't appear in the other.

In [15]:
def compare(one_tokens, two_tokens):
    
    one_counts=Counter()
    two_counts=Counter()

    for sentence in one_tokens:
        for token in sentence:
            one_counts[token]+=1
        
    for sentence in two_tokens:
        for token in sentence:
            two_counts[token]+=1
        
    missing_from_one=Counter()
    missing_from_two=Counter()
    
    for word_type in one_counts:
        if word_type not in two_counts:
            missing_from_two[word_type]=one_counts[word_type]
        
    for word_type in two_counts:
        if word_type not in one_counts:
            missing_from_one[word_type]=two_counts[word_type]

    print ("Token counts -- one: %s, two: %s" % (len(one_tokens), len(two_tokens)))
    print ("\nNot in one:")
    print ('\n'.join("%s\t%d" % (k,v) for (k,v) in missing_from_one.most_common(20)))
    print ("\nNot in two:")
    print ('\n'.join("%s\t%d" % (k,v) for (k,v) in missing_from_two.most_common(20)))


In [16]:
compare(nltk_casual_tokens, nltk_tokens)

Token counts -- one: 36583, two: 36583

Not in one:
``	13299
''	11514
's	3541
amp	3364
n't	2503
--	2077
Trump2016	846
U.S.	665
....	542
'm	538
're	528
CelebApprentice	416
Mr.	333
MittRomney	312
've	307
'll	307
IvankaTrump	236
w/	209
'd	175
.....	171

Not in two:
"	24807
@realDonaldTrump	8661
#Trump2016	840
@BarackObama	732
don't	626
#MakeAmericaGreatAgain	560
@FoxNews	547
I'm	524
@foxandfriends	504
can't	423
@ApprenticeNBC	393
@MittRomney	314
It's	304
it's	303
🇺	300
🇸	300
#CelebApprentice	289
@CNN	285
you're	276
doesn't	266


Q3: Use one of the NLTK tokenizers; write code to determine how many sentences are in this dataset, and what the average number of words per sentence is.

In [17]:
count=0.
num_sents=0
for tweet in tweets:
    for sent in nltk.sent_tokenize(tweet):
        count+=len(nltk.word_tokenize(sent))
        num_sents+=1
print("Sents: %s, Tokens/sent: %.1f" % (num_sents, (count/num_sents)))


Sents: 70491, Tokens/sent: 12.6


Q4 (check-plus): modify the extensible tokenizer above to keep urls together (e.g., www.google.com or http://www.google.com)

In [None]:
# Keep usernames together (any token starting with @, followed by A-Z, a-z, 0-9)
regexes=(r"(?:@[\w_]+)",

# Keep hashtags together (any token starting with #, followed by A-Z, a-z, 0-9, _, or -)
r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)",

# Keep urls together
r"(?:https?:\S+)",
r"(?:www\.\S+)",
  
# Keep words with apostrophes, hyphens and underscores together
r"(?:[a-z][a-z’'\-_]+[a-z])",

# Keep all other sequences of A-Z, a-z, 0-9, _ together
r"(?:[\w_]+)",

# Everything else that's not whitespace
r"(?:\S)"
)

big_regex="|".join(regexes)

my_url_extensible_tokenizer = re.compile(big_regex, re.VERBOSE | re.I | re.UNICODE)

def my_extensible_tokenize_with_urls(text):
    return my_url_extensible_tokenizer.findall(text)

In [None]:
print ('\n'.join(my_extensible_tokenize_with_urls("The course website is http://people.ischool.berkeley.edu/~dbamman/info256.html")))