This notebook outlines several methods for tokenizing text into words (and sentences), including:

* whitespace
* nltk (Penn Treebank tokenizer)
* nltk (Twitter-aware)
* spaCy
* custom regular expressions

highlighting differences between them.

In [1]:
import nltk, re, json
nltk.download('punkt')
import spacy
from collections import Counter

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Isak\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# spaCy lemmatization needs tagger but disable the rest
nlp = spacy.load('en', disable=['tagger,ner,parser'])
nlp.remove_pipe('tagger')
nlp.remove_pipe('ner')
nlp.remove_pipe('parser');

In [3]:
def read_tweets_from_json(filename):
    tweets=[]
    with open(filename, encoding="utf-8") as file:
        data=json.load(file)
        for tweet in data:
            tweets.append(tweet["text"])
    return tweets        

trump_tweets.json comes from the Trump Twitter collection here (downloaded 1/19/19)
http://www.trumptwitterarchive.com/archive

In [4]:
filename="../data/trump_tweets.json"

In [5]:
tweets=read_tweets_from_json(filename)

In [6]:
whitespace_tokens=[]
for tweet in tweets:
    whitespace_tokens.append(tweet.split())

In [7]:
nltk_tokens=[]
for tweet in tweets:
    nltk_tokens.append(nltk.word_tokenize(tweet, language="english"))

In [8]:
nltk_casual_tokens=[]
for tweet in tweets:
    nltk_casual_tokens.append(nltk.casual_tokenize(tweet))

In [9]:
spacy_tokens=[]
for tweet in tweets:
    spacy_tokens.append([token.text for token in nlp(tweet)])

In [10]:
# Shorter version of http://sentiment.christopherpotts.net/code-data/happyfuntokenizing.py

# The order here is important (match from first to last)

# Keep usernames together (any token starting with @, followed by A-Z, a-z, 0-9)
regexes=(r"(?:@[\w_]+)",

# Keep hashtags together (any token starting with #, followed by A-Z, a-z, 0-9, _, or -)
r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)",

# Keep words with apostrophes, hyphens and underscores together
r"(?:[a-z][a-z’'\-_]+[a-z])",

# Keep all other sequences of A-Z, a-z, 0-9, _ together
r"(?:[\w_]+)",

# Everything else that's not whitespace
r"(?:\S)"
)

big_regex="|".join(regexes)

my_extensible_tokenizer = re.compile(big_regex, re.VERBOSE | re.I | re.UNICODE)

def my_extensible_tokenize(text):
    return my_extensible_tokenizer.findall(text)

In [11]:
extensible_tokens=[]
for tweet in tweets:
    extensible_tokens.append(my_extensible_tokenize(tweet))

Q1: Write a function to print out the first 5 tokenized tweets in each of the five tokenizers above. Examine those tweets; how would you characterize the differences?



In [12]:
tokenizers = [whitespace_tokens, nltk_tokens, nltk_casual_tokens, spacy_tokens, extensible_tokens]       
for tweet in range(5):
    print("\nTweet number", tweet+1)
    for tokenizer in tokenizers:
        for word in tokenizer[tweet]:
            print(word, ' ', end='')
        print('\n')


Tweet number 1
Mexico  is  doing  NOTHING  to  stop  the  Caravan  which  is  now  fully  formed  and  heading  to  the  United  States.  We  stopped  the  last  two  -  many  are  still  in  Mexico  but  can’t  get  through  our  Wall,  but  it  takes  a  lot  of  Border  Agents  if  there  is  no  Wall.  Not  easy!  

Mexico  is  doing  NOTHING  to  stop  the  Caravan  which  is  now  fully  formed  and  heading  to  the  United  States  .  We  stopped  the  last  two  -  many  are  still  in  Mexico  but  can  ’  t  get  through  our  Wall  ,  but  it  takes  a  lot  of  Border  Agents  if  there  is  no  Wall  .  Not  easy  !  

Mexico  is  doing  NOTHING  to  stop  the  Caravan  which  is  now  fully  formed  and  heading  to  the  United  States  .  We  stopped  the  last  two  -  many  are  still  in  Mexico  but  can  ’  t  get  through  our  Wall  ,  but  it  takes  a  lot  of  Border  Agents  if  there  is  no  Wall  .  Not  easy  !  

Mexico  is  doing  NOTHING  to  stop  t

Q2: Write a function `compare(tokenization_one, tokenization_two)` that compares two tokenizations of the same text and finds the 20 most frequent tokens that don't appear in the other.



In [13]:
from collections import Counter

def compare(tokenization_one, tokenization_two):
    # First will flatten out both tokenizations into lists
    flat_one, flat_two = [], []
    for tweet in tokenization_one:
        for word in tweet:
            flat_one.append(word)
    for tweet in tokenization_two:
        for word in tweet:
            flat_two.append(word)
    
    # Then check for occurences across lists
    unique = []
    wordcount = 0;
    for word in flat_one:
        wordcount += 1
        if word not in flat_two:
            unique.append(word)

    # Found documentation on collections.Counter at https://docs.python.org/2/library/collections.html#collections.Counter
    occurence = Counter(unique)
    print(occurence.most_common(20))

In [14]:
compare(nltk_casual_tokens, nltk_tokens)

[('"', 24807), ('@realDonaldTrump', 8661), ('#Trump2016', 840), ('@BarackObama', 732), ("don't", 626), ('#MakeAmericaGreatAgain', 560), ('@FoxNews', 547), ("I'm", 524), ('@foxandfriends', 504), ("can't", 423), ('@ApprenticeNBC', 393), ('@MittRomney', 314), ("It's", 304), ("it's", 303), ('🇺', 300), ('🇸', 300), ('#CelebApprentice', 289), ('@CNN', 285), ("you're", 276), ("doesn't", 266)]


Q3: Use one of the NLTK tokenizers; write code to determine how many sentences are in this dataset, and what the average number of words per sentence is.



In [15]:
from statistics import mean

sentence_lengths = []
sentence_number = 0
end_signs = ['.','!','?']
for tweet in nltk_tokens:
    counter = 0;
    for word in tweet:
        if word not in end_signs:
            counter += 1
        else:
            sentence_lengths.append(counter)
            sentence_number += 1
            counter = 0
avg = round(mean(sentence_lengths))
print("Number of sentences:", sentence_number,
      "\nAverage sentence length is", avg)

Number of sentences: 57646 
Average sentence length is 11


Q4 (check-plus): modify the extensible tokenizer above to keep urls together (e.g., www.google.com or http://www.google.com)

In [16]:
# Keep usernames together (any token starting with @, followed by A-Z, a-z, 0-9)
regexes=(r"(?:@[\w_]+)",

# Keep hashtags together (any token starting with #, followed by A-Z, a-z, 0-9, _, or -)
r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)",

# Keep urls together
# FILL IN HERE
# Documentation found at https://www.w3schools.com/python/python_regex.asp
r"(?:http[s]?://(?:[\w_]|[$-_@.&+~]|[!*\(\),])+)",
         
# Keep words with apostrophes, hyphens and underscores together
r"(?:[a-z][a-z’'\-_]+[a-z])",

# Keep all other sequences of A-Z, a-z, 0-9, _ together
r"(?:[\w_]+)",

# Everything else that's not whitespace
r"(?:\S)"
)

big_regex="|".join(regexes)

my_url_extensible_tokenizer = re.compile(big_regex, re.VERBOSE | re.I | re.UNICODE)

def my_extensible_tokenize_with_urls(text):
    return my_url_extensible_tokenizer.findall(text)

In [17]:
print ('\n'.join(my_extensible_tokenize_with_urls("The course website is http://people.ischool.berkeley.edu/~dbamman/info256.html")))

The
course
website
is
http://people.ischool.berkeley.edu/~dbamman/info256.html
