This notebook outlines several methods for tokenizing text into words (and sentences), including:

* whitespace
* nltk (Penn Treebank tokenizer)
* nltk (Twitter-aware)
* spaCy
* custom regular expressions

highlighting differences between them.

In [1]:
import nltk, re, json
import spacy
from collections import Counter

In [2]:
# If you haven't downloaded the NLTK sentence segmentation model before, do so here
nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/pierrejaumier/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
# spaCy lemmatization needs tagger but disable the rest
nlp = spacy.load('en', disable=['tagger,ner,parser'])
nlp.remove_pipe('tagger')
nlp.remove_pipe('ner')
nlp.remove_pipe('parser');

In [5]:
def read_tweets_from_json(filename):
    tweets=[]
    with open(filename, encoding="utf-8") as file:
        data=json.load(file)
        for tweet in data:
            tweets.append(tweet["text"])
    return tweets        

trump_tweets.json comes from the Trump Twitter collection here (downloaded 1/19/19)
http://www.trumptwitterarchive.com/archive

In [6]:
filename="../data/trump_tweets.json"

In [7]:
tweets=read_tweets_from_json(filename)

In [8]:
whitespace_tokens=[]
for tweet in tweets:
    whitespace_tokens.append(tweet.split())

In [9]:
nltk_tokens=[]
for tweet in tweets:
    nltk_tokens.append(nltk.word_tokenize(tweet, language="english"))

In [10]:
nltk_casual_tokens=[]
for tweet in tweets:
    nltk_casual_tokens.append(nltk.casual_tokenize(tweet))

In [11]:
spacy_tokens=[]
for tweet in tweets:
    spacy_tokens.append([token.text for token in nlp(tweet)])

In [12]:
# Shorter version of http://sentiment.christopherpotts.net/code-data/happyfuntokenizing.py

# The order here is important (match from first to last)

# Keep usernames together (any token starting with @, followed by A-Z, a-z, 0-9)
regexes=(r"(?:@[\w_]+)",

# Keep hashtags together (any token starting with #, followed by A-Z, a-z, 0-9, _, or -)
r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)",

# Keep words with apostrophes, hyphens and underscores together
r"(?:[a-z][a-z’'\-_]+[a-z])",

# Keep all other sequences of A-Z, a-z, 0-9, _ together
r"(?:[\w_]+)",

# Everything else that's not whitespace
r"(?:\S)"
)

big_regex="|".join(regexes)

my_extensible_tokenizer = re.compile(big_regex, re.VERBOSE | re.I | re.UNICODE)

def my_extensible_tokenize(text):
    return my_extensible_tokenizer.findall(text)

In [13]:
extensible_tokens=[]
for tweet in tweets:
    extensible_tokens.append(my_extensible_tokenize(tweet))

Q1: Write a function to print out the first 5 tokenized tweets in each of the five tokenizers above. Examine those tweets; how would you characterize the differences?



In [22]:
tweets[0]

'Mexico is doing NOTHING to stop the Caravan which is now fully formed and heading to the United States. We stopped the last two - many are still in Mexico but can’t get through our Wall, but it takes a lot of Border Agents if there is no Wall. Not easy!'

In [23]:
whitespace_tokens[0]

['Mexico',
 'is',
 'doing',
 'NOTHING',
 'to',
 'stop',
 'the',
 'Caravan',
 'which',
 'is',
 'now',
 'fully',
 'formed',
 'and',
 'heading',
 'to',
 'the',
 'United',
 'States.',
 'We',
 'stopped',
 'the',
 'last',
 'two',
 '-',
 'many',
 'are',
 'still',
 'in',
 'Mexico',
 'but',
 'can’t',
 'get',
 'through',
 'our',
 'Wall,',
 'but',
 'it',
 'takes',
 'a',
 'lot',
 'of',
 'Border',
 'Agents',
 'if',
 'there',
 'is',
 'no',
 'Wall.',
 'Not',
 'easy!']

In [26]:
a = ['a', 'aa']
b = ['b', 'bb', 'bbb']
for i,x in enumerate(zip(a,b)):
    print(i,x)

0 ('a', 'b')
1 ('aa', 'bb')


In [31]:
' '.join(a)

'a aa'

Formatage en python: https://realpython.com/python-f-strings/

In [32]:
name = "Eric"
age = 74
f"Hello, {name}. You are {age}."

'Hello, Eric. You are 74.'

In [36]:
for idx, (whitespace, nltk, nltk_casual, spacy, extensible) \
    in enumerate(zip(whitespace_tokens, nltk_tokens, nltk_casual_tokens, spacy_tokens, extensible_tokens)):
    if idx >= 5:
        break
    print(f"whitespace >\n {' '.join(whitespace_tokens[idx])}")
    print(f"nltk >\n {' '.join(nltk_tokens[idx])}")
    print(f"nltk_casual >\n {' '.join(nltk_casual_tokens[idx])}")
    print(f"spacy >\n {' '.join(spacy_tokens[idx])}")
    print(f"extensible >\n {' '.join(extensible_tokens[idx])}")
    print("\n")
    
        

whitespace >
 Mexico is doing NOTHING to stop the Caravan which is now fully formed and heading to the United States. We stopped the last two - many are still in Mexico but can’t get through our Wall, but it takes a lot of Border Agents if there is no Wall. Not easy!
nltk >
 Mexico is doing NOTHING to stop the Caravan which is now fully formed and heading to the United States . We stopped the last two - many are still in Mexico but can ’ t get through our Wall , but it takes a lot of Border Agents if there is no Wall . Not easy !
nltk_casual >
 Mexico is doing NOTHING to stop the Caravan which is now fully formed and heading to the United States . We stopped the last two - many are still in Mexico but can ’ t get through our Wall , but it takes a lot of Border Agents if there is no Wall . Not easy !
spacy >
 Mexico is doing NOTHING to stop the Caravan which is now fully formed and heading to the United States . We stopped the last two - many are still in Mexico but ca n’t get through o

Q2: Write a function `compare(tokenization_one, tokenization_two)` that compares two tokenizations of the same text and finds the 20 most frequent tokens that don't appear in the other.

In [48]:
# Fréquence d'éléments dans une liste de listes
l = nltk_tokens
flat_list = [item for sublist in l for item in sublist]
Counter(flat_list).most_common(20) 

[('@', 39951),
 ('.', 32387),
 (':', 23593),
 ('!', 21713),
 ('the', 21222),
 (',', 20257),
 ('to', 14812),
 ('``', 13299),
 ("''", 11514),
 ('and', 11218),
 ('a', 10622),
 ('is', 9904),
 ('of', 9768),
 ('in', 8773),
 ('realDonaldTrump', 8655),
 ('I', 8642),
 ('you', 7774),
 ('for', 7639),
 ('#', 7333),
 ('on', 6268)]

In [45]:
def compare(tokenization_one, tokenization_two):
    flat_list_one = [token for sublist in tokenization_one for token in sublist]
    flat_list_two = [token for sublist in tokenization_two for token in sublist]
    # Eléments uniques dans chacune des listes
    set_one = set(flat_list_one)
    set_two = set(flat_list_two)
    
    # Liste d'éléments non vus dans l'autre tokenization
    unseen_token = []
    for token in flat_list_one:
        if token not in set_two:
            unseen_token.append(token)
    for token in flat_list_two:
        if token not in set_one:
            unseen_token.append(token)
    return Counter(unseen_token).most_common(20) 

In [49]:
for key, value in compare(nltk_casual_tokens, nltk_tokens):
    print(f"{key}\t{value}")

"	24807
``	13299
''	11514
@realDonaldTrump	8661
's	3541
amp	3364
n't	2503
--	2077
Trump2016	846
#Trump2016	840
@BarackObama	732
U.S.	665
don't	626
#MakeAmericaGreatAgain	560
@FoxNews	547
....	542
'm	538
're	528
I'm	524
@foxandfriends	504


Q3: Use one of the NLTK tokenizers; write code to determine how many sentences are in this dataset, and what the average number of words per sentence is.

Aide: https://www.guru99.com/tokenize-words-sentences-nltk.html

In [60]:
tweets[0]

'Mexico is doing NOTHING to stop the Caravan which is now fully formed and heading to the United States. We stopped the last two - many are still in Mexico but can’t get through our Wall, but it takes a lot of Border Agents if there is no Wall. Not easy!'

In [57]:
from nltk.tokenize import sent_tokenize
text = "God is Great! I won a lottery."
print(sent_tokenize(text))

['God is Great!', 'I won a lottery.']


In [59]:
from nltk.tokenize import sent_tokenize
text = tweets[0]
print(len(sent_tokenize(text)))

3


In [63]:
from nltk.tokenize import sent_tokenize, word_tokenize

n_sentences = 0
word_per_sentence = []

for tweet in tweets:
    sentences = sent_tokenize(tweet)
    n_sentences += len(sentences)
    for sentence in sentences:
        word_per_sentence.append(len(word_tokenize(sentence)))

print(f"Nombre total de phrases {n_sentences}")
print(f"Moyenne de mots par phrase {sum(word_per_sentence)/len(word_per_sentence):.2f}")

Nombre total de phrases 70491
Moyenne de mots par phrase 12.55


Q4 (check-plus): modify the extensible tokenizer above to keep urls together (e.g., www.google.com or http://www.google.com)

In [79]:
# Keep usernames together (any token starting with @, followed by A-Z, a-z, 0-9)
regexes=(r"(?:@[\w_]+)",

# Keep hashtags together (any token starting with #, followed by A-Z, a-z, 0-9, _, or -)
r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)",

# Keep urls together
r"(?:https?:\S+)",
r"(?:www\.\S+)",
         
# Keep words with apostrophes, hyphens and underscores together
r"(?:[a-z][a-z’'\-_]+[a-z])",

# Keep all other sequences of A-Z, a-z, 0-9, _ together
r"(?:[\w_]+)",

# Everything else that's not whitespace
r"(?:\S)"
)

big_regex="|".join(regexes)

my_url_extensible_tokenizer = re.compile(big_regex, re.VERBOSE | re.I | re.UNICODE)

def my_extensible_tokenize_with_urls(text):
    return my_url_extensible_tokenizer.findall(text)

In [80]:
my_extensible_tokenize_with_urls("The course website is http://people.ischool.berkeley.edu/~dbamman/info256.html")

['The',
 'course',
 'website',
 'is',
 'http://people.ischool.berkeley.edu/~dbamman/info256.html']

In [81]:
print ('\n'.join(my_extensible_tokenize_with_urls("The course website is http://people.ischool.berkeley.edu/~dbamman/info256.html")))

The
course
website
is
http://people.ischool.berkeley.edu/~dbamman/info256.html
