In [1]:
import re

Important to prefix strings with r

In [2]:
my_string = "Let's write Regex! Won't that be fun? I sure think so. Can you find 4 sentences? Or perhaps, all 19 words?"
sentence_endings = r"[.?!]"

print(re.split(sentence_endings, my_string))

["Let's write Regex", " Won't that be fun", ' I sure think so', ' Can you find 4 sentences', ' Or perhaps, all 19 words', '']


In [3]:
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

['Let', 'Regex', 'Won', 'Can', 'Or']


In [4]:
spaces = r'\s+'
print(re.split(spaces, my_string))

["Let's", 'write', 'Regex!', "Won't", 'that', 'be', 'fun?', 'I', 'sure', 'think', 'so.', 'Can', 'you', 'find', '4', 'sentences?', 'Or', 'perhaps,', 'all', '19', 'words?']


In [5]:
digits = r'\d+'
print(re.findall(digits, my_string))

['4', '19']


# Tokenization

Turning string or doc into tokens (smaller chunks)

* Break out words or sentences, 
* separating punctuation,
* separating hashtags in tweet

## nltk library
natural language toolkit library

In [6]:
#!pip install nltk
import nltk
#nltk.download('punkt')

In [10]:
from nltk.tokenize import word_tokenize, sent_tokenize
word_tokenize('Hi there!')

['Hi', 'there', '!']

Why tokenize?
* match commone words
remove unwanted tokens

`sent_tokenize`: sentences  
`regexp_tokenize`  
`TweetTokenizer`  

**re.search vs re.match**

match goes from beginning until cannot find any longer
search goes through entire string.

In [8]:
re.search('cd', 'abcde')

<re.Match object; span=(2, 4), match='cd'>

In [9]:
scene_one = "SCENE 1: [wind] [clop clop clop] \nKING ARTHUR: Whoa there!  [clop clop clop] \nSOLDIER #1: Halt!  Who goes there?\nARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.  King of the Britons, defeator of the Saxons, sovereign of all England!\nSOLDIER #1: Pull the other one!\nARTHUR: I am, ...  and this is my trusty servant Patsy.  We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.  I must speak with your lord and master.\nSOLDIER #1: What?  Ridden on a horse?\nARTHUR: Yes!\nSOLDIER #1: You're using coconuts!\nARTHUR: What?\nSOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together.\nARTHUR: So?  We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--\nSOLDIER #1: Where'd you get the coconuts?\nARTHUR: We found them.\nSOLDIER #1: Found them?  In Mercea?  The coconut's tropical!\nARTHUR: What do you mean?\nSOLDIER #1: Well, this is a temperate zone.\nARTHUR: The swallow may fly south with the sun or the house martin or the plover may seek warmer climes in winter, yet these are not strangers to our land?\nSOLDIER #1: Are you suggesting coconuts migrate?\nARTHUR: Not at all.  They could be carried.\nSOLDIER #1: What?  A swallow carrying a coconut?\nARTHUR: It could grip it by the husk!\nSOLDIER #1: It's not a question of where he grips it!  It's a simple question of weight ratios!  A five ounce bird could not carry a one pound coconut.\nARTHUR: Well, it doesn't matter.  Will you go and tell your master that Arthur from the Court of Camelot is here.\nSOLDIER #1: Listen.  In order to maintain air-speed velocity, a swallow needs to beat its wings forty-three times every second, right?\nARTHUR: Please!\nSOLDIER #1: Am I right?\nARTHUR: I'm not interested!\nSOLDIER #2: It could be carried by an African swallow!\nSOLDIER #1: Oh, yeah, an African swallow maybe, but not a European swallow.  That's my point.\nSOLDIER #2: Oh, yeah, I agree with that.\nARTHUR: Will you ask your master if he wants to join my court at Camelot?!\nSOLDIER #1: But then of course a-- African swallows are non-migratory.\nSOLDIER #2: Oh, yeah...\nSOLDIER #1: So they couldn't bring a coconut back anyway...  [clop clop clop] \nSOLDIER #2: Wait a minute!  Supposing two swallows carried it together?\nSOLDIER #1: No, they'd have to have it on a line.\nSOLDIER #2: Well, simple!  They'd just use a strand of creeper!\nSOLDIER #1: What, held under the dorsal guiding feathers?\nSOLDIER #2: Well, why not?\n"

In [12]:
sentences = sent_tokenize(scene_one)
tokenize_sent = word_tokenize(sentences[3])
unique_tokens = set(word_tokenize(scene_one))

In [14]:
match = re.search('coconuts', scene_one)
print(match.start(), match.end())

580 588


In [16]:
pattern1 = r"\[.*\]"
print(re.search(pattern1, scene_one))

<re.Match object; span=(9, 32), match='[wind] [clop clop clop]'>


In [18]:
pattern2 = r"[\D]+:"
print(re.match(pattern2, sentences[3]))

<re.Match object; span=(0, 7), match='ARTHUR:'>


## Advanced tokenization

regex:  
OR is represented using `|`  
Groups designated with `()`  
define explicit char ranges using `[]`

In [20]:
#example
match_digits_and_words = ('(\d+|\w+)')
re.findall(match_digits_and_words, 'He has 11 cats.')

['He', 'has', '11', 'cats']

In [21]:
match_upper_and_lower = r'[A-Za-z]+'
match_numbers_0_9 = r'[0-9]'
# using escape chars
match_upper_lower_dash_pd = r'[A-Za-z\-\.]+'
match_group_az = r'(a-z)'  # 'a-z'
match_group_spaces_or_comma = r'(\s+|,)'

In [25]:
from nltk.tokenize import regexp_tokenize
my_string = "SOLDIER #1: Found them? In Mercea? The coconut's tropical!"
regexp_tokenize(my_string, r'\w+|#\d|\?|!')

['SOLDIER',
 '#1',
 'Found',
 'them',
 '?',
 'In',
 'Mercea',
 '?',
 'The',
 'coconut',
 's',
 'tropical',
 '!']

In [26]:
from nltk.tokenize import regexp_tokenize, TweetTokenizer
tweets = ['This is the best #nlp exercise ive found online! #python',
 '#NLP is super fun! <3 #learning',
 'Thanks @datacamp :) #nlp #python']
hashtag_pattern = r'#\w+'
hashtags = regexp_tokenize(tweets[0], hashtag_pattern)
print(hashtags)

['#nlp', '#python']


In [36]:
at_pattern = r'([@#]\w+)'
mentions_hashtags = regexp_tokenize(tweets[-1], at_pattern)
print(mentions_hashtags)

['@datacamp', '#nlp', '#python']


In [33]:
tweets[-1]

'Thanks @datacamp :) #nlp #python'

In [39]:
tknzr = TweetTokenizer()
all_tokens = [tknzr.tokenize(t) for t in tweets]
print(all_tokens)

[['This', 'is', 'the', 'best', '#nlp', 'exercise', 'ive', 'found', 'online', '!', '#python'], ['#NLP', 'is', 'super', 'fun', '!', '<3', '#learning'], ['Thanks', '@datacamp', ':)', '#nlp', '#python']]


## Non-ascii tokenization

examples: German, and emojis

In [40]:
german_text = 'Wann gehen wir Pizza essen? 🍕 Und fährst du mit Über? 🚕'

In [41]:
all_words = word_tokenize(german_text)
print(all_words)
capital_words = r"[A-ZÜ]\w+"
print(regexp_tokenize(german_text, capital_words))
emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"
print(regexp_tokenize(german_text, emoji))

['Wann', 'gehen', 'wir', 'Pizza', 'essen', '?', '🍕', 'Und', 'fährst', 'du', 'mit', 'Über', '?', '🚕']
['Wann', 'Pizza', 'Und', 'Über']
['🍕', '🚕']
