## Which pattern?

In [None]:
import re

my_string = "Let's write RegEx!"

<p>Which of the following Regex patterns results in the following text? </p>
<pre><code class="{python} language-{python}">&gt;&gt;&gt; my_string = "Let's write RegEx!"
&gt;&gt;&gt; re.findall(PATTERN, my_string)
['Let', 's', 'write', 'RegEx']
</code></pre>
<p>In the IPython Shell, try replacing <code>PATTERN</code> with one of the below options and observe the resulting output. The <code>re</code> module has been pre-imported for you and <code>my_string</code> is available in your namespace.</p>

<p>Which pattern will match upper and lowercase characters? Remember, the + sign will make the pattern greedy!</p>

## Practicing regular expressions: re.split() and re.findall()

In [None]:
my_string = "Let's write RegEx!  Won't that be fun?  I sure think so.  Can you find 4 sentences?  Or perhaps, all 19 words?"
import re


In [None]:
# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.?!]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))

# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))

# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))


<p>Now you'll get a chance to write some regular expressions to match digits, strings and non-alphanumeric characters. Take a look at <code>my_string</code> first by printing it in the IPython Shell, to determine how you might best match the different steps.</p>
<p>Note: It's important to prefix your regex patterns with <code>r</code> to ensure that your patterns are interpreted in the way you want them to. Else, you may encounter problems to do with escape sequences in strings. For example, <code>"\n"</code> in Python is used to indicate a new line, but if you use the <code>r</code> prefix, it will be interpreted as the raw string <code>"\n"</code> - that is, the character <code>"\"</code> followed by the character <code>"n"</code> - and not as a new line.</p>
<p>The regular expression module <code>re</code> has already been imported for you.</p>
<p><em>Remember from the video that the syntax for the regex library is to always to pass the <strong>pattern first</strong>, and then the <strong>string second</strong>.</em></p>

<ul>
<li>Split <code>my_string</code> on each sentence ending. To do this:<ul>
<li>Write a pattern called <code>sentence_endings</code> to match sentence endings (<code>.?!</code>).</li>
<li>Use <code>re.split()</code> to split <code>my_string</code> on the pattern and print the result.</li></ul></li>
<li>Find and print all capitalized words in <code>my_string</code> by writing a pattern called <code>capitalized_words</code> and using <code>re.findall()</code>. <ul>
<li>Remember the <code>[a-z]</code> pattern shown in the video to match lowercase groups? Modify that pattern appropriately in order to match uppercase groups.</li></ul></li>
<li>Write a pattern called <code>spaces</code> to match one or more spaces (<code>"\s+"</code>) and then use <code>re.split()</code> to split <code>my_string</code> on this pattern, keeping all punctuation intact. Print the result.</li>
<li>Find all digits in <code>my_string</code> by writing a pattern called <code>digits</code> (<code>"\d+"</code>) and using <code>re.findall()</code>. Print the result.</li>
</ul>

<ul>
<li>Remember, you can use <code>"\w"</code> to match alphanumeric, <code>"\d"</code> to match digits, <code>"\s"</code> to match spaces and <code>"+"</code> to make anything greedy. </li>
<li>For groupings, you can use square brackets <code>[]</code> to declare part of the pattern or the entire pattern.</li>
</ul>

## Word tokenization with NLTK

In [None]:
from urllib.request import urlopen, urlretrieve
import os

os.makedirs('tokenizers/punkt/PY3/')
urlretrieve('https://s3.amazonaws.com/assets.datacamp.com/production/course_3747/datasets/english_pickle.txt', 'tokenizers/punkt/PY3/english.pickle')
holy_grail = urlopen('https://s3.amazonaws.com/assets.datacamp.com/production/course_3747/datasets/grail.txt').read().decode('utf-8')

scene_one = holy_grail[:holy_grail.find("SCENE 2")]


In [None]:
# Import necessary modules
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

# Split scene_one into sentences: sentences
sentences = sent_tokenize(scene_one)

# Use word_tokenize to tokenize the fourth sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[3])

# Make a set of unique tokens in the entire scene: unique_tokens
unique_tokens = set(word_tokenize(scene_one))

# Print the unique tokens result
print(unique_tokens)


<p>Here, you'll be using the first scene of Monty Python's Holy Grail, which has been pre-loaded as <code>scene_one</code>. Feel free to check it out in the IPython Shell!</p>
<p>Your job in this exercise is to utilize <code>word_tokenize</code> and <code>sent_tokenize</code> from <code>nltk.tokenize</code> to tokenize both words and sentences from Python strings - in this case, the first scene of Monty Python's Holy Grail.</p>

<ul>
<li>Import the <code>sent_tokenize</code> and <code>word_tokenize</code> functions from <code>nltk.tokenize</code>.</li>
<li>Tokenize all the sentences in <code>scene_one</code> using the <code>sent_tokenize()</code> function.</li>
<li>Tokenize the fourth sentence in <code>sentences</code>, which you can access as <code>sentences[3]</code>, using the <code>word_tokenize()</code> function. </li>
<li>Find the unique tokens in the entire scene by using <code>word_tokenize()</code> on <code>scene_one</code> and then converting it into a set using <code>set()</code>.</li>
<li>Print the unique tokens found. This has been done for you, so hit 'Submit Answer' to see the results!</li>
</ul>

<ul>
<li>Use the command <code>from y import x</code> to import <code>x</code> from <code>y</code>.</li>
<li>Use the <code>sent_tokenize()</code> function to tokenize the sentences in <code>scene_one</code>.</li>
<li>Use <code>word_tokenize()</code> to tokenize the appropriate sentence in <code>sentences</code>. Remember, Python uses 0-based numbering.</li>
<li>After using <code>word_tokenize()</code> on <code>scene_one</code>, use the <code>set()</code> function to convert it into a set.</li>
</ul>

## More regex with re.search()

In [None]:
import re
from nltk.tokenize import sent_tokenize
from urllib.request import urlopen, urlretrieve
import os

os.makedirs('tokenizers/punkt/PY3/')
urlretrieve('https://s3.amazonaws.com/assets.datacamp.com/production/course_3747/datasets/english_pickle.txt', 'tokenizers/punkt/PY3/english.pickle')


holy_grail = urlopen('https://s3.amazonaws.com/assets.datacamp.com/production/course_3747/datasets/grail.txt').read().decode('utf-8')
scene_one = holy_grail[:holy_grail.find("SCENE 2")]
sentences = sent_tokenize(scene_one)

<p>In this exercise, you'll utilize <code>re.search()</code> and <code>re.match()</code> to find specific tokens. Both <code>search</code> and <code>match</code> expect regex patterns, similar to those you defined in an earlier exercise. You'll apply these regex library methods to the same Monty Python text from the <code>nltk</code> corpora.</p>
<p>You have both <code>scene_one</code> and <code>sentences</code> available from the last exercise; now you can use them with <code>re.search()</code> and <code>re.match()</code> to extract and match more text.</p>

## Choosing a tokenizer

In [None]:
from nltk.tokenize import regexp_tokenize

my_string = "SOLDIER #1: Found them? In Mercea? The coconut's tropical!"

pattern1 = r"(\w+|\?|!)"

pattern2 = r"(\w+|#\d|\?|!)"

pattern3 = r"(#\d\w+\?!)"

pattern4 = r"\s+"


<p>Given the following string, which of the below patterns is the best tokenizer? If possible, you want to retain sentence punctuation as separate tokens, but have <code>'#1'</code> remain a single token.</p>
<pre><code class="{python} language-{python}">my_string = "SOLDIER #1: Found them? In Mercea? The coconut's tropical!"
</code></pre>
<p>The string is available in your workspace as <code>my_string</code>, and the patterns have been pre-loaded as <code>pattern1</code>, <code>pattern2</code>, <code>pattern3</code>, and <code>pattern4</code>, respectively. </p>
<p>Additionally, <code>regexp_tokenize</code> has been imported from <code>nltk.tokenize</code>. You can use <code>regexp_tokenize(string, pattern)</code> with <code>my_string</code> and one of the patterns as arguments to experiment for yourself and see which is the best tokenizer.</p>

<p>The <code>|</code> character operates like an <code>or</code> statement. Try using <code>regexp_tokenize()</code> with <code>my_string</code> and one of the patterns to see how each pattern tokenizes the string.</p>

## Regex with NLTK tokenization

In [None]:
tweets = ["This is the best #nlp exercise ive found online! #python", "#NLP is super fun! <3 #learning", "Thanks @datacamp :) #nlp #python"]
#pattern2 = r"([#|@]\w+)" # or r"@\w+|#\w+"

<p>Twitter is a frequently used source for NLP text and tasks. In this exercise, you'll build a more complex tokenizer for tweets with hashtags and mentions using <code>nltk</code> and regex. The <code>nltk.tokenize.TweetTokenizer</code> class gives you some extra methods and attributes for parsing tweets. </p>
<p>Here, you're given some example tweets to parse using both <code>TweetTokenizer</code> and <code>regexp_tokenize</code> from the <code>nltk.tokenize</code> module. These example tweets have been pre-loaded into the variable <code>tweets</code>. Feel free to explore it in the IPython Shell!</p>
<p><em>Unlike the syntax for the regex library, with <code>nltk_tokenize()</code> you pass the pattern as the <strong>second</strong> argument.</em></p>

## Non-ascii tokenization

In [None]:
from urllib.request import urlopen, urlretrieve
import os

os.makedirs('tokenizers/punkt/PY3/')
urlretrieve('https://s3.amazonaws.com/assets.datacamp.com/production/course_3747/datasets/english_pickle.txt', 'tokenizers/punkt/PY3/english.pickle')

from nltk.tokenize import regexp_tokenize, word_tokenize
german_text = "Wann gehen wir Pizza essen? \U0001F355 Und fährst du mit Über? \U0001F695"

print(german_text)


In [None]:
# Tokenize and print all words in german_text
all_words = word_tokenize(german_text)
print(all_words)

# Tokenize and print only capital words
capital_words = r"[A-ZÜ]\w+"
print(regexp_tokenize(german_text, capital_words))

# Tokenize and print only emoji
emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"
print(regexp_tokenize(german_text, emoji))

<p>In this exercise, you'll practice advanced tokenization by tokenizing some non-ascii based text. You'll be using German with emoji!</p>
<p>Here, you have access to a string called <code>german_text</code>, which has been printed for you in the Shell. Notice the emoji and the German characters!</p>
<p>The following modules have been pre-imported from <code>nltk.tokenize</code>: <code>regexp_tokenize</code> and <code>word_tokenize</code>. </p>
<p>Unicode ranges for emoji are:</p>
<p><code>('\U0001F300'-'\U0001F5FF')</code>, <code>('\U0001F600-\U0001F64F')</code>, <code>('\U0001F680-\U0001F6FF')</code>, and <code>('\u2600'-\u26FF-\u2700-\u27BF')</code>.</p>

<ul>
<li>Tokenize all the words in <code>german_text</code> using <code>word_tokenize()</code>, and print the result.</li>
<li>Tokenize only the capital words in <code>german_text</code>. <ul>
<li>First, write a pattern called <code>capital_words</code> to match only capital words. Make sure to check for the German <code>Ü</code>! To use this character in the exercise, copy and paste it from these instructions.</li>
<li>Then, tokenize it using <code>regexp_tokenize()</code>. </li></ul></li>
<li>Tokenize only the emoji in <code>german_text</code>. The pattern using the unicode ranges for emoji given in the assignment text has been written for you. Your job is to use <code>regexp_tokenize()</code> to tokenize the emoji.</li>
</ul>

<ul>
<li>To tokenize all the words in <code>german_text</code>, pass it in as an argument to <code>word_tokenize()</code>.</li>
<li>The pattern to match only capital words is <code>r"[A-ZÜ]\w+"</code>.</li>
<li>To write the pattern to match emoji, separate the unicode ranges for emoji shown in the assignment text using <code>|</code>. After writing the patterns, be sure to tokenize using <code>regexp_tokenize()</code> and then print the results.</li>
</ul>

## Charting practice

In [None]:
from nltk.tokenize import regexp_tokenize
import matplotlib.pyplot as plt
from urllib.request import urlopen
import re

holy_grail = urlopen('https://s3.amazonaws.com/assets.datacamp.com/production/course_3747/datasets/grail.txt').read().decode('utf-8')


In [None]:
# Split the script into lines: lines
lines = holy_grail.split('\n')

# Replace all script lines for speaker
pattern = "[A-Z]{2,}(\s)?(#\d)?([A-Z]{2,})?:"
lines = [re.sub(pattern, '', l) for l in lines]

# Tokenize each line: tokenized_lines
tokenized_lines = [regexp_tokenize(s, "\w+") for s in lines]

# Make a frequency list of lengths: line_num_words
line_num_words = [len(t_line) for t_line in tokenized_lines]

# Plot a histogram of the line lengths
plt.hist(line_num_words)

# Show the plot
plt.show()

<p>Try using your new skills to find and chart the number of words per line in the script using <code>matplotlib</code>. The Holy Grail script is loaded for you, and you need to use regex to find the words per line. </p>
<p>Using list comprehensions here will speed up your computations. For example: <code>my_lines = [tokenize(l) for l in lines]</code> will call a function <code>tokenize</code> on each line in the list <code>lines</code>. The new transformed list will be saved in the <code>my_lines</code> variable.</p>
<p>You have access to the entire script in the variable <code>holy_grail</code>. Go for it!</p>

<ul>
<li>Split the script <code>holy_grail</code> into lines using the newline (<code>'\n'</code>) character.</li>
<li>Use <code>re.sub()</code> inside a list comprehension to replace the prompts such as <code>ARTHUR:</code> and <code>SOLDIER #1</code>. The pattern has been written for you. </li>
<li>Use a list comprehension to tokenize <code>lines</code> with <code>regexp_tokenize()</code>, keeping <strong>only words</strong>. Recall that the pattern for words is <code>"\w+"</code>.</li>
<li>Use a list comprehension to create a list of line lengths called <code>line_num_words</code>.<ul>
<li>Use <code>t_line</code> as your iterator variable to iterate over <code>tokenized_lines</code>, and then <code>len()</code> function to compute line lengths.</li></ul></li>
<li>Plot a histogram of <code>line_num_words</code> using <code>plt.hist()</code>. Don't forgot to use <code>plt.show()</code> as well to display the plot.</li>
</ul>

<ul>
<li>Use the <code>.split()</code> method on <code>holy_grail</code> with the newline character (<code>'\n'</code>) as the argument.</li>
<li>Recall that <code>re.sub()</code> requires 3 arguments: The pattern, the replacement, and the string. The pattern is given for you; the replacement is <code>''</code> and the string is <code>l</code>.</li>
<li>Use <code>regexp_tokenize()</code> as the output expression of your list comprehension, with <code>s</code> and <code>"\w+"</code> as the arguments.</li>
<li>To create <code>line_num_words</code>, use <code>len(t_line)</code> as the output expression of the list comprehension.</li>
<li>Use <code>plt.hist()</code> with <code>line_num_words</code> as the argument to create the histogram, and then <code>plt.show()</code> to display it.</li>
</ul>