# <center>Book: Steven Bird, Ewan Klein, Edward Loper, 2009. **Natural Language Processing (NLP) with Python**, O'Reilly.</center> 

This notebook is an exploration of the solutions proposed by the user:
    https : // github.com / Sturz gef ahr

## <span style="color: blue;">Chapter #3</span>

In [1]:
# Lets first import all we will need for these first questions
import nltk
from nltk.corpus import gutenberg
from nltk.corpus import state_union
from nltk.corpus import wordnet as wn
from nltk.corpus import brown
import matplotlib.pyplot as plt
%matplotlib inline

###### 1. 

☼ Define a string `s = 'colorless'`. Write a Python statement that changes this to "colourless" using only the slice and concatenation operations

In [2]:
s = 'colorless'
s = s[:4] + 'u' + s[4:]
s

'colourless'

##### 2.

☼ We can use the slice notation to remove morphological endings on words. For example, `'dogs'[:-1]` removes the last character of `dogs`, leaving `dog`. Use slice notation to remove the affixes from these words (we've inserted a hyphen to indicate the affix boundary, but omit this from your strings): `dish-es`, `run-ning`, `nation-ality`, `un-do`, `pre-heat`.

In [3]:
affixed = [('dishes', 2), 
           ('running', 4),
           ('nationality', 5),
           ('undo', 2),
           ('preheat', 4)]

print([s[:-a] for s, a in affixed])

['dish', 'run', 'nation', 'un', 'pre']


##### 4. 

☼ We can specify a "step" size for the slice. The following returns every second character within the slice: `monty[6:11:2]`. It also works in the reverse direction: `monty[10:5:-2]` Try these for yourself, then experiment with different step values.

In [4]:
# Portuguese:
tt = 'Minha lingua materna e portugues'
# Every other letter
tt[::2]

'Mnalnu aen  otge'

In [5]:
# Every other letter from the end
tt[::-2]

'suurpeartmagi hi'

In [6]:
# Every third letter
tt[::3]

'Mhlg tneoue'

*You get the point...*

##### 5. 

☼ What happens if you ask the interpreter to evaluate `monty[::-1]`? Explain why this is a reasonable result.

*It prints the word backwards.  It's simply printing from the end by steps of -1:*

In [7]:
"redrum"[::-1]

'murder'

##### 6.

☼ Describe the class of strings matched by the following regular expressions.

a. `[a-zA-Z]+`

b. `[A-Z][a-z]*`

c. `p[aeiou]{,2}t`

d. `\d+(\.\d+)?`

e. `([^aeiou][aeiou][^aeiou])*`

f. `\w+|[^\w\s]+`

Test your answers using `nltk.re_show()`.

*__a.__ `[a-zA-Z]+` will match anything alphabetical:*

In [9]:
import nltk, re

nltk.re_show(r'[a-zA-Z]+', "cAMELCASE 6186258313 hybr1d")

{cAMELCASE} 6186258313 {hybr}1{d}


In [10]:
test = 'This is just another random sentence, ' \
       'outra RanDOM sentence.'

nltk.re_show(r'[A-Z][a-z]*', test)

{This} is just another random sentence, outra {Ran}{D}{O}{M} sentence.


In [11]:
wordlist = [w.lower() for w in nltk.corpus.words.words('en')]
len([w for w in wordlist if re.search(r'p[aeiou]{,2}t', w)])

6978

In [12]:
print([w for w in wordlist if re.search(r'p[aeiou]{,2}t', w)][:20])

['abaptiston', 'abepithymia', 'ableptical', 'ableptically', 'abrupt', 'abruptedly', 'abruption', 'abruptly', 'abruptness', 'absorpt', 'absorptance', 'absorptiometer', 'absorptiometric', 'absorption', 'absorptive', 'absorptively', 'absorptiveness', 'absorptivity', 'absumption', 'acalypterae']


In [13]:
print([w for w in wordlist if re.search(r'^p[aeiou]{,2}t$', w)])

['pat', 'pat', 'paut', 'peat', 'pet', 'piet', 'piet', 'pit', 'poet', 'poot', 'pot', 'pout', 'put']


In [14]:
test = ['1234', '12.34', 'example 123.4 in a string', '1-234', '12,4', '$12.34']
for t in test:
    nltk.re_show(r'\d+(\.\d+)?', t) 

{1234}
{12.34}
example {123.4} in a string
{1}-{234}
{12},{4}
${12.34}


In [15]:
nltk.re_show(r'\d+(\.\d+)?', '1.23.4')

{1.23}.{4}


In [16]:
nltk.re_show(r'\d+(\.\d+)+', '1.23.4')

{1.23.4}


<i>__e.__ `([^aeiou][aeiou][^aeiou])*` will match any non-vowel\vowel\non-vowel combination, no matter how many times it's repeated.  White spaces are considered non-vowels, so a string such as `to ` would match.  `nltk.re_show()` behaves quite strangely with this RegExp - a string like `"baab"` would return `{}b{}a{}a{}b{}`.  However, I have evaluated this RegExp with online evaluators (such as [this one](https://regexr.com/ "regexr.com"), and there the responsive is as expected:</i>

In [17]:
string = "babbabbab" \
         "babapapa"
nltk.re_show(r'([^aeiou][aeiou][^aeiou])*', string)

{babbabbabbab}{}a{pap}{}a{}


In [18]:
string = "baab"
nltk.re_show(r'([^aeiou][aeiou][^aeiou])*', string)

{}b{}a{}a{}b{}


*__f.__ `\w+|[^\w\s]+` will match either any alphanumeric string of any length, or a string of any length that does not contain alphanumeric characters or whitespace - i.e., all punctuation and any other non-whitespace/non-alphanumeric characters:*

In [19]:
string = "This RegExp needs a fairly long string to show what it can %#$^%&* do."
nltk.re_show(r'\w+|[^\w\s]+', string)

{This} {RegExp} {needs} {a} {fairly} {long} {string} {to} {show} {what} {it} {can} {%#$^%&*} {do}{.}


##### 7.

*☼ Write regular expressions to match the following classes of strings:*

 + *__a.__ A single determiner (assume that __a__, __an__, and __the__ are the only determiners).*
 + <i>__b.__ An arithmetic expression using integers, addition, and multiplication, such as `2*3+8`.</i>
 
*__a.__*

In [20]:
string = "This sentence is just an example to be tested."
nltk.re_show(r'\b[Aa]n?\b|\b[Tt]he\b', string)

This sentence is just {an} example to be tested.


*__b.__*

In [21]:
string = "2 * 3 + 8"
nltk.re_show(r'(\d|[+*= ])+', string)

{2 * 3 + 8}


In [22]:
string = "11 + 4 * 2"
nltk.re_show(r'(\d|[+*= ])+', string)

{11 + 4 * 2}


##### 8.


☼ Write a utility function that takes a URL as its argument, and returns the contents of the URL, with all HTML markup removed. Use `from urllib import request`  and then `request.urlopen('http://nltk.org/').read().decode('utf8')` to access the contents of the URL.

*The code below is inspired in [this answer in the above Stack Overflow discussion](https://stackoverflow.com/a/30565597 "Removing Style, Scripts, and HTML tags - answer"):*

In [23]:
from urllib import request
from bs4 import BeautifulSoup
from unicodedata import normalize

def return_URL_contents(url):
    html = request.urlopen(url).read().decode('utf8')
    raw = BeautifulSoup(html, 'html.parser')
    for r in raw(['script', 'style']):
        r.extract() # remove tags
    
    text = ' '.join(raw.stripped_strings) # retrieve tag content
    
    return normalize('NFKD', text) # normalize escape sequences

In [24]:
url = "https://www.nytimes.com/2017/10/29/business/virtual-reality-driverless-cars.html?module=inline"

return_URL_contents(url)[:2000]

'What Virtual Reality Can Teach a Driverless Car - The New York Times Sections SEARCH Skip to content Skip to site index Business Log in Today’s Paper Business | What Virtual Reality Can Teach a Driverless Car Artificial Intelligence The Bot That Writes Are These People Real? Algorithms Against Suicide Robots Without Bias Advertisement Continue reading the main story Supported by Continue reading the main story What Virtual Reality Can Teach a Driverless Car By Cade Metz Oct. 29, 2017 SAN FRANCISCO — As the computers that operate driverless cars digest the rules of the road, some engineers think it might be nice if they can learn from mistakes made in virtual reality rather than on real streets. Companies like Toyota, Uber and Waymo have discussed at length how they are testing autonomous vehicles on the streets of Mountain View, Calif., Phoenix and other cities. What is not as well known is that they are also testing vehicles inside computer simulations of these same cities. Virtual c

In [25]:
url = "https://en.wikipedia.org/wiki/Guido_van_Rossum"

return_URL_contents(url)[:2000]

'Guido van Rossum - Wikipedia Guido van Rossum From Wikipedia, the free encyclopedia Jump to navigation Jump to search Dutch programmer and creator of Python "GvR" redirects here. For other uses, see gvr (disambiguation) . In this Dutch name , the surname is van Rossum . Guido van Rossum Van Rossum at the Dropbox headquarters in 2014 Born ( 1956-01-31 ) 31 January 1956 (age 65) [1] Haarlem , Netherlands [2] [3] Nationality Dutch Alma mater University of Amsterdam Occupation Computer programmer, author Known for Creating the Python programming language Spouse(s) Kim Knapp \u200b ( m. 2000) \u200b Children 1 [4] Awards Award for the Advancement of Free Software (2001) Website gvanrossum .github .io Guido van Rossum ( Dutch: [ˈɣido vɑn ˈrɔsʏm, -səm] ; born 31 January 1956) is a Dutch programmer best known as the creator of the Python programming language , for which he was the " Benevolent dictator for life " (BDFL) until he stepped down from the position in July 2018. [5] [6] He remained

##### 9. 

☼ Save some text into a file `corpus.txt`. Define a function `load(f)` that reads from the file named in its sole argument, and returns a string containing the text of the file.

 + a. Use `nltk.regexp_tokenize()` to create a tokenizer that tokenizes the various kinds of punctuation in this text. Use one multi-line regular expression, with inline comments, using the verbose flag `(?x)`.
 
 + b. Use `nltk.regexp_tokenize()` to create a tokenizer that tokenizes the following kinds of expression: monetary amounts; dates; names of people and organizations.

In [31]:
url = 'https://www.cbc.ca/news/world/coronavirus-covid19-canada-world-february21-2021-1.5922099'

text = return_URL_contents(url)

with open('corpus.txt', 'w', encoding = "utf-8") as f:
    f.write(text)

In [32]:
def load(f):
    text = open(f, encoding = "utf-8")
    raw = text.read()
    
    return raw

In [33]:
nyt = load('corpus.txt')

*__a.__*

In [34]:
pattern = r'''(?x)
    [][.,;"'?!():_-`] # finds punctuation
'''

print(nltk.regexp_tokenize(nyt, pattern))

[':', "'", ':', "'", ',', '.', '.', '.', ':', ',', ':', ':', "'", '.', '.', '.', '.', '(', ')', ':', '.', '.', ',', '.', '.', "'", ',', '.', '.', '.', ':', ',', '.', '?', '.', ',', ',', '"', '"', '.', '.', "'", '.', ',', '.', ',', ',', '.', '.', '.', '"', '"', '.', "'", ',', ',', '.', '.', ',', "'", ',', '.', '.', ',', ',', '.', ',', '.', '.', "'", '.', '"', '"', '.', "'", '?', ',', "'", ',', '.', ',', ',', '.', ',', '"', ',', ',', ',', '.', '"', ',', '.', '.', "'", '.', '.', '(', ')', '.', ',', '.', '.', '.', '.', '.', ',', '.', ',', ',', ',', ',', '.', ',', ',', '.', '.', "'", ':', "'", ':', ',', "'", '.', "'", ',', '.', '.', ':', '.', ',', '.', '.', "'", ':', '.', '.', ',', ',', ',', ',', '.', ',', '.', ',', '.', '.', ',', ',', "'", '.', '.', '.', '.', "'", "'", ',', '.', '.', '.', ':', ':', ',', '.', '.', ':', '.', ',', ',', '.', '.', ',', ',', '.', ',', '.', ',', '.', '.', "'", '.', ',', '.', ',', "'", '.', "'", ',', '.', ',', '.', '.', '.', '.', '.', '.', '.', "'", '.', ',', '.',

*__b.__ Using regular expressions to extract information such as proper names - which can take numerous forms - is wrought with problems, and the regular expressions below are far from perfect.  It could very well be that the point of this exercise was to demonstrate just how difficult this approach is.*

In [35]:
pattern = r'''(?x)
          
          (?:[A-Z])(?:[a-z]+|\.)(?:\s+[A-Z](?:[a-z]+|\.))*(?:\s+[A-Z])(?:[a-z]+|\.)
                                         # proper names
          | \$\d+\s\b[tr|b|m]illion\b    # literal monetary amounts
          | \$?\d+(?:[,\.]\d+)?          # numerical monetary amounts
          | \d{2}\[\\]\d{2}\-\\]\d{4}    # numerical dates (U.S. format)
          | [A-Z][a-z.]*\s\d{2}\,\s\d{2, 4} # literal dates (U.S. format)

          
        '''

print(nltk.regexp_tokenize(nyt, pattern))

['News Skip', 'Main Content Menu Search Search Sign In Quick Links News Sports Radio Music Listen Live', '19', '19', 'Top Stories Local The National Opinion World Canada Politics Indigenous Business Health Entertainment Tech', 'News Investigates Go Public Shows About', 'News World', 'Sunday The British', '31', 'Social Sharing U.', '19', '1', '31', 'The Associated Press', '21', '2021', '9', '00', 'Last Updated', '21', 'Astra Zeneca', 'Westfield Stratford City', '18', 'Henry Nicholls', '31', '19', 'United States', 'The British', '31', '50', '15', '1', 'But U.', 'K. Health Secretary Matt Hancock', '120,000', '17.2', '8', '12', 'Prime Minister Boris Johnson', 'Theresa Tam', '19', '704', '19', 'Quebec City', 'The Marguerite', 'Quebec City', '283', '50', 'Michel Cloutier', '19', '22', '286', '39', 'South Africa', '$2,000', 'Starting Monday', '19', 'Navigating Canada', 'Navigating Canada', 'The National', '1', '8', '25', 'Rohan Jumani', 'Richard Vanderlubbe', '8', '25', 'Prime Minister Justin

#####  10.

☼ Rewrite the following loop as a list comprehension:

In [36]:
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
result = []
for word in sent:
    word_len = (word, len(word))
    result.append(word_len)
print(result)

[('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]


In [37]:
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
result = [(word, len(word)) for word in sent]
print(result)

[('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]


##### 11.

☼ Define a string `raw` containing a sentence of your own choosing. Now, split `raw` on some character other than space, such as '`s`'.

In [38]:
raw = "How much wood would a woodchuck chuck if a woodchuck could chuck wood?"
raw.split('wood')

['How much ', ' would a ', 'chuck chuck if a ', 'chuck could chuck ', '?']

##### 12.

☼ Write a `for` loop to print out the characters of a string, one per line.

In [39]:
string = "Compared to some of the previous exercises, this seems comically easy."

for s in string[:20]:
    print(s)

C
o
m
p
a
r
e
d
 
t
o
 
s
o
m
e
 
o
f
 


##### 13.

☼ What is the difference between calling `split` on a string with no argument or with `' '` as the argument, e.g. `sent.split()` versus `sent.split(' ')`? What happens when the string being split contains tab characters, consecutive space characters, or a sequence of tabs and spaces? 

*`sent.split()` splits all whitespace identically.*

*`sent.split(' ')` splits all whitespace literally.  I.e., tabs will be represented as `\t`, and each individaul whitespace will be spilt into its own string.*

In [40]:
s1 = "This string is a pretty simple string."
s2 = "This\tstrings\thas\ttabs."
s3 = "This        string          has      lots    of     space."
s4 = "This\tstring         has\ttabs       and\tspaces."

Ss = [s1, s2, s3, s4]

for s in Ss:
    print("\nWith `sent.split()`:")
    print(s.split())
    print("\nWith `sent.split(' ')`:")
    print(s.split(' '))


With `sent.split()`:
['This', 'string', 'is', 'a', 'pretty', 'simple', 'string.']

With `sent.split(' ')`:
['This', 'string', 'is', 'a', 'pretty', 'simple', 'string.']

With `sent.split()`:
['This', 'strings', 'has', 'tabs.']

With `sent.split(' ')`:
['This\tstrings\thas\ttabs.']

With `sent.split()`:
['This', 'string', 'has', 'lots', 'of', 'space.']

With `sent.split(' ')`:
['This', '', '', '', '', '', '', '', 'string', '', '', '', '', '', '', '', '', '', 'has', '', '', '', '', '', 'lots', '', '', '', 'of', '', '', '', '', 'space.']

With `sent.split()`:
['This', 'string', 'has', 'tabs', 'and', 'spaces.']

With `sent.split(' ')`:
['This\tstring', '', '', '', '', '', '', '', '', 'has\ttabs', '', '', '', '', '', '', 'and\tspaces.']


##### 14. 

☼ Create a variable `words` containing a list of words. Experiment with `words.sort()` and `sorted(words)`. What is the difference?

*`words.sort()` doesn't return a value, but it alters the ordering of the list, so that whenever I call the list again, the returned list will be the ordered one, and not the one I originally stored.*

In [41]:
words = ["slova", "ord", "Wörter", "λόγια", "words", "palabras", "sanat", 
         "mots", "focail", "szavak", "parole", "words", "woorden", "ord", 
         "słowa", "palavras", "từ ngữ", "ווערטער"]

words.sort()
print(words)

['Wörter', 'focail', 'mots', 'ord', 'ord', 'palabras', 'palavras', 'parole', 'sanat', 'slova', 'szavak', 'słowa', 'từ ngữ', 'woorden', 'words', 'words', 'λόγια', 'ווערטער']


In [42]:
print(words)

['Wörter', 'focail', 'mots', 'ord', 'ord', 'palabras', 'palavras', 'parole', 'sanat', 'slova', 'szavak', 'słowa', 'từ ngữ', 'woorden', 'words', 'words', 'λόγια', 'ווערטער']


In [43]:
words = ["slova", "ord", "Wörter", "λόγια", "words", "palabras", "sanat", 
         "mots", "focail", "szavak", "parole", "words", "woorden", "ord", 
         "słowa", "palavras", "từ ngữ", "ווערטער"]

print(sorted(words))

['Wörter', 'focail', 'mots', 'ord', 'ord', 'palabras', 'palavras', 'parole', 'sanat', 'slova', 'szavak', 'słowa', 'từ ngữ', 'woorden', 'words', 'words', 'λόγια', 'ווערטער']


In [44]:
print(words)

['slova', 'ord', 'Wörter', 'λόγια', 'words', 'palabras', 'sanat', 'mots', 'focail', 'szavak', 'parole', 'words', 'woorden', 'ord', 'słowa', 'palavras', 'từ ngữ', 'ווערטער']


##### 15. 

☼ Explore the difference between strings and integers by typing the following at a Python prompt: `"3" * 7` and `3 * 7`. Try converting between strings and integers using `int("3")` and `str(3)`.

*Multiplying a string $x$ by an integer $y$ will just cause $x$ to be printed to the console $y$ times:*

In [45]:
"3" * 7

'3333333'

In [46]:
3 * 7

21

In [47]:
int("3") * 7

21

In [48]:
str(3) * 7

'3333333'

##### 17. 

☼ What happens when the formatting strings `%6s` and `%-6s` are used to display strings that are longer than six characters?

*This looks to be a legacy question from an older version of the book, since this is the older method of formatting in Python.  As the question is written, `%6s` won't do anything to a longer string:*

In [51]:
test = "another test"

"%6s" % (test)

'another test'

In [52]:
"%-6s" % ("hey")

'hey   '

In [53]:
"%.6s" % (test)

'anothe'

##### 18. 

◑ Read in some text from a corpus, tokenize it, and print the list of all *wh*-word types that occur. (*wh*-words in English are used in questions, relative clauses and exclamations: *who*, *which*, *what*, and so on.) Print them in order. Are any words duplicated in this list, because of the presence of case distinctions or punctuation?

*This question is a little difficult to follow.  Most of the corpora we're using have texts that have already been tokenized, so the first part of this question seems a bit redundant.  However, just to play along,  I'll use the raw text version of one of the Project Gutenberg texts.*

In [54]:
from nltk import word_tokenize
from nltk.corpus import gutenberg

raw = gutenberg.raw('bryant-stories.txt')

tokens = word_tokenize(raw)

tokens = sorted(set(tokens))

In [55]:
print([w for w in tokens if re.search('^[Ww]h', w)])

['Whale', 'What', 'When', 'Whenever', 'Where', 'Whether', 'Whiff', 'While', 'Whirling', 'White', 'Who', 'Whose', 'Why', 'what', 'whatever', 'wheat', 'wheelbarrow', 'wheeled', 'when', 'whence', 'whenever', 'where', 'wherein', 'wherever', 'whether', 'which', 'while', 'whimpering', 'whin', 'whinny', 'whipped', 'whirlpool', 'whiruled', 'whisk', 'whisked', 'whisper', 'whisper_', 'whispered', 'whispering', 'whispers', 'whistle', 'whistled', 'white', 'white-haired', 'white-robed', 'whither', 'who', 'whole', 'wholly', 'whom', 'whose', 'why']


##### 22. 

◑ Examine the results of processing the URL `http://news.bbc.co.uk/` using the regular expressions suggested above. You will see that there is still a fair amount of non-textual data there, particularly Javascript commands. You may also find that sentence breaks have not been properly preserved. Define further regular expressions that improve the extraction of text from this web page.

In [58]:
url = "https://www.cbc.ca/news/world/coronavirus-covid19-canada-world-february21-2021-1.5922099"
return_URL_contents(url)[:2000]

'Coronavirus: What\'s happening in Canada and around the world Sunday | CBC News Skip to Main Content Menu Search Search Sign In Quick Links News Sports Radio Music Listen Live TV Watch COVID-19 Local updates Watch live COVID-19 tracker Vaccine tracker Top Stories Local The National Opinion World Canada Politics Indigenous Business Health Entertainment Tech & Science CBC News Investigates Go Public Shows About CBC News World · THE LATEST Coronavirus: What\'s happening in Canada and around the world Sunday The British government declared Sunday that every adult in the country should get a first coronavirus vaccine shot by July 31, at least a month earlier than its previous target. Social Sharing U.K. aims to get all adults vaccinated against COVID-19 with 1st dose by July 31 The Associated Press · Posted: Feb 21, 2021 9:00 AM ET | Last Updated: February 21 A health worker prepares an injection with a dose of Astra Zeneca coronavirus vaccine at a vaccination centre at London\'s Westfield

##### 24. 

◑ Try to write code to convert text into *hAck3r*, using regular expressions and substitution, where `e` → `3`, `i` → `1`, `o` → `0`, `l` → `|`, `s` → `5`, `.` → `5w33t!`, `ate` → `8`. Normalize the text to lowercase before converting it. Add more substitutions of your own. Now try to map `s` to two different values: `$` for word-initial `s`, and `5` for word-internal `s`.

In [59]:
test = "Hello suckers.  I ate your lunch.  It was delish."

test = test.lower()

org = ['ate', 'e', 'i', 'o', 'l', 's', '\.']
sub = ['8', '3', '1', '0', '|', '5', '5w33t!']

for i in range(len(org)):
    test = re.sub(org[i], sub[i], test)

test

'h3||0 5uck3r55w33t!  1 8 y0ur |unch5w33t!  1t wa5 d3|15h5w33t!'

In [60]:
test = "Peter Piper picked a peck of pickled peppers."

test = test.lower()

org = ['e', 'i', 'o', 'l', 's', 't', 'p', '\.']
sub = ['3', '1', '0', '|', '5', '+', '%', '5w33t!']

for i in range(len(org)):
    test = re.sub(org[i], sub[i], test)

test

'%3+3r %1%3r %1ck3d a %3ck 0f %1ck|3d %3%%3r55w33t!'

In [61]:
test = "Susie lives in Mississippi."

test = test.lower()

org = ['ate', 'e', 'i', 'o', 'l', 't', r'\bs', 's', '\.']
sub = ['8', '3', '1', '0', '|', '+', '$', '5', '5w33t!']

for i in range(len(org)):
    test = re.sub(org[i], sub[i], test)

test

'$u513 |1v35 1n m1551551pp15w33t!'

##### 30. 

◑ Use the Porter Stemmer to normalize some tokenized text, calling the stemmer on each word. Do the same thing with the Lancaster Stemmer and see if you observe any differences.

In [62]:
url = 'https://www.cbc.ca/news/world/coronavirus-covid19-canada-world-february21-2021-1.5922099'

to_be_stemmed = return_URL_contents(url)
print(to_be_stemmed[:250])

Coronavirus: What's happening in Canada and around the world Sunday | CBC News Skip to Main Content Menu Search Search Sign In Quick Links News Sports Radio Music Listen Live TV Watch COVID-19 Local updates Watch live COVID-19 tracker Vaccine tracker


In [63]:
tokens = word_tokenize(to_be_stemmed)
print(tokens[:100])

['Coronavirus', ':', 'What', "'s", 'happening', 'in', 'Canada', 'and', 'around', 'the', 'world', 'Sunday', '|', 'CBC', 'News', 'Skip', 'to', 'Main', 'Content', 'Menu', 'Search', 'Search', 'Sign', 'In', 'Quick', 'Links', 'News', 'Sports', 'Radio', 'Music', 'Listen', 'Live', 'TV', 'Watch', 'COVID-19', 'Local', 'updates', 'Watch', 'live', 'COVID-19', 'tracker', 'Vaccine', 'tracker', 'Top', 'Stories', 'Local', 'The', 'National', 'Opinion', 'World', 'Canada', 'Politics', 'Indigenous', 'Business', 'Health', 'Entertainment', 'Tech', '&', 'Science', 'CBC', 'News', 'Investigates', 'Go', 'Public', 'Shows', 'About', 'CBC', 'News', 'World', '·', 'THE', 'LATEST', 'Coronavirus', ':', 'What', "'s", 'happening', 'in', 'Canada', 'and', 'around', 'the', 'world', 'Sunday', 'The', 'British', 'government', 'declared', 'Sunday', 'that', 'every', 'adult', 'in', 'the', 'country', 'should', 'get', 'a', 'first', 'coronavirus']


In [64]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
print([porter.stem(t) for t in tokens[:100]])

['coronaviru', ':', 'what', "'s", 'happen', 'in', 'canada', 'and', 'around', 'the', 'world', 'sunday', '|', 'cbc', 'new', 'skip', 'to', 'main', 'content', 'menu', 'search', 'search', 'sign', 'In', 'quick', 'link', 'new', 'sport', 'radio', 'music', 'listen', 'live', 'TV', 'watch', 'covid-19', 'local', 'updat', 'watch', 'live', 'covid-19', 'tracker', 'vaccin', 'tracker', 'top', 'stori', 'local', 'the', 'nation', 'opinion', 'world', 'canada', 'polit', 'indigen', 'busi', 'health', 'entertain', 'tech', '&', 'scienc', 'cbc', 'new', 'investig', 'Go', 'public', 'show', 'about', 'cbc', 'new', 'world', '·', 'the', 'latest', 'coronaviru', ':', 'what', "'s", 'happen', 'in', 'canada', 'and', 'around', 'the', 'world', 'sunday', 'the', 'british', 'govern', 'declar', 'sunday', 'that', 'everi', 'adult', 'in', 'the', 'countri', 'should', 'get', 'a', 'first', 'coronaviru']


In [65]:
print([lancaster.stem(t) for t in tokens[:100]])

['coronavir', ':', 'what', "'s", 'hap', 'in', 'canad', 'and', 'around', 'the', 'world', 'sunday', '|', 'cbc', 'new', 'skip', 'to', 'main', 'cont', 'menu', 'search', 'search', 'sign', 'in', 'quick', 'link', 'new', 'sport', 'radio', 'mus', 'list', 'liv', 'tv', 'watch', 'covid-19', 'loc', 'upd', 'watch', 'liv', 'covid-19', 'track', 'vaccin', 'track', 'top', 'story', 'loc', 'the', 'nat', 'opin', 'world', 'canad', 'polit', 'indig', 'busy', 'heal', 'entertain', 'tech', '&', 'sci', 'cbc', 'new', 'investig', 'go', 'publ', 'show', 'about', 'cbc', 'new', 'world', '·', 'the', 'latest', 'coronavir', ':', 'what', "'s", 'hap', 'in', 'canad', 'and', 'around', 'the', 'world', 'sunday', 'the', 'brit', 'govern', 'decl', 'sunday', 'that', 'every', 'adult', 'in', 'the', 'country', 'should', 'get', 'a', 'first', 'coronavir']


##### 33.

◑ The `index()` function can be used to look up items in sequences. For example, `'inexpressible'.index('e')` tells us the index of the first position of the letter `e`.

 * a. What happens when you look up a substring, e.g. `'inexpressible'.index('re')`?
 
 * b. Define a variable `words` containing a list of words. Now use `words.index()` to look up the position of an individual word.
 
 * c. Define a variable `silly` as in the exercise above. Use the `index()` function in combination with list slicing to build a list `phrase` consisting of all the words up to (but not including) `in` in `silly`.

In [66]:
'inexpressible'.index('re')

5

In [67]:
words = ["I'm", 'too', 'tired', 'think', 'of', 'a', 'more', 
         'original', 'list']

words.index('tired')

2

In [68]:
silly = 'newly formed bland ideas are inexpressible in an infuriating way'

phrase = [i for i in silly.split()[:silly.split().index('in')]]

phrase

['newly', 'formed', 'bland', 'ideas', 'are', 'inexpressible']