# CS 195: Natural Language Processing
## Tokenization

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F3_1_Tokenization.ipynb)


## References

Python `requests` library quickstart: https://requests.readthedocs.io/en/latest/user/quickstart/

Beautiful Soup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

GPT Tokenizer Illustration: https://platform.openai.com/tokenizer

Python `split` method: https://docs.python.org/3/library/stdtypes.html#str.split

Hugging Face Byte-Pair Encoding tokenization: https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt

Hugging Face WordPiece tokenization: https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt

In [None]:
import sys
!{sys.executable} -m pip install requests chardet nltk beautifulsoup4 tokenizers transformers

Collecting tokenizers
  Downloading tokenizers-0.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.33.3-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m38.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface_hub<0.17,>=0.16.4 (from tokenizers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m46.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Download

In [None]:
#you shouldn't need to do this in Colab, but I had to do it on my own machine
#in order to connect to the nltk service
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context


## Tokenization

Before you can feed input into most NLP algorithms, you have to **tokenize** the text - break apart the string into units (the *tokens*) that the algorithm needs to work with.

A set of tokens can be
* letter
* words
* a mix of words and punctuation
* parts of words

See how GPT tokenizes here: https://platform.openai.com/tokenizer

It can be accomplished with *rule-based* methods or automatically learned.

As we saw previously, the Python string `split` method can be very useful for rule-based methods:
* if you give it a parameter, it will break up the string using that delimiter
* if you don't it separates by whitespace

In [None]:
text = "I code when I am happy . I am happy therefore I code . "
text_tokens = text.split()

print(text_tokens)

['I', 'code', 'when', 'I', 'am', 'happy', '.', 'I', 'am', 'happy', 'therefore', 'I', 'code', '.']


In [None]:
text = "I code when I am happy . I am happy therefore I code . "
text_tokens = text.split("I") #you probably don't want to do this

print(text_tokens)

['', ' code when ', ' am happy . ', ' am happy therefore ', ' code . ']


## The requests library

The `requests` library is useful for loading data stored on the web.

Here's how we can request the text version of *The Adventures of Sherlock Holmes* from Project Gutenberg: https://www.gutenberg.org/ebooks/1661


In [None]:
import requests

response = requests.get("https://www.gutenberg.org/files/1661/1661-0.txt")

print(response)
print(response.headers)

<Response [200]>
{'date': 'Thu, 28 Sep 2023 20:40:38 GMT', 'server': 'Apache', 'last-modified': 'Wed, 09 Jun 2021 16:45:05 GMT', 'accept-ranges': 'bytes', 'content-length': '607430', 'content-type': 'text/plain'}


A response code of 200 means it worked, and we can look at some of the other metadata that came back with it with `.headers`

Now let's look at what some of this text looks like:

In [None]:
#print(response.text) #uncomment to print the whole thing
print(response.text[4000:6000]) #printing a sample of some text in the middle

 my former friend and companion.

One nightâit was on the twentieth of March, 1888âI was returning from a
journey to a patient (for I had now returned to civil practice), when
my way led me through Baker Street. As I passed the well-remembered
door, which must always be associated in my mind with my wooing, and
with the dark incidents of the Study in Scarlet, I was seized with a
keen desire to see Holmes again, and to know how he was employing his
extraordinary powers. His rooms were brilliantly lit, and, even as I
looked up, I saw his tall, spare figure pass twice in a dark silhouette
against the blind. He was pacing the room swiftly, eagerly, with his
head sunk upon his chest and his hands clasped behind him. To me, who
knew his every mood and habit, his attitude and manner told their own
story. He was at work again. He had risen out of his drug-created
dreams and was hot upon the scent of some new problem. I rang the bell
and was shown up to the chamber which had 

Notice: There are a lot of weird characters like â - if this looks different than what you see when you open the file, it means something went wrong.

Usually, the `response` library can figure out the format that the characters are stored in, and that's what `response.text` does - it assumed these were the [ISO-8859-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1) encoding, but that's not quite right.


In [None]:
print(response.encoding)

ISO-8859-1


Let's see what the requests module documentation suggests: https://requests.readthedocs.io/en/latest/user/quickstart/#response-content

look for clues by looking at `response.content`, which will show the text in it's more raw form:

In [None]:
#print(response.content)
print(response.content[4000:6000])

b' my former friend and companion.\r\n\r\nOne night\xe2\x80\x94it was on the twentieth of March, 1888\xe2\x80\x94I was returning from a\r\njourney to a patient (for I had now returned to civil practice), when\r\nmy way led me through Baker Street. As I passed the well-remembered\r\ndoor, which must always be associated in my mind with my wooing, and\r\nwith the dark incidents of the Study in Scarlet, I was seized with a\r\nkeen desire to see Holmes again, and to know how he was employing his\r\nextraordinary powers. His rooms were brilliantly lit, and, even as I\r\nlooked up, I saw his tall, spare figure pass twice in a dark silhouette\r\nagainst the blind. He was pacing the room swiftly, eagerly, with his\r\nhead sunk upon his chest and his hands clasped behind him. To me, who\r\nknew his every mood and habit, his attitude and manner told their own\r\nstory. He was at work again. He had risen out of his drug-created\r\ndreams and was hot upon the scent of some new problem. I rang the 

One thing to notice: newlines are represented as `\r\n` rather than the usual `\n` - that will be important later, so remember it

Now we can use a module like `chardet` to detect the encoding

In [None]:
import chardet


encoding_info = chardet.detect(response.content)
print(encoding_info)

{'encoding': 'UTF-8-SIG', 'confidence': 1.0, 'language': ''}


Looks like it is actuall a variant of the popular encoding [UTF-8](https://en.wikipedia.org/wiki/UTF-8)

and now we can set the encoding to match

In [None]:
response.encoding = 'UTF-8-SIG'
print(response.text[4000:6000])

r friend and companion.

One night—it was on the twentieth of March, 1888—I was returning from a
journey to a patient (for I had now returned to civil practice), when
my way led me through Baker Street. As I passed the well-remembered
door, which must always be associated in my mind with my wooing, and
with the dark incidents of the Study in Scarlet, I was seized with a
keen desire to see Holmes again, and to know how he was employing his
extraordinary powers. His rooms were brilliantly lit, and, even as I
looked up, I saw his tall, spare figure pass twice in a dark silhouette
against the blind. He was pacing the room swiftly, eagerly, with his
head sunk upon his chest and his hands clasped behind him. To me, who
knew his every mood and habit, his attitude and manner told their own
story. He was at work again. He had risen out of his drug-created
dreams and was hot upon the scent of some new problem. I rang the bell
and was shown up to the chamber which had formerly been

## Cutting to the content

This ebook has markers showing where the actual content of the book start and stop, so we can cut out the Project Gutenberg preamble and license stuff at the end.

In [None]:
start_text = "*** START OF THE PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK HOLMES ***"
end_text = "*** END OF THE PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK HOLMES ***"
start_index = response.text.index(start_text)+len(start_text)
end_index = response.text.index(end_text)
print("Start and end index of the text",start_index,end_index)
sherlock_text = response.text[start_index:end_index]
#print(sherlock_text)
print(sherlock_text[:1000])

Start and end index of the text 912 575060


cover




The Adventures of Sherlock Holmes

by Arthur Conan Doyle


Contents

   I.     A Scandal in Bohemia
   II.    The Red-Headed League
   III.   A Case of Identity
   IV.    The Boscombe Valley Mystery
   V.     The Five Orange Pips
   VI.    The Man with the Twisted Lip
   VII.   The Adventure of the Blue Carbuncle
   VIII.  The Adventure of the Speckled Band
   IX.    The Adventure of the Engineer’s Thumb
   X.     The Adventure of the Noble Bachelor
   XI.    The Adventure of the Beryl Coronet
   XII.   The Adventure of the Copper Beeches




I. A SCANDAL IN BOHEMIA


I.

To Sherlock Holmes she is always _the_ woman. I have seldom heard him
mention her under any other name. In his eyes she eclipses and
predominates the whole of her sex. It was not that he felt any emotion
akin to love for Irene Adler. All emotions, and that one particularly,
were abhorrent to his cold, precise but admirably ba

## Now we're ready to tokenize

A question we need to answer: what do we want our tokens to look like?

Do we want to include punctuation? Should it be a separate token?

Do we want it broken into letters? words? sentences?

For this example, let's assume we want to keep punctuation but break it apart from the words it is next to.

Unfortunately, a simple `.split()` won't do the trick - notice the periods are stuck to the words they're next to.



In [None]:
print(sherlock_text[:1000].split())

['cover', 'The', 'Adventures', 'of', 'Sherlock', 'Holmes', 'by', 'Arthur', 'Conan', 'Doyle', 'Contents', 'I.', 'A', 'Scandal', 'in', 'Bohemia', 'II.', 'The', 'Red-Headed', 'League', 'III.', 'A', 'Case', 'of', 'Identity', 'IV.', 'The', 'Boscombe', 'Valley', 'Mystery', 'V.', 'The', 'Five', 'Orange', 'Pips', 'VI.', 'The', 'Man', 'with', 'the', 'Twisted', 'Lip', 'VII.', 'The', 'Adventure', 'of', 'the', 'Blue', 'Carbuncle', 'VIII.', 'The', 'Adventure', 'of', 'the', 'Speckled', 'Band', 'IX.', 'The', 'Adventure', 'of', 'the', 'Engineer’s', 'Thumb', 'X.', 'The', 'Adventure', 'of', 'the', 'Noble', 'Bachelor', 'XI.', 'The', 'Adventure', 'of', 'the', 'Beryl', 'Coronet', 'XII.', 'The', 'Adventure', 'of', 'the', 'Copper', 'Beeches', 'I.', 'A', 'SCANDAL', 'IN', 'BOHEMIA', 'I.', 'To', 'Sherlock', 'Holmes', 'she', 'is', 'always', '_the_', 'woman.', 'I', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name.', 'In', 'his', 'eyes', 'she', 'eclipses', 'and', 'predominates', '

One strategy use the `replace` method to put spaces before and after the periods

In [None]:
example_strategy = sherlock_text[:1000].replace("."," . ")
print(example_strategy)
print(example_strategy.split()) #now . are separate tokens



cover




The Adventures of Sherlock Holmes

by Arthur Conan Doyle


Contents

   I .      A Scandal in Bohemia
   II .     The Red-Headed League
   III .    A Case of Identity
   IV .     The Boscombe Valley Mystery
   V .      The Five Orange Pips
   VI .     The Man with the Twisted Lip
   VII .    The Adventure of the Blue Carbuncle
   VIII .   The Adventure of the Speckled Band
   IX .     The Adventure of the Engineer’s Thumb
   X .      The Adventure of the Noble Bachelor
   XI .     The Adventure of the Beryl Coronet
   XII .    The Adventure of the Copper Beeches




I .  A SCANDAL IN BOHEMIA


I . 

To Sherlock Holmes she is always _the_ woman .  I have seldom heard him
mention her under any other name .  In his eyes she eclipses and
predominates the whole of her sex .  It was not that he felt any emotion
akin to love for Irene Adler .  All emotions, and that one particularly,
were abhorrent to his cold, precise but admirably balanced 

OK - let's do the whole text and separate lots of other punctuation while we're at it

In [None]:
sherlock_text_intermediate = sherlock_text
sherlock_text_intermediate = sherlock_text_intermediate.replace("."," . ")
sherlock_text_intermediate = sherlock_text_intermediate.replace(","," , ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("!"," ! ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("?"," ? ")
sherlock_text_intermediate = sherlock_text_intermediate.replace(":"," : ")
sherlock_text_intermediate = sherlock_text_intermediate.replace(";"," ; ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("“"," “ ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("”"," ” ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("’"," ’ ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("‘"," ‘ ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("-"," - ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("—"," - ")

print(sherlock_text_intermediate[4000:6000])

w his every mood and habit ,  his attitude and manner told their own
story .  He was at work again .  He had risen out of his drug - created
dreams and was hot upon the scent of some new problem .  I rang the bell
and was shown up to the chamber which had formerly been in part my own . 

His manner was not effusive .  It seldom was ;  but he was glad ,  I think , 
to see me .  With hardly a word spoken ,  but with a kindly eye ,  he waved
me to an armchair ,  threw across his case of cigars ,  and indicated a
spirit case and a gasogene in the corner .  Then he stood before the fire
and looked me over in his singular introspective fashion . 

 “ Wedlock suits you ,  ”  he remarked .   “ I think ,  Watson ,  that you have put
on seven and a half pounds since I saw you .  ” 

 “ Seven !  ”  I answered . 

 “ Indeed ,  I should have thought a little more .  Just a trifle more ,  I
fancy ,  Watson .  And in practice again ,  I observe .  You did not tell me
that you intend

In [None]:
sherlock_tokens = sherlock_text_intermediate.split()
print(sherlock_tokens[:1000])

['cover', 'The', 'Adventures', 'of', 'Sherlock', 'Holmes', 'by', 'Arthur', 'Conan', 'Doyle', 'Contents', 'I', '.', 'A', 'Scandal', 'in', 'Bohemia', 'II', '.', 'The', 'Red', '-', 'Headed', 'League', 'III', '.', 'A', 'Case', 'of', 'Identity', 'IV', '.', 'The', 'Boscombe', 'Valley', 'Mystery', 'V', '.', 'The', 'Five', 'Orange', 'Pips', 'VI', '.', 'The', 'Man', 'with', 'the', 'Twisted', 'Lip', 'VII', '.', 'The', 'Adventure', 'of', 'the', 'Blue', 'Carbuncle', 'VIII', '.', 'The', 'Adventure', 'of', 'the', 'Speckled', 'Band', 'IX', '.', 'The', 'Adventure', 'of', 'the', 'Engineer', '’', 's', 'Thumb', 'X', '.', 'The', 'Adventure', 'of', 'the', 'Noble', 'Bachelor', 'XI', '.', 'The', 'Adventure', 'of', 'the', 'Beryl', 'Coronet', 'XII', '.', 'The', 'Adventure', 'of', 'the', 'Copper', 'Beeches', 'I', '.', 'A', 'SCANDAL', 'IN', 'BOHEMIA', 'I', '.', 'To', 'Sherlock', 'Holmes', 'she', 'is', 'always', '_the_', 'woman', '.', 'I', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other

## Exercise

The text also contains some underscores. What do these signify?

Should we separate them out? Should we remove them? Go ahead and do what you think you should do.

Can you find any other special characters we should deal with?

In [None]:
chars_of_interest = []
SET = set("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-.!?,;:'“”")

for x in sherlock_text_intermediate:
  if x not in SET and x not in chars_of_interest:
    chars_of_interest.append(x)
print(chars_of_interest)

for inter_char in chars_of_interest:
  char_index = sherlock_text_intermediate.find(inter_char)
  char_start = char_index - 30
  char_end = char_index + 30

  print("\nExample \"" + inter_char + "\": " + sherlock_text_intermediate[char_start:char_end])

['\r', '\n', ' ', '’', '_', '(', ')', '‘', '&', 'é', '£', 'æ', 'œ', '½', 'à', 'â', 'è']

Example "": 

Example "
": 

Example " ": 

Example "’": The Adventure of the Engineer ’ s Thumb
   X .      The Adv

Example "_": Sherlock Holmes she is always _the_ woman .  I have seldom h

Example "(":  from a
journey to a patient (for I had now returned to civ

Example ")": now returned to civil practice) ,  when
my way led me throu

Example "‘":   ” 

 “ Not at all .  The  ‘ G ’  with the small  ‘ t ’  

Example "&":  shouted ,   ‘ first to Gross &
Hankey ’ s in Regent Street

Example "é":  

     “ IRENE NORTON ,  _née_ ADLER .  ” 


 “ What a

Example "£":  of the League to a salary of £ 4 a
week for purely nominal

Example "æ": ‘ Is to copy out the _Encyclopædia Britannica_ .  There is t

Example "œ": _L ’ homme c ’ est rien - l ’ œuvre c ’ est tout_ ,  ’ 
as 

Example "½":  New Zealand stock ,  paying 4½ per cent .  Two thousand
fi

Example "à": é ,  James Windibank . 
_Vo

## What if I wanted it broken down by sentences?

In this example, suppose we want
* broken down by words
* no punctuation
* structured by sentence

In [None]:
#split into lists by period
sherlock_sentences = sherlock_text.split(".")
print(sherlock_sentences[:100])

['\r\n\r\ncover\r\n\r\n\r\n\r\n\r\nThe Adventures of Sherlock Holmes\r\n\r\nby Arthur Conan Doyle\r\n\r\n\r\nContents\r\n\r\n   I', '     A Scandal in Bohemia\r\n   II', '    The Red-Headed League\r\n   III', '   A Case of Identity\r\n   IV', '    The Boscombe Valley Mystery\r\n   V', '     The Five Orange Pips\r\n   VI', '    The Man with the Twisted Lip\r\n   VII', '   The Adventure of the Blue Carbuncle\r\n   VIII', '  The Adventure of the Speckled Band\r\n   IX', '    The Adventure of the Engineer’s Thumb\r\n   X', '     The Adventure of the Noble Bachelor\r\n   XI', '    The Adventure of the Beryl Coronet\r\n   XII', '   The Adventure of the Copper Beeches\r\n\r\n\r\n\r\n\r\nI', ' A SCANDAL IN BOHEMIA\r\n\r\n\r\nI', '\r\n\r\nTo Sherlock Holmes she is always _the_ woman', ' I have seldom heard him\r\nmention her under any other name', ' In his eyes she eclipses and\r\npredominates the whole of her sex', ' It was not that he felt any emotion\r\nakin to love for Irene Adler', ' All e

In [None]:
chars_to_remove = [",","!","?",";",":","“","”","’","‘"]
chars_to_change_to_spaces = ["-","—","\r\n"]

for idx in range(len(sherlock_sentences)):
    for c in chars_to_remove:
        sherlock_sentences[idx] = sherlock_sentences[idx].replace(c,"") #replace those characters with the empty string
    for c in chars_to_change_to_spaces:
        sherlock_sentences[idx] = sherlock_sentences[idx].replace(c," ") #replace those characters with a space
    sherlock_sentences[idx] = sherlock_sentences[idx].split()

print(sherlock_sentences[:100])

[['cover', 'The', 'Adventures', 'of', 'Sherlock', 'Holmes', 'by', 'Arthur', 'Conan', 'Doyle', 'Contents', 'I'], ['A', 'Scandal', 'in', 'Bohemia', 'II'], ['The', 'Red', 'Headed', 'League', 'III'], ['A', 'Case', 'of', 'Identity', 'IV'], ['The', 'Boscombe', 'Valley', 'Mystery', 'V'], ['The', 'Five', 'Orange', 'Pips', 'VI'], ['The', 'Man', 'with', 'the', 'Twisted', 'Lip', 'VII'], ['The', 'Adventure', 'of', 'the', 'Blue', 'Carbuncle', 'VIII'], ['The', 'Adventure', 'of', 'the', 'Speckled', 'Band', 'IX'], ['The', 'Adventure', 'of', 'the', 'Engineers', 'Thumb', 'X'], ['The', 'Adventure', 'of', 'the', 'Noble', 'Bachelor', 'XI'], ['The', 'Adventure', 'of', 'the', 'Beryl', 'Coronet', 'XII'], ['The', 'Adventure', 'of', 'the', 'Copper', 'Beeches', 'I'], ['A', 'SCANDAL', 'IN', 'BOHEMIA', 'I'], ['To', 'Sherlock', 'Holmes', 'she', 'is', 'always', '_the_', 'woman'], ['I', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name'], ['In', 'his', 'eyes', 'she', 'eclipses', 'and'

## Exercise

What if we wanted to covert all of the uppercase letters to lowercase? Edit the code to do this to each sentence.

Recall, you can use the `.lower()` string method.

In [None]:
my_string = "here’s another vacancy on the League of the Red-headed Men"
my_string_lower = my_string.lower()
print(my_string_lower)

here’s another vacancy on the league of the red-headed men


## What if I wanted it broken down by paragraph?

This time, we'll leave punctuation in.

In [None]:
sherlock_paragraphs = sherlock_text.split("\r\n")
print(sherlock_paragraphs[:50]) #look at the first few paragraphs

['', '', 'cover', '', '', '', '', 'The Adventures of Sherlock Holmes', '', 'by Arthur Conan Doyle', '', '', 'Contents', '', '   I.     A Scandal in Bohemia', '   II.    The Red-Headed League', '   III.   A Case of Identity', '   IV.    The Boscombe Valley Mystery', '   V.     The Five Orange Pips', '   VI.    The Man with the Twisted Lip', '   VII.   The Adventure of the Blue Carbuncle', '   VIII.  The Adventure of the Speckled Band', '   IX.    The Adventure of the Engineer’s Thumb', '   X.     The Adventure of the Noble Bachelor', '   XI.    The Adventure of the Beryl Coronet', '   XII.   The Adventure of the Copper Beeches', '', '', '', '', 'I. A SCANDAL IN BOHEMIA', '', '', 'I.', '', 'To Sherlock Holmes she is always _the_ woman. I have seldom heard him', 'mention her under any other name. In his eyes she eclipses and', 'predominates the whole of her sex. It was not that he felt any emotion', 'akin to love for Irene Adler. All emotions, and that one particularly,', 'were abhorrent 

In [None]:
chars_to_separate = [",","!","?",";",":","“","”","’","‘","-","—","."]

for idx in range(len(sherlock_paragraphs)):
    for c in chars_to_separate:
        sherlock_paragraphs[idx] = sherlock_paragraphs[idx].replace(c," "+c+" ") #put a space before and after the character

    sherlock_paragraphs[idx] = sherlock_paragraphs[idx].split()

print(sherlock_paragraphs[:50])

[[], [], ['cover'], [], [], [], [], ['The', 'Adventures', 'of', 'Sherlock', 'Holmes'], [], ['by', 'Arthur', 'Conan', 'Doyle'], [], [], ['Contents'], [], ['I', '.', 'A', 'Scandal', 'in', 'Bohemia'], ['II', '.', 'The', 'Red', '-', 'Headed', 'League'], ['III', '.', 'A', 'Case', 'of', 'Identity'], ['IV', '.', 'The', 'Boscombe', 'Valley', 'Mystery'], ['V', '.', 'The', 'Five', 'Orange', 'Pips'], ['VI', '.', 'The', 'Man', 'with', 'the', 'Twisted', 'Lip'], ['VII', '.', 'The', 'Adventure', 'of', 'the', 'Blue', 'Carbuncle'], ['VIII', '.', 'The', 'Adventure', 'of', 'the', 'Speckled', 'Band'], ['IX', '.', 'The', 'Adventure', 'of', 'the', 'Engineer', '’', 's', 'Thumb'], ['X', '.', 'The', 'Adventure', 'of', 'the', 'Noble', 'Bachelor'], ['XI', '.', 'The', 'Adventure', 'of', 'the', 'Beryl', 'Coronet'], ['XII', '.', 'The', 'Adventure', 'of', 'the', 'Copper', 'Beeches'], [], [], [], [], ['I', '.', 'A', 'SCANDAL', 'IN', 'BOHEMIA'], [], [], ['I', '.'], [], ['To', 'Sherlock', 'Holmes', 'she', 'is', 'always

## Exercise

Remove empty paragraphs from `sherlock_paragraphs`.

In [None]:
clean_paragraphs = []
for paragraph in sherlock_paragraphs:
  if len(paragraph) != 0:
    clean_paragraphs.append(paragraph)

print(clean_paragraphs[:50])

[['cover'], ['The', 'Adventures', 'of', 'Sherlock', 'Holmes'], ['by', 'Arthur', 'Conan', 'Doyle'], ['Contents'], ['I', '.', 'A', 'Scandal', 'in', 'Bohemia'], ['II', '.', 'The', 'Red', '-', 'Headed', 'League'], ['III', '.', 'A', 'Case', 'of', 'Identity'], ['IV', '.', 'The', 'Boscombe', 'Valley', 'Mystery'], ['V', '.', 'The', 'Five', 'Orange', 'Pips'], ['VI', '.', 'The', 'Man', 'with', 'the', 'Twisted', 'Lip'], ['VII', '.', 'The', 'Adventure', 'of', 'the', 'Blue', 'Carbuncle'], ['VIII', '.', 'The', 'Adventure', 'of', 'the', 'Speckled', 'Band'], ['IX', '.', 'The', 'Adventure', 'of', 'the', 'Engineer', '’', 's', 'Thumb'], ['X', '.', 'The', 'Adventure', 'of', 'the', 'Noble', 'Bachelor'], ['XI', '.', 'The', 'Adventure', 'of', 'the', 'Beryl', 'Coronet'], ['XII', '.', 'The', 'Adventure', 'of', 'the', 'Copper', 'Beeches'], ['I', '.', 'A', 'SCANDAL', 'IN', 'BOHEMIA'], ['I', '.'], ['To', 'Sherlock', 'Holmes', 'she', 'is', 'always', '_the_', 'woman', '.', 'I', 'have', 'seldom', 'heard', 'him'], ['

## Working with HTML data

Most data you retrieve from the web is not in text format - it is usually has lots of html tags like `<title>`, `</br>`, and `<p>`.


In [None]:
import requests

response = requests.get("https://en.wikipedia.org/wiki/Sherlock_Holmes")

print(response)
print(response.headers)

<Response [200]>
{'date': 'Wed, 27 Sep 2023 18:02:15 GMT', 'server': 'mw2315.codfw.wmnet', 'x-content-type-options': 'nosniff', 'content-language': 'en', 'accept-ch': '', 'vary': 'Accept-Encoding,Cookie', 'last-modified': 'Mon, 25 Sep 2023 12:02:48 GMT', 'content-type': 'text/html; charset=UTF-8', 'content-encoding': 'gzip', 'age': '95936', 'x-cache': 'cp4041 hit, cp4041 hit/6', 'x-cache-status': 'hit-front', 'server-timing': 'cache;desc="hit-front", host;desc="cp4041"', 'strict-transport-security': 'max-age=106384710; includeSubDomains; preload', 'report-to': '{ "group": "wm_nel", "max_age": 604800, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }', 'nel': '{ "report_to": "wm_nel", "max_age": 604800, "failure_fraction": 0.05, "success_fraction": 0.0}', 'set-cookie': 'WMF-Last-Access=28-Sep-2023;Path=/;HttpOnly;secure;Expires=Mon, 30 Oct 2023 12:00:00 GMT, WMF-Last-Access-

In [None]:
response.text[:3000]

'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-disabled vector-feature-client-preferences-disabled" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>Sherlock Holmes - Wikipedia</title>\n<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabl

## Beautiful Soup

The Beautiful Soup package is great for *parsing* and manipulating HTML: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [None]:
from bs4 import BeautifulSoup
import requests

response = requests.get("https://en.wikipedia.org/wiki/Sherlock_Holmes")
sherlock_wiki_html = BeautifulSoup(response.text, 'html.parser')

You can look for a title tag:

In [None]:
print(sherlock_wiki_html.title)

<title>Sherlock Holmes - Wikipedia</title>


Or look for all of the `<a>` tags which are the links to other pages

In [None]:
list_of_links = sherlock_wiki_html.find_all('a')
for link in list_of_links[:100]:
    print(link.get('href'))

#bodyContent
/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
//en.wikipedia.org/wiki/Wikipedia:Contact_us
https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Main_Page
/wiki/Special:Search
/w/index.php?title=Special:CreateAccount&returnto=Sherlock+Holmes
/w/index.php?title=Special:UserLogin&returnto=Sherlock+Holmes
/w/index.php?title=Special:CreateAccount&returnto=Sherlock+Holmes
/w/index.php?title=Special:UserLogin&returnto=Sherlock+Holmes
/wiki/Help:Introduction
/wiki/Special:MyContributions
/wiki/Special:MyTalk
#
#Inspiration_for_the_character
#Fictional_character_biography
#Family_and_early_life
#Life_with_Watson
#Practice
#The_Great_Hiatus
#Retirement
#Personality_and_habits
#Drug_us

## Extracting text with Beautiful Soup

Use the `.get_text()` method on the soup object

In [None]:
sherlock_wiki_text = sherlock_wiki_html.get_text()

sherlock_wiki_text[:2000]

'\n\n\n\nSherlock Holmes - Wikipedia\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nJump to content\n\n\n\n\n\n\n\nMain menu\n\n\n\n\n\nMain menu\nmove to sidebar\nhide\n\n\n\n\t\tNavigation\n\t\n\n\nMain pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate\n\n\n\n\n\n\t\tContribute\n\t\n\n\nHelpLearn to editCommunity portalRecent changesUpload file\n\n\n\n\n\nLanguages\n\nLanguage links are at the top of the page across from the title.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\nCreate accountLog in\n\n\n\n\n\n\nPersonal tools\n\n\n\n\n\n Create account Log in\n\n\n\n\n\n\t\tPages for logged out editors learn more\n\n\n\nContributionsTalk\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nContents\nmove to sidebar\nhide\n\n\n\n\n(Top)\n\n\n\n\n\n1Inspiration for the character\n\n\n\n\n\n\n\n2Fictional character biography\n\n\n\nToggle Fictional character biography subsection\n\n\n\n\n\n2.1Fam

In [None]:
sherlock_wiki_no_lines = sherlock_wiki_text.replace("\n"," ")
sherlock_wiki_no_lines[:2000]

'    Sherlock Holmes - Wikipedia                                   Jump to content        Main menu      Main menu move to sidebar hide    \t\tNavigation \t   Main pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate      \t\tContribute \t   HelpLearn to editCommunity portalRecent changesUpload file      Languages  Language links are at the top of the page across from the title.                    Search            Search         Create accountLog in       Personal tools       Create account Log in      \t\tPages for logged out editors learn more    ContributionsTalk                            Contents move to sidebar hide     (Top)      1Inspiration for the character        2Fictional character biography    Toggle Fictional character biography subsection      2.1Family and early life        2.2Life with Watson        2.3Practice        2.4The Great Hiatus        2.5Retirement          3Personality and habits    Toggle Personality and habits subsection      3.1Drug u

In [None]:
chars_to_separate = [",","!","?",";",":","\"","\'","-",".","(",")"]

for c in chars_to_separate:
    sherlock_wiki_no_lines = sherlock_wiki_no_lines.replace(c," "+c+" ")

sherlock_wiki_no_lines[:2000]

'    Sherlock Holmes  -  Wikipedia                                   Jump to content        Main menu      Main menu move to sidebar hide    \t\tNavigation \t   Main pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate      \t\tContribute \t   HelpLearn to editCommunity portalRecent changesUpload file      Languages  Language links are at the top of the page across from the title .                     Search            Search         Create accountLog in       Personal tools       Create account Log in      \t\tPages for logged out editors learn more    ContributionsTalk                            Contents move to sidebar hide      ( Top )       1Inspiration for the character        2Fictional character biography    Toggle Fictional character biography subsection      2 . 1Family and early life        2 . 2Life with Watson        2 . 3Practice        2 . 4The Great Hiatus        2 . 5Retirement          3Personality and habits    Toggle Personality and habits subsect

In [None]:
sherlock_wiki_tokens = sherlock_wiki_no_lines.split()
print(sherlock_wiki_tokens[:500])

['Sherlock', 'Holmes', '-', 'Wikipedia', 'Jump', 'to', 'content', 'Main', 'menu', 'Main', 'menu', 'move', 'to', 'sidebar', 'hide', 'Navigation', 'Main', 'pageContentsCurrent', 'eventsRandom', 'articleAbout', 'WikipediaContact', 'usDonate', 'Contribute', 'HelpLearn', 'to', 'editCommunity', 'portalRecent', 'changesUpload', 'file', 'Languages', 'Language', 'links', 'are', 'at', 'the', 'top', 'of', 'the', 'page', 'across', 'from', 'the', 'title', '.', 'Search', 'Search', 'Create', 'accountLog', 'in', 'Personal', 'tools', 'Create', 'account', 'Log', 'in', 'Pages', 'for', 'logged', 'out', 'editors', 'learn', 'more', 'ContributionsTalk', 'Contents', 'move', 'to', 'sidebar', 'hide', '(', 'Top', ')', '1Inspiration', 'for', 'the', 'character', '2Fictional', 'character', 'biography', 'Toggle', 'Fictional', 'character', 'biography', 'subsection', '2', '.', '1Family', 'and', 'early', 'life', '2', '.', '2Life', 'with', 'Watson', '2', '.', '3Practice', '2', '.', '4The', 'Great', 'Hiatus', '2', '.', '

## Exercise

Suppose you needed to tokenize lots of Wikipedia pages like this. Can you come up with a strategy for jumping straight to the content like we did with the Project Gutenberg book?

## NLTK Tokenizers

NLTK has some tokenizers - the `punkt` tokenizer is the most popular.

It can tokenize by words:


In [None]:
import nltk
import requests

#nltk.download("punkt") #need to do this the first time you run it

response = requests.get("https://www.gutenberg.org/files/1661/1661-0.txt")
sherlock_raw_text = response.text

sherlock_words = nltk.word_tokenize(sherlock_raw_text)
print(sherlock_words[:1000])

LookupError: ignored

or sentences

In [None]:
import nltk
import requests

#nltk.download("punkt") #only need to do this once

response = requests.get("https://www.gutenberg.org/files/1661/1661-0.txt")
sherlock_raw_text = response.text

sherlock_sentences = nltk.sent_tokenize(sherlock_raw_text)
print(sherlock_sentences[:100])

['ï»¿The Project Gutenberg eBook of The Adventures of Sherlock Holmes, by Arthur Conan Doyle\r\n\r\nThis eBook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever.', 'You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this eBook or online at\r\nwww.gutenberg.org.', 'If you are not located in the United States, you\r\nwill have to check the laws of the country where you are located before\r\nusing this eBook.', 'Title: The Adventures of Sherlock Holmes\r\n\r\nAuthor: Arthur Conan Doyle\r\n\r\nRelease Date: November 29, 2002 [eBook #1661]\r\n[Most recently updated: May 20, 2019]\r\n\r\nLanguage: English\r\n\r\nCharacter set encoding: UTF-8\r\n\r\nProduced by: an anonymous Project Gutenberg volunteer and Jose Menendez\r\n\r\n*** START OF THE PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK HOLMES ***\r\n\r\ncover\r\n\r\n\r\n\r\n\r\nTh

## Exercise

It seems that there are still some strange characters - can you preprocess the text to fix them before using the NLTK tokenizer?

Could you structure the words by sentences like we did earlier?

## Automatic Tokenizers

Rather than having to program specific rules for how to tokenize your text, you could learn to do it automatically.

Two popular algorithms:
* Byte-Pair Encoding tokenization (used by OpenAI's GPT)
* WordPiece tokenization (used by Google's BERT)

Main idea:
* do some normalization and pre-tokenization - like the rule-based tokenization we used to form characters into sequences separated by spaces
* start with a vocabulary where each character is a different possible token
* find the most frequent consecutive pair, merge them together into a new token
* keep going until your vocabulary is a desired size

Frequent words - don't break them apart

Less-frequent words - represent them as several subwords

For WordPiece, `##` represents a partial word

In [None]:
import requests
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


response = requests.get("https://www.gutenberg.org/files/1661/1661-0.txt")
sherlock_raw_text = response.text

sherlock_hf_tokens = tokenizer.tokenize( sherlock_raw_text )
print(sherlock_hf_tokens[:1000])

Token indices sequence length is longer than the specified maximum sequence length for this model (143279 > 512). Running this sequence through the model will result in indexing errors


['ï', '»', '¿', 'The', 'Project', 'G', '##ute', '##nberg', 'e', '##B', '##ook', 'of', 'The', 'Adventures', 'of', 'Sherlock', 'Holmes', ',', 'by', 'Arthur', 'Conan', 'Doyle', 'This', 'e', '##B', '##ook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', '.', 'You', 'may', 'copy', 'it', ',', 'give', 'it', 'away', 'or', 're', '-', 'use', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'G', '##ute', '##nberg', 'License', 'included', 'with', 'this', 'e', '##B', '##ook', 'or', 'online', 'at', 'www', '.', 'gut', '##enberg', '.', 'org', '.', 'If', 'you', 'are', 'not', 'located', 'in', 'the', 'United', 'States', ',', 'you', 'will', 'have', 'to', 'check', 'the', 'laws', 'of', 'the', 'country', 'where', 'you', 'are', 'located', 'before', 'using', 'this', 'e', '##B', '##ook', '.', 'Title', ':', 'The', 'Adventures', 'of'

## Applied Exploration

Find some new text, tokenize it according to one or more of the methods discussed here

Use it as input for the Markov Chain in the previous set of notes

Describe what you did and record notes about your results

