<a href="https://colab.research.google.com/github/gonzalovaldenebro/NaturalLanguageProcessing-Portfolio/blob/main/F3_1_Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS 195: Natural Language Processing
## Tokenization

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F3_1_Tokenization.ipynb)


## References

Python `requests` library quickstart: https://requests.readthedocs.io/en/latest/user/quickstart/

Beautiful Soup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

GPT Tokenizer Illustration: https://platform.openai.com/tokenizer

Python `split` method: https://docs.python.org/3/library/stdtypes.html#str.split

Hugging Face Byte-Pair Encoding tokenization: https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt

Hugging Face WordPiece tokenization: https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt

In [1]:
import sys
!{sys.executable} -m pip install requests chardet nltk beautifulsoup4 tokenizers transformers



In [2]:
#you shouldn't need to do this in Colab, but I had to do it on my own machine
#in order to connect to the nltk service
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context


## Tokenization

Before you can feed input into most NLP algorithms, you have to **tokenize** the text - break apart the string into units (the *tokens*) that the algorithm needs to work with.

A set of tokens can be
* letter
* words
* a mix of words and punctuation
* parts of words

See how GPT tokenizes here: https://platform.openai.com/tokenizer

It can be accomplished with *rule-based* methods or automatically learned.

As we saw previously, the Python string `split` method can be very useful for rule-based methods:
* if you give it a parameter, it will break up the string using that delimiter
* if you don't it separates by whitespace

In [3]:
text = "I code when I am happy . I am happy therefore I code . "
text_tokens = text.split()

print(text_tokens)

['I', 'code', 'when', 'I', 'am', 'happy', '.', 'I', 'am', 'happy', 'therefore', 'I', 'code', '.']


In [4]:
text = "I code when I am happy . I am happy therefore I code . "
text_tokens = text.split("I") #you probably don't want to do this

print(text_tokens)

['', ' code when ', ' am happy . ', ' am happy therefore ', ' code . ']


## The requests library

The `requests` library is useful for loading data stored on the web.

Here's how we can request the text version of *The Adventures of Sherlock Holmes* from Project Gutenberg: https://www.gutenberg.org/ebooks/1661


In [5]:
import requests

response = requests.get("https://www.gutenberg.org/files/1661/1661-0.txt")

print(response)
print(response.headers)

<Response [200]>
{'date': 'Sun, 19 Nov 2023 23:14:36 GMT', 'server': 'Apache', 'last-modified': 'Tue, 10 Oct 2023 11:01:52 GMT', 'accept-ranges': 'bytes', 'content-length': '607504', 'x-backend': 'gutenweb1', 'content-type': 'text/plain'}


A response code of 200 means it worked, and we can look at some of the other metadata that came back with it with `.headers`

Now let's look at what some of this text looks like:

In [24]:
print(response.text) #uncomment to print the whole thing
#print(response.text[4000:6000]) #printing a sample of some text in the middle

The Project Gutenberg eBook of The Adventures of Sherlock Holmes,
by Arthur Conan Doyle

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.

Title: The Adventures of Sherlock Holmes

Author: Arthur Conan Doyle

Release Date: November 29, 2002 [eBook #1661]
[Most recently updated: October 10, 2023]

Language: English

Character set encoding: UTF-8

Produced by: an anonymous Project Gutenberg volunteer and Jose Menendez

*** START OF THE PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK
HOLMES ***




The Adventures of Sherlock Holmes

by Arthur Conan Doyle


Contents

   I.     A Scandal in Bohemia
  

Notice: There are a lot of weird characters like â - if this looks different than what you see when you open the file, it means something went wrong.

Usually, the `response` library can figure out the format that the characters are stored in, and that's what `response.text` does - it assumed these were the [ISO-8859-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1) encoding, but that's not quite right.


In [7]:
print(response.encoding)

ISO-8859-1


Let's see what the requests module documentation suggests: https://requests.readthedocs.io/en/latest/user/quickstart/#response-content

look for clues by looking at `response.content`, which will show the text in it's more raw form:

In [25]:
#print(response.content)
print(response.content[4000:6000])

b' former friend and companion.\r\n\r\nOne night\xe2\x80\x94it was on the twentieth of March, 1888\xe2\x80\x94I was returning from a\r\njourney to a patient (for I had now returned to civil practice), when\r\nmy way led me through Baker Street. As I passed the well-remembered\r\ndoor, which must always be associated in my mind with my wooing, and\r\nwith the dark incidents of the Study in Scarlet, I was seized with a\r\nkeen desire to see Holmes again, and to know how he was employing his\r\nextraordinary powers. His rooms were brilliantly lit, and, even as I\r\nlooked up, I saw his tall, spare figure pass twice in a dark silhouette\r\nagainst the blind. He was pacing the room swiftly, eagerly, with his\r\nhead sunk upon his chest and his hands clasped behind him. To me, who\r\nknew his every mood and habit, his attitude and manner told their own\r\nstory. He was at work again. He had risen out of his drug-created\r\ndreams and was hot upon the scent of some new problem. I rang the bel

One thing to notice: newlines are represented as `\r\n` rather than the usual `\n` - that will be important later, so remember it

Now we can use a module like `chardet` to detect the encoding

In [26]:
import chardet

encoding_info = chardet.detect(response.content)
print(encoding_info)

{'encoding': 'UTF-8-SIG', 'confidence': 1.0, 'language': ''}


Looks like it is actuall a variant of the popular encoding [UTF-8](https://en.wikipedia.org/wiki/UTF-8)

and now we can set the encoding to match

In [27]:
response.encoding = 'UTF-8-SIG'
print(response.text[4000:6000])

riend and companion.

One night—it was on the twentieth of March, 1888—I was returning from a
journey to a patient (for I had now returned to civil practice), when
my way led me through Baker Street. As I passed the well-remembered
door, which must always be associated in my mind with my wooing, and
with the dark incidents of the Study in Scarlet, I was seized with a
keen desire to see Holmes again, and to know how he was employing his
extraordinary powers. His rooms were brilliantly lit, and, even as I
looked up, I saw his tall, spare figure pass twice in a dark silhouette
against the blind. He was pacing the room swiftly, eagerly, with his
head sunk upon his chest and his hands clasped behind him. To me, who
knew his every mood and habit, his attitude and manner told their own
story. He was at work again. He had risen out of his drug-created
dreams and was hot upon the scent of some new problem. I rang the bell
and was shown up to the chamber which had formerly been in part my own.



## Cutting to the content

This ebook has markers showing where the actual content of the book start and stop, so we can cut out the Project Gutenberg preamble and license stuff at the end.

In [35]:
start_text = "*** START OF THE PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK HOLMES ***"
end_text   = "*** END OF THE PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK HOLMES   ***"

start_index = response.text.find(start_text)+len(start_text)
end_index = response.text.find(end_text)

print("Start and end index of the text",start_index,end_index)
sherlock_text = response.text[start_index:end_index]

#print(sherlock_text)
print(sherlock_text[:1000])

Start and end index of the text 77 -1
Conan Doyle

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.

Title: The Adventures of Sherlock Holmes

Author: Arthur Conan Doyle

Release Date: November 29, 2002 [eBook #1661]
[Most recently updated: October 10, 2023]

Language: English

Character set encoding: UTF-8

Produced by: an anonymous Project Gutenberg volunteer and Jose Menendez

*** START OF THE PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK
HOLMES ***




The Adventures of Sherlock Holmes

by Arthur Conan Doyle


Contents

   I.     A Scandal in Bohemia
   II.    The Red-Headed League
   III. 

## Now we're ready to tokenize

A question we need to answer: what do we want our tokens to look like?

Do we want to include punctuation? Should it be a separate token?

Do we want it broken into letters? words? sentences?

For this example, let's assume we want to keep punctuation but break it apart from the words it is next to.

Unfortunately, a simple `.split()` won't do the trick - notice the periods are stuck to the words they're next to.



In [36]:
print(sherlock_text[:1000].split())
print(sherlock_text[:1000])

['Conan', 'Doyle', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever.', 'You', 'may', 'copy', 'it,', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'eBook', 'or', 'online', 'at', 'www.gutenberg.org.', 'If', 'you', 'are', 'not', 'located', 'in', 'the', 'United', 'States,', 'you', 'will', 'have', 'to', 'check', 'the', 'laws', 'of', 'the', 'country', 'where', 'you', 'are', 'located', 'before', 'using', 'this', 'eBook.', 'Title:', 'The', 'Adventures', 'of', 'Sherlock', 'Holmes', 'Author:', 'Arthur', 'Conan', 'Doyle', 'Release', 'Date:', 'November', '29,', '2002', '[eBook', '#1661]', '[Most', 'recently', 'updated:', 'October', '10,', '2023]', 'Language:', 'English', 'Character', 'set', 'encoding:', 'UTF-8', '

One strategy use the `replace` method to put spaces before and after the periods

In [37]:
example_strategy = sherlock_text[:1000].replace("."," . ")
print(example_strategy)
print(example_strategy.split()) #now . are separate tokens

Conan Doyle

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever .  You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www . gutenberg . org .  If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook . 

Title: The Adventures of Sherlock Holmes

Author: Arthur Conan Doyle

Release Date: November 29, 2002 [eBook #1661]
[Most recently updated: October 10, 2023]

Language: English

Character set encoding: UTF-8

Produced by: an anonymous Project Gutenberg volunteer and Jose Menendez

*** START OF THE PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK
HOLMES ***




The Adventures of Sherlock Holmes

by Arthur Conan Doyle


Contents

   I .      A Scandal in Bohemia
   II .     The Red-Headed League
   III .  
['Conan', 'Doyle', 'T

OK - let's do the whole text and separate lots of other punctuation while we're at it

In [38]:
sherlock_text_intermediate = sherlock_text
sherlock_text_intermediate = sherlock_text_intermediate.replace("."," . ")
sherlock_text_intermediate = sherlock_text_intermediate.replace(","," , ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("!"," ! ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("?"," ? ")
sherlock_text_intermediate = sherlock_text_intermediate.replace(":"," : ")
sherlock_text_intermediate = sherlock_text_intermediate.replace(";"," ; ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("“"," “ ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("”"," ” ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("’"," ’ ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("‘"," ‘ ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("-"," - ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("—"," - ")

print(sherlock_text_intermediate[4000:6000])

ch I merely shared with all the readers of
the daily press ,  I knew little of my former friend and companion . 

One night - it was on the twentieth of March ,  1888 - I was returning from a
journey to a patient (for I had now returned to civil practice) ,  when
my way led me through Baker Street .  As I passed the well - remembered
door ,  which must always be associated in my mind with my wooing ,  and
with the dark incidents of the Study in Scarlet ,  I was seized with a
keen desire to see Holmes again ,  and to know how he was employing his
extraordinary powers .  His rooms were brilliantly lit ,  and ,  even as I
looked up ,  I saw his tall ,  spare figure pass twice in a dark silhouette
against the blind .  He was pacing the room swiftly ,  eagerly ,  with his
head sunk upon his chest and his hands clasped behind him .  To me ,  who
knew his every mood and habit ,  his attitude and manner told their own
story .  He was at work again .  He had risen out of his drug - created
drea

In [39]:
sherlock_tokens = sherlock_text_intermediate.split()
print(sherlock_tokens[:1000])

['Conan', 'Doyle', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', '.', 'You', 'may', 'copy', 'it', ',', 'give', 'it', 'away', 'or', 're', '-', 'use', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'eBook', 'or', 'online', 'at', 'www', '.', 'gutenberg', '.', 'org', '.', 'If', 'you', 'are', 'not', 'located', 'in', 'the', 'United', 'States', ',', 'you', 'will', 'have', 'to', 'check', 'the', 'laws', 'of', 'the', 'country', 'where', 'you', 'are', 'located', 'before', 'using', 'this', 'eBook', '.', 'Title', ':', 'The', 'Adventures', 'of', 'Sherlock', 'Holmes', 'Author', ':', 'Arthur', 'Conan', 'Doyle', 'Release', 'Date', ':', 'November', '29', ',', '2002', '[eBook', '#1661]', '[Most', 'recently', 'updated', ':', 'October', '10', ',', '2023]',

## Exercise

The text also contains some underscores. What do these signify?

 - These signify an italic word

Should we separate them out? Should we remove them? Go ahead and do what you think you should do.

 - Yes, we should remove them from the text

Can you find any other special characters we should deal with?

 - Yes, we found parethesis.

In [40]:
sherlock_text_intermediate = sherlock_text
sherlock_text_intermediate = sherlock_text_intermediate.replace("."," . ")
sherlock_text_intermediate = sherlock_text_intermediate.replace(","," , ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("!"," ! ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("?"," ? ")
sherlock_text_intermediate = sherlock_text_intermediate.replace(":"," : ")
sherlock_text_intermediate = sherlock_text_intermediate.replace(";"," ; ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("“"," “ ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("”"," ” ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("’"," ’ ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("‘"," ‘ ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("-"," - ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("—"," - ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("_","")


print(sherlock_text_intermediate[:1000])

Conan Doyle

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever .  You may copy it ,  give it away or re - use it under the terms
of the Project Gutenberg License included with this eBook or online at
www . gutenberg . org .  If you are not located in the United States ,  you
will have to check the laws of the country where you are located before
using this eBook . 

Title :  The Adventures of Sherlock Holmes

Author :  Arthur Conan Doyle

Release Date :  November 29 ,  2002 [eBook #1661]
[Most recently updated :  October 10 ,  2023]

Language :  English

Character set encoding :  UTF - 8

Produced by :  an anonymous Project Gutenberg volunteer and Jose Menendez

*** START OF THE PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK
HOLMES ***




The Adventures of Sherlock Holmes

by Arthur Conan Doyle


Contents

   I .      A Scandal in Bohemia
   


## What if I wanted it broken down by sentences?

In this example, suppose we want
* broken down by words
* no punctuation
* structured by sentence

In [41]:
#split into lists by period
sherlock_sentences = sherlock_text.split(".")
print(sherlock_sentences[:100])

['Conan Doyle\r\n\r\nThis eBook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever', ' You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this eBook or online at\r\nwww', 'gutenberg', 'org', ' If you are not located in the United States, you\r\nwill have to check the laws of the country where you are located before\r\nusing this eBook', '\r\n\r\nTitle: The Adventures of Sherlock Holmes\r\n\r\nAuthor: Arthur Conan Doyle\r\n\r\nRelease Date: November 29, 2002 [eBook #1661]\r\n[Most recently updated: October 10, 2023]\r\n\r\nLanguage: English\r\n\r\nCharacter set encoding: UTF-8\r\n\r\nProduced by: an anonymous Project Gutenberg volunteer and Jose Menendez\r\n\r\n*** START OF THE PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK\r\nHOLMES ***\r\n\r\n\r\n\r\n\r\nThe Adventures of Sherlock Holmes\r\n\r\nby Arthur Conan Doyle\r\n\r\n\r\n

In [43]:
chars_to_remove = [",","!","?",";",":","“","”","’","‘"]
chars_to_change_to_spaces = ["-","—","\r\n"]

for idx in range(len(sherlock_sentences)):
  sherlock_sentences[idx] = sherlock_sentences[idx].lower()
for c in chars_to_remove:
    sherlock_sentences[idx] = sherlock_sentences[idx].replace(c,"") #replace those characters with the empty string
for c in chars_to_change_to_spaces:
    sherlock_sentences[idx] = sherlock_sentences[idx].replace(c," ") #replace those characters with a space
sherlock_sentences[idx] = sherlock_sentences[idx].split()

print(sherlock_sentences[:100])

['conan doyle\r\n\r\nthis ebook is for the use of anyone anywhere in the united states and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever', ' you may copy it, give it away or re-use it under the terms\r\nof the project gutenberg license included with this ebook or online at\r\nwww', 'gutenberg', 'org', ' if you are not located in the united states, you\r\nwill have to check the laws of the country where you are located before\r\nusing this ebook', '\r\n\r\ntitle: the adventures of sherlock holmes\r\n\r\nauthor: arthur conan doyle\r\n\r\nrelease date: november 29, 2002 [ebook #1661]\r\n[most recently updated: october 10, 2023]\r\n\r\nlanguage: english\r\n\r\ncharacter set encoding: utf-8\r\n\r\nproduced by: an anonymous project gutenberg volunteer and jose menendez\r\n\r\n*** start of the project gutenberg ebook the adventures of sherlock\r\nholmes ***\r\n\r\n\r\n\r\n\r\nthe adventures of sherlock holmes\r\n\r\nby arthur conan doyle\r\n\r\n\r\n

## Exercise

What if we wanted to covert all of the uppercase letters to lowercase? Edit the code to do this to each sentence.

Recall, you can use the `.lower()` string method.

In [97]:
my_string = "here’s another VACANCY on the LEAGUE of the Red-headed Men"
my_string_lower = my_string.lower()
print(my_string_lower)

here’s another vacancy on the league of the red-headed men


## What if I wanted it broken down by paragraph?

This time, we'll leave punctuation in.

In [94]:
sherlock_paragraphs = sherlock_text.split("\r\n")
print(sherlock_paragraphs[:100]) #look at the first few paragraphs

['Conan Doyle', '', 'This eBook is for the use of anyone anywhere in the United States and', 'most other parts of the world at no cost and with almost no restrictions', 'whatsoever. You may copy it, give it away or re-use it under the terms', 'of the Project Gutenberg License included with this eBook or online at', 'www.gutenberg.org. If you are not located in the United States, you', 'will have to check the laws of the country where you are located before', 'using this eBook.', '', 'Title: The Adventures of Sherlock Holmes', '', 'Author: Arthur Conan Doyle', '', 'Release Date: November 29, 2002 [eBook #1661]', '[Most recently updated: October 10, 2023]', '', 'Language: English', '', 'Character set encoding: UTF-8', '', 'Produced by: an anonymous Project Gutenberg volunteer and Jose Menendez', '', '*** START OF THE PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK', 'HOLMES ***', '', '', '', '', 'The Adventures of Sherlock Holmes', '', 'by Arthur Conan Doyle', '', '', 'Contents', '', 

In [95]:
chars_to_separate = [",","!","?",";",":","“","”","’","‘","-","—","."]

for idx in range(len(sherlock_paragraphs)):
    for c in chars_to_separate:
        sherlock_paragraphs[idx] = sherlock_paragraphs[idx].replace(c," "+c+" ") #put a space before and after the character

    sherlock_paragraphs[idx] = sherlock_paragraphs[idx].split()

print(sherlock_paragraphs[:50])

[['Conan', 'Doyle'], [], ['This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and'], ['most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions'], ['whatsoever', '.', 'You', 'may', 'copy', 'it', ',', 'give', 'it', 'away', 'or', 're', '-', 'use', 'it', 'under', 'the', 'terms'], ['of', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'eBook', 'or', 'online', 'at'], ['www', '.', 'gutenberg', '.', 'org', '.', 'If', 'you', 'are', 'not', 'located', 'in', 'the', 'United', 'States', ',', 'you'], ['will', 'have', 'to', 'check', 'the', 'laws', 'of', 'the', 'country', 'where', 'you', 'are', 'located', 'before'], ['using', 'this', 'eBook', '.'], [], ['Title', ':', 'The', 'Adventures', 'of', 'Sherlock', 'Holmes'], [], ['Author', ':', 'Arthur', 'Conan', 'Doyle'], [], ['Release', 'Date', ':', 'November', '29', ',', '2002', '[eBook', '#1661]'], ['[Most', 'recently', 'update

## Exercise

Remove empty paragraphs from `sherlock_paragraphs`.

In [96]:
# Remove empty paragraphs
sherlock_paragraphs = [paragraph for paragraph in sherlock_paragraphs if paragraph]

# Print the result
sherlock_paragraphs

[['Conan', 'Doyle'],
 ['This',
  'eBook',
  'is',
  'for',
  'the',
  'use',
  'of',
  'anyone',
  'anywhere',
  'in',
  'the',
  'United',
  'States',
  'and'],
 ['most',
  'other',
  'parts',
  'of',
  'the',
  'world',
  'at',
  'no',
  'cost',
  'and',
  'with',
  'almost',
  'no',
  'restrictions'],
 ['whatsoever',
  '.',
  'You',
  'may',
  'copy',
  'it',
  ',',
  'give',
  'it',
  'away',
  'or',
  're',
  '-',
  'use',
  'it',
  'under',
  'the',
  'terms'],
 ['of',
  'the',
  'Project',
  'Gutenberg',
  'License',
  'included',
  'with',
  'this',
  'eBook',
  'or',
  'online',
  'at'],
 ['www',
  '.',
  'gutenberg',
  '.',
  'org',
  '.',
  'If',
  'you',
  'are',
  'not',
  'located',
  'in',
  'the',
  'United',
  'States',
  ',',
  'you'],
 ['will',
  'have',
  'to',
  'check',
  'the',
  'laws',
  'of',
  'the',
  'country',
  'where',
  'you',
  'are',
  'located',
  'before'],
 ['using', 'this', 'eBook', '.'],
 ['Title', ':', 'The', 'Adventures', 'of', 'Sherlock', 'Hol

## Working with HTML data

Most data you retrieve from the web is not in text format - it is usually has lots of html tags like `<title>`, `</br>`, and `<p>`.


In [47]:
import requests

response = requests.get("https://en.wikipedia.org/wiki/Sherlock_Holmes")

print(response)
print(response.headers)

<Response [200]>
{'date': 'Sun, 19 Nov 2023 17:11:24 GMT', 'vary': 'Accept-Encoding,Cookie', 'server': 'ATS/9.1.4', 'x-content-type-options': 'nosniff', 'content-language': 'en', 'accept-ch': '', 'last-modified': 'Mon, 13 Nov 2023 21:00:37 GMT', 'content-type': 'text/html; charset=UTF-8', 'content-encoding': 'gzip', 'age': '25598', 'x-cache': 'cp1106 hit, cp1106 hit/12', 'x-cache-status': 'hit-front', 'server-timing': 'cache;desc="hit-front", host;desc="cp1106"', 'strict-transport-security': 'max-age=106384710; includeSubDomains; preload', 'report-to': '{ "group": "wm_nel", "max_age": 604800, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }', 'nel': '{ "report_to": "wm_nel", "max_age": 604800, "failure_fraction": 0.05, "success_fraction": 0.0}', 'set-cookie': 'WMF-Last-Access=20-Nov-2023;Path=/;HttpOnly;secure;Expires=Fri, 22 Dec 2023 00:00:00 GMT, WMF-Last-Access-Global=2

In [48]:
response.text[:3000]

'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-typography-survey-disabled vector-toc-available" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>Sherlock Holmes - Wikipedia</title>\n<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width

## Beautiful Soup

The Beautiful Soup package is great for *parsing* and manipulating HTML: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [49]:
from bs4 import BeautifulSoup
import requests

response = requests.get("https://en.wikipedia.org/wiki/Sherlock_Holmes")
sherlock_wiki_html = BeautifulSoup(response.text, 'html.parser')

You can look for a title tag:

In [50]:
print(sherlock_wiki_html.title)

<title>Sherlock Holmes - Wikipedia</title>


Or look for all of the `<a>` tags which are the links to other pages

In [51]:
list_of_links = sherlock_wiki_html.find_all('a')
for link in list_of_links[:100]:
    print(link.get('href'))

#bodyContent
/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
//en.wikipedia.org/wiki/Wikipedia:Contact_us
https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Main_Page
/wiki/Special:Search
/w/index.php?title=Special:CreateAccount&returnto=Sherlock+Holmes
/w/index.php?title=Special:UserLogin&returnto=Sherlock+Holmes
/w/index.php?title=Special:CreateAccount&returnto=Sherlock+Holmes
/w/index.php?title=Special:UserLogin&returnto=Sherlock+Holmes
/wiki/Help:Introduction
/wiki/Special:MyContributions
/wiki/Special:MyTalk
#
#Inspiration_for_the_character
#Fictional_character_biography
#Family_and_early_life
#Life_with_Watson
#Practice
#The_Great_Hiatus
#Retirement
#Personality_and_habits
#Drug_us

## Extracting text with Beautiful Soup

Use the `.get_text()` method on the soup object

In [52]:
sherlock_wiki_text = sherlock_wiki_html.get_text()

sherlock_wiki_text[:2000]

'\n\n\n\nSherlock Holmes - Wikipedia\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nJump to content\n\n\n\n\n\n\n\nMain menu\n\n\n\n\n\nMain menu\nmove to sidebar\nhide\n\n\n\n\t\tNavigation\n\t\n\n\nMain pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate\n\n\n\n\n\n\t\tContribute\n\t\n\n\nHelpLearn to editCommunity portalRecent changesUpload file\n\n\n\n\n\nLanguages\n\nLanguage links are at the top of the page across from the title.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\nCreate accountLog in\n\n\n\n\n\n\nPersonal tools\n\n\n\n\n\n Create account Log in\n\n\n\n\n\n\t\tPages for logged out editors learn more\n\n\n\nContributionsTalk\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nContents\nmove to sidebar\nhide\n\n\n\n\n(Top)\n\n\n\n\n\n1Inspiration for the character\n\n\n\n\n\n\n\n2Fictional character biography\n\n\n\nToggle Fictional character biography subsection\n\n\n\n\n\n2.1Famil

In [53]:
sherlock_wiki_no_lines = sherlock_wiki_text.replace("\n"," ")
sherlock_wiki_no_lines[:2000]

'    Sherlock Holmes - Wikipedia                                   Jump to content        Main menu      Main menu move to sidebar hide    \t\tNavigation \t   Main pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate      \t\tContribute \t   HelpLearn to editCommunity portalRecent changesUpload file      Languages  Language links are at the top of the page across from the title.                    Search            Search         Create accountLog in       Personal tools       Create account Log in      \t\tPages for logged out editors learn more    ContributionsTalk                           Contents move to sidebar hide     (Top)      1Inspiration for the character        2Fictional character biography    Toggle Fictional character biography subsection      2.1Family and early life        2.2Life with Watson        2.3Practice        2.4The Great Hiatus        2.5Retirement          3Personality and habits    Toggle Personality and habits subsection      3.1Drug us

In [54]:
chars_to_separate = [",","!","?",";",":","\"","\'","-",".","(",")"]

for c in chars_to_separate:
    sherlock_wiki_no_lines = sherlock_wiki_no_lines.replace(c," "+c+" ")

sherlock_wiki_no_lines[:2000]

'    Sherlock Holmes  -  Wikipedia                                   Jump to content        Main menu      Main menu move to sidebar hide    \t\tNavigation \t   Main pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate      \t\tContribute \t   HelpLearn to editCommunity portalRecent changesUpload file      Languages  Language links are at the top of the page across from the title .                     Search            Search         Create accountLog in       Personal tools       Create account Log in      \t\tPages for logged out editors learn more    ContributionsTalk                           Contents move to sidebar hide      ( Top )       1Inspiration for the character        2Fictional character biography    Toggle Fictional character biography subsection      2 . 1Family and early life        2 . 2Life with Watson        2 . 3Practice        2 . 4The Great Hiatus        2 . 5Retirement          3Personality and habits    Toggle Personality and habits subsecti

In [55]:
sherlock_wiki_tokens = sherlock_wiki_no_lines.split()
print(sherlock_wiki_tokens[:500])

['Sherlock', 'Holmes', '-', 'Wikipedia', 'Jump', 'to', 'content', 'Main', 'menu', 'Main', 'menu', 'move', 'to', 'sidebar', 'hide', 'Navigation', 'Main', 'pageContentsCurrent', 'eventsRandom', 'articleAbout', 'WikipediaContact', 'usDonate', 'Contribute', 'HelpLearn', 'to', 'editCommunity', 'portalRecent', 'changesUpload', 'file', 'Languages', 'Language', 'links', 'are', 'at', 'the', 'top', 'of', 'the', 'page', 'across', 'from', 'the', 'title', '.', 'Search', 'Search', 'Create', 'accountLog', 'in', 'Personal', 'tools', 'Create', 'account', 'Log', 'in', 'Pages', 'for', 'logged', 'out', 'editors', 'learn', 'more', 'ContributionsTalk', 'Contents', 'move', 'to', 'sidebar', 'hide', '(', 'Top', ')', '1Inspiration', 'for', 'the', 'character', '2Fictional', 'character', 'biography', 'Toggle', 'Fictional', 'character', 'biography', 'subsection', '2', '.', '1Family', 'and', 'early', 'life', '2', '.', '2Life', 'with', 'Watson', '2', '.', '3Practice', '2', '.', '4The', 'Great', 'Hiatus', '2', '.', '

## Exercise

Suppose you needed to tokenize lots of Wikipedia pages like this. Can you come up with a strategy for jumping straight to the content like we did with the Project Gutenberg book?

## NLTK Tokenizers

NLTK has some tokenizers - the `punkt` tokenizer is the most popular.

It can tokenize by words:


In [56]:
import nltk
import requests

#nltk.download("punkt") #need to do this the first time you run it

response = requests.get("https://www.gutenberg.org/files/1661/1661-0.txt")
sherlock_raw_text = response.text

sherlock_words = nltk.word_tokenize(sherlock_raw_text)
print(sherlock_words[:1000])

['ï', '»', '¿The', 'Project', 'Gutenberg', 'eBook', 'of', 'The', 'Adventures', 'of', 'Sherlock', 'Holmes', ',', 'by', 'Arthur', 'Conan', 'Doyle', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', '.', 'You', 'may', 'copy', 'it', ',', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'eBook', 'or', 'online', 'at', 'www.gutenberg.org', '.', 'If', 'you', 'are', 'not', 'located', 'in', 'the', 'United', 'States', ',', 'you', 'will', 'have', 'to', 'check', 'the', 'laws', 'of', 'the', 'country', 'where', 'you', 'are', 'located', 'before', 'using', 'this', 'eBook', '.', 'Title', ':', 'The', 'Adventures', 'of', 'Sherlock', 'Holmes', 'Author', ':', 'Arthur', 'Conan', 'Doyle', 'Release', 'Date', ':', 'November', '29

or sentences

In [57]:
import nltk
import requests

#nltk.download("punkt") #only need to do this once

response = requests.get("https://www.gutenberg.org/files/1661/1661-0.txt")
sherlock_raw_text = response.text

sherlock_sentences = nltk.sent_tokenize(sherlock_raw_text)
print(sherlock_sentences[:100])

['ï»¿The Project Gutenberg eBook of The Adventures of Sherlock Holmes,\r\nby Arthur Conan Doyle\r\n\r\nThis eBook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever.', 'You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this eBook or online at\r\nwww.gutenberg.org.', 'If you are not located in the United States, you\r\nwill have to check the laws of the country where you are located before\r\nusing this eBook.', 'Title: The Adventures of Sherlock Holmes\r\n\r\nAuthor: Arthur Conan Doyle\r\n\r\nRelease Date: November 29, 2002 [eBook #1661]\r\n[Most recently updated: October 10, 2023]\r\n\r\nLanguage: English\r\n\r\nCharacter set encoding: UTF-8\r\n\r\nProduced by: an anonymous Project Gutenberg volunteer and Jose Menendez\r\n\r\n*** START OF THE PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK\r\nHOLMES ***\r\n\r\n\r\n\r\n\r\nThe A

## Exercise

It seems that there are still some strange characters - can you preprocess the text to fix them before using the NLTK tokenizer?

Could you structure the words by sentences like we did earlier?

## Automatic Tokenizers

Rather than having to program specific rules for how to tokenize your text, you could learn to do it automatically.

Two popular algorithms:
* Byte-Pair Encoding tokenization (used by OpenAI's GPT)
* WordPiece tokenization (used by Google's BERT)

Main idea:
* do some normalization and pre-tokenization - like the rule-based tokenization we used to form characters into sequences separated by spaces
* start with a vocabulary where each character is a different possible token
* find the most frequent consecutive pair, merge them together into a new token
* keep going until your vocabulary is a desired size

Frequent words - don't break them apart

Less-frequent words - represent them as several subwords

For WordPiece, `##` represents a partial word

In [58]:
import requests
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


response = requests.get("https://www.gutenberg.org/files/1661/1661-0.txt")
sherlock_raw_text = response.text

sherlock_hf_tokens = tokenizer.tokenize( sherlock_raw_text )
print(sherlock_hf_tokens[:1000])

Downloading tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (143245 > 512). Running this sequence through the model will result in indexing errors


['ï', '»', '¿', 'The', 'Project', 'G', '##ute', '##nberg', 'e', '##B', '##ook', 'of', 'The', 'Adventures', 'of', 'Sherlock', 'Holmes', ',', 'by', 'Arthur', 'Conan', 'Doyle', 'This', 'e', '##B', '##ook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', '.', 'You', 'may', 'copy', 'it', ',', 'give', 'it', 'away', 'or', 're', '-', 'use', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'G', '##ute', '##nberg', 'License', 'included', 'with', 'this', 'e', '##B', '##ook', 'or', 'online', 'at', 'www', '.', 'gut', '##enberg', '.', 'org', '.', 'If', 'you', 'are', 'not', 'located', 'in', 'the', 'United', 'States', ',', 'you', 'will', 'have', 'to', 'check', 'the', 'laws', 'of', 'the', 'country', 'where', 'you', 'are', 'located', 'before', 'using', 'this', 'e', '##B', '##ook', '.', 'Title', ':', 'The', 'Adventures', 'of'

## Applied Exploration

Find some new text, tokenize it according to one or more of the methods discussed here

Use it as input for the Markov Chain in the previous set of notes

Describe what you did and record notes about your results



## New text

I will be using the [Dracula Text](https://www.gutenberg.org/cache/epub/345/pg345-images.html)

In [64]:
import requests

response_dracula = requests.get("https://www.gutenberg.org/cache/epub/345/pg345.txt")

print(response_dracula)
print(response_dracula.headers)

<Response [200]>
{'date': 'Mon, 20 Nov 2023 00:39:11 GMT', 'server': 'Apache', 'last-modified': 'Sun, 12 Nov 2023 07:04:26 GMT', 'accept-ranges': 'bytes', 'content-length': '890355', 'x-backend': 'gutenweb1', 'content-type': 'text/plain; charset=utf-8'}


In [65]:
start_text = "*** START OF THE PROJECT GUTENBERG EBOOK DRACULA ***"
end_text   = "*** END OF THE PROJECT GUTENBERG EBOOK DRACULA ***"

start_index = response_dracula.text.find(start_text)+len(start_text)
end_index = response_dracula.text.find(end_text)

print("Start and end index of the text",start_index,end_index)
dracula_text = response_dracula.text[start_index:end_index]

#print(sherlock_text)
print(dracula_text[:1000])

Start and end index of the text 723 862152




                                DRACULA

                                  _by_

                              Bram Stoker

                        [Illustration: colophon]

                                NEW YORK

                            GROSSET & DUNLAP

                              _Publishers_

      Copyright, 1897, in the United States of America, according
                   to Act of Congress, by Bram Stoker

                        [_All rights reserved._]

                      PRINTED IN THE UNITED STATES
                                   AT
               THE COUNTRY LIFE PRESS, GARDEN CITY, N.Y.




                                   TO

                             MY DEAR FRIEND

                               HOMMY-BEG




Contents

CHAPTER I. Jonathan Harker’s Journal
CHAPTER II. Jonathan Harker’s Journal
CHAPTER III. Jonathan Harker’s Journal
CHAPTER IV. Jonathan Harker’s Journal
CHAPTER V. Letters—Lucy and Mina


In [68]:
def preprocess_text(text):
    text = text.replace(".", " . ")
    text = text.replace(",", " , ")
    text = text.replace("!", " ! ")
    text = text.replace("?", " ? ")
    text = text.replace(":", " : ")
    text = text.replace(";", " ; ")
    text = text.replace("“", " “ ")
    text = text.replace("”", " ” ")
    text = text.replace("’", " ’ ")
    text = text.replace("‘", " ‘ ")
    text = text.replace("-", " - ")
    text = text.replace("—", " - ")
    return text

dracula_text_intermediate = preprocess_text(dracula_text)

print(dracula_text_intermediate[:1000])






                                DRACULA

                                  _by_

                              Bram Stoker

                        [Illustration :  colophon]

                                NEW YORK

                            GROSSET & DUNLAP

                              _Publishers_

      Copyright ,  1897 ,  in the United States of America ,  according
                   to Act of Congress ,  by Bram Stoker

                        [_All rights reserved . _]

                      PRINTED IN THE UNITED STATES
                                   AT
               THE COUNTRY LIFE PRESS ,  GARDEN CITY ,  N . Y . 




                                   TO

                             MY DEAR FRIEND

                               HOMMY - BEG




Contents

CHAPTER I .  Jonathan Harker ’ s Journal
CHAPTER II .  Jonathan Harker ’ s Journal
CHAPTER III .  Jonathan Harker ’ s Journal
CHAPTER IV .  Jonathan Harker ’ s Jour


In [73]:
def preprocess_and_print(text):
    dracula_paragraphs = text.split("\r\n")

    for i in range(len(dracula_paragraphs)):
        dracula_paragraphs[i] = dracula_paragraphs[i].replace(".", " . ")
        dracula_paragraphs[i] = dracula_paragraphs[i].replace(",", " , ")
        dracula_paragraphs[i] = dracula_paragraphs[i].replace("!", " ! ")
        dracula_paragraphs[i] = dracula_paragraphs[i].replace("?", " ? ")
        dracula_paragraphs[i] = dracula_paragraphs[i].replace(":", " : ")
        dracula_paragraphs[i] = dracula_paragraphs[i].replace(";", " ; ")
        dracula_paragraphs[i] = dracula_paragraphs[i].replace("“", " “ ")
        dracula_paragraphs[i] = dracula_paragraphs[i].replace("”", " ” ")
        dracula_paragraphs[i] = dracula_paragraphs[i].replace("’", " ’ ")
        dracula_paragraphs[i] = dracula_paragraphs[i].replace("‘", " ‘ ")
        dracula_paragraphs[i] = dracula_paragraphs[i].replace("-", " - ")
        dracula_paragraphs[i] = dracula_paragraphs[i].replace("—", " - ")
        dracula_paragraphs[i] = dracula_paragraphs[i].replace("[", " [ ")
        dracula_paragraphs[i] = dracula_paragraphs[i].replace("]", " ] ")

    # Remove empty paragraphs
    dracula_paragraphs = [paragraph for paragraph in dracula_paragraphs if paragraph]

    # Print the result
    print(dracula_paragraphs[:600])

tokenized_data = preprocess_and_print(dracula_text_intermediate)
tokenized_data


['                                DRACULA', '                                  _by_', '                              Bram Stoker', '                         [ Illustration  :   colophon ] ', '                                NEW YORK', '                            GROSSET & DUNLAP', '                              _Publishers_', '      Copyright  ,   1897  ,   in the United States of America  ,   according', '                   to Act of Congress  ,   by Bram Stoker', '                         [ _All rights reserved  .  _ ] ', '                      PRINTED IN THE UNITED STATES', '                                   AT', '               THE COUNTRY LIFE PRESS  ,   GARDEN CITY  ,   N  .  Y  .  ', '                                   TO', '                             MY DEAR FRIEND', '                               HOMMY  -  BEG', 'Contents', 'CHAPTER I  .   Jonathan Harker  ’  s Journal', 'CHAPTER II  .   Jonathan Harker  ’  s Journal', 'CHAPTER III  .   Jonathan Harker  ’  s Journal', 'CH

## Markov Chain 

This is the Markov Chain from the previous set of notes, I jhave included the previous function preprocess() so that it directly pre-process the data and then is feed into the Markov Chain, which is not really doing well

In [102]:
paragraphs

NameError: name 'paragraphs' is not defined

In [112]:
from nltk.corpus import gutenberg
from collections import defaultdict
import networkx as nx
import matplotlib.pyplot as plt
import random
from pprint import pformat

# Your preprocessing function
def preprocess(text):
    text = text.lower()
    paragraphs = text.split("\r\n")

    for i in range(len(paragraphs)):
        paragraphs[i] = paragraphs[i].replace(".", " . ")
        paragraphs[i] = paragraphs[i].replace(",", " , ")
        paragraphs[i] = paragraphs[i].replace("!", " ! ")
        paragraphs[i] = paragraphs[i].replace("?", " ? ")
        paragraphs[i] = paragraphs[i].replace(":", " : ")
        paragraphs[i] = paragraphs[i].replace(";", " ; ")
        paragraphs[i] = paragraphs[i].replace("“", " “ ")
        paragraphs[i] = paragraphs[i].replace("”", " ” ")
        paragraphs[i] = paragraphs[i].replace("’", " ’ ")
        paragraphs[i] = paragraphs[i].replace("‘", " ‘ ")
        paragraphs[i] = paragraphs[i].replace("-", " - ")
        paragraphs[i] = paragraphs[i].replace("_", " _ ")
        paragraphs[i] = paragraphs[i].replace("[", " [ ")
        paragraphs[i] = paragraphs[i].replace("]", " ] ")

        # Remove numerical values
        paragraphs[i] = ''.join(char for char in paragraphs[i] if not char.isdigit())

    # Remove empty paragraphs
    paragraphs = [paragraph for paragraph in paragraphs if paragraph]

    # Return the result
    return paragraphs

class MarkovModel:

    def __init__(self, order=1):
        # empty nested dictionary mapping words to words to ints
        self.transition_counts = defaultdict(lambda: defaultdict(int))
        self.order = order

    def train(self, corpus):
        # loop through each word in the corpus record the next word
        # in its frequency dictionary
        for idx in range(len(corpus) - self.order):
            current_token = tuple(corpus[idx:idx + self.order])
            next_token = corpus[idx + self.order]
            self.transition_counts[current_token][next_token] += 1

    def generate_random_next_word(self, current_words):
        # get the frequency of all words that come after current_words
        possible_words_counts = self.transition_counts[current_words]
        # count up the total of all words that come after current_words
        total_occurrences = sum(possible_words_counts.values())

        # check if there are no occurrences
        if total_occurrences == 0:
            return None  # Handle the case when there are no occurrences

        # we are going to select one occurrence randomly
        random_num = random.randint(1, total_occurrences)

        # subtract words counts from our random number until we hit 0
        # this will hit more frequent words proportionally more often
        for word in possible_words_counts:
            random_num = random_num - possible_words_counts[word]
            if random_num <= 0:
                return word

    def generate_text(self, num=100, start_words=("I",)):
        # a running string to build on with random words
        markov_text = " ".join(start_words) + " "
        curr_words = tuple(start_words)

        # add num random words onto our running string
        for n in range(num):
            next_word = self.generate_random_next_word(curr_words)

            # handle the case when generate_random_next_word returns None
            if next_word is None:
                break

            markov_text += next_word + " "
            curr_words = curr_words[1:] + (next_word,)

        return markov_text

    def print_generated_texts(self, num_texts=5, text_length=100, start_words=("I",)):
        for _ in range(num_texts):
            generated_text = self.generate_text(num=text_length, start_words=start_words)
            print(generated_text)
            print('\n' + '-'*50 + '\n')

    def __str__(self):
        # convert defaultdicts to dicts and format using the pprint formatter
        return pformat({key: dict(self.transition_counts[key]) for key in self.transition_counts})

    def visualize(self, probabilities=False, layout=nx.kamada_kawai_layout):
        # use this method to generate visualizations of small models
        # it will take too long on large texts - don't do it!
        G = nx.DiGraph()

        if probabilities:
            transition_probabilities = defaultdict(dict)
            for current_words, next_words in self.transition_counts.items():
                total_occurrences = sum(next_words.values())
                for next_word, count in next_words.items():
                    transition_probabilities[current_words][next_word] = count / total_occurrences

            for current_words, next_words in transition_probabilities.items():
                for next_word, probability in next_words.items():
                    G.add_edge(current_words, next_word, weight=probability)
        else:
            for current_words, next_words in self.transition_counts.items():
                for next_word, count in next_words.items():
                    G.add_edge(current_words, next_word, weight=count)

        pos = layout(G)
        edge_labels = {edge: f"{G.edges[edge]['weight']}" for edge in G.edges()}
        nx.draw(G, pos, with_labels=True, node_size=500, node_color='lightblue', font_size=10)
        nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels)

        plt.title("Markov Model Visualization")
        #plt.figure(figsize=(25, 14))
        #plt.show()

# Preprocess the text
tokenized_data = preprocess(dracula_text_intermediate[1000:])

# Train the model on the initial data
dracula_model = MarkovModel(order=2)  # Use a 1st order model
dracula_model.train(tokenized_data)

# Generate text and print the model
dracula_model.print_generated_texts(num_texts=1, text_length=50, start_words=("and", "the"))
print(dracula_model)

# Visualize the model
#dracula_model.visualize(probabilities=False)


and the 

--------------------------------------------------

{('                                  note', 'seven years ago we all went through the flames  ;   and the happiness of'): {'some of us since then is  ,   we think  ,   well worth the pain we endured  .   it': 1},
 ('                                dracula  .  ', 'this then was the un  -  dead home of the king  -  vampire  ,   to whom so many more'): {'were due  .   its emptiness spoke eloquent to make certain what i knew  .  ': 1},
 ('                                the end', '       *       *       *       *       *'): {'                        _ there  ’  s more to follow  !   _ ': 1},
 ('                             _ extra special  .   _ ', '                         the hampstead horror  .  '): {'                         another child injured  .  ': 1},
 ('                           the escaped wolf  .  ', '         perilous adventure of our interviewer  .  '): {'          _ interview with the keeper in the zoölogical gar