## Tokenization Practice and Simple Document Similarity

For this notebook, you have been provided the top 50 most downloaded books from Project Gutenberg over the last 90 days as text files.

In [1]:
import re
import glob
from tqdm.notebook import tqdm
import pandas as pd
import numpy as np

from nltk import sent_tokenize, word_tokenize, regexp_tokenize
from nltk.corpus import stopwords

from collections import Counter

Given a filepath, you can open the file and use the `read` method to extract the contents as a string.

For example, if we want to import the full text of War and Peace, we can do that using the following block of code.

In [2]:
filepath = '../data/books/War and Peace by graf Leo Tolstoy.txt'

with open(filepath, encoding = 'utf-8') as fi:
    book = fi.read()

You'll notice that there is some metadata at the top of the file and at the bottom of the file.

In [3]:
book[:1000]

'\ufeffThe Project Gutenberg eBook of War and Peace, by Leo Tolstoy\n\nThis eBook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this eBook or online at\nwww.gutenberg.org. If you are not located in the United States, you\nwill have to check the laws of the country where you are located before\nusing this eBook.\n\nTitle: War and Peace\n\nAuthor: Leo Tolstoy\n\nTranslators: Louise and Aylmer Maude\n\nRelease Date: April, 2001 [eBook #2600]\n[Most recently updated: January 21, 2019]\n\nLanguage: English\n\nCharacter set encoding: UTF-8\n\nProduced by: An Anonymous Volunteer and David Widger\n\n*** START OF THE PROJECT GUTENBERG EBOOK WAR AND PEACE ***\n\n\n\n\nWAR AND PEACE\n\n\nBy Leo Tolstoy/Tolstoi\n\n\n    Contents\n\n    BOOK ONE: 1805\n\n    CHAPTER I\n\n    CHAPTER II\n\n    CH

In [4]:
book[-18420:-18000]

'scious.\n\n\n\n\n*** END OF THE PROJECT GUTENBERG EBOOK WAR AND PEACE ***\n\nUpdated editions will replace the previous one--the old editions will\nbe renamed.\n\nCreating the works from print editions not protected by U.S. copyright\nlaw means that no one owns a United States copyright in these works,\nso the Foundation (and you!) can copy and distribute it in the\nUnited States without permission and without paying copyright\nro'

Write some code that will remove this text at the bottom and top of the string.

**Hint:** You might want to make use of the [`re.search`](https://docs.python.org/3/library/re.html#re.search) function from the `re` library.

In [5]:

FrontCover_MetaData = re.search(r"\*\*\*(.+)\*\*\*", book) #.group()
FrontCover_MetaData

<re.Match object; span=(775, 833), match='*** START OF THE PROJECT GUTENBERG EBOOK WAR AND >

In [6]:
FrontCover_MetaData_Examine = re.search(r"\*\*\*(.+)\*\*\*", book).group()
FrontCover_MetaData_Examine

'*** START OF THE PROJECT GUTENBERG EBOOK WAR AND PEACE ***'

In [7]:
FrontCover_MetaData.span()

(775, 833)

In [8]:
Front_Start_Pos = FrontCover_MetaData.span()[1]

In [9]:

BackCover_MetaData = re.search(r"\*\*\* END OF THE PROJECT GUTENBERG EBOOK(.+)", book) #.group()   # \n\n\n\n\n\
BackCover_MetaData

<re.Match object; span=(3209115, 3209171), match='*** END OF THE PROJECT GUTENBERG EBOOK WAR AND PE>

In [10]:
BackCover_MetaData_Examine = re.search(r"\*\*\* END OF THE PROJECT GUTENBERG EBOOK(.+)", book).group()   # \n\n\n\n\n\
BackCover_MetaData_Examine

'*** END OF THE PROJECT GUTENBERG EBOOK WAR AND PEACE ***'

In [11]:
BackCover_MetaData.span()

(3209115, 3209171)

In [12]:
BackCover_MetaData.span()[0]

3209115

In [13]:
Rear_End_Pos = BackCover_MetaData.span()[0]

If we want to be able to scale up our analysis to multiple books, it would be nice to have a function to use repeatedly. Write a function called `import_book` which takes as an argument a filepath and returns the contents of that file as a string with the metadata at the top and bottom removed.

In [14]:
import os

In [15]:
book_list = []
directory = '../data/books'

In [None]:
def import_book(directory = '../data/books'):

    Bk_Content_List = []
    Bk_Title_List = []

    for filename in os.listdir(directory):
        if filename.endswith('.txt'):
            with open(os.path.join(directory, filename), encoding = 'utf-8') as file:
                BkTxt_with_MetaData = file.read()
                
                FrontCover_MetaData = re.search(r"\*\*\*(.+)\*\*\*", BkTxt_with_MetaData)
                try:
                    Front_Start_Pos = FrontCover_MetaData.span()[1]
                except: 
                    print(filename)
                    
                BackCover_MetaData = re.search(r"\*\*\* END", BkTxt_with_MetaData)
                try:
                    Rear_End_Pos = BackCover_MetaData.span()[0]
                except:
                    print(filename)    
                    
                BkTxt_wo_MetaData = BkTxt_with_MetaData[Front_Start_Pos : Rear_End_Pos+1]
                
                Bk_Content_List.append(BkTxt_wo_MetaData)
                Bk_Title_List.append(filename[:-4])
                
                Bk_Content_List_series = pd.Series(Bk_Content_List)
                Bk_Title_List_series = pd.Series(Bk_Title_List)
    
    Bk_Title_Content_Lists_df = pd.concat([Bk_Title_List_series, Bk_Content_List_series], axis = 1)
        
    return Bk_Title_Content_Lists_df     

In [46]:
Bk_Title_Content_Lists_df = import_book(directory = '../data/books')

In [47]:
Bk_Title_Content_Lists_df

Unnamed: 0,0,1
0,Pygmalion by Bernard Shaw,\n\n\n\n\nTRANSCRIBER’S NOTE: In the printed v...
1,The War of the Worlds by H. G. Wells,\n\ncover \n\n\n\n\nThe War of the Worlds\n\nb...
2,Leviathan by Thomas Hobbes,\n\n\n\n\nLEVIATHAN\n\nBy Thomas Hobbes\n\n165...
3,Don Quixote by Miguel de Cervantes Saavedra,\n\n\n\n\nbookcover.jpg (230K)\n\n\nFull Size\...
4,"The Awakening, and Selected Short Stories by K...",\n\n\n\n\nThe Awakening\nand Selected Short St...
5,"The Count of Monte Cristo, Illustrated by Alex...",\n\n\n\n\nTHE COUNT OF MONTE CRISTO\n\n\n\nby ...
6,Les Misérables by Victor Hugo,\n\n\n\n\nLES MISÉRABLES\n\nBy Victor Hugo\n\n...
7,The Republic by Plato,\n\n\n\n\nTHE REPUBLIC\n\nBy Plato\n\nTranslat...
8,"Walden, and On The Duty Of Civil Disobedience ...",\n\n\n\n\nWALDEN\n\n\n\n\nand\n\n\n\nON THE DU...
9,Dubliners by James Joyce,\n\ncover\n\n\n\n\nDUBLINERS\n\nby James Joyce...


Now, let's utilize our function to import all of the books into a data structure of some kind.

First, we need to be able to iterate through the list of filepaths. For this, we can use the `glob` function. This function takes as agument a pattern to match. Try it out.

In [19]:
glob.glob('../data/books/*.txt')

['../data/books/Pygmalion by Bernard Shaw.txt',
 '../data/books/The War of the Worlds by H. G.  Wells.txt',
 '../data/books/Leviathan by Thomas Hobbes.txt',
 '../data/books/Don Quixote by Miguel de Cervantes Saavedra.txt',
 '../data/books/The Awakening, and Selected Short Stories by Kate Chopin.txt',
 '../data/books/The Count of Monte Cristo, Illustrated by Alexandre Dumas.txt',
 '../data/books/Les Misérables by Victor Hugo.txt',
 '../data/books/The Republic by Plato.txt',
 '../data/books/Walden, and On The Duty Of Civil Disobedience by Henry David Thoreau.txt',
 '../data/books/Dubliners by James Joyce.txt',
 '../data/books/Great Expectations by Charles Dickens.txt',
 '../data/books/Anthem by Ayn Rand.txt',
 '../data/books/Anna Karenina by graf Leo Tolstoy.txt',
 '../data/books/The Yellow Wallpaper by Charlotte Perkins Gilman.txt',
 '../data/books/Moby Dick; Or, The Whale by Herman Melville.txt',
 '../data/books/The Adventures of Sherlock Holmes by Arthur Conan Doyle.txt',
 '../data/bo

In [20]:
filepath = glob.glob('../data/books/*.txt')[0]
filepath

'../data/books/Pygmalion by Bernard Shaw.txt'

It would be nice to save the title of each book without the extra pieces around it. Write code that will remove the "books/" from the front of the filepath and the ".txt" from the end. That is, we want to extract just the "Little Women by Louisa May Alcott" from the current filepath.

In [21]:
# Your Code Here

Now, combine together the function you created and the code that you just wrote to iterate through the filepaths for the books and save the contents of each book into a dictionary whose keys are equal to the cleaned up titles.

In [22]:
# Your Code Here

Now let's write some code so that we can cluster our books. In order to cluster, we'll need to be able to compute a similarity or distance between books.

A simple way to compute similarity of documents is the [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index) of the set of words that they conain. This metric computes the amount of overlap between two sets compared to their union. Two books which contain exactly the same words (but not necessarily in the same order or with the same frequency) will have a Jaccard similarity of 1 and two books which have no words in common will have a Jaccard similarity of 0.

**Question:** What might be some of the downsides to using Jaccard similarity to compute the similarity of two books?

In order to use this, we'll need to tokenize each book and store the results in a collection of some kind. Since we are interested in which words appear but not necessarily in what order or how frequently, we can make use of a [set](https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset). A set is similar to a list, but the order of the contents does not matter and a set cannot contain duplicates.

For practice, let's grab one of our books.

In [23]:
book = books['Little Women by Louisa May Alcott']

NameError: name 'books' is not defined

Write some code which tokenizes Little Women and stores the tokens it contains in a set. It is up to you to decide exactly how you want to tokenize or what you want to count as a token.

Once you are happy with your tokenization method, convert it into a function named `tokenize_book` which takes in a string and returns a set of tokens.

In [None]:
# Your Code Here

Now, write a function `jaccard` which takes in two sets of tokens and returns the Jaccard similarities between them. **Hint:** Python sets have `intersection` and `union` methods.

In [None]:
# Your Code Here

Is Little Women more similar (using Jaccard Similarity) to Heart of Darkness or Anthem?

In [None]:
# Your Code Here

In [None]:
# Your Code Here

Let's create another dictionary called `book_tokens` that contains the title of each book as a key and the tokenized version of the book as values.

In [None]:
# Your Code Here

Using this, let's create a distance matrix for our books using the jaccard function above. **Note:** You created a function for jaccard _similarity_. This can be converted to a **distance** by subtracting the similarity score from 1.

In [None]:
dists = np.zeros(shape = (len(book_tokens), len(book_tokens)))

Now, fill in the distance matrix so that in the i,j spot you have one minus the jaccard similarity of the ith and jth books.

In [None]:
# Your Code Here

Once we have our distance matrix, we can compute a **dendogram**. 

A dendogram is a way to visualize a hierarchical clustering of a dataset. You can read more about it [here](https://www.statisticshowto.com/hierarchical-clustering/).

In [None]:
from scipy.cluster.hierarchy import linkage, dendrogram
from scipy.spatial.distance import pdist, squareform
import matplotlib.pyplot as plt

In [None]:
mergings = linkage(squareform(dists), method='complete')

plt.figure(figsize = (12,8))
dendrogram(mergings,
           labels = list(book_tokens.keys()),
           leaf_rotation = 90,
           leaf_font_size = 6);

plt.tight_layout()
plt.savefig('images/dendogram_complete_jaccard.png', transparent=False, facecolor='white', dpi = 150);

**Bonus Material** Jaccard Similarity does not account for the frequency that each word is used, only whether or not it is used.

We might be better off using the **cosine similarity** as a way to measure the similarity of two books.

Create a dataframe named `books_df` where each row corresponds to a book and each column corresponds to a word. It should count the number of times the word appears in that book (including zero). Use the book title as the index of this dataframe.