# Chapter 9: Counting and Indexing Words
Building an inverted index

Programs from the book: [_Python for Natural Language Processing_](https://link.springer.com/book/9783031575488)

__Author__: Pierre Nugues

## Outline

This program indexes all the words in a corpus. Conceptually, an index consists of rows with one word per row and the list of files and positions, where this word occurs. Such a row is called a _posting list_. We encode the position of a word by the number of characters from the start of the file.
<pre>
word1: file_name pos1 pos2 pos3... file_name pos1 pos2 ...
word2: file_name pos1 pos2 pos3... file_name pos1 pos2 ...
...
</pre>

## Modules

Some imports

In [1]:
import math
import os
import pickle
import regex as re

## Corpus

A function to read the files with a certain suffix in a folder

In [2]:
def get_files(dir, suffix):
    """
    Returns all the files in a folder ending with suffix
    :param dir:
    :param suffix:
    :return: the list of file names
    """
    files = []
    for file in os.listdir(dir):
        if file.endswith(suffix):
            files.append(file)
    return files

We will create an index for a corpus of Dickens works.

In [3]:
folder = '../datasets/dickens/'

In [4]:
corpus_files = get_files(folder, 'txt')
corpus_files

['Hard Times.txt',
 'Oliver Twist.txt',
 'Great Expectations.txt',
 'The Old Curiosity Shop.txt',
 'A Tale of Two Cities.txt',
 'Dombey and Son.txt',
 'The Pickwick Papers.txt',
 'Bleak House.txt',
 'Our Mutual Friend.txt',
 'The Mystery of Edwin Drood.txt',
 'Nicholas Nickleby.txt',
 'David Copperfield.txt',
 'Little Dorrit.txt',
 'A Christmas Carol in Prose.txt']

## Programming the Indexer

### Tokenizer 

In [5]:
regex = r'\p{L}+'

In [6]:
re.findall(regex, 'Monsieur the Marquis, vendor of wine.')

['Monsieur', 'the', 'Marquis', 'vendor', 'of', 'wine']

In [7]:
def tokenize(text):
    words = re.finditer(regex, text)
    return words

In [8]:
tokens = tokenize(
    'Monsieur the Marquis, vendor of wine.')
list(tokens)

[<regex.Match object; span=(0, 8), match='Monsieur'>,
 <regex.Match object; span=(9, 12), match='the'>,
 <regex.Match object; span=(13, 20), match='Marquis'>,
 <regex.Match object; span=(22, 28), match='vendor'>,
 <regex.Match object; span=(29, 31), match='of'>,
 <regex.Match object; span=(32, 36), match='wine'>]

#### Extracting indices

The `text_to_idx(words)` function extracts the indices from the list of tokens (words).

In [9]:
def text_to_idx(words):
    """
    Builds an index from a list of match objects
    """
    word_idx = {}
    for word in words:
        try:
            word_idx[word.group()].append(word.start())
        except:
            word_idx[word.group()] = [word.start()]
    return word_idx

In [10]:
tokens = tokenize(
    'Monsieur the Marquis, vendor of wine.'.lower().strip())
text_to_idx(tokens)

{'monsieur': [0],
 'the': [9],
 'marquis': [13],
 'vendor': [22],
 'of': [29],
 'wine': [32]}

#### Reading one file

We read one file, _A Tale of Two Cities_, `A Tale of Two Cities.txt`, set it in lowercase, tokenize it, and index it. We call this index `idx`

In [11]:
first_file = folder + 'A Tale of Two Cities.txt'
text = open(first_file, encoding='utf-8').read().lower().strip()
words = tokenize(text)
idx = text_to_idx(words)

In [12]:
idx['vendor']

[218582, 218631, 219234, 635168]

#### Saving the index

We save index in a file with the pickle module.

In [13]:
index_file = 'a_tale_of_two_cities.idx'
pickle.dump(idx, open(index_file, 'wb'))

We read back your file and we store the content in `idx`

In [14]:
idx = pickle.load(open(index_file, 'rb'))

In [15]:
idx['vendor']

[218582, 218631, 219234, 635168]

### Reading the content of a folder

In [16]:
corpus_files = get_files(folder, 'txt')
corpus_files

['Hard Times.txt',
 'Oliver Twist.txt',
 'Great Expectations.txt',
 'The Old Curiosity Shop.txt',
 'A Tale of Two Cities.txt',
 'Dombey and Son.txt',
 'The Pickwick Papers.txt',
 'Bleak House.txt',
 'Our Mutual Friend.txt',
 'The Mystery of Edwin Drood.txt',
 'Nicholas Nickleby.txt',
 'David Copperfield.txt',
 'Little Dorrit.txt',
 'A Christmas Carol in Prose.txt']

### Creating a master index

The word <i>vendor</i>, for instance, occurs four times in _A Tale of Two Cities_ at positions
            218582, 218631, 219234, and 635168.

In [17]:
master_index = {}
for file in corpus_files:
    text = open(folder + file, encoding='utf-8').read().lower().strip()
    words = tokenize(text)
    idx = text_to_idx(words)
    for word in idx:
        if word in master_index:
            master_index[word][file] = idx[word]
        else:
            master_index[word] = {}
            master_index[word][file] = idx[word]

In [18]:
master_index['vendor']

{'Oliver Twist.txt': [788457],
 'A Tale of Two Cities.txt': [218582, 218631, 219234, 635168],
 'Dombey and Son.txt': [1080291],
 'The Pickwick Papers.txt': [28715],
 'Bleak House.txt': [1474429]}

In [19]:
master_index['deserve']

{'Hard Times.txt': [206688, 329331, 330018],
 'Oliver Twist.txt': [117173, 152567, 257637, 568782, 673524],
 'Great Expectations.txt': [272920, 321608, 648710, 982503],
 'The Old Curiosity Shop.txt': [7177, 57252, 187226],
 'A Tale of Two Cities.txt': [269196, 618140, 669252],
 'Dombey and Son.txt': [100181,
  328794,
  498361,
  622188,
  622246,
  766143,
  912596,
  1204627],
 'The Pickwick Papers.txt': [85832,
  425008,
  425370,
  533925,
  650753,
  1343247,
  1592405,
  1629557,
  1673972],
 'Bleak House.txt': [962433, 1662029, 1840382, 1897369],
 'Our Mutual Friend.txt': [91952,
  92020,
  410589,
  414573,
  683351,
  835951,
  888199,
  926327,
  969205,
  1254188,
  1258422,
  1318630,
  1457539,
  1466035,
  1490735,
  1673403,
  1737036,
  1794595],
 'Nicholas Nickleby.txt': [411260,
  790675,
  1168530,
  1197303,
  1240532,
  1391071,
  1457691,
  1766702],
 'David Copperfield.txt': [533653,
  597133,
  699197,
  819121,
  1161620,
  1297435,
  1297468,
  1522835,
  1827

We save the master index in a file and read it again

In [20]:
pickle.dump(master_index, open('master.idx', 'wb'))
master_index = pickle.load(open('master.idx', 'rb'))

In [21]:
master_index['vendor']

{'Oliver Twist.txt': [788457],
 'A Tale of Two Cities.txt': [218582, 218631, 219234, 635168],
 'Dombey and Son.txt': [1080291],
 'The Pickwick Papers.txt': [28715],
 'Bleak House.txt': [1474429]}

#### Concordances

The `concordance(word, master_index, window)` function extracts the concordances of a `word` within a window of `window` characters

In [22]:
def concordance(word, master_index, window):
    for document in master_index[word].keys():
        print(document)
        text = open(folder + document, encoding='utf-8').read().lower().strip()
        for idx in master_index[word][document]:
            if idx - window < 0:
                idx_left = 0
            else:
                idx_left = idx - window
            if idx + window > len(text):
                idx_right = len(text)
            else:
                idx_right = idx + window
            concordance = re.sub('\s', ' ', text[idx_left:idx_right])
            print('\t' + concordance)

In [23]:
concordance('vendor', master_index, 25)

Oliver Twist.txt
	 plainly hesitated.  the vendor observing this, in
A Tale of Two Cities.txt
	  “monsieur the marquis, vendor of wine.”  “pick u
	up that, philosopher and vendor of wine,” said the
	e spot where defarge the vendor of wine had stood,
	es. ernest defarge, wine-vendor of st. antoine.”  
Dombey and Son.txt
	eral garments, which the vendor declared to be suc
The Pickwick Papers.txt
	orcing the heated pastry-vendor’s proposition: and
Bleak House.txt
	ariably, taken in by the vendor and installed in t


### Representing Documents with tf-idf

Once we have created the index, we can represent each document in your corpus as a dictionary. The keys of these dictionaries are the words and we define the value of a word with the tf-idf metric.

As definition of tf-idf, we use this one: 
 * Tf is the relative frequency of the term in the document and 
 * idf, the logarithm base 10 of the inverse document frequency.

Conceptually, the tf-idf representation is a vector.

In [24]:
tfidf = {}
for file in corpus_files:
    tfidf[file] = {}
    total_words = 0
    for word in master_index:
        idf = math.log(len(corpus_files) / len(master_index[word]), 10)
        if file in master_index[word]:
            tf = len(master_index[word][file])
            tfidf[file][word] = tf * idf
            total_words += tf
        else:
            tfidf[file][word] = 0
    for word in tfidf[file]:
        tfidf[file][word] /= total_words

In [25]:
tfidf['A Tale of Two Cities.txt']['vendor']

1.2924669774106877e-05

In [26]:
tfidf['A Tale of Two Cities.txt']['deserve']

1.451274081696086e-06

### Comparing Documents

Using the cosine similarity, we compare all the pairs of documents with their tf-idf representation. We compute it with `cosine_similarity(document1, document2)`

In [27]:
def cosine_similarity(document1, document2):
    scalar_prod = 0.0
    norm1 = 0.0
    norm2 = 0.0
    for word in tfidf[document1]:
        scalar_prod += tfidf[document1][word] * tfidf[document2][word]
        norm1 += tfidf[document1][word] * tfidf[document1][word]
    for word in tfidf[document2]:
        norm2 += tfidf[document2][word] * tfidf[document2][word]
    # print(document1, document2, scalar_prod, math.sqrt(norm1), math.sqrt(norm2), '\n')
    return scalar_prod / (math.sqrt(norm1) * math.sqrt(norm2))

#### Similarity matrix

We compute the similarity matrix between the documents of the corpus. While computing the similarities, we record the two most similar documents `most_sim_doc1` and `most_sim_doc2`.

In [28]:
max_similarity = 0.0
most_sim_doc1 = ''
most_sim_doc2 = ''
print(corpus_files)
for doc1 in corpus_files:
    print(doc1, end='\t')
    for doc2 in corpus_files:
        cos_similarity = cosine_similarity(doc1, doc2)
        print("%1.4f" % cos_similarity, end='\t')
        if cos_similarity > max_similarity and doc1 != doc2:
            max_similarity = cos_similarity
            most_sim_doc1 = doc1
            most_sim_doc2 = doc2
    print()

['Hard Times.txt', 'Oliver Twist.txt', 'Great Expectations.txt', 'The Old Curiosity Shop.txt', 'A Tale of Two Cities.txt', 'Dombey and Son.txt', 'The Pickwick Papers.txt', 'Bleak House.txt', 'Our Mutual Friend.txt', 'The Mystery of Edwin Drood.txt', 'Nicholas Nickleby.txt', 'David Copperfield.txt', 'Little Dorrit.txt', 'A Christmas Carol in Prose.txt']
Hard Times.txt	1.0000	0.0005	0.0009	0.0008	0.0007	0.0054	0.0015	0.0023	0.0009	0.0010	0.0009	0.0011	0.0005	0.0002	
Oliver Twist.txt	0.0005	1.0000	0.0009	0.0021	0.0007	0.0007	0.0010	0.0030	0.0018	0.0016	0.0006	0.0020	0.0006	0.0003	
Great Expectations.txt	0.0009	0.0009	1.0000	0.0015	0.0011	0.0008	0.0008	0.0028	0.0015	0.0009	0.0007	0.0012	0.0013	0.0003	
The Old Curiosity Shop.txt	0.0008	0.0021	0.0015	1.0000	0.0022	0.0010	0.0016	0.0086	0.0046	0.0013	0.0011	0.0028	0.0013	0.0014	
A Tale of Two Cities.txt	0.0007	0.0007	0.0011	0.0022	1.0000	0.0013	0.0012	0.0012	0.0008	0.0008	0.0025	0.0008	0.0035	0.0003	
Dombey and Son.txt	0.0054	0.0007	0.0008	0.0

In [29]:
print("Most similar:", most_sim_doc1,
      most_sim_doc2, "Similarity:", max_similarity)

Most similar: The Old Curiosity Shop.txt Bleak House.txt Similarity: 0.008569236177211179
