# Assignment #1: Building an inverted index
Author: Pierre Nugues

## Objectives

The objectives of this assignment are to:
* Write a program that collects all the words from a set of documents
* Build an index from the words
* Represent a document using the Tf.Idf values
* Write a short report of 1 to 2 pages on the assignment
* Read a short text on an industrial system

## Submission

When you have written all the missing code and run all the cells, you will submit your notebook to an automatic marking system. Do not erase the content of the cells as we will possibly check your programs manually.
The submission instructions are at the bottom of the notebook.

## Description of the assignment

### Outline

In this lab, you will build an indexer to index all the words in a corpus. Conceptually, an index consists of rows with one word per row and the list of files and positions, where this word occurs. Such a row is called a _posting list_. You will encode the position of a word by the number of characters from the start of the file.
<pre>
word1: file_name pos1 pos2 pos3... file_name pos1 pos2 ...
word2: file_name pos1 pos2 pos3... file_name pos1 pos2 ...
...
</pre>

#### Imports

Some imports. Add others as needed

In [6]:
import bz2
import math
import os
import pickle
import regex as re
import requests
import sys
import numpy as np
import time
from zipfile import ZipFile

## Corpus

You will create an index for a corpus of Selma Lagerlöf's works: To gather the corpus, you can alternatively:
1. Download the <a href="https://github.com/pnugues/ilppp/raw/master/programs/corpus/Selma.zip">Selma folder</a> and uncompress it. It contains novels by <a href="https://sv.wikipedia.org/wiki/Selma_Lagerl%C3%B6f">Selma Lagerlöf</a>. The text of these novels was extracted from <a href="https://litteraturbanken.se/forfattare/LagerlofS/titlar">Lagerlöf arkivet</a> at <a href="https://litteraturbanken.se/">Litteraturbanken</a>.
2. Or run this cell that will download the corpus and place it in your folder.

In [2]:
# Parameters for Selma dataset
SELMA_URL = "https://github.com/pnugues/ilppp/raw/master/programs/corpus/Selma.zip"

SELMA_FILES = [
    os.path.join("Selma", fname) 
    for fname in 
    [
        "bannlyst.txt", 
        "gosta.txt", 
        "herrgard.txt", 
        "jerusalem.txt", 
        "kejsaren.txt", 
        "marbacka.txt", 
        "nils.txt", 
        "osynliga.txt", 
        "troll.txt"
    ]
]

def download_and_extract_selma():
    """Downloads and unpacks Selma.zip"""
    
    global SELMA_URL
    # Download if not all files exist
    req = requests.get(SELMA_URL, stream=True)
    if req.status_code != 200:
        print("Failed to download file, got status: " + req.status_code)
        req.close()
    else:
        with open("Selma.zip", "wb") as fd:
            written = 0
            for chunk in req.iter_content(chunk_size=65536):
                fd.write(chunk)
                written += len(chunk)
                print("Downloading: %d bytes written to Selma.zip" % written)

        print("Selma.zip donwnloaded.")
        req.close()
        
        selma_zipfile = ZipFile("Selma.zip")
        selma_files_to_extract = [zi for zi in selma_zipfile.filelist if not zi.filename.startswith("__") and zi.filename.endswith(".txt")]
        for zi in selma_files_to_extract:
            selma_zipfile.extract(zi)
            print("Extracted: " + zi.filename)
            
        print("Done!")
        
# If not all path exists (all are true), then download
if not all([os.path.exists(fname) for fname in SELMA_FILES]):
    download_and_extract_selma()
else:
    print("Selma has been downloaded.")
    
SELMA_FILES

Selma has been downloaded.


['Selma/bannlyst.txt',
 'Selma/gosta.txt',
 'Selma/herrgard.txt',
 'Selma/jerusalem.txt',
 'Selma/kejsaren.txt',
 'Selma/marbacka.txt',
 'Selma/nils.txt',
 'Selma/osynliga.txt',
 'Selma/troll.txt']

### Running the indexer (optional)

In a production context, your final program would take a corpus as input (here the Selma Lagerlöf's novels) and create an index of all the words with their positions. You should be able to run it this way:
<pre>$ python indexer.py folder_name</pre>
In this lab, you will write the index in a Jupyter Notebook. The conversion into a Python program is left as an optional exercise.

## Programming the Indexer

To make programming easier, you will split this exercise into five steps:
1. Index one file;
2. Read the content of a folder
3. Create a master index for all the files
4. Use tfidf to represent the documents (novels)
5. Compare the documents of a collection

You will use dictionaries to represent the postings.

### Indexing one file

#### Description

<p>Write a program that reads one document <tt>file_name.txt</tt> and outputs an index file:
            <tt>file_name.idx</tt>:
        </p>
        <ol>
            <li>The index file will contain all the unique words in the document,
                where each word is associated with the list of its positions in the document.
            </li>
            <li>You will represent this index as a dictionary, where the keys will be the words, and
                the values, the lists of positions
            </li>
            <li>As words, you will consider all the strings of letters that you will set in lower case.
                You will not index the rest (i.e. numbers, punctuations, or symbols).
            </li>
            <li>To extract the words, use Unicode regular expressions. Do not use <tt>\w+</tt>,
                for instance, but the Unicode equivalent.
            </li>
            <li>The word positions will correspond to the number of characters from the beginning of the file.
                (The word offset from the beginning)
            </li>
            <li>You will use the <tt>finditer()</tt> method to find the positions of the words.
                This will return you match objects,
                where you will get the matches and the positions with
                the <tt>group()</tt> and <tt>start()</tt> methods.
            </li>
            <li>You will use the pickle package to write your dictionary in an file,
                see <a href="https://wiki.python.org/moin/UsingPickle">https://wiki.python.org/moin/UsingPickle</a>.
            </li>
        </ol>

Below is an excerpt of the index of the `bannlyst.txt` text for the words <i>gjord</i>, <i>uppklarnande</i>, and <i>stjärnor</i>. The data is stored in a dictionary:

<pre>
{...
'gjord': [8600, 183039, 220445],
'uppklarnande': [8617],
'stjärnor': [8641], ...
}
</pre>
where the word <i>gjord</i> occurs three times in the text at positions 8600, 183039, and 220445, <i>uppklarnande</i>, once at position 8617, and <i>stjärnor</i>, once at position 8641.

#### Writing a tokenizer 

Write a Unicode regular expression to find words defined as sequences of letters.

In [7]:
# Write your regex here
regex = '\p{L}+' # unicode commando, not regex

In [8]:
re.findall(regex, 'En gång hade de på Mårbacka en barnpiga, som hette Back-Kajsa')

['En',
 'gång',
 'hade',
 'de',
 'på',
 'Mårbacka',
 'en',
 'barnpiga',
 'som',
 'hette',
 'Back',
 'Kajsa']

Using `regex`, write `tokenize(text)` function to tokenize a text. Return their positions.

In [9]:
# Write your code here
def tokenize(text): # file_text är en string, inte en lista
    text = text.lower().strip()
    temp = []
    for m in re.finditer(regex, text):
        temp.append(m)
    return temp

In [10]:
tokens = tokenize('En gång hade de på Mårbacka en barnpiga, som hette Back-Kajsa.')
list(tokens)

[<regex.Match object; span=(0, 2), match='en'>,
 <regex.Match object; span=(3, 7), match='gång'>,
 <regex.Match object; span=(8, 12), match='hade'>,
 <regex.Match object; span=(13, 15), match='de'>,
 <regex.Match object; span=(16, 18), match='på'>,
 <regex.Match object; span=(19, 27), match='mårbacka'>,
 <regex.Match object; span=(28, 30), match='en'>,
 <regex.Match object; span=(31, 39), match='barnpiga'>,
 <regex.Match object; span=(41, 44), match='som'>,
 <regex.Match object; span=(45, 50), match='hette'>,
 <regex.Match object; span=(51, 55), match='back'>,
 <regex.Match object; span=(56, 61), match='kajsa'>]

#### Extracting indices

Write a `text_to_idx(words)` function to extract the indices from the list of tokens (words). Return a dictionary, where the keys will be the tokens (words), and the values a list of positions.

In [17]:
# Write your code here
def text_to_idx(words): # words är en list av regex-match objects
    dictionary = {}
    for w in words:
        if w.group(0) in dictionary:                      # kollar dubbletter
            dictionary[w.group(0)].append(w.start()) 
                    # dictionary[w.group(0)] är en dictionary som jag appendar till
# dictionary[w.group(0)] är en dictionary som jag appendar till
        else:                                             
            dictionary[w.group(0)] = [w.start()]    
            print(w.start()) 
            print('nisse')    # mappar ordet till en lista och lägger till startindex i listan
    return dictionary # returnerar ett dictionary

In [18]:
tokens = tokenize('En gång hade de på Mårbacka en barnpiga, som hette Back-Kajsa.')
text_to_idx(tokens)

0
nisse
3
nisse
8
nisse
13
nisse
16
nisse
19
nisse
31
nisse
41
nisse
45
nisse
51
nisse
56
nisse


{'en': [0, 28],
 'gång': [3],
 'hade': [8],
 'de': [13],
 'på': [16],
 'mårbacka': [19],
 'barnpiga': [31],
 'som': [41],
 'hette': [45],
 'back': [51],
 'kajsa': [56]}

#### Reading one file

Read one file, _Mårbacka_, `marbacka.txt`, set it in lowercase, tokenize it, and index it. Call this index `idx`

In [13]:
# Write your code here
file = open('Selma/marbacka.txt', encoding = 'utf-8') # måste ha utf-8 annars kan den ej läsa in vissa tecken

file_text = file.read() # läser in filen (lowercase sköts av tokenize) 

tokens = tokenize(file_text) # file_text är en string, inte en lista
idx = text_to_idx(tokens) # tokens är en lista av ord
#print(idx)

In [14]:
idx['mårbacka']

[16,
 139,
 752,
 1700,
 2582,
 3324,
 15117,
 15404,
 27794,
 42175,
 49126,
 50407,
 52053,
 60144,
 63374,
 64910,
 67182,
 67330,
 67799,
 67824,
 69232,
 71328,
 72099,
 74147,
 74255,
 74614,
 76610,
 76884,
 77138,
 77509,
 77787,
 77936,
 78574,
 80597,
 81782,
 82003,
 84363,
 84786,
 85251,
 89837,
 97093,
 98642,
 100474,
 105063,
 105298,
 105721,
 108710,
 109133,
 112844,
 113725,
 114997,
 115583,
 115833,
 116368,
 116557,
 121896,
 124823,
 126409,
 126542,
 128758,
 130976,
 131939,
 132826,
 136914,
 137187,
 137872,
 139196,
 140721,
 142324,
 146781,
 151497,
 154335,
 155139,
 155438,
 155886,
 156405,
 158108,
 159817,
 160107,
 161158,
 162085,
 165847,
 168316,
 168528,
 169111,
 170333,
 172684,
 182047,
 182427,
 186362,
 189535,
 190999,
 191110,
 193177,
 196686,
 202552,
 206340,
 207789,
 208382,
 209874,
 210525,
 217464,
 219933,
 221393,
 221533,
 221880,
 222213,
 224190,
 229501,
 229598,
 230783,
 231453,
 232140,
 234427,
 236193,
 236950,
 240168,

#### Saving the index

Save your index in a file so that you can reuse it. Use the pickle module.

In [11]:
# Write your code here
# save a dictionary (idx in this case) into a pickle file
pickle.dump(idx, open("idx.p", "wb")) # .p betyder att det är en pickle-fil, w = writing, b = binary (onödig info)

In [12]:
# Write your code here
idx = pickle.load( open( "idx.p", "rb" ) ) # load the dictionary back from the pickle file, r = reading

In [13]:
idx['mårbacka']

[16,
 139,
 752,
 1700,
 2582,
 3324,
 15117,
 15404,
 27794,
 42175,
 49126,
 50407,
 52053,
 60144,
 63374,
 64910,
 67182,
 67330,
 67799,
 67824,
 69232,
 71328,
 72099,
 74147,
 74255,
 74614,
 76610,
 76884,
 77138,
 77509,
 77787,
 77936,
 78574,
 80597,
 81782,
 82003,
 84363,
 84786,
 85251,
 89837,
 97093,
 98642,
 100474,
 105063,
 105298,
 105721,
 108710,
 109133,
 112844,
 113725,
 114997,
 115583,
 115833,
 116368,
 116557,
 121896,
 124823,
 126409,
 126542,
 128758,
 130976,
 131939,
 132826,
 136914,
 137187,
 137872,
 139196,
 140721,
 142324,
 146781,
 151497,
 154335,
 155139,
 155438,
 155886,
 156405,
 158108,
 159817,
 160107,
 161158,
 162085,
 165847,
 168316,
 168528,
 169111,
 170333,
 172684,
 182047,
 182427,
 186362,
 189535,
 190999,
 191110,
 193177,
 196686,
 202552,
 206340,
 207789,
 208382,
 209874,
 210525,
 217464,
 219933,
 221393,
 221533,
 221880,
 222213,
 224190,
 229501,
 229598,
 230783,
 231453,
 232140,
 234427,
 236193,
 236950,
 240168,

### Reading the content of a folder

Write a `get_files(dir, suffix)` function that reads all the files in a folder with a specific `suffix` (txt). You will need the Python `os` package, see <a href="https://docs.python.org/3/library/os.html">https://docs.python.org/3/library/os.html</a>. You will return the file names in a list.

You can reuse this function:

In [15]:
def get_files(dir, suffix):
    """
    Returns all the files in a folder ending with suffix
    :param dir:
    :param suffix:
    :return: the list of file names
    """ # """text""" för att kommentera en längre text
    
    files = []
    for file in os.listdir(dir):
        if file.endswith(suffix):
            files.append(file)
    return files

In [16]:
# Write your code here
files = get_files('Selma', 'txt')
print(files)

['bannlyst.txt', 'gosta.txt', 'herrgard.txt', 'jerusalem.txt', 'kejsaren.txt', 'marbacka.txt', 'nils.txt', 'osynliga.txt', 'troll.txt']


### Creating a master index

Complete your program with the creation of master index, where you will associate each word of the corpus with the files, where it occur and its positions: a posting list
Below is an except of the master index with the words <i>samlar</i> and <i>ände</i>:

In [16]:
{'samlar':
            {'troll.txt': [641880, 654233],
            'nils.txt': [51805, 118943],
            'osynliga.txt': [399121],
            'gosta.txt': [313784, 409998, 538165]},
 'ände':
            {'troll.txt': [39562, 650112],
            'kejsaren.txt': [50171],
            'marbacka.txt': [370324],
            'nils.txt': [1794],
            'osynliga.txt': [272144]}
}

{'samlar': {'troll.txt': [641880, 654233],
  'nils.txt': [51805, 118943],
  'osynliga.txt': [399121],
  'gosta.txt': [313784, 409998, 538165]},
 'ände': {'troll.txt': [39562, 650112],
  'kejsaren.txt': [50171],
  'marbacka.txt': [370324],
  'nils.txt': [1794],
  'osynliga.txt': [272144]}}

The word <i>samlar</i>, for instance, occurs three times in the gosta text at positions
            313784, 409998, and 538165.

In [17]:
# write your code here
def master_index(txt_files): # tar in en lista av textfiler, loopar igenom
    master_idx = {}
    text_idx = {}
    
    for f in txt_files:
        file = open('Selma/' + f, encoding = 'utf-8')
        file_text = file.read()
        tokens = tokenize(file_text) # file_text är en string, inte en lista
        idx = text_to_idx(tokens) # tokens är en lista av ord
        text_idx[f] = idx # mappar filnamnet till index

        
        for word in text_idx[f]: # word i detta fall blir nycklarna till text_idx[f]
            if word in master_idx:
                master_idx[word].update({f: text_idx[f].get(word)})
                
            else:
                master_idx[word] = {f: text_idx[f].get(word)} # mappa word i master_idx till en ny lista
                                                                        # varje gång vid nytt ord, vars nyckel är namnet
                                                                        # på boken och värdet är listan av förekomster
    return master_idx

In [18]:
my_master_index = master_index(files)
my_master_index['samlar']

{'gosta.txt': [313784, 409998, 538165],
 'nils.txt': [51805, 118943],
 'osynliga.txt': [399121],
 'troll.txt': [641880, 654233]}

In [19]:
my_master_index['mårbacka']

{'marbacka.txt': [16,
  139,
  752,
  1700,
  2582,
  3324,
  15117,
  15404,
  27794,
  42175,
  49126,
  50407,
  52053,
  60144,
  63374,
  64910,
  67182,
  67330,
  67799,
  67824,
  69232,
  71328,
  72099,
  74147,
  74255,
  74614,
  76610,
  76884,
  77138,
  77509,
  77787,
  77936,
  78574,
  80597,
  81782,
  82003,
  84363,
  84786,
  85251,
  89837,
  97093,
  98642,
  100474,
  105063,
  105298,
  105721,
  108710,
  109133,
  112844,
  113725,
  114997,
  115583,
  115833,
  116368,
  116557,
  121896,
  124823,
  126409,
  126542,
  128758,
  130976,
  131939,
  132826,
  136914,
  137187,
  137872,
  139196,
  140721,
  142324,
  146781,
  151497,
  154335,
  155139,
  155438,
  155886,
  156405,
  158108,
  159817,
  160107,
  161158,
  162085,
  165847,
  168316,
  168528,
  169111,
  170333,
  172684,
  182047,
  182427,
  186362,
  189535,
  190999,
  191110,
  193177,
  196686,
  202552,
  206340,
  207789,
  208382,
  209874,
  210525,
  217464,
  219933,
  2213

Save your master index in a file and read it again

In [20]:
# Write your code here
pickle.dump(master_index, open("master_index.p", "wb"))
master_index = pickle.load( open( "master_index.p", "rb" ) )

In [20]:
my_master_index['samlar']

{'gosta.txt': [313784, 409998, 538165],
 'nils.txt': [51805, 118943],
 'osynliga.txt': [399121],
 'troll.txt': [641880, 654233]}

#### Concordances

Write a `concordance(word, master_index, window)` function to extract the concordances of a `word` within a window of `window` characters

In [21]:
# Write your code here
def concordance(word, master_index, window):

    for w in master_index: # för varje word w i master_index
        if w == word:
            dict_txt_file = master_index[w]
            for txt_file in dict_txt_file: # för varje nyckel w in dict_txt_file
                file = open('Selma/' + txt_file, encoding = 'utf-8')
                file_text = file.read()
                
                print(txt_file)
                list_of_index = dict_txt_file[txt_file]
                for idx in list_of_index:
                    row = '        ' + file_text[max(idx - window, 0):min(idx + window, len(file_text) -1)]
                    row = row.replace('\n', ' ')
                    print(row.lower())

In [23]:
concordance('samlar', my_master_index, 25)

gosta.txt
        om ligger nära borg, och samlar ihop ett litet mid
        lika förstämda.  men hon samlar upp allt detta som
        n ensam i livet.  därmed samlar han korten tillhop
nils.txt
         bara, att du i all hast samlar ihop så mycket bos
        ar stannat hemma, och nu samlar de sig för att int
osynliga.txt
         till höger i kärran och samlar just ihop tömmarna
troll.txt
        en örtkunnig läkare, som samlar in markens växter 
        älper dem, och medan hon samlar och handlar för de


### Representing Documents with tf-idf

Once you have created the index, you will represent each document in your corpus as a dictionary. The keys of these dictionaries will be the words and you will define the value of a word with the tf-idf metric: 
1. Read the description of the tf-idf measure on Wikipedia (<a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf">https://en.wikipedia.org/wiki/Tf-idf</a>)
2. After reading the description, you probably realized that there are multiple definitions of tf-idf. In this assignment, 
 * Tf will be the relative frequency of the term in the document and 
 * idf, the logarithm base 10 of the inverse document frequency.
        
You have below the tf-idf values for a few words. In our example, the word <i>gås</i> has the value 0 in bannlyst.txt and the value 0.000101001964 in nils.txt

<pre>
troll.txt
	känna	 0.0
	gås	 0.0
	nils	 2.148161748868631e-06
	et	 0.0
kejsaren.txt
	känna	 0.0
	gås	 0.0
	nils	 8.08284798629935e-06
	et	 8.273225429362848e-05
marbacka.txt
	känna	 0.0
	gås	 0.0
	nils	 7.582276564686669e-06
	et	 9.70107989686256e-06
herrgard.txt
	känna	 0.0
	gås	 0.0
	nils	 0.0
	et	 0.0
nils.txt
	känna	 0.0
	gås	 0.00010100196417506702
	nils	 0.00010164426900380124
	et	 0.0
osynliga.txt
	känna	 0.0
	gås	 0.0
	nils	 0.0
	et	 0.0
jerusalem.txt
	känna	 0.0
	gås	 0.0
	nils	 4.968292117670952e-06
	et	 0.0
bannlyst.txt
	känna	 0.0
	gås	 0.0
	nils	 0.0
	et	 0.0
gosta.txt
	känna	 0.0
	gås	 0.0
	nils	 0.0
	et	 0.0
</pre>

Conceptually, the tf-idf representation is a vector. In your program, you will keep this idea and use all the words in the corpus as keys: Each dictionary will include all the words of the corpus as keys. The value of the key is then possibly 0, meaning that the word in not in the document or is in all the documents as for the word `nils` in `gosta.tx`. 

As further work, you may think of optimizing this part.

In [74]:
# Write your code here
def tfidf(master_index, files):
    tf = {}
    idf = {}
    tf_idf = {}
    for file in files:
        text = open('Selma/' + file, encoding = 'utf-8').read().lower().strip()
        total_nbr_word = len(re.findall(regex, text))
        tf[file] = {}
        tf_idf[file] = {}
        for w in master_index:
            if master_index[w].get(file) is None:
                freq = 0.0
            else:
                freq = len(master_index[w].get(file))
            tf[file].update({w: freq/total_nbr_word})
    N = len(files)
    for w in master_index:
        idf[w] = math.log(N / float((len(master_index[w].keys()))), 10)
    for file in files:
        for w in tf[file]:
            tf_idf[file][w] = tf[file].get(w) * idf[w]
    return tf_idf

In [75]:
my_tf_idf = tfidf(my_master_index, files)

my_tf_idf['troll.txt']['känna']

0.0

In [76]:
my_tf_idf['troll.txt']['nils'] # 2.148161748868631e-06

2.1481617488686316e-06

### Comparing Documents

Using the cosine similarity, compare all the pairs of documents with their tf-idf representation and present your results in a table. You will include this table in your report.

#### Cosine similarity

Write a function computing the cosine similarity between two documents

In [83]:
# Write your code here
def cosine_similarity(doc_1, doc_2, tf_idf):
    
    a = tf_idf[doc_1]
    b = tf_idf[doc_2]
    
    sum_tf_idf = 0
    pow_a = 0
    pow_b = 0
    for word in tf_idf[doc_1]: # funkar med tf_idf[doc_2] också eftersom de innehåller samma (alla) ord
        sum_tf_idf = sum_tf_idf + a[word]*b[word]
        pow_a = pow_a + pow(a[word], 2)
        pow_b = pow_b + pow(b[word], 2)
    
    similarity = sum_tf_idf/(math.sqrt(pow_a)*math.sqrt(pow_b))

    return similarity

#### Similarity matrix

Compute the similarity matrix between the documents of the corpus. While computing the similarities, you will record the two most similar documents that you will call `most_sim_doc1` and `most_sim_doc2`.

In [85]:
# Write your code here
size = len(files)
limit = -1
max_similarity = 0
most_sim_doc1 = ''
most_sim_doc2 = ''
print(files)

for row in range(0, size):
    list = []
    for col in range(0, size):
        sim = round(cosine_similarity(files[row], files[col], my_tf_idf), 5)
        list.append(str(sim))
        if ((sim != 1) and (sim > limit)):
            limit = sim
            max_similarity = sim
            most_sim_doc1 = files[row]
            most_sim_doc2 = files[col]

    indent = 20 - len(files[row])
    print(files[row] + ' '*indent + ' '.join(list)) 

['bannlyst.txt', 'gosta.txt', 'herrgard.txt', 'jerusalem.txt', 'kejsaren.txt', 'marbacka.txt', 'nils.txt', 'osynliga.txt', 'troll.txt']
bannlyst.txt        1.0 0.04904 0.00095 0.00646 0.02401 0.03681 0.05098 0.05206 0.08862
gosta.txt           0.04904 1.0 0.00311 0.00432 0.04802 0.08017 0.10483 0.12476 0.19574
herrgard.txt        0.00095 0.00311 1.0 0.37069 0.00074 0.00361 0.00507 0.00483 0.00407
jerusalem.txt       0.00646 0.00432 0.37069 1.0 0.00183 0.00487 0.00454 0.0283 0.00706
kejsaren.txt        0.02401 0.04802 0.00074 0.00183 1.0 0.07112 0.04966 0.05111 0.18128
marbacka.txt        0.03681 0.08017 0.00361 0.00487 0.07112 1.0 0.08474 0.09317 0.14715
nils.txt            0.05098 0.10483 0.00507 0.00454 0.04966 0.08474 1.0 0.11057 0.18848
osynliga.txt        0.05206 0.12476 0.00483 0.0283 0.05111 0.09317 0.11057 1.0 0.1926
troll.txt           0.08862 0.19574 0.00407 0.00706 0.18128 0.14715 0.18848 0.1926 1.0


Give the name of the two novels that are the most similar.

In [86]:
print("Most similar:", most_sim_doc1, most_sim_doc2, "Similarity:", max_similarity)

Most similar: herrgard.txt jerusalem.txt Similarity: 0.37069


## Submission

When you have written all the code and run all the cells, fill in your ID and as well as the name of the notebook.

In [87]:
STIL_ID = ["elt15jli"] # Write your stil ids
CURRENT_NOTEBOOK_PATH = os.path.join(os.getcwd(), 
                                     "1-indexer.ipynb") # Write the name of your notebook

The submission code will send your answer. It consists of the two most similar novels.

In [88]:
ANSWER = ' '.join(sorted([most_sim_doc1, most_sim_doc2]))
ANSWER

'herrgard.txt jerusalem.txt'

Now the moment of truth:
1. Save your notebook and
2. Run the cells below

In [89]:
SUBMISSION_NOTEBOOK_PATH = CURRENT_NOTEBOOK_PATH + ".submission.bz2"

In [90]:
ASSIGNMENT = 1
API_KEY = "f581ba347babfea0b8f2c74a3a6776a7"

# Copy and compress current notebook
with bz2.open(SUBMISSION_NOTEBOOK_PATH, mode="wb") as fout:
    with open(CURRENT_NOTEBOOK_PATH, "rb") as fin:
        fout.write(fin.read())

In [91]:
res = requests.post("https://vilde.cs.lth.se/edan20checker/submit", 
                    files={"notebook_file": open(SUBMISSION_NOTEBOOK_PATH, "rb")}, 
                    data={
                        "stil_id": STIL_ID,
                        "assignment": ASSIGNMENT,
                        "answer": ANSWER,
                        "api_key": API_KEY,
                    },
                   verify=False)


# from IPython.display import display, JSON
res.json()



{'msg': None,
 'status': 'correct',
 'signature': 'c7eb11d723ecd8d77e2f4e5c1898b5a240b1128671bce46552a1658624cc11990eb91e191d593d663fed3ed6258981517be0a1b59158f2db9e5a5b7e2548d834',
 'submission_id': '30b8a919-c18f-438d-b017-e1731f692e9e'}

Check the `status` and be sure it is `correct`. If not, revise your code; verify that you obtained intermediate results identical to those in the notebook; and resubmit your notebook. You can submit multiple times.

<h2>Reading</h2>

Now you are done, it is time to write your individual report. You will describe the indexer and comment the results.

You will also read the text: <i>Challenges in Building Large-Scale Information Retrieval Systems</i> about the history of <a href="https://research.google.com/people/jeff/WSDM09-keynote.pdf">Google indexing</a> by <a href="https://research.google.com/pubs/jeff.html">Jeff Dean</a>.

In your report, you will tell how your index encoding is related to what Google did. You must identify the slide where you have the most similar indexing technique and write the slide title and the slide number in your report.