In [1]:
import numpy as np
import pandas as pd
import math
import nltk
import itertools
import functions as fn
import os
from collections import defaultdict
from collections import OrderedDict
from collections import Counter
from nltk.corpus import stopwords 
from nltk.stem import PorterStemmer
from nltk.tokenize import RegexpTokenizer

import pickle

## Load Data into Data Frame and Save it

#### EXECUTED JUST ONCE AT THE BEGINNING

In [None]:
df = fn.create_dataframe()

To save the dataframe we have created into one csv file, so that we could jut import it instead of recreating it every time, we used this command JUST ONCE:

In [3]:
df.to_csv(r'./data.csv', index = False)    

#### READ ONLY THE COLUMNS THAT WE NEED IN THE DATAFRAME CREATED

In [2]:
df = pd.read_csv("./data.csv", usecols= ['bookTitle','plot','url'])    #to read

This time we are only importing the columns that we will need in our analysis to make the process smoother and faster

In [3]:
df

Unnamed: 0,bookTitle,plot,url
0,The Hunger Games,"Could you survive on your own in the wild, wit...",https://www.goodreads.com/work/best_book/27927...
1,Harry Potter and the Order of the Phoenix,There is a door at the end of a silent corrido...,https://www.goodreads.com/work/best_book/28092...
2,To Kill a Mockingbird,The unforgettable novel of a childhood in a sl...,https://www.goodreads.com/work/best_book/32757...
3,Pride and Prejudice,Alternate cover edition of ISBN 9780679783268S...,https://www.goodreads.com/work/best_book/30609...
4,Twilight,About three things I was absolutely positive.F...,https://www.goodreads.com/work/best_book/32122...
...,...,...,...
27104,Chicken Soup for the Soul: From Lemons to Lemo...,From lemons to lemonade; from heartbreak to ha...,https://www.goodreads.com/work/best_book/21955...
27105,The Lucy Variations,Lucy Beck-Moreau once had a promising future a...,https://www.goodreads.com/work/best_book/16774...
27106,Look to See Me: A Collection of Reflections,,https://www.goodreads.com/book/show/11457776-l...
27107,The Silver Wolf,Regeane is a fatherless royal relation who hap...,https://www.goodreads.com/work/best_book/25070...


#### Before starting we want to define some useful functions that we will use to save dictionaries into pickle files and load them back every time we might need them

In [4]:
#Useful functions to save and load files in pickle format
def save_dict(obj, name ):
    with open(f'{name}.pickle', 'wb') as f:
        pickle.dump(obj, f)

def load_obj(name ):
    with open(f'{name}.pickle', 'rb') as f:
        return pickle.load(f)

# Q2. Search Engine
## 2.1. Conjunctive query
In this firts version of the search engine we just want to evaluate conjuctive queries (AND) with respect of only the **plot** of the books

### Examine Null Values

In [5]:
df['plot'].isnull().sum()

582

#### substitute nulla values in the plot with a string

In [6]:
df['plot'].fillna('unknown',inplace = True)

In [7]:
df['plot'].isnull().sum()

0

The first thing to do, when building a search engine is creating a **vocabulary**. We will first look at all the unique words into the plot of each book and create a list of them, then, from this list, we will map each word to an integer, that we will call **term_id** and will be useful in the future creation of the **inverted index**.

#### The next two cells were runned just one time at the beginning

In [8]:
dictionary = fn.build_dictionary(df)

In [9]:
save_dict(dictionary,'vocabulary')   #to save

#### Everytime that we restart the notebook we just need to load the file that we have already created, using the command below:

In [8]:
dictionary = load_obj('vocabulary')  #to load

In [9]:
dictionary

{'upcom': 0,
 'shiversam': 1,
 '2195': 2,
 'kippur': 3,
 'novi': 4,
 'absorb': 5,
 'mel': 6,
 'jawless': 7,
 'offbeat': 8,
 'kilomet': 9,
 'kasparkova': 10,
 'être': 11,
 'valiard': 12,
 'bogucki': 13,
 'rubbl': 14,
 'villag': 15,
 'rushdi': 16,
 'phaedru': 17,
 'doberman': 18,
 'empieza': 19,
 'twinborn': 20,
 'transgend': 21,
 'nocturn': 22,
 'molina': 23,
 'darkey': 24,
 'tompkin': 25,
 'spilfer': 26,
 'mackensi': 27,
 '1187': 28,
 'eldest': 29,
 'shrine': 30,
 'specialand': 31,
 'brine': 32,
 'pandem': 33,
 'gardnerian': 34,
 'miama': 35,
 'pincu': 36,
 'potti': 37,
 'resist': 38,
 'amazonit': 39,
 'parliament': 40,
 'comeback': 41,
 'hoenikk': 42,
 'membranerefer': 43,
 'walloon': 44,
 'tunak': 45,
 'multiplex': 46,
 'eldanarian': 47,
 'zing': 48,
 'seadragon': 49,
 'raynor': 50,
 'traductor': 51,
 'talesskilgannon': 52,
 'merton': 53,
 'sackett': 54,
 'themag': 55,
 'delani': 56,
 'thunderbook': 57,
 'narcolepsi': 58,
 'quinzel': 59,
 'mulan': 60,
 'innisfallen': 61,
 'metalog': 

Now, before calculating the inverted index we will determine, for each document, the frequency of each word in the document:

In [10]:
frequency_of_words = fn.frequency_of_words_per_book(df,dictionary)

We now have all the tools needed to create our first **inverted index**: we will run the cell below just one time and store the inverted index in a pickle file so that we can recall it every time we need it.

#### The next two cells were runned just one time at the beginning

In [31]:
inverted_index_1 = fn.inverted_index1(df,dictionary,frequency_of_words)

In [32]:
save_dict(inverted_index_1,'inverted_index_1')    #to save

#### Everytime that we restart the notebook we just need to load the file that we have already created, using the command below:

In [11]:
inverted_index_1 = load_obj('inverted_index_1')   #to load

Let's now give in input a query:

In [13]:
query = input()

survival games


We have to pre-process the query as well if we want to recall it inside our inverted index

In [14]:
query = fn.query_processed(query)
query

['surviv', 'game']

### SEARCH ENGINE 1

Now we have **everything** so we can implement our very first **search engine**

In [15]:
output = fn.search_engine1(query,df,inverted_index_1,dictionary)

In [16]:
output.head(10)

Unnamed: 0,bookTitle,plot,url
0,The Hunger Games,"Could you survive on your own in the wild, wit...",https://www.goodreads.com/work/best_book/27927...
1,Catching Fire,SPARKS ARE IGNITING.FLAMES ARE SPREADING.AND T...,https://www.goodreads.com/work/best_book/61714...
2,Mockingjay,The final book in the ground-breaking HUNGER G...,https://www.goodreads.com/work/best_book/88127...
3,Legend,What was once the western United States is now...,https://www.goodreads.com/work/best_book/14157...
4,"A Child Called ""It""",Also see: Alternate Cover Editions for this IS...,https://www.goodreads.com/work/best_book/59104...
5,The Magus,"This daring literary thriller, rich with eroti...",https://www.goodreads.com/work/best_book/18164...
6,Ender's Shadow,Welcome to Battleschool.Growing up is never ea...,https://www.goodreads.com/work/best_book/31455...
7,The Lucky One,When U.S. Marine Logan Thibault finds a photog...,https://www.goodreads.com/work/best_book/30944...
8,Sliding on the Snow Stone,It is astonishing that anyone lived this story...,https://www.goodreads.com/work/best_book/18004...
9,Code Name Verity,"Oct. 11th, 1943 - A British spy plane crashes ...",https://www.goodreads.com/work/best_book/16885...


## 2.2 Conjunctive query & Ranking score
### For the second search engine, given a query, we want to get the top-k documents related to the query. In particular:
- Find all the documents that contains all the words in the query.
- Sort them by their similarity with the query
- Return in output k documents, or all the documents with non-zero similarity with the query when the results are less than k. 

In [17]:
df

Unnamed: 0,bookTitle,plot,url
0,The Hunger Games,"Could you survive on your own in the wild, wit...",https://www.goodreads.com/work/best_book/27927...
1,Harry Potter and the Order of the Phoenix,There is a door at the end of a silent corrido...,https://www.goodreads.com/work/best_book/28092...
2,To Kill a Mockingbird,The unforgettable novel of a childhood in a sl...,https://www.goodreads.com/work/best_book/32757...
3,Pride and Prejudice,Alternate cover edition of ISBN 9780679783268S...,https://www.goodreads.com/work/best_book/30609...
4,Twilight,About three things I was absolutely positive.F...,https://www.goodreads.com/work/best_book/32122...
...,...,...,...
27104,Chicken Soup for the Soul: From Lemons to Lemo...,From lemons to lemonade; from heartbreak to ha...,https://www.goodreads.com/work/best_book/21955...
27105,The Lucy Variations,Lucy Beck-Moreau once had a promising future a...,https://www.goodreads.com/work/best_book/16774...
27106,Look to See Me: A Collection of Reflections,unknown,https://www.goodreads.com/book/show/11457776-l...
27107,The Silver Wolf,Regeane is a fatherless royal relation who hap...,https://www.goodreads.com/work/best_book/25070...


In [18]:
df.isnull().sum()

bookTitle    0
plot         0
url          0
dtype: int64

To answer this question we first need to define the concepts of **TF-IDF score** and **cosine similarity**.

**TF-IDF** stands for **Term Frequency-Inverse Document Frequency**. This tecnique is used to quantify the words inside a document by giving a weigth to each word in proportion with its importance.

So, this score is given by two main components:
- *Term Frequency*: that measures the frequency of a word in a document. It is given by: $\textrm{tf}(t,d)=\frac{\textrm{count of t in d}}{\textrm{number of words in d}}$


- *Document Frequency*: it measures the importance of the document by counting the number of documents is which a certain word is present (at least one time): 
$\textrm{df}(t) = \textrm{occurrence of t in documents}$

But we actually need the *Inverse Document Frequency*:
it measures the informativeness of the term t in the document. IDF will be small if the occurence of the words is very big and viceversa:
    $\textrm{idf}(t) = \frac{N}{\textrm{df}}$

There may be some problems with $IDF$, in particular in the cases where the corpus is really large, so in this case it may be convenient to take the log of it.
Also, if a word that is not in the pre-determined vocabolary occurs, its df will be equal to $0$, but then we would have a division with 0 at the denominator, which, of course, is going to lead to an error. To solve this issue we are just going to add a 1 to the denominator. So, the final formula is: $\textrm{idf}(t) = \textrm{log}(\frac{N}{\textrm{df}+1})$

Finally, the TD-IDF score is defined by:        

$\textrm{tf}-\textrm{idf}(t,d) = \textrm{tf}(t,d)*\textrm{log}(\frac{N}{\textrm{df}+1})$


## Step 1: calculate TF-IDF score

In [41]:
tf_score = fn.tf(df,dictionary,frequency_of_words)

In [42]:
idf_score = fn.idf(df,dictionary,frequency_of_words)

#### The next two cells were runned just one time at the beginning

In [43]:
tf_idf_scores = fn.tf_idf_score(df,dictionary,tf_score,idf_score)

In [44]:
save_dict(tf_idf_scores,'tf_idf_scores')    #to save

#### Everytime that we restart the notebook we just need to load the file that we have already created, using the command below:

In [19]:
tf_idf_scores = load_obj('tf_idf_scores')   #to load

In [20]:
frequency_of_words[463]['upcom']

1

## Step 2: from list of words calculate inverted index 

#### The next two cells were runned just one time at the beginning

In [53]:
inverted_index_2 = fn.inverted_index2(df,dictionary,tf_idf_scores,frequency_of_words)

In [54]:
save_dict(inverted_index_2,'inverted_index_2')    #to save

#### Everytime that we restart the notebook we just need to load the file that we have already created, using the command below:

In [21]:
inverted_index_2 = load_obj('inverted_index_2')   #to load

## Step 3: calculate cosine similarity

Let's now talk about **cosine similarity**:
this is just a metric used to measure how similar the documents regardless their size. Geometrically speaking this consists of measuring the cosine of the angle between two vectors (where the vector corresponds to a word in the document). The formula to use is the following: $\textrm{cos}(\hat{\theta}) = \textrm{cos}(\vec{x},\vec{y}) = \frac{\vec{x}\cdot\vec{y}}{||\vec{x}||\cdot||\vec{y}||}=\frac{\sum_{i=1}^{m} x_i\cdot y_i}{||\vec{x}||\cdot||\vec{y}||}$



In our specific case the final formula will be the following:
 $\textrm{score} (\vec{q},\vec{d_i}) = \frac{1}{||\vec{q}||}\cdot \frac{1}{||\vec{d_i}||} \cdot \sum_{j=1}^{m} \vec{q_j}\vec{d_j^i}$
 


In [22]:
cosine_similarity_score = fn.cosine_similarity(query,df,tf_idf_scores)

### SEARCH ENGINE 2

We are now read to build our second search engine. First, we need in input a query from the user:

In [28]:
query = input()

survival games


Then, we want to pre-process the query so that it can be found in the inverted index

In [29]:
query = query = fn.query_processed(query)
query

['surviv', 'game']

<p>&nbsp;</p>

Finally, we can look at the output we want, sorted in decreasing order based on the book's cosine similarity score:

In [23]:
output2 = fn.search_engine2(df,query,inverted_index_1,dictionary,cosine_similarity_score)

In [27]:
output2

Unnamed: 0,bookTitle,plot,url,Similarity
83,The Warden,Alice has led a normal life up until now. She ...,https://www.goodreads.com/work/best_book/54520...,0.5093
120,Devil's Own,"After surviving slavery, Aiden MacAlpin has no...",https://www.goodreads.com/work/best_book/13578...,0.3972
38,The Quillan Games,LET THE GAMES BEGIN....Quillan is a territory ...,https://www.goodreads.com/work/best_book/73848...,0.3652
0,The Hunger Games,"Could you survive on your own in the wild, wit...",https://www.goodreads.com/work/best_book/27927...,0.3248
4,"A Child Called ""It""",Also see: Alternate Cover Editions for this IS...,https://www.goodreads.com/work/best_book/59104...,0.2849
37,Truth,From New York Times and USA Today bestselling ...,https://www.goodreads.com/work/best_book/21863...,0.2749
58,The Books of the South,Marching south after the ghastly battle at the...,https://www.goodreads.com/work/best_book/23725...,0.2556
50,Cage of Darkness,"While traveling to Fren, Allyssa and Odar are ...",https://www.goodreads.com/work/best_book/43296...,0.2343
102,Blood Awakening,"A dangerous game of life, blood, and survival…...",https://www.goodreads.com/work/best_book/15950...,0.2266
26,The Calling,"Twelve thousand years ago, they came. They des...",https://www.goodreads.com/work/best_book/35441...,0.2257
