This notebook uses the subset of Python libraries used in `generator.ipynb`. Additionally `cosine_similarity` is used to determine similarity between documents.

In [1]:
import json

import cloudpickle as pickle
import nltk
import numpy as np
import pandas as pd

from src.common import get_text
from sklearn.metrics.pairwise import cosine_similarity

Load database (documents in tf-idf) generated in `generator.ipynb`.

In [2]:
database = pd.read_csv('./src/database.csv', index_col=0)

Load transformer to tf-idf fitted in `generator.ipynb`.

In [3]:
with open('./src/tfidf.pkl', 'rb') as fp:
    tfidf_transformer = pickle.load(fp)

For the sake of explainability create dict of terms indices in tf-idf and human-readable words.

In [4]:
term_dict = {value: key for key, value in tfidf_transformer.vocabulary_.items()}

Function converting pages from theirs url to tf-idf content representation. In general, the function repeats a pipeline from `generator.ipynb`.

In [5]:
def urls_to_tfidf(urls: str) -> pd.DataFrame:
    porter = nltk.PorterStemmer()
    wordnet = nltk.WordNetLemmatizer()
    stopwords = set(nltk.corpus.stopwords.words('english'))
    documents = [get_text(url) for url in urls]
    tokens = [nltk.tokenize.wordpunct_tokenize(doc) for doc in documents]
    tokens = [[token for token in document if token.isalpha() and token not in stopwords] for document in tokens]
    lemmas = [[wordnet.lemmatize(porter.stem(token)) for token in document] for document in tokens]
    tfidf = tfidf_transformer.transform(lemmas)
    df = pd.DataFrame.sparse.from_spmatrix(tfidf)
    df.index = urls
    return df
    

Function scanning database to find most similar documents to the query. Using knowledge gained during lectures from Information Retrival course, I decided to determine similarity between documents in tf-idf representation using cosine similarity. This function filters out documents with similarity equal to 1.0 (or really close to 1.0 because of limited precision of computers) with any document in the query. It removes from results not only query but also all it duplicates hidden under different links (https://en.wikipedia.org/wiki/Information_retrieval_applications is just an alias for https://en.wikipedia.org/wiki/Information_retrieval#Applications). The similarity between query and a record is computed as a average of cosine similiarities with all documents in query.
$$\text{sim}(\text{query}, \text{doc})=\frac{1}{|\text{query}|} \sum_{q \in \text{query}} \cos(q, \text{doc})$$

In [6]:
def query_database(database: pd.DataFrame, query: pd.DataFrame, top: int = None) -> pd.DataFrame:
    df = database.copy()
    for i in range(query.shape[0]):
        q = query.iloc[i, :].to_frame().transpose()
        df[f'cos_sim_{i}'] = cosine_similarity(df.iloc[:, :database.shape[1]], q)
       
    cos_temp_cols = list(df.columns[df.columns.str.contains('cos_sim_')])
    
    for col in cos_temp_cols:
        df = df[np.abs(df[col] - 1.0) > 1e-3]
        
    df['cosine_similarity'] = df.iloc[:, df.columns.str.contains('cos_sim_')].sum(axis=1) / query.shape[0]
    df = df.drop(columns=cos_temp_cols)  

    df = df.sort_values(by='cosine_similarity', ascending=False)   
    if top: df = df.head(top)
    return df

Function explaining key similarities and differences between query and result. I believe that provided explanations are easly interpretable and thus I don't elaborate how they are computed. Such results allows the user to find key similarites and differences between documents. This allows to determine if the page is worthy of consideration ("cellphone" vs. "red cell", simillar words, different domain).

In [7]:
def explain_result(result: pd.Series, query: pd.DataFrame, top: int = 10) -> None:
    result = result.drop(columns='cosine_similarity')
    result_all = np.ceil(result).sum(axis=0)
    result_unique = set(result_all[result_all == 1].index)
    result_any = set(result_all[result_all >= 1].index)
    result_all = set(result_all[result_all == result.shape[0]].index)
    
    query_all = np.ceil(query).sum(axis=0)
    query_unique = set(query_all[query_all == 1].index.map(str))
    query_any = set(query_all[query_all >= 1].index.map(str))
    query_all = set(query_all[query_all == query.shape[0]].index.map(str))

    print(f"IMPORTANT TERMS")
    print(f"\tResult: {len(result_any)}\tQuery: {len(query_any)}")
    
    common = query_all.intersection(result_all)
    print(f"COMMON TERMS")
    print(f"\tIn total (result and query): {len(common)}")
    print("\t", end="")
    if len(common) in range(0, top):
        for c in list(common)[:top]: print(f"\"{term_dict[int(c)]}\"", end=" ")
    print("")
    print(f"\tResult: {len(result_all)}\tQuery: {len(query_all)}")
    
    common = query_any.intersection(result_any)
    different = query_any.union(result_any)
    different = different.difference(common)
    unique_query = len(query_any.difference(common))
    unique_result = len(result_any.difference(common))
    
    
    unique = result_unique.difference(query_unique)
    print(f"UNIQUE TERMS")
    print(f"\tIn total: {len(unique)}")
    print(f"\tResult: {len(result_unique)}\tQuery: {len(query_unique)}")

    print(f"DIFFERENT TERMS")
    print(f"\tIn total (or in result either in query): {len(different)}")
    print(f"\tResult: {unique_result}\tQuery: {unique_query}")

    result = result.sum(axis=0)
    missing = result[~result.index.isin(query_all)].sort_values(ascending=False)[:top]
    print(f"\tTERMS IN THE RESULT NOT OCCURING IN THE QUERY")
    print("\t\t", end="")
    for mis in missing.index:
        print(f"\"{term_dict[int(mis)]}\"", end=" ")
        
    query = query.sum(axis=0)
    query.index = query.index.map(str)
    missing = query[~query.index.isin(result_all)].sort_values(ascending=False)[:top]
    print("")
    print(f"\tTERMS IN THE QUERY NOT OCCURING IN THE RESULT")
    print("\t\t", end="")
    for mis in missing.index:
        print(f"\"{term_dict[int(mis)]}\"", end=" ")

# DEMO
### 1. Provide query and generate its tf-idf representation
Define links constituting the query in the file `query.json`, in the array in field named "url".

In [8]:
with open('query.json', 'r') as fp:
    urls = json.load(fp)['urls']
query = urls_to_tfidf(urls)
query

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,30475,30476,30477,30478,30479,30480,30481,30482,30483,30484
https://en.wikipedia.org/wiki/Information_retrieval,0.012078,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
https://en.wikipedia.org/wiki/Artificial_intelligence,0.035195,0.0,0.0,0.0,0.0,0.008004,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
https://en.wikipedia.org/wiki/Pug,0.02837,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 2. Query database
Returned `pd.DataFrame` containts pages ordered by similarity (most similar at the top). If you want to limit number of results provide argument `top`.

In [9]:
result = query_database(database, query, top=10)
result

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,30476,30477,30478,30479,30480,30481,30482,30483,30484,cosine_similarity
/wiki/concept_search,0.049664,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.28642
/wiki/document_retrieval,0.085551,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25786
/wiki/information_science,0.041869,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.252889
/wiki/information_science#history,0.041869,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.252889
/wiki/human%e2%80%93computer_information_retrieval,0.027073,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.245405
/wiki/ranking_(information_retrieval),0.014867,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.22609
/wiki/relevance_(information_retrieval),0.041853,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.218742
/wiki/information_system,0.042404,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.210526
/wiki/information_systems,0.042404,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.210526
/wiki/information_systems_(discipline),0.042404,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.210526


### 3. Choose result and display similarities and differences
Below you can see key similarities and differences between query and results.

- In the section "Important terms" you can see number of terms with nonzero values (in tf-idf representation) for both sets.
- In the section "Common terms" you can see intersection of both sets. In the first row you can see terms occuring in all documents. If there are not so many of them they are displayed below. One line below contains number of terms occuring in all documents in sets result and query.
- In the section "Unique terms" you can see number of unique (occuring only in one document) terms, in first row in both sets. In the second row numbers were computed for each set separately.
- In the section "Different terms" you can see number of terms occuring or in result either in query, also detailed by displaying number of terms in result not in query (and in query not in result as well). In the following subsections you can observe most important terms from only one set.

In [10]:
explain_result(result, query)

IMPORTANT TERMS
	Result: 1628	Query: 1973
COMMON TERMS
	In total (result and query): 7
	"system" "the" "two" "exampl" "one" "a" "use" 
	Result: 16	Query: 58
UNIQUE TERMS
	In total: 278
	Result: 389	Query: 1583
DIFFERENT TERMS
	In total (or in result either in query): 1877
	Result: 766	Query: 1111
	TERMS IN THE RESULT NOT OCCURING IN THE QUERY
		"inform" "retriev" "document" "queri" "relev" "scienc" "search" "technolog" "is" "process" 
	TERMS IN THE QUERY NOT OCCURING IN THE RESULT
		"ai" "breed" "dog" "search" "intellig" "queri" "machin" "ir" "fawn" "artifici" 