# Term Project
“The Cranfield collection [...] was the pioneering test collection in allowing CRANFIELD precise quantitative measures of information retrieval effectiveness [...]. Collected in the United Kingdom starting in the late 1950s, it contains 1398 abstracts of aerodynamics journal articles, a set of 225 queries, and exhaustive relevance judgments of all (query, document) pairs.” [1, Section 8.2]

Your tasks, reviewed by your colleagues and the course instructors, are the following:

1.   *Implement a ranked retrieval system*, [1, Chapter 6] which will produce a list of documents from the Cranfield collection in a descending order of relevance to a query from the Cranfield collection. You MUST NOT use relevance judgements from the Cranfield collection in your information retrieval system. Relevance judgements MUST only be used for the evaluation of your information retrieval system.

2.   *Document your code* in accordance with [PEP 257](https://www.python.org/dev/peps/pep-0257/), ideally using [the NumPy style guide](https://numpydoc.readthedocs.io/en/latest/format.html#docstring-standard) as seen in the code from exercises.  
     *Stick to a consistent coding style* in accordance with [PEP 8](https://www.python.org/dev/peps/pep-0008/).

3.   *Reach at least 35% mean average precision* [1, Section 8.4] with your system on the Cranfield collection. You are encouraged to use techniques for tokenization, [1, Section 2.2] document representation [1, Section 6.4], tolerant retrieval [1, Chapter 3], relevance feedback and query expansion, [1, Chapter 9] and others discussed in the course.

4.   *Upload a link to your Google Colaboratory document to the homework vault in IS MU.* You MAY also include a brief description of your information retrieval system.

#### Install the fresh version of utils

In [1]:
! pip install git+https://gitlab.fi.muni.cz/xstefan3/pv211-utils.git@master | grep '^Successfully'

  Running command git clone -q https://gitlab.fi.muni.cz/xstefan3/pv211-utils.git /tmp/pip-req-build-hyhdxteb
Successfully built gdown
Successfully built pv211-utils
Successfully installed gdown-3.12.2 pv211-utils-1.0.0


## Loading the Cranfield collection

### Loading the documents
The following code loads documents from the Cranfield collection into the `documents` [ordered dictionary](https://docs.python.org/3.8/library/collections.html#collections.OrderedDict). Tokenization of the `title` and `body` attributes of the individual documents as well as the creative use of the `authors`, `bibliography`, and `title` attributes is left to your imagination and craftsmanship.

In [2]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
import re, string
import inflect
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from nltk.tokenize import WordPunctTokenizer
import copy

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [3]:
def ending_cleaner(text, ending):
  """Cuts off the end of a string if it ends in ending

  Parameters
  ----------
  test
      The text which is to be cleaned

  ending
      The text ending which is to be removed if present
  
  Returns
  -------
  text : string
      The preprocessed version of text

  """      
  if len(text)>0:
    if text.endswith(ending):
      return text[:-len(ending)]

  return text  

def preproces_text(text):
  """Performs a series of preprocessing tasks on text

  Parameters
  ----------
  text
      The text which is to be cleaned in string format

  Returns
  -------
  proces : list
      A list of preprocessed tokenized version of text

  """
  #make a copy of text     
  text_copy = copy.deepcopy(text)
  # strip numbers
  text_copy = re.sub('[0-9]+', '', text_copy)
  # tokenize text
  proces = WordPunctTokenizer().tokenize(text_copy)
  # convert to lower case
  proces = list(map(lambda x: x.lower(), proces))
  # remove punctuation
  proces = list(map(lambda x: re.sub(r'[^\w\s]', '', x), proces))
  proces = [token for token in proces if token != '']
  # use ending cleaner with ending 'eous', since the word gaseous does not get correctly stemmed
  proces = [ending_cleaner(token, 'eous') for token in proces]
  # lemmatize the tokens
  lemmatizer = WordNetLemmatizer()
  proces = [lemmatizer.lemmatize(token) for token in proces]
  # build a list neg_stopwords, which I want included, but are present in nltk stopword list
  neg_stopwords = ['must'] 
  # copy nltk stopwords
  nltk_stopwords = [token for token in nltk.corpus.stopwords.words('english')]
  # add neg_stopwords to nltk_stopwords
  actual_stopwords = list(set(nltk_stopwords) - set(neg_stopwords))  
  # add some custom stopwords which I determined should be removed while examining query and document bodies
  custom_stopwords = ['viz', 'fig', 'ie', 'eg', 'see', 'dash', 'far', 'one', 'else', 'anyone', 'made', 'way', 'exist']
  custom_stopwords.extend(['available', 'done', 'play', 'within', 'work', 'paper', 'quite', 'title','efforts','efforts','chart'])
  custom_stopwords.extend(['best', 'done', 'year', 'called', 'mention', 'pre', 'subscript', 'let'])
  # remove stopwords from proces list
  proces = [token for token in proces if token not in actual_stopwords and token not in custom_stopwords]
  # stemming
  stemmer = nltk.stem.PorterStemmer()
  # prevent 'gas' from being stemmed, since diffenent version of the word don't get stemmed equally
  non_stem_words = ['gas']
  new_proces = []
  for token in proces:
    if len(token) > 3:
      new_proces.append(stemmer.stem(token))
    else:
      new_proces.append(token)
  proces = new_proces
  # remove tokens which are too short
  proces = [token for token in proces if len(token) > 2]
  return proces


In [4]:
from pv211_utils.entities import DocumentBase

class Document(DocumentBase):
    """
    A Cranfield collection document.

    Parameters
    ----------
    document_id : int
        A unique identifier of the document.
    authors : list of str
        A unique identifiers of the authors of the document.
    bibliography : str
        The bibliographical entry for the document.
    title : str
        The title of the document.
    body : str
        The abstract of the document.
    preprocessed_title : list
        Preprocessed version of document title
    preprocessed_body : list
        Preprocessed version of document body

    """
    def __init__(self, document_id, authors, bibliography, title, body):
        super().__init__(document_id, authors, bibliography, title, body)
        self.preprocessed_title = preproces_text(title)
        self.preprocessed_body = preproces_text(body)

In [5]:
from pv211_utils.loader import load_documents
documents = load_documents(Document)

ModuleNotFoundError: ignored

### Loading the queries
The following code loads queries from the Cranfield collection into the `queries` [ordered dictionary](https://docs.python.org/3.8/library/collections.html#collections.OrderedDict). Tokenization of the `body` attribute of the individual queries is left to your imagination and craftsmanship.

In [None]:
from pv211_utils.entities import QueryBase

class Query(QueryBase):
    """
    A Cranfield collection query.

    Parameters
    ----------
    query_id : int
        A unique identifier of the query.
    body : str
        The text of the query.
    preprocessed_body : list
        Preprocessed version of query body

    """
    def __init__(self, query_id, body):
        super().__init__(query_id, body)
        self.preprocessed_body = preproces_text(body)

In [None]:
from pv211_utils.loader import load_queries
queries = load_queries(Query)

### Loading the relevance judgements
The following code loads relevance judgements from the Cranfield collection into the `relevant` set. Relevance judgements MUST NOT be used in your information retrieval system. Relevance judgements MUST only be used for the evaluation of your information retrieval system.

In [None]:
from pv211_utils.loader import load_judgements
relevant = load_judgements(queries, documents)

In [None]:
def uniq(sorted_list):
    """A sorted list with duplicates removed.
    This code is borrowed from week 1 and 2 class notebook
    Used to build inverted index

    Parameters
    ----------
    sorted_list : list
        The sorted list.
  
    Returns
    -------
    list
        The sorted list with duplicates removed.

    """
    if len(sorted_list) <= 1:
        return sorted_list
    uniq_list = sorted_list[:1]
    previous_value = sorted_list[0]
    for value in sorted_list[1:]:
        if value != previous_value:
            uniq_list.append(value)
        previous_value = value
    return uniq_list

## Implementation of your information retrieval system
The following code provides an example implementation of an information retrieval system in the `search` function. This example implementation returns documents in a random order and achieves a very weak mean average precision. Replace this implementation with your own implementation.

In [None]:
from random import seed, shuffle
from collections import OrderedDict

from pv211_utils.irsystem import IRSystem

class MyIRSystem(IRSystem):
    """
    A system that returns documents in random order.

    Attributes
    ----------
    random_documents : list of Document
        Documents in random order.

    """
    def __init__(self, threshold = 0):
        self.all_documents = documents
        self.output_documents = [] # menim z [all_documents[0]]
        self.build_inverted_index()
        self.build_tf_idf_frames()
        #print(self.output_documents.body)
        self.threshold = threshold

    def build_inverted_index(self):
      '''
      Builds an inverted index out of all document bodies
      as well as documents titles and query bodies
      Basic structure is borrowed from week 1 and 2 class notebook

      Returns
      -------
      global dictionary : list
          The list of all unique terms in preprocessed document bodies.
      
      global inverted_index : dictionary
          The dictionary of inverted indexes for each term.

      global dictionary_titles : list
          The list of all unique terms in preprocessed document titles.
      
      global inverted_index_titles : dictionary
          The dictionary of inverted indexes for each term.

      global query_dictionary : list
          The list of all unique terms in preprocessed query bodies.
      
      global query_inverted_index : dictionary
          The dictionary of inverted indexes for each term.

      '''

      # build token, document pairs from preprocessed document bodies
      pairs = []
      for doc_id in documents:
          tokens = documents[doc_id].preprocessed_body
          for token in tokens:
              pair = (token, documents[doc_id].document_id)
              pairs.append(pair)
      sorted_pairs = uniq(sorted(pairs, key=lambda x: (x[0].lower(), x[1])))

      # build dictionary and inverted index from preprocessed document bodies
      global dictionary
      dictionary = []
      global inverted_index
      inverted_index = {}
      for term, doc_id in sorted_pairs:
        if term not in inverted_index:
          inverted_index[term] = []
          dictionary.append(term)
        inverted_index[term].append(doc_id) 

      # build token, document pairs from preprocessed document titles
      pairs = []
      for doc_id in documents:
          tokens = documents[doc_id].preprocessed_title
          for token in tokens:
              pair = (token, documents[doc_id].document_id)
              pairs.append(pair)
      sorted_pairs = uniq(sorted(pairs, key=lambda x: (x[0].lower(), x[1])))

      # build dictionary and inverted index from preprocessed document titles
      global dictionary_titles
      dictionary_titles = []
      global inverted_index_titles
      inverted_index_titles = {}
      for term, doc_id in sorted_pairs:
        if term not in inverted_index_titles:
          inverted_index_titles[term] = []
          dictionary_titles.append(term)
        inverted_index_titles[term].append(doc_id)           

      # build token, document pairs from preprocessed query bodies
      pairs = []
      for query_id in queries:
          tokens = queries[query_id].preprocessed_body
          for token in tokens:
              pair = (token, queries[query_id].query_id) #doc misto doc.document_id
              pairs.append(pair)
      sorted_pairs = uniq(sorted(pairs, key=lambda x: (x[0].lower(), x[1])))

      # build dictionary and inverted index from preprocessed query bodies      
      global query_dictionary
      query_dictionary = []
      global query_inverted_index
      query_inverted_index = {}
      for term, query_id in sorted_pairs:
        if term not in query_inverted_index:
          query_inverted_index[term] = []
          query_dictionary.append(term)
        query_inverted_index[term].append(query_id)     

    def build_tf_idf_frames(self):
      '''
      Builds tfidf dataframes for both preprocessed document bodies and titles

      Returns
      -------
      tfidf_frame : dataframe
          Dataframe consisting of tfidf value for each term and document from
          preprocessed document bodies.
      
      tfidf_frame_titles : dataframe
          Dataframe consisting of tfidf value for each term and document from
          preprocessed document titles.

      '''
      # calculate average documents body size for special weighting
      avg_doc_size = 0
      counter = 0
      for doc_id in documents:
        avg_doc_size += len(documents[doc_id].preprocessed_body)
        counter += 1
      avg_doc_size = avg_doc_size/counter

      # build a data frame of term frequencies for document bodies
      global tf_frame
      tf_frame = pd.DataFrame()
      tf_frame['terms'] = dictionary
      for doc_id in self.all_documents:
        term_frequency = {}
        for term in self.all_documents[doc_id].preprocessed_body:
          if term not in term_frequency:
            term_frequency[term] = 1
          else:
            term_frequency[term] += 1
        tf_doc_entry = {}
        for term in dictionary:
          if term in term_frequency:
            tf_doc_entry[term] = term_frequency[term]/(term_frequency[term]+4.289*len(documents[doc_id].preprocessed_body)/avg_doc_size)
          else:
            tf_doc_entry[term] = 0
        tf_frame[doc_id] = tf_frame['terms'].map(tf_doc_entry)
      tf_frame.set_index('terms', inplace=True)

      # calculate N and get document frequencies
      N = len(dictionary)
      df_dict = {}
      for term in dictionary:
        if term not in df_dict:
          df_dict[term] = len(inverted_index[term])
        else:
          df_dict[term] = 0

      # build idf dictionary
      global idf_dict
      idf_dict = dict((term, np.log2((N+1)/(df+1)+1.0001)) for term, df in df_dict.items())
      # build tfidf dtaframe
      global tfidf_frame
      tfidf_frame = tf_frame.apply(lambda x: np.log(1.00097+x)*idf_dict[x.name], axis = 1)

      # same steps for document titles
      global tf_frame_titles
      tf_frame_titles = pd.DataFrame()
      tf_frame_titles['terms'] = dictionary
      for doc_id in self.all_documents:
        term_frequency = {}
        for term in self.all_documents[doc_id].preprocessed_title:
          if term not in term_frequency:
            term_frequency[term] = 1
          else:
            term_frequency[term] += 1
        tf_doc_entry = {}
        for term in dictionary:
          if term in term_frequency:
            tf_doc_entry[term] = term_frequency[term]/(term_frequency[term]+4.289*len(documents[doc_id].preprocessed_title)/avg_doc_size)
          else:
            tf_doc_entry[term] = 0
        tf_frame_titles[doc_id] = tf_frame_titles['terms'].map(tf_doc_entry)
      tf_frame_titles.set_index('terms', inplace=True)
      df_dict_titles = {}
      for term in dictionary:
        if term not in df_dict_titles:
          df_dict_titles[term] = len(inverted_index[term])
        else:
          df_dict_titles[term] = 0

      global idf_dict_titles
      idf_dict_titles = dict((term, np.log2((N+1)/(df+1)+1.00097)) for term, df in df_dict_titles.items())
      global tfidf_frame_titles
      tfidf_frame_titles = tf_frame_titles.apply(lambda x: np.log(1.00097+x)*idf_dict_titles[x.name], axis = 1)  
             
    def query_to_tfidf(self, query):
      '''
      Builds a dataframe of tfidf values for provided query

      Parameters
      ----------
      query : query object
        Provided query.      

      Returns
      -------
      query_tfidf : dataframe
          Dataframe consisting of tfidf values for provided query.

      '''     
      terms = queries[query.query_id].preprocessed_body
      # build term frequency dataframe
      term_frequency = {}
      for term in terms:
        if term not in term_frequency.keys():
          term_frequency[term] = 1
        else:
          term_frequency[term] += 1
      full_term_frequency = {}
      for term in dictionary:
        if term in term_frequency:
          full_term_frequency[term] = term_frequency[term]
        else:
          full_term_frequency[term] = 0
      full_term_frequency = pd.DataFrame.from_dict(full_term_frequency,orient='index',columns=['Query'])
      query_tfidf = full_term_frequency.apply(lambda x:  np.log(1.00097+x)*idf_dict[x.name], axis = 1)

      return(query_tfidf)

    def tfidf_search(self, query):
      '''
      For provided query finds the most relevant documents

      Parameters
      -------
      query : Query object
          A provided query.
      
      Returns
      -------
      output_documents : Ordered dict
          An ordered dict of Document object sorted by relevance

      '''      
      # get querie's tfidf values
      query_tfidf = self.query_to_tfidf(query)

      # build dictionary of similarities to all documents bodies
      columns = list(tfidf_frame)
      similarity_dict = {}
      for column in columns:
        similarity = float(cosine_similarity(query_tfidf['Query'].values.reshape(1, -1), tfidf_frame[column].values.reshape(1, -1)))
        similarity_dict[column] = similarity     
      
      # build dictionary of similarities to all document titles
      columns = list(tfidf_frame_titles)
      similarity_dict_titles = {}
      for column in columns:
        similarity = float(cosine_similarity(query_tfidf['Query'].values.reshape(1, -1), tfidf_frame_titles[column].values.reshape(1, -1)))
        similarity_dict_titles[column] = similarity

      # if title similarity is greater than 0.65, use title similarity, otherwise use body similarity
      combined_similarity_dict = {}
      for key, value in similarity_dict.items():
        similarity_titles = similarity_dict_titles[key]
        if similarity_titles >= 0.65:
          combined_similarity_dict[key] = similarity_titles
        else:
          combined_similarity_dict[key] = value
      combined_similarity_dict = OrderedDict(sorted(combined_similarity_dict.items(), key=lambda x: x[1], reverse=True))    
     
      # build output document list
      output_documents = []
      for key, value in combined_similarity_dict.items():
        output_documents.append(key)     

      return output_documents

    def search(self, query):
        """The ranked retrieval results for a query.

        Parameters
        ----------
        query : Query
            A query.
        
        Returns
        -------
        list of Document
            The ranked retrieval results for a query.

        """

        relevant_doclist = self.tfidf_search(query)
        result = []
        for doc in relevant_doclist:
          result.append(documents[doc])
        self.output_documents = result

        return self.output_documents

## Evaluation

The following code evaluates your information retrieval system using the Mean Average Precision evaluation measure.
You can [check out on GitLab](https://gitlab.fi.muni.cz/xstefan3/pv211-utils/blob/master/pv211_utils/eval.py) how Mean Average Precision is computed.

If you choose to `submit_result`, the result of your run will appear among our [Leaderboard submissions](https://docs.google.com/spreadsheets/d/e/2PACX-1vSGTg_Agc0SowDIsDDsaBN_UD-9r-F2eSpozyvVA8F51YHt3GmAle3niaCoj0ocazjDm01OJNgNEykZ/pubhtml).

Then, your best score for each week will be submited and ranked in the Leaderboard sheet. The best solvers will get small **awards during the semester**, or some **seriously big awards** after the personal check, at the end of the competition (that's the 8th of May for now).

In [None]:
from pv211_utils.eval import mean_average_precision

mean_average_precision(MyIRSystem(), submit_result=False, author_name="Čapek, Nicholas")

Please be polite and do not spoil the game for the others ;)

**Have fun!**

## Bibliography
[1] Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. [*Introduction to information retrieval*](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf). Cambridge university press, 2008.