# Assignment 4: Evaluating Search Engines

For this assignment, we leave aside the code we developed so far, and look into the more general issue of how to evaluate and compare different search engines. The ultimate test for any Information Retrieval system is how well it is able to satisfy the information needs of users.

# Cohen's Kappa

Our evaluation will involve the calculation of [Cohen's Kappa](https://en.wikipedia.org/wiki/Cohen's_kappa) to quantify the degree to which two human assessors agree or disagree on whether results are considered relevant or not. To calculate Cohen's Kappa, we are going to use the [scikit-learn library](http://scikit-learn.org/stable/):

In [2]:
! pip install --user scikit-learn



In [3]:
from sklearn.metrics import cohen_kappa_score

This library expects relevance assessments as lists of elements where `1` stands for _relevant_ and `0` stands for _not relevant_, for example like this:

In [4]:
a1=[1,0,1,0,1,0,1,0]

This list means that the first document was assessed to be relevant, the second to be not relevant, the third to be relevant etc.

We need two assessments in order to calculate Cohen's Kappa, so let's make another exemplary list that only differs on the last element:

In [5]:
a2=[1,0,1,0,1,0,1,1]

We can now invoke the library as follows to calculate the agreement between the two:

In [6]:
cohen_kappa_score(a1, a2)

0.75

This value represents high agreement. We can reach maximal agreement if the two assessments are identical:

In [7]:
cohen_kappa_score(a1, a1)

1.0

Now, let's see what happens for a third assessment that differs on three positions with the first one (the three last positions):

In [8]:
a3=[1,0,1,0,1,1,0,1]

cohen_kappa_score(a1, a3)

0.25

We get a smaller but still positive value, because these two assessments still mostly agree. If we make a further example that differs on 6 of the 8 positions, we get the following result:

In [9]:
a4=[1,0,0,1,0,1,0,1]

cohen_kappa_score(a1, a4)

-0.5

The score is now negative, because the two differ on more positions than they agree. The agreement is in fact less than what you would expect to occur just by chance. We get the maximal disagreement if we define a fifth example that disagrees on all positions:

In [10]:
a5=[0,1,0,1,0,1,0,1]

cohen_kappa_score(a1, a5)

-1.0

Now that we understand how this function works, we will apply it below for our specific evaluation.

# Results and Assessments

Next, we will define some auxilary code to deal with lists of URLs from search engines and associated relevance assessments. We will encode result lists like this:

In [13]:
urls = [
    'https://en.wikipedia.org/wiki/Information_retrieval/',  # 1st result
    'http://www.dictionary.com/browse/information',          # 2nd result
    'https://nlp.stanford.edu/IR-book/'                      # ...
]

And we represent corresponding assessments, as above, as lists of the same size containing relevance values:

In [14]:
my_assessment = [1, 0, 1]
another_assessment = [0, 0, 1]

In order to nicely display URL lists, with or without related assessments, we define a function called `display_results`:

In [15]:
from IPython.display import display, HTML

def display_results(urls, assessment1=None, assessment2=None):
    lines = []
    lines.append('<table>')
    header = '<tr><th>#</th><th>Result URL</th>'
    if (assessment1):
        header += '<th>Assessment 1</th>'
    if (assessment2):
        header += '<th>Assessment 2</th>'
    header += '</tr>'
    lines.append(header)
    i = 0
    for url in urls:
        show_url = url
        if (len(url) > 80):
            show_url = url[:75] + '...'
        line = '<tr><td>{}</td><td><a href="{:s}">{:s}</a></td>'.format(i+1, url, show_url)
        if (assessment1):
            if (assessment1[i] == 0):
                line += '<td><em>Not relevant</em></td>'
            else:
                line += '<td><strong>Relevant</strong></td>'
        if (assessment2):
            if (assessment2[i] == 0):
                line += '<td><em>Not relevant</em></td>'
            else:
                line += '<td><strong>Relevant</strong></td>'
        line += '</tr>'
        lines.append(line)
        i = i+1
    lines.append('</table>')
    display( HTML(''.join(lines)) )
    
display_results(urls, my_assessment, another_assessment)

#,Result URL,Assessment 1,Assessment 2
1,https://en.wikipedia.org/wiki/Information_retrieval/,Relevant,Not relevant
2,http://www.dictionary.com/browse/information,Not relevant,Not relevant
3,https://nlp.stanford.edu/IR-book/,Relevant,Relevant


We can use this function to display a list of URLs, optionally together with one or two assessment lists:

In [16]:
print("Just a list of URLs:")
display_results(urls)

print("With one assessment:")
display_results(urls, my_assessment)

print("With two assessments:")
display_results(urls, my_assessment, another_assessment)

Just a list of URLs:


#,Result URL
1,https://en.wikipedia.org/wiki/Information_retrieval/
2,http://www.dictionary.com/browse/information
3,https://nlp.stanford.edu/IR-book/


With one assessment:


#,Result URL,Assessment 1
1,https://en.wikipedia.org/wiki/Information_retrieval/,Relevant
2,http://www.dictionary.com/browse/information,Not relevant
3,https://nlp.stanford.edu/IR-book/,Relevant


With two assessments:


#,Result URL,Assessment 1,Assessment 2
1,https://en.wikipedia.org/wiki/Information_retrieval/,Relevant,Not relevant
2,http://www.dictionary.com/browse/information,Not relevant,Not relevant
3,https://nlp.stanford.edu/IR-book/,Relevant,Relevant


Now we are ready to perform an actual evaluation, which will involve a substantial amount of manual work.

---

# Tasks

**Your name:** Polina Boneva

### Task 1

Think up and formulate an information need in the areas of Computer Science or the Life Sciences (medicine, biology, etc.) for which you think the answer can be found in scientific publications. On page 152 in the book an example of such an information need is shown: "Information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine."

**Answer:** "functional programming vs oop in technological security"

Next, write down specifically what documents have to look like to satisfy your information need. For example if your information need is about finding an overview of different cancer types, you could state that a document would need to list at least ten types of cancer to satisfy your information need (among other criteria). Write this down as a protocol with rules and examples. For example, such a protocol could state that at least three out of five given criteria have to be fulfilled for a document to be considered relevant for the information need, and then specify the criteria. Or your protocol could have the form of a sequence of rules, where each rule lets you either label the document as relevant or not relevant, or proceed with the next rule. Such rules and criteria can, for example, be about the general topic of the paper, the concepts mentioned in it, the covered relations between concepts, the type of publication (research paper, overview paper, etc.), the number of references, the types of contained diagrams, and so on, depending on your specified information need.

**Answer:** 
### Protocol to determine the relevance of a document when looking up "functional programming vs oop in technological security"
> The following requirements are a must-have for a document to be relevant:
- assessment of at least 3 most important aspects of technological security. For example:
    - quality control
    - amount and types of attacks covered 
    - ease of taking over control by malicious users
    - future relevance of currently established security principles
- description of functional programming for security
- description of object-oriented programming for security
- comparison **data** between functional and oop. One of the following must be covered:
    - relevant to security
    - irrelevant to security, but further analysis describes relevance to security nevertheless
- conclusion. One of the following must be covered:
    - decision whether functional or oop prevails in security technology
    - lack of decision, but both of functional and oop are overviewed

---
> The following requirements are recommended for a document to be relevant:
- overview of at least 3 programming languages for security 
- assessment of at least 3 programming languages. Three or more of the following must be covered: 
    - description of its functional and oop principles and practices 
    - relevance for security
    - work-arounds when encountering a problem in development
    - ease of use (documentation, libraries, community)
- at least 3 obstacles when developing technology for security
- overview of at least 3 already developed security systems. Must have:
    - functional programming or/and oop used
    - future possibilities: ease of development, relevance to different types of attacks, ease of use;
    

\* A **single** must-have requirement can be replaced for **two** recommended ones, if it is lacking.


### Task 2

Formulate a keyword query that represents the information need. For the example on page 152 in the book (see above), the example query "wine AND red AND white AND heart AND attack AND effective" is given. (You don't need to use connectors like "AND", but if you do, make first sure your chosen search engines below actually support them.)

**Answer:** functional programming OR oop AND technological security

Then submit your query to **two** of the following academic search engines:

- [Google Scholar](https://scholar.google.com) (all science disciplines)
- [Semantic Scholar](https://www.semanticscholar.org) (computer science and biomedicine)
- [PubMed Search](https://www.ncbi.nlm.nih.gov/pubmed) (Life Sciences / biomedicine)

The right choice of two from the three search engines depends on the topic of your information need. If your information need is in Computer Science, for example, you should use Google Scholar and Semantic Scholar.

Extract a list of the top 10 URLs of the lists of each of the search engines
given the query. Try to access the resulting publications. For the publications
where that is not possible (because of dead links or because the publication is
pay-walled even within the VU network), exclude them from the list and add more publications to the end of
your list (that is, append results number 11, then 12, etc. to ensure you have
two lists of 10 publications each). In order to deal with paywalls, you should try accessing the articles from the VU network, use
[UBVU Off-Campus
Access](http://www.ub.vu.nl.vu-nl.idm.oclc.org/nl/faciliteiten/toegang-buiten-de-campus/index.aspx), or try to find the respective documents from alternative sources (Google Scholar, for example, is very good at finding free PDFs of articles). If you get fewer than 10 results for one of the search engines, modify the keyword query above to make it more inclusive, and then redo the steps of this task.

Store your two lists of URLs in the form of Python lists as introduced above. Then, use the `display_results` function to nicely display them.

In [23]:
# Create two of the lists below, depending on your chosen engines:

urls_google = [
    'https://patents.google.com/patent/US5014312A/en',
    'http://en.cnki.com.cn/Article_en/CJFDTotal-JSJK201103026.htm',
    'https://people.cs.kuleuven.be/~eddy.truyen/IMAGES/aosd_book.pdf',
    'https://pdfs.semanticscholar.org/617b/fa500a4dfccb03b3024f819fc4c4d1fd49d0.pdf',
    'https://patents.google.com/patent/US6501369B1/en',
    'https://patents.google.com/patent/US6184779B1/en',
    'https://link.springer.com/chapter/10.1007/978-3-540-74974-5_51',
    'https://plg.uwaterloo.ca/~migod/846/papers/aop-intro.pdf',
    'https://img.sauf.ca/pictures/2016-04-11/49263cddb34640e4db611c0effe2d3f3.pdf',
    'https://ieeexplore.ieee.org/abstract/document/6157179',
]
urls_semantic = [
    'https://www.semanticscholar.org/paper/Analyzing-the-functional-dynamics-of-technological-Bergek-Jacobsson/43963b810bf8903138e08f73e4ecbc441e1ab814',
    'https://www.semanticscholar.org/paper/Endogenous-Technological-Change-Romer/e85ede931ca515b0fec1b650fb044310d0c62fe7',
    'https://www.semanticscholar.org/paper/The-Technological-Society-Falk-Ellul/70c021d366002ef2ee895a4d72ba666f11c57b4b',
    'https://www.semanticscholar.org/paper/Technological-Discontinuities-and-Organizational-Tushman-Anderson/c74bed3065d1aa4fa3102e48e5c6baf2cfaef77a',
    'https://www.semanticscholar.org/paper/Eliciting-security-requirements-with-misuse-cases-Sindre-Opdahl/5053e086955182440b2c3e6bd21a29240b1565ed',
    'https://www.semanticscholar.org/paper/Technological-forecasting-for-decision-making-Martino/ce53faa89074b3c84c22757da5aca27d02621bf3',
    'https://www.semanticscholar.org/paper/Information-technology-implementation-research%3A-a-Cooper-Zmud/11bebab9e623723fa510d89a7f743e211e0b286d',
    'https://www.semanticscholar.org/paper/Long-Run-Implications-of-Investment-Specific-Change-Greenwood-Hercowitz/59183f4b41e6377c5d68cad2402a8f97205388a5',
    'https://www.semanticscholar.org/paper/Physical-Layer-Security%3A-From-Information-Theory-to-Bloch-Barros/c1da30faf9b6ee56b70be1d20b934c6e130cac6e',
    'https://www.semanticscholar.org/paper/Splintering-Urbanism%3A-Networked-Infrastructures%2C-Warf/f74c40cbe6622023e6dbbad056ff29394a90d8e8',
]


display_results(urls_google)
display_results(urls_semantic)

#,Result URL
1,https://patents.google.com/patent/US5014312A/en
2,http://en.cnki.com.cn/Article_en/CJFDTotal-JSJK201103026.htm
3,https://people.cs.kuleuven.be/~eddy.truyen/IMAGES/aosd_book.pdf
4,https://pdfs.semanticscholar.org/617b/fa500a4dfccb03b3024f819fc4c4d1fd49d0.pdf
5,https://patents.google.com/patent/US6501369B1/en
6,https://patents.google.com/patent/US6184779B1/en
7,https://link.springer.com/chapter/10.1007/978-3-540-74974-5_51
8,https://plg.uwaterloo.ca/~migod/846/papers/aop-intro.pdf
9,https://img.sauf.ca/pictures/2016-04-11/49263cddb34640e4db611c0effe2d3f3.pdf
10,https://ieeexplore.ieee.org/abstract/document/6157179


#,Result URL
1,https://www.semanticscholar.org/paper/Analyzing-the-functional-dynamics-of-...
2,https://www.semanticscholar.org/paper/Endogenous-Technological-Change-Romer...
3,https://www.semanticscholar.org/paper/The-Technological-Society-Falk-Ellul/...
4,https://www.semanticscholar.org/paper/Technological-Discontinuities-and-Org...
5,https://www.semanticscholar.org/paper/Eliciting-security-requirements-with-...
6,https://www.semanticscholar.org/paper/Technological-forecasting-for-decisio...
7,https://www.semanticscholar.org/paper/Information-technology-implementation...
8,https://www.semanticscholar.org/paper/Long-Run-Implications-of-Investment-S...
9,https://www.semanticscholar.org/paper/Physical-Layer-Security%3A-From-Infor...
10,https://www.semanticscholar.org/paper/Splintering-Urbanism%3A-Networked-Inf...


### Task 3

Then, find a fellow student who will **independently**
assess the results as "relevant" or "not relevant" using the protocol that you
have defined above, and also help (at least) one other student for his/her
assessment. Write down their names here:

**Name of the student who assesses my results:** Ioana Kirova

**Name of the student who I help to assess his/her results:** Ioana Kirova

Show to the other assessor everything you have written down above for Tasks 1 and 2 (and you might also want to give him/her the PDFs you got for these papers to simplify the process).

You as assessors need to stick to the protocol you made in Task 1 and should not discuss with each other, especially when you doubt whether a result is relevant or not. Write down your assessments as lists of relevance values, as introduced above, and make sure they correctly map to the URLs by displaying them together with the `display_results` function.

To avoid problems with extreme results, mark in each list at least one paper as 'relevant' and at least one paper as 'not relevant'. That is, if all papers seem relevant, mark the one that seems least relevant 'not relevant', and conversely, if none of the papers seem relevant, mark the one that seems a bit more relevant than the others as 'relevant'.

In [24]:
# 0 = not relevant; 1 = relevant

assessment1_google = [0, 0, 1, 1, 0, 0, 0, 1, 0, 0]
assessment1_semantic = [0, 0, 0, 0, 1, 0, 0, 0, 1, 0]

assessment2_google = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
assessment2_semantic = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

display_results(urls_google, assessment1_google, assessment2_google)
display_results(urls_semantic, assessment1_semantic, assessment2_semantic)

#,Result URL,Assessment 1,Assessment 2
1,https://patents.google.com/patent/US5014312A/en,Not relevant,Not relevant
2,http://en.cnki.com.cn/Article_en/CJFDTotal-JSJK201103026.htm,Not relevant,Not relevant
3,https://people.cs.kuleuven.be/~eddy.truyen/IMAGES/aosd_book.pdf,Relevant,Relevant
4,https://pdfs.semanticscholar.org/617b/fa500a4dfccb03b3024f819fc4c4d1fd49d0.pdf,Relevant,Not relevant
5,https://patents.google.com/patent/US6501369B1/en,Not relevant,Not relevant
6,https://patents.google.com/patent/US6184779B1/en,Not relevant,Not relevant
7,https://link.springer.com/chapter/10.1007/978-3-540-74974-5_51,Not relevant,Not relevant
8,https://plg.uwaterloo.ca/~migod/846/papers/aop-intro.pdf,Relevant,Not relevant
9,https://img.sauf.ca/pictures/2016-04-11/49263cddb34640e4db611c0effe2d3f3.pdf,Not relevant,Not relevant
10,https://ieeexplore.ieee.org/abstract/document/6157179,Not relevant,Not relevant


#,Result URL,Assessment 1,Assessment 2
1,https://www.semanticscholar.org/paper/Analyzing-the-functional-dynamics-of-...,Not relevant,Not relevant
2,https://www.semanticscholar.org/paper/Endogenous-Technological-Change-Romer...,Not relevant,Not relevant
3,https://www.semanticscholar.org/paper/The-Technological-Society-Falk-Ellul/...,Not relevant,Not relevant
4,https://www.semanticscholar.org/paper/Technological-Discontinuities-and-Org...,Not relevant,Relevant
5,https://www.semanticscholar.org/paper/Eliciting-security-requirements-with-...,Relevant,Not relevant
6,https://www.semanticscholar.org/paper/Technological-forecasting-for-decisio...,Not relevant,Not relevant
7,https://www.semanticscholar.org/paper/Information-technology-implementation...,Not relevant,Not relevant
8,https://www.semanticscholar.org/paper/Long-Run-Implications-of-Investment-S...,Not relevant,Not relevant
9,https://www.semanticscholar.org/paper/Physical-Layer-Security%3A-From-Infor...,Relevant,Not relevant
10,https://www.semanticscholar.org/paper/Splintering-Urbanism%3A-Networked-Inf...,Not relevant,Not relevant


### Task 4

Compute Cohen's kappa to quantify how much the two assessors agreed. Use the function `cohen_kappa_score` demonstrated above to calculate two times the inter-annotator agreement (once for each of the two search engines), and print out the results.

In [22]:
kappa_google = cohen_kappa_score(assessment1_google, assessment2_google)
kappa_semantic = cohen_kappa_score(assessment1_semantic, assessment2_semantic)

print("Kappa for Google Scholar:", kappa_google)
print("Kappa for Semantic Scholar:", kappa_semantic)


Kappa for Google Scholar: 0.41176470588235303
Kappa for Semantic Scholar: -0.15384615384615374


Explain whether the agreement can be considered high or not, based on the interpretation table on [this Wikipedia page](https://en.wikipedia.org/wiki/Fleiss'_kappa#Interpretation) (this Wikipedia page is about a different type of kappa but the interpretation table can also be used for Cohen's kappa).

**Answer:** According to the table we are in a moderate agreement on the articles suggested by Google Scholar. However, we are in poor agreement on the articles offered by Semantic Scholar. 
The first moderate agreement stems mainly from the fact that most articles were simply irrelevant to my search query. Thus, we have both seen the same irrelevance, following the criteria I created to rank the pages relevant or not. The criteria in the protocol above are quite specific, thus it is difficult to place an article as relevant because most of them do not mention functional or object-oriented programming at all (some do not even mention security).
However, the two of us have described three different documents as relevant from the Semantic Scholar. This means that the criteria in the protocol is somewhat confusing or that the documents were overall too irrelevant, thus we have separately tried to find relevant documents by giving them the benefit of the doubt: providing them with more qualities than they have. The criteria in the protocol seems hard to match, which means there hasn't been much substantial research on the types of programming used in security. 

### Task 5

Define a function called `precision_at_n` that calculates Precision@n as described in the lecture slides, which takes as input an assessment list and a value for _n_ and returns the respective Precision@n value. Run this function to calculate Precision@10 (that is, n=10) on all four assessments (two assessors and two search engines).

In [27]:
def precision_at_n(assessment_list, n):
   return print(sum(assessment_list[:n]) / n)

precision_at_n(assessment1_google, 10)
precision_at_n(assessment2_google, 10)

precision_at_n(assessment1_semantic, 10)
precision_at_n(assessment2_semantic, 10)

0.3
0.1
0.2
0.1


Explain what these specific Precision@10 results tell us (or don't tell us) about the quality of the two search engines for your particular information need. You can also refer to the results of Task 4 if necessary.

**Answer:** The results of Precision@10 tell us everything about the quality of any search engine. Of course, it is subjective to judge the relevance of a document from our own perspective, but this is okay because we are using the search engines for us, not for an objective creature. 
Both Precision@10's of Ioana are 0.1 which means she did not see almost no value in the results by either search engine. The true positives of hers are only two for both Google and Semantic Scholar, where the false positives are eighteen in total. Each search engine provided 10 best results, but each was 90% wrong, according to Ioana.
However, I saw a bit more value in the results by Google: 0.3 out of 1. This means three of the ten positives were true and seven were false, according to my perspective. Semantic scores even worse with 0.2. Only two of the ten articles seemed relevant to me, even if they were all deemed relevant according to the search engine. 
Thus, for the particular information that I, as a beginner in security systems overall, needed on "functional programming OR oop AND technological security" both engines do not hold much quality. Google Scholar is slightly better. 


# Submission

Submit the answers to the assignment via Canvas as a modified version of this Notebook file (file with `.ipynb` extension) that includes your code and your answers.

Before submitting, restart the kernel and re-run the complete code (**Kernel > Restart & Run All**), and then check whether your assignment code still works as expected.

Don't forget to add your name, and remember that the assignments have to be done individually and group submissions are **not allowed**.