# Data-Driven Research Assignment 1: Build your own Vertical Search Engine
This notebook contains the first individual graded assignment of the 2023 Data-Driven Research course. In this assignment you will build your own search engine -- and compare against search engine giant Google -- with a realistic chance to do better than Google! 

To complete the assignment, fill in the **Report part 1**, **Report part 2** and **Report part 3** and **Report part 4**, and complete the **code** in the last section that is needed to write Report part 3.

This is an individual assignment. In the text cell below, please add your name.

If you used code or a solution from the internet (such as StackOverflow) or another external resource, please make reference to it (in any format). Unattributed copied code will be considered plagiarism and therefore fraud.


**Author of this answer: George Christian Cotea - 13842013**

# 1. Introduction

Search engines provide a crucial role in making sense of the almost infinite amount of information online. In fact your mental model of what the web is, is completely determined by the small fraction of web data served in a streamlined way by search engines.  You reliance on search engines also makes them the gatekeeper of information: with millions or billions of search results for every query they determine the ranking with only the first handful of them gets any clicks.   Is this generic approach really optimal for everyone and any topic?

A *Vertical Search Engine* is a "specialized" search engine that focuses on a specific domain or service, tailored to the
particular information needs of niche audiences and professions.  Search experts believe that the performance of general-purpose search engines (such as http://google.com/, http://bing.com/, and http://onesearch.com/ or regional champions like https://baidu.cn/ and https://yandex.ru/) cannot improve much due to the short and highly ambiguous queries that are standard on the web.  A natural alternative to the one-size-fits-all approach of general-purpose search engines is the
use of a dedicated vertical search engine, which is expected to provide more focused search results.

# 2. Accessing Google using Python

Probably, you normally access the web by searching for something through a search engine like Google. But what if you want to make use of this information for data-driven or quantitative research purposes? It's kind of a hassle to copy-paste all your search results into a file for further analysis. 

We could use Python's Request library to simply request a page of search results from Google. Let's try this:

In [2]:
import requests  # External package: https://requests.readthedocs.io/en/master/

# Example of a Google search
query = "How to search Google using Python"
headers = {
    "referer":"referer: https://www.google.com/",
    "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
    }

with requests.Session() as s:
    s.post('https://www.google.com/search', params={'q': query}, headers=headers)
    r = s.get('https://www.google.com/search', params={'q': query}, headers=headers)

If you are wondering what the "headers" is, it provides information about the user (mainly your browser) to the website. We have to add them here for Google to view us as a regular user and not a first-time user, to avoid getting cookies and privacy consent popups instead of the results that we want.

You can find a selection of HTTP headers [here](https://en.wikipedia.org/wiki/List_of_HTTP_header_fields). Here's an explanation for the ones used above:
- [referer](https://en.wikipedia.org/wiki/HTTP_referer):  This is the address of the previous web page from which a link to the currently requested page was followed. 
- [user-agent]( https://en.wikipedia.org/wiki/User_agent#User_agent_identification): any software, acting on behalf of a user. It often identifies itself, its application type, operating system, device model, software vendor, or software revision 

In [3]:
r.status_code

200

A code of 200 means success in the world of the web. View the HTTP response status codes here, including the famous 404: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

In [4]:
print(r.url)
r.text[:2000]

https://www.google.com/search?q=How+to+search+Google+using+Python


'<!doctype html><html itemscope="" itemtype="http://schema.org/SearchResultsPage" lang="de"><head><meta charset="UTF-8"><meta content="origin" name="referrer"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>How to search Google using Python - Google Suche</title><script nonce="xME6uIwMpS0Kv-RNTeqqbA">(function(){var b=window.addEventListener;window.addEventListener=function(a,c,d){"unload"!==a&&b(a,c,d)};}).call(this);(function(){window.google={kEI:\'Qdk1ZKDUFOuJ9u8P-bCI6Ac\',kEXPI:\'31\',kBL:\'x4dB\',kOPI:89978449};google.sn=\'web\';google.kHL=\'de\';})();(function(){\nvar e=this||self;var g,h=[];function k(a){for(var c;a&&(!a.getAttribute||!(c=a.getAttribute("eid")));)a=a.parentNode;return c||g}function l(a){for(var c=null;a&&(!a.getAttribute||!(c=a.getAttribute("leid")));)a=a.parentNode;return c}function m(a){/^http:/i.test(a)&&"https:"===window.location.protocol&&(google.ml&&google.ml(Error("a"),!1,{src:a,glmm:1}),a="");return a}

We indeed get a result, but the result is just the HTML code for Google's webpage containing the results. It is possible to actually extract the search results from this mess (and various companies offer this as a paid service) but it requires further knowledge of web scraping and is not ideal as the format of these pages that Google gives you might change. These pages are designed for being visually rendered by a web browser and read by humans, not for being read by machines.

Fortunately, Google also offers programmatic access to its search services through an API, to a limited extent, which avoids the above complications with headers and web scraping.

## Working with APIs

An **Application Programming Interface** is a set of protocols that defines how software programs communicate among eachother. Without APIs, we have to scrape the Web or get the data directly. With APIs, we often can get structured data: it is a much more convenient way to work.

APIs are a great option in that they implement extensively tested routines (**high reliability**). However, you should spend time in learning how they work and, in some cases, they don't allow you to access the piece of information you may need (**low flexibility**).

In this assignment, we will see how we can use Google's Custom Search API to build our own search engine on a specific topic.

# 3. My First Search Engine

In the first assignment, you'll:
* Identify a specific topic of interest, the purpose of the vertical search engine, and the selection principles used to determine what's in and what's not in.
* Build you very own vertical search engine using Google Co-op.
* Evaluate the performance of your vertical search engine.

## Subject, Purpose, and Selection Principles

In any sort of data-driven research, you need to specify what kind of data is in and what kind of data is out. This is needed to clarify your domain of research and to make your research reproducible to others.

First select a *subject* or topic for which you want to build a vertical search engine.  Generally speaking, a very specific topic like *Miffy (Nijntje)*, *Dodo's*, or *Olympic Games in  Amsterdam 1992* works better than general topics like *Literature*, *Ornithology*, or *Sports*.  

Discuss the following issues, and write down your findings in the section below. You will have to specify the subject very specifically in terms of the *purpose* of the vertical search engine: what is the goal of the search engine (what should it make possible), and who are the envisaged users (what are special characteristics of the audience, in terms of background knowledge and preferences)?  You also have to translate this purpose into clear *selection principles*: what sites/pages do belong in the search engine, and---more importantly---what sites/pages do not belong in it. Give examples of sites that are in, and that are out, and explain this using your selection principles.

<div class="alert alert-block alert-info">
In practice, spiral development works best. That is, after an initial formulation of the subject, immediately build an initial version of the vertical search engine, and let it evolve in parallel. So, feel free to return to this section later to improve it.
</div>

### Report Part 1: Description of my search engine

(your discussion here)

## Build the Vertical Search Engine

We will use Google Co-op's custom search engine http://cse.google.com/. Follow these steps to create your search engine:

* Click on ''New Search Engine'' and sign in.
* Specify a list of websites to search. You can start with a few websites, and add more later (aim for 10-20). If you add too many, it may not search all of them --- you can see this in the Control Panel. Only if a website has a green checkmark, it will be searched.
* Enter the basic information: language, name.
* Click on ''Create''.

<div class="alert alert-block alert-info">
Searching sites to include with Google first, and then search for them a little later will not result in a fair evaluation of the effectiveness of Google (due to personalization against your recent search history). In order to have a fair comparison, please use another search engine such as  \url{https://onesearch.com/} or \url{https://duckduckgo.com/} (both powered by non-personalized Bing API results) to discover sites to include.  Alternatively, you can develop with Google, but later compare your search engine to another large search engine.
</div>

You can now try your own search engine by clicking ''View it on the Web.''  Your own search engine now has
its own homepage, and a  control panel to help further develop the search engine (i.e., add and manage the sites; search
refinements; look-and-feel; collaboration; etc.). Please include a link to your search engine in the text box below `https://cse.google.com/cse/publicurl?cx='Search_engine_unique_ID'` e.g. https://cse.google.com/cse?cx=015090763398590173596:fny6ovbqr4y.

### Report part 2: Search engine link and List of Websites

(your search engine link and list of websites here)

## Access your search engine from Python

To use your search engine in Python, you need two things: your Search Engine ID (visible in the CSE control panel) and an API key. To get a key, go to this page: https://developers.google.com/custom-search/v1/introduction and click Get a Key.

Now, we can try to search using Python. Insert the two things you just obtained in the code block below and run it.

<div class="alert alert-block alert-info">
In most real API access situations, you would not put your keys in the notebook like this for security reasons. Instead, you would put it in a separate file and load that file. But for now, we will do it like this for convenience.
</div>

In [None]:
import requests

# get the API KEY here: https://developers.google.com/custom-search/v1/overview
API_KEY = "<INSERT_YOUR_API_KEY_HERE>"
# get your Search Engine ID on your CSE control panel
SEARCH_ENGINE_ID = "<INSERT_YOUR_SEARCH_ENGINE_ID_HERE>"

And, let's test it:

In [None]:
# The query you want to search for.
query = "test"

# using the first page
page = 1

# Making the link to Google to search
# Documentation on this topic: https://developers.google.com/custom-search/v1/using_rest
# Start should be the index of the first result you want to see, and each page has 10 results.
# So if we want to see page 2, we need to start at result number 11.
start = (page - 1) * 10 + 1
# Building the link to send to Google
url = f"https://www.googleapis.com/customsearch/v1?key={API_KEY}&cx={SEARCH_ENGINE_ID}&q={query}&start={start}"

<div class="alert alert-block alert-danger">
Warning: With the free version, you can only make 100 queries per day to Google (the requests.get part). So don't run the "get" queries too often or you will hit the limit before being able to finish the assignment.
</div>

In [None]:
# Make the search request to the API. This is a cell you want to run as few times as possible.
data = requests.get(url).json()

In [None]:
search_results = data.get("items")

This returns a dictionary *data* containing the result of our request. The actual search results are stored in `data["items"]` so we have saved that to the variable search_results. We can re-use this variable to avoid making many requests to Google.

### Read through the search results

Let's print the results in a nice way:

In [None]:

def print_search_results(search_results):
    for i, search_item in enumerate(search_results, start=1):
        # get the page title
        title = search_item.get("title")
        # page snippet
        snippet = search_item.get("snippet")
        # alternatively, you can get the HTML snippet (bolded keywords)
        html_snippet = search_item.get("htmlSnippet")
        # extract the page url
        link = search_item.get("link")
        # print the results
        print("="*15, f"Result #{i+start-1}", "="*15)
        print("Title:", title)
        print("Description:", snippet)
        print("URL:", link, "\n")
        
print_search_results(search_results) #Print the search results we just got using the function we just defined.

## Evaluation: Can You Beat Google?

Once you are completely satisfied with your vertical search engine, we
move to the final stage: evaluate the effectiveness of your vertical
search engine in comparison to a general-purpose search engine.  Please
document the evaluations below.

* Search/navigate/browse to a web page that is part of your
  collection (List of Websites).  Skim-read the page and close the page in your
  browser. Now write down 2-3 keywords below that you may use when trying to
  find this page again at a later time.  Do this for **5 pages**
  in total. This will be our evaluation dataset: we say that a search engine is good if it finds those pages using those keywords. Write the keywords and link for each of the 5 pages in the Python dictionary below.


In [None]:
# Manually define a dictionary with our 5 webpages and keywords
eval_dataset_dict = {
    "keyword1 keyword2 keyword3": "https://example.com/",
    "keyword3 keyword4": "https://wikipedia.com/"
}

First, let's get the Google results for each keyword query we defined above using your vertical search engine. We want to do this only once to avoid hitting the 100 limit, so **avoid running this code block multiple times**:

In [None]:
eval_results_dict = {} #new dictionary to store query results

for query in eval_dataset_dict:
    #loop through each of the 5 search queries and search for them using our vertical search engine
    
    #Make use of the API key and search engine ID you defined above
    url = f"https://www.googleapis.com/customsearch/v1?key={API_KEY}&cx={SEARCH_ENGINE_ID}&q={query}&start={1}"
    data = requests.get(url).json() #Ask Google for the results
    eval_results_dict[query] = data.get("items") #Extract the found items

Now, we have an additional dictionary *eval_results_dict* that contains our 5 keyword queries and the results from our search engine. We can re-use this variable to avoid making many requests to Google.

In [None]:
len(eval_results_dict)

Next, we write an evaluation function.

* Write a re-usable function that takes as its input a query from this dictionary and the results of your search engine for that query (as stored in *eval_results_dict*). The function should find the rank (result number, 1, 2, 3 ...) at which the page appears in the top 10 in your results. Then, it should calculate $\frac{1}{rank}$ as the score for this search (so if its at rank 1, you get a score of $\frac{1}{1} = 1$, rank 2 gives  $\frac{1}{2}$, etc.  If its below rank 10, just score 0. Return the score. For more details on this evaluation: see http://en.wikipedia.org/wiki/Mean_reciprocal_rank 

Feel free to make use of the *print_search_results()* function I defined above, or adapt it. I provide a start for the code below. Feel free to also expand it by e.g. computing the mean of all the scores. Or if you are confident, delete my code and do it from scratch yourself!

In [None]:
def evaluate_result(search_results, target_link):
    # A function that checks whether the target link was found
    # in your search engine results, and returns the Reciprocal Rank.
    
    #Your code here


Use this function to evaluate all queries in our dataset:

In [None]:
for query in eval_results_dict: 
#Loop through all 5 search results obtained above
    
    search_results = eval_results_dict[query] #This contains the results given by your search engine
    target_link = eval_dataset_dict[query] #This contains the link to the webpage it was supposed to find
    
    #Use your function to return the Reciprocal Rank score for this query
    rr_score = evaluate_result(search_results, target_link)
    
    # Print the score for the link. I'm sure you can make the printing nicer.
    print(f'For the page {target_link}, my search engine scored {rr_score}.')
    
# Try to also compute the average RR score for all the queries here

<div class="alert alert-block alert-info">
Debugging advice: If you are struggling to understand what is inside the variables, print them!
</div>

* Do the same using the general search engine, www.google.com - here you can just search for your keywords and compute the RR scores by hand. 
* Now compare the average score for both search engines, and also look at the number of topics where one or the other is better.
* Is your vertical search engine better than Google?

### Report part 3: Evaluation

(Put the comparison of your search engine with Google here)

## Discussion
Feel free to do some further exploration here to go for the bonus points. For example, you could evaluate your search engine's Precision (http://en.wikipedia.org/wiki/Precision_(information_retrieval)#Precision). For this, you think up a general topic, and count how many of the results pertain to that topic, and again compare it to Google.

You could also reflect on the strengths and limitations of text-based searching. Viewing documents as just ''bags of words'' works sometimes surprisingly well, but also has significant limitations.  Was the search engine perfect (finding exactly all, and only, relevant information), and why not?  What do you think are strengths and weaknesses of standard keyword search?  What is the main barrier when it fails to retrieve a relevant document (is it on the user's end in the query formulation, is it on the system's end in the ranking component, or is in the document's end in the way information is expressed?)   In what situations are the limitations harmful?  Is there a way to compensate for these?   For which kinds of tasks would this imperfect tool already be very useful?    What would be different if advanced computational language understanding was possible?   Etc.

### Report part 4: Discussion/Reflection

(Discuss whether your vertical search engine is better than Google based on your evaluation results)

(Further evaluation and/or reflection)
