# Exercise #1: Searching in Wikipedia

Your task is to make a small subset of entities in Wikipedia searchable using Elasticsearch.
For that, you'll need to 

  1. Run the elasticsearch server on your local machine
  1. Download and index a subset of Wikipedia pages
  1. Come up with some queries and score them against your index

## Elasticsearch

Check [this document](https://github.com/kbalog/uis-dat640-fall2019/tree/master/code/elasticsearch) on Elasticsearch. Note that you'll need to download and run the binary (that's the elasticsearch service running on your local machine) as well as to install the Python client.

In [1]:
from elasticsearch import Elasticsearch

es = Elasticsearch()

In [4]:
# Initialize index

INDEX_NAME = "wikipedia"

INDEX_SETTINGS = {
    "settings" : {
        "index" : {
            "number_of_shards" : 1,
            "number_of_replicas" : 1
        }
    },
    "mappings": {
            "properties": {
                "content": {
                    "type": "text",
                    "term_vector": "yes",
                    "analyzer": "english"
                },
                "title": {
                    "type": "text",
                    "term_vector": "yes",
                    "analyzer": "english"
                },
                "id": {
                    "type": "keyword"
                },
                "url": {
                    "type": "keyword"
                },

            }
        }
    }

if not es.indices.exists(INDEX_NAME):  # create index if it doesn't exist
    es.indices.create(index=INDEX_NAME, body=INDEX_SETTINGS)

## Downloading and indexing Wikipedia pages

Select a small subset of pages (min. 10) to index. Specifically, we want articles corresponding to Norwegian cities.

This may be done manually, by simply listing the pages (as below) or programmatically, e.g., by taking all pages from a given category (e.g., from https://en.wikipedia.org/wiki/List_of_towns_and_cities_in_Norway).

In [6]:
pages = [
    "Bergen", "Oslo", "Stavanger", "Trondheim", "Kristiansand", 
    "Haugesund", "Kristiansand", "Bodø_(town)", "Egersund", "Lillehammer",
    "Mandal", "Molde"
]

Crawl the content of these pages using the [Python Wikipedia API](https://pypi.org/project/Wikipedia-API/). You need to install it using

```
pip install wikipedia-api
```

It's up to you what you put in the index, but at the minimum index the title of the article as well as its first paragraph (i.e., the summary). Specifically:

  * Try to index the full document content and the categories that the article belongs to.
  * Separate the different parts of the article into multiple fields in your index.

In [7]:
import wikipediaapi

wiki_wiki = wikipediaapi.Wikipedia('en')

for page in pages:
    # Get a page
    page_py = wiki_wiki.page(page)
    print("Page: {}".format(page))
    print("Title: {}".format(page_py.title))
    print("Summary: {:.200}...".format(page_py.summary))
    # Indexing article    
    es.index(index=INDEX_NAME, id=page, body={'title': page_py.title, 'summary': page_py.summary})

Page: Bergen
Title: Bergen
Summary: Bergen (Norwegian pronunciation: [ˈbærɡn̩] (listen)), historically Bjørgvin, is a city and municipality in Hordaland on the west coast of Norway. At the end of the first quarter of 2018, the municipal...
Page: Oslo
Title: Oslo
Summary: Oslo ( OZ-loh, also US:  OSS-loh, Norwegian: [²ʊʂlʊ] (listen), rarely [²ʊslʊ, ˈʊʂlʊ]; Southern Sami: Oslove) is the capital and most populous city of Norway. It constitutes both a county and a municip...
Page: Stavanger
Title: Stavanger
Summary: Stavanger (, also UK: , US usually , Norwegian: [stɑˈvɑŋər] (listen)) is a city and municipality in Norway. It is the third largest city and metropolitan area in Norway (through conurbation with neigh...
Page: Trondheim
Title: Trondheim
Summary: Trondheim (UK: , US: , Urban East Norwegian: [²trɔn(h)æɪm] (listen)), historically Kaupangen, Nidaros and Trondhjem, is a city and municipality in Trøndelag county, Norway. It has a population of 193,...
Page: Kristiansand
Title: Kristi

## Running queries

Create some queries that are 'interesting' (and where you could judge what would be sensible results for that query).

In [13]:
queries = ["capital", "commercial centre"]

for query in queries:
    print("Scoring query '{}'".format(query))
    res = es.search(index=INDEX_NAME, q=query, _source=False, size=3)
    for hit in res['hits']['hits']:
        print("Doc ID: %3r  Score: %5.2f" % (hit['_id'], hit['_score']))

Scoring query 'capital'
Doc ID: 'Oslo'  Score:  1.46
Doc ID: 'Stavanger'  Score:  1.30
Doc ID: 'Bergen'  Score:  1.23
Scoring query 'commercial centre'
Doc ID: 'Molde'  Score:  2.39
Doc ID: 'Haugesund'  Score:  0.22
Doc ID: 'Lillehammer'  Score:  0.20


## Feedback

Please give (anonymous) feedback on this exercise by filling out [this form](https://forms.gle/22o3ursi5YsR1Ztb8).