In [None]:
import warnings
warnings.filterwarnings("ignore")

# CAI Lab Session 3: Programming with Elastic Search

In this session you will:

- Learn how to tell ElasticSearch to apply different tokenizers and filters to the documents, like removing stopwords or stemming the words.
- Study how these changes affect the terms that ElasticSearch puts in the index, and how this in turn affects searches.
- Continuing previous work, implement tf-idf scheme over a repository of scietific article abstracts, including cosine measure for document similarities

## 1. Preprocessing with ElasticSearch

One of the tasks of the previous session was to remove from the documents vocabulary all those strings that were not proper words. Obviously this is a frequent task and all these kinds of DB have standard processes that help to filter and reduce the terms that are not useful for searching.

Text, before being indexed, can be subjected to a pipeline of different processes that strips it from anything that will not be useful for a specific application. In ES these preprocessing pipelines are called _Analyzers_; ES includes many choices for each preprocessing step. 


The [following picture](https://www.elastic.co/es/blog/found-text-analysis-part-1) illustrates the chaining of preprocessing steps:

![](https://api.contentstack.io/v2/assets/575e4c8c3dc542cb38c08267/download?uid=blt51e787daed39eae9?uid=blt51e787daed39eae9)

The first step of the pipeline is usually a process that converts _raw text_ into _tokens_. We can for example tokenize a text using blanks and punctuation signs or use a language specific analyzer that detects words in an specific language or parse HTML/XML...

[This section](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html) of the ElasticSearch manual explains the different text tokenizers available.

Once we have obtained tokens, we can _normalize_ the strings and/or filter out valid tokens that are not useful. For instance, strings can be transformed to lowercase so all occurrences of the same word are mapped to the same token regardless of whether they were capitalized. Also, there are words that are not semantically useful when searching such as adverbs, articles or prepositions, in this case each language will have its own standard list of words; these are usually called "_stopwords_". Another language-specific token normalization is stemming. The stem of a word corresponds to the common part of a word from all variants are formed by inflection or addition of suffixes or prefixes. For instance, the words "unstoppable", "stops" and "stopping" all derive from the stem "stop". The idea is that all variations of a word will be represented by the same token.

[This section](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html) of ElasticSearch manual will give you an idea of the possibilities.


## 2. Modifying `ElasticSearch` index behavior (using Analyzers)

In this section we are going to learn how to set up preprocessing with ElasticSearch. We are going to do it _inline_ so that you have a few examples and get familiar with how to set up ES analyzers. We are going to showcase the different options with the toy English phrase

```
my taylor 4ís was% &printing printed rich the.
```

which contains symbols and weird things to see what effect the different tokenizers and filtering options have. We are going to work with three of the usual processes:

* Tokenization
* Normalization
* Token filtering (stopwords and stemming)

The next cells allow configuring the default tokenizer for an index and analyze an example text. We are going to play a little bit with the possibilities and see what tokens result from the analysis.


In [None]:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Index, analyzer, tokenizer
from elasticsearch.exceptions import NotFoundError
from pprint import pprint


client = Elasticsearch("http://localhost:9200", request_timeout=1000)

### Token `whitespace` filter `lowercase`:

In [None]:
# Index analyzer cofiguration
# Change the configuration and run this cell and the next to see the changes

# Tokenizers: whitespace, standard, classic, letter
# Filters: lowercase, asciifolding, stop, porter_stem, kstem, snowball
my_analyzer = analyzer('default',
    type='custom',
    tokenizer=tokenizer('whitespace'),
    filter=['lowercase']
)

# work with dummy index called 'foo'
ind = Index('foo', using=client)
ind.settings(number_of_shards=1)
try:
    # drop if exists
    ind.delete()
except NotFoundError:
    pass

# create it
ind.create()

# close to update analyzer to custom `my_analyzer`    
ind.close()
ind.analyzer(my_analyzer)    
ind.save()
ind.open()

Now you can ask the index to analyze any text, feel free to change the text

In [None]:
res = ind.analyze(body={'analyzer':'default', 'text':u'my taylor 4ís was% &printing printed rich the.'})
for r in res['tokens']:
    print(r)

### Token `standard` filter `lowercase`:

In [None]:
# Index analyzer cofiguration
# Change the configuration and run this cell and the next to see the changes

# Tokenizers: whitespace, standard, classic, letter
# Filters: lowercase, asciifolding, stop, porter_stem, kstem, snowball
my_analyzer = analyzer('default',
    type='custom',
    tokenizer=tokenizer('standard'),
    filter=['lowercase']
)
   
ind = Index('foo', using=client)
ind.close()
ind.analyzer(my_analyzer)    
ind.save()
ind.open()

Now you can ask the index to analyze any text, feel free to change the text

In [None]:
res = ind.analyze(body={'analyzer':'default', 'text':u'my taylor 4ís was% &printing printed rich the.'})
for r in res['tokens']:
    print(r)

### Token `letter` filter `lowercase`:

In [None]:
# Index analyzer cofiguration
# Change the configuration and run this cell and the next to see the changes

# Tokenizers: whitespace, standard, classic, letter
# Filters: lowercase, asciifolding, stop, porter_stem, kstem, snowball
my_analyzer = analyzer('default',
    type='custom',
    tokenizer=tokenizer('letter'),
    filter=['lowercase']
)

ind = Index('foo', using=client)
ind.close()
ind.analyzer(my_analyzer)    
ind.save()
ind.open()

Now you can ask the index to analyze any text, feel free to change the text

In [None]:
res = ind.analyze(body={'analyzer':'default', 'text':u'my taylor 4ís was% &printing printed rich the.'})
for r in res['tokens']:
    print(r)

### Filter `asciifolding`

In [None]:
# Index analyzer cofiguration
# Change the configuration and run this cell and the next to see the changes

# Tokenizers: whitespace, standard, classic, letter
# Filters: lowercase, asciifolding, stop, porter_stem, kstem, snowball
my_analyzer = analyzer('default',
    type='custom',
    tokenizer=tokenizer('letter'),
    filter=['lowercase','asciifolding']
)
   
ind = Index('foo', using=client)
ind.close()
ind.analyzer(my_analyzer)    
ind.save()
ind.open()

Now you can ask the index to analyze any text, feel free to change the text

In [None]:
res = ind.analyze(body={'analyzer':'default', 'text':u'my taylor 4ís was% &printing printed rich the.'})
for r in res['tokens']:
    print(r)

### Filter `asciifolding` + `stop`

In [None]:
# Index analyzer cofiguration
# Change the configuration and run this cell and the next to see the changes

# Tokenizers: whitespace, standard, classic, letter
# Filters: lowercase, asciifolding, stop, porter_stem, kstem, snowball
my_analyzer = analyzer('default',
    type='custom',
    tokenizer=tokenizer('letter'),
    filter=['lowercase','asciifolding', 'stop']
)
   
ind = Index('foo', using=client)
ind.close()
ind.analyzer(my_analyzer)    
ind.save()
ind.open()

Now you can ask the index to analyze any text, feel free to change the text

In [None]:
res = ind.analyze(body={'analyzer':'default', 'text':u'my taylor 4ís was% &printing printed rich the.'})
for r in res['tokens']:
    print(r)

### Filter `asciifolding` + `stop` + `snowball`

In [None]:
# Index analyzer cofiguration
# Change the configuration and run this cell and the next to see the changes

# Tokenizers: whitespace, standard, classic, letter
# Filters: lowercase, asciifolding, stop, porter_stem, kstem, snowball
my_analyzer = analyzer('default',
    type='custom',
    tokenizer=tokenizer('letter'),
    filter=['lowercase','asciifolding','stop', 'snowball']
)
   
ind = Index('foo', using=client)
ind.close()
ind.analyzer(my_analyzer)    
ind.save()
ind.open()

Now you can ask the index to analyze any text, feel free to change the text

In [None]:
res = ind.analyze(body={'analyzer':'default', 'text':u'my taylor 4ís was% &printing printed rich the.'})
for r in res['tokens']:
    print(r)

---

**Exercise 1:** solve exercise 1 (stopword removal & stemming) from problem set 1 using ElasticSearch. You can use the following string.

```
moonstone = """
We found my lady with no light in the room but the reading-lamp.
The shade was screwed down so as to over-shadow her face. Instead of looking up at us in her usual straightforward way, she sat
close at the table, and kept her eyes fixed obstinately on an open
book.
“Officer,” she said, “it is important to the inquiry you are conducting to know beforehand if any person now in this house wishes
to leave it?”
"""
```

---

In [None]:
moonstone = """
    We found my lady with no light in the room but the reading-lamp.
    The shade was screwed down so as to over-shadow her face. Instead of looking up at us in her usual straightforward way, she sat
    close at the table, and kept her eyes fixed obstinately on an open
    book.
    “Officer,” she said, “it is important to the inquiry you are conducting to know beforehand if any person now in this house wishes
    to leave it?”
"""

my_analyzer = analyzer('default',
    type='custom',
    tokenizer=tokenizer('letter'),
    filter=['lowercase', 'stop', 'snowball']
)
   
ind = Index('foo', using=client)
ind.close()
ind.analyzer(my_analyzer)    
ind.save()
ind.open()
res = ind.analyze(body={'analyzer':'default', 'text':moonstone})
for r in res['tokens']:
    print(r)

In [None]:
# cleanup ..
ind.delete()

## 3. Indexing script `IndexFilesPreprocess.py`

You should study how the provided indexer script named `IndexFilesPreprocess.py` works. 
Its usage is as follows:

```
usage: IndexFilesPreprocess.py [-h] --path PATH --index INDEX
                               [--token {standard,whitespace,classic,letter}]
                               [--filter ...]

optional arguments:
  -h, --help            show this help message and exit
  --path PATH           Path to the files
  --index INDEX         Index for the files
  --token {standard,whitespace,classic,letter}
                        Text tokenizer
  --filter ...          Text filter: lowercase, asciifolding, stop,
                        porter_stem, kstem, snowball
```

So, you can pass a `--path` argument which is the path to a directory where the files that you want to index are located (possibly in subdirectories);
you can specify through `--index` the name of the index to be created; you can also specify the _tokenization_ procedure to be used with the `--token` argument;
and finally you can apply preprocessing filters through the `--filter` argument. As an example call,

```
$ python3 IndexFilesPreprocess.py --index toy --path toy-docs --token letter --filter lowercase asciifolding
```

would create an index called `toy` adding all files located within the subdirectory `toy-docs`, applying the letter tokenizer and applying `lowercase` and `asciifolding` preprocessing.


In particular, you should pay attention to:

- how preprocessing is done within the script
- how the `bulk` operation is used for adding documents to the index (instead of adding files one-by-one)
- the structure of docuements added, which contains a `text` field with the content but also a `path` field with the name of the file being added



## 4. Suggested coding exercises

---

**Exercise 2:**  

Download the `arxiv_abs.zip` repository from `https://www.cs.upc.edu/~marias/arxiv_abs.zip`; unzip it. You should see a directory containing folders that contain
text files. These correspond to abstracts of scientific papers in several topics from the [arXiv.org](https://arxiv.org) repository. Index these abstracts using the `IndexFilesPreprocess.py` script (be patient, it takes a while). Double check that your index contains around 58K documents. Pay special attention to how file names are stored in the `path` field of the indexed elasticsearch documents.

In [None]:
index = Index("arxiv", using=client)
index.delete()

Command used:

`python IndexFilesPreprocess.py --index arxiv --path ~/Desktop/CAI/Data/arxiv --token letter --filter lowercase asciifolding`

---

**Exercise 3:**

Write a function that computes the _cosine similarity_ between pairs of documents in your index. For that, you will find useful the computations from last week that computed the _tf-idf_ vectors of documents in the toy-document dataset. It is important to use _sparse representation_ for these vectors, either through the use of a python dictionary (with `term: weight` entries), or alternatively you could use a list of pairs `(term, weight)`; if you choose the latter, then it is going to be useful to sort the lists by term so that you can find common terms in order to compute the similarities.

In [None]:
from elasticsearch.helpers import scan

index_name = "arxiv"

sc = scan(client, index=index_name, query={"query" : {"match_all": {}}})

for doc in sc:
    print(doc)
    tv = client.termvectors(index=index_name, id=doc['_id'], fields=['text'], term_statistics=True, positions=False)
    break

print(tv)

---

**Exercise 4:**

Finally, using your code above, build a matrix that reflects the average cosine similarities between pairs of documents in different paper abstract categories. These categories are reflected in the path names of the files, e.g. in my computer, the path name to abstract `/Users/marias/Downloads/arxiv/hep-ph.updates.on.arXiv.org/000787` corresponds to the category of `hep-ph` papers. The categories are `astro-ph, cs, hep-th, physics, cond-mat, hep-ph, math, quant-ph`, which can be extracted from path names.

Finally, the following piece of code may be useful to see the content of a few random documents within an index

In [None]:
def print_docs_from_index(index_name, client, max_docs):

    print(f"===================")
    info = client.cat.count(index=index_name, format = "json")[0]
    print(f"Index: {index_name} with {info['count']} documents.")
    print()

    res = client.search(index=index_name, size = max_docs, query= {'match_all' : {}})

    for doc in res['hits']['hits']:
        print (doc['_id'], doc['_source'])

print_docs_from_index('arxiv', Elasticsearch("http://localhost:9200", request_timeout=1000), max_docs=10)