# Session 2 - Programming with Elastic Search

In [1]:
#%pip install elasticsearch
#%pip install elasticsearch-dsl
#%pip install pandas

## 1 Modifying ElasticSearch index behavior

In the previous session we had to clean manually the list of words in order to compute Zipf's and Heaps' laws. 

ElasticSearch allows using a pipeline of processes that allows to clean the text that is indexed discarding anything not useful.

We are going to work with three of the usual processes:

* Tokenization
* Normalization
* Token filtering (stopwords and stemming)

The next cells allow configuring the default tokenizer for an index and analyze an example text. We are going to play a little bit with the possibilities and see what tokens result from the analysis.


In [2]:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Index, analyzer, tokenizer
import pandas as pd

In [3]:
client = Elasticsearch( hosts=['http://localhost:9200'], request_timeout=1000)

## Token Whitespace filter lowercase
The whitespace tokenizer divides text into terms whenever it encounters any whitespace character.
The lowercase tokenizer filter changes the token to lowercase.

In [4]:
# Index analyzer cofiguration
# Change the configuration and run this cell and the next to see the changes

# Tokenizers: whitespace, standard, classic, letter
# Filters: lowercase, asciifolding, stop, porter_stem, kstem, snowball
my_analyzer = analyzer('default',
    type='custom',
    tokenizer=tokenizer('whitespace'),
    filter=['lowercase']
)
   
ind = Index('news', using=client)
ind.close()
ind.analyzer(my_analyzer)    
ind.save()
ind.open()

  return self._get_connection(using).indices.close(index=self._name, **kwargs)
  return self._get_connection(using).indices.close(index=self._name, **kwargs)
  self._get_connection(using).indices.exists(index=self._name, **kwargs)
  return self._get_connection(using).indices.get_settings(
  state = self._get_connection(using).cluster.state(
  return self._get_connection(using).indices.put_settings(
  return self._get_connection(using).indices.open(index=self._name, **kwargs)


ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True})

Now you can ask the index to analyze any text, feel free to change the text

In [5]:
res = ind.analyze(body={'analyzer':'default', 'text':u'my taylor 4ís was% &printing printed rich the.'})
for r in res['tokens']:
    print(r)

{'token': 'my', 'start_offset': 0, 'end_offset': 2, 'type': 'word', 'position': 0}
{'token': 'taylor', 'start_offset': 3, 'end_offset': 9, 'type': 'word', 'position': 1}
{'token': '4ís', 'start_offset': 10, 'end_offset': 13, 'type': 'word', 'position': 2}
{'token': 'was%', 'start_offset': 14, 'end_offset': 18, 'type': 'word', 'position': 3}
{'token': '&printing', 'start_offset': 19, 'end_offset': 28, 'type': 'word', 'position': 4}
{'token': 'printed', 'start_offset': 29, 'end_offset': 36, 'type': 'word', 'position': 5}
{'token': 'rich', 'start_offset': 37, 'end_offset': 41, 'type': 'word', 'position': 6}
{'token': 'the.', 'start_offset': 42, 'end_offset': 46, 'type': 'word', 'position': 7}


  return self._get_connection(using).indices.analyze(index=self._name, **kwargs)


## Token Standard
The standard tokenizer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation symbols.

In [6]:
# Index analyzer cofiguration
# Change the configuration and run this cell and the next to see the changes

# Tokenizers: whitespace, standard, classic, letter
# Filters: lowercase, asciifolding, stop, porter_stem, kstem, snowball
my_analyzer = analyzer('default',
    type='custom',
    tokenizer=tokenizer('standard'),
    filter=['lowercase']
)
   
ind = Index('news', using=client)
ind.close()
ind.analyzer(my_analyzer)    
ind.save()
ind.open()

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True})

Now you can ask the index to analyze any text, feel free to change the text

In [7]:
res = ind.analyze(body={'analyzer':'default', 'text':u'my taylor 4ís was% &printing printed rich the.'})
for r in res['tokens']:
    print(r)

{'token': 'my', 'start_offset': 0, 'end_offset': 2, 'type': '<ALPHANUM>', 'position': 0}
{'token': 'taylor', 'start_offset': 3, 'end_offset': 9, 'type': '<ALPHANUM>', 'position': 1}
{'token': '4ís', 'start_offset': 10, 'end_offset': 13, 'type': '<ALPHANUM>', 'position': 2}
{'token': 'was', 'start_offset': 14, 'end_offset': 17, 'type': '<ALPHANUM>', 'position': 3}
{'token': 'printing', 'start_offset': 20, 'end_offset': 28, 'type': '<ALPHANUM>', 'position': 4}
{'token': 'printed', 'start_offset': 29, 'end_offset': 36, 'type': '<ALPHANUM>', 'position': 5}
{'token': 'rich', 'start_offset': 37, 'end_offset': 41, 'type': '<ALPHANUM>', 'position': 6}
{'token': 'the', 'start_offset': 42, 'end_offset': 45, 'type': '<ALPHANUM>', 'position': 7}


## Token Letter
The letter tokenizer divides text into terms whenever it encounters a character which is not a letter

In [8]:
# Index analyzer cofiguration
# Change the configuration and run this cell and the next to see the changes

# Tokenizers: whitespace, standard, classic, letter
# Filters: lowercase, asciifolding, stop, porter_stem, kstem, snowball
my_analyzer = analyzer('default',
    type='custom',
    tokenizer=tokenizer('letter'),
    filter=['lowercase']
)
   
ind = Index('news', using=client)
ind.close()
ind.analyzer(my_analyzer)    
ind.save()
ind.open()

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True})

Now you can ask the index to analyze any text, feel free to change the text

In [9]:
res = ind.analyze(body={'analyzer':'default', 'text':u'my taylor 4ís was% &printing printed rich the.'})
for r in res['tokens']:
    print(r)

{'token': 'my', 'start_offset': 0, 'end_offset': 2, 'type': 'word', 'position': 0}
{'token': 'taylor', 'start_offset': 3, 'end_offset': 9, 'type': 'word', 'position': 1}
{'token': 'ís', 'start_offset': 11, 'end_offset': 13, 'type': 'word', 'position': 2}
{'token': 'was', 'start_offset': 14, 'end_offset': 17, 'type': 'word', 'position': 3}
{'token': 'printing', 'start_offset': 20, 'end_offset': 28, 'type': 'word', 'position': 4}
{'token': 'printed', 'start_offset': 29, 'end_offset': 36, 'type': 'word', 'position': 5}
{'token': 'rich', 'start_offset': 37, 'end_offset': 41, 'type': 'word', 'position': 6}
{'token': 'the', 'start_offset': 42, 'end_offset': 45, 'type': 'word', 'position': 7}


## Filter asciifolding
Filter parameter can be a combination of filters. We use lowercase, but we also introduce asciifolding, this converts alphabetic, numeric, and symbolic characters that are not in the Basic Latin Unicode block (first 127 ASCII characters) to their ASCII equivalent, if one exists. For example, the filter changes à to a.

In [10]:
# Index analyzer cofiguration
# Change the configuration and run this cell and the next to see the changes

# Tokenizers: whitespace, standard, classic, letter
# Filters: lowercase, asciifolding, stop, porter_stem, kstem, snowball
my_analyzer = analyzer('default',
    type='custom',
    tokenizer=tokenizer('letter'),
    filter=['lowercase','asciifolding']
)
   
ind = Index('news', using=client)
ind.close()
ind.analyzer(my_analyzer)    
ind.save()
ind.open()

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True})

Now you can ask the index to analyze any text, feel free to change the text

In [11]:
res = ind.analyze(body={'analyzer':'default', 'text':u'my taylor 4ís was% &printing printed rich the.'})
for r in res['tokens']:
    print(r)

{'token': 'my', 'start_offset': 0, 'end_offset': 2, 'type': 'word', 'position': 0}
{'token': 'taylor', 'start_offset': 3, 'end_offset': 9, 'type': 'word', 'position': 1}
{'token': 'is', 'start_offset': 11, 'end_offset': 13, 'type': 'word', 'position': 2}
{'token': 'was', 'start_offset': 14, 'end_offset': 17, 'type': 'word', 'position': 3}
{'token': 'printing', 'start_offset': 20, 'end_offset': 28, 'type': 'word', 'position': 4}
{'token': 'printed', 'start_offset': 29, 'end_offset': 36, 'type': 'word', 'position': 5}
{'token': 'rich', 'start_offset': 37, 'end_offset': 41, 'type': 'word', 'position': 6}
{'token': 'the', 'start_offset': 42, 'end_offset': 45, 'type': 'word', 'position': 7}


## filter asciifolding + stop

Using the last two filters, we will now be adding stop, this removes stop words from a token stream.

When not customized, the filter removes the following English stop words by default:



In [12]:
# Index analyzer cofiguration
# Change the configuration and run this cell and the next to see the changes

# Tokenizers: whitespace, standard, classic, letter
# Filters: lowercase, asciifolding, stop, porter_stem, kstem, snowball
my_analyzer = analyzer('default',
    type='custom',
    tokenizer=tokenizer('letter'),
    filter=['lowercase','asciifolding', 'stop']
)
   
ind = Index('news', using=client)
ind.close()
ind.analyzer(my_analyzer)    
ind.save()
ind.open()

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True})

Now you can ask the index to analyze any text, feel free to change the text

In [13]:
res = ind.analyze(body={'analyzer':'default', 'text':u'my taylor 4ís was% &printing printed rich the.'})
for r in res['tokens']:
    print(r)

{'token': 'my', 'start_offset': 0, 'end_offset': 2, 'type': 'word', 'position': 0}
{'token': 'taylor', 'start_offset': 3, 'end_offset': 9, 'type': 'word', 'position': 1}
{'token': 'printing', 'start_offset': 20, 'end_offset': 28, 'type': 'word', 'position': 4}
{'token': 'printed', 'start_offset': 29, 'end_offset': 36, 'type': 'word', 'position': 5}
{'token': 'rich', 'start_offset': 37, 'end_offset': 41, 'type': 'word', 'position': 6}


## Filter asciifolding + stop + snowball

Snowball is a filter that stems words using a Snowball-generated stemmer. 

In [14]:
# Index analyzer cofiguration
# Change the configuration and run this cell and the next to see the changes

# Tokenizers: whitespace, standard, classic, letter
# Filters: lowercase, asciifolding, stop, porter_stem, kstem, snowball
my_analyzer = analyzer('default',
    type='custom',
    tokenizer=tokenizer('letter'),
    filter=['lowercase','asciifolding','stop', 'snowball']
)
   
ind = Index('news', using=client)
ind.close()
ind.analyzer(my_analyzer)    
ind.save()
ind.open()

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True})

Now you can ask the index to analyze any text, feel free to change the text

In [15]:
res = ind.analyze(body={'analyzer':'default', 'text':u'my taylor 4ís was% &printing printed rich the.'})
for r in res['tokens']:
    print(r)

{'token': 'my', 'start_offset': 0, 'end_offset': 2, 'type': 'word', 'position': 0}
{'token': 'taylor', 'start_offset': 3, 'end_offset': 9, 'type': 'word', 'position': 1}
{'token': 'print', 'start_offset': 20, 'end_offset': 28, 'type': 'word', 'position': 4}
{'token': 'print', 'start_offset': 29, 'end_offset': 36, 'type': 'word', 'position': 5}
{'token': 'rich', 'start_offset': 37, 'end_offset': 41, 'type': 'word', 'position': 6}


Now **follow the instructions** of the documentation, index the documents from the previous session using the script 'IndexFilesPreprocess.py' and use the script 'CountWords.py' from the previous session to see how the set of tokens change.

***

## 2 The index reloaded

We will now be using the `novels` index with diferent kinds of tokenizer and filters. For analizing results, we've modified `CountWords.py` as a class to import it's functionality to the notebook. We added diferent tokenizers and filters that in the previous section we've seen how the affect the text. Now we want to see how they affect the total word count.

In [16]:
from CountWords import ElasticFunctionals

results_df = pd.DataFrame(columns=['tokenizer', 'filters', 'word_count'])

index = 'novels'

ef = ElasticFunctionals(client)

tokenizers = ['whitespace', 'standard', 'classic', 'letter']
filters = [['lowercase'], ['lowercase','asciifolding'], ['lowercase', 'porter_stem'], ['lowercase', 'kstem'], ['lowercase', 'snowball']]

In [17]:
for tokenizer_name in tokenizers:
    for filter_list in filters:
        my_analyzer = analyzer('default',
            type='custom',
            tokenizer=tokenizer(tokenizer_name),
            filter=filter_list
        )
        
        ind = Index(index, using=client);
        ind.close()
        ind.analyzer(my_analyzer);
        ind.save()
        ind.open()

        # Contar las palabras usando la clase ElasticFunctionals
        word_count = ef.count_words(index, alpha=False);

        # Crear un DataFrame temporal con los resultados de la iteración actual
        new_row = pd.DataFrame({
            'tokenizer': [tokenizer_name],
            'filters': [', '.join(filter_list)],
            'word_count': [word_count]
        })

        # Concatenar la nueva fila al DataFrame existente
        results_df = pd.concat([results_df, new_row], ignore_index=True);


  return self._get_connection(using).indices.close(index=self._name, **kwargs)
  return self._get_connection(using).indices.close(index=self._name, **kwargs)
  self._get_connection(using).indices.exists(index=self._name, **kwargs)
  return self._get_connection(using).indices.get_settings(
  state = self._get_connection(using).cluster.state(
  return self._get_connection(using).indices.put_settings(
  return self._get_connection(using).indices.open(index=self._name, **kwargs)
  for s in sc:
  tv = self.client.termvectors(index=index, id=s['_id'], fields=['text'])


In [18]:
print(results_df)

     tokenizer                  filters word_count
0   whitespace                lowercase     167063
1   whitespace  lowercase, asciifolding     167063
2   whitespace   lowercase, porter_stem     149270
3   whitespace         lowercase, kstem     153217
4   whitespace      lowercase, snowball     141967
5     standard                lowercase      61825
6     standard  lowercase, asciifolding      61739
7     standard   lowercase, porter_stem      41955
8     standard         lowercase, kstem      45530
9     standard      lowercase, snowball      39067
10     classic                lowercase      59569
11     classic  lowercase, asciifolding      59480
12     classic   lowercase, porter_stem      39643
13     classic         lowercase, kstem      43219
14     classic      lowercase, snowball      36760
15      letter                lowercase      54455
16      letter  lowercase, asciifolding      54365
17      letter   lowercase, porter_stem      34361
18      letter         lowercas

Se observa que el tokenizer `whitespace` con el filtro `lowercase` produce el mayor conteo de palabras, alcanzando 167,063 tokens. Sin embargo, la aplicación de diferentes filtros de stemming, como `porter_stem` y `kstem`, reduce significativamente el número de tokens, con conteos de 149,270 y 153,217, respectivamente. Los tokenizers `standard` y `classic` generan conteos considerablemente más bajos, con el estándar alcanzando solo 61,825 palabras con el filtro `lowercase`. Esto muestra como el tokenizer y los filtros aplicados influyen de manera crítica en el procesamiento del texto.

### Analyzing the most common word 

In [21]:
indices = ['novels', 'scientific']
results_frequency = pd.DataFrame(columns=['Index', 'Most Common Word', 'Count'])

ef = ElasticFunctionals(client)

# Contar palabras más comunes en cada índice
for index in indices:
    word_counts = ef.count_word_frequency(index)
    
    if word_counts:
        most_common_word = word_counts[0]  # Obtener la palabra más común
        new_row = {
            'Index': index,
            'Most Common Word': most_common_word[0],
            'Count': most_common_word[1]
        }
        results_frequency = pd.concat([results_frequency, pd.DataFrame([new_row])], ignore_index=True)

print(results_frequency)

  for s in sc:
  tv = self.client.termvectors(index=index, id=s['_id'], fields=['text'])


        Index Most Common Word   Count
0      novels              the  206706
1  scientific              the  257240


***

## 3 Computing Tf-Idf and Cosine similarity

Now is your turn to work in the session task.

The idea is to program a script that given two document paths obtains their ids, computes the Tf-Idf representation of the documents and then computes and prints their cosine similarity

**Follow the instructions** in the documentation and and **pay attention** to the documentation that you have to deliver for this session.