# ElasticSearch & Python

## Conceptos

Elasticsearch sistema de almacenamiento que permite indexar y analizar en tiempo real grandes cantidades de datos.
Actúa como repositorio de información, almacenando los documentos que indexa para su rápida búsqueda.
 **No necesita declarar un esquema** de la información que añadimos, aunque pero para sacar mayor partido a la información tendremos que añadir los llamados **mapping**.

- OpenSource
- Escrito en Java
- Apache Lucene
- Permite Arquitectura: distribuida.
- Escalable, en alta disponibilidad.
- Útil para soluciones NoSQL (sin transacciones distribuidas).
- API RESTfull sobre http para consulta, indexación, administración.
- Permite alias de índices
- Permite consultas sobre uno o varios índices
- Full Text Search o búsqueda por texto completo
- Indexa todos los campos de los documentos JSON.
- Búsquedas mediante ElasticSearch Query DSL (Domain Specific Language): multilenguaje, geolocalización, contextual, autocompletar, etc

Su RESTful API permite cualquier integración con otras tecnologías.

## Traducción de los conceptos a lenguage relacional

| Relationnal database | Elasticsearch         |
|----------------------|-----------------------|
| Database             | Index                 |
| Table                | Type                  |
| Row                  | Document              |
| Column               | Field                 |
| Schema               | Mapping               |
| Index                | Everything is indexed |
| SQL                  | Query DSL             |
| SELECT * FROM table… | GET http://…          |
| UPDATE table SET     | PUT http://…          |

## Elasticsearch Analyser,  term frecuency y index term

![](/home/docker_worker/work/assert/analyzer.png)

<img src="assets/analyser.png"/>

In [2]:
from elasticsearch_dsl import connections

connections.create_connection(hosts=['elasticsearch'])

<Elasticsearch([{'host': 'elasticsearch'}])>

In [25]:
print(connections.get_connection().cluster.health()['status'])

yellow


In [3]:
from datetime import datetime
from elasticsearch_dsl import Document, Date, Integer, Keyword, Text
from elasticsearch_dsl.connections import connections


class Article(Document):
    title = Text(analyzer='snowball', fields={'raw': Keyword()})
    body = Text(analyzer='snowball')
    tags = Keyword()
    published_from = Date()
    words = Integer()

    class Index:
        name = 'blog'
        settings = {
          "number_of_shards": 2,
        }

    def save(self, ** kwargs):
        self.words = len(self.body.split())
        return super(Article, self).save(** kwargs)

    def is_published(self):
        return datetime.now() >= self.published_from

In [4]:
import elasticsearch.exceptions as elasticexceptions
from elasticsearch_dsl import Index

index = Index("blog")

try:
    index.delete()
except elasticexceptions.NotFoundError:
    pass

Article.init()

In [81]:
article1 = Article(title='Hello world!', tags=['test'])
article1.body = ''' Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut magna augue, congue eget vulputate id, vestibulum ut augue. Sed accumsan at diam ut consectetur. Nam sed massa ac libero lobortis sodales ac eu justo. Cras lacus ipsum, lobortis et porttitor vehicula, congue vitae lacus. Maecenas pharetra justo risus, eu ultrices risus sollicitudin eu. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec ac est ultrices, congue odio ac, faucibus nisi. Nullam eu nibh ut ligula rutrum ullamcorper vitae ac diam. '''
article1.published_from = datetime.now()
article1.save()

article2 = Article(title='Bye world!', tags=['test'])
article2.body = ''' Vivamus eu ipsum neque. Nullam eget congue tortor, ut blandit dui. Proin eu ultrices sem. Phasellus tellus sem, egestas vitae ullamcorper sit amet, sollicitudin non velit. Aenean vel neque nec tellus semper ultricies eget sed ex. Aliquam porttitor eu eros non gravida. Nulla eu ligula dapibus, eleifend nisl ac, consectetur massa. Ut vitae dolor felis. Praesent sit amet est sit amet dui sagittis iaculis. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Curabitur orci libero, dictum nec ex vitae, accumsan pellentesque nisi. '''
article2.published_from = datetime.now()
article2.save()

article3 = Article(title='Other world!', tags=['awesome'])
article3.body = ''' Curabitur rutrum arcu nec cursus viverra. Maecenas gravida pellentesque erat, id semper leo rutrum sed. Proin faucibus lectus eu libero sagittis pulvinar.'''
article3.published_from = datetime.now()
article3.save()

True

In [91]:
search = Article.search().query("match", body="adipiscing")

In [92]:
print(search.to_dict())
response = search.execute()

{'query': {'match': {'body': 'adipiscing'}}}


In [93]:
result = [print(f"{hit.title}  {hit.words}") for hit in search]    

Hello world!  84


In [94]:
search = Article.search().query("match", body="tortor")
response = search.execute()
result = [print(f"{hit.title}  {hit.words}") for hit in search]    


Bye world!  86


In [95]:
search = Article.search().query("match", body="ipsum").sort('-words')

print(search.to_dict())
print()

response = search.execute()
result = [print(f"{hit.title}  {hit.words}") for hit in search]  

{'query': {'match': {'body': 'ipsum'}}, 'sort': [{'words': {'order': 'desc'}}]}

Bye world!  86
Hello world!  84


In [100]:
search = Article.search().filter('terms', tags=['awesome'])

print(search.to_dict())
print()

response = search.execute()
result = [print(f"{hit.title}  {hit.words}") for hit in search]  

{'query': {'bool': {'filter': [{'terms': {'tags': ['awesome']}}]}}}

Other world!  22


In [101]:
from elasticsearch import helpers, Elasticsearch
import csv

#es = Elasticsearch()

with open('/home/docker_worker/work/data/titanic.csv') as f:
    reader = csv.DictReader(f)
    #helpers.bulk(es, reader, index='my-index', doc_type='my-type')

In [53]:
from elasticsearch_dsl import Document, Date, Integer, Keyword, Text, Boolean, Float

class TitanicPassenger(Document):
    passengerid = Integer()
    survived = Integer()
    pclass = Integer()
    name = Text(analyzer='snowball')
    sex = Keyword()
    age = Integer()
    sibsp = Integer()
    parch = Integer()
    ticket = Keyword()
    fare = Float()
    cabin = Keyword()
    embarked = Keyword()
    

    class Index:
        name = 'titanic'
        settings = {
          "number_of_shards": 2,
          "blocks":{'read_only_allow_delete': None},
        }



In [54]:
import elasticsearch.exceptions as elasticexceptions
from elasticsearch_dsl import Index

index = Index("titanic")

try:
    index.delete()
except elasticexceptions.NotFoundError:
    pass

TitanicPassenger.init()

In [55]:
csv_path='/home/docker_worker/work/data/titanic.csv'

In [56]:
import pandas as pd
df = pd.read_csv(csv_path, sep='\t')
df.head()

Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7925.0,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


In [57]:
from elasticsearch import helpers, Elasticsearch
import csv
es = Elasticsearch("elasticsearch:9200")

def docs_for_load():
    with open(csv_path) as f:
        reader = csv.reader(f, delimiter="\t")
        next(reader)
        for row in reader:
            TitanicPassenger(passengerid=row[0],
                                   survived=row[1],
                                   pclass=row[2],
                                   name=row[3],
                                   sex=row[4],
                                   age=row[5],
                                   sibsp=row[6],
                                   parch=row[7],
                                   ticket=row[8],
                                   fare=row[9],
                                   cabin=row[10],
                                   embarked=row[11]).save()

docs_for_load()

In [63]:
search = TitanicPassenger.search().query("match", name="Meo")
response = search.execute()
result = [print(f"{hit.passengerid}) {hit.name}") for hit in search]  

153) Meo, Mr. Alfonzo
