# ElasticSearch, Kibana and FSCrawler
- Elasticsearch is a full text search engine built on Lucene
- Kibana is the interface to the data in Elasticsearch
- FSCrawler helps to index binary documents such as PDF, Open Office, MS Office.

## ElasticSearch use cases
- security/log analytics
- marketing
- operations
- search
![title](img/es.png)

## What is a document and indice?
Documents:
    - Documents are the data
    - Documents have attributes
    - Attributes have values
    - Attributes have types
Incides:
    - Documents are grouped in a index
    - Elasticsearch can hame multiple indices
    - Depending on data, can name indices appropriately

## What is mapping?
- Predefined document types, applied to an index
- Don't make ElasticSearch guess on how to treat data, becouse it will:
    - which string fields should be trreated as full text fields
    - which fields contain numbers, dates, ar geolocations
    - Whether all fields values in the document should be indexed in _all field
    - the format af date values

## Data Types
![title](img/es_data_types.png)

- Core Data Types:
    - text, keyword
    - long, integer, short, byte, double, float, half_float, scaled_float
    - date
    - boolean
    - binary
    - integer_range, float_range, long_range, double_range, date_range
- Complex Data Types:
    - Array
    - Object
    - Nested objects
- Geo Data Types:
    - Geo-Point
    - Geo-Shape
- Specialized Data Types:
    - IP (IPv4 IPv6)
    - Completion
    - Token count
    - Mapper-murmur3
    - Percolator
    - join

## What is a shard and replica?
The index, in order to be scalable, has to be distributed, and it does this using shards and replicas.

Now a replica is a segment of an index, and a shard is a portion of that index. Because of its nature, a replica can never be located on the same node as the primary shard that it's a backup for.

The default when creating an index is to have five shards and one replica, that would equal five primary shards and five replica shards distributed across two different nodes.
![title](img/es_shards_replicas.png)

## First Steps with ElasticSearch and Kibana
ElasticSearch:
- Download and unzip Elasticsearch
- Install ElasticSearch

Kibana:
- Download and unzip Kibana
- Install and configure Kibana to connect to ElasticSearch

### Download and unzip Elasticsearch (latest version - 6.4.0)
![title](img/Elasticsearch-Logo-Color-H.png)
https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.4.0.tar.gz

### Install ElasticSearch (more info - https://www.elastic.co/downloads/elasticsearch)
/bin/elasticsearch

### Test ElasticSearch with browser
http://localhost:9200/

### Download and unzip Kibana (latest version - 6.4.0)
![title](img/Kibana-Logo-Color-H.png)
https://artifacts.elastic.co/downloads/kibana/kibana-6.4.0-linux-x86_64.tar.gz

### Install and configure Kibana to connect to Elasticsearch (more info - https://www.elastic.co/downloads/kibana)
sudo nano /config/kibana.yml

elasticsearch.url: "http://localhost:9200"

/bin/kibana

### Test Kibana with browser
http://localhost:5601

# Load some data
- To load data We will use fscrawler
- Download fscrawler and unzip fscrawler
- Start fscrawler to create a job
- Configure fscrawler _setting.json file
- Start fscawler to run a job

### Download fscrawler and unzip fscrawler (latest stable verison - 2.5.0)
https://repo1.maven.org/maven2/fr/pilato/elasticsearch/crawler/fscrawler/2.5/

### fscrawler documention
https://fscrawler.readthedocs.io/en/fscrawler-2.5

NOTE: FS Crawler 2.5 is using Tika 1.18 and Elasticsearch Rest Client 6.3.2.

### Start fscrawler
bin/fscrawler <job_name>

### Configure fscrawler _setting.json file
nano /path/to/.fscrawler/<job_name>/_settings.json

In [None]:
{
  "name" : "<job_name>",
  "fs" : {
    "url" : "<path/to/your/directory>",
    "update_rate" : "15m",
    "excludes" : [ "*/~*" ],
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "add_as_inner_object" : false,
    "store_source" : false,
    "index_content" : true,
    "indexed_chars" : "100%",
    "attributes_support" : false,
    "raw_metadata" : true,
    "xml_support" : false,
    "index_folders" : true,
    "lang_detect" : false,
    "continue_on_error" : false,
    "pdf_ocr" : true,
    "ocr" : {
      "language" : "eng"
    }
  },
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "127.0.0.1",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "bulk_size" : 100,
    "flush_interval" : "5s",
    "byte_size" : "10mb"
  },
  "rest" : {
    "scheme" : "HTTP",
    "host" : "127.0.0.1",
    "port" : 8080,
    "endpoint" : "fscrawler"
  }
}


### Start fscawler job

path/to/fscrawler_directory>//bin/fscrawler <job_name>

### Restart fscrawler job (if necessary)

<path/to/fscrawler_directory>/bin/fscrawler <job_name> --loop 1 --restart #loop is optional