#(실습-1) Elasticsearch 설치 및 실행

##실습 개요
1) 실습 목적 <br>
  이번 실습에서는 가장 많이 활용되고 있는 검색엔진 솔루션 중 하나인 Elasticsearch를 설치하고 구동해 봅니다. <br>
  데몬으로 구동한 후 가장 간단한 형태의 색인 및 검색 테스트를 통해 정상적으로 동작하는 지 확인합니다. <br>
2) 수강 목표
  * Elasticsearch를 노트북 환경에 설치할 수 있다.
  * Elasticsearch를 구동할 수 있다.
  * 간단한 색인 및 검색 명령을 수행할 수 있다.

### 실습 목차
* 1. Elasticsearch 설치
* 2. Elasticsearch 구동
* 3. 색인 및 검색 명령 실행

### 데이터셋 개요
* 데이터셋: cnn_dailymail
* 데이터셋 개요 : 약 27만개 cnn 뉴스 데이터

## 1. Elasticsearch 설치

In [None]:
# Elasticsearch Python 패키지 설치
!pip install elasticsearch==8.8.0

Collecting elasticsearch==8.8.0
  Downloading elasticsearch-8.8.0-py3-none-any.whl (393 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m393.8/393.8 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting elastic-transport<9,>=8 (from elasticsearch==8.8.0)
  Downloading elastic_transport-8.12.0-py3-none-any.whl (59 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.9/59.9 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: elastic-transport, elasticsearch
Successfully installed elastic-transport-8.12.0 elasticsearch-8.8.0


In [None]:
# Elasticsearch 8.8.0 다운로드 및 압축 풀기

# 리눅스용 엘라스틱서치 서버 설치를 위한 패키지 다운로드
!wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.8.0-linux-x86_64.tar.gz
!tar -xzf elasticsearch-8.8.0-linux-x86_64.tar.gz
!ls elasticsearch-8.8.0/

bin  config  jdk  lib  LICENSE.txt  logs  modules  NOTICE.txt  plugins	README.asciidoc


In [None]:
# 코랩 노트북 환경에서 서버 구동을 위해서 PPID 1의 백그라운드 데몬 프로세스가 해당 폴더에 접근이 가능하도록 소유자 변경
!sudo chown -R daemon:daemon elasticsearch-8.8.0/

# 코랩 노트북 환경에서 서버 구동을 위한 리소스 제한/격리를 위해 아래 명령 수행
!umount /sys/fs/cgroup
!apt install cgroup-tools

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libcgroup1
The following NEW packages will be installed:
  cgroup-tools libcgroup1
0 upgraded, 2 newly installed, 0 to remove and 30 not upgraded.
Need to get 121 kB of archives.
After this operation, 435 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libcgroup1 amd64 2.0-2 [49.8 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 cgroup-tools amd64 2.0-2 [70.8 kB]
Fetched 121 kB in 1s (222 kB/s)
Selecting previously unselected package libcgroup1:amd64.
(Reading database ... 121658 files and directories currently installed.)
Preparing to unpack .../libcgroup1_2.0-2_amd64.deb ...
Unpacking libcgroup1:amd64 (2.0-2) ...
Selecting previously unselected package cgroup-tools.
Preparing to unpack .../cgroup-tools_2.0-2_amd64.deb ...
Unpacking cgroup-tools (2.0-2) ...
Setting up

## 2. Elasticsearch 구동

In [None]:
# 엘라스틱서치의 데몬 인스턴스 만들기
import os
from elasticsearch import Elasticsearch, helpers
import numpy as np
import pandas as pd
import json
from subprocess import Popen, PIPE, STDOUT

es_server = Popen(['elasticsearch-8.8.0/bin/elasticsearch'],
                  stdout=PIPE, stderr=STDOUT,
                  preexec_fn=lambda: os.setuid(1)  # as daemon
                 )

# 인스턴스를 로드하는 데 약간의 시간이 걸림
import time
time.sleep(30)

In [None]:
# 데몬이 구동되었는지 확인 (세개의 daemon process가 있어야 함)
!ps -ef | grep elasticsearch

daemon       867     178 52 08:29 ?        00:00:22 /content/elasticsearch-8.8.0/jdk/bin/java -Xms4m
daemon       991     867 99 08:29 ?        00:00:42 /content/elasticsearch-8.8.0/jdk/bin/java -Des.n
daemon      1035     991  0 08:29 ?        00:00:00 /content/elasticsearch-8.8.0/modules/x-pack-ml/p
root        1124     178  0 08:30 ?        00:00:00 /bin/bash -c ps -ef | grep elasticsearch
root        1126    1124  0 08:30 ?        00:00:00 grep elasticsearch


In [None]:
# 데몬 구동후 password 설정 단계 필요
# 명령 실행 후 "Please confirm that you would like to continue"에서 y 입력 필요
!/content/elasticsearch-8.8.0/bin/elasticsearch-setup-passwords auto -url "https://localhost:9200"

******************************************************************************
Note: The 'elasticsearch-setup-passwords' tool has been deprecated. This       command will be removed in a future release.
******************************************************************************

Initiating the setup of passwords for reserved users elastic,apm_system,kibana,kibana_system,logstash_system,beats_system,remote_monitoring_user.
The passwords will be randomly generated and printed to the console.
Please confirm that you would like to continue [y/N]y


Changed password for user apm_system
PASSWORD apm_system = 8QD8qH0yiy7s9W2yRh5W

Changed password for user kibana_system
PASSWORD kibana_system = Qt0dJ8EBQU8Aofg9T1lN

Changed password for user kibana
PASSWORD kibana = Qt0dJ8EBQU8Aofg9T1lN

Changed password for user logstash_system
PASSWORD logstash_system = QLcW4iHCAcGA57nOHyxZ

Changed password for user beats_system
PASSWORD beats_system = ih2cTZJlWhDpvqXv8JCt

Changed password for user rem

In [None]:
username = 'elastic'

# 위 명령 실행 결과의 마지막 부분인 PASSWORD elastic 값으로 교체 필요
password = '15Fg7u0gPf393wgNAnVd'

es = Elasticsearch(['https://localhost:9200'], basic_auth=(username, password), ca_certs="/content/elasticsearch-8.8.0/config/certs/http_ca.crt")

resp = dict(es.info())

resp

{'name': '78b526436184',
 'cluster_name': 'elasticsearch',
 'cluster_uuid': '5xYegAVaT9C1Icui0c66-g',
 'version': {'number': '8.8.0',
  'build_flavor': 'default',
  'build_type': 'tar',
  'build_hash': 'c01029875a091076ed42cdb3a41c10b1a9a5a20f',
  'build_date': '2023-05-23T17:16:07.179039820Z',
  'build_snapshot': False,
  'lucene_version': '9.6.0',
  'minimum_wire_compatibility_version': '7.17.0',
  'minimum_index_compatibility_version': '7.0.0'},
 'tagline': 'You Know, for Search'}

## 3. 색인 및 검색 명령 실행

In [None]:
# 데이터셋 사용을 위한 huggingface datasets 패키지 인스톨
!pip install datasets -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# 실제 다운로드 및 데이터셋 로딩
import datasets
dataset = datasets.load_dataset("cnn_dailymail", "3.0.0", split="train").to_pandas()
dataset.drop("id", axis=1, inplace=True)
print(f"shape of dataset: {dataset.shape}")
dataset.head()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/257M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/257M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/259M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

shape of dataset: (287113, 2)


Unnamed: 0,article,highlights
0,"LONDON, England (Reuters) -- Harry Potter star...",Harry Potter star Daniel Radcliffe gets £20M f...
1,Editor's note: In our Behind the Scenes series...,Mentally ill inmates in Miami are housed on th...
2,"MINNEAPOLIS, Minnesota (CNN) -- Drivers who we...","NEW: ""I thought I was going to die,"" driver sa..."
3,WASHINGTON (CNN) -- Doctors removed five small...,"Five small polyps found during procedure; ""non..."
4,(CNN) -- The National Football League has ind...,"NEW: NFL chief, Atlanta Falcons owner critical..."


In [None]:
# Elasticsearch 인덱스의 설정 및 매핑 정의
settings = {
    "settings":{
        "number_of_shards":1,
        "number_of_replicas":0
    },
    "mappings":{
        "properties":{
            "article":{
                "type":"text"
            },
            "highlights":{
                "type":"text"
            }
        }
    }
}

In [None]:
def json_formatter(dataset, index_name):
    """
    이 함수는 Elasticsearch 색인을 위한 JSON 형식의 딕셔너리를 생성하는 데 사용됩니다.

    Args:
      dataset: 이 함수를 적용하려는 데이터입니다.
      index_name: Elasticsearch의 인덱스 이름입니다.
    """
    try:
        List = []
        columns = dataset.columns
        for idx, row in dataset.iterrows():
            dic = {}
            dic['_index'] = index_name
            source = {}
            for i in dataset.columns:
                source[i] = row[i]
            dic['_source'] = source
            List.append(dic)
        return List

    except Exception as e:
        print("There is a problem: {}".format(e))

In [None]:
MY_INDEX = es.indices.create(index="news_index", body=settings)
MY_INDEX

  MY_INDEX = es.indices.create(index="news_index", body=settings)


ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'news_index'})

In [None]:
# 너무 많아서 100개만 사용
dataset = dataset[:100]

json_Formatted_dataset = json_formatter(dataset=dataset, index_name='news_index')
json_Formatted_dataset[0]

{'_index': 'news_index',
 '_source': {'article': 'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his numbe

In [None]:
# 색인을 위해 elasticsearch.helpers API를 사용합니다.
res = helpers.bulk(es, json_Formatted_dataset[:100])
res

(100, [])

In [None]:
# 색인된 문서중 10개의 데이터 샘플 가져오기
query = es.search(
    index="news_index",
    body={
      "size":10,
      "query": {
        # 모든 문서가 매치된다는 의미 (_socre는 모두 1.0)
        "match_all":{}
      }
    }
)

output = pd.json_normalize((query['hits']['hits']))
output

  query = es.search(


Unnamed: 0,_index,_id,_score,_source.article,_source.highlights
0,news_index,QrR1NY0Bkd_aQ6hN-Fe6,1.0,"LONDON, England (Reuters) -- Harry Potter star...",Harry Potter star Daniel Radcliffe gets £20M f...
1,news_index,Q7R1NY0Bkd_aQ6hN-Fe8,1.0,Editor's note: In our Behind the Scenes series...,Mentally ill inmates in Miami are housed on th...
2,news_index,RLR1NY0Bkd_aQ6hN-Fe8,1.0,"MINNEAPOLIS, Minnesota (CNN) -- Drivers who we...","NEW: ""I thought I was going to die,"" driver sa..."
3,news_index,RbR1NY0Bkd_aQ6hN-Fe8,1.0,WASHINGTON (CNN) -- Doctors removed five small...,"Five small polyps found during procedure; ""non..."
4,news_index,RrR1NY0Bkd_aQ6hN-Fe8,1.0,(CNN) -- The National Football League has ind...,"NEW: NFL chief, Atlanta Falcons owner critical..."
5,news_index,R7R1NY0Bkd_aQ6hN-Fe8,1.0,"BAGHDAD, Iraq (CNN) -- Dressed in a Superman s...","Parents beam with pride, can't stop from smili..."
6,news_index,SLR1NY0Bkd_aQ6hN-Fe8,1.0,"BAGHDAD, Iraq (CNN) -- The women are too afrai...","Aid workers: Violence, increased cost of livin..."
7,news_index,SbR1NY0Bkd_aQ6hN-Fe8,1.0,"BOGOTA, Colombia (CNN) -- A key rebel commande...",Tomas Medina Caracas was a fugitive from a U.S...
8,news_index,SrR1NY0Bkd_aQ6hN-Fe8,1.0,WASHINGTON (CNN) -- White House press secretar...,"President Bush says Tony Snow ""will battle can..."
9,news_index,S7R1NY0Bkd_aQ6hN-Fe8,1.0,(CNN) -- Police and FBI agents are investigati...,Empty anti-tank weapon turns up in front of Ne...


#Reference