# Unit J
# Search Database Model

- Examples From Video Lecture 


In [1]:
import pyspark
from pyspark.sql import SparkSession
# ELASTICSEARCH CONFIGURATION
elastic_host = "elasticsearch"
elastic_port = "9200"
spark = SparkSession.builder \
    .master("local") \
    .appName('jupyter-pyspark') \
    .config("spark.jars.packages","org.elasticsearch:elasticsearch-spark-30_2.12:8.17.0")\
    .config("spark.es.nodes", elastic_host) \
    .config("spark.es.port",elastic_port) \
    .getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("ERROR")


## Elasticsearch and Kibana

### Elasticsearch REST API

- Use the HTTP protocol to add and query data in elasticsearch
- open a terminal in Jupyter to run these from the Linux command prompt

```
# add three documents

curl -X POST "http://elasticsearch:9200/people/students" -H 'Content-Type: application/json' -d '{ "name" : "mike", "major" : "math", "gpa" : 3.4 }'
curl -X POST "http://elasticsearch:9200/people/students" -H 'Content-Type: application/json' -d '{ "name" : "phil", "major" : "math", "gpa" : 3.2 }'
curl -X POST "http://elasticsearch:9200/people/students" -H 'Content-Type: application/json' -d '{ "name" : "pete", "major" : "bio", "gpa" : 3.7 }'

# Get a count of students (documents)

curl -X GET "http://elasticsearch:9200/_cat/count/people"

#  Find the math majors

curl -X GET  "http://elasticsearch:9200/people/_search?pretty&q=major:math"
```

- Type this into your web browser to find students with a GPA>3.3 
- NOTE: It's localhost now because we are not inside a docker container.

```
http://localhost:9200/people/_search?pretty&q=gpa:[3.3 TO *]
```

### Kibana as a client of Elasticsearch
 
- Kibana serves as an Elasticsearch client.
- To view the people/students index, go to Management=> Stack Management=> Index Management
    - Summary shows how many documents in the index
    - Mappings shows the searchable fields discovered and their data types.
- To use in Kibana, we must create a Kibana => Index Pattern
    - There are text fiels like major and keyword fields like major.keyword for aggregations
    - You can add custom fields by editing the value for example `deanslist` field as `emit( doc['gpa'].value > 3.3)`
- Once your Index Pattern is created you can search : Analyticcs => Discover



## Tweet Simulator 

- Genenrate some fake tweets in quazi-real time as a streaming data source
- Each tweet is posted to elasticsearch and echoed to the console
- This code will run until the user Stops the Notebook Cell or the Tweet limit is hit.

In [3]:
from simtweet import generateRandomTweet
import requests
from time import sleep
import random
import json
index = "tweets"
url = f"http://{elastic_host}:{elastic_port}/{index}"
headers = { "Content-Type" : "application/json" }
tweet_limit = 100
min_delay = 1
max_delay = 5

for i in range(tweet_limit):
    sleep(random.randint(min_delay, max_delay))
    tweet = generateRandomTweet()
    endpoint = f"{url}/_doc/{tweet['id']}"
    response = requests.post(endpoint, headers = headers, data = json.dumps(tweet))
    response.raise_for_status()
    print(f"curl -X POST {endpoint} -H 'Content-Type: application/json' \n\t-d '{json.dumps(tweet)}'")
    

curl -X POST http://elasticsearch:9200/tweets/_doc/1463065144848886549 -H 'Content-Type: application/json' 
	-d '{"id": "1463065144848886549", "timestamp": [2023, 4, 10, 17, 51, 55, 0, 100, 0], "timestamp_format": "2023-04-10T17:51:55", "date": "2023-04-10", "time": "17:51:55", "user": "tanott", "lat": 29.38, "lon": -94.84, "text": "Hey fudgemart, why is your support so bad? #upset", "sentiment": "negative", "mentions": [], "hashtags": ["#upset"]}'
curl -X POST http://elasticsearch:9200/tweets/_doc/4537573983336726708 -H 'Content-Type: application/json' 
	-d '{"id": "4537573983336726708", "timestamp": [2023, 4, 10, 17, 52, 0, 0, 100, 0], "timestamp_format": "2023-04-10T17:52:00", "date": "2023-04-10", "time": "17:52:00", "user": "afresco", "lat": 33.38, "lon": -96.79, "text": "Just got some electronics from #fudgemart. Awesome!", "sentiment": "positive", "mentions": [], "hashtags": ["#fudgemart."]}'
curl -X POST http://elasticsearch:9200/tweets/_doc/1677081450422603036 -H 'Content-Type

## Spark Elasticsearch

The elasticsearch spark connector supports:

 - Writing Spark  DataFrames to an ES index
 - An entire ES index into a spark DataFrame
 - It does NOT support ES Queries.
 - It does not handle nested schemas without creating a customer mapping in elasticsearch

In [None]:
fm = spark.read.option("inferSchema",True).option("header",True).csv("file:///home/jovyan/datasets/fudgemart/fudgemart-order-details.csv")
fm.printSchema()

In [None]:
fm.show(5)

In [None]:
fm.write.mode("Overwrite").format("es").save("fm-order-details")

### NOTE: Wait a minute for Elasticsearch to catch up!!!!

- Don't read until the mapping is created.

In [None]:
fm2 = spark.read.format("es").load("fm-order-details")
fm2.show()

In [None]:
query = fm2.select("product_name","product_retail_price","order_qty").where("product_name='Steam Iron'")

In [None]:
query.explain()

In [None]:
# Filters are pushed down into elastic!!
fm2 = spark.read.format("es").load("fm-order-details")
fm2.where("customer_city='Fresno' and order_total<500").explain()

In [None]:
fm2.where("customer_city='Fresno' and order_total<500").toPandas()

                                                                                

                                                                                