# Unit J
# Search Database Model

- Examples From Video Lecture 


In [3]:
import pyspark
from pyspark.sql import SparkSession
# ELASTICSEARCH CONFIGURATION
elastic_host = "elasticsearch"
elastic_port = "9200"
spark = SparkSession.builder \
    .master("local") \
    .appName('jupyter-pyspark') \
    .config("spark.jars.packages","org.elasticsearch:elasticsearch-spark-20_2.12:7.15.0")\
    .config("spark.es.nodes", elastic_host) \
    .config("spark.es.port",elastic_port) \
    .getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("ERROR")

## Elasticsearch and Kibana

### Elasticsearch REST API

- Use the HTTP protocol to add and query data in elasticsearch
- open a terminal in Jupyter to run these from the Linux command prompt

```
# add three documents

curl -X POST "http://elasticsearch:9200/people/students" -H 'Content-Type: application/json' -d '{ "name" : "mike", "major" : "math", "gpa" : 3.4 }'
curl -X POST "http://elasticsearch:9200/people/students" -H 'Content-Type: application/json' -d '{ "name" : "phil", "major" : "math", "gpa" : 3.2 }'
curl -X POST "http://elasticsearch:9200/people/students" -H 'Content-Type: application/json' -d '{ "name" : "pete", "major" : "bio", "gpa" : 3.7 }'

# Get a count of students (documents)

curl -X GET "http://elasticsearch:9200/_cat/count/people"

#  Find the math majors

curl -X GET  "http://elasticsearch:9200/people/_search?pretty&q=major:math"
```

- Type this into your web browser to find students with a GPA>3.3 
- NOTE: It's localhost now because we are not inside a docker container.

```
http://localhost:9200/people/_search?pretty&q=gpa:[3.3 TO *]
```

### Kibana as a client of Elasticsearch
 
- Kibana serves as an Elasticsearch client.
- To view the people/students index, go to Management=> Stack Management=> Index Management
    - Summary shows how many documents in the index
    - Mappings shows the searchable fields discovered and their data types.
- To use in Kibana, we must create a Kibana => Index Pattern
    - There are text fiels like major and keyword fields like major.keyword for aggregations
    - You can add custom fields by editing the value for example `deanslist` field as `emit( doc['gpa'].value > 3.3)`
- Once your Index Pattern is created you can search : Analyticcs => Discover



## Tweet Simulator 

- Genenrate some fake tweets in quazi-real time as a streaming data source
- Each tweet is posted to elasticsearch and echoed to the console
- This code will run until the user Stops the Notebook Cell or the Tweet limit is hit.

In [None]:
from simtweet import generateRandomTweet
import requests
from time import sleep
import random
import json
index = "tweets"
url = f"http://{elastic_host}:{elastic_port}/{index}"
headers = { "Content-Type" : "application/json" }
tweet_limit = 25
min_delay = 1
max_delay = 15

for i in range(tweet_limit):
    sleep(random.randint(min_delay, max_delay))
    tweet = generateRandomTweet()
    endpoint = f"{url}/_doc/{tweet['id']}"
    response = requests.post(endpoint, headers = headers, data = json.dumps(tweet))
    response.raise_for_status()
    print(f"curl -X POST {endpoint} -H 'Content-Type: application/json' \n\t-d '{json.dumps(tweet)}'")
    

## Spark Elasticsearch

The elasticsearch spark connector supports:

 - Writing Spark  DataFrames to an ES index
 - An entire ES index into a spark DataFrame
 - It does NOT support ES Queries.
 - It does not handle nested schemas without creating a customer mapping in elasticsearch

In [77]:
fm = spark.read.option("inferSchema",True).option("header",True).csv("file:///home/jovyan/datasets/fudgemart/fudgemart-order-details.csv")
fm.printSchema()

root
 |-- customer_id: integer (nullable = true)
 |-- customer_email: string (nullable = true)
 |-- customer_name: string (nullable = true)
 |-- customer_address: string (nullable = true)
 |-- customer_city: string (nullable = true)
 |-- customer_state: string (nullable = true)
 |-- customer_zip: integer (nullable = true)
 |-- order_id: integer (nullable = true)
 |-- order_date: string (nullable = true)
 |-- creditcard_number: string (nullable = true)
 |-- creditcard_exp_date: string (nullable = true)
 |-- order_total: double (nullable = true)
 |-- ship_via: string (nullable = true)
 |-- shipped_date: string (nullable = true)
 |-- product_id: integer (nullable = true)
 |-- order_item_id: integer (nullable = true)
 |-- order_qty: integer (nullable = true)
 |-- product_name: string (nullable = true)
 |-- product_retail_price: double (nullable = true)



In [81]:
fm.write.mode("Overwrite").format("es").save("fm-order-details/_doc")

### NOTE: Wait a minute for Elasticsearch to catch up!!!!

- Don't read until the mapping is created.

In [84]:
fm2 = spark.read.format("es").load("fm-order-details/_doc")
fm2.show()

+--------------------+-------------------+--------------------+-------------+--------------------+-----------+----------------+--------------+------------+--------------------+--------+-------------+---------+-----------+----------+--------------------+--------------------+--------------+--------------------+
| creditcard_exp_date|  creditcard_number|    customer_address|customer_city|      customer_email|customer_id|   customer_name|customer_state|customer_zip|          order_date|order_id|order_item_id|order_qty|order_total|product_id|        product_name|product_retail_price|      ship_via|        shipped_date|
+--------------------+-------------------+--------------------+-------------+--------------------+-----------+----------------+--------------+------------+--------------------+--------+-------------+---------+-----------+----------+--------------------+--------------------+--------------+--------------------+
|2013-12-31 00:00:...| 644167 329790 0456|     2508 W Shaw Ave|    