<a href="https://colab.research.google.com/github/moyeed/Elasticsearch/blob/main/elastic_search_spotlight.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
#CSCI-642: Information Storage and Retrival
##Spotlight: ElasticSearch
##Authors: 

> **1) Mohammed Abdul Moyeed(Z1912165)**

> **2) Omer Bin Ali Bajubair(Z1905006)**

---



# Introduction
* Elastic  Search is a powerful text retrival and analysis tool.
* While most of the databases are enough for storing structured data, Elastic search not only helps in storing structured, unstructed or any type of huge data efficiently but also provides scalable search, various filtering, sharding and visualization abilities in real time.
* It also maintains powerful API, that allows to acces elastic search from any device and language.
* ElasticSearch is an interface to Lucene(an open source java library for text search) designed for BIGDATA from ground to up, including all the features of Lucene out of box.
* Elastic Search can be accessed in diferent ways, 

>> 1) Download to local machine and run it on local server when ever needed.

>> 2) Create an Insatnce on cloud services like AWS, AZURE, etc. and access it on their dashboard.

>> 3) Create a cloud server on elastic search website itself(we have followed this for simplicity and easy access)






# Getting Started : Connecting to our ElasticSearch Instance.

## Install elasticSearch python library.

In [1]:
%run -m pip install elasticsearch

Collecting elasticsearch
  Downloading elasticsearch-7.15.1-py2.py3-none-any.whl (378 kB)
Installing collected packages: elasticsearch
Successfully installed elasticsearch-7.15.1


## Connection to Elastic Search Insatnce: There are many ways depedning on how you are using the elastic search, below mentioned are few of them.

> 1) If you have elastic search downloaded to local machine and it is running, you can use "connection_with_local_server method after running the elastic insatnce on oyur machine and providng correct port.

> 2) You can connect using the API key or cloud_id if the server is instatiated on cloud using any of the two methods, "connection_with_API_key" or "connection_with_cloud_id"

In [2]:
from elasticsearch import Elasticsearch,helpers

def connection_with_local_server(port : str):
  es = Elasticsearch(
      {'host':'localhost',port:port})
  es.ping()

  return es

def connection_with_API_key(list_of_nodes,API_key : str):
  es = Elasticsearch(
    list_of_nodes,    #['node-1', 'node-2', 'node-3']
    api_key=('id', API_key),
)
  es.ping()

  return es

def connection_with_cloud_id(cloud_id,password):
  es = Elasticsearch(
      cloud_id = cloud_id,
      http_auth=("elastic", password),
  )
  es.ping()

  return es

es = connection_with_cloud_id("ISR-spotlight:ZWFzdHVzMi5henVyZS5lbGFzdGljLWNsb3VkLmNvbTo5MjQzJDc0MzJjNTg4M2I1MjQ2ZGViYzllOWJmNTE0YzFlNTUyJDM4NTJjMTIyNTg3YzQ5NWViZjQxMDQ0ZGExZTNkNGFi","G33hwear6NKHA79sa3EwNwHK")




GET https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/ [status:200 request:0.512s]
HEAD https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/ [status:200 request:0.092s]


# Working with Indices (Tables)

> * Generally a Database is called Index in elastic search till version 7 but in further releases the Index is a table not database.


> * All the methods we are calling using the es insatnce will result in an API call either GET, PUT, POST or DELETE based on each operation and their states can be seen in the output of each execution.



## Creating a new Index

> * creating a new database or index uses PUT request to our server.




In [3]:
def create_index(index_name):
  response = es.indices.create(index = index_name,ignore = 400)
  if(response.get("acknowledged",False)):
    print("index created")

create_index("data")

PUT https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/data [status:200 request:0.286s]
index created


## Getting all currently available indices or Tables

> * This uses a GET request to fetch all indices.




In [4]:
def list_all_indices(index_name):
  indices_list = es.indices.get_alias(index = index_name)
  for each_index in indices_list:
    if(each_index.startswith(".") or each_index.startswith("apm")):
      continue
    print(each_index)

list_all_indices("*")

GET https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/*/_alias [status:200 request:0.099s]
student
data
facebook_outage_tweets
kibana_sample_data_flights




## Deleting an index or table

> * This uses a DELETE request to detlete required index.




In [5]:
def delete_index(index_name):
  response = es.indices.delete(index = index_name)
  if(response.get("acknowledged",False)):
    print("index deleted")

delete_index("data")


DELETE https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/data [status:200 request:0.204s]
index deleted


# Deploy/Access data in Elasticsearch indices

> * We can use JSON to add structured data in the index.


> * Each index or table is given an unique id to it.

> * The acknowledgement message shows all the attributes discussed above.









In [6]:
delete_index("student")
create_index("student")

DELETE https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/student [status:200 request:0.207s]
index deleted
PUT https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/student [status:200 request:0.268s]
index created


In [7]:
#create a JSON data for students table
student1 = {
    "name":"Moyeed",
    "Z-id":"Z-1912165",
    "level": "Graduate",
    "Location":"Dekalb"
}
student2 = {
    "name":"Omer",
    "Z-id":"Z-1905006",
    "level": "Graduate",
    "Location":"Dekalb"
}

#create a studnet table and insert this data
es.index(index = "student", document = {"s1":student1,"s2":student2})

POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/student/_doc [status:201 request:0.181s]


{'_id': 'qIJhmXwBUCa3jWZUdwQw',
 '_index': 'student',
 '_primary_term': 1,
 '_seq_no': 0,
 '_shards': {'failed': 0, 'successful': 2, 'total': 2},
 '_type': '_doc',
 '_version': 1,
 'result': 'created'}

## Import data from pandas dataframe to elasticsearch Index.










### downlad the dataset and extract it.

In [9]:


import zipfile
import os
from urllib.request import urlretrieve

import pandas as pd

file_to_download = ("tweets","https://github.com/moyeed/Elasticsearch/raw/main/tweets.zip")
dataset_fileName = ("Facebook_outage_Tweet_data_4th_October.csv","/content/Facebook_outage_Tweet_data_4th_October.csv")
if not os.path.exists(file_to_download[0]):
  urlretrieve(file_to_download[1], file_to_download[0])
  print("zip file downloaded")

if not os.path.exists(dataset_fileName[0]):
  !unzip /content/tweets
  print("Facebook_outage_Tweet_data_4th_October.csv, extracted from zip file")


zip file downloaded
Archive:  /content/tweets
  inflating: Facebook_outage_Tweet_data_4th_October.csv  
Facebook_outage_Tweet_data_4th_October.csv, extracted from zip file


### Read the data into a pandas datframe and replace Null values with -1.
> * Pandas dataframe does not have any problem with holding null values, but elasticsearch does not support indexing null values for efficiency purposes.

In [10]:
fb_outage_tweets_df = pd.read_csv(dataset_fileName[1], error_bad_lines=False)
fb_outage_tweets_df = fb_outage_tweets_df.fillna("-1")
fb_outage_tweets_df.drop(columns = ["thumbnail"]) 
# print(df.isnull().sum())
fb_outage_tweets_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,tweet,language,mentions,urls,photos,replies_count,retweets_count,likes_count,hashtags,cashtags,link,retweet,quote_url,video,thumbnail,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest
0,1445144194033590276,1445144194033590276,2021-10-04 21:49:59 UTC,2021-10-04,21:49:59,0,833121104,haythamkhateer,Haytham Khater,-1,#Facebook seems like a bgp routing update wit...,en,[],[],[],0,0,0,['facebook'],[],https://twitter.com/Haythamkhateer/status/1445...,False,-1,0,-1,-1,-1,-1,-1,-1,-1,[],-1,-1,-1,-1
1,1445144194025246720,1445144194025246720,2021-10-04 21:49:59 UTC,2021-10-04,21:49:59,0,2177188897,ms_catarsis,M.,-1,Tsss una hora antes de salir vuelve Facebook e...,es,[],[],['https://pbs.twimg.com/tweet_video_thumb/FA4u...,0,0,0,[],[],https://twitter.com/Ms_Catarsis/status/1445144...,False,-1,1,https://pbs.twimg.com/tweet_video_thumb/FA4u-w...,-1,-1,-1,-1,-1,-1,[],-1,-1,-1,-1
2,1445144193849073670,1445141972323344384,2021-10-04 21:49:59 UTC,2021-10-04,21:49:59,0,1156613502863233024,walter21839737,Nobody,-1,@tcsnoticias Sobrevimos a 30 años de arena y f...,es,[],[],[],0,0,2,[],[],https://twitter.com/Walter21839737/status/1445...,False,-1,0,-1,-1,-1,-1,-1,-1,-1,"[{'screen_name': 'tcsnoticias', 'name': 'TCS N...",-1,-1,-1,-1
3,1445144193819574285,1445144193819574285,2021-10-04 21:49:59 UTC,2021-10-04,21:49:59,0,1439752489864151045,sh07398411rubal,Rubal Sharma,-1,Aaj Facebook Instagram aur WhatsApp nahin chal...,hi,[],[],[],0,0,0,[],[],https://twitter.com/Sh07398411Rubal/status/144...,False,-1,0,-1,-1,-1,-1,-1,-1,-1,[],-1,-1,-1,-1
4,1445144193714774025,1445144193714774025,2021-10-04 21:49:59 UTC,2021-10-04,21:49:59,0,1096607025688121344,stefanidaluz17,Stefani,-1,O Facebook voltou ou eu tô ficando louca?,pt,[],[],[],1,0,1,[],[],https://twitter.com/StefaniDaLuz17/status/1445...,False,-1,0,-1,-1,-1,-1,-1,-1,-1,[],-1,-1,-1,-1


### Create a generator object that can be sent for uploadig dataset from pandas dataframe to our index.
> * we can import the dataset through kibana but the file should be less than 100mb, if the saize is greater, we can use something called bulk upload programatiically.

> * we have to create a generator to upload huge datasets to elastic search programatically.

In [11]:
def generator_for_bulk_upload(dataframe):
  columns = list(dataframe.columns)
  # print(type(df))
  # print(columns)
  for index, each_record in dataframe.iterrows():
    temp_source = {}
    # print(each_record[0])
    # for index,col in enumerate(columns):
    for col in columns:
      temp_source[col] = each_record[col]
    yield {
        '_index':'facebook_outage_tweets',
        '_id':each_record.id,
        '_source':temp_source
    }
# gen = generator_for_bulk_upload(fb_outage_tweets_df)

### create index to upload, if not already present.

In [12]:
# next(gen)
if es.indices.exists(index="facebook_outage_tweets"):
  print("index already exists")
  pass
else:
  create_index("facebook_outage_tweets")
  

HEAD https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/facebook_outage_tweets [status:200 request:0.098s]
index already exists


### upload the data in dataframe to elaticsearch index using _bulk API

In [None]:
helpers.bulk(es,generator_for_bulk_upload(fb_outage_tweets_df))


POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/_bulk [status:200 request:0.544s]
POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/_bulk [status:200 request:0.355s]
POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/_bulk [status:200 request:0.388s]
POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/_bulk [status:200 request:0.387s]
POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/_bulk [status:200 request:0.366s]
POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/_bulk [status:200 request:0.377s]
POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/_bulk [status:200 request:0.368s]
POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/_bulk [status:200 request:0.339s]
POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elas

(304135, [])

## Get data from elastic search to pandas dataframe

> * we use the search query here to get full documnet without any filters.(we will go through the queries in detail in further cells, here just trying to show that data can be uploaded and downloaded from pandas dataframe.)




In [13]:
print("The original shape of data is ", fb_outage_tweets_df.shape)
res = es.search(index = "facebook_outage_tweets")

data_from_es = pd.DataFrame(res["hits"]["hits"])
print("The shape of returned recieved data is ",data_from_es.shape,"\n")
data_from_es.head()

The original shape of data is  (304135, 36)
POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/facebook_outage_tweets/_search [status:200 request:0.105s]
The shape of returned recieved data is  (10, 6) 



Unnamed: 0,_index,_type,_id,_score,_source,_ignored
0,facebook_outage_tweets,_doc,1445143573364498447,1.0,"{'id': 1445143573364498447, 'conversation_id':...",
1,facebook_outage_tweets,_doc,1445143573347852292,1.0,"{'id': 1445143573347852292, 'conversation_id':...",
2,facebook_outage_tweets,_doc,1445143573339545608,1.0,"{'id': 1445143573339545608, 'conversation_id':...",
3,facebook_outage_tweets,_doc,1445143573175980033,1.0,"{'id': 1445143573175980033, 'conversation_id':...",
4,facebook_outage_tweets,_doc,1445143573175877638,1.0,"{'id': 1445143573175877638, 'conversation_id':...",


### Pagination or scrolling over a index

> * As we have seen above, we we do not give any JSON formatted query, it gets first 10 results of the index, the highest size we can get is 10,000 hits at once, we will not be able to get more than that at a time. We need to use scroll attribute of elastic search API to be able to receive more than 10,000 results.

> * Another problem is the data is stored in _soucre of each item in new dataframe, which can be solved by using pandas to extract the data into columns.(That will not be shown here)





In [14]:
data = []
print("The original shape of data is ", fb_outage_tweets_df.shape)

res = es.search(index = "facebook_outage_tweets",size = 10000, scroll = "10m" )
data.extend(res["hits"]["hits"])
while(len(res["hits"]["hits"])):
  scroll_id = res['_scroll_id']
  res = es.scroll(scroll_id = scroll_id, scroll='10m')
  data.extend(res["hits"]["hits"])

data_from_es = pd.DataFrame(data)
print("The shape of returned recieved data is ",data_from_es.shape,"\n")
data_from_es.head()

The original shape of data is  (304135, 36)
POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/facebook_outage_tweets/_search?scroll=10m [status:200 request:1.088s]
POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/_search/scroll [status:200 request:0.498s]
POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/_search/scroll [status:200 request:0.518s]
POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/_search/scroll [status:200 request:0.528s]
POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/_search/scroll [status:200 request:0.457s]
POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/_search/scroll [status:200 request:0.480s]
POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/_search/scroll [status:200 request:0.552s]
POST https://7432c5883b5246debc9e9bf514c1e552.e

Unnamed: 0,_index,_type,_id,_score,_source,_ignored
0,facebook_outage_tweets,_doc,1445143573364498447,1.0,"{'id': 1445143573364498447, 'conversation_id':...",
1,facebook_outage_tweets,_doc,1445143573347852292,1.0,"{'id': 1445143573347852292, 'conversation_id':...",
2,facebook_outage_tweets,_doc,1445143573339545608,1.0,"{'id': 1445143573339545608, 'conversation_id':...",
3,facebook_outage_tweets,_doc,1445143573175980033,1.0,"{'id': 1445143573175980033, 'conversation_id':...",
4,facebook_outage_tweets,_doc,1445143573175877638,1.0,"{'id': 1445143573175877638, 'conversation_id':...",


# Querying in ElasticSearch (Let the fun begin!!)

> * There are various ways in which a query can be written for text retrival in elastic search, but everything comes donw to one JSON object.

> * Elasticsearch provides easy to use and different filters that can be used to model our query to return data as per requirements.

> * A query in short is the POST request to our elasticsearch endpoint to retrive data.







## First match query (like Hello world! in programming)

> * The match query is used to match the term specified at any column of the table. we define what columns the search result should match in the match field of the JSON.

> * By default th top 10 documents based on the query are returned, we can modify the size of the result to be returned by using size property.

> * The output is not just the top 10 documents but some additional data is also displayed which is useful in understanding how the query has perfomed on our data.
  >> * **_shards**: This dictionary tells us how many shards were queried and how many were successful.
  >> * **hits**: This dictionary shows the total number of results matched and has an internal hits object for matched documents.
  >>> * **total**: This dictionary holds the count of all matched documents and the relation that tells us if the count is accurate or not(i.e count  is either exactly equal if "eq" or less than or greater than if "lte" or "gte"). what ever the count, if size is not specified it returns 10 documents at once, we can get a maximum of 10000 documents at once in a single query.
  >>> * **max_score**: This filed show the max score that was observed in all matches.
  >>> * **hits**: This internal hits dictionary holds the full document that is ranked in that speicific position based on score.
  >>>> * **_id**: is the unique id of each document or entry in the index.
  >>>> * **_score**: is the score of the document based on the query. higher the score better the rank.
  >>>>* **_source**: This dictionary has all the attributes of the entry or document of the matched document in the index.

In [None]:
# find the documents that has "signal" in the column "tweet".
query = {
    "match":{
        "tweet":"Signal"
        }
 }
es.search(query = query,index = "facebook_outage_tweets")


POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/facebook_outage_tweets/_search [status:200 request:0.027s]


{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '1445136468247293962',
    '_index': 'facebook_outage_tweets',
    '_score': 10.717007,
    '_source': {'cashtags': '[]',
     'conversation_id': 1445136468247293962,
     'created_at': '2021-10-04 21:19:18 UTC',
     'date': '2021-10-04',
     'geo': '-1',
     'hashtags': "['facebook', 'whatsapp', 'instagramdown', 'signal']",
     'id': 1445136468247293962,
     'language': 'und',
     'likes_count': 0,
     'link': 'https://twitter.com/omar_karbil/status/1445136468247293962',
     'mentions': '[]',
     'name': 'Omar Karbil',
     'near': '-1',
     'photos': '[]',
     'place': '-1',
     'quote_url': '-1',
     'replies_count': 0,
     'reply_to': '[]',
     'retweet': False,
     'retweet_date': '-1',
     'retweet_id': '-1',
     'retweets_count': 0,
     'source': '-1',
     'thumbnail': '-1',
     'time': '21:19:18',
     'timezone': 0,
     'trans_dest': '-1',
     'trans_src': '-

## Keeping track of Total matched documents

> * We have seen above that hits has a total named dictionary that holds the count of matched documents(not the retrived documents), we only retrive a few documents, if 1000 are matched we only need first 100 or some arbitary value, so keeping a track of total is costly and we can skip if we want.

> * The attribute "track_total_hits" helps us in keeping track of the total hits as needed.
>> * If it is True, exact matched queries count is tracked and the relation will be "eq"
>> * instead if the value is False while querying, we will not track this count at all.
>> * Besides being a booleean value, the attribute can also be a integer. In this case, the total count till the integer value provided will be accurate but once it exceeds that value, it is not accurate.



In [None]:
# find the documents that has "signal" in the column "tweet".
query = {
    "track_total_hits":100,
    "query": {
    "match":{
        "tweet":"Signal"
        }
     }
     }
     
es.search(body = query,index = "facebook_outage_tweets")


POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/facebook_outage_tweets/_search [status:200 request:0.032s]


  # This is added back by InteractiveShellApp.init_path()


{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '1445136468247293962',
    '_index': 'facebook_outage_tweets',
    '_score': 10.717007,
    '_source': {'cashtags': '[]',
     'conversation_id': 1445136468247293962,
     'created_at': '2021-10-04 21:19:18 UTC',
     'date': '2021-10-04',
     'geo': '-1',
     'hashtags': "['facebook', 'whatsapp', 'instagramdown', 'signal']",
     'id': 1445136468247293962,
     'language': 'und',
     'likes_count': 0,
     'link': 'https://twitter.com/omar_karbil/status/1445136468247293962',
     'mentions': '[]',
     'name': 'Omar Karbil',
     'near': '-1',
     'photos': '[]',
     'place': '-1',
     'quote_url': '-1',
     'replies_count': 0,
     'reply_to': '[]',
     'retweet': False,
     'retweet_date': '-1',
     'retweet_id': '-1',
     'retweets_count': 0,
     'source': '-1',
     'thumbnail': '-1',
     'time': '21:19:18',
     'timezone': 0,
     'trans_dest': '-1',
     'trans_src': '-

## The "Size" Attribute
> * As mentioned before, if size attribute is not given, we will get a 10 results if matches are more than 10 and number of matches if they are less than 10.
> * We can get a maximum of 10,000 documents at once, if we wnat more, we need to use scroll/pagination.
> * if we set the size to zero, we can just check how many documents amtch the query.

In [None]:
query = {
    "match":{
        "tweet":"Signal"
        }
 }
es.search(query = query,index = "facebook_outage_tweets",size = 0)

POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/facebook_outage_tweets/_search [status:200 request:0.025s]


{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [],
  'max_score': None,
  'total': {'relation': 'eq', 'value': 545}},
 'timed_out': False,
 'took': 1}

## Filtering Search results
> * Filter attribute in our query will help in filtering and getting only the data that matches the filter as descriibed.
> * term key in filter item is used when you have to check if a field has that term or not
> * exists field is used to check if the field exists or not.
> * range field is used when we want to search for range, less than or greater than
> * match is used to check if the term or sentence matches in the given field.
> * Filter is used inside bool attribute.


In [None]:

query = {
"track_total_hits":True,
"query":{
    "bool": {
      "filter": [
        { "term": { "language": "fr"}},
        {"exists":{"field":"tweet"}},
        {"range":{"replies_count":{"gte":10}}},
        {"match":{"tweet":"mark"}}

      ]
    }
 }
}
es.search(body = query,index = "facebook_outage_tweets")

POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/facebook_outage_tweets/_search [status:200 request:0.029s]


  app.launch_new_instance()


{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '1445135556422377472',
    '_index': 'facebook_outage_tweets',
    '_score': 0.0,
    '_source': {'cashtags': '[]',
     'conversation_id': 1445135556422377472,
     'created_at': '2021-10-04 21:15:40 UTC',
     'date': '2021-10-04',
     'geo': '-1',
     'hashtags': '[]',
     'id': 1445135556422377472,
     'language': 'fr',
     'likes_count': 1850,
     'link': 'https://twitter.com/CurryElPatron/status/1445135556422377472',
     'mentions': '[]',
     'name': 'Shady',
     'near': '-1',
     'photos': '[]',
     'place': '-1',
     'quote_url': '-1',
     'replies_count': 24,
     'reply_to': '[]',
     'retweet': False,
     'retweet_date': '-1',
     'retweet_id': '-1',
     'retweets_count': 477,
     'source': '-1',
     'thumbnail': 'https://pbs.twimg.com/ext_tw_video_thumb/1445135096588210181/pu/img/whzbo1Ou6bPYTW7I.jpg',
     'time': '21:15:40',
     'timezone': 0,
     'trans_d

## Aggregation
> * Aggregration is used to group and analyze the data on a field.
> * we will bring the count of top 10 highest tweeted conversationIds.
> * Then we will also get the maximum likes in the top ten conversation-id tweets.
> * These are just a few common items, there are different things that can be done with aggregation which can be found [here.](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html)


In [None]:
  query = {
"track_total_hits":False,
"query":{
    "bool": {
      "filter": {"exists":{"field":"conversation_id"}}
    }
 },
 "aggs":{
     "MyBuckets":{
         "terms":{
             "field":"conversation_id",
             "order":{
                 "_count":"desc"
             },
             "size":10
         },"aggs": {
        "max_likes_count": { "max": { "field": "likes_count" } }
      }
     }
 },
}
es.search(body = query,index = "facebook_outage_tweets")



POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/facebook_outage_tweets/_search [status:200 request:0.532s]


{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'aggregations': {'MyBuckets': {'buckets': [{'doc_count': 5446,
     'key': 1445061804636479493,
     'max_likes_count': {'value': 835.0}},
    {'doc_count': 3708,
     'key': 1445114730151043073,
     'max_likes_count': {'value': 720.0}},
    {'doc_count': 1079,
     'key': 1445078208190291973,
     'max_likes_count': {'value': 23992.0}},
    {'doc_count': 651,
     'key': 1445104389014835213,
     'max_likes_count': {'value': 332.0}},
    {'doc_count': 418,
     'key': 1445131408993923077,
     'max_likes_count': {'value': 32586.0}},
    {'doc_count': 376,
     'key': 1445125556408836096,
     'max_likes_count': {'value': 408.0}},
    {'doc_count': 233,
     'key': 1445066240817573903,
     'max_likes_count': {'value': 1278.0}},
    {'doc_count': 176,
     'key': 1445060216161116168,
     'max_likes_count': {'value': 37.0}},
    {'doc_count': 170,
     'key': 1445134713245601796,
     'max_likes_count': {'value': 6

## Get only required fields
> * we can filter the data to return just that fields that are mentioned in the fields array.
> * wildcard character can also be used.

In [None]:
query = {
  "query": {
    "match": {
      "tweet": "Mark"
    }
  },
  "fields": [
    "tweet",
    "replies_count",
    "retweet*"
  ],
  "_source": False
}
es.search(body = query,index = "facebook_outage_tweets")

POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/facebook_outage_tweets/_search [status:200 request:0.032s]


  


{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '1445132497499607040',
    '_index': 'facebook_outage_tweets',
    '_score': 6.394503,
    '_type': '_doc',
    'fields': {'replies_count': [0],
     'retweet': [False],
     'retweet_date': ['-1'],
     'retweet_date.keyword': ['-1'],
     'retweet_id': ['-1'],
     'retweet_id.keyword': ['-1'],
     'retweets_count': [0],
     'tweet': ['Mark! Mark! Facebook is down.']}},
   {'_id': '1445129213883953154',
    '_index': 'facebook_outage_tweets',
    '_score': 6.0430903,
    '_type': '_doc',
    'fields': {'replies_count': [0],
     'retweet': [False],
     'retweet_date': ['-1'],
     'retweet_date.keyword': ['-1'],
     'retweet_id': ['-1'],
     'retweet_id.keyword': ['-1'],
     'retweets_count': [0],
     'tweet': ["Mark Zuckerberg'in instagram ve facebook adresi Mark Zuckerberg."]}},
   {'_id': '1445130053508022278',
    '_index': 'facebook_outage_tweets',
    '_score': 5.9343815,
   

## Boolean Query
> * **must**: The term must exist in the field, it contributes towards score.
> * **filter**: The term must exist in the filed, but unlike must attribute it does not contribute towards score.
> * **should**: The term should be  in the field, not compusary.
>* **must_not**: The query is that matches the condition given will be excluded form scoring.
> * **minimum_should_match**: This attribute can hold different typees of values,
>> * Integer: Indicates a fixed value regardless of the number of optional clauses. example = 2
>> * negative Integer: Indicates that the total number of optional clauses, minus this number should be mandatory. example = -3
>> * Percentage: Indicates that this percent of the total number of optional clauses are necessary. The number computed from the percentage is rounded down and used as the minimum. Example = 75%
>> * Negative percentage: Indicates that this percent of the total number of optional clauses can be missing. The number computed from the percentage is rounded down, before being subtracted from the total to determine the minimum.
Example = -25%
>> * Combination: A positive integer, followed by the less-than symbol, followed by any of the previously mentioned specifiers is a conditional specification. It indicates that if the number of optional clauses is equal to (or less than) the integer, they are all required, but if it’s greater than the integer, the specification applies. example = 3 < 90% In this example: if there are 1 to 3 clauses they are all required, but for 4 or more clauses only 90% are required.
>> * Multiple combinations:Multiple conditional specifications can be separated by spaces, each one only being valid for numbers greater than the one before it.example - 2<-25% 9<-3, In this example: if there are 1 or 2 clauses both are required, if there are 3-9 clauses all but 25% are required, and if there are more than 9 clauses, all but three are required.


In [None]:
query = {
  "query": {
    "bool" : {
      "must" : {
        "term" : { "tweet" : "uninstall" }
      },
      "filter": {
        "term" : { "language" : "en" }
      },
      "must_not" : {
        "range" : {
          "replies_count" : { "gte" : 10, "lte" : 20 }
        }
      },
      "should" : [
        { "term" : { "timezone" : 0 } },
        { "term" : { "video" : 0 } }
      ],
      "minimum_should_match" : 1,
      "boost" : 1.0
    }
  }
}
es.search(body = query,index = "facebook_outage_tweets")

POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/facebook_outage_tweets/_search [status:200 request:0.033s]




{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '1445127546270531596',
    '_index': 'facebook_outage_tweets',
    '_score': 15.512344,
    '_source': {'cashtags': '[]',
     'conversation_id': 1445127546270531596,
     'created_at': '2021-10-04 20:43:50 UTC',
     'date': '2021-10-04',
     'geo': '-1',
     'hashtags': '[]',
     'id': 1445127546270531596,
     'language': 'en',
     'likes_count': 1,
     'link': 'https://twitter.com/dr5bludgeoning/status/1445127546270531596',
     'mentions': '[]',
     'name': 'i_amJeremy',
     'near': '-1',
     'photos': '[]',
     'place': '-1',
     'quote_url': '-1',
     'replies_count': 0,
     'reply_to': '[]',
     'retweet': False,
     'retweet_date': '-1',
     'retweet_id': '-1',
     'retweets_count': 0,
     'source': '-1',
     'thumbnail': '-1',
     'time': '20:43:50',
     'timezone': 0,
     'trans_dest': '-1',
     'trans_src': '-1',
     'translate': '-1',
     'tweet': 'Unins

## Score boosting

> * we can decrease the score of the required terms or phrases as required.
> * below query adds to the score when "signal" matches in tweet and decreases the relevence score by penalizing the terms that exists in negative field.
> * boost value ranges form 0 to 1 only.


In [None]:
query = {
  "query": {
    "boosting": {
      "positive": {
        "match": {
          "tweet": "Signal"
        }
      },
      "negative": {
        "term": {
          "text": "facebook"
        }
      },
      "negative_boost": 0.8,
      "negative": {
        "term": {
          "text": "watsapp"
        }
      },
      "negative_boost": 0.3,
      "negative": {
        "term": {
          "text": "instagram"
        }
      },
      "negative_boost": 0.5
      
    }
  }
}
es.search(body = query,index = "facebook_outage_tweets")

POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/facebook_outage_tweets/_search [status:200 request:0.028s]




{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '1445136468247293962',
    '_index': 'facebook_outage_tweets',
    '_score': 10.717007,
    '_source': {'cashtags': '[]',
     'conversation_id': 1445136468247293962,
     'created_at': '2021-10-04 21:19:18 UTC',
     'date': '2021-10-04',
     'geo': '-1',
     'hashtags': "['facebook', 'whatsapp', 'instagramdown', 'signal']",
     'id': 1445136468247293962,
     'language': 'und',
     'likes_count': 0,
     'link': 'https://twitter.com/omar_karbil/status/1445136468247293962',
     'mentions': '[]',
     'name': 'Omar Karbil',
     'near': '-1',
     'photos': '[]',
     'place': '-1',
     'quote_url': '-1',
     'replies_count': 0,
     'reply_to': '[]',
     'retweet': False,
     'retweet_date': '-1',
     'retweet_id': '-1',
     'retweets_count': 0,
     'source': '-1',
     'thumbnail': '-1',
     'time': '21:19:18',
     'timezone': 0,
     'trans_dest': '-1',
     'trans_src': '-

## Intervals Query
> * Returns documents based on the order and proximity of matching terms.
> * **all_of**: the mathcing document must hav the phrase in this segment.
> * **any_of**: the matching document will have any of the matching query in this segment.
> * **Max_gaps**: this provides the spcae between the first phrse and other phrases that can be there.

In [None]:
query = {
  "query": {
    "intervals" : {
      "tweet" : {
        "all_of" : {
          "ordered" : True,
          "intervals" : [
            {
              "match" : {
                "query" : "My first Tweet ever because",
                "max_gaps" : 0,
                "ordered" : True
              }
            },
            {
              "any_of" : {
                "intervals" : [
                  { "match" : { "query" : " Facebook is down! " } },
                  { "match" : { "query" : "watsapp is not working" } }
                ]
              }
            }
          ]
        }
      }
    }
  }
}
es.search(body = query,index = "facebook_outage_tweets")

POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/facebook_outage_tweets/_search [status:200 request:0.201s]




{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '1445142277249290241',
    '_index': 'facebook_outage_tweets',
    '_score': 0.5,
    '_source': {'cashtags': '[]',
     'conversation_id': 1445142277249290241,
     'created_at': '2021-10-04 21:42:22 UTC',
     'date': '2021-10-04',
     'geo': '-1',
     'hashtags': '[]',
     'id': 1445142277249290241,
     'language': 'en',
     'likes_count': 0,
     'link': 'https://twitter.com/lovedhim78/status/1445142277249290241',
     'mentions': '[]',
     'name': 'Joy Davis',
     'near': '-1',
     'photos': '[]',
     'place': '-1',
     'quote_url': '-1',
     'replies_count': 0,
     'reply_to': '[]',
     'retweet': False,
     'retweet_date': '-1',
     'retweet_id': '-1',
     'retweets_count': 0,
     'source': '-1',
     'thumbnail': '-1',
     'time': '21:42:22',
     'timezone': 0,
     'trans_dest': '-1',
     'trans_src': '-1',
     'translate': '-1',
     'tweet': 'My first Tweet e

## Match phrase query

> * This attribute is used to match the query that has exact same phrase in it.
> * Those document that has this term and some prefix or suffix to it also match the query.


In [None]:
query = {
  "query": {
    "match_phrase": {
      "tweet": "the Facebook shutdown"
    }
  }
}
es.search(body = query,index = "facebook_outage_tweets")

  


POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/facebook_outage_tweets/_search [status:200 request:0.221s]


{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '1445128574407675915',
    '_index': 'facebook_outage_tweets',
    '_score': 10.944517,
    '_source': {'cashtags': '[]',
     'conversation_id': 1445128574407675915,
     'created_at': '2021-10-04 20:47:55 UTC',
     'date': '2021-10-04',
     'geo': '-1',
     'hashtags': '[]',
     'id': 1445128574407675915,
     'language': 'en',
     'likes_count': 0,
     'link': 'https://twitter.com/theguykt/status/1445128574407675915',
     'mentions': '[]',
     'name': 'kt.',
     'near': '-1',
     'photos': '[]',
     'place': '-1',
     'quote_url': 'https://twitter.com/nocontextdrumar/status/1445085983335231493',
     'replies_count': 1,
     'reply_to': '[]',
     'retweet': False,
     'retweet_date': '-1',
     'retweet_id': '-1',
     'retweets_count': 1,
     'source': '-1',
     'thumbnail': '-1',
     'time': '20:47:55',
     'timezone': 0,
     'trans_dest': '-1',
     'trans_src': '-1

## Match phrase prefix query
> * Returns documents that contain the words of a provided text, in the same order as provided. The last term of the provided text is treated as a prefix, matching any words that begin with that term.

In [None]:
query = {
  "query": {
    "match_phrase_prefix": {
      "tweet": {
        "query": "Facebook s",
      }
    }
  }
}
es.search(body = query,index = "facebook_outage_tweets")

POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/facebook_outage_tweets/_search [status:200 request:0.041s]


  # Remove the CWD from sys.path while we load stuff.


{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '1445143350252822536',
    '_index': 'facebook_outage_tweets',
    '_score': 791.3332,
    '_source': {'cashtags': '[]',
     'conversation_id': 1445114730151043073,
     'created_at': '2021-10-04 21:46:38 UTC',
     'date': '2021-10-04',
     'geo': '-1',
     'hashtags': '[]',
     'id': 1445143350252822536,
     'language': 'und',
     'likes_count': 0,
     'link': 'https://twitter.com/Omeed_Shahab/status/1445143350252822536',
     'mentions': '[]',
     'name': 'Omeed-Shahab-Hamad',
     'near': '-1',
     'photos': '[]',
     'place': '-1',
     'quote_url': '-1',
     'replies_count': 0,
     'reply_to': "[{'screen_name': 'schrep', 'name': 'Mike Schroepfer', 'id': '6182852'}, {'screen_name': 'Facebook', 'name': 'Facebook', 'id': '2425151'}]",
     'retweet': False,
     'retweet_date': '-1',
     'retweet_id': '-1',
     'retweets_count': 0,
     'source': '-1',
     'thumbnail': '-1

## Multi macth query
> * The multi_match query builds on the match query to allow multi-field queries.

In [None]:
query = {
  "query": {
    "multi_match" : {
      "query": "facebook", 
      "fields": [ "tweet", "mentions" ] 
    }
  }
}
es.search(body = query,index = "facebook_outage_tweets")

POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/facebook_outage_tweets/_search [status:200 request:0.040s]


  if __name__ == '__main__':


{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '1445128160534749196',
    '_ignored': ['mentions.keyword'],
    '_index': 'facebook_outage_tweets',
    '_score': 1.4122502,
    '_source': {'cashtags': '[]',
     'conversation_id': 1445128160534749196,
     'created_at': '2021-10-04 20:46:17 UTC',
     'date': '2021-10-04',
     'geo': '-1',
     'hashtags': '[]',
     'id': 1445128160534749196,
     'language': 'es',
     'likes_count': 0,
     'link': 'https://twitter.com/ALEXANDERCALLAO/status/1445128160534749196',
     'mentions': "[{'screen_name': 'facebook', 'name': 'facebook', 'id': '2425151'}, {'screen_name': 'facebook', 'name': 'facebook', 'id': '2425151'}, {'screen_name': 'facebook', 'name': 'facebook', 'id': '2425151'}, {'screen_name': 'facebook', 'name': 'facebook', 'id': '2425151'}, {'screen_name': 'facebook', 'name': 'facebook', 'id': '2425151'}, {'screen_name': 'facebook', 'name': 'facebook', 'id': '2425151'}, {'screen_nam

##Specialized queries(Distance feature query)
> * Boosts the relevance score of documents closer to a provided origin date or point. For example, you can use this query to give more weight to documents closer to a certain date or location.
> * You can use the distance_feature query to find the nearest neighbors to a location. You can also use the query in a bool search’s should filter to add boosted relevance scores to the bool query’s scores.
> * The attributes used are described as below,
>> * **field**: This can be used to store date or geolocation.
>> * **origin**: This is the point of reference form where the current value distace has to be measured.
>> * **pivot**: Distance from the origin at which relevance scores receive half of the boost value
>> * **boost**: Floating point number used to multiply the relevance score of matching documents. This value cannot be negative. Defaults to 1.0. 

In [19]:
 query = {
  "query": {
    "bool": {
      "must": {
        "match": {
          "tweet": "facebook is down"
        }
      },
      "should": {
        "distance_feature": {
          "field": "date",
          "pivot": "7d",
          "origin": "now",
          "boost":0.5
        }
      }
    }
  }
}
es.search(body = query,index = "facebook_outage_tweets")



POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/facebook_outage_tweets/_search [status:200 request:0.496s]


{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '1445126990202380298',
    '_index': 'facebook_outage_tweets',
    '_score': 6.948606,
    '_source': {'cashtags': '[]',
     'conversation_id': 1445126990202380298,
     'created_at': '2021-10-04 20:41:38 UTC',
     'date': '2021-10-04',
     'geo': '-1',
     'hashtags': '[]',
     'id': 1445126990202380298,
     'language': 'en',
     'likes_count': 0,
     'link': 'https://twitter.com/yoncesmegan/status/1445126990202380298',
     'mentions': '[]',
     'name': 'Charlie 🐝🐝',
     'near': '-1',
     'photos': '[]',
     'place': '-1',
     'quote_url': '-1',
     'replies_count': 0,
     'reply_to': '[]',
     'retweet': False,
     'retweet_date': '-1',
     'retweet_id': '-1',
     'retweets_count': 0,
     'source': '-1',
     'thumbnail': '-1',
     'time': '20:41:38',
     'timezone': 0,
     'trans_dest': '-1',
     'trans_src': '-1',
     'translate': '-1',
     'tweet': 'instagram

##Specialized queries(More like this query)
> * It helps in searching the documents that are similar to given text.
> * It uses TF-IDF scoring to retrive similar documents matching the query.
> * The terms used are defined as,
>> * **like**: The docuements that match this text or terms
>> * **unllike**: The documents that does not match the terms.
>> * If both like and unlike are used, it will behave like an or condition.

In [27]:
query = {
  "query": {
    "more_like_this" : {
      "fields" : ["tweet", "hashtags"],
      "like" : "#Tor #signal",
      "unlike":"#Facebook",
      "min_term_freq" : 1,
      "max_query_terms" : 12
    }
  }
}
es.search(body = query,index = "facebook_outage_tweets")

POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/facebook_outage_tweets/_search [status:200 request:0.102s]


  if sys.path[0] == '':


{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '1445143173551075330',
    '_index': 'facebook_outage_tweets',
    '_score': 22.750309,
    '_source': {'cashtags': '[]',
     'conversation_id': 1445114730151043073,
     'created_at': '2021-10-04 21:45:56 UTC',
     'date': '2021-10-04',
     'geo': '-1',
     'hashtags': '[]',
     'id': 1445143173551075330,
     'language': 'en',
     'likes_count': 0,
     'link': 'https://twitter.com/AnonBlueCC/status/1445143173551075330',
     'mentions': '[]',
     'name': 'Richard Sanchez',
     'near': '-1',
     'photos': '[]',
     'place': '-1',
     'quote_url': '-1',
     'replies_count': 0,
     'reply_to': "[{'screen_name': 'schrep', 'name': 'Mike Schroepfer', 'id': '6182852'}, {'screen_name': 'Facebook', 'name': 'Facebook', 'id': '2425151'}]",
     'retweet': False,
     'retweet_date': '-1',
     'retweet_id': '-1',
     'retweets_count': 0,
     'source': '-1',
     'thumbnail': '-1',
  

##Specialized queries(Script Query)
> * Filters documents based on a provided script. The script query is typically used in a filter context.
> * Using scripts can result in slower search speeds.
> * This script must return a boolean value, true or false

In [47]:
query = {
  "query": {
    "bool": {
      "filter": {
        "script": {
          "script": {
            "source": "doc['tweet'].value > params.size_limit",
            "lang": "painless",
            "params": {
              "size_limit": 10
            }
          }
        }
      }
    }
  }
}
es.search(body = query,index = "facebook_outage_tweets")



RequestError: ignored

##Specialized queries(Pinned Query)
> * Promotes selected documents to rank higher than those matching a given query. This feature is typically used to guide searchers to curated documents that are promoted over and above any "organic" matches for a search. The promoted or "pinned" documents are identified using the document IDs stored in the _id field.

In [48]:
query = {
  "query": {
    "pinned": {
      "ids": [ "1,445,143,573,364,498,432", "1,445,143,573,347,852,288", "1,445,143,573,339,545,600" ],
      "organic": {
        "match": {
          "tweet": "facebook"
        }
      }
    }
  }
}
es.search(body = query,index = "facebook_outage_tweets")

  del sys.path[0]


POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/facebook_outage_tweets/_search [status:200 request:0.558s]


{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '1445131666754785281',
    '_index': 'facebook_outage_tweets',
    '_score': 0.029181112,
    '_source': {'cashtags': '[]',
     'conversation_id': 1445131666754785281,
     'created_at': '2021-10-04 21:00:13 UTC',
     'date': '2021-10-04',
     'geo': '-1',
     'hashtags': '[]',
     'id': 1445131666754785281,
     'language': 'hu',
     'likes_count': 0,
     'link': 'https://twitter.com/RobertoAylmer/status/1445131666754785281',
     'mentions': '[]',
     'name': 'Robert  Pilon',
     'near': '-1',
     'photos': '[]',
     'place': '-1',
     'quote_url': '-1',
     'replies_count': 0,
     'reply_to': '[]',
     'retweet': False,
     'retweet_date': '-1',
     'retweet_id': '-1',
     'retweets_count': 0,
     'source': '-1',
     'thumbnail': '-1',
     'time': '21:00:13',
     'timezone': 0,
     'trans_dest': '-1',
     'trans_src': '-1',
     'translate': '-1',
     'tweet': 'F

## Term level Queries(Fuzzy query)
> * Returns documents that contain terms similar to the search term, as measured by a Levenshtein edit distance.
> * An edit distance is the number of one-character changes needed to turn one term into another. These changes can include:
>> * Changing a character (box → fox)
>> * Removing a character (black → lack)
>> * Inserting a character (sic → sick)
>> * Transposing two adjacent characters (act → cat)
> * To find similar terms, the fuzzy query creates a set of all possible variations, or expansions, of the search term within a specified edit distance. The query then returns exact matches for each expansion.
> * This can be useful to search even if the word is not spelled correctly.
> * The fields used are described below,
>> * **value**: The term to be searched.
>> * **fuzziness**: Maximum edit distance allowed.
>> * **max_expansions**: Maximum number of variation created default -50
>> * **prefix_length**:Number of beginning characters left unchanged when creating expansions. Defaults to 0.
>> * **transpositions**:Indicates whether edits include transpositions of two adjacent characters (ab → ba). Defaults to true.

In [56]:
query = {
  "query": {
    "fuzzy": {
      "tweet": {
        "value": "faceook",
        "fuzziness": "AUTO",
        "max_expansions": 50,
        "prefix_length": 0,
        "transpositions": True
      }
    }
  }
}
es.search(body = query,index = "facebook_outage_tweets")

POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/facebook_outage_tweets/_search [status:200 request:0.143s]


  


{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '1445143936603930629',
    '_index': 'facebook_outage_tweets',
    '_score': 0.03839601,
    '_source': {'cashtags': '[]',
     'conversation_id': 1445143936603930629,
     'created_at': '2021-10-04 21:48:58 UTC',
     'date': '2021-10-04',
     'geo': '-1',
     'hashtags': '[]',
     'id': 1445143936603930629,
     'language': 'pt',
     'likes_count': 1,
     'link': 'https://twitter.com/Sakur4t3Amo/status/1445143936603930629',
     'mentions': '[]',
     'name': 'Mia',
     'near': '-1',
     'photos': '[]',
     'place': '-1',
     'quote_url': '-1',
     'replies_count': 1,
     'reply_to': '[]',
     'retweet': False,
     'retweet_date': '-1',
     'retweet_id': '-1',
     'retweets_count': 0,
     'source': '-1',
     'thumbnail': '-1',
     'time': '21:48:58',
     'timezone': 0,
     'trans_dest': '-1',
     'trans_src': '-1',
     'translate': '-1',
     'tweet': 'volvio facebok

## Term level Queries(prefix query)
> * Returns documents that contain a specific prefix in a provided field.


In [58]:
query = {
  "query": {
    "prefix": {
      "name": {
        "value": "tw"
      }
    }
  }
}
es.search(body = query,index = "facebook_outage_tweets")

POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/facebook_outage_tweets/_search [status:200 request:0.109s]


  # Remove the CWD from sys.path while we load stuff.


{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '1445143889472630784',
    '_index': 'facebook_outage_tweets',
    '_score': 1.0,
    '_source': {'cashtags': '[]',
     'conversation_id': 1445143889472630784,
     'created_at': '2021-10-04 21:48:47 UTC',
     'date': '2021-10-04',
     'geo': '-1',
     'hashtags': '[]',
     'id': 1445143889472630784,
     'language': 'en',
     'likes_count': 0,
     'link': 'https://twitter.com/TwoPeasNAPod2/status/1445143889472630784',
     'mentions': "[{'screen_name': 'cryptomanran', 'name': 'ran neuner', 'id': '58487473'}, {'screen_name': 'el33th4xor', 'name': 'emin gün sirer🔺', 'id': '399412477'}]",
     'name': 'Two Peas N A Pod',
     'near': '-1',
     'photos': '[]',
     'place': '-1',
     'quote_url': '-1',
     'replies_count': 1,
     'reply_to': '[]',
     'retweet': False,
     'retweet_date': '-1',
     'retweet_id': '-1',
     'retweets_count': 0,
     'source': '-1',
     'thumbnail

## Term level Queries(Range query)
> * Returns documents that contain terms within a provided range.


In [59]:
query = {
  "query": {
    "range": {
      "likes_count": {
        "gte": 100,
        "lte": 200,
        "boost": 2.0
      }
    }
  }
}
es.search(body = query,index = "facebook_outage_tweets")

POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/facebook_outage_tweets/_search [status:200 request:0.109s]


  if sys.path[0] == '':


{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '1445143926038532099',
    '_index': 'facebook_outage_tweets',
    '_score': 2.0,
    '_source': {'cashtags': '[]',
     'conversation_id': 1445143926038532099,
     'created_at': '2021-10-04 21:48:56 UTC',
     'date': '2021-10-04',
     'geo': '-1',
     'hashtags': '[]',
     'id': 1445143926038532099,
     'language': 'fr',
     'likes_count': 131,
     'link': 'https://twitter.com/SKFCB_/status/1445143926038532099',
     'mentions': '[]',
     'name': 'ƧK',
     'near': '-1',
     'photos': "['https://pbs.twimg.com/media/FA4uvTQWQAEzgoH.jpg']",
     'place': '-1',
     'quote_url': '-1',
     'replies_count': 2,
     'reply_to': '[]',
     'retweet': False,
     'retweet_date': '-1',
     'retweet_id': '-1',
     'retweets_count': 2,
     'source': '-1',
     'thumbnail': 'https://pbs.twimg.com/media/FA4uvTQWQAEzgoH.jpg',
     'time': '21:48:56',
     'timezone': 0,
     'trans_dest': 

## Term level Queries(Regexp query)
> * Returns documents that contain terms matching a regular expression.
> * **value**: This will have the regular expression that is needed.
> * **flags**:Enables optional operators for the regular expression. For valid values and more information, see [Regular expression syntax](https://www.elastic.co/guide/en/elasticsearch/reference/current/regexp-syntax.html#regexp-optional-operators).
> * **max_determinized_states**: Maximum number of [automaton states](https://en.wikipedia.org/wiki/Deterministic_finite_automaton) required for the query. Default is 10000.
> * **rewrite**: Method used to rewrite the query.

In [63]:
query = {
  "query": {
    "regexp": {
      "tweet": {
        "value": "k.*y",
        "flags": "ALL",
        "case_insensitive": True,
        "max_determinized_states": 10000,
        "rewrite": "constant_score"
      }
    }
  }
}
es.search(body = query,index = "facebook_outage_tweets")

POST https://7432c5883b5246debc9e9bf514c1e552.eastus2.azure.elastic-cloud.com:9243/facebook_outage_tweets/_search [status:200 request:0.105s]


  


{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': '1445143875945828365',
    '_index': 'facebook_outage_tweets',
    '_score': 1.0,
    '_source': {'cashtags': '[]',
     'conversation_id': 1445143875945828365,
     'created_at': '2021-10-04 21:48:44 UTC',
     'date': '2021-10-04',
     'geo': '-1',
     'hashtags': '[]',
     'id': 1445143875945828365,
     'language': 'tl',
     'likes_count': 0,
     'link': 'https://twitter.com/kiman13_/status/1445143875945828365',
     'mentions': '[]',
     'name': 'kim',
     'near': '-1',
     'photos': '[]',
     'place': '-1',
     'quote_url': '-1',
     'replies_count': 1,
     'reply_to': '[]',
     'retweet': False,
     'retweet_date': '-1',
     'retweet_id': '-1',
     'retweets_count': 0,
     'source': '-1',
     'thumbnail': '-1',
     'time': '21:48:44',
     'timezone': 0,
     'trans_dest': '-1',
     'trans_src': '-1',
     'translate': '-1',
     'tweet': 'yey wara klase kay outag

# Conclusion:
> * This Notebook tries to show how to connect with the elastic search insatnce and basic text retrival techniques using elastic search.
> * Most of the queries used can be nested together to form a complex query as per requirement.

