## Once more, with python

### Installation of elasticsearch client 

We've seen in the previous notebooks how we can send requests to the Elasticsearch API through the console developer tool. However, we can also have the same functionalities directly from our python notebook, without having to switch to the browser interface. Let's recreate some of our steps with the Elasticsearch python client.

Firstly, we will install  `elasticsearch`  in your python (virtual) environment. This is an official a low-level client for Elasticsearch that allows you to interact with the search engine directly from python. You will see that the syntax is not that different from the requests we wrote in the previous steps. [See the docs](https://elasticsearch-py.readthedocs.io/en/v8.9.0/) for more info on this client.

```
pip install elasticsearch
```


### Connecting & Authenticating to Elastic Cloud

We have to first create a connection to our deployment on Elasticsearch. 

For security, the keys and secrets are saved in a config file. [(see docs for more info)](https://docs.python.org/3/library/configparser.html). Fill in your own keys [in the example file provided](/foobar-example.ini), or directly copy paste the values in the next block if you do not plan to share the code. 

If you are connecting to the elastic cloud using SSO, you can still find your username (usually `elastic`) and password in the cloud UI under Deployments - Security. [(see example to reset password)](https://www.elastic.co/guide/en/cloud/current/ec-password-reset.html)

In [1]:
from getpass import getpass  # For securely getting user input
from elasticsearch import Elasticsearch

# Prompt the user to enter their Elastic Cloud ID and API Key securely
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")
ELASTIC_API_KEY = getpass("Elastic API Key: ")

# Create an Elasticsearch client using the provided credentials
client = Elasticsearch(
    cloud_id=ELASTIC_CLOUD_ID,  # cloud id can be found under deployment management
    api_key=ELASTIC_API_KEY # API keys can be generated under management / security
)

### Elasticsearch Queries

Now that we're successfully connected to our cluster, we can run the same queries as [in the previous notebooks](/4.%20Search%20Magic.md), but with python!

As a reminder, here is the first query we ran directly in the Elasticsearch console:
``` json
GET hp/_search
{
  "query": {
    "match": {
      "Loyalty": "Dumbledores Army"
    }
  }
}
```

The same query with the python client will look like this:

In [9]:
response = client.search(index="hp", query={
    "match": {
        "Loyalty": "Dumbledores Army"
    }
})

print(response)

{'took': 1, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 31, 'relation': 'eq'}, 'max_score': 3.8102126, 'hits': [{'_index': 'hp', '_id': '8t5A9IkBHcQ5Wxo9CJrt', '_score': 3.8102126, '_ignored': ['Death', 'Birth'], '_source': {'column1': 84, 'Wand': 'Unknown', 'Hair colour': 'Red', 'House': 'Hufflepuff', 'Gender': 'Male', 'Patronus': 'Noncorporeal', 'Birth': 'Unknown', 'Blood status': 'muggleborn', 'Name': 'Justin FinchFletchley', 'Skills': 'Unknown', 'Death': 'Unknown', 'Eye colour': 'Unknown', 'Job': 'Student', 'Loyalty': 'Dumbledores Army', 'Species': 'Human'}}, {'_index': 'hp', '_id': '895A9IkBHcQ5Wxo9CJrt', '_score': 3.8102126, '_ignored': ['Death', 'Birth'], '_source': {'column1': 85, 'Wand': 'Unknown', 'Hair colour': 'Blonde', 'House': 'Hufflepuff', 'Gender': 'Male', 'Patronus': 'Unknown', 'Birth': 'Unknown', 'Blood status': 'pureblood or halfblood', 'Name': 'Zacharias Smith', 'Skills': 'Chaser', 'Death': 'Un

We see the same json response as we got in the direct console calls. However, since we're already working in python, we can also clean up our response and make it more understandable:

In [10]:
print("We get back {total} results, here are the top ones:".format(total=response["hits"]['total']['value']))
for hit in response["hits"]["hits"]:
    print(hit['_source']['Name'])

We get back 31 results, here are the top ones:
Justin FinchFletchley
Zacharias Smith
Hannah Abbott
Ernest Macmillan
Susan Bones
Dennis Creevey
Dean Thomas
Seamus Finnigan
Angelina Johnson
Katie Bell


We have all the same functionalities as directly making queries in the console. Let's see our most complex query from the previous notebooks; you can use the same json within the python search function like this:

In [11]:
response = client.search(index="hp", query = {
    "bool": {
      "must" : [
        {
          "multi_match" : {
            "query":    "quidditch chaser keeper beater seeker", 
            "fields": [ "Job", "Skills" ] 
          }
        },
        {
          "match" : {
            "House" : "Gryffindor"
          }
        }
      ],
      "must_not": {
          "range": {
            "Birth": {
              "lte":"1980-01-01"
              }
          }
        },
      "filter": {
        "term": {
          "Hair colour": "Red"
        }
        
      }
    }  
  })

In [12]:
print("We get back {total} results, here are the top ones:".format(total=response["hits"]['total']['value']))
for hit in response["hits"]["hits"]:
    print(hit['_source']['Name'])

We get back 3 results, here are the top ones:
Rose GrangerWeasley
Ronald Bilius Weasley
Ginevra Ginny Molly Weasley


## New Data Cleaning

Now let's expand upon our project and try out some more Functionalities. in python. 

We will introduce our second dataset to the project: the (dialogue) script of the first Harry Potter movie. Like the previous dataset, this is taken from [this kaggle project](https://www.kaggle.com/datasets/gulsahdemiryurek/harry-potter-dataset?select=Harry+Potter+1.csv) where you can download it. It has also been added to our [data folder](/data/Harry_Potter_1.csv) for convenience. 

In [38]:
import pandas as pd
hp_script = pd.read_csv("data/Harry_Potter_1.csv", sep = ";" )

In [39]:
hp_script.head()

Unnamed: 0,Character,Sentence
0,Dumbledore,"I should've known that you would be here, Prof..."
1,McGonagall,"Good evening, Professor Dumbledore."
2,McGonagall,"Are the rumors true, Albus?"
3,Dumbledore,"I'm afraid so, professor."
4,Dumbledore,The good and the bad.


First we have to clean up our data a bit, we'll make sure we don't have multiple instances of the same character by removing the special characters and spacing in the names. This takes us from 91 unique characters to 56. 

In [40]:
import re
unique_chars = hp_script["Character"].unique()
print("There are {} unique characters: {}".format(len(unique_chars), unique_chars))
hp_script = hp_script.applymap(lambda x: re.sub(r'[^ \w+]', '', str(x).strip()))
unique_chars = hp_script["Character"].unique()
print("There are {} unique characters: {}".format(len(unique_chars), unique_chars))


There are 91 unique characters: ['Dumbledore' 'McGonagall' 'Hagrid' 'Petunia' 'Dudley' 'Vernon' 'Harry'
 'Snake' 'Someone' 'Barkeep\xa0Tom' 'Man' 'Witch' 'Quirrell' 'Boy'
 'Goblin' 'Griphook' 'Ollivander' 'Trainmaster' 'Mrs. Weasley' 'George'
 'Fred' 'Ginny' 'Ron' 'Woman' 'Hermione' 'Neville' 'Malfoy' 'Whispers'
 'Sorting Hat' 'Seamus' 'Percy' 'Sir Nicholas' 'Girl' 'Man in paint'
 'Fat Lady' 'Snape' 'Dean' 'Madam Hooch' 'Class' 'Harry ' 'Fred  ' 'Ron  '
 'George  ' 'Harry  ' 'Hermione  ' 'Ron ' 'Hermione ' 'Filch' 'All  '
 'Oliver ' 'Oliver  ' 'Flitwick' 'Draco  ' 'Flitwick  ' 'Seamus  '
 'Girl  ' 'Boy  ' 'Percy  ' 'McGonagall ' 'Ron and Harry' 'McGonagall  '
 'Quirrell  ' 'Snape  ' 'OIiver  ' 'Lee Jordan' 'Hagrid ' 'Gryffindors  '
 'Flint  ' 'Crowd  ' 'Flint' 'Hagrid  ' 'Man  ' 'Lee  Jordan'
 'Madam Hooch ' 'Quirrell ' 'Filch  ' 'Dumbledore  ' 'Hermoine'
 'Ron and Harry  ' 'All 3  ' 'Filch ' 'Firenze  ' 'Firenze ' 'Snape '
 'Neville  ' 'Ron   ' 'Voldemort ' 'Voldemort' 'Voldemort  ' '

In [41]:
hp_script["Line_number"] = hp_script.index
hp_script.head()

Unnamed: 0,Character,Sentence,Line_number
0,Dumbledore,I shouldve known that you would be here Profes...,0
1,McGonagall,Good evening Professor Dumbledore,1
2,McGonagall,Are the rumors true Albus,2
3,Dumbledore,Im afraid so professor,3
4,Dumbledore,The good and the bad,4


### Creating an index and mapping

We will add this to our Elasticsearch cluster via code. Firstly, we must create a new index and mapping like we've seen in [the previous notebooks.](/3.%20Index%20Mapping.md)

In [42]:
index = "hp_script_1"
settings = {}
mappings = {
    "_meta" : {
        "created_by" : "Iulia Feroli"
    },
    "properties" : {
        "Line_number" : {
            "type" : "long"
        },
        "Character" : {
            "type" : "keyword",
            "type" : "text"
        },
        "Sentence" : {
            "type" : "text"
        }
    }
}

client.indices.create(index=index, settings=settings, mappings=mappings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'hp_script_1'})

Now we can add our documents to the index. We can easily convert our dataframe into a dictionary to see the format each of our documents will take in the index.

In [52]:
from json import loads
docs = hp_script.to_json(orient = "records")
hp_script_docs = loads(docs)
hp_script_docs[0:5]

[{'Character': 'Dumbledore',
  'Sentence': 'I shouldve known that you would be here Professor McGonagall',
  'Line_number': 0},
 {'Character': 'McGonagall',
  'Sentence': 'Good evening Professor Dumbledore',
  'Line_number': 1},
 {'Character': 'McGonagall',
  'Sentence': 'Are the rumors true Albus',
  'Line_number': 2},
 {'Character': 'Dumbledore',
  'Sentence': 'Im afraid so professor',
  'Line_number': 3},
 {'Character': 'Dumbledore',
  'Sentence': 'The good and the bad',
  'Line_number': 4}]

We can index documents into our new index either one by one with the index function or, more conveniently, when dealing with large numbers of documents, using the bulk helper. Let's run a test to see how `index` and `delete` work. See more info in the [docs here](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/examples.html)

In [86]:
doc_test = {
    'Character': 'Iulia Feroli',
    'Sentence': "Wow, I've just added myself to the Harry Potter Books, I have so much to say!",
    'Line_number': 0
}

response = client.index(index = index, id = 1, document = doc_test)
print(response['result'])

response = client.search(index = index)
for hit in response["hits"]["hits"]:
    print(hit['_source'])

updated
{'Character': 'Iulia Feroli', 'Sentence': "Wow, I've just added myself to the Harry Potter Books, I have so much to say!", 'Line_number': 0}


In [87]:
response = client.delete(index = index, id = 1)
print(response["result"])

deleted


And now let's index all our Harry Potter script lines. See [bulk helper docs here.](https://elasticsearch-py.readthedocs.io/en/7.x/helpers.html). 

This works by iterating through documents and indexing them into Elasticsearch one by one - this is why we need to pass an iterator of our document list into the parameters. 

See a more complex example of this [here](https://github.com/elastic/elasticsearch-py/blob/main/examples/bulk-ingest/bulk-ingest.py).

In [97]:
from elasticsearch.helpers import bulk

response = bulk(client = client, index = index, actions = iter(hp_script_docs), stats_only = True )

And that's it! Let's see if the bulk ingest worked by doing a general search of all index documents:

### Fuzzy, magical searches

In [100]:
response = client.search(index = index)

print("We get back {total} results, here are the top ones:".format(total=response["hits"]['total']['value']))
for hit in response["hits"]["hits"]:
    print(hit['_source'])

We get back 1587 results, here are the top ones:
{'Character': 'Dumbledore', 'Sentence': 'I shouldve known that you would be here Professor McGonagall', 'Line_number': 0}
{'Character': 'McGonagall', 'Sentence': 'Good evening Professor Dumbledore', 'Line_number': 1}
{'Character': 'McGonagall', 'Sentence': 'Are the rumors true Albus', 'Line_number': 2}
{'Character': 'Dumbledore', 'Sentence': 'Im afraid so professor', 'Line_number': 3}
{'Character': 'Dumbledore', 'Sentence': 'The good and the bad', 'Line_number': 4}
{'Character': 'McGonagall', 'Sentence': 'And the boy', 'Line_number': 5}
{'Character': 'Dumbledore', 'Sentence': 'Hagrid is bringing him', 'Line_number': 6}
{'Character': 'McGonagall', 'Sentence': 'Do you think it wise to trust Hagrid with something as important as this', 'Line_number': 7}
{'Character': 'Dumbledore', 'Sentence': 'Ah Professor I would trust Hagrid with my life', 'Line_number': 8}
{'Character': 'Hagrid', 'Sentence': 'Professor Dumbledore sir', 'Line_number': 9}


### Getting to NLP magic 
Now we can once again play with the search engine to look through our new index. This time we have way more natural language text so we can build way cooler solutions! 

As a first tease, let's check for one of the most famously common sentences in the first book. 

You will see that the matches that we get back are already using some NLP techniques to retrieve similar sentences that aren't the exact query word for word. 

We see Elasticsearch also returns a score by which the responses are ranked. Now things are getting exciting.

In the next phase, we will use vector search, embeddings, and similarity scores to explore some really fun stuff in the Harry Potter world. 

![](img/hagrid.jpeg)

In [107]:
response = client.search(index = index, query={
    "match" : {
        "Sentence" : "shouldn't have said that"
    }
})

print("We get back {total} results, here are the top ones:".format(total=response["hits"]['total']['value']))
for hit in response["hits"]["hits"]:
    print(hit["_score"], hit['_source'])

We get back 157 results, here are the top ones:
11.772943 {'Character': 'Hagrid', 'Sentence': 'I shouldnt have said that', 'Line_number': 961}
10.969593 {'Character': 'Hagrid', 'Sentence': 'I should not have said that', 'Line_number': 962}
10.969593 {'Character': 'Hagrid', 'Sentence': 'I should not have said that', 'Line_number': 963}
7.562082 {'Character': 'Hagrid', 'Sentence': 'Shouldnta said that  No more questions', 'Line_number': 945}
6.277094 {'Character': 'Neville', 'Sentence': 'She said that shed been in there all afternooncrying', 'Line_number': 800}
6.0444646 {'Character': 'Hagrid', 'Sentence': 'I shouldnt have told you that', 'Line_number': 1140}
6.0444646 {'Character': 'Hagrid', 'Sentence': 'I shouldnt have told you that', 'Line_number': 1141}
6.0444646 {'Character': 'Hagrid', 'Sentence': 'I shouldnt have told you that', 'Line_number': 1292}
5.6583557 {'Character': 'Harry', 'Sentence': 'Will I have to wear that too', 'Line_number': 96}
5.3653812 {'Character': 'Neville', 'Se