# Index JSON File in Elasticsearch using Python Client

https://www.elastic.co/docs/reference/elasticsearch/clients/python

To index a file containing a list of JSON objects into Elasticsearch, the most efficient method is to use the Elasticsearch Bulk API via Python. This approach allows you to index multiple documents in a single request, significantly improving performance compared to indexing documents individually.

First, ensure you have the Elasticsearch Python client installed. You can install it using pip:

```bash

In [3]:
# !conda install -y elasticsearch

In [1]:
!conda list elasticsearch

# packages in environment at /opt/anaconda3:
#
# Name                     Version          Build            Channel
elasticsearch              9.2.0            py313hca03da5_0


```
Next, create a Python script that reads the JSON file, processes the list of objects, and uses the `helpers.bulk()` method to index them. Hereâ€™s a complete example:
```python

In [9]:
import json
from elasticsearch import Elasticsearch, helpers
from datetime import datetime

In [14]:
# Connect to Elasticsearch
# copy http_ca.crt from ES container (config/cert/http_ca.crt)
client = Elasticsearch("https://localhost:9200",
    ca_certs="/Users/blauerbock/.ssh/http_ca.crt",
    basic_auth=("elastic", "giraffe") 
)

# Define the index name
index_name = "users"

# Function to load JSON data from a file
def load_json_data(filename):
    with open(filename, "r", encoding="utf-8") as file:
        return json.load(file)

# Load the JSON data
# data_path = r"/Users/blauerbock/workspaces/python-workout/recursive_json_parsing/users.json"
data_path = r"/Users/blauerbock/workspaces/python-workout/recursive_json_parsing/data/users.json"
json_data = load_json_data(data_path)  # "data.json")  # Replace with your file path

# Prepare the list of documents for bulk indexing
doc_list = []
for i, doc in enumerate(json_data):
    # Optionally add a timestamp
    doc["timestamp"] = datetime.now().isoformat()
    # Optionally set a custom _id
    doc["_id"] = i
    doc_list.append(doc)

In [5]:
print(doc_list[:2])

[{'id': 1, 'firstName': 'Emily', 'lastName': 'Johnson', 'maidenName': 'Smith', 'age': 29, 'gender': 'female', 'email': 'emily.johnson@x.dummyjson.com', 'phone': '+81 965-431-3024', 'username': 'emilys', 'password': 'emilyspass', 'birthDate': '1996-5-30', 'image': 'https://dummyjson.com/icon/emilys/128', 'bloodGroup': 'O-', 'height': 193.24, 'weight': 63.16, 'eyeColor': 'Green', 'hair': {'color': 'Brown', 'type': 'Curly'}, 'ip': '42.48.100.32', 'address': {'address': '626 Main Street', 'city': 'Phoenix', 'state': 'Mississippi', 'stateCode': 'MS', 'postalCode': '29112', 'coordinates': {'lat': -77.16213, 'lng': -92.084824}, 'country': 'United States'}, 'macAddress': '47:fa:41:18:ec:eb', 'university': 'University of Wisconsin--Madison', 'bank': {'cardExpire': '05/28', 'cardNumber': '3693233511855044', 'cardType': 'Diners Club International', 'currency': 'GBP', 'iban': 'GB74MH2UZLR9TRPHYNU8F8'}, 'company': {'department': 'Engineering', 'name': 'Dooley, Kozey and Cronin', 'title': 'Sales Man

In [15]:
# Perform bulk indexing
try:
    print("Attempting to index the list of docs using helpers.bulk()")
    response = helpers.bulk(
        client,
        doc_list,
        index=index_name,
        # doc_type="_doc"  # Use "_doc" for Elasticsearch 6.0+
    )
    print("Bulk indexing completed:", response)
except Exception as e:
    print("Error during bulk indexing:", e)

Attempting to index the list of docs using helpers.bulk()
Bulk indexing completed: (30, [])


python elasticsearch Attempting to index the list of docs using helpers.bulk()
Error during bulk indexing: Connection error caused by: ConnectionError(Connection error caused by: ProtocolError(('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))))

This script assumes your JSON file contains a list of objects, such as:
```json
[
  {"name": "Alice", "age": 30},
  {"name": "Bob", "age": 25}
]
```
The `helpers.bulk()` method expects a list of dictionaries, where each dictionary represents an Elasticsearch document. You can include additional fields like `_id` or `timestamp` as needed. The `doc_type` parameter is set to `_doc` to comply with Elasticsearch 6.0+ standards, as the `doc_type` field has been deprecated 

Alternatively, you can use the `curl` command with the `_bulk` endpoint if you prefer a command-line approach. The JSON file must be formatted with alternating lines: one line for the action/metadata (e.g., `{"index": {"_id": "1"}}`) and one line for the document body. For example:

```json
{"index": {"_id": "1"}}
{"name": "Alice", "age": 30}
{"index": {"_id": "2"}}
{"name": "Bob", "age": 25}
```

Then use the following command:

```bash
curl -XPOST "http://localhost:9200/your_index_name/_bulk" --data-binary @data.json
```
This method requires the file to be in the NDJSON (Newline Delimited JSON) format, where each line is a valid JSON object 

Both approaches are effective, but using the Python `helpers.bulk()` method is generally preferred for its ease of integration, error handling, and flexibility in data preprocessing 