# ETL jsonfile to Elasticsearch / Opensearch (non sql data)

In this notebook we will process a json file in python (Extract), select some features (Transform) and upload in elasticsearch (Load)

In [1]:
import pandas as pd
import json

In [2]:
from elasticsearch import Elasticsearch
from elasticsearch import helpers

es = Elasticsearch([{'host':'localhost', 'port': 9200}])
print(es)

from opensearchpy import OpenSearch
# Create the client with SSL/TLS enabled, but hostname verification disabled.
os = OpenSearch(hosts = [{'host': 'localhost', 'port': 9200}])
print(os)

<Elasticsearch([{'host': 'localhost', 'port': 9200}])>
<OpenSearch([{'host': 'localhost', 'port': 9200}])>


Loading json data to python

In [3]:
filename='./data_json/datos_json.json'

with open(filename, 'r') as f:
    object_list = []
    for line in f.readlines():
        object_list.append(json.loads(line))


After load the file we can explore the characteristics

In [4]:
#print(len(object_list))
print(json.dumps(object_list[0], indent=4, sort_keys=True))

{
    "country": "AR",
    "id": "x7mDIiDB3jEiPGPHOmDzyw",
    "variables": {
        "abstract": "The Big Data is a technological advent of processing large volumes of data that has gained notoriety because of opportunities and challenges around their usefulness in supporting the business. Therefore, this article, through a systematic literature review to identify how they are connected to Big Data and the corporate world. For this, 439 papers are investigated in terms of type of publication, annual trends in production, leading authors and institutions. As a result it was identified that there is a growing interest in the topic Big Data connected to business, both in the productions linked to scientific institutions, as those linked to companies. Were also observed theme leads to a very wide range of businesses, health marketing, or public transportation to education, always linked to better decision making.",
        "concepts": [
            "Big",
            "Big Data",
         

We can access one specific register and variable

In [5]:
object_list[0]['variables'].get('abstract')

'The Big Data is a technological advent of processing large volumes of data that has gained notoriety because of opportunities and challenges around their usefulness in supporting the business. Therefore, this article, through a systematic literature review to identify how they are connected to Big Data and the corporate world. For this, 439 papers are investigated in terms of type of publication, annual trends in production, leading authors and institutions. As a result it was identified that there is a growing interest in the topic Big Data connected to business, both in the productions linked to scientific institutions, as those linked to companies. Were also observed theme leads to a very wide range of businesses, health marketing, or public transportation to education, always linked to better decision making.'

We can load each json to elasticsearch / opensearch

In [7]:
for object in object_list:
    print("loading json file",object['id'])
    es.index(index='test', doc_type='doc',id=object['id'],  body=object)

loading json file x7mDIiDB3jEiPGPHOmDzyw
loading json file dDl8zu1vWPdKGihJrwQbpw


or we can only include some variables of the original json file before loading to elasticsearch

In [17]:
for object in object_list:    
    ##create a json ###
    ##first extract the useful variables
    id=object['id']
    concepts=object['variables'].get('concepts','')
    ##second, create a dictionary
    data = {}
    ##assign the data to the dictionary
    data['id'] = id
    data['concepts'] = concepts
    ##sent to json format
    json_data = json.dumps(data)
    print(json_data)

    es.index(index='funciona', doc_type='doc', id=id, body=json_data)

{"id": "x7mDIiDB3jEiPGPHOmDzyw", "concepts": ["Big", "Big Data", "Data", "technological advent", "advent", "large volumes", "volumes", "data", "notoriety", "opportunities", "challenges", "usefulness", "business", "article", "systematic literature", "systematic literature review", "literature", "literature review", "review", "Big", "Big Data", "Data", "corporate world", "world", "papers", "terms", "type", "publication", "annual trends", "trends", "production", "authors", "institutions", "result", "interest", "topic", "Big", "Big Data", "Data", "business", "productions", "scientific institutions", "institutions", "companies", "theme", "wide range", "range", "businesses", "health", "health marketing", "marketing", "public transportation", "transportation", "education", "better decision", "better decision making", "decision", "decision making", "making"]}
{"id": "dDl8zu1vWPdKGihJrwQbpw", "concepts": ["Big", "Big Data", "Data", "technological advent", "advent", "large volumes", "volumes", "