# Bulk import: Movielens dataset

In this lab we are going to import the Movielens dataset into the `movies` index in ElasticSearch.

If you do not have your own OpenSearch instance running you can import it in the `movies.cursoXXX` index in the main ElasticSearch cluster.

The format of the `movies.csv` is the following:
```
movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
```

The first line is the header and then we have the data for each film.

We would want to import it with the following fields:
- movieId
- title
- year
- genres

So we will have to split the csv title field in two parts: the title and the year.

We will also have to parse the genres and create a list with them.

The objective is to generate a json file with this format:
```
{"create": {"_index": "movies", "_id": 1}}
{"movieId": 1, "title": "Toy Story", "year": "1995", "genres": ["Adventure", "Animation", "Children", "Comedy", "Fantasy"]}
{"create": {"_index": "movies", "_id": 2}}
{"movieId": 2, "title": "Jumanji", "year": "1995", "genres": ["Adventure", "Children", "Fantasy"]}
```

## Configuring the environment to point to our OpenSearch instance

Remember that before launching the following commands we have to set up the configuration variables to point to our opensearch instance.

Set the IP address corresponding to your instance in `OPENSEARCH_HOST` variable:

In [1]:
%%writefile opensearch-lab-config
OPENSEARCH_HOST="10.133.29.238"
OPENSEARCH_PORT=9200
OPENSEARCH_USER="admin"
OPENSEARCH_PASSWD="admin"

DATASET_LOCATION="/opt/cesga/cursos/pyspark_2022/datasets/"

export OPENSEARCH_HOST OPENSEARCH_PORT OPENSEARCH_USER OPENSEARCH_PASSWD DATASET_LOCATION

Overwriting opensearch-lab-config


Then load it with:
```bash
source notebook/supplementary/elasticsearch/exercises/opensearch-lab-config
```
and restart the notebook.

## Converting the csv file to bulk format

You can use the `convert_csv_to_bulk_format.py` script to do the conversion of the csv to the bulk format.

In [2]:
%%bash

module load anaconda3
python3 convert_csv_to_bulk_format.py ${DATASET_LOCATION}/movielens-latest-small/movies.csv movies.${USER} > movies-bulk.json

The bulk file generated will publish the data to the `movies.${USER}` index.

## Importing with dynamic mapping
We will start importing the data using **dynamic mapping**.

TIP: Since the output of the command will be very long, in the notebook cell options (press right mouse button) you can "Enable Scrolling for Outputs".

In [3]:
%%bash

curl --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} \
    -X PUT -H "Content-Type: application/json" \
    --data-binary @movies-bulk.json \
    "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/_bulk"

{"took":546,"errors":false,"items":[{"create":{"_index":"movies.curso849","_type":"_doc","_id":"1","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1,"status":201}},{"create":{"_index":"movies.curso849","_type":"_doc","_id":"2","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":1,"_primary_term":1,"status":201}},{"create":{"_index":"movies.curso849","_type":"_doc","_id":"3","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":2,"_primary_term":1,"status":201}},{"create":{"_index":"movies.curso849","_type":"_doc","_id":"4","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":3,"_primary_term":1,"status":201}},{"create":{"_index":"movies.curso849","_type":"_doc","_id":"5","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4,"_primary_term":1,"status":201}},{"create":{"_index":"mov

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3351k  100 1849k  100 1501k  2926k  2375k --:--:-- --:--:-- --:--:-- 5294k


Let's see the mappings generated:

In [4]:
%%bash

curl --silent --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} \
    -X GET \
    "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/movies.${USER}?pretty"

{
  "movies.curso849" : {
    "aliases" : { },
    "mappings" : {
      "properties" : {
        "genres" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "movieId" : {
          "type" : "long"
        },
        "title" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "year" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    },
    "settings" : {
      "index" : {
        "creation_date" : "1666866256692",
        "number_of_shards" : "1",
        "number_of_replicas" : "1",
        "uuid" : "ACVINOE_Ssqzph1h6SHV4w",
        "version" : {
          "create

We see that the `year` field has been detected as `text` instead of `date`, so let's try to improve it using explicit mapping.

We also see that the number of shards is set to 1 and the number of replicas also to 1, but in our cluster of 3 nodes the most efficient way of storing data is using 3 primary shards and 2 replicas.

## Using explicit mapping
We could use dynamic mapping, but in this case we will use **explicit mapping** so we can set the number of shards and the type of the `year` field as `date`:

NOTE: This settings are the optimal for a 3 node cluster, if you are using your 1 node cluster, then you have to set `number_of_shards` as 1, and `number_of_replicas` as 0.

First we have to delete the index:

In [5]:
%%bash

curl --silent --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} \
    -X DELETE \
    "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/movies.${USER}?pretty"

{
  "acknowledged" : true
}


And now we can re-create it with the right settings:

In [6]:
%%bash

curl --silent --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} \
    -X PUT -H "Content-Type: application/json" \
    "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/movies.${USER}" -d '
{
    "settings" : {
        "index" : {
            "number_of_shards" : "1",
            "number_of_replicas" : "0"
        }
    },
    "mappings": {
        "properties": {
            "year": {"type": "date"}
        }
    }
}'

{"acknowledged":true,"shards_acknowledged":true,"index":"movies.curso849"}

Let's verify, that it has the right settings:

In [7]:
%%bash

curl --silent --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} \
    -X GET \
    "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/movies.${USER}?pretty"

{
  "movies.curso849" : {
    "aliases" : { },
    "mappings" : {
      "properties" : {
        "year" : {
          "type" : "date"
        }
      }
    },
    "settings" : {
      "index" : {
        "creation_date" : "1666866257925",
        "number_of_shards" : "1",
        "number_of_replicas" : "0",
        "uuid" : "1ApYsz5nS_usGidAhxLJpQ",
        "version" : {
          "created" : "135248227"
        },
        "provided_name" : "movies.curso849"
      }
    }
  }
}


As you can see only the `year` fiels appears right now: until we import the data **the other fields do not appear yet because ElasticSearch has not seen them yet**.

## Importing the data
Let's import now the data into the existing index:

In [8]:
%%bash

curl --silent --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} \
    -X PUT -H "Content-Type: application/json" \
    --data-binary @movies-bulk.json \
    "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/_bulk"

{"took":391,"errors":false,"items":[{"create":{"_index":"movies.curso849","_type":"_doc","_id":"1","_version":1,"result":"created","_shards":{"total":1,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1,"status":201}},{"create":{"_index":"movies.curso849","_type":"_doc","_id":"2","_version":1,"result":"created","_shards":{"total":1,"successful":1,"failed":0},"_seq_no":1,"_primary_term":1,"status":201}},{"create":{"_index":"movies.curso849","_type":"_doc","_id":"3","_version":1,"result":"created","_shards":{"total":1,"successful":1,"failed":0},"_seq_no":2,"_primary_term":1,"status":201}},{"create":{"_index":"movies.curso849","_type":"_doc","_id":"4","_version":1,"result":"created","_shards":{"total":1,"successful":1,"failed":0},"_seq_no":3,"_primary_term":1,"status":201}},{"create":{"_index":"movies.curso849","_type":"_doc","_id":"5","_version":1,"result":"created","_shards":{"total":1,"successful":1,"failed":0},"_seq_no":4,"_primary_term":1,"status":201}},{"create":{"_index":"mov

## Verifying the mapping

In [9]:
%%bash

curl --silent --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} \
    -X GET \
    "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/movies.${USER}?pretty"

{
  "movies.curso849" : {
    "aliases" : { },
    "mappings" : {
      "properties" : {
        "genres" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "movieId" : {
          "type" : "long"
        },
        "title" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "year" : {
          "type" : "date"
        }
      }
    },
    "settings" : {
      "index" : {
        "creation_date" : "1666866257925",
        "number_of_shards" : "1",
        "number_of_replicas" : "0",
        "uuid" : "1ApYsz5nS_usGidAhxLJpQ",
        "version" : {
          "created" : "135248227"
        },
        "provided_name" : "movies.curso849"
      }
    }
  }
}
