Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] GeoIP database auto update - API design #5860

Closed
heemin32 opened this issue Jan 13, 2023 · 11 comments
Closed

[RFC] GeoIP database auto update - API design #5860

heemin32 opened this issue Jan 13, 2023 · 11 comments
Labels
Build Libraries & Interfaces enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes

Comments

@heemin32
Copy link
Contributor

heemin32 commented Jan 13, 2023

The purpose of this RFC (request for comments) is to gather community feedbacks on a proposal of API design for #5856

Manifest file in a database distribution server

The free database distribution server will contains following manifest file of which will be used by OpenSearch cluster to know about where the actual database file exist and all other metadata for the database file.

  • url: url for a single zip file
  • db_name: database file name inside the zip file
  • md5_hash: md5 value of the zip file
  • valid_for: duration in days in which the data base can be used without update
  • updated_at: last updated date of the zip file in unix epoch time
  • provider: provider name of the database

Example

{
  "url": "https://geoip.opensearch.org/v1/database/geolite2-city.zip"
  "db_name": "GeoLite2-Country.csv",
  "md5_hash": "63d0cea9d550e495fde1b81310951bd7",
  "valid_for": 30,
  "updated_at": 1673639523,
  "provider": "maxmind"
}

API to trigger an auto update of a database

A user will call an API to create a GeoIP datasource. This will be a new API. Once the API is called, OpenSearch cluster starts to download file from the given endpoint with given interval. Those two parameters are optional and the default value will be provided. A user can update the value of a datasource. Also, if a user delete the datasource, OpenSearch will remove GeoIP data from a cluster. If update_interval is larger than valid_for in a manifest file, it will throw an error.

  • endpoint: Endpoint url
  • update_interval: Update interval in days

Example

PUT /_geoip/datasource/my-datasource
{
  "endpoint": "https://geoip.opensearch.org/v1/geolite2-city/manifest.json"
  "update_interval": 7
}

API to get a status of the datasource

After an OpenSearch cluster read a manifest file and everything is good, it stores all relevant metadata of a datasource and a user can query those data.

  • state: Indicate the state of the datasource. It starts with preparing and moves to available once a GeoIP database is ready to be used. If there was an issue during the first database preparation, it will be marked as failed. When a user delete the datasource, it will marked as deleting and the actual deletion will take afterward.

Screenshot 2023-03-06 at 12 04 55 PM

  • error_msg: "Error message when it fails to update database"
  • endpoint: Endpoint to a manifest file
  • update_interval: Update interval.
  • database.provider: Database provider
  • database.md5_hash: Hash value of a zip file. Use this value to check if the file in a server is updated
  • database.expire_after: Epoch time after which the database cannot be used.
  • database.updated_at: Epoch time when the database was updated. Use this value to check if the file is newer one or not.
  • database.fields: Fields of which the GeoIP processor can append in a document during ingestions.

Example

GET /_geoip/datasource/my-datasource?filter_path=state,manifest_url,update_interval,info
{
  "state": "preparing",
  "error_msg": "",
  "endpoint": "https://geoip.opensearch.org/v1/manifest/geolite2-city",
  "update_interval": 15,
  "database": {
    "provider": "maxmind",
    "md5_hash": "63d0cea9d550e495fde1b81310951bd7"
    "expire_after": 1673739523
    "updated_at": 1673639523
    "fields": ["latitude", "longitude", "country", "city"]
  }
}

GeoIP processor using the datasource

Once a GeoIP policy is available, a user can create a GeoIP processor to use the datasource by providing a datasource name in new datasource field. The value is optional. If the value is not provided, the processor will fall back to current behavior which uses a static GeoIP database.

  • datasource: GeoIp datasource name a user created previously.

Example

PUT _ingest/pipeline/geoip
{
  "processors" : [
    {
      "geoip" : {
        "field" : "ip",
        "target_field" : "location",
        "datasource" : "my-datasource"
      }
    }
  ]
}

API to get a metrics of the datasource

The GeoIP datasource will contains metrics for update activity as well.

  • last_succeeded_at: Epoch time when it succeed to update a database
  • last_failed_at: Epoch time when it failed to update a database.
  • last_skipped_at: Epoch time when it checked that there is no update from an endpoint.
  • last_processing_time: Processing time to update a database in seconds

Example

GET /_geoip/datasource/my-datasource?filter_path=update_stats
{
  "update_stats": {
    "last_succeeded_at": 1673639523,
    "last_failed_at": 1673631523,
    "last_skipped_at": 1673639123,
    "last_processing_time": 1123123
  }
}
@navneet1v
Copy link
Contributor

Couple of Comments:

PUT /_geoip/policy/my-policy
{
  "endpoint": "geoip.opensearch.org/v1/manifest/geolite2-city"
  "update_interval": 20
}

should the endpoint variable value be a valid URI?

  • If update does not happen on time, the value changes to expired. If there was an issue during the first database preparation, it will be marked as failed

Will there be automated retries? or can a user do the retry without changing policy names?

@heemin32
Copy link
Contributor Author

heemin32 commented Jan 14, 2023

should the endpoint variable value be a valid URI?

Correct. Updated the post.

Will there be automated retries? or can a user do the retry without changing policy names?

There won't be automated retries. It will try to update in next interval. A user can change the interval to update it earlier than it was set before.

@kotwanikunal kotwanikunal added Build Libraries & Interfaces Clients Clients within the Core repository such as High level Rest client and low level client and removed Build Libraries & Interfaces labels Jan 17, 2023
@minalsha minalsha added RFC Issues requesting major changes Build Libraries & Interfaces and removed untriaged Clients Clients within the Core repository such as High level Rest client and low level client labels Jan 31, 2023
@minalsha
Copy link
Contributor

@dagneyb , @dbwiddis , @saratvemulapalli any comments?

@dblock
Copy link
Member

dblock commented Feb 7, 2023

Other than regular updates, how is this a "policy"? Is "geoip data" (aka /geoip/data) more appropriate?

@heemin32
Copy link
Contributor Author

heemin32 commented Feb 7, 2023

Other than regular updates, how is this a "policy"? Is "geoip data" (aka /geoip/data) more appropriate?

Didn't use the term "data" because the endpoint and update interval does not represent "geoip data" by themselves. The actual data will be stored in an index. I think it is a policy about where and how to get geoip data.

@dblock
Copy link
Member

dblock commented Feb 7, 2023

Other names could be "datasource". But maybe that's confusing with actual data sources. Policy implies some kind of principle, so that was an odd name. I don't feel strongly about it, maybe others have comments.

@heemin32
Copy link
Contributor Author

heemin32 commented Feb 8, 2023

"datasource" sounds fair for me. JDBC uses a term datasource when it access database.

@dblock
Copy link
Member

dblock commented Mar 24, 2023

The design in #5856 talks about the possibility of storing all data in an OpenSearch index. Is that an option? These files are large and now we're introducing additional storage requirements. In the future I'd like to be able to have remote storage (e.g. S3) entirely, including for this GeoIp data.

@heemin32
Copy link
Contributor Author

Yes. We are going to store GeoIP database file in an index which will take about 500MB for each node.

@heemin32
Copy link
Contributor Author

heemin32 commented Apr 4, 2023

I am considering of implementing the feature with new processor type called ip2geo in geospatial repository with following advantages.

  1. Can utilize job-scheduler rightaway.
  2. Easy to deprecate current geoip processor.
  3. Can install the plugin only when a user want to use it.

@dblock
Copy link
Member

dblock commented Apr 5, 2023

I like this plan better, assuming the new geoip processor can fully replace the existing one in 3.0. We should mark the latter deprecated when we release the first version of the new one.

@heemin32 heemin32 closed this as completed Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build Libraries & Interfaces enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes
Projects
None yet
Development

No branches or pull requests

5 participants