Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic SDK code generation #40

Closed
DadiBit opened this issue Oct 14, 2023 · 23 comments
Closed

Automatic SDK code generation #40

DadiBit opened this issue Oct 14, 2023 · 23 comments

Comments

@DadiBit
Copy link
Member

DadiBit commented Oct 14, 2023

It would be nice to create a GitHub action that can automatically export the hourly/daily/minutely_15 options to the Python SDK (already in the TODO) and the Kotlin SDK. Of course, this could be extended to all parameters.

Options

Previously in the Kotlin SDK I used to fetch the website and grep the data out. Unluckily this can no longer work, since some options won't be loaded until the tab is selected (namely the UV index, pressure & co. ones in the forecast API).
The main idea would be to either work directly with the API source code (which may lead to the generation of undocumented stuff, like the seasonal forecast API) or with the docs page (much better IMO) and trigger a workflow_dispatch action on the SDK projects.

Any help on the parsing of the website docs source code would be appreciated.

@patrick-zippenfenig
Copy link
Member

Thanks for bringing this up. In the past weeks I have spend a lot of time implementing the FlatBuffers serialisation format. The underlaying goals are:

  • Efficiently serialise on server side. Especially long time-series data for historical data
  • Reduce transfer size. Technically, streaming data for a list of locations is also supported
  • Deserialise data with low overhead. No parsing, direct access to large arrays, zero-copy

Using JSON for time-series data is rather inefficient. I implemented an optimised JSON serialiser on server side to quickly encode data to JSON, but most client side JSON implementations do not particular work well with large floating point arrays. Even for a moderate amount of weather data, parsing data can take 20-100 milliseconds or more.

Using binary serialisation formats like FlatBuffers or Protobuf can solve this issue. Especially with FlatBuffers, floating point arrays can be transferred directly. Because FlatBuffers uses fixed types, this also makes working strict typing easier on the client.

Right now, you can find the FlatBuffers definitions here: https://github.com/open-meteo/swift-sdk/tree/main/FlatBuffers. I am still actively developing, so they are likely to change.

The basic idea is to provide client libraries that offer a simple interface to decode data. For example in Python it may look like this:

om = HttpxClient()
params = {
    "latitude": [52.54, 48.1, 48.4],
    "longitude": [13.41, 9.31, 8.5],
    "hourly": ["temperature_2m", "precipitation"],
    "start_date": "2023-08-01",
    "end_date": "2023-08-02",
    # 'timezone': 'auto',
    # 'current': ['temperature_2m','precipitation'],
    "format": "flatbuffers",
}

results = om.weather_api("https://archive-api.open-meteo.com/v1/archive", params=params)
assert len(results) == 3
res = results[0]
assert res.Latitude() == pytest.approx(52.5)
assert res.Longitude() == pytest.approx(13.4)
res = results[1]
assert res.Latitude() == pytest.approx(48.1)
assert res.Longitude() == pytest.approx(9.3)
print("Coordinates ", res.Latitude(), res.Longitude(), res.Elevation())
print(res.Timezone(), res.TimezoneAbbreviation())
print("Generation time", res.GenerationtimeMs())

print(res.Hourly().Temperature2m().ValuesAsNumpy())

All attributes like Temperature2m are defined in the FlatBuffers schema and enable code IDE completion. In case of Python, it is possible do to zero-copy to get a numpy array.

The principles for each client in individual programming languages are:

  • The client accepts a URL and parameters. The URL should not be hard-coded because users may use their own API instances.
  • URL parameter should use simple strings. Using enumerations do not work well, as the number of weather variables are basically endless.
  • The client offers endpoints like weather_api(url, params), air_quality_api(url, params) or ensemble_api(url, params) which calls the appropriate FlatBuffer message decoder
  • Each function returns an array of API responses (required for multi-location or multi-domain calls). Python example. Note, the server sends multiple size-prefixed FlatBuffers messages.
  • If possible, the client should provide APIs for common HTTP client implementations. For Python aiohttp, requests and httpx. For Swift built-in client and nio-http-client
  • As few dependencies as possible. If multiple HTTP clients are used, offer multiple versions. E.g. For Python and Swift I am planning to have one base library with only the compiled FlatBuffer definition and individual clients that add a shim library for each HTTP client (Sans IO).
  • Integrate retry functions for HTTP error 429 and 5xx. Should support exponential back-off.
  • If useful, the client should offer caching functionality. For Python, just examples for libraries like requests-cache should be fine.
  • MIT Licensed
  • CI integration for package distribution like npm or pip with semantic versioning

My rough plan for the FlatBuffers format is

  1. Test FlatBuffers integration with Python, Typescript and Swift. I first need to understand if the current schema is working well. There already have been multiple schema iterations. The current schema seems to be working well, but there are still some edge cases.
  2. Finalise schema definition. Adding additional fields in schemas is a non-breaking change.
  3. Finalise client libraries for Python, Typescript and Swift
  4. Coordinate with other contributors how to provide clients for other languages
  5. Automatically generate demo code in the API documentation to make is easy to use

Wdyt? Would like to have a look at the FlatBuffers format and try to compile it for Kotlin?

@DadiBit
Copy link
Member Author

DadiBit commented Oct 15, 2023

Thanks for bringing this up. In the past weeks I have spend a lot of time implementing the FlatBuffers serialisation format. The underlaying goals are:

  • Efficiently serialise on server side. Especially long time-series data for historical data

  • Reduce transfer size. Technically, streaming data for a list of locations is also supported

  • Deserialise data with low overhead. No parsing, direct access to large arrays, zero-copy

In addition to the historical long arrays, now that more and more parameters are being added to the APIs the data transfer can get quite heavy.
Second point (list of locations): I didn't see it was implemented! Sounds amazing, especially for apps that show you data over multiple spots (perhaps along a bike route... who knows!)

Using JSON for time-series data is rather inefficient. I implemented an optimised JSON serialiser on server side to quickly encode data to JSON, but most client side JSON implementations do not particular work well with large floating point arrays. Even for a moderate amount of weather data, parsing data can take 20-100 milliseconds or more.

Using binary serialisation formats like FlatBuffers or Protobuf can solve this issue. Especially with FlatBuffers, floating point arrays can be transferred directly. Because FlatBuffers uses fixed types, this also makes working strict typing easier on the client.

Right now, you can find the FlatBuffers definitions here: https://github.com/open-meteo/swift-sdk/tree/main/FlatBuffers. I am still actively developing, so they are likely to change.

I've looked a bit at the FlatBuffer docs and it seems quite interesting. I'm using ProtoBuf already in the Kotlin SDK, and it works flawlessly (apart from some issues I had with the elements order initially).
To me, it looks like FlatBuffer schema is more complex, or at least less readable(?). I will play a bit with it, but right now I'm not sure if it makes sense to jump to a new format (still, it would be dope to implement both ProtoBuf and FlatBuffer)... It could be helpful to see some benchmarks 😄

The basic idea is to provide client libraries that offer a simple interface to decode data. For example in Python it may look like this:

om = HttpxClient()
params = {
    "latitude": [52.54, 48.1, 48.4],
    "longitude": [13.41, 9.31, 8.5],
    "hourly": ["temperature_2m", "precipitation"],
    "start_date": "2023-08-01",
    "end_date": "2023-08-02",
    # 'timezone': 'auto',
    # 'current': ['temperature_2m','precipitation'],
    "format": "flatbuffers",
}

results = om.weather_api("https://archive-api.open-meteo.com/v1/archive", params=params)
assert len(results) == 3
res = results[0]
assert res.Latitude() == pytest.approx(52.5)
assert res.Longitude() == pytest.approx(13.4)
res = results[1]
assert res.Latitude() == pytest.approx(48.1)
assert res.Longitude() == pytest.approx(9.3)
print("Coordinates ", res.Latitude(), res.Longitude(), res.Elevation())
print(res.Timezone(), res.TimezoneAbbreviation())
print("Generation time", res.GenerationtimeMs())

print(res.Hourly().Temperature2m().ValuesAsNumpy())

All attributes like Temperature2m are defined in the FlatBuffers schema and enable code IDE completion. In case of Python, it is possible do to zero-copy to get a numpy array.

I think having IDE completion is a must for the SDKs. So far I had implemented a bash script that could create the Hourly/Daily/Models "options" (that's how I called them in the SDK code at least, more on this at the end).

The principles for each client in individual programming languages are:

  • The client accepts a URL and parameters. The URL should not be hard-coded because users may use their own API instances.

In the Kotlin SDK I have an Endpoint class that is inherited by all endpoints and has a variable context (base URL), which is used when the API is queried. This way, if needed, you can just create a marine endpoint with your very own custom domain.
In the marine API I pretty much hard coded the context URL, but can still be set on every call (invoke function):

object Marine : Endpoint(
    URL("https://marine-api.open-meteo.com/v1/marine")
) {

But in the OpenMeteo class (which is the recommended way to interface with the SDK) I let the user pass a set of contexts to use for every call (with defaults, of course):

open class OpenMeteo(
    var latitude: Float,
    var longitude: Float,
    var apikey: String? = null,
    var contexts: Contexts = Contexts(),
) {

    /**
     * A list of URL endpoints contexts for all the implemented APIs.
     */
    class Contexts(
        var airQuality: URL = AirQuality.context,
        var climateChange: URL = ClimateChange.context,
        var elevation: URL = Elevation.context,
        var ensemble: URL = Ensemble.context,
        var flood: URL = Flood.context,
        var forecast: URL = Forecast.context,
        var geocodingGet: URL = GeocodingGet.context,
        var geocodingSearch: URL = GeocodingSearch.context,
        var historical: URL = Historical.context,
        var marine: URL = Marine.context,
    )

    // ...

    /**
     * Query the [Marine] API.
     * @param context The API endpoint to use. Useful for self-hosted server instances.
     * @param query The query modifier.
     */
    inline fun marine(
        context: URL = contexts.marine,
        query: Marine.Query.() -> Unit,
    ) = Marine(latitude, longitude, apikey, context, query)

}

As you can see, the user has always a way to set the endpoint context for every call, no matter what. With this said, maybe some sort of standardization across the SDKs could be helpful.

  • URL parameter should use simple strings. Using enumerations do not work well, as the number of weather variables are basically endless.

IDE implementation could suffer when using just strings... So far I just made a simple object with key-value pairs:

object Hourly : Options.Hourly, Options.Listable<Hourly>() {
        const val waveHeight = "wave_height"
        const val waveDirection = "wave_direction"
        // ....

The Listable class simply has a method that lets you pick a list of parameters (which value is a string), which are automatically joined with a comma when returned. It looks clean and works smooth as butter (you can even add additional options and access them in the relative hourly/daily Map<String, ...>), but I admit that's a bit hacky.

  • The client offers endpoints like weather_api(url, params), air_quality_api(url, params) or ensemble_api(url, params) which calls the appropriate FlatBuffer message decoder

  • Each function returns an array of API responses (required for multi-location or multi-domain calls). Python example. Note, the server sends multiple size-prefixed FlatBuffers messages.

It's like "forcing" to use arrays for the response data, even if one city is returned, right? Okay 👍

  • If possible, the client should provide APIs for common HTTP client implementations. For Python aiohttp, requests and httpx. For Swift built-in client and nio-http-client

  • As few dependencies as possible. If multiple HTTP clients are used, offer multiple versions. E.g. For Python and Swift I am planning to have one base library with only the compiled FlatBuffer definition and individual clients that add a shim library for each HTTP client (Sans IO).

What do you mean by this? Like, giving the dev the option to pick which client they want to use? Right now I'm using the built-in HTTP client and it just works (+ it doesn't increase the bundle size, since any library would use the built-in client anyway).
HTTP libraries make sense when working with POST requests, but for the SDK scenario built-in options are usually enough IMO.

  • Integrate retry functions for HTTP error 429 and 5xx. Should support exponential back-off.

Should this be done by default? Like 5 retries? Still, it should be configurable, in case someone doesn't want to hog the server...

  • If useful, the client should offer caching functionality. For Python, just examples for libraries like requests-cache should be fine.

I think this should be part of the actual client, rather than the SDK package. Let's say, hypothetically, someone is using the Kotlin SDK in a weather app. It's the app (client) that should cache the data for offline usage, not the SDK.

  • MIT Licensed

  • CI integration for package distribution like npm or pip with semantic versioning

👍 to both points

My rough plan for the FlatBuffers format is

  1. Test FlatBuffers integration with Python, Typescript and Swift. I first need to understand if the current schema is working well. There already have been multiple schema iterations. The current schema seems to be working well, but there are still some edge cases.

  2. Finalise schema definition. Adding additional fields in schemas is a non-breaking change.

  3. Finalise client libraries for Python, Typescript and Swift

I'm pretty sure Kotlin should transcompile to javascript/typescript, so don't rush that SDK.

  1. Coordinate with other contributors how to provide clients for other languages

Should we just rename this issue?

  1. Automatically generate demo code in the API documentation to make is easy to use

This looks a bit tough to do automatically... I have no idea how to do it, but it can probably be an SDK-related script, perhaps. Otherwise the coding structure should be super strict.

Wdyt? Would like to have a look at the FlatBuffers format and try to compile it for Kotlin?

Ok, I start experimenting with FlatBuffers and report back as soon as I get some stuff working and get an actual idea of what is like.

  • Naming conventions & Co.: I'd like to propose a naming standardization for the object names, as well the data structure. This way a developer that would use the TypeScript SDK and the Python one could just get a quick read on the docs and start coding straight away. I've used the example of the Options objects in my case, but the same could go for the URL context and pretty much anything else.

@DadiBit
Copy link
Member Author

DadiBit commented Oct 15, 2023

I just had an idea that could ease the SDK coordination: why not creating a (template) repo with some "common" docs/scripts that can just be forked for each SDK and customized per language? Having a common CONTRIBUTING.md file could be a bit too much, but if you'd like to, you can peek at the one used in the Kotlin SDK

@DadiBit
Copy link
Member Author

DadiBit commented Oct 15, 2023

I tried "compiling" just the units.fbs and the weather_api.fbs source schemes. I'll be honest: it's cool to just run flatc, import the stuff and you get everything working out of the box in a standard manner, but holy moly, the code gets quite heavy:

Language Lines of code KB x size of .fbs source
FlatBuffer source 830 32 1
Java 1787 192 5
TypeScript 3862 192 5
Kotlin 4134 200 5.3
Python 6369 268 7.4
Swift 26671 272 7.5

The byte size/SLOC is not a perfect code size measure, I know, but as a reference: the whole (not just units and weather_api) Kotlin SDK main has 2721 SLOC for a total of 236KB (comments included, tests excluded, updated parameters excluded and deserialization library excluded).
I'm pretty sure that later versions of FlatBuffer will reduce the code size, but I'm a bit hesitant to implement it right now, tbh.

The big advantage of course is that we could drop any JSON/ProtoBuf library on the client side and reduce the server/client load, so probably the final package/bundle/library size maybe could still be less: when I find a proper way to bundle the kotlin sdk into a single package with the deserialization library, I'll post some more scientific results.

Footnotes

  1. swift has a bunch of one line functions that have multiple statements, so just looking at SLOC was kind of misleading (the code is generated by a machine, not a human, so it actually makes kind of sense)

@patrick-zippenfenig
Copy link
Member

Using strings as params: IDE implementation could suffer when using just strings... So far I just made a simple object with key-value pairs:

I am worries that a fixed list of weather variables will be long and limiting. Especially with data on pressure level like temperature_500hPa the number of option is already around 300. In the future there will more variables on model levels as well e.g. temperature_1000m with 500m or 1km increments. I considered completely decoupling variable and level giving you the option to specify temperature and 1000m separately, but this does not work well with the current API syntax (could be a v2 in the far future). Lastly, most applications select a handful weather variables for their needs and never change them. If this process is well supported by the API documentation, just using strings should be ok.

Using fixed types for the result-set clearly improves codes and makes it safer to use like res.Hourly().Temperature2m().ValuesAsNumpy(). I also considered encoding variable and level in attributes, but the resulting client code would be horrible and unnecessary complex for many use cases (e.g. res.Hourly().filter({ $0.variable == .temperature && $0.altitude == 2 }).asFloat()?).

What do you mean by this? Like, giving the dev the option to pick which client they want to use? Right now I'm using the built-in HTTP client and it just works

In programming languages like Python, you need to use a HTTP client library to fetch data. There are different clients for different use-cases. If there is a built-in client, sure, thats already sufficient.

It's like "forcing" to use arrays for the response data, even if one city is returned, right?

Correct. The user will also get multiple responses if multiple weather models are used.

Integrate retry functions for HTTP error 429 and 5xx. Should support exponential back-off.
Should this be done by default? Like 5 retries? Still, it should be configurable, in case someone doesn't want to hog the server...

Yes, this should be done by default, but configurable by the user. In many cases, a simple network error could interrupt and application. Many developers are also unaware that the HTTP transport can be unreliable.

This looks a bit tough to do automatically... I have no idea how to do it, but it can probably be an SDK-related script, perhaps. Otherwise the coding structure should be super strict.

Yes, this is generated per SDK. There could be a simple switch to select the programming language than it then shows simple instructions on how to use it, given the current selected weather variables. For Python it would just be

# Run `pip install xxxxxxx`
cache = HttpxCache(path=".cache", ttl=86400)
om = HttpxClient(cache=cache)
params = {
    "latitude": [52.54],
    "longitude": [13.41],
    "hourly": ["temperature_2m", "precipitation"],
    "start_date": "2023-08-01",
    "end_date": "2023-08-02"
}

results = om.weather_api("https://archive-api.open-meteo.com/v1/archive", params=params)
result = results[0]
print("Coordinates ", result.Latitude(), result.Longitude(), resulr.Elevation())
print(res.Timezone(), res.TimezoneAbbreviation())

hourly = result.Hourly()
time = hourly.something_that_generates_a_time_iterator()
temperature2m = hourly.Temperature2m().ValuesAsNumpy()
precipitation = hourly.Precipitation().ValuesAsNumpy()

This is simply a jump-start for users to quickly get data into their program. Depending on the programming language, it support different use-cases. For Python, it is geared towards data-science and therefore I want to encourage the use of a cache. For a web-application in Typescript, a cache like this does not make sense.

I just had an idea that could ease the SDK coordination: why not creating a (template) repo with some "common" docs/scripts that can just be forked for each SDK and customized per language?

I am considering a mono repo which contains all compiled schemas for all programming languages. The big advantage is that any addition to the schema files, will generate the code for all programming languages in one go. The drawback: It will be a pain to setup all package manager integrations....

The whole (not just units and weather_api) Kotlin SDK main has 2721 SLOC for a total of 236KB (comments included, tests excluded, updated parameters excluded and deserialization library excluded).

Yes, this is a trade-off, but the code size is still reasonable IMHO. The compiled size is of course significantly smaller and many code paths could even be removed entirely by the compiler/linker. Right now, I do not see any other way to reduce the code size signifncantly

@patrick-zippenfenig
Copy link
Member

Update: There is now a mono repository for the compiled FlatBuffers schema with package manager integration for Python, TypeScript and Swift. Using a mono repository keeps it simple to update schema files and consistently distribute all files.

I mostly documented the structure. I still have some remaining doubts and I do not like that certain weather variables have a lot of duplicates (e.g. temperature or soil properties on different levels), but it is consistent and it works. Other approaches (like using enums for all variables) have drawbacks on client and/or server side.

The first Python API client is also relatively far. It is based on the Python requests http client. Integrations for other clients like aiohttp or httpx may follow at some point.

Other programming languages or package manager follow later.

Code generation in the API documentation is mostly done (#42). All selected parameter are applied automatically and the generated dummy code should work as a good starting point for any data scientist. The code also includes cache and retry.

Screenshot 2023-10-19 at 16 22 07

@DadiBit Wdyt?

@DadiBit
Copy link
Member Author

DadiBit commented Oct 19, 2023

TL;DR: why not let the dev provide "temperature_2m" both in the request query and the response access? See FlexBuffers for unstructured data.

I am worries that a fixed list of weather variables will be long and limiting. Especially with data on pressure level like temperature_500hPa the number of option is already around 300. In the future there will more variables on model levels as well e.g. temperature_1000m with 500m or 1km increments. I considered completely decoupling variable and level giving you the option to specify temperature and 1000m separately, but this does not work well with the current API syntax (could be a v2 in the far future). Lastly, most applications select a handful weather variables for their needs and never change them. If this process is well supported by the API documentation, just using strings should be ok.

I shall re-consider my take on this: if all parameters are automatically generated, a bunch of useless code is pushed, making everything huge. So yeah, it just makes little sense, when the dev has to go to the docs website to check the available params anyway.

Using fixed types for the result-set clearly improves codes and makes it safer to use like res.Hourly().Temperature2m().ValuesAsNumpy(). I also considered encoding variable and level in attributes, but the resulting client code would be horrible and unnecessary complex for many use cases (e.g. res.Hourly().filter({ $0.variable == .temperature && $0.altitude == 2 }).asFloat()?).

I think we could simply implement a res.hourly["temperature_2m"] or similar (a quick Stack Overflow search led me to this): as far as I know, Python, JavaScript/TypeScript and Kotlin can have custom functions called when getting data through the square brackets.
Since the dev needs to know the key while querying the API, we could just let them write them again when accessing the response...

The way I would implement it in Python1:

class Hourly(object):
    def __init__(self, values):
        self._values = values # I have zero idea of how data is stored...

    def __getitem__(self, key, altitude): # [A, B] is not doable in JavaScript, only [A][B]... Maybe a middle type could be a feasible standard across all languages
        # return self._values[f"{key}_{altitude}"] # concatenate the two keys? Idk, what if no altitude is provided... Yep needs an if-else fork, but can access the String-Values dictionary directly, maybe
        return self._values.filter({ $0.variable == key && $0.altitude == altitude }).asFloat()? # are you sure it's .temperature and not "temperature"/key?

p = Hourly(...)
print (p["temperature", 2])
  1. Probably it could work with some casting as well, but honestly I have developed very little with Python
  2. Again, I have no idea of how data is stored, so maybe it makes no sense what I've just written
  3. I see FlatBuffers support key-value Maps (slower): https://flatbuffers.dev/flexbuffers.html

⚠️ What if a dev can't update the library (bug or something), but still wants to access a new parameter? Wouldn't it just make more sense to let full control to them from the query to the response access? If we use the [key] trick, the code would look quite clean, plus (!) the generated code should be much lighter.

FlexBuffers should work perfectly for this job, but they are slower. It'd be cool to use a FlexBuffer just to store unknown data/keys. The problem? How the hell does the server know which keys doesn't the client know? Doable, but a bit hard to implement.

An easy way to kill two birds with one store is to arbitrarily2 pick the more popular parameters and provide them in the fbs schema, while letting everything else exist in the FlexBuffer. See this interesting Stack Overflow answer.

Footnotes

  1. I'm no Python expert + I haven't tested this code + stolen from Stack Overflow 😄

  2. Maybe an anonymous analysis of data access could rank the params and give us a hint on which one should be extracted out from the "ugly" FlexBuffer (of course once it's added it shouldn't get removed, since it would just break compatibility)

@DadiBit
Copy link
Member Author

DadiBit commented Oct 19, 2023

Update: There is now a mono repository for the compiled FlatBuffers schema with package manager integration for Python, TypeScript and Swift. Using a mono repository keeps it simple to update schema files and consistently distribute all files.

Yep, it's plain and simple: no webhooks, no manual/scheduled gh action

I mostly documented the structure. I still have some remaining doubts and I do not like that certain weather variables have a lot of duplicates (e.g. temperature or soil properties on different levels), but it is consistent and it works. Other approaches (like using enums for all variables) have drawbacks on client and/or server side.

I still wonder if something like this could work:

res.hourly("temperature_1000hPa")
res.hourly("temperature", hPa = 1000) # internally calls `.hourly( "temperature_1000hPa")`
res.hourly("temperature_2m")
res.hourly("temperature", m = 2) # internally calls `.hourly( "temperature_2m")`
res.hourly("is_day")
  1. Of course, it needs some handwritten helper classes, but they should just act as proxy to the ones generated, so not too much complex stuff + it uses a full FlexBuffer, without using the faster static data structure (unless a manual re-mapping is done, but... why would you do it? Waste of time and energy)
  2. If we drop the idea of "helping" the dev with the altitude/pressure params then we can just use square brackets, which are a lot better from a semantics aspect (at least coming from JavaScript/Kotlin)

The first Python API client is also relatively far. It is based on the Python requests http client. Integrations for other clients like aiohttp or httpx may follow at some point.

So, if I can just "bake" a mini HTTP client in the library, is it compliant anyway? It was quite simple to implement in Kotlin, since I only used it for GET requests. I still need to implement the "retry" logic.

Other programming languages or package manager follow later.

Problem: JitPack (a Kotlin/Java package publishing platform) uses the root directory to manage to get the build configuration (maybe a subdirectory can be used, I'm not 100% sure).
I know it's a pain in the butt to publish to Maven compared to JitPack (it's still my first serious project in Kotlin...), but it's more official and easier for the end-user to use.

Code generation in the API documentation is mostly done (#42). All selected parameter are applied automatically and the generated dummy code should work as a good starting point for any data scientist. The code also includes cache and retry.

Neat! I believe that near "Python" the other languages will appear when ready, right? Love it!

@patrick-zippenfenig
Copy link
Member

FlexBuffers does not work well. It is not supported for all programming languages and large floating point arrays are encoded differently. Ideally I want to be able to serve data 1:1 from my backend code. As FlexBuffers needs to be parsed as well and has no fixed data-types, there is not much benefit compared to other formats like BSON and others.

I was considering a FlatBuffers scheme like you mentioned with

table ApiResponse {
  latitude: float
  longitude: float
  model: Model

  hourly: [TimeSeries]
  hourly_time: TimeRange
  daily: [TimeSeries]
  daily_time: TimeRange
}

table TimeSeries {
  variable: Variable
  unit: SiUnit
  altitude: int // meters above sea level
  pressure_level: int // altitude in hPa
  depth_from: int // soil depth e.g. soil temperature from 0-100 cm
  depth_to: int
  aggregation: Aggregation // daily min/max/mean
  ensemble_member: int
  
  values: [float]
  valuesInt64: [int64] // sunrise and sunset timestamps
}

struct TimeRange {start: int64, end: int64, interval: int32}
enum Variable { temperature, windspeed, ... }
enum SiUnit { celsius, fahrenheit, kph, ... }
enum Aggregation { min, max, mean, ...}
enum Model { best_match, icon_d2, gfs012, ... }

The schema is shorter, but it requires more logic in each programming language. E.g. helper functions like

// Return only first one matching
get(variable, altitude, pressure_level, ...) -> TimeSeries

// Return all matching, Could also be a "generator" or callback function
getMultiple(variable, altitude, pressure_level, ...) -> [TimeSeries]

// Convert to string like `soil_temperature_0_to_100_cm` or `temperature_2m_member15`
toString() -> String

Note: I do not want to use strings, but use enumerations. This works better with code completion and is slightly more efficient.

Automatic code generation in the API documentation will be more complicated as I need to map temperature_2m to get(.Temperature, altitude=2). Doable, but it needs a lookup table.

I might spend a couple of hours and test this schema. It looks feasible on the first sight, but I am still undecided...

Problem: JitPack (a Kotlin/Java package publishing platform) uses the root directory to manage to get the build configuration (maybe a subdirectory can be used, I'm not 100% sure).

Yes, I want to integrate it into Maven. No clue how it works, but there is a Maven plugin for semantic-release that I use to automate the release process with GitHub actions. It is really neat, as I only have to merge a PR and all packages will be build and distributed automatically :D

@patrick-zippenfenig
Copy link
Member

Update: I now merged all the required changes into the API code as well as updated the SDK. The schema is using the proposed array format instead of hard-coded attributes.

Python, Swift and Typescript releases are fully automated and publish packages to the corresponding registries.

Currently, I am working on the setup for Java. The process to get access to maven central and gradle portal is quite painful. @DadiBit do you know if it is sufficient to only publish Java packages and use those in Kotlin? I did not find an elegant way to publish a single distribution with Java and Kotlin. Alternatively, I can split them into com.open-meteo:sdk-java and com.open-meteo:sdk-kotlin

@DadiBit
Copy link
Member Author

DadiBit commented Oct 26, 2023

To my understanding, Kotlin is translated to Java, and then run in the JVM; a bit like TypeScript for JavaScript. In other words, if you have a Kotlin library, you can use it in Java. If you have a Java library, you can use it in Kotlin.
I personally prefer Kotlin to Java, but to my understanding they can co-exist in a single project.

@DadiBit
Copy link
Member Author

DadiBit commented Oct 26, 2023

I know it's a bit late in the implementation of flatbuffers, but according to this benchamrk in Go and Rust protobufs seem to be (or at least have been 9 months ago) faster then flatbuffers... If you want to, I can do a benchmark in Kotlin with some sample data from the historical API (which is the "fattest" one)

@patrick-zippenfenig
Copy link
Member

patrick-zippenfenig commented Oct 27, 2023

The benchmark shows that decoding is significantly fast, because it does not need to parse data :). The advantage gets even big for large floating point arrays. This works great on client side.

On server side, encoding speed of protobuf and flatbuffers are similar. However, because the wire format for floating point arrays is just binary floating point data, I will also be able to implement a customised writer to send data without encoding it again. Right now, I am using the integrated FlatBuffers writer, but once the format is well established, I will develop a customised faster version.

The SDK is now on Maven Central. Instructions are available here: https://github.com/open-meteo/sdk/tree/main/java. I do not have any Java examples yet

Edit: All API servers do support the new FlatBuffers structure as of today!

@patrick-zippenfenig
Copy link
Member

Forgot to mention: The Python code generation is now integration into the API documentation and can be tested here: http://staging.open-meteo.com/en/docs (API Response -> switch to Preview Python)

@DadiBit
Copy link
Member Author

DadiBit commented Oct 28, 2023

I've been working a bit on a test fork for the GitHub action yml file, here you can see a successful1 run with an implementation of the commands you wrote in DEVELOPMENT.md.

  1. I've simplified the commands used to replace the namespace for each language (I am not 100% sure it works on other distros/OSes, but on my Ubuntu machine sed works with multiple files as well...)
  2. Maybe I should implement the cache action for the flatc binary, but tbh it's so quick and easy to just wget the zip and unzip it, that I don't know if it makes sense (plus, it's super easy to upgrade the version)

I would like to implement a step to push the changes, but I have no idea how I should do it: do I create a branch and then a PR? Should/Could I push directly to main?
I have never collaborated in open-source projects before (I've always pushed to main on all "my" projects), so I would kindly ask you for some guidance @patrick-zippenfenig on how I should Interact with the repo 😄

Footnotes

  1. Actually, I saw some awk errors regarding the Python SDK, but everything else works just fine.

@DadiBit
Copy link
Member Author

DadiBit commented Oct 28, 2023

Sidenote: I've split the flatbuffers tables/enums, and the sed command works fine with multiple fbs files as well. My idea was to just run flatc on the updated files to reduce the action runtime (possibly once with all language flags set), but since it requires having all included files as well this could work out only for the enums, which are not a lot: the idea could just be dropped, to be fair.

@patrick-zippenfenig
Copy link
Member

Hi,
In order to contribute you would have to fork the repository, create a branch and create a pull request.

What kind of changes do you want to do exactly? I want to keep the flatc process manual to better control any changes. I do not want to automatically compile the flatbuffers files in CI. In the future, there could manual changes to the compiled files. E.g. inject functions to filter for weather variables.

I would also prefer to keep it in a single file. There will be additional FlatBuffers schema files for geocoding and elevation APIs. Keeping each "kind" of API in a file keeps better separated.

@DadiBit
Copy link
Member Author

DadiBit commented Oct 31, 2023

Hi, In order to contribute you would have to fork the repository, create a branch and create a pull request.

Ok, thank you.

What kind of changes do you want to do exactly? I want to keep the flatc process manual to better control any changes. I do not want to automatically compile the flatbuffers files in CI. In the future, there could manual changes to the compiled files. E.g. inject functions to filter for weather variables.

Ooops, my issue originated when I wanted to integrate the code generation automatically through the gh action, that's why I was thinking to use them for this job (no pun intended).
Regarding the changes: I started testing with kotlin and the flatbuffer API on a sample historical API response, but so far I couldn't get it to decode (but hey, at least it compiled!)... I will try to work with simpler languages like TypeScript first and then move on to Kotlin: probably I just need to implement the decoding of the "array" response. As soon as I get it to work I'll fork the repo and make a pull request 😄

I would also prefer to keep it in a single file. There will be additional FlatBuffers schema files for geocoding and elevation APIs. Keeping each "kind" of API in a file keeps better separated.

👍

@patrick-zippenfenig
Copy link
Member

A college provided some Java example code using the maven central package: https://github.com/open-meteo/sdk/blob/main/java/README.md

I also tested the Typescript integration with Svelte here: https://github.com/open-meteo/open-meteo-website/blob/main/src/routes/en/weather/%2Bpage.svelte https://github.com/open-meteo/typescript

The Python instructions also got updated yesterday evening with some structure changes: https://github.com/open-meteo/python-requests

I also added an example how an API response can be decoded using flatc https://github.com/open-meteo/sdk#convert-api-response-to-json

@DadiBit
Copy link
Member Author

DadiBit commented Oct 31, 2023

Thank you for all the resources. I got the generated code implementation working in Kotlin, hurray! Plus, there's even a basic streaming feature: it decodes one location entry at a time. 🙇‍♂️

If you're interested, here's the snippet of code:

val inputStream = get(url) // get is the internal built-in HTTPS client
// TODO: here there should be a loop until the end of the response array
val lengthBytes = inputStream.readNBytes(4)
lengthBytes.reverse() // still need to figure out how endianess is handled by `.getInt` down there
val buffer: ByteBuffer = ByteBuffer.allocate(Integer.BYTES)
buffer.put(lengthBytes)
buffer.rewind()
val length = buffer.getInt()
val bytes = inputStream.readNBytes(length)
val apiResponse = ApiResponse.asRoot( ArrayReadWriteBuffer(bytes) )
// enjoy apiResponse.location & apiResponse.longitude

Porting these few lines of code to Java should be easy, but I'm pretty sure it's better to stick to either Java or Kotlin, not both. I think Kotlin is easier to read and maintain

peterfication added a commit to open-meteo-ruby/open-meteo-ruby that referenced this issue Nov 13, 2023
peterfication added a commit to open-meteo-ruby/open-meteo-ruby that referenced this issue Nov 13, 2023
peterfication added a commit to open-meteo-ruby/open-meteo-ruby that referenced this issue Nov 13, 2023
@DadiBit
Copy link
Member Author

DadiBit commented Dec 26, 2023

Ooops, my issue originated when I wanted to integrate the code generation automatically through the gh action, that's why I was thinking to use them for this job (no pun intended). Regarding the changes: I started testing with kotlin and the flatbuffer API on a sample historical API response, but so far I couldn't get it to decode (but hey, at least it compiled!)... I will try to work with simpler languages like TypeScript first and then move on to Kotlin: probably I just need to implement the decoding of the "array" response. As soon as I get it to work I'll fork the repo and make a pull request 😄

Well, that was a lie ;D
Since you already implemented the Java SDK there was little to no point for me to work on the kotlin one, but, to be fair, I really wanted to get it to work in Kotlin Multiplatform... I started waiting for google repo to get some fresh updates on KMP, but there were none and I honestly kind of forgot about the project...

If you're interested in the current status of the Kotlin (not yet multiplatform) SDK there's issue #12 on the kotlin sdk repo

@patrick-zippenfenig
Copy link
Member

Let me know if it helps to publish a Kotlin SDK version on Maven Central or you need any help!

@DadiBit
Copy link
Member Author

DadiBit commented Jan 5, 2024

Moved to open-meteo/open-meteo#580

@DadiBit DadiBit closed this as completed Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants