Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide push API for statical information #35

Closed
Orbiter opened this issue Jun 4, 2015 · 8 comments
Closed

provide push API for statical information #35

Orbiter opened this issue Jun 4, 2015 · 8 comments

Comments

@Orbiter
Copy link
Member

Orbiter commented Jun 4, 2015

To add more sources than harvested by twitter, we want to add data from other sources including RSS feeds and geoJSON data. These sources must be added to the message index in the context of a lifetime flag #33

The data submitted to the API must therefore include:

  • URL of the source
  • data format of the source (i.e. RSS/GeoRSS/geoJSON etc)
  • a harvesting frequence (the submitter knows best how often the data is changed)
  • a lifetime. The lifetime must be smaller or equal to the harvester frequence. The lifetime is asserted to the index and it may mean that the data disappears from search results after that time. A special lifetime of 2^31-1 can be set to announce that the data is statical forever, like a normal 'news' message, or a location that will never change (i.e. place of a city)
@Orbiter
Copy link
Member Author

Orbiter commented Jun 4, 2015

create #37 first

@zyzo
Copy link
Contributor

zyzo commented Jul 3, 2015

Hi, what needs to be done to achieve this ? A new Push API is created (#55) but data is saved as MessageEntry without a harvesting frequence or a lifetime as described above.

@Orbiter
Copy link
Member Author

Orbiter commented Jul 6, 2015

implemented with api/push/geojson.json

@zyzo
Copy link
Contributor

zyzo commented Jul 7, 2015

I have several blocker questions, that are critical to implement the connect service interface :

  • what field name is source url saved to ? same for harvester frequence ?
  • Is the periodic harvester implemented yet ? So I just save the harvester frequence as a message field, and it will be automatically detected and harvested by the server ?
  • what if, one source url contains multiple messages ? Is it better to save common information (url, harvesting frequence), and list of messages the source contains, as new data type, rather than duplicate in each message ? It would be much easier to update the source.

@Orbiter
Copy link
Member Author

Orbiter commented Jul 10, 2015

  • what field name is source url saved to ? same for harvester frequence ?

nowhere yet. There must be a new data structure to hold this. At this time, data can just read from that url and if the import shall start again, the api must be called again. That is of course not the target design. It is true that the url must be stored and then the harvesting frequence must be either submitted as well or the frequence must be computed by try.

  • Is the periodic harvester implemented yet ? So I just save the harvester frequence as a message field, and it will be automatically detected and harvested by the server ?

We already have a mechanism in loklak which is doing a very same thing: the query index. This index stores all words which have been submitted as query, stored the messge frequency and provides a prediction when the next time the a message for the query may appear again. We need something for this for IoT imports as well. Designing such a thing is somehow critical because it's difficult du clean up a messed up data structure later. Therefore I would like to collect some more experience with the API before starting automated imports. From my point of view they can be added later and meanwhile we can help ourself with cron jobs pushing the API again an agin.

  • what if, one source url contains multiple messages ?

One source should of course contain several messages! You may think of duplicate messages (thats your next question about) but that should never be the case for several messages from one import. However, we must take care of it, see bwlow.

Is it better to save common information (url, harvesting frequence), and list of messages the source contains, as new data type, rather than duplicate in each message ? It would be much easier to update the source.

I don't exactly understand how the several topics you address (re-harvesting, data types and message duplication) are related, I believe they are unrelated. Howver they should be considered, but not connected:

  • re-harvesting: I answered on that in another issue
  • data type: all different json schemas must be considered separately, but of course they should be handled with the same re-harvesting mechanismn.
  • double message detection: if we harvest from IoT sources, the data may be updated, but same and also may be not updated and the same. I would consider to not distinguish these cases and just compare the new data with the old data stored in the index. Therefore we must find an identification how to find out if a IoT data object refers to the same message generation entity or not. I believe we can identify the device using the geolocation information and the source url. That would of course mean, that each source must not contain several IoT entities at the same place.

We could compute a hash from a string consisting of the harvesting url and the location. That hash must be stored into a kind of device hash field in the message or we re-use another field for that, i.e. the link field in the form <source-url>#<lat>,<lon> or create another field, like provider_id.

@zyzo
Copy link
Contributor

zyzo commented Jul 10, 2015

I don't exactly understand how the several topics you address (re-harvesting, data types and message duplication) are related, I believe they are unrelated.

This is just a question about data scheme. The question, in a cleaner format, is : is it wiser to save import source information and list of imported messages, in a new data structure (for e.g. SourceEntry), rather than save import source information inside each imported message. I think you already answer this, and I totally agree - saving in a new data structure is the way to go :

There must be a new data structure to hold this. At this time, data can just read from that url and if the import shall start again, the api must be called again. That if of course not the target design. It is true that the url must be stored and then the harvesting frequence must be either submitted as well or the frequence must be computed by try.

And thank you for the high level of details. This answer definitely helps a lot.

@zyzo zyzo mentioned this issue Jul 13, 2015
5 tasks
@Orbiter
Copy link
Member Author

Orbiter commented Jul 14, 2015

I think there is a mix-up of two things here:

The question, in a cleaner format, is : is it wiser to save import source information and list of imported messages, in a new data structure (for e.g. SourceEntry), rather than save import source information inside each imported message. I think you already answer this, and I totally agree - saving in a new data structure is the way to go :

The new data structure should hold the source url and import metadata, not the source content. The imported content must be adjusted to fit in our message format. How to do that is already answered, you implemented a mapping for this and I suggested to add the source content (i.e. the content of the property object from geojson) as part of the message in the same fashion as rich texts are stored.

zyzo added a commit that referenced this issue Jul 23, 2015
   This commit introduces two new features :
      - save the import profile when pushing custom messages. Currently it is only implemented in /api/push/geojson.json
      - /api/import.json with source_type parameter to retrieve import profiles list by source_type
@zyzo
Copy link
Contributor

zyzo commented Jul 31, 2015

Implemented in #83

@zyzo zyzo closed this as completed Jul 31, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants