Skip to content
This repository has been archived by the owner on May 5, 2022. It is now read-only.

What does hash values of those individual addresses for? #683

Open
eugeneYWang opened this issue Feb 6, 2018 · 11 comments
Open

What does hash values of those individual addresses for? #683

eugeneYWang opened this issue Feb 6, 2018 · 11 comments

Comments

@eugeneYWang
Copy link

eugeneYWang commented Feb 6, 2018

This might be a dumb question. But I really wonder what those values, which are assigned to all addresses records, indicate.

Added this line: My curiosities could be summarized as these 3 questions below:

  1. Will those values get updated once their addresses or their correspondent lat/long get changes?

  2. Or is it just an index and will never be changed?

  3. Will the same address have two different indices, If, for example, the same address are provided in two sources (e.g. locality and state government) , which is quite common since OpenAddresses keep address points from different sources even through one of them might cover another?

@migurski
Copy link
Member

migurski commented Feb 6, 2018

The hash value is calculated as a content hash, and it can be used to determine that two addresses are identical between different runs of a single source. The values are calculated each time a source is run, and you might use it to detect new addresses. I hope this helps!

@eugeneYWang
Copy link
Author

eugeneYWang commented Feb 6, 2018

@migurski Thank you for responding!

Your answer certainly helps me.

Just a few more questions after seeing your answer.

  1. Is hash calculated only based on address text, not based on lat/long?

  2. Is there any promise that records in different sources with the same info (address, city, state) will have the same hash?

    2.1 If the last question is yes, here is an extended question of my third question in the first floor? As I observed, those OpenAddresses data from local sources often lost city info and state info. If hash is based on address text and if the OpenAddresses data from state sources has detailed city and state info, then the same address with detailed city and state info will have a different hash compared to the same address without city and state info, right?

@migurski
Copy link
Member

migurski commented Feb 6, 2018

It’s calculated based on the entire row in conform.py#L1210-L1219, which will include lat/lon and other details.

There’s no particular promise that rows in different sources will have identical hashes. Where I’ve looked into overlaps such as cities and counties in the Bay Area, sources will have subtly different locations, such as here:

687474703a2f2f6d696b652e7465637a6e6f2e636f6d2f696d672f6f612d646f746d6170732d637570657274696e6f2f53637265656e25323053686f74253230323031372d30332d33312532306174253230332e31352e3134253230504d2e706e67

@eugeneYWang
Copy link
Author

eugeneYWang commented Feb 8, 2018

@migurski Thanks for all these helpful information!

Here is probably the last relevant question:

I saw there is a "fingerprint" field in openaddresses.com/state.txt, and I wonder if that fingerprints also reflect changes on ALL the content of their corresponding sources? In another word, can i count on those footprints to detect changes on each source?

@migurski
Copy link
Member

migurski commented Feb 8, 2018

I believe the fingerprint is an MD5 hash of the entire source. For static files, this is a great indicator of change. For ESRI FeatureServer sources, it might be more volatile than you want.

What are you hoping to do?

@eugeneYWang
Copy link
Author

eugeneYWang commented Feb 8, 2018

@migurski I am hoping to keep a local copy of OpenAddresses in PostGIS and keep it updated daily. So I need to find a way to figure out which source got updated and when.

Besides scraping the web pages of http://results.openaddresses.io/?runs=all#runs, do you have any suggestion?

@migurski
Copy link
Member

migurski commented Feb 8, 2018

You might find the plaintext version of that page useful for this purpose: http://results.openaddresses.io/state.txt

It’ll tell you that a source was changed: the URLs for our processed files are immutable, so if you’ve already downloaded a zip file once you shouldn't ever need to request that same file again.

@eugeneYWang
Copy link
Author

eugeneYWang commented Feb 8, 2018

@migurski Got it. Should I monitor "fingerprint" or any other field for changes?

Just a thought, I like the "cached date" in http://results.openaddresses.io/?runs=all#runs as it is straight forward, it might be nice to also have it in http://results.openaddresses.io/state.txt

@eugeneYWang
Copy link
Author

@migurski Sorry to bother you again.

I guess state.txt might be the life saver. Do you know any official documentation explaining the details or meanings of those columns in state.txt?

@trescube
Copy link
Member

trescube commented Mar 8, 2018

I don't think there's an official doc, but here's the description of the fields:

  • source: the source in sources
  • cache: the URL for the cached copy of the raw source data
  • sample: 5 sample records from the latest machine run
  • geometry type: the geometry type of the source (usually one of Point, Polygon, or Polygon)
  • address count: the number of addresses processed (and available in the output) from the latest machine run
  • version fingerprint: unique hash code for the source run
  • cache time: the amount of time spent caching the source data
  • processed: the URL for the processed results from the machine run
  • process time: the amount of time spent processing the cached data
  • process hash: hash of the contents of the processed data
  • output: machine output for the source run
  • attribution required: true or false denoting whether using the data requires attribution
  • attribution name: text of the attribution to use when attribution required is true
  • share-alike: true, false, or blank when source requires share-alike licensing
  • code version: the tagged machine version used to process the source

@iandees
Copy link
Member

iandees commented Mar 8, 2018

Documentation for some of these fields is in: https://github.com/openaddresses/openaddresses/blob/master/CONTRIBUTING.md

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants