-
-
Notifications
You must be signed in to change notification settings - Fork 849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Brazil census data #2303
Comments
This sounds similar to US Census interpolated address ranges ala Tiger.
+1 for OpenAddresses adding support for this data type (at least interpolation points at each end of line).
… On Dec 30, 2016, at 06:03, astoff ***@***.***> wrote:
Census data could be used to generate a country-wide address list in Brazil, but it would require a dedicated script to parse and process. I will describe the situation so you can decide if this makes sense for openaddresses.
There are two datasets that would need to be combined:
An address list in each municipality (or, sometimes, a district within a municipality). For rural addresses, a GPS reading is included. For urban addresses, the id number of a city block and sidewalk segment is given. (This is at ftp://ftp.ibge.gov.br/Censos/Censo_Demografico_2010/Cadastro_Nacional_de_Enderecos_Fins_Estatisticos/)
Vector maps with the sidewalk segments. (This is at ftp://geoftp.ibge.gov.br/recortes_para_fins_estatisticos/malha_de_setores_censitarios/censo_2010/base_de_faces_de_logradouros/)
Thus, at least the block corresponding to each address can be determined.
I can't say much about the quality of the data; I suppose it varies a lot throughout the country. One issue is that the geospatial information is often not precisely aligned with OSM, see the pictures below. In any case, this is the best data one can hope to gather in Brazil (apart from a few bigger cities that may release their own addresses) at least until the next census comes out, in 5 to 10 years.
―
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@nvkelso These are not address ranges. They are more like the Japanese dataset: individual addresses geolocated up to city block (as opposed to parcel or rooftop). |
Interesting!
Can you provide a fully formed Brazil example address, and what this dataset would allow? Is it 1:1 or missing something?
… On Dec 30, 2016, at 10:27, astoff ***@***.***> wrote:
@nvkelso These are not address ranges. They are more like the Japanese dataset: individual addresses geolocated up to city block (as opposed to parcel or rooftop).
―
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I have attached a little example of what one can get (a part of Curitiba) The sample above contains only unique buildings, ignoring units. In another case (Porto Alegre) there are 638K dwellings (house number + unit), and 288K unique buildings (i.e., after discarding units). Again, that seems about right for a fairly dense city of 1.5 million. Should we keep or discard unit information? Note that different units in the same number will always have equal coordinates anyway. |
Wow, that's amazing! Let's get this added :) Import strategy: Since there is some manual process involved, I suggest breaking the import into sources per state (27 total, 26 for states and one for the capital). Accuracy: It looks like a lot of these addresses are stacked on top of each other (so there are multiple addresses per "street block"), with "block" level precision. @migurski is there an accuracy config to use for this? Looks like Units: Looks like many of the "stacked" records have an empty Projection: For the alignment issues, we sometimes see that in other sources. It is possible to override the projection information specified in the shapefile in the OpenAddress src config if we can figure out a better one to use after the per-state files are generated. |
+1, very interested in Brazil countrywide data. Units could be useful for libpostal purposes as well. |
This data is great. I would say "5" for accuracy, and maybe we can reevaluate what we're using for accuracy anyway. The integer values are not super, it'd be nice to just use strings like "rooftop" or "block level". |
OK, great to hear you like this. Let's discuss the concern with units. In the vast majority of cases, the "units" in this dataset refer to apartments of an apartment tower, and not different buildings within the same parcel or different street-level entrances to a given building. I imagined you would not want to retain this level of detail — are you saying you actually do want to keep it? I should also remark that every single building (residential or not) appears in the listing, but only residential units are listed. |
We do have a place in the OA schema to put units, so if they exist in the data it wouldn't hurt to add them! For @thatdatabaseguy’s purposes, I believe that the presence of units has also helped with libpostal training data sets. |
Yes, the libpostal parser takes data sets like OpenAddresses that are already separated into fields, recreates what the full address would look like, and then trains a model to parse full addresses (from geocoder input, CSVs, etc) into components. Having the unit information helps the system handle more types of input, for example recognizing that "Apto 202" is a unit, and thus most geocoding systems can ignore it for the purposes of determining the lat/lon (sometimes they can get confused by extra information like this, so it's useful to determine which parts of a geocoder query are important). They're not mission-critical but nice to have! |
OK, I'll stop discarding the units. One more question before I am satisfied with the script: Should we try to clean up the original data in any way? (By the way, is a CSV enhancer (openaddresses/machine#283) still on the roadmap?) This particular dataset would benefit mostly from the following two things, the second of which may be quite important if this is to be used as training data:
|
I would prefer to see caching/downloading scripts be as "dumb" as possible and to constrain themselves to only downloading the data from the source in as raw of a format as possible. Doing this means that we end up with the raw data downloaded to our system to start with and can adjust the transform/conform process to include different data later. The CSV enhancer/cleaner steps are still on our roadmap, and I think these changes you list would be an interesting set of "fixes" to apply to the output. |
Agreed, it's usually reasonable to keep the raw data raw and update the transform over time. The number_suffix changes are a bit more intricate than what the OA transforms can currently handle, but AFAICT from the São Paulo data, these cases are fairly infrequent and shouldn't affect consumers much. Libpostal has its own cleanup/normalization which can accommodate all of the above, so no worries there, but useful to know the edge cases. Re: diacritics, it looks like they're already stripped in the source, no? For certain known words like "Praca" => "Praça" it should be possible to recover them, and libpostal has dictionaries for Portuguese that can handle cases like that, but I'd think it would be a bit more difficult for random streets (not impossible, could be done by building an index of sans-diacritics forms to their most common unnormalized forms using a data set that's known to use proper diacritics e.g. OSM roads). In any case, probably outside the scope of OpenAddresses. |
#2315 closes this. |
anyone crazy enough to try and update this to the 2019 data: https://geoftp.ibge.gov.br/recortes_para_fins_estatisticos/malha_de_setores_censitarios/censo_2010/base_de_faces_de_logradouros_versao_2019/ |
I'll try running my old script. Hopefully this solves the issue where some entire cities are missing. By the way, I was automatically kicked out of the organization sometime ago because I didn't have two-factor authentication set up. Can someone include me again? |
@astoff I tried to run the script, but the 2019 version changed a bit. The files are packed differently and the internal structure changed too. In case you need to download all the files again I'm seeding than as torrents. Download might be faster than IBGE FTP: https://github.com/vgeorge/cnefe |
Census data could be used to generate a country-wide address list in Brazil, but it would require a dedicated script to parse and process. I will describe the situation so you can decide if this makes sense for openaddresses.
There are two datasets that would need to be combined:
Thus, at least the block corresponding to each address can be determined.
I can't say much about the quality of the data; I suppose it varies a lot throughout the country. One issue is that the geospatial information is often not precisely aligned with OSM, see the pictures below. In any case, this is the best data one can hope to gather in Brazil (apart from a few bigger cities that may release their own addresses) at least until the next census comes out, in 5 to 10 years.
The text was updated successfully, but these errors were encountered: