Brazil census data #2303

astoff · 2016-12-30T14:03:45Z

Census data could be used to generate a country-wide address list in Brazil, but it would require a dedicated script to parse and process. I will describe the situation so you can decide if this makes sense for openaddresses.

There are two datasets that would need to be combined:

An address list in each municipality (or, sometimes, a district within a municipality). For rural addresses, a GPS reading is included. For urban addresses, the id number of a city block and sidewalk segment is given. (This is at ftp://ftp.ibge.gov.br/Censos/Censo_Demografico_2010/Cadastro_Nacional_de_Enderecos_Fins_Estatisticos/)
Vector maps with the sidewalk segments. (This is at ftp://geoftp.ibge.gov.br/recortes_para_fins_estatisticos/malha_de_setores_censitarios/censo_2010/base_de_faces_de_logradouros/)

Thus, at least the block corresponding to each address can be determined.

I can't say much about the quality of the data; I suppose it varies a lot throughout the country. One issue is that the geospatial information is often not precisely aligned with OSM, see the pictures below. In any case, this is the best data one can hope to gather in Brazil (apart from a few bigger cities that may release their own addresses) at least until the next census comes out, in 5 to 10 years.

nvkelso · 2016-12-30T17:33:11Z

This sounds similar to US Census interpolated address ranges ala Tiger. +1 for OpenAddresses adding support for this data type (at least interpolation points at each end of line).

…

On Dec 30, 2016, at 06:03, astoff ***@***.***> wrote: Census data could be used to generate a country-wide address list in Brazil, but it would require a dedicated script to parse and process. I will describe the situation so you can decide if this makes sense for openaddresses. There are two datasets that would need to be combined: An address list in each municipality (or, sometimes, a district within a municipality). For rural addresses, a GPS reading is included. For urban addresses, the id number of a city block and sidewalk segment is given. (This is at ftp://ftp.ibge.gov.br/Censos/Censo_Demografico_2010/Cadastro_Nacional_de_Enderecos_Fins_Estatisticos/) Vector maps with the sidewalk segments. (This is at ftp://geoftp.ibge.gov.br/recortes_para_fins_estatisticos/malha_de_setores_censitarios/censo_2010/base_de_faces_de_logradouros/) Thus, at least the block corresponding to each address can be determined. I can't say much about the quality of the data; I suppose it varies a lot throughout the country. One issue is that the geospatial information is often not precisely aligned with OSM, see the pictures below. In any case, this is the best data one can hope to gather in Brazil (apart from a few bigger cities that may release their own addresses) at least until the next census comes out, in 5 to 10 years. ― You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

astoff · 2016-12-30T18:27:13Z

@nvkelso These are not address ranges. They are more like the Japanese dataset: individual addresses geolocated up to city block (as opposed to parcel or rooftop).

nvkelso · 2016-12-30T18:42:18Z

Interesting! Can you provide a fully formed Brazil example address, and what this dataset would allow? Is it 1:1 or missing something?

…

On Dec 30, 2016, at 10:27, astoff ***@***.***> wrote: @nvkelso These are not address ranges. They are more like the Japanese dataset: individual addresses geolocated up to city block (as opposed to parcel or rooftop). ― You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

astoff · 2017-01-01T16:05:30Z

I have attached a little example of what one can get (a part of Curitiba)
c-uniq.csv.gz. Judging by the amount of dots, I'd say the dataset is fairly complete.

The sample above contains only unique buildings, ignoring units. In another case (Porto Alegre) there are 638K dwellings (house number + unit), and 288K unique buildings (i.e., after discarding units). Again, that seems about right for a fairly dense city of 1.5 million.

Should we keep or discard unit information? Note that different units in the same number will always have equal coordinates anyway.

nvkelso · 2017-01-07T00:56:40Z

Wow, that's amazing! Let's get this added :)

Import strategy: Since there is some manual process involved, I suggest breaking the import into sources per state (27 total, 26 for states and one for the capital).

Accuracy: It looks like a lot of these addresses are stacked on top of each other (so there are multiple addresses per "street block"), with "block" level precision. @migurski is there an accuracy config to use for this? Looks like 4 or 5 now, or maybe add one for 6 to indicate block level?

Units: Looks like many of the "stacked" records have an empty unit property in the CSV table. Is there a unit available in the source data?

Projection: For the alignment issues, we sometimes see that in other sources. It is possible to override the projection information specified in the shapefile in the OpenAddress src config if we can figure out a better one to use after the per-state files are generated.

albarrentine · 2017-01-07T01:51:37Z

+1, very interested in Brazil countrywide data. Units could be useful for libpostal purposes as well.

migurski · 2017-01-07T05:02:52Z

This data is great. I would say "5" for accuracy, and maybe we can reevaluate what we're using for accuracy anyway. The integer values are not super, it'd be nice to just use strings like "rooftop" or "block level".

astoff · 2017-01-07T09:58:45Z

OK, great to hear you like this. Let's discuss the concern with units.

In the vast majority of cases, the "units" in this dataset refer to apartments of an apartment tower, and not different buildings within the same parcel or different street-level entrances to a given building. I imagined you would not want to retain this level of detail — are you saying you actually do want to keep it?

I should also remark that every single building (residential or not) appears in the listing, but only residential units are listed.

migurski · 2017-01-07T18:25:24Z

We do have a place in the OA schema to put units, so if they exist in the data it wouldn't hurt to add them! For @thatdatabaseguy’s purposes, I believe that the presence of units has also helped with libpostal training data sets.

albarrentine · 2017-01-08T18:02:31Z

Yes, the libpostal parser takes data sets like OpenAddresses that are already separated into fields, recreates what the full address would look like, and then trains a model to parse full addresses (from geocoder input, CSVs, etc) into components. Having the unit information helps the system handle more types of input, for example recognizing that "Apto 202" is a unit, and thus most geocoding systems can ignore it for the purposes of determining the lat/lon (sometimes they can get confused by extra information like this, so it's useful to determine which parts of a geocoder query are important). They're not mission-critical but nice to have!

astoff · 2017-01-10T08:54:23Z

OK, I'll stop discarding the units.

One more question before I am satisfied with the script: Should we try to clean up the original data in any way? (By the way, is a CSV enhancer (openaddresses/machine#283) still on the roadmap?)

This particular dataset would benefit mostly from the following two things, the second of which may be quite important if this is to be used as training data:

Titlecase and add diacritics to street names.
Clean up the number_suffix field (which is appended to the number in the current form of Brazil census addresses #2315). This field is used for various random purposes:
1. An actual suffix for the number, like "A" in "7A"
2. "SN" for "sem número", which is most commonly rendered "s/nº"
3. "KM" in rural areas indicates not a house number but the milestone where the property is located. Should typically be prepended, not appended, to the number.
4. Sometimes unit information is given in the number_suffix, the most common cases being "FRENTE" and "FUNDOS" (front side versus back side of the parcel)
5. Various annotations:
  - "CASA" or "ED", to indicate it's a single-unit respectively multi-unit building. Safe to discard.
  - The name of a utility or government agency ("DMAE", "CEEE", "FUNASA", etc.) indicates an "issuing authority" for the number. Common in informal or very recent settlements. This I'd say should be kept.
  - Many other random annotations, which usually seem safe to discard.

iandees · 2017-01-10T14:18:37Z

I would prefer to see caching/downloading scripts be as "dumb" as possible and to constrain themselves to only downloading the data from the source in as raw of a format as possible. Doing this means that we end up with the raw data downloaded to our system to start with and can adjust the transform/conform process to include different data later.

The CSV enhancer/cleaner steps are still on our roadmap, and I think these changes you list would be an interesting set of "fixes" to apply to the output.

albarrentine · 2017-01-10T15:49:05Z

Agreed, it's usually reasonable to keep the raw data raw and update the transform over time. The number_suffix changes are a bit more intricate than what the OA transforms can currently handle, but AFAICT from the São Paulo data, these cases are fairly infrequent and shouldn't affect consumers much.

Libpostal has its own cleanup/normalization which can accommodate all of the above, so no worries there, but useful to know the edge cases.

Re: diacritics, it looks like they're already stripped in the source, no? For certain known words like "Praca" => "Praça" it should be possible to recover them, and libpostal has dictionaries for Portuguese that can handle cases like that, but I'd think it would be a bit more difficult for random streets (not impossible, could be done by building an index of sans-diacritics forms to their most common unnormalized forms using a data set that's known to use proper diacritics e.g. OSM roads). In any case, probably outside the scope of OpenAddresses.

astoff · 2017-01-12T20:11:03Z

#2315 closes this.

justinelliotmeyers · 2021-01-26T02:40:24Z

anyone crazy enough to try and update this to the 2019 data: https://geoftp.ibge.gov.br/recortes_para_fins_estatisticos/malha_de_setores_censitarios/censo_2010/base_de_faces_de_logradouros_versao_2019/
red is 2019, black is 2010

old data is spatially off

astoff · 2021-01-31T08:44:05Z

I'll try running my old script. Hopefully this solves the issue where some entire cities are missing.

By the way, I was automatically kicked out of the organization sometime ago because I didn't have two-factor authentication set up. Can someone include me again?

vgeorge · 2021-01-31T09:22:14Z

@astoff I tried to run the script, but the 2019 version changed a bit. The files are packed differently and the internal structure changed too. In case you need to download all the files again I'm seeding than as torrents. Download might be faster than IBGE FTP: https://github.com/vgeorge/cnefe

astoff mentioned this issue Jan 11, 2017

Brazil census addresses #2315

Merged

astoff closed this as completed Jan 12, 2017

vgeorge mentioned this issue Feb 17, 2021

Update br cnefe #5482

Closed

plentz mentioned this issue May 28, 2024

Update Brazilian addresses to use census 2022 #7251

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brazil census data #2303

Brazil census data #2303

astoff commented Dec 30, 2016

nvkelso commented Dec 30, 2016 via email

astoff commented Dec 30, 2016

nvkelso commented Dec 30, 2016 via email

astoff commented Jan 1, 2017 •

edited

Loading

nvkelso commented Jan 7, 2017

albarrentine commented Jan 7, 2017

migurski commented Jan 7, 2017 •

edited

Loading

astoff commented Jan 7, 2017

migurski commented Jan 7, 2017

albarrentine commented Jan 8, 2017

astoff commented Jan 10, 2017 •

edited

Loading

iandees commented Jan 10, 2017

albarrentine commented Jan 10, 2017

astoff commented Jan 12, 2017

justinelliotmeyers commented Jan 26, 2021 •

edited

Loading

astoff commented Jan 31, 2021

vgeorge commented Jan 31, 2021

Brazil census data #2303

Brazil census data #2303

Comments

astoff commented Dec 30, 2016

nvkelso commented Dec 30, 2016 via email

astoff commented Dec 30, 2016

nvkelso commented Dec 30, 2016 via email

astoff commented Jan 1, 2017 • edited Loading

nvkelso commented Jan 7, 2017

albarrentine commented Jan 7, 2017

migurski commented Jan 7, 2017 • edited Loading

astoff commented Jan 7, 2017

migurski commented Jan 7, 2017

albarrentine commented Jan 8, 2017

astoff commented Jan 10, 2017 • edited Loading

iandees commented Jan 10, 2017

albarrentine commented Jan 10, 2017

astoff commented Jan 12, 2017

justinelliotmeyers commented Jan 26, 2021 • edited Loading

astoff commented Jan 31, 2021

vgeorge commented Jan 31, 2021

astoff commented Jan 1, 2017 •

edited

Loading

migurski commented Jan 7, 2017 •

edited

Loading

astoff commented Jan 10, 2017 •

edited

Loading

justinelliotmeyers commented Jan 26, 2021 •

edited

Loading