Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Brazil census data #2303

Closed
astoff opened this issue Dec 30, 2016 · 17 comments
Closed

Brazil census data #2303

astoff opened this issue Dec 30, 2016 · 17 comments

Comments

@astoff
Copy link
Contributor

astoff commented Dec 30, 2016

Census data could be used to generate a country-wide address list in Brazil, but it would require a dedicated script to parse and process. I will describe the situation so you can decide if this makes sense for openaddresses.

There are two datasets that would need to be combined:

  • An address list in each municipality (or, sometimes, a district within a municipality). For rural addresses, a GPS reading is included. For urban addresses, the id number of a city block and sidewalk segment is given. (This is at ftp://ftp.ibge.gov.br/Censos/Censo_Demografico_2010/Cadastro_Nacional_de_Enderecos_Fins_Estatisticos/)
  • Vector maps with the sidewalk segments. (This is at ftp://geoftp.ibge.gov.br/recortes_para_fins_estatisticos/malha_de_setores_censitarios/censo_2010/base_de_faces_de_logradouros/)

Thus, at least the block corresponding to each address can be determined.

I can't say much about the quality of the data; I suppose it varies a lot throughout the country. One issue is that the geospatial information is often not precisely aligned with OSM, see the pictures below. In any case, this is the best data one can hope to gather in Brazil (apart from a few bigger cities that may release their own addresses) at least until the next census comes out, in 5 to 10 years.

1
2

@nvkelso
Copy link
Member

nvkelso commented Dec 30, 2016 via email

@astoff
Copy link
Contributor Author

astoff commented Dec 30, 2016

@nvkelso These are not address ranges. They are more like the Japanese dataset: individual addresses geolocated up to city block (as opposed to parcel or rooftop).

@nvkelso
Copy link
Member

nvkelso commented Dec 30, 2016 via email

@astoff
Copy link
Contributor Author

astoff commented Jan 1, 2017

I have attached a little example of what one can get (a part of Curitiba)
c-uniq.csv.gz. Judging by the amount of dots, I'd say the dataset is fairly complete.

The sample above contains only unique buildings, ignoring units. In another case (Porto Alegre) there are 638K dwellings (house number + unit), and 288K unique buildings (i.e., after discarding units). Again, that seems about right for a fairly dense city of 1.5 million.

Should we keep or discard unit information? Note that different units in the same number will always have equal coordinates anyway.

@nvkelso
Copy link
Member

nvkelso commented Jan 7, 2017

Wow, that's amazing! Let's get this added :)

Import strategy: Since there is some manual process involved, I suggest breaking the import into sources per state (27 total, 26 for states and one for the capital).

Accuracy: It looks like a lot of these addresses are stacked on top of each other (so there are multiple addresses per "street block"), with "block" level precision. @migurski is there an accuracy config to use for this? Looks like 4 or 5 now, or maybe add one for 6 to indicate block level?

Units: Looks like many of the "stacked" records have an empty unit property in the CSV table. Is there a unit available in the source data?

Projection: For the alignment issues, we sometimes see that in other sources. It is possible to override the projection information specified in the shapefile in the OpenAddress src config if we can figure out a better one to use after the per-state files are generated.

@albarrentine
Copy link
Contributor

+1, very interested in Brazil countrywide data. Units could be useful for libpostal purposes as well.

@migurski
Copy link
Member

migurski commented Jan 7, 2017

This data is great. I would say "5" for accuracy, and maybe we can reevaluate what we're using for accuracy anyway. The integer values are not super, it'd be nice to just use strings like "rooftop" or "block level".

@astoff
Copy link
Contributor Author

astoff commented Jan 7, 2017

OK, great to hear you like this. Let's discuss the concern with units.

In the vast majority of cases, the "units" in this dataset refer to apartments of an apartment tower, and not different buildings within the same parcel or different street-level entrances to a given building. I imagined you would not want to retain this level of detail — are you saying you actually do want to keep it?

I should also remark that every single building (residential or not) appears in the listing, but only residential units are listed.

@migurski
Copy link
Member

migurski commented Jan 7, 2017

We do have a place in the OA schema to put units, so if they exist in the data it wouldn't hurt to add them! For @thatdatabaseguy’s purposes, I believe that the presence of units has also helped with libpostal training data sets.

@albarrentine
Copy link
Contributor

Yes, the libpostal parser takes data sets like OpenAddresses that are already separated into fields, recreates what the full address would look like, and then trains a model to parse full addresses (from geocoder input, CSVs, etc) into components. Having the unit information helps the system handle more types of input, for example recognizing that "Apto 202" is a unit, and thus most geocoding systems can ignore it for the purposes of determining the lat/lon (sometimes they can get confused by extra information like this, so it's useful to determine which parts of a geocoder query are important). They're not mission-critical but nice to have!

@astoff
Copy link
Contributor Author

astoff commented Jan 10, 2017

OK, I'll stop discarding the units.

One more question before I am satisfied with the script: Should we try to clean up the original data in any way? (By the way, is a CSV enhancer (openaddresses/machine#283) still on the roadmap?)

This particular dataset would benefit mostly from the following two things, the second of which may be quite important if this is to be used as training data:

  1. Titlecase and add diacritics to street names.
  2. Clean up the number_suffix field (which is appended to the number in the current form of Brazil census addresses #2315). This field is used for various random purposes:
    1. An actual suffix for the number, like "A" in "7A"
    2. "SN" for "sem número", which is most commonly rendered "s/nº"
    3. "KM" in rural areas indicates not a house number but the milestone where the property is located. Should typically be prepended, not appended, to the number.
    4. Sometimes unit information is given in the number_suffix, the most common cases being "FRENTE" and "FUNDOS" (front side versus back side of the parcel)
    5. Various annotations:
      • "CASA" or "ED", to indicate it's a single-unit respectively multi-unit building. Safe to discard.
      • The name of a utility or government agency ("DMAE", "CEEE", "FUNASA", etc.) indicates an "issuing authority" for the number. Common in informal or very recent settlements. This I'd say should be kept.
      • Many other random annotations, which usually seem safe to discard.

@iandees
Copy link
Member

iandees commented Jan 10, 2017

I would prefer to see caching/downloading scripts be as "dumb" as possible and to constrain themselves to only downloading the data from the source in as raw of a format as possible. Doing this means that we end up with the raw data downloaded to our system to start with and can adjust the transform/conform process to include different data later.

The CSV enhancer/cleaner steps are still on our roadmap, and I think these changes you list would be an interesting set of "fixes" to apply to the output.

@albarrentine
Copy link
Contributor

Agreed, it's usually reasonable to keep the raw data raw and update the transform over time. The number_suffix changes are a bit more intricate than what the OA transforms can currently handle, but AFAICT from the São Paulo data, these cases are fairly infrequent and shouldn't affect consumers much.

Libpostal has its own cleanup/normalization which can accommodate all of the above, so no worries there, but useful to know the edge cases.

Re: diacritics, it looks like they're already stripped in the source, no? For certain known words like "Praca" => "Praça" it should be possible to recover them, and libpostal has dictionaries for Portuguese that can handle cases like that, but I'd think it would be a bit more difficult for random streets (not impossible, could be done by building an index of sans-diacritics forms to their most common unnormalized forms using a data set that's known to use proper diacritics e.g. OSM roads). In any case, probably outside the scope of OpenAddresses.

@astoff
Copy link
Contributor Author

astoff commented Jan 12, 2017

#2315 closes this.

@astoff astoff closed this as completed Jan 12, 2017
@justinelliotmeyers
Copy link
Member

justinelliotmeyers commented Jan 26, 2021

anyone crazy enough to try and update this to the 2019 data: https://geoftp.ibge.gov.br/recortes_para_fins_estatisticos/malha_de_setores_censitarios/censo_2010/base_de_faces_de_logradouros_versao_2019/
red is 2019, black is 2010
image

old data is spatially off
image

@astoff
Copy link
Contributor Author

astoff commented Jan 31, 2021

I'll try running my old script. Hopefully this solves the issue where some entire cities are missing.

By the way, I was automatically kicked out of the organization sometime ago because I didn't have two-factor authentication set up. Can someone include me again?

@vgeorge
Copy link

vgeorge commented Jan 31, 2021

@astoff I tried to run the script, but the 2019 version changed a bit. The files are packed differently and the internal structure changed too. In case you need to download all the files again I'm seeding than as torrents. Download might be faster than IBGE FTP: https://github.com/vgeorge/cnefe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants