# Geocoding Brazilian Addresses

## Overview

This notebook forward-geocodes Brazilian firms' addresses using the Google Geocoding API. It uses a cache file to avoid repeated API calls for the same address. It generates a `folium` map with markers for a sample of 1000 addresses to spot-check the geocoding results visually.

## Output

This notebook outputs 3 .csv files: `geocode_cache.csv`, `geocoded_data.csv`, and `geocoded_data_high_precision.csv`.

- `geocode_cache.csv`: A cache file that stores the geocoding results for addresses already processed.
- `geocoded_data.csv`: A file containing the original dataset with 5 extra columns: `full_address`, `lat`, `lng`, `status`, and `location_type`. `status` is OK if the address was successfully geocoded, and UNKNOWN_ERROR** or ZERO_RESULTS if not. `location_type` indicates the type of location returned by the API.
- `geocoded_data_high_precision.csv`: `geocoded_data.csv` filtered to include only addresses with `location_type` as ROOFTOP or GEOMETRIC_CENTER, i.e. addresses with high-precision geocoding results.

In [None]:
encoding = ''
file_path = 'data/raw/firm_ids_and_cities.csv'

import chardet
import os

# 1. Determine the encoding of the input file
with open(file_path, 'rb') as f:
    detector = chardet.universaldetector.UniversalDetector()
    for line in f:
        detector.feed(line)
        if detector.done:
            break
    detector.close()
encoding = detector.result['encoding']
print(f"Detected encoding: {encoding}")


Detected encoding: ISO-8859-1


In [None]:
import pandas as pd
import unidecode
import re

dtypes = {
    "cnpj_cei": "string",
    "city_code": "float64",
    "end_logradouro": "string",
    "city": "string",
    "year": "int64",
}

# 2.1 Read the full panel (CSV or feather/pkl)
df = pd.read_csv(file_path, encoding=encoding, dtype=dtypes)
# or: df = pd.read_pickle("rais_data.pkl")

print(df.shape)
print(df.columns)
print(df.head())

fill_na_values = {
    "cnpj_cei": "",
    "city_code": 0,
    "end_logradouro": "",
    "city": "",
    "year": 0,
}

print(f"Number of null values in each column: \n{df.isna().sum()}")

# Fill null values with empty strings
df.fillna(value=fill_na_values, inplace=True)

# 2.2 Normalize & build full_address
def clean_text(s):
    s = str(s).strip()                        # trim whitespace
    s = re.sub(r"\s+", " ", s)               # collapse spaces
    return s

df["municipio"], df["uf"] = zip(*df["city"]
    .str.split(",", n=1)
    .apply(lambda parts: (clean_text(parts[0]), clean_text(parts[1]) if len(parts)>1 else "")))

df["pais"] = "BR"
df["end_logradouro"] = df["end_logradouro"].apply(clean_text)

# Upper-case, remove accents
df["full_address"] = (
    df["end_logradouro"] + ", " +
    df["municipio"]     + ", " +
    df["uf"]            + ", " +
    df["pais"]
)
df["full_address"] = (
    df["full_address"]
      .str.upper()
      .apply(unidecode.unidecode)
) 

# 2.3 Extract unique addresses lookup
lookup = pd.DataFrame(df["full_address"].unique(), columns=["full_address"])
print(lookup.head())
print(f"Lookup shape: {lookup.shape}")
print(f"Number of null values: {lookup.isna().sum().sum()}")

(549082, 5)
Index(['cnpj_cei', 'city_code', 'end_logradouro', 'city', 'year'], dtype='object')
         cnpj_cei  city_code                              end_logradouro  \
0   2460658001930   110002.0                 RODOVIA BR 364, KM 523,5  .   
1  84643881000159   110002.0                                AV JARU  S/N   
2  34773267000133   110002.0                          AV RIO NEGRO 2260    
3  84623768000101   110002.0  RUA A COM RODOVIA BR 421  CAIXA POSTAL 135   
4  22861090000148   110002.0                 RODOVIA RO 01 KM 01 1 KM 01   

                  city  year  
0  Ariquemes, Rondônia  2003  
1  Ariquemes, Rondônia  2003  
2  Ariquemes, Rondônia  2003  
3  Ariquemes, Rondônia  2003  
4  Ariquemes, Rondônia  2003  
Number of null values in each column: 
cnpj_cei           7
city_code         27
end_logradouro     9
city              27
year               0
dtype: int64
                                        full_address
0  RODOVIA BR 364, KM 523,5 ., ARIQUEMES, RONDONI...

In [None]:
import time
import requests
import dotenv
import os

dotenv.load_dotenv()

# 3.1 Sample 1,000 for trial geocode
sample = lookup.sample(1000, random_state=1)

# 3.2 Send `sample["full_address"]` to a free geocoder - Mapbox
#     Flag those with errors (e.g. missing numbers, “S/N”, bad municipio).
#     Correct or drop in `lookup` before moving on.

MAPBOX_URL = "https://api.mapbox.com/search/geocode/v6/forward"

def geocode_address(address: str)->dict:
    record = {}
    params = {
        "q": address,
        "access_token": os.getenv("MAPBOX_ACCESS_TOKEN"),
        "country": "BR",
        "limit": 1,

    }
    response = requests.get(MAPBOX_URL, params=params, headers={"User-Agent": "kcao@bu.edu"})
    data = response.json()
    if response.status_code == 200 and data and "features" in data and len(data["features"]) > 0:
        record = {
            "full_address": address,
            "lat": data["features"][0]["geometry"]["coordinates"][1],
            "lng": data["features"][0]["geometry"]["coordinates"][0],
            "status": "OK",
            "location_type": data["features"][0]["properties"]["feature_type"],
        }
    else:
        record = {
            "full_address": address,
            "lat": None,
            "lng": None,
            "status": "ERROR",
            "location_type": None,
        }
    if response.status_code == 429:
        print("Rate limit exceeded. Please wait before making more requests.")
        time.sleep(5)
    return record

new_rows = []
for address in sample["full_address"]:
    record = geocode_address(address)
    new_rows.append(record)
    time.sleep(0.1)

num_errors = sum(1 for row in new_rows if row["status"] == "ERROR")
print(f"Number of errors: {num_errors}")
print(f"Number of successful geocodes: {len(new_rows) - num_errors}")
error_rows = [row for row in new_rows if row["status"] == "ERROR"]
print(f"Error rows: {error_rows}")







## Sampling results

Number of samples: 2000 

Number of errors encountered: 0

In [None]:
# 3.3 Inspect the sample geocode results
new_rows[:10]

[{'full_address': 'RUA SIMAO BARBOSA 1208 EDWIGES LOJA 10, CANINDE, CEARA, BR',
  'lat': -4.360424,
  'lng': -39.313174,
  'status': 'OK',
  'location_type': 'address'},
 {'full_address': 'PANDIA CALOGENAS, BLUMENAU, SANTA CATARINA, BR',
  'lat': -26.927142,
  'lng': -49.062726,
  'status': 'OK',
  'location_type': 'street'},
 {'full_address': 'AV. PRINCESA ISABEL 10 AV ATLA 1020, RIO DE JANEIRO, RIO DE JANEIRO, BR',
  'lat': -22.96442,
  'lng': -43.173397,
  'status': 'OK',
  'location_type': 'address'},
 {'full_address': 'RUA JOSE ALVES BEZERRA 454, JABOATAO DOS GUARARAPES, PERNAMBUCO, BR',
  'lat': -8.163556,
  'lng': -34.937768,
  'status': 'OK',
  'location_type': 'address'},
 {'full_address': 'RUA MARACA, BELO HORIZONTE, MINAS GERAIS, BR',
  'lat': -19.945241,
  'lng': -43.946738,
  'status': 'OK',
  'location_type': 'street'},
 {'full_address': 'AVENIDA BASILEIA, RESENDE, RIO DE JANEIRO, BR',
  'lat': -22.470257,
  'lng': -44.46742,
  'status': 'OK',
  'location_type': 'street'}

In [None]:
import requests
import time
from pathlib import Path
import os
from dotenv import load_dotenv
load_dotenv()

# 4.1 Load or initialize cache
cache_dtypes = {
    "full_address": "string",
    "lat": "float64",
    "lng": "float64",
    "status": "string",
    "location_type": "string",
}
cache_file = Path("geocode_cache.csv")
if cache_file.exists():
    cache = pd.read_csv(cache_file, dtype=cache_dtypes)
else:
    cache = pd.DataFrame(columns=[
        "full_address", "lat", "lng", "status", "location_type"
        ], 
        dtype=cache_dtypes)
    

# 4.2 Geocoding loop
API_KEY = os.getenv("GOOGLE_GEOCODING_API_KEY")
base_url = "https://maps.googleapis.com/maps/api/geocode/json"
new_rows = []

MAX_RATE_LIMIT_RETRIES = 5

CHECKPOINT = 1000

for i, addr in enumerate(lookup["full_address"]):
    retry_count = 0
    try:
        if addr in cache["full_address"].values:
            continue  # already geocoded
        num_api_requests+=1
        params = {
            "address": addr,
            "components": "country:BR",
            "key": API_KEY
        }
        resp = requests.get(base_url, params=params).json()
        status = resp.get("status")

        if status == "OK" and resp["results"]:
            res = resp["results"][0]
            loc = res["geometry"]["location"]
            loc_type = res["geometry"]["location_type"]
            new_rows.append({
                "full_address": addr,
                "lat": loc["lat"],
                "lng": loc["lng"],
                "status": status,
                "location_type": loc_type
            })
        else:
            new_rows.append({
                "full_address": addr,
                "lat": None,
                "lng": None,
                "status": status,
                "location_type": None
            })

        # Rate-limit + backoff
        time.sleep(0.01) # ~100 requests/sec
        if status == "OVER_QUERY_LIMIT":
            retry_count += 1
            if retry_count > MAX_RATE_LIMIT_RETRIES:
                print("Max retries exceeded. Exiting.")
                exit(1)
            print("Rate limit exceeded. Waiting for 5 seconds.")
            time.sleep(5)
    except KeyboardInterrupt:
        print("Geocoding interrupted. Saving cache and exiting.")
        if new_rows:
            cache = pd.concat([cache, pd.DataFrame(new_rows)], ignore_index=True)
            cache.to_csv(cache_file, index=False)
        exit(0)
    except Exception as e:
        print(f"Error processing address {addr}: {e}")
        exit(1)
    finally:
        if i % CHECKPOINT == 0: # Checkpoint and save cache every CHECKPOINT addresses processed
            # print(f"Processed {i} addresses. Cache size: {cache.shape[0]}")
            if new_rows:
                cache = pd.concat([cache, pd.DataFrame(new_rows)], ignore_index=True)
                cache.to_csv(cache_file, index=False)
                new_rows = []
# 4.3 Append & save cache
if new_rows:
    cache = pd.concat([cache, pd.DataFrame(new_rows)], ignore_index=True)
    cache.to_csv(cache_file, index=False)

In [None]:
# 5 Inspect the cache and check for errors
print(f"Final cache size: {cache.shape[0]}")
num_errors = cache[cache["status"] != "OK"].shape[0]
print(f"Number of errors: {num_errors}")

error_df = cache[cache["status"] != "OK"]
if not error_df.empty:
    print(error_df.head(10))

Final cache size: 193727
Number of errors: 19
                                            full_address  lat  lng  \
5208   AV CRISTIANO MACHADO 4000 LJ ANCORA D, BELO HO...  NaN  NaN   
10375        PRACA DAS FLORES 30, BARUERI, SAO PAULO, BR  NaN  NaN   
11800  RUA GERONIMO CAYETANO GARCIA 176, FRANCISCO MO...  NaN  NaN   
12885  AV CARLOS DRUMOND DE ANDRADE 434, LENCOIS PAUL...  NaN  NaN   
18091        RUA SAO PEDRO S/N, SAO ROQUE, SAO PAULO, BR  NaN  NaN   
32263  FAZENDA SANT ANA S/N, SAO JOAQUIM DA BARRA, SA...  NaN  NaN   
32338  AVENIDA HEITOR VILA LOBOS 1172 SALA 4, SAO JOS...  NaN  NaN   
46898  AV. DAS NACOES UNIDAS 4254, NOVO HAMBURGO, RIO...  NaN  NaN   
54175  R SENADOR BENTO PEREIRA BUENO 129, JUNDIAI, SA...  NaN  NaN   
64839  RUA ROMUALDO ANDREAZZI 492, CAMPINAS, SAO PAUL...  NaN  NaN   

              status location_type  
5208   UNKNOWN_ERROR          <NA>  
10375   ZERO_RESULTS          <NA>  
11800  UNKNOWN_ERROR          <NA>  
12885  UNKNOWN_ERROR          <NA> 

In [None]:
# 6.1 Merge lat/lng into full panel
df_geocoded = df.merge(cache, on="full_address", how="left")

df_geocoded.to_csv("geocoded_data.csv", index=False)

# 6.2 Keep only high-precision hits if desired
high_prec = ["ROOFTOP", "RANGE_INTERPOLATED"]
df_high = df_geocoded[df_geocoded["location_type"].isin(high_prec)]

df_high.to_csv("geocoded_data_high_precision.csv", index=False)

# 6.3 Spot-check a random sample on a map (e.g. with folium)
import folium
m = folium.Map(location=[-15, -55], zoom_start=4)
for _, row in df_high.sample(100).iterrows():
    folium.CircleMarker(
        [row.lat, row.lng], radius=2, color="blue"
    ).add_to(m)
m.save("spotcheck.html")

In [64]:
# 7 Inspect the final geocoded data
print(f"Full geocoded data size: {df_geocoded.shape}")
num_errors_full = df_geocoded[df_geocoded["status"] != "OK"].shape[0]
print(f"Number of errors in full data: {num_errors_full}")

print(f"High-precision geocoded data size: {df_high.shape}")



Full geocoded data size: (549082, 13)
Number of errors in full data: 60
High-precision geocoded data size: (224670, 13)
