<a href="https://colab.research.google.com/github/iiSherBearii/Geospatial-Tutorial/blob/main/Geocoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Welcome to Geocoding and Shapefiles!
Before you can analyze and visualize spatial data, you often have to start with converting that data into a shapefile- a special spatial data format used to create maps and run spatial analysis like spatial lag models.

Every location on Earth has a unique geocode identifier to tie their spatial location to their non-spatial data like population sizes etc. When using spatial databases like the U.S. Census' American Community Survey (ACS), every location has a unique GEOID that acts as a geocode to tie their locations to the survey data provided in the ACS database.

A shapefile is a spatial dataset that contains shapes like polygons or points that are matched to a unique geoid. These are used to create the maps we see in most spatial research. But when working with raw address datasets, we can't make maps out of them without matching them to a GEOID through the X, Y coordinates of the address.

Today we'll learn how to take raw address files, geocode them through a geocoding server, and transform them into a shapefile for mapping!

##What is Geocoding?
Geocoding is the process of transforming location descriptions (like addresses or place names) into geographic coordinates (latitude and longitude), which are then used to place markers or position maps. Reverse geocoding is the opposite process, converting geographic coordinates into a human-readable address.

#Downloading Raw Address File

If you have access to a dataset with spatial identifiers, like addresses, you have the ability to create those into geospatial data by geocoding them and transforming them into a shapefile. For today's tutorial, we'll be using a dataset from the NYPD Stop, Question and Frisk Database (https://www.nyc.gov/site/nypd/stats/reports-analysis/stopfrisk.page).

This dataset contains incident reports of when NYPD officers stopped a pedestraian due to suspicion. Within the dataset is a column for the location of where this stop and frisk took place. That's all we need to start geocoding.

In [1]:
#Download packages
!pip install pandas
!pip install geopandas
!pip install geopy
!pip install python-dotenv

import pandas as pd
import geopandas as gpd
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
from tqdm import tqdm #for progress bar in the batch geocoding script

Collecting python-dotenv
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Downloading python_dotenv-1.1.0-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.1.0


In [2]:
# Import CSV data
url = 'https://raw.githubusercontent.com/iiSherBearii/Geospatial-Tutorial/refs/heads/main/stop_frisk.csv'
df = pd.read_csv(url)

In [3]:
print(df)

      STOP_ID STOP_FRISK_DATE STOP_FRISK_TIME  YEAR2   MONTH2      DAY2  \
0   279772561          1/1/24        01:58:00   2024  January    Monday   
1   279772564          1/1/24        00:48:00   2024  January    Monday   
2   279772565          1/1/24        01:10:00   2024  January    Monday   
3   279772566          1/1/24        01:10:00   2024  January    Monday   
4   279772567          1/1/24        01:10:00   2024  January    Monday   
..        ...             ...             ...    ...      ...       ...   
95  279990285          1/1/24        19:29:00   2024  January    Monday   
96  279990286          1/1/24        19:29:00   2024  January    Monday   
97  279990288          1/4/24        02:40:00   2024  January  Thursday   
98  279990292          1/4/24        00:02:00   2024  January  Thursday   
99  279990293          1/4/24        00:02:00   2024  January  Thursday   

         STOP_WAS_INITIATED ISSUING_OFFICER_RANK  \
0   Based on Self Initiated                   P

#Geocoding using the Nominatim Server (OpenStreetMaps)

##What is the Nominatim Server?
"Nominatim is the geocoding software that powers the official OSM site www.openstreetmap.org. It serves 30 million queries per day on a single server." - Nominatim

##Alternative Spatial Databases
  - U.S. Census
  - ArcGIS

##Batch Geocoding

In [5]:
#Connecting to Nominatim server
geolocator = Nominatim(user_agent="myGeocoder")
# Batch Geocoding Function (with progress bar)
tqdm.pandas()

def geocode_address(row):
    address = f"{row['STOP_LOCATION_STREET_NAME']}, {row['city']}"
    try:
        location = geolocator.geocode(address)
        if location:
            return pd.Series({'latitude': location.latitude, 'longitude': location.longitude})
        else:
            return pd.Series({'latitude': None, 'longitude': None})
    except:
        return pd.Series({'latitude': None, 'longitude': None})

# Apply batch geocoding with progress bar
df[['latitude', 'longitude']] = df.progress_apply(geocode_address, axis=1)

100%|██████████| 100/100 [04:50<00:00,  2.90s/it]


In [6]:
#display geocoded dataset
print(df)
#save geocoded dataset to a .csv
df.to_csv('geocoded_stop_frisk.csv', index=False)

      STOP_ID STOP_FRISK_DATE STOP_FRISK_TIME  YEAR2   MONTH2      DAY2  \
0   279772561          1/1/24        01:58:00   2024  January    Monday   
1   279772564          1/1/24        00:48:00   2024  January    Monday   
2   279772565          1/1/24        01:10:00   2024  January    Monday   
3   279772566          1/1/24        01:10:00   2024  January    Monday   
4   279772567          1/1/24        01:10:00   2024  January    Monday   
..        ...             ...             ...    ...      ...       ...   
95  279990285          1/1/24        19:29:00   2024  January    Monday   
96  279990286          1/1/24        19:29:00   2024  January    Monday   
97  279990288          1/4/24        02:40:00   2024  January  Thursday   
98  279990292          1/4/24        00:02:00   2024  January  Thursday   
99  279990293          1/4/24        00:02:00   2024  January  Thursday   

         STOP_WAS_INITIATED ISSUING_OFFICER_RANK  \
0   Based on Self Initiated                   P