Skip to content

Geospatial Data Analysis

Abe edited this page May 3, 2022 · 5 revisions

Introduction

Geospatial data is important in many civic tech projects. The data, published by various stakeholders, are used in multiple hackforla projects. A few recent examples include:

  • 311-data - Foundation project combining live 311 data (point features) and Neighborhood Council (polygon features) into a portal for empowerla
  • accessthedata - Combines 311 requests (point features) and Neighborhood Council boundaries (polygon features) to create data stories for workshops
  • Pedestrian Safety Analysis - Maps traffic accidents (point features), administrative boundaries (polygon features), and intersection data (point features) to analyze accident patterns

The typical geodata workflow includes the steps to find and combine multiple feature layers into a unified picture of the world. Once the data is assembled, statistical and visualization tools are used to tell the story. Analysis may be centered on one variable, but understanding comes from the other contextual features.

This note is intended to get you started with basic geospatial analysis. Specifically:

  1. Geodata overview and where to look for data
  2. Using the python data ecosystem to process and analyze
  3. Basic map visualizations in geospatial analysis
  4. Ideas on putting it all together

Understanding Geospatial Data

Governments at the national, state, county, and local levels produce open data. Each dataset is purpose-built, by a department for a given use. Geospatial analysts need to understand the why of the data and its implications for data fusion.

In Los Angeles, there are some important sites for data.

  1. The City of Los Angeles has two data portals, LA Data and LA geodata. Explore the data at each of these sites to see what they offer!
  2. The County of LA also publishes data. You always need to understand which jurisdiction is responsible for the data. A good example is the property tax assessor - that is usually a county function.
  3. The Southern California Association of Governments (SCAG) also publishes open data on the LA region.
  4. Finally, while there are many other data sources, a couple are important. First is US Census data. It is the authoritative source for demographics, but it can be a bit complicated to navigate. Second is OpenStreetMap. OSM has good road network data, points of interest, and an open source geocoder!

Geospatial data formats include both vector and raster. For this brief I focus on three vector data types.

  • Points
  • This format is used for features like buildings, 311 requests, traffic accidents. The location information typically comes as latitude, longitude pairs, geometries (encoded latitude, longitude), or addresses. For spatial analysis need the geometry so some processing may be required.
  • Lines
  • This format is used for linear features like roads, rivers, or power distribution lines. Depending on the application, this data is represented as geometries and/or network objects.
  • Polygons
  • Features in this category include boundary features like Neighborhood Councils, City Council Districts, cities, and census tracts. These datasets will include geometries.

It is often said that spatial is special. Geodata enables the where in our models. It adds to non-spatial data science, providing notions of location, distance, and spatial interactions/relationships. To use spatial data we need to deal with some representational issues. I'll identify a few issues to wrap up this section.

  1. Coordinate reference systems (CRS) - CRSs are an important topic with geodata. An overview addresses the basic issues. You don't need to understand all the nuances of this aspect of geodata for your analysis. There are many reference sites. Here is one very common CRS. It is important to know which reference system is used by your datasets and then standardize on one.

  2. Scale - Scale can be important for analysis to generate statistics for geospatial data. For example, this interesting overview identifies some of the issues when inferring statistics for a large area (i.e. Neighborhood Council) from small area features (i.e. census tracts). Note: don't be confused when talking about map scale!

  3. Standards - Well-known text (wkt) is an important standard to represent geometries. Initial development was by the OGC but was moved to the ISO.

Geospatial Tools, Analysis, and Visualization

This note assumes you know the basics of python and jupyter for data analysis. It is focused on basic geodata tools. There are many options not covered. For example, it does not address GIS tools, like QGIS or ArcGIS. I've also limited the explanation of map visualization tools. There are many options.

Python libraries

  • geopandas - Dataframes are a common way to organize information for analysis. Examples come from pandas, R, and excel. Geopandas is a (pandas) dataframe with geometry. It adds spatial support to the basic tools of pandas. It is the important data structure for python-based geodata. A few examples include:

    • It adds where to the dataframe row. It can be absolute location, i.e. the lat/long, or, with some computation relative location.
    • Geometries support spatial distance among rows.
    • Geometry adds the ability to combine data. You can compute intersections or apply features, with geometry, for spatial joins.
    • geopandas supports multiple file types. It handles shape and geodatabase files, geoJSON, and parquet.
  • shapely - The spatial tools in the geodataframe are supplied by the shapely package. Indeed, the geopandas documentation states that it combines pandas and shapely. It is not important to understand all the details of the package, but know there are many. Some examples:

    • area, bounds, length of features with extend (lines, polygons)
    • contains, covers, crosses can be used with LineStrings (think roads) and polygons (think city)
    • disjoint, intersects, touches, within predicates are useful with LineStrings and polygons
    • difference, intersection, union are useful set-oriented operators for polygons
    • buffer is useful to get an area around a LineString
    • provides support for geopandas implementation of dissolve() to combine geometries
  • osmnx - This is an amazing package. It combines the data from osm with the power of networkx. It provides the interface to read osm node-edge graphs by name, mbr, or general geometry. It has utilities to convert the graph to geodataframes to use with your geopandas tools. It has a powerful API. It can generate paths, register external points to the closest edge, and compute complexity measures, to name a few features. Very good package to analyze the built environment.

The next couple of packages are specific to geospatial analysis. They are more complex, with a variety of use cases. I'm including them in this basic interview to give a starting point for the next level of analysis.

  • pysal - The swiss army knife for spatial data science. It was started at Arizona State University by Serge Rey and Luc Anselin. It has a strong core of current developers and past participants. The packages are organized around three themes:

    • Explore - Tools for exploratory spatial data analysis (ESDA). Two important ones used for analyzing LA data are momepy (urban morphology) and spaghetti (network data analysis).
    • Model - Statistical and spatial relationship model tools. For starters, check out tobler. It provides tools and techniques to combine data from different polygon resolutions. The techniques can be used to support small-scale analysis, i.e. combining census tract information with Neighborhood Councils for better, current demographics.
    • Visualize - This package includes tools to help design maps (choropleths) and legends. This is a bit on the geeky side since the tools help to see what your color selections will look like when designing map visualizations.
  • geosnap - This project is under development at UC Riverside. It is led by Serge Rey, one of the core developers of pysal. It is built with pysal and comes with batteries included (census data). It is focused on understanding neighborhoods. This provides some powerful tools for LA (county, city, ...), OC, ...

Visualization

The visualization tutorial noted there are many different visualization options in the python ecosystem. This is also true for map-based visualizations. This section will look at three approaches to map visualization.

  • folium - This leaflet.js based library is one of the standards. You can find many tutorials, covering a variety of geospatial use cases. An example use of folium maps, using 311 data for access-the-data workshops shows the processing steps from observations to visualization.

  • leafmap - leafmap is an interesting map option. It is advertized as mapping with minimal coding. It started in the google earth world with geemap. If you use Google Earth Engine it is a good option. Even though they say minimal coding it also has some good examples (dare I say code) on widget-based development. The code is worth reading if you want to extend/integrate widgets in your workflow.

  • holoviz - Very nice set of tools for visualization, cross-filtering, and dashboarding. Built on bokeh so it provides an interesting approach to serving the analysis. I also like their concept of data pipelines. Worth the effort to explore!

Putting it all together

There is a github repo covering many of the basics.

Resources/References

This document, and the repo, include many different references. This section includes some courses I've seen along the way. Bits and pieces from each will help you on your journey.

Geographic Data Science for Applied Economists

Introduction to Python GIS

Open Source geospatial Programming & Remote Sensing

git repo for Geographic Data Science with Python


Issues used in the creation of this page

(Some of these issues may be closed or open/in progress.)

Contributors

Mike Morgan

Clone this wiki locally