# <center> Spatial Data Types </center>

## <center> A Brief Introduction </center>

## Satelite Images
<img src="satelite.png" width="600">

Without the text labels, I could just read this in as an image like last class.

The issue is, I would be missing information...


__Example__: what if I read in an image from Chicago and one from Champaign? Could I relate them in any way?

Numpy arrays are not enough because I need to __reference what I'm seeing with everything else in the world.__

How do we solve this "referencing problem"? A system! 

Hence why we have __Coordinate Reference Systems__ (CRS).


## How to Make a Reference System
Imagine a world where no maps exist but we have two images of Champaign and Chicago. How would we make a system that allows us to find the distance between them?

1. Measure the earth.
    - We would want to understand the entire planet and what its dimensions are.
2. Make a useful approximation of the earth.
    - We could mock up a sphere, ellipsoid, or other shape that roughly approximates the earth.
3. Project your three-dimensional earth onto a two dimensional surface.
    - We use some formula to translate the geographic coordinates to a grid system.

1. Measure the earth.
    - Called __geodosy__, the study of the measurement of the earth.
2. Make a useful approximation of the earth.
    - A three dimensional model results in a __geographic coordinate system__.
    - These coordinates are degrees, not distance. A popular measurement is angular distance from a fixed point such as "latitude" and "longitude."
3. Project your three-dimensional earth onto a two dimensional surface.
    - An approximation of 3-dimensions onto 2-dimensions is a __projected coordinate system__.

### Geographic Reference System
<center>
<img src="GRS.png" width="400">
</center>

[figure source](https://en.wikipedia.org/wiki/Geographic_coordinate_system#/media/File:ECEF.svg)

- With a reference ellipsoid, a coordinate system can simply be angular distance from some reference point.


- __Note__: the Earth is NOT a perfect sphere or ellipsoid. It is instead a lumpy, awkward rock floating through space.
- What this means: over large distances a sphere is a good approximation. Over short distances you should get distortions due to the lumpiness of the Earth.
- Different reference ellipsoids (called __datums__) can be used for different parts of the earth to correct for this.

<center>
<img src="GRS_details.png" width="400">
</center>

[figure source](https://www.esri.com/arcgis-blog/products/arcgis-pro/mapping/coordinate-systems-difference/)

This is the kind of information that goes into a GRS, for example __WGS 1984__ (a very popular one).

### A way to solve the "distance problem"
#### Haversine Distance
[figure source](https://github.com/DaniilSydorenko/haversine-geolocation)
<center>
<img src="haversine.png" width="400">
</center>

- A mathematical approach to calculating distance on a sphere which accepts lat and long.
- Assumes a perfect sphere, which works well enough at some distances.

## Projected Reference Systems

<center>
<img src="CRS.png" width="400">
</center>

[figure source](https://docs.qgis.org/3.16/en/docs/gentle_gis_introduction/coordinate_reference_systems.html)

They are always technically wrong. Three things PRS has to balance:
- Angluar conformity
- Distance
- Area

### Angular Conformity Preserving
The __Mercator Projection__ preserves angles since it was used by Europeans to sail across oceans in the 16th century.
<center>
<img src="mercator.jpg" width="400">
</center>


As a result, it severely distorts the relative sizes of countries that are far away from the equator.

<center>
<img src="mercator-vs-truesize.gif" width="600">
</center>


### Distance Preserving
#### Plate-Carree Equidistant Cylindrical

This one preserves distances, and has become near universal standard for raster data sets.
<center>
<img src="plateecaree.jpg" width="400">
</center>


### Area Preserving
#### Albers Equal Area Conic

This one preserves area of objects (important for area calculations and densities). Used by Census Bureau and most Atlas of the US.

<center>
<img src="albers.jpg" width="400">
</center>




### A Compromise
#### UTM Zones
The Universal Transverse Mercator (UTM) coordinate system is a Mercator projection which can be projected into 60 different zones. The idea is that there are less distortions within its UTM zone.

<center>
<img src="utm_zones.png" width="600">
</center>


## So... who cares?

- Since all projections have their own issues, which one you need depends on what you want to do.
- For area calculations will want an equal area and for distances want an equal distance projection.

__However__, in most applications we end up using the __WGS84__ GRS (EPGS: 4326) which has become a standard GRS for storing the data.

When projecting, __Plate-Carree__ PRS (EPSG: 54001) is used for most rastermaps. 

Google projects all of its maps in __Pseudo-Mercator__ (EPSG: 3857) since 2005, as does Mapbox, OpenStreetMap, and basically every other mapping app. Its now known as "Web Mercator" since it is the standard for web apps.

### Common pitfalls:
- Trying to find "distance" with only a GRS.
    - In this case, either project to an equi-distance projection or use haversine distance.
- __Using two files that are not the same CRS__.
    - A check that needs to be in every workflow is "are they the same CRS"?

## Spatial Data Types
- Rasters
    - Essentially the pixels of an image that is geo-referenced.
    - Attributes: CRS, resolution (e.g. 100 km grid).
    - Usually Stored as .tiff

- Vectors
    - Either points, lines, or polygons made up of vertices.
    - Attributes: CRS.
    - Usually stored as .shp or .geojson


### Python Packages
- Rasters: `rasterio`.
- Vectors: `shapely`, `fiona`.
- Vector statistics (zonal statistics): `rasterstats`.
- Handling spatial data: `geopandas`.

Most of these use the packages `gdal` or `osgeo` as dependencies.

### Installing packages
<img src="gdal_meme.jpeg">

### Geospatial package setup in Python

These packages tend to depend on older dependencies. The effect of this is that, for example, using newest `numpy` will not allow `geopandas` to work.

Typical ways to take care of this problem:
- Build packages from source (annoying).
- Create a new conda environment and install __geopandas first__.
    - This allows it to specifically install the versions it wants to use.

I typically have a seperate environment in Python for geospatial work altogether because of this.

## Other Tools for GIS

We have barely gone over all the tools we could possibly use in GIS at this point. Here are some things that might be worth looking into:

### Advanced Topics
That we didn't have time for:
- Making buffers
- Polygonizing rasters
- Calculating "hub distance"

The hope is that with the baseline knowledge you have here you can figure these out on your own.

### Better Plotting Libraries
- [plotly](https://plotly.com/python/maps/): very nice and easy to use plotting that can make your geographic plots interactive fairly easily.
- [bokeh](https://docs.bokeh.org/en/latest/docs/user_guide/geo.html): similar capabilities to plotly, but not as expansive.

Some of these can be deployed to websites to be interactive.

__We will return to some of these packages later in the semester__


### Google Maps API
Conditional on paying for it, Google Maps has some nice functionality for geolocating or getting directions.

It can also calculate things like road distance between points as opposed to straight distance.

The problem is that you need to pay for it (though it isn't super expensive).

### OpenStreetMap
The open source version of Google Maps.

Has some great functionality for large cities where things are well identified.

One of its main features is being able to identify "amenities" (restaurants, hospitals, etc.) and also road networks.

From my own experience, powerful but it has a steep learning curve.

OpenStreetMap can be accessed using the [Overpass API](https://wiki.openstreetmap.org/wiki/Overpass_API/Language_Guide).

One promising way to learn to use this API would be fooling around with [Overpass turbo](https://overpass-turbo.eu/) which can help with construction of queries.

#### OpenStreetMap Example

I've extracted the "bounding box" for the CU area.

I can run a query to find all the tres in the area:



In [1]:
import overpy
import pandas as pd 

api = overpy.Overpass()
r = api.query("""
node(40.03602, -88.32493, 40.17404, -88.1076)
    [natural=tree];
out;
""")

In [2]:
[x for x in r.nodes]

[<overpy.Node id=38040799 lat=40.0908484 lon=-88.2732766>,
 <overpy.Node id=3603476177 lat=40.0933996 lon=-88.3187209>,
 <overpy.Node id=3624181219 lat=40.1320683 lon=-88.2528260>,
 <overpy.Node id=3624181220 lat=40.1320689 lon=-88.2680456>,
 <overpy.Node id=3624181221 lat=40.1320811 lon=-88.2660013>,
 <overpy.Node id=3624181222 lat=40.1320840 lon=-88.2716764>,
 <overpy.Node id=3624181223 lat=40.1320872 lon=-88.2685168>,
 <overpy.Node id=3624181224 lat=40.1320938 lon=-88.2756467>,
 <overpy.Node id=3624181225 lat=40.1321066 lon=-88.2713504>,
 <overpy.Node id=3624181226 lat=40.1321081 lon=-88.2978471>,
 <overpy.Node id=3624181227 lat=40.1321097 lon=-88.2414131>,
 <overpy.Node id=3624181228 lat=40.1321115 lon=-88.2414536>,
 <overpy.Node id=3624181229 lat=40.1321245 lon=-88.2661704>,
 <overpy.Node id=3624181230 lat=40.1321361 lon=-88.2981694>,
 <overpy.Node id=3624181231 lat=40.1321362 lon=-88.2983646>,
 <overpy.Node id=3624181232 lat=40.1321385 lon=-88.2682098>,
 <overpy.Node id=362418123

In [3]:
len(r.nodes)

22000

Each of these nodes is a tree with information:

In [9]:
r.nodes[56].tags

{'natural': 'tree', 'species': 'Pinus strobus'}

#### Finding Restaurants
What about all of the restaurants in the area?

In [10]:
r = api.query("""
node(40.03602, -88.32493, 40.17404, -88.1076)
    ["amenity"="restaurant"];
out;
""")

Each node looks like this

In [11]:
r.nodes[0]

<overpy.Node id=287470185 lat=40.1125742 lon=-88.2095509>

Each of these nodes has a tag

In [12]:
r.nodes[11].tags

{'addr:city': 'Urbana',
 'addr:housename': 'The Channing Murray Foundation',
 'addr:housenumber': '1209',
 'addr:postcode': '61801',
 'addr:street': 'W. Oregon Street, Urbana, Illinois',
 'amenity': 'restaurant',
 'contact:facebook': 'https://www.facebook.com/RedHerringVegetarianRestaurant/',
 'contact:instagram': 'https://www.instagram.com/redherringrestaurant/',
 'contact:website': 'https://www.redherringlove.com',
 'cuisine': 'international',
 'diet:vegan': 'only',
 'diet:vegetarian': 'only',
 'name': 'The Red Herring',
 'opening_hours': 'Mo-Fr 11:00-14:30; We 17:00-20:00'}

I can make a dataframe from these tags, one row at a time.

In [13]:
pd.DataFrame(r.nodes[11].tags,index=[0])

Unnamed: 0,addr:city,addr:housename,addr:housenumber,addr:postcode,addr:street,amenity,contact:facebook,contact:instagram,contact:website,cuisine,diet:vegan,diet:vegetarian,name,opening_hours
0,Urbana,The Channing Murray Foundation,1209,61801,"W. Oregon Street, Urbana, Illinois",restaurant,https://www.facebook.com/RedHerringVegetarianR...,https://www.instagram.com/redherringrestaurant/,https://www.redherringlove.com,international,only,only,The Red Herring,Mo-Fr 11:00-14:30; We 17:00-20:00


Get a list of these rows with a list comprehension.

In [14]:
DFs = [pd.DataFrame(r.nodes[x].tags,index=[0]) \
       for x in range(len(r.nodes))]

Now concatenate them together:

In [15]:
rest_df = pd.concat(DFs,axis=0)

In [16]:
rest_df

Unnamed: 0,amenity,cuisine,name,contact:phone,contact:website,opening_hours,addr:city,addr:housenumber,addr:postcode,addr:state,...,takeaway,alt_name,amenity_1,bar,shop_1,name:en,brand:website,diet:dairy_free,diet:nut_free,email
0,restaurant,thai,Siam Terrace,,,,,,,,...,,,,,,,,,,
0,restaurant,,Houlihan's,+1 217 8195005,http://www.houlihans.com/my-houlihans/champaign,Su-Th 11:00-22:00;Fr 11:00-23:00;Sa 07:00-23:0...,,,,,...,,,,,,,,,,
0,restaurant,asian,Spicy Tang,,,Mo-Su 11:00-21:00,Champaign,607,61801,IL,...,,,,,,,,,,
0,restaurant,middle_eastern,Jerusalem Restaurant,,,,Champaign,601,61820,IL,...,,,,,,,,,,
0,restaurant,korean,Ar-Ri-Rang,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0,restaurant,chinese,Rice Bowl,,,Mo-Sa 11:00-21:00,Champaign,625,61820,IL,...,,,,,,,,,,
0,restaurant,,C-U La La Noodle,,,Mo-Su 11:00-20:30,,,,,...,,,,,,,,,,
0,restaurant,mediterranean,Shawarma Joint,,,Mo-Sa 11:00-22:00; Su 12:00-21:00,Champaign,627,61820,IL,...,,,,,,,,,,
0,restaurant,mexican,El Toro Bravo,,,,Champaign,2561,61821,IL,...,,,,,,,,,,


### What is the most popular cuisine in Chambana?

In [17]:
rest_df['cuisine'].value_counts().head()

cuisine
chinese     13
asian        7
mexican      6
american     5
thai         4
Name: count, dtype: int64

In [18]:
rest_df[rest_df['cuisine'] == "mexican"]

Unnamed: 0,amenity,cuisine,name,contact:phone,contact:website,opening_hours,addr:city,addr:housenumber,addr:postcode,addr:state,...,takeaway,alt_name,amenity_1,bar,shop_1,name:en,brand:website,diet:dairy_free,diet:nut_free,email
0,restaurant,mexican,Dos Reales,,,,,,,,...,,,,,,,,,,
0,restaurant,mexican,Pancheros,,,,Urbana,102.0,61801.0,,...,,,,,,,,,,
0,restaurant,mexican,Maize,,,,,,,,...,,,,,,,,,,
0,restaurant,mexican,Pancheros,,,Su-Th 10:30-23:00; Fr-Sa 10:30-00:00,,2009.0,61820.0,,...,,,,,,,,,,
0,restaurant,mexican,El Toro,,,,,,,,...,,,,,,,,,,
0,restaurant,mexican,El Toro Bravo,,,,Champaign,2561.0,61821.0,IL,...,,,,,,,,,,


We're missing a few restaurants here.

### What area should we search?

In [30]:
r = api.query("""
node(39.096937863355755,-121.70104980468751,39.18757116788847,-121.57333374023439)
    ["amenity"="restaurant"];
out;
""")

In [31]:
DFs = [pd.DataFrame(r.nodes[x].tags,index=[0]) \
       for x in range(len(r.nodes))]

rest_df = pd.concat(DFs,axis=0)

In [32]:
rest_df['name'].value_counts().sort_index()

name
Applebee's                      1
Buffalo Wild Wings              1
Chicago's Pizza With a Twist    1
IHOP                            1
Peach Tree                      1
Pizza Hut                       1
Sutter Buttes Brewing           1
Name: count, dtype: int64

In [26]:
rest_df['cuisine'].value_counts().head(10)

cuisine
international        2
regional             2
chicken              1
burger               1
local                1
pizza                1
asian;diner;local    1
Name: count, dtype: int64