## Spatial Modeling and Analytics Try-it Notebook #2
### Calculating distances and areas

## Reminder
<a href="#/slide-2-0" class="navigate-right" style="background-color:blue;color:white;padding:8px;margin:2px;font-weight:bold;">Continue with the lesson</a>

<br>
</br>
<font size="+1">

By continuing with this lesson you are granting your permission to take part in this research study for the Hour of Cyberinfrastructure: Developing Cyber Literacy for GIScience project. In this study, you will be learning about cyberinfrastructure and related concepts using a web-based platform that will take approximately one hour per lesson. Participation in this study is voluntary.

Participants in this research must be 18 years or older. If you are under the age of 18 then please exit this webpage or navigate to another website such as the Hour of Code at https://hourofcode.com, which is designed for K-12 students.

If you are not interested in participating please exit the browser or navigate to this website: http://www.umn.edu. Your participation is voluntary and you are free to stop the lesson at any time.

For the full description please navigate to this website: <a href="../../gateway-lesson/gateway/gateway-1.ipynb">Gateway Lesson Research Study Permission</a>.

</font>

In this try-it you'll download some data and calculate distances between points. 

Once again we'll work in a notebook without using slides. Scroll down through the series of code blocks, executing them as you go. Run through this notebook as presented without making any changes. Then when you're done, try experimenting with the code by making minor modifications. Enjoy!

Remember for each of the code chunks below, click the arrow to the left of the box. Be patient, sometimes these take a few seconds to execute. Wait for the asterisk to change into a number.

# Key modules and libraries we'll use

There are three Python packages we import a lot in spatial analytics and modeling: pandas, geopandas and shapely. 
- <a href="https://pandas.pydata.org/docs/">Pandas</a> is "a library providing high-performance, easy-to-use data structures and data analysis tools." You will use it in many of your python operations, not just geospatial ones. Pandas' key data structure is the dataframe (often labelled a "df"), a simple tabular structure like a spreadsheet. 
- <a href="https://geopandas.org/">Geopandas</a> is built on pandas to "extend the datatypes used by pandas to allow spatial operations on geometric types." Geopandas' key data structure is the geodataframe (often labelled "gdf"). 
- Finally, <a hred="https://shapely.readthedocs.io/en/stable/#">shapely</a> handles "manipulation and analysis of planar geometric objects."

In the code below, we execute "import pandas as pd". Python programmers like to use shortcuts so setting "pd" to refer to pandas is pretty standard. Likewise, you'll see below we'll import geopandas as "gpd". Get used to it! 

Other libraries and modules we use here are:
- <a href="https://docs.python.org/3/library/csv.html">The CSV module</a> implements classes to read and write tabular data in CSV format.
- <a href="https://docs.python-requests.org/en/master/">Requests</a> provides a simple way to send http requests to APIs from which you wish to pull data.
- <a href="https://python-visualization.github.io/folium/quickstart.html">Folium</a> is one of the packages you can use to make "beautiful, interactive maps with Python and Leaflet.js."

In [None]:
import pandas as pd
import geopandas as gpd
import shapely
import csv
import requests
import warnings
warnings.filterwarnings('ignore') # Hide warnings

# Get the Data

There are tons of tabular data with lat/long attributes on the web, much of it provided in CSV (comma separated values) format, which is a simple text file with the attribute data in each row separated by commas. It's super simple to grab a CSV file and turn it into something you can do spatial analysis on in a notebook and Python. 

For this exercise we're going to look at abandoned wells in the State of New York and we've found a source of tabular data at the general website <a href="https://www.data.gov/">data.gov</a>, "The home of the U.S. Government’s open data". 

Our data for this exercise can be found at https://catalog.data.gov/dataset/abandoned-wells. It is a list of wells that are regulated under the Oil, Gas and Solution Mining Law (ECL Article 23) in New York State that are abandoned and not plugged.

Let's get the data and look at it. 

In [None]:
path = 'https://data.ny.gov/api/views/vgue-bamz/rows.csv'
wells = pd.read_csv(path)

wells.info()
wells.head()

# Clean and prepare the data for analysis

By looking at this info, we can see it has some problems (as raw data usually does). There are 6851 rows (entries), but only 6682 have latitude and longitude values. So, we can delete those as they will be of no use to us. 

Also, you'll see there is a column called Georeference that appears to be a merge of the lat and long values. As you'll discover when working with Python, and many other languages like R, different modules have different data format requirements. We need a geometry column for the coding we're going to do here and it's not clear if this merged column suits the format requirements of the modules we'll use, so just to avoid confusion, we'll delete that column and generate a new geometry one within geopandas.  

In [None]:
#first we delete the column "Georeference"
del wells['Georeference']

#then we remove all rows that have no lat and long data
wells = wells[wells['SURFACE LONGITUDE'].notna() & wells['SURFACE LATITUDE'].notna()]

wells.info()

OK, that got rid of the rows with no location data. But this is still a lot of data for a little exercise, so we'll extract only the wells in Region 8 and we might as well trim off the ones for which location is not verified. 

In [None]:
wells8 = wells[(wells['VERIFIED LOCATION']=='YES') & (wells['REGION']==8)]

wells8.info()

So now we've got how many entries left?

[77] 

OK, that's plenty small now. 

Now we need to do a bit more data munging so we can start doing geometry calculations. Here we use shapely to create a new column at the far right called geometry which contains point features composed of the lat and long values. 

In [None]:
from shapely.geometry import Point

# Make a list that contains a series of point geometries from each pair of lat/long values.
geometry_latlon = [Point(xy) for xy in zip(wells8['SURFACE LONGITUDE'], wells8['SURFACE LATITUDE'])]

# Append the list to the wells8 dataframe to create a new column "geometry"
wells8['geometry'] = geometry_latlon 

wells8.head()

And finally, let's convert the pandas dataframe into a geopandas geodataframe and specify that the coordinate reference system (CRS) for the geometry is WGS84 (lat/long). This is EPSG code 4326.

[The <a href='http://www.epsg-registry.org/'>EPSG Geodetic Parameter Dataset</a> is a structured dataset of CRS and Coordinate Transformations. It was originally compiled by the, now defunct, European Petroleum Survey Group, hence the acronymn, though it is no longer maintained by that group.]

In [None]:
wells_gdf = gpd.GeoDataFrame(wells8, geometry = 'geometry')
wells_gdf.crs = 4326

Now that our geodataframe has a CRS, we can convert the lat/long coordinates to the UTM coordinate system simply by providing the EPSG code for the UTM zone covering this location (18N). That's EPSG code 32618. That will make our data finally ready for some Cartesian geometry calculations! (Remember: Data prep often takes a large portion of most spatial analytics project time.)

In [None]:
wells_gdf_utm = wells_gdf.to_crs(epsg='32618')
wells_gdf_utm.head()

You'll see the content of these two geodatabases is identical except for the final geometry column that displays different coordinate units. 

And a good final step in our data preparation stage is to take a look at the final data on a map. 

In [None]:
import folium

mymap = folium.Map(location = [42.58758, -77.16301], tiles='OpenStreetMap' , zoom_start = 8) 

for _, r in wells_gdf.iterrows():
    sim_geo = gpd.GeoSeries(r['geometry']) 
    geo_j = sim_geo.to_json()
    geo_j = folium.GeoJson(data=geo_j, 
                style_function = lambda x: {'color': 'red', 'weight': 1,  'fillColor': 'YlGnBu'})
    folium.Popup(f"<i>Well Name: {r['WELL NAME']}, <br> Well Type Code: {r['WELL TYPE CODE']}, <br> Company Name: {r['COMPANY NAME']}</i>", min_width=200, max_width=400).add_to(geo_j)
    folium.Tooltip(f"<i>Well Name: {r['WELL NAME']}, <br> Well Type Code: {r['WELL TYPE CODE']}, <br> Company Name: {r['COMPANY NAME']}</i>").add_to(geo_j)
    geo_j.add_to(mymap)

mymap

# Find distances

FINALLY, let's do some spatial analytics! Earlier we looked at the algorithm for calculating the distance between two points. Good news! That algorithm is built into geopandas, so all we have to do is invoke the "distance" function. 

One more nomenclature thing - how to pull out a single cell in our geodataframe? Like this:

     gdf_name.column_name.iloc[row_number]. So the UTM geometry of the two wells are:

In [None]:
print(wells_gdf_utm.geometry.iloc[0])
print(wells_gdf_utm.geometry.iloc[1])

So now we can calculate the distance in meters between the first well and all others. We start by defining an empty list variable and then iterate through the dataframe, one row at a time, adding each calculated distance to the list.

In [None]:
distances = []

for i in range(len(wells_gdf_utm)):
    distances.append(wells_gdf_utm.geometry.iloc[0].distance(wells_gdf_utm.geometry.iloc[i]))

distances   

And let's do the big one! Build a matrix that shows the distance between every point and every other point!

In [None]:
# generate column names e.g., Point 0, Point 1, etc
pdf = {}
col_names = [f'Point {i}' for i in range(len(wells_gdf_utm))]
pdf.update({' ': col_names}) 

distances = []
for i in range(len(wells_gdf_utm)):
    distances = []
    for j in range(len(wells_gdf_utm)):
        distances.append(wells_gdf_utm.geometry.iloc[i].distance(wells_gdf_utm.geometry.iloc[j]))
        pdf.update({f'Point {i}': distances})

dist_matrix = pd.DataFrame(pdf)
dist_matrix.set_index(' ', inplace=True) # Y axis naming style
dist_matrix

Very cool! Now there's plenty of statistics we could do on this matrix (average, max, min, etc.) and we could do more advanced matrix math to calculate clusters and spatial autocorrelation! But that's enough for now.

<font size="+1"><a style="background-color:blue;color:white;font-weight:bold;" 
href="sma-5.ipynb">OK, let's go back to the final part of the lesson!</a></font>