# 1. Introduction and Preliminaries
## 1.1 Installation
#### 1.1.1 Anaconda

<img src="./img/logo-horizontal-large.svg" width="300" height="300"/></img>


The fastest way to install all the libraries we need for the workshop is through Anaconda. I am running these on notebooks on Anaconda 4.3.0 and Python 3.6. You can find the installers here:

[Anaconda 4.3.0](https://www.continuum.io/downloads)

Anaconda installs most of the packages and dependencies for this workshop, but it can be useful to force `conda` to use the conda-forge channel to ensure the correct dependency versions are installed.

`$ sudo gedit ~/.condarc`

then add 

```
channels:
 - conda-forge
 - defaults
 ```
 
`$ conda update --all`

#### 1.1.2 Quantum GIS (QGIS)

<img src="./img/qgis.jpg" width="100" height="300"/></img>

Quantum GIS [(QGIS)](http://www.qgis.org/en/site/forusers/download.html) is a free and open Source Geographic Information System (GIS) that I recommend using for rapidly visualizing geospatial data. There are a number of geospatial visualization tools available in Python that are built on top of matplotlib and we will use the plot function in Geopandas later in the workshop. In addition to quickly visualizing a wide variety of geospatial data types, QGIS offers a number of other benefits:
* Built on GDAL/OGR
* Python scripting framework
* Multiple OS distributions

#### 1.1.3 Notebooks and Data

https://github.com/kas673/odsc

`$ git clone https://github.com/kas673/odsc.git`

#### 1.1.4 GeoDa (Optional)

[GeoDa](http://geodacenter.github.io/) is the graphical user interface for PySAL. 

## 1.2 Agenda
#### 1.2.1 Philosophy

We are lucky to have so many open source Python libraries available to us now. Many of the tools we will use are based on other libraries (e.g. GDAL/OGR). These are the guidelines I will try to follow in this workshop:

* Path of least resistance
    * Use GIS to visualize when needed
    * Use highest level package (e.g. Geopandas v. GDAL/OGR)
    * Use the least amount of code with simple Python examples
* Share, share, share
    * Packages
    * Textbooks
    * Web resources
* Replicate common ML workflows
    * Vector geospatial data (tabular)
    * DataFrames and Series
    * sklearn


#### 1.2.2 Outline

This workshop is organized into eight modules which I will try to spend approximately 30 minutes each on. Each module is strucutred with 5min-15min of instruction and examples and approximately 15 minutes of working on the exercise. We'll walk through the exercise line-by-line and the final exercise notebooks will be posted to my github account after the workshop is finished this evening. Here are the the topics we'll cover:

* Geospatial Data Formats and I/O
    * GDAL/OGR
    * File-based formats
    * Database formats
* Exploratory Spatial Data Analysis (ESDA)
    * PySAL
    * Spatial Weights Matrix
    * Spatial Autocorrelation
* Spatial Smoothing, Regionalization, and Neighborhood Analysis
    * Spatial Smoothing
    * Regionalization
    * Neighborhood Analysis
* Geospatial Feature Engineering
    * Geometry-based Features
    * Topologically-based Features
    * Set theoretic Features
* Geospatial Feature Enrichment
    * Joins
    * Geocoding
    * Zonal Statistics
* Spatial Econometric Approaches
    * Spatial Lag
    * Spatial Error
    * spreg
* Traditional ML Approaches
    * Spatial Group Partitioning
    * Tree-based methods
    * Two-way partial dependence on location

#### 1.2.3 Target Audience

This workshop should be useful for a wide range of audiences. If you have basic Python programming skills and are interested in geospatial analysis and machine learning this workshop will be a good starting point. Here are some good target users:
* Beginners in machine learning and geosptial analysis
* Experienced machine learners that have never used geospatial data
* GIS Analysts that primarily conduct analysis in traditional Desktop GIS (ArcGIS, QGIS, etc.)
* Experienced Geospatial developers with little ML experience


## 1.3 Background
#### 1.3.1 About Me
* Customer Facing Data Scientist at DataRobot
    * Work with account executives to ensure customer success
    * Work with the product team to address geospatial reqs on roadmap
    * Help customers with geospatial use cases utilize DataRobot
* Adjunct Professor of Geographic Information Systems at Penn State
    * Teach Graduate level GIS course (ArcGIS, QGIS, GeoDa, Carto)
    * Vector and raster analyses
    * Exposure to open source
* Applied Spatial Analysis
    * Experienced in point pattern analysis
    * Spatial econometrics and geostatistics
    * Wide range of software and formats
    
#### 1.3.2 What I'm Not About

* Computer vision, neural networks, and deep learning
* Expert Python programming
* Hadoop

#### 1.3.3 GIS

A GIS is a computer-based system to aid in the collection, maintenance, storage, analysis, output, and distribution of spatial data and information (Bolstad 2016). The traditional behemoth in the GIS industry has been [ESRI](http://www.esri.com/) and their flagship product ArcGIS. QGIS has become more popular recently and replicates many of the functions in ArcGIS Desktop. The popularity of web mapping applications and geotagged device data has also seen the rise in specialized SaaS products ([Mapbox](https://www.mapbox.com/) and [CARTO](https://carto.com/)) and open source front-end libraries such as [leaflet.js](http://leafletjs.com/) and [turf.js](http://turfjs.org/).

A word of caution:

<img src="./img/sidewalkballet_gis.png" width="600" height="300"/></img>

Terminology:

Machine Learning - Feature

Statistics - Variable

Geospatial - Attribute

A feature can also be used in geospatial-speak to talk about an individual geometry as we'll see in the next module.