## Introduction to Big Data
### Segment 2 of 5

# Spatial Big Data (Variety)

*Lesson Developer: Jayakrishnan Ajayakumar, jxa421@case.edu*


In [None]:
# This code cell starts the necessary setup for Hour of CI lesson notebooks.
# First, it enables users to hide and unhide code by producing a 'Toggle raw code' button below.
# Second, it imports the hourofci package, which is necessary for lessons and interactive Jupyter Widgets.
# Third, it helps hide/control other aspects of Jupyter Notebooks to improve the user experience
# This is an initialization cell
# It is not displayed because the Slide Type is 'Skip'

from IPython.display import HTML, IFrame, Javascript, display
from ipywidgets import interactive
import ipywidgets as widgets
from ipywidgets import Layout
import pandas as pd
import queue
import threading
import time
import getpass # This library allows us to get the username (User agent string)

# import package for hourofci project
import sys
sys.path.append('../../supplementary') # relative path (may change depending on the location of the lesson notebook)
import hourofci

import warnings
warnings.filterwarnings('ignore') # Hide warnings

# load javascript to initialize/hide cells, get user agent string, and hide output indicator
# hide code by introducing a toggle button "Toggle raw code"
# HTML(''' 
#     <script type="text/javascript" src=\"../../supplementary/js/custom.js\"></script>
    
#     <input id="toggle_code" type="button" value="Toggle raw code">
# ''')

HTML(''' 
    <script type="text/javascript" src=\"../../supplementary/js/custom.js\"></script>
    
    <style>
        .output_prompt{opacity:0;}
    </style>
    
    <input id="toggle_code" type="button" value="Toggle raw code">
''')



## Reminder
<a href="#/slide-2-0" class="navigate-right" style="background-color:blue;color:white;padding:8px;margin:2px;font-weight:bold;">Continue with the lesson</a>

<br>
</br>
<font size="+1">

By continuing with this lesson you are granting your permission to take part in this research study for the Hour of Cyberinfrastructure: Developing Cyber Literacy for GIScience project. In this study, you will be learning about cyberinfrastructure and related concepts using a web-based platform that will take approximately one hour per lesson. Participation in this study is voluntary.

Participants in this research must be 18 years or older. If you are under the age of 18 then please exit this webpage or navigate to another website such as the Hour of Code at https://hourofcode.com, which is designed for K-12 students.

If you are not interested in participating please exit the browser or navigate to this website: http://www.umn.edu. Your participation is voluntary and you are free to stop the lesson at any time.

For the full description please navigate to this website: <a href="../../gateway-lesson/gateway/gateway-1.ipynb">Gateway Lesson Research Study Permission</a>.

</font>

![variety](supplementary/images/variety.jpg)
There are many sources of spatial data including sensors, social media platforms, satellite and remote sensing, gps enbaled devices, that generates spatial data at various spatial and temporal scales. Making sense of spatial data that is being ingested from various sources at various granularities and frequencies can be an arduous task. In order to show the variety attribute of Spatial Big Data we provide a small experiment.

## Experiment (Snow storm warning!!!)
![snow_storm](supplementary/images/snow_storm.jpg)

In this experiment, we will again use our good old taxi GPS data. However, along with the taxi data we will also use live snow depth data, which provides snow depth data (in inches) for every second. The snow depth data is a raster, which consists of a matrix of cells (or pixels) organized into rows and columns (or a grid) where each cell contains a value representing information, such as snow depth. 
![raster_examples](supplementary/images/raster_examples.png)

Let's first do some warmup with raster data. We use the library called rasterio to read a raster file.

In [None]:
import rasterio

In [None]:
rasterSample = rasterio.open('supplementary/data/nyc_snow_depth_data/1.tif')

In [None]:
print(rasterSample.width)
print(rasterSample.height)
print(rasterSample.crs)

We can even plot this raster file

In [None]:
from rasterio.plot import show
show(rasterSample,cmap='Reds');

An increase in intensity represents an increase in value (in our case snow depth).

The values are stored as a multidimensional array in the raster file. Let's examine the values

In [None]:
data = rasterSample.read(1)
print(data.shape)
#print (data)

As you can see data is a multidimensional array with 13 rows and 17 columns. If you print the array you will see a large number of same value (-99999). This value is a special marker called "No Data". You can in fact check the "No Data" value for a raster from the raster dataset object.

In [None]:
rasterSample.nodata

As you can see the "No Data" value is indeed -99999. No data value indicates that there is no recorded value in that particular raster cell. In our plot, the No Data values are represented in white color.

In order to represent the raster data on our live and interactive map, we have created a set of polygons (rectangles) that have the same extents as the raster and each polygon represents a grid cell. Let us look at the grid geometric grid data.

In [None]:
import geopandas as gpd
weatherGrid = gpd.read_file(r'supplementary/data/nyc_grid/nyc_grid.shp')
weatherGrid

Let's see the plot for the grid.

In [None]:
weatherGrid.plot(facecolor="none", edgecolor="black");

As the grid is of the same dimensions as the raster, there is a one-to-one mapping between the raster and the raster. For example, the grid cell with id 1 is the same as the top left corner cell of the raster. And since the grid is a collection of geometries and kind of geometrical operations (such as how many taxis are currently with in a particular grid cell) can be performed on the grid. 

Let us look at an experiment. In this experiment, we will have our good-old taxis running for an hour through every nook and corner of NYC. For every second there will be a snow depth update from the weather station and the update will be a raster file. The snow depth value will be categorized into four classes

1) Less than 4 inches -- No warning (No color)

2) 4 to 8 inches -- Level 1 (Drive Cautiously) (Yellow color)

3) 8 to 12 inches -- Level 2 (Extreme Caution) (Orange color)

4) 12 inches and above -- Level 3 (Don't Drive) (Red color)

Based on the current position, the taxis will get warnings and their background color will change to match the warning (for No Warning the color would remain as green).

Without further delay, let us get into the implementation

In [None]:
import pandas as pd
import queue
import threading
import time
import geopandas as gpd
import json
from scipy.spatial import cKDTree
import numpy as np
import rasterio

In [None]:
class GPSThread(threading.Thread):
    # we will use two queues, one for pushing the GPS data and other to recieve any message from main thread
    def __init__(self, dataFrame,status):
        threading.Thread.__init__(self)
        self.dataFrame = dataFrame
        self.status = status
    def run(self):
        #load the data file
        data = pd.read_parquet(r'supplementary/data/taxi1hr_gps.parquet')
        #create an index based on seconds 
        data.set_index('sec',inplace=True)
        #now we need to loop through the dataset
        for sec in range(data.index.min(),data.index.max()):
            #kill switch is a message in the message queue
            if self.status[0]==0:
                break
            dat = data.loc[sec]
            gpsDat = dat[['id','lng','lat']].to_json(orient='records')
            self.dataFrame.loc[sec] = gpsDat
            #after one iteration sleep for a second......
            time.sleep(1)
            #remove data that is older than 5 minutes seconds with respect to current time
            #self.dataFrame.drop(self.dataFrame.index[self.dataFrame.index<(sec-300)],inplace=True)
        #if simulation is over put a pill in the outQueue
        self.status[0] = 2
        
class WeatherThread(threading.Thread):
    def __init__(self, dataFrame,status):
        threading.Thread.__init__(self)
        self.dataFrame = dataFrame
        self.status = status
    def run(self):
        #load the data file
        weatherDataFolder = r'supplementary/data/nyc_snow_depth_data'
        #now we need to loop through the dataset
        for sec in range(0,3600):
            #kill switch is a message in the message queue
            if self.status[0]==0:
                break
            self.dataFrame.loc[sec] = weatherDataFolder+'/'+str((sec//100))+'.tif'
            #after one iteration sleep for a second......
            time.sleep(1)
            #remove data that is older than 5 minutes seconds with respect to current time
            #self.dataFrame.drop(self.dataFrame.index[self.dataFrame.index<(sec-300)],inplace=True)

gpsData = None
status = [-1]
weatherGrid = gpd.read_file(r'supplementary/data/nyc_grid/nyc_grid.shp')[['id','geometry']]
weatherGrid['id'] = weatherGrid['id'].astype(int)
weatherRawData = pd.DataFrame(columns = ['sec','data'])
weatherDict = None
warningSnowDepth = 4
def start():
    global gpsData
    global status
    global weatherRawData
    global weatherDict
    gpsData = pd.DataFrame(columns = ['sec','data'])
    gpsData.set_index('sec',inplace=True)
    weatherRawData = pd.DataFrame(columns = ['sec','data'])
    weatherRawData.set_index('sec',inplace=True)
    weatherDict = {}
    status[0] = 1
    #startup threads
    gpsThread = GPSThread(gpsData,status)
    gpsThread.start()
    weatherThread = WeatherThread(weatherRawData,status)
    weatherThread.start()
    #small delay for the thread to startup
    return "started"
    
def getGPSData(sec):
    if sec in gpsData.index:
        currentData = pd.read_json(gpsData.loc[sec].data)
        currentData['sD'] = np.zeros(len(currentData))
        if sec in weatherDict and not isinstance(weatherDict[sec], str):
            warningGrids = weatherDict[sec]
            currentDataGeo = gpd.GeoDataFrame(currentData['id'],geometry = gpd.points_from_xy(currentData.lng,currentData.lat),crs='EPSG:4326')
            wGrid = weatherGrid.merge(warningGrids,on='id')
            matches = currentDataGeo.sjoin(wGrid,predicate="within")
            currentData.loc[currentData.id.isin(matches.id_left),"sD"] = matches.snowDepth
        return currentData.to_json(orient="records")+">>>currentGPSTime:"+str(max(gpsData.index))
    return "No Data"  

def getStatus():
    global status
    if status[0] == 2 or status[0] == 0:
        return "sim over"
    return "running"

def stop():
    global status
    status[0] = 0
    return "stopping"

def getWeatherData(sec):
    if sec in weatherRawData.index:
        weatherGridData = rasterio.open(weatherRawData.loc[sec].data).read(1)
        warning = np.where(weatherGridData>=warningSnowDepth)
        if len(warning[0])!=0:
            matchingIds = warning[0]+1+((warning[1])*weatherGridData.shape[0])
            weatherDict[sec] = pd.DataFrame({'id':matchingIds,'snowDepth':weatherGridData[warning]})
            return weatherDict[sec].to_json(orient='records')
        else:
            weatherDict[sec] = "No Warnings"
            return weatherDict[sec]
    else:
        return "No Data"

def getGrid():
    return weatherGrid.to_json()
        

In [None]:
%%html
<link rel="stylesheet" href="https://unpkg.com/leaflet@1.8.0/dist/leaflet.css"/>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/leaflet.draw/1.0.4/leaflet.draw.css"/>
<div id="main">
    <div id="mapandcontrols">
        <div id="map"></div>
        <div id="controls">
            <button id="start_button" class ="controlbuttons" onclick = "run()">Start Simulation</button>
            <button id="stop_button" class ="controlbuttons" onclick = "stop()" disabled>Stop Simulation</button>
            <span id="active_cars" class ="controlbuttons">Active Cars:0</span>
            <span id="gpstime" class ="controlbuttons">GPSTime:0</span>
            <span id="realgpstime" class ="controlbuttons">RealGPSTime:0</span>
        </div>
    </div>
</div>

<style>
#main { height: 500px;width:800px; }
#mapandcontrols { height: 100%;width:65%;float:left}
#map { height:75%;width:100%; }
#controls { height: 20%;margin-top:4%; }
.maxwidth{width:100%;}
.maxheight{height:100%;}
.halfwidth{width:50%;}
.halfheight{height:50%;}
.controlbuttons{float:left;margin:1%;}

</style>
<script>
    require.config({
        paths: {
            d3: 'https://d3js.org/d3.v7.min',
            L: 'https://unpkg.com/leaflet@1.8.0/dist/leaflet'
        }
    });
    var map,tooltip,datInterval, current,weatherCurrent, weatherInterval, statusInterval,pathCreator,warnings = new Map(),currentlyProcessing = new Set();
    var projectPoint = function(x, y) {
        const point = map.latLngToLayerPoint(new L.LatLng(y, x));
        this.stream.point(point.x, point.y)
    }
    require(['d3'], function(d3) {
        const projection = d3.geoTransform({point: projectPoint})
        pathCreator = d3.geoPath().projection(projection)
    });
    
    function drawGrid(){
        IPython.notebook.kernel.execute(
            "getGrid()", 
            {
                iopub: {
                    output: function(response) {
                        require(['d3'], function(d3) {
                            var dataString = response.content.data['text/plain'];
                            var data = JSON.parse(dataString.slice(1, dataString.length - 1));
                            d3.select('#gridFrame')
                                .selectAll("path")
                                .data(data.features, d => d.properties.id)
                                .join(
                                    enter => enter.append('path')
                                    .attr('fill-opacity', 0)
                                    .style('fill','none')
                                    .style("pointer-events", "auto")
                                    .attr('stroke', 'grey')
                                    .attr('stroke-width', 0.3)
                                    .attr('d', pathCreator)
                                    .selection()
                                )
                        });
                    }
                }
            },
            {
                silent: false, 
                store_history: false, 
                stop_on_error: true
            }
        );
    }

    function cleanup(){
        clearInterval(statusInterval);
        clearInterval(datInterval);
        clearInterval(weatherInterval);
        current = 0;
        weatherCurrent = 0;
        warnings = new Map();
        currentlyProcessing = new Set();
        require(['d3'], function(d3) {
            d3.select("#gpsFrame").selectAll("circle").remove();
            d3.select("#gridFrame").selectAll("path").remove();
            document.getElementById("start_button").disabled = false;
            document.getElementById("stop_button").disabled = true;
            document.getElementById("active_cars").innerText = "Active Cars:0";
            document.getElementById("gpstime").innerText = "GPSTime:0";
            document.getElementById("realgpstime").innerText = "RealGPSTime:0";
        });
    }
    
    function getStatus(){
        IPython.notebook.kernel.execute(
            "getStatus()", 
            {
                iopub: {
                    output: function(response) {
                        var dataString = response.content.data['text/plain'];
                        if (dataString.includes("sim over")){
                            console.log('Time to clean up everything');
                            cleanup();
                        }
                    }
                }
            },
            {
                silent: false, 
                store_history: false, 
                stop_on_error: true
            }
        );    
    }

    function fetchGPSData() {
        if (currentlyProcessing.has(current))
            return;
        currentlyProcessing.add(current);
        IPython.notebook.kernel.execute(
            "getGPSData(" + current + ")", {
                iopub: {
                    output: function(response) {
                        // Print the return value of the Python code to the console
                        var dataString = response.content.data['text/plain'];
                        if (!(dataString.includes("sim over") || dataString.includes("No Data"))) {
                            require(['d3'], function(d3) {
                                var sections = dataString.split(">>>");
                                var data = JSON.parse(sections[0].slice(1, sections[0].length));
                                document.getElementById("active_cars").innerText = "Active Cars: " + data.length;
                                d3.select("#gpsFrame")
                                    .selectAll("circle")
                                    .data(data, d => d.id)
                                    .join(
                                        enter => enter.append('circle')
                                        .attr("cx", d => map.latLngToLayerPoint([d.lat, d.lng]).x)
                                        .attr("cy", d => map.latLngToLayerPoint([d.lat, d.lng]).y)
                                        .attr("r", 2)
                                        .attr("stroke-width", 0.4)
                                        .attr("fill-opacity", 0.3)
                                        .style("fill",function(d){
                                            if(d.sD>=4 && d.sD<8)
                                                return "yellow";
                                            else if(d.sD>=8 && d.sD<12){
                                                return "orange";
                                            }
                                            else if(d.sD>12)
                                                return "red";
                                            else 
                                                return "green"
                                        })
                                        .selection(),

                                        update => update
                                        .attr("cx", d => map.latLngToLayerPoint([d.lat, d.lng]).x)
                                        .attr("cy", d => map.latLngToLayerPoint([d.lat, d.lng]).y)
                                        .style("fill",function(d){
                                            if(d.sD>=4 && d.sD<8)
                                                return "yellow";
                                            else if(d.sD>=8 && d.sD<12)
                                                return "orange"
                                            else if(d.sD>12){
                                                return "red";
                                            }
                                            else 
                                                return "green"
                                        })
                                        .selection(),

                                        exit => exit
                                        .remove()
                                    )
                                document.getElementById("gpstime").innerText = "GPSTime: " + current;
                                document.getElementById("realgpstime").innerText = "RealGPSTime: " + sections[1].replace("'","").split(":")[1];
                                current += 1;
                            });
                        }
                        else if(dataString.includes("No Data")){
                            currentlyProcessing.delete(current);
                        }
                    }
                }
            }, {
                silent: false,
                store_history: false,
                stop_on_error: true
            }
        );
    }

    
    function fetchWeatherData(){
        IPython.notebook.kernel.execute(
            "getWeatherData(" + weatherCurrent + ")", 
            {
                iopub: {
                    output: function(response) {
                        require(['d3'], function(d3) {
                            var dataString = response.content.data['text/plain'];
                            warnings = new Map();
                            if (!(dataString.includes("sim over") || dataString.includes("No Data") || dataString.includes("No Warnings"))) {
                                var data = JSON.parse(dataString.slice(1, dataString.length - 1));
                                data.forEach(function(d){
                                    warnings.set(parseInt(d.id),parseFloat(d.snowDepth));
                                });
                                d3.select('#gridFrame')
                                    .selectAll("path")
                                    .attr('fill-opacity', function(d){
                                        if(warnings.has(d.properties.id))
                                            return 0.3;
                                        else
                                            return 0;
                                    })
                                    .style('fill',function(d){
                                        if(warnings.has(d.properties.id)){
                                            snowDepth = warnings.get(d.properties.id);
                                            if(snowDepth>=4 && snowDepth<8)
                                                return "yellow";
                                            else if(snowDepth>=8 && snowDepth<12){
                                                return "orange";
                                            }
                                            else if(snowDepth>12)
                                                return "red";
                                            else 
                                                return "none"
                                        }  
                                        else
                                            return "none";
                                    });
                                weatherCurrent+=1
                            }
                            else if(dataString.includes("No Warnings")){
                                d3.select('#gridFrame')
                                    .selectAll("path")
                                    .attr('fill-opacity',0)
                                    .style('fill','none');
                                weatherCurrent+=1
                            }
                        });
                    }
                }
            },
            {
                silent: false, 
                store_history: false, 
                stop_on_error: true
            }
        );
    }

    function update() {
        require(['d3'], function(d3) {
            d3.selectAll("circle")
                .attr("cx",d => map.latLngToLayerPoint([d.lat, d.lng]).x)
                .attr("cy", d => map.latLngToLayerPoint([d.lat, d.lng]).y);
            d3.select('#gridFrame').selectAll("path")
                .attr('d', pathCreator);
        });
    }

    function stop() {
        IPython.notebook.kernel.execute(
            "stop()", {
                iopub: {
                    output: function(response) {}
                }
            }, {
                silent: false,
                store_history: false,
                stop_on_error: true
            }
        )
    }

    function run() {
        current = 0;
        weatherCurrent = 0;
        IPython.notebook.kernel.execute(
            "start()", {
                iopub: {
                    output: function(response) {
                        statusInterval = setInterval(getStatus, 500);
                        weatherInterval = setInterval(fetchWeatherData, 900);
                        datInterval = setInterval(fetchGPSData, 1000);
                        drawGrid();
                        //disable the start button
                        document.getElementById("start_button").disabled = true;
                        document.getElementById("stop_button").disabled = false;
                    }
                }
            }, {
                silent: false,
                store_history: false,
                stop_on_error: true
            }
        )
    }

    require(['d3','L'], function(d3, L) {
        map = L
            .map('map')
            .setView([40.763231753511604, -73.98383956127027], 10); // center position + zoom

        // Add a tile to the map = a background. Comes from OpenStreetmap
        L.tileLayer('https://tile.openstreetmap.org/{z}/{x}/{y}.png', {
            maxZoom: 19,
            attribution: '© OpenStreetMap'
        }).addTo(map);
        // Add a svg layer to the map
        L.svg().addTo(map);
        map.on("moveend", update)
        d3.select("#map").select("svg").append("g").attr("id", "gpsFrame");
        d3.select("#map").select("svg").append("g").attr("id", "gridFrame");
    });
    
</script>

So let's see the important sections of the code.

```python
def getWeatherData(sec):
    if sec in weatherRawData.index:
        weatherGridData = rasterio.open(weatherRawData.loc[sec].data).read(1)
        warning = np.where(weatherGridData>=warningSnowDepth)
        if len(warning[0])!=0:
            matchingIds = warning[0]+1+((warning[1])*weatherGridData.shape[0])
            weatherDict[sec] = pd.DataFrame({'id':matchingIds,'snowDepth':weatherGridData[warning]})
            return weatherDict[sec].to_json(orient='records')
        else:
            weatherDict[sec] = "No Warnings"
            return weatherDict[sec]
    else:
        return "No Data"
```

The getWeatherData function pulls the snow depth file and reads the data as a multidimensional array

<code>weatherGridData = rasterio.open(weatherRawData.loc[sec].data).read(1)</code>

where <code>weatherRawData.loc[sec].data</code> gives you the filename.

Now we need to extract out those grids that are having snow depth values greater than 4. We can achieve this through the numpy where method.

<code>warning = np.where(weatherGridData>=warningSnowDepth)</code>

Let's see a quick example with np.where

In [None]:
import numpy as np
array = np.asarray([10,5,6,23,34,2,1])
np.where(array>5)

You need to notice that np.where gives the indices where the value is greater than 5. If you want to print these values

In [None]:
array[np.where(array>5)]

Then we need retrieve the matching id's (for our geometrical grid) based on the indices

<code>matchingIds = warning[0]+1+((warning[1])*weatherGridData.shape[0])</code>

Once we calculated the matching id's we store the id's and corresponding values to a dataframe

<code>weatherDict[sec] = pd.DataFrame({'id':matchingIds,'snowDepth':weatherGridData[warning]})</code>

The next important section of the code is assigning snow depth values and warnings to taxis.

```python
def getGPSData(sec):
    if sec in gpsData.index:
        currentData = pd.read_json(gpsData.loc[sec].data)
        currentData['sD'] = np.zeros(len(currentData))
        if sec in weatherDict and not isinstance(weatherDict[sec], str):
            warningGrids = weatherDict[sec]
            currentDataGeo = gpd.GeoDataFrame(currentData['id'],geometry = gpd.points_from_xy(currentData.lng,currentData.lat),crs='EPSG:4326')
            wGrid = weatherGrid.merge(warningGrids,on='id')
            matches = currentDataGeo.sjoin(wGrid,predicate="within")
            currentData.loc[currentData.id.isin(matches.id_left),"sD"] = matches.snowDepth
        return currentData.to_json(orient="records")+">>>currentGPSTime:"+str(max(gpsData.index))
    return "No Data"  
```

Here <code> currentData</code> contains the GPS data for a particular second for the taxis. 

Then we add a new attribute to our currentData dataframe and initialize it with zero. This attribute will carry the snow depth value. 

<code>currentData['sD'] = np.zeros(len(currentData))</code>

We retrieve the grid cells that have snow depth value greater than 4 that we have previously calculated. 

<code>warningGrids = weatherDict[sec]</code>

Then we create a GeoDataFrame for our currentData

<code>currentDataGeo = gpd.GeoDataFrame(currentData['id'],geometry = gpd.points_from_xy(currentData.lng,currentData.lat),crs='EPSG:4326')</code>

Now this is the critical section

<code>matches = currentDataGeo.sjoin(wGrid,predicate="within")</code>

Here we are using the sjoin method to check which taxi's are within a grid. This operation is called spatial join. We have already seen spatial joins in earlier chapters, however we will have a small diagrammatic illustration.

![sample_sjoin](supplementary/images/sample_sjoin.jpg)

And finally using the spatially merged dataset, we can assign snow depth value to each of the taxi records

<code>currentData.loc[currentData.id.isin(matches.id_left),"sD"] = matches.snowDepth</code>

So that wraps up our session on Spatial Big Data (Variety)

<font size="+1"><a style="background-color:blue;color:white;padding:12px;margin:10px;font-weight:bold;" 
href="bigdata-4.ipynb">Click here to go to the next notebook.</a></font>