## Big Data

### Segment 3 of 3
# Exploration

<i>Lesson Developers: </i>
<ul>
    <li>
    <i>Edwin Chow, chow@txstate.edu</i>
    </li>
    <li>
    <i>Jayakrishnan Ajayakumar, jxa421@case.edu</i>
    </li>
</ul>


In [None]:
# This code cell starts the necessary setup for Hour of CI lesson notebooks.
# First, it enables users to hide and unhide code by producing a 'Toggle raw code' button below.
# Second, it imports the hourofci package, which is necessary for lessons and interactive Jupyter Widgets.
# Third, it helps hide/control other aspects of Jupyter Notebooks to improve the user experience
# This is an initialization cell
# It is not displayed because the Slide Type is 'Skip'

from IPython.display import HTML, IFrame, Javascript, display
from ipywidgets import interactive
import ipywidgets as widgets
from ipywidgets import Layout
import pandas as pd
import queue
import threading
import time

import getpass # This library allows us to get the username (User agent string)

# import package for hourofci project
import sys
sys.path.append('../../supplementary') # relative path (may change depending on the location of the lesson notebook)
import hourofci


# load javascript to initialize/hide cells, get user agent string, and hide output indicator
# hide code by introducing a toggle button "Toggle raw code"
HTML(''' 
    <script type="text/javascript" src=\"../../supplementary/js/custom.js\"></script>
    
    <style>
        .output_prompt{opacity:0;}
    </style>
    
    <input id="toggle_code" type="button" value="Toggle raw code">
''')

## Reminder
<a href="#/slide-2-0" class="navigate-right" style="background-color:blue;color:white;padding:8px;margin:2px;font-weight:bold;">Continue with the lesson</a>

<br>
</br>
<font size="+1">

By continuing with this lesson you are granting your permission to take part in this research study for the Hour of Cyberinfrastructure: Developing Cyber Literacy for GIScience project. In this study, you will be learning about cyberinfrastructure and related concepts using a web-based platform that will take approximately one hour per lesson. Participation in this study is voluntary.

Participants in this research must be 18 years or older. If you are under the age of 18 then please exit this webpage or navigate to another website such as the Hour of Code at https://hourofcode.com, which is designed for K-12 students.

If you are not interested in participating please exit the browser or navigate to this website: http://www.umn.edu. Your participation is voluntary and you are free to stop the lesson at any time.

For the full description please navigate to this website: <a href="../../gateway-lesson/gateway/gateway-1.ipynb">Gateway Lesson Research Study Permission</a>.

</font>

If you have large volume of Spatial Big Data that is essentially static or is updated only once in a few days or months or years, then we can easily solve those challenges with adequate computational resources. For example the crime data from the City of Chicago has more than 7 million records, but it is not updated in real-time (in fact it is updated weekly). So we do have a buffer of 7 days to process the incoming data and make sense of it. 

But the real computational challenge associated with Spatial Big Data is when the data gets updated in real-time (milliseconds, seconds, minutes). Now we have to process large volume of data in real-time. We have already seen some examples in the Introduction session. 

One of the biggest challenges of handling SBD or even BD with high velocity is that the processing has to keep abreast with the enormous data that is getting generated in real-time. If data processing cannot keep up with data ingestion then the end result is <span STYLE="font-size:18.0pt;color:black">Data Loss</span>.

<img src = "supplementary/firehose.jpg" width = 50%>

Let's check out the Velocity aspect of SBD through some real world examples.


## An hour in NYC (From a Yellow Cab Perspective)

![yellow_taxi](supplementary/yellow_taxi.jpg)
Yellow taxicabs are widely recognized as a symbol for NYC. There are almost 13,587 yellow cabs in NYC. NYC Taxi and Limousine Commission TLC releases Trip Record Data (https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). The Trip record data includes pick-up and drop-off dates/times, pick-up and drop-off locations (the spatial component), trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.

The taxi dataset you are going to use for this lesson is an extract from the entire dataset. It contains the trip record for 1 hour. While the taxi records only contain the pick-up and drop-off locations, we have generated a GPS trajectory dataset for each taxi based on its pick-up and drop-off location. 

Let's get into action with our first experiment


### Experiment I (Taxis Taxis Everywhere!!!)

Let's explore the dataset first

Our data is in parquet format which is particularly designed for columnar storage and is highly optimized for storing large amount of data. <br>

In [None]:
import pandas as pd
import queue
import threading
import time

data = pd.read_parquet(r'supplementary/taxi1hr_gps.parquet')
data

There are 10,891,783 coordinates even for 1 hour taxi trajectory data. The id is unique for each taxi and sec indicates the current sec from the start of the simulation.

You can click on the Start Simulation Button to see the taxis in action!!!

In [None]:
class GPSThread(threading.Thread):
    # we will use two queues, one for pushing the GPS data and other to recieve any message from main thread
    def __init__(self, dataDict,status):
        threading.Thread.__init__(self)
        self.dataDict = dataDict
        self.status = status
    def run(self):
        #load the data file
        data = pd.read_parquet(r'supplementary/taxi1hr_gps.parquet')
        #create an index based on seconds 
        data.set_index('sec',inplace=True)
        #now we need to loop through the dataset
        for sec in range(data.index.min(),data.index.max()):
            #kill switch is a message in the message queue
            if self.status[0]==0:
                break
            dat = data.loc[sec]
            self.dataDict[sec] = dat[['id','lng','lat']]
            #after one iteration sleep for a second......
            time.sleep(1)
        #if simulation is over put a pill in the outQueue
        self.status[0] = 2
atDict = {}
status = [-1]
def start():
    global datDict
    global status
    datDict = {}
    status[0] = 1
    gpsThread = GPSThread(datDict,status)
    gpsThread.start()
    #small delay for the thread to startup
    time.sleep(.5)
    return "started"
def getData(sec):
    if status[0] == 2 or status[0] == 0:
        return "sim over"
    if sec in datDict:
        return datDict.pop(sec).to_json(orient="records")
    return "No Data"   
def stop():
    global status
    status[0] = 0
    return "stopping"

In [None]:
%%html
<link rel="stylesheet" href="https://unpkg.com/leaflet@1.8.0/dist/leaflet.css"/>
<div id="main">
    <div id="map"></div>
    <div id="controls">
        <button id="start_button" class ="controlbuttons" onclick = "run()">Start Simulation</button>
        <button id="stop_button" class ="controlbuttons" onclick = "stop()" disabled>Stop Simulation</button>
        <span id="active_cars" class ="controlbuttons">Active Cars:0</span>
    </div>
</div>
<style>
#main { height: 500px;width:800px; }
#map { height: 75%;width:100%; }
#controls { height: 20%;margin-top:4%; }
.controlbuttons{float:left;margin:1%;}
</style>
<script>
    var map,datInterval,current;
    require.config({ 
        paths: { 
             d3: 'https://d3js.org/d3.v7.min',
             L: 'https://unpkg.com/leaflet@1.8.0/dist/leaflet'
        }});
    
    function fetch(){
        IPython.notebook.kernel.execute(
            "getData("+current+")", 
            {
                iopub: {
                    output: function(response) {
                        // Print the return value of the Python code to the console
                        var dataString = response.content.data['text/plain'];
                        if (dataString.includes("sim over")){
                            console.log('Time to clean up');
                            require(['d3'], function(d3) { 
                                d3.select("#map")
                                        .select("svg")
                                        .selectAll("circle").remove();
                            });
                            clearInterval(datInterval)
                            current=0;
                            //enable the start button
                            document.getElementById("start_button").disabled = false;
                            document.getElementById("stop_button").disabled = true;
                            document.getElementById("active_cars").innerText = "Active Cars:0";
                        }
                        else if (dataString.includes("No Data")){
                            console.log('No data for '+current)
                        }
                        else{
                            require(['d3'], function(d3) { 
                                var data = JSON.parse(dataString.slice(1,dataString.length-1));
                                document.getElementById("active_cars").innerText = "Active Cars: "+data.length;
                                d3.select("#map")
                                    .select("svg")
                                    .selectAll("circle")
                                    .data(data, d => d.id)
                                    .join(
                                        enter => enter.append('circle')
                                        .attr("cx", d => map.latLngToLayerPoint([d.lat,d.lng]).x)
                                        .attr("cy", d => map.latLngToLayerPoint([d.lat,d.lng]).y)
                                        .attr("r", 2)
                                        .style("fill", "green")
                                        .attr("stroke", "green")
                                        .attr("stroke-width", 1)
                                        .attr("fill-opacity", 1) 
                                        .selection(),

                                        update => update
                                        .attr("cx", d => map.latLngToLayerPoint([d.lat,d.lng]).x)
                                        .attr("cy", d => map.latLngToLayerPoint([d.lat,d.lng]).y)
                                        .selection(),

                                        exit => exit
                                        .remove()
                                    )
                            });
                            current+=1;
                        }
                    }
                }
            },
            {
                silent: false, 
                store_history: false, 
                stop_on_error: true
            }
        );
    }

    function update() {
        require(['d3'], function(d3) { 
            d3.selectAll("circle")
            .attr("cx", d => map.latLngToLayerPoint([d.lat,d.lng]).x)
            .attr("cy", d => map.latLngToLayerPoint([d.lat,d.lng]).y)
        });
    } 
    
    function stop(){
        IPython.notebook.kernel.execute(
            "stop()", 
            {
                iopub: {
                    output: function(response) {
                    }
                }
            },
            {
                silent: false, 
                store_history: false, 
                stop_on_error: true
            }
        )
    }
    
    function run(){
        current = 0;
        IPython.notebook.kernel.execute(
            "start()", 
            {
                iopub: {
                    output: function(response) {
                        datInterval = setInterval(fetch, 1000);
                        //disable the start button
                        document.getElementById("start_button").disabled = true;
                        document.getElementById("stop_button").disabled = false;
                    }
                }
            },
            {
                silent: false, 
                store_history: false, 
                stop_on_error: true
            }
        )
    }
    
    require(['d3','L'], function(d3,L) { 
        map = L
            .map('map')
            .setView([40.763231753511604, -73.98383956127027], 10);   // center position + zoom

        // Add a tile to the map = a background. Comes from OpenStreetmap
        L.tileLayer('https://tile.openstreetmap.org/{z}/{x}/{y}.png', {
            maxZoom: 19,
            attribution: '© OpenStreetMap'
        }).addTo(map);
        // Add a svg layer to the map
        L.svg().addTo(map);
        map.on("moveend", update)
    });
    
    
</script>

So we can see our Yellow Taxis (they are in green color though!) in action. 

### Explore more...
You have done the following: 
- Learned about what big data is
- The 'V's of big data and its relevance to their applications
- Explore some big data on the Internet
- Load COVID-19 data into a table using Pandas
- Parse the data and calculate new columns 

Here are some pointers for further exploration: 
- Noticed that there are some calculation returns a value "NaN". What does that mean?
- Explore more county level COVID-19 data from NY Times at: https://github.com/nytimes/covid-19-data
- Load the mask use data: https://github.com/nytimes/covid-19-data/tree/master/mask-use

If you are interested, feel free to check out the intermediate lesson. We will introduce more techniques to process, analyze and visualize the big data! 

# Congratulations!


**You have finished an Hour of CI!**


But, before you go ... 

1. Please fill out a very brief questionnaire to provide feedback and help us improve the Hour of CI lessons. It is fast and your feedback is very important to let us know what you learned and how we can improve the lessons in the future.
2. If you would like a certificate, then please type your name below and click "Create Certificate" and you will be presented with a PDF certificate.

<font size="+1"><a style="background-color:blue;color:white;padding:12px;margin:10px;font-weight:bold;" href="https://forms.gle/JUUBm76rLB8iYppN7">Take the questionnaire and provide feedback</a></font>

In [None]:

# This code cell loads the Interact Textbox that will ask users for their name
# Once they click "Create Certificate" then it will add their name to the certificate template
# And present them a PDF certificate
from PIL import Image
from PIL import ImageFont
from PIL import ImageDraw

from ipywidgets import interact

def make_cert(learner_name, lesson_name):
    cert_filename = 'hourofci_certificate.pdf'

    img = Image.open("../../supplementary/hci-certificate-template.jpg")
    draw = ImageDraw.Draw(img)

    cert_font   = ImageFont.truetype('../../supplementary/cruft.ttf', 150)
    cert_fontsm = ImageFont.truetype('../../supplementary/cruft.ttf', 80)
    
    _,_,w,h = cert_font.getbbox(learner_name)  
    draw.text( xy = (1650-w/2,1100-h/2), text = learner_name, fill=(0,0,0),font=cert_font)
    
    _,_,w,h = cert_fontsm.getbbox(lesson_name)
    draw.text( xy = (1650-w/2,1100-h/2 + 750), text = lesson_name, fill=(0,0,0),font=cert_fontsm)
    
    img.save(cert_filename, "PDF", resolution=100.0)   
    return cert_filename


interact_cert=interact.options(manual=True, manual_name="Create Certificate")

@interact_cert(name="Your Name")
def f(name):
    print("Congratulations",name)
    filename = make_cert(name, 'Beginner Big Data')
    print("Download your certificate by clicking the link below.")
    
    

<font size="+1"><a style="background-color:blue;color:white;padding:12px;margin:10px;font-weight:bold;" href="hourofci_certificate.pdf?download=1" download="hourofci_certificate.pdf">Download your certificate</a></font>