# Selecting with Spark Lab

### Introduction

In this lesson, we'll work with dataframes to explore a dataset.  Let's get started.

### Getting Set Up (For Google Colab)

> If we are running this on google colab, we can run the following to eventually interact with our Spark UI.

* Begin by installing some pip packages and the java development kit.

In [None]:
!pip install pyspark --quiet
!pip install -U -q PyDrive --quiet 
!apt install openjdk-8-jdk-headless &> /dev/null

* Then set the java environmental variable

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

* Then connect to a SparkSession, setting the spark ui port to `4050`.

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("civComplaints") \
    .config("spark.ui.port", "4050") \
    .getOrCreate()

* Then we need to install ngrok which will allow us to place our local spark ui on the web.

In [None]:
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip &> /dev/null
!unzip ngrok-stable-linux-amd64.zip &> /dev/null
get_ipython().system_raw('./ngrok http 4050 &')

* And finally we get a link our Spark UI

In [None]:
!curl -s http://localhost:4040/api/tunnels | python3 -c \
    "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

### Loading the Data

And now let's view open up the Spark UI, so we can see what's occurring as we use Spark.  If we are running in a google colab click the link above.  Or if running on your local computer, type in `spark`, we'll see a link to our Spark UI. 

In [None]:
spark

Ok, with our Spark UI opened in a new tab, it's time to begin loading our data.  We'll first load up our data into a pandas dataframe.

In [None]:
url = "s3://jigsaw-labs/civ_complaints.csv"
import pandas as pd
df = pd.read_csv(url).astype(str)

And now, we'll need to convert this into a spark dataframe.

In [None]:
complaints_df = None

In [None]:
complaints_df
# DataFrame[Extract Run Date: string, Randomized Id: string, 
# CCRB Received Year: string, Days Between Incident Date and Received Date: string,
# Case Type: string, Complaint Received Place: string, 
# Complaint Received Mode: string, Borough Of Incident: string, 
# Patrol Borough Of Incident: string, Reason For Initial Contact: string]

DataFrame[Extract Run Date: string, Randomized Id: string, CCRB Received Year: string, Days Between Incident Date and Received Date: string, Case Type: string, Complaint Received Place: string, Complaint Received Mode: string, Borough Of Incident: string, Patrol Borough Of Incident: string, Reason For Initial Contact: string]

So we can begin to see some of the columns that we have some of our dataset.

### Exploring Data

Let's get a better sense by displaying the schema.

root
 |-- Extract Run Date: string (nullable = true)
 |-- Randomized Id: string (nullable = true)
 |-- CCRB Received Year: string (nullable = true)
 |-- Days Between Incident Date and Received Date: string (nullable = true)
 |-- Case Type: string (nullable = true)
 |-- Complaint Received Place: string (nullable = true)
 |-- Complaint Received Mode: string (nullable = true)
 |-- Borough Of Incident: string (nullable = true)
 |-- Patrol Borough Of Incident: string (nullable = true)
 |-- Reason For Initial Contact: string (nullable = true)



Let's also display the first two records of our dataset, and set vertical to `True`.

-RECORD 0------------------------------------------------------------
 Extract Run Date                             | 05/25/2018           
 Randomized Id                                | 1                    
 CCRB Received Year                           | 2000                 
 Days Between Incident Date and Received Date | 2.0                  
 Case Type                                    | IAB                  
 Complaint Received Place                     | CCRB                 
 Complaint Received Mode                      | Phone                
 Borough Of Incident                          | Bronx                
 Patrol Borough Of Incident                   | Bronx                
 Reason For Initial Contact                   | PD suspected C/V ... 
-RECORD 1------------------------------------------------------------
 Extract Run Date                             | 05/25/2018           
 Randomized Id                                | 2                    
 CCRB Received Year 

Now take a look at the Spark UI.  View the most recent job, and click on the Stage to get a more detailed view of the steps in the stage.

> Notice that the first step was `readRDDFromFile`, so our data was loaded in this stage, and then later on we see an `applySchemaToPythonRDD`, where it seems that our columns were formatted.  After these steps, then the printing of the first few rows occurred.

> <img src="https://github.com/jigsawlabs-student/pyspark-dataframes/blob/main/dag_viz_print.png?raw=1" width="40%">

Also, if we take a look at the tasks, we can see that only one of the cores was used for this call.

> <img src="https://github.com/jigsawlabs-student/pyspark-dataframes/blob/main/total_tasks.png?raw=1" width="90%">

### Viewing Columns

Ok, so now let's select narrow down our data by just displaying a couple of columns.  We can view the various columns by viewing the `columns` attribute of our dataframe.

In [None]:


# ['Extract Run Date',
#  'Randomized Id',
#  'CCRB Received Year',
#  'Days Between Incident Date and Received Date',
#  'Case Type',
#  'Complaint Received Place',
#  'Complaint Received Mode',
#  'Borough Of Incident',
#  'Patrol Borough Of Incident',
#  'Reason For Initial Contact']

['Extract Run Date',
 'Randomized Id',
 'CCRB Received Year',
 'Days Between Incident Date and Received Date',
 'Case Type',
 'Complaint Received Place',
 'Complaint Received Mode',
 'Borough Of Incident',
 'Patrol Borough Of Incident',
 'Reason For Initial Contact']

And from here, let's just select the `Borough of Incident` column.  And display the first 3 results.

In [None]:

# +-------------------+
# |Borough of Incident|
# +-------------------+
# |              Bronx|
# |           Brooklyn|
# |             Queens|
# +-------------------+
# only showing top 3 rows


+-------------------+
|Borough of Incident|
+-------------------+
|              Bronx|
|           Brooklyn|
|             Queens|
+-------------------+
only showing top 3 rows



Ok, now let's select both the `Borough of Incident` and the `Reason for Initial Contact`, and display the first three rows. 

In [None]:
complaints_df.select(['Borough of Incident', 'Reason For Initial Contact']).show(3)

+-------------------+--------------------------+
|Borough of Incident|Reason For Initial Contact|
+-------------------+--------------------------+
|              Bronx|      PD suspected C/V ...|
|           Brooklyn|         Parking violation|
|             Queens|                       nan|
+-------------------+--------------------------+
only showing top 3 rows



### Selecting Rows

Now that we've practiced selecting our columns, let's also select our rows.  So to begin, select all of the complaints located in `Brooklyn`, and display the first two results vertically.

In [None]:


# -RECORD 0---------------------------------------------------------
#  Extract Run Date                             | 05/25/2018        
#  Randomized Id                                | 2                 
#  CCRB Received Year                           | 2000              
#  Days Between Incident Date and Received Date | 86.0              
#  Case Type                                    | OCD               
#  Complaint Received Place                     | Other NYPD unit   
#  Complaint Received Mode                      | In-person         
#  Borough Of Incident                          | Brooklyn          
#  Patrol Borough Of Incident                   | Brooklyn North    
#  Reason For Initial Contact                   | Parking violation 
# -RECORD 1---------------------------------------------------------
#  Extract Run Date                             | 05/25/2018        
#  Randomized Id                                | 6                 
#  CCRB Received Year                           | 2000              
#  Days Between Incident Date and Received Date | 1.0               
#  Case Type                                    | CCRB              
#  Complaint Received Place                     | CCRB              
#  Complaint Received Mode                      | Phone             
#  Borough Of Incident                          | Brooklyn          
#  Patrol Borough Of Incident                   | Brooklyn South    
#  Reason For Initial Contact                   | Other             
# only showing top 2 rows

-RECORD 0---------------------------------------------------------
 Extract Run Date                             | 05/25/2018        
 Randomized Id                                | 2                 
 CCRB Received Year                           | 2000              
 Days Between Incident Date and Received Date | 86.0              
 Case Type                                    | OCD               
 Complaint Received Place                     | Other NYPD unit   
 Complaint Received Mode                      | In-person         
 Borough Of Incident                          | Brooklyn          
 Patrol Borough Of Incident                   | Brooklyn North    
 Reason For Initial Contact                   | Parking violation 
-RECORD 1---------------------------------------------------------
 Extract Run Date                             | 05/25/2018        
 Randomized Id                                | 6                 
 CCRB Received Year                           | 2000          

Ok, let's say we only care about the `Reason For Initial Contact` in Brooklyn.  So this time, select only the incidents that occurred in Brooklyn, and only select the `Borough of Incident` and `Reason For Initial Contact` columns.  Display the first 5 results.

In [None]:


# +-------------------+--------------------------+
# |Borough of Incident|Reason For Initial Contact|
# +-------------------+--------------------------+
# |           Brooklyn|         Parking violation|
# |           Brooklyn|                     Other|
# |           Brooklyn|      Other violation o...|
# |           Brooklyn|                       nan|
# |           Brooklyn|      PD suspected C/V ...|
# +-------------------+--------------------------+
# only showing top 5 rows

+-------------------+--------------------------+
|Borough of Incident|Reason For Initial Contact|
+-------------------+--------------------------+
|           Brooklyn|         Parking violation|
|           Brooklyn|                     Other|
|           Brooklyn|      Other violation o...|
|           Brooklyn|                       nan|
|           Brooklyn|      PD suspected C/V ...|
+-------------------+--------------------------+
only showing top 5 rows



Finally, note that in our list of columns we do have an `Randomized Id`.

In [None]:
complaints_df.columns

['Extract Run Date',
 'Randomized Id',
 'CCRB Received Year',
 'Days Between Incident Date and Received Date',
 'Case Type',
 'Complaint Received Place',
 'Complaint Received Mode',
 'Borough Of Incident',
 'Patrol Borough Of Incident',
 'Reason For Initial Contact']

So let's practice selecting a row of data by that id.  Select the record with the randomized id equal to `200`, and display the result vertically.

In [None]:


# -RECORD 0--------------------------------------------------
#  Extract Run Date                             | 05/25/2018 
#  Randomized Id                                | 200        
#  CCRB Received Year                           | 2000       
#  Days Between Incident Date and Received Date | 9.0        
#  Case Type                                    | OCD        
#  Complaint Received Place                     | IAB        
#  Complaint Received Mode                      | Phone      
#  Borough Of Incident                          | Queens     
#  Patrol Borough Of Incident                   | Other      
#  Reason For Initial Contact                   | nan   

-RECORD 0--------------------------------------------------
 Extract Run Date                             | 05/25/2018 
 Randomized Id                                | 200        
 CCRB Received Year                           | 2000       
 Days Between Incident Date and Received Date | 9.0        
 Case Type                                    | OCD        
 Complaint Received Place                     | IAB        
 Complaint Received Mode                      | Phone      
 Borough Of Incident                          | Queens     
 Patrol Borough Of Incident                   | Other      
 Reason For Initial Contact                   | nan        



So even though Spark does not allow us to access a record with an index, we still can filter through the records to find a match.

### Summary

In this lesson, we practiced understanding the spark calls in a Spark dataframe, and did so by displaying a few rows of data and then looking at the DAG in the Spark UI.  From there we practiced using various Spark methods like the following:

* `printSchema` to display the schema
* `columns` to list just the columns
* `select` to select specific columns
* `df[df[column] == 'value']` to select specific rows

<a href="3_Spark_RDDs_Lab.ipynb" style="background-color:blue;color:white;padding:10px;margin:2px;font-weight:bold;">Next Notebook</a>
