# Wrangle

This exercises uses the case.csv, dept.csv, and source.csv files from the san antonio 311 call dataset.

### Part 1

__Read the case, department, and source data into their own spark dataframes.__

In [3]:
import numpy as np
import pandas as pd
import pyspark

In [4]:
#Start the spark cluster
spark = pyspark.sql.SparkSession.builder.getOrCreate()

In [11]:
#Create the case spark df
case_df = spark.read.csv("case.csv", inferSchema = True, header = True)

In [23]:
case_df.show(2, truncate = False, vertical = True)

-RECORD 0----------------------------------------------------
 case_id              | 1014127332                           
 case_opened_date     | 1/1/18 0:42                          
 case_closed_date     | 1/1/18 12:29                         
 SLA_due_date         | 9/26/20 0:42                         
 case_late            | NO                                   
 num_days_late        | -998.5087616000001                   
 case_closed          | YES                                  
 dept_division        | Field Operations                     
 service_request_type | Stray Animal                         
 SLA_days             | 999.0                                
 case_status          | Closed                               
 source_id            | svcCRMLS                             
 request_address      | 2315  EL PASO ST, San Antonio, 78207 
 council_district     | 5                                    
-RECORD 1----------------------------------------------------
 case_id

In [15]:
#Create the dept spark df
dept_df = spark.read.csv('dept.csv', header = True, inferSchema = True)

In [17]:
dept_df.show(5)

+--------------------+--------------------+----------------------+-------------------+
|       dept_division|           dept_name|standardized_dept_name|dept_subject_to_SLA|
+--------------------+--------------------+----------------------+-------------------+
|     311 Call Center|    Customer Service|      Customer Service|                YES|
|               Brush|Solid Waste Manag...|           Solid Waste|                YES|
|     Clean and Green|Parks and Recreation|    Parks & Recreation|                YES|
|Clean and Green N...|Parks and Recreation|    Parks & Recreation|                YES|
|    Code Enforcement|Code Enforcement ...|  DSD/Code Enforcement|                YES|
+--------------------+--------------------+----------------------+-------------------+
only showing top 5 rows



In [18]:
#Create the source spark df
source_df = spark.read.csv('source.csv', header = True, inferSchema = True)

In [19]:
source_df.show(5)

+---------+----------------+
|source_id| source_username|
+---------+----------------+
|   100137|Merlene Blodgett|
|   103582|     Carmen Cura|
|   106463| Richard Sanchez|
|   119403|  Betty De Hoyos|
|   119555|  Socorro Quiara|
+---------+----------------+
only showing top 5 rows



__Write the code necessary to store the source data in both csv and json format, store these as sources_csv and sources_json__

In [20]:
#Save as csv
source_df.write.csv('sources_csv', mode = 'overwrite')

In [21]:
#Save as json
source_df.write.json('sources_json', mode = 'overwrite')

__Inspect the data in your dataframes. Are the data types appropriate? Write the code necessary to cast the values to the appropriate types.__

In [22]:
#Check case_df
case_df.printSchema()

root
 |-- case_id: integer (nullable = true)
 |-- case_opened_date: string (nullable = true)
 |-- case_closed_date: string (nullable = true)
 |-- SLA_due_date: string (nullable = true)
 |-- case_late: string (nullable = true)
 |-- num_days_late: double (nullable = true)
 |-- case_closed: string (nullable = true)
 |-- dept_division: string (nullable = true)
 |-- service_request_type: string (nullable = true)
 |-- SLA_days: double (nullable = true)
 |-- case_status: string (nullable = true)
 |-- source_id: string (nullable = true)
 |-- request_address: string (nullable = true)
 |-- council_district: integer (nullable = true)



Things to change:
* I think case_id should be a string, not an integer
* case_opened_date should be a datetime object
* case_closed_date should be a datetime object
* SLA_due_date should be a datetime object
* case_late should be a boolean
* num_days_late should be an int
* case_closed should be a boolean
* SLA_days should be an int
* Although there's nothing inherently wrong with request_address, I would like to go ahead and create new columns for city and zip code.
* council_district should be a string.

In [24]:
#Rename SLA_due_date to case_due_date to match the other column names
case_df = case_df.withColumnRenamed('SLA_due_date', 'case_due_date')

In [26]:
from pyspark.sql.functions import col

#Cast case_id, and council_district as strings
case_df = case_df.withColumn('case_id', col('case_id').cast('string')).withColumn('council_district', col('council_district').cast('string'))

In [30]:
from pyspark.sql.functions import to_timestamp

#Convert the dates to datetime objects
#First, determine the format 
fmt = "M/d/yy H:mm"
case_df = (
    case_df.withColumn('case_opened_date', to_timestamp('case_opened_date', fmt))
    .withColumn('case_closed_date', to_timestamp('case_closed_date', fmt))
    .withColumn('case_due_date', to_timestamp('case_due_date', fmt))
)

In [33]:
from pyspark.sql.functions import expr

#Convert case_closed and case_late to booleans
case_df = (
    case_df.withColumn('case_closed', expr("case_closed == 'Yes'"))
    .withColumn('case_late', expr("case_late == 'Yes'"))
)

In [34]:
#Convert num_days_late and SLA_days to ints
case_df = (
    case_df.withColumn('num_days_late', col('num_days_late').cast('integer'))
    .withColumn('SLA_days', col('SLA_days').cast('integer'))
)

In [62]:
from pyspark.sql.functions import regexp_extract, trim, lower

#Strip all leading and trailing whitespace from the request_address and convert to lowercase
case_df = case_df.withColumn('request_address', trim(lower(case_df.request_address)))

#Now create new columns for city and zip code
case_df = (
    case_df.withColumn('zip_code', regexp_extract('request_address', r'\w{5}$', 0))
    .withColumn('city', regexp_extract('request_address', r', (.*),', 1))
)

In [65]:
from pyspark.sql.functions import format_string

#Format the council district string so that there are leading 0s
case_df = case_df.withColumn('council_district', format_string('%03d', col('council_district').cast('integer')))

In [68]:
#Now check the schema for the dept_df
dept_df.printSchema()

root
 |-- dept_division: string (nullable = true)
 |-- dept_name: string (nullable = true)
 |-- standardized_dept_name: string (nullable = true)
 |-- dept_subject_to_SLA: string (nullable = true)



Things to Change:
* dept_subject_to_SLA should be a boolean

In [69]:
#Convert dept_subject_to_SLA to a boolean
dept_df = dept_df.withColumn('dept_subject_to_SLA', expr("dept_subject_to_SLA == 'Yes'"))

In [71]:
#Now check the schema for source_df
source_df.printSchema()

root
 |-- source_id: string (nullable = true)
 |-- source_username: string (nullable = true)



Things to Change:
* Nothing. These datatypes seem fine.

### Part 2

__1) How old is the latest (in terms of days past SLA) currently open issue? How long has the oldest (in terms of days since opened) currently opened issue been open?__