# Wrangle Exercises
___
### Data Acquisition

These exercises should go in a notebook or script named wrangle. Add, commit, and push your changes.

This exercises uses the `case.csv`, `dept.csv`, and `source.csv` files from the san antonio 311 call dataset.

#### 1. Read the case, department, and source data into their own spark dataframes.

#### 2. Let's see how writing to the local disk works in spark:
   - Write the code necessary to store the source data in both csv and json format, store these as `sources_csv` and `sources_json`
   - Inspect your folder structure. What do you notice?
   
#### 3. Inspect the data in your dataframes. Are the data types appropriate? Write the code necessary to cast the values to the appropriate types.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.getOrCreate()

Read the case, department, and source data into their own spark dataframes.

In [2]:
case = spark.read.csv("case.csv", sep=",", header=True, inferSchema=True)
dept = spark.read.csv("dept.csv", sep=",", header=True, inferSchema=True)
source = spark.read.csv("source.csv", sep=",", header=True, inferSchema=True)

Write the code necessary to store the source data in both `csv` and `json` format, store these as `sources_csv` and `sources_json`

In [3]:
source.write.json("data/source_json", mode="overwrite")
source.write.csv("data/source_csv", mode="overwrite")

Inspect your folder structure. What do you notice?
- It generated a directory (`data`) with the 2 formats (`csv` & `json`) in 2 seperate folders.

Inspect the data in your dataframes. Are the data types appropriate? Write the code necessary to cast the values to the appropriate types.

In [4]:
case.dtypes

[('case_id', 'int'),
 ('case_opened_date', 'string'),
 ('case_closed_date', 'string'),
 ('SLA_due_date', 'string'),
 ('case_late', 'string'),
 ('num_days_late', 'double'),
 ('case_closed', 'string'),
 ('dept_division', 'string'),
 ('service_request_type', 'string'),
 ('SLA_days', 'double'),
 ('case_status', 'string'),
 ('source_id', 'string'),
 ('request_address', 'string'),
 ('council_district', 'int')]

I am going to be using the sollutions from the wrangle lesson to correct the dtypes (Also in Codeup DS Curriculum).

In [5]:
# use .withColumn to change columns from string to boolean values
case = case.withColumn("case_closed", expr('case_closed == "YES"')).withColumn(
    "case_late", expr('case_late == "YES"')
)

In [6]:
# council_district as a string instead of int

case = case.withColumn("council_district", col("council_district").cast("string"))

In [7]:
# to_timestamp, fmt

fmt = "M/d/yy H:mm"

case = case.withColumnRenamed('SLA_due_date', 'case_due_date')

case = case.withColumn('case_opened_date', to_timestamp('case_opened_date', fmt))\
.withColumn('case_closed_date', to_timestamp('case_opened_date', fmt))\
.withColumn('case_due_date', to_timestamp('case_due_date', fmt))

In [8]:
case.dtypes

[('case_id', 'int'),
 ('case_opened_date', 'timestamp'),
 ('case_closed_date', 'timestamp'),
 ('case_due_date', 'timestamp'),
 ('case_late', 'boolean'),
 ('num_days_late', 'double'),
 ('case_closed', 'boolean'),
 ('dept_division', 'string'),
 ('service_request_type', 'string'),
 ('SLA_days', 'double'),
 ('case_status', 'string'),
 ('source_id', 'string'),
 ('request_address', 'string'),
 ('council_district', 'string')]

`case` is looks good in terms of dtypes now.

In [9]:
dept.dtypes

[('dept_division', 'string'),
 ('dept_name', 'string'),
 ('standardized_dept_name', 'string'),
 ('dept_subject_to_SLA', 'string')]

In [10]:
# use .withColumn to change columns from string to boolean values
dept = dept.withColumn("dept_subject_to_SLA", expr('dept_subject_to_SLA == "YES"'))
dept.show(1)

+---------------+----------------+----------------------+-------------------+
|  dept_division|       dept_name|standardized_dept_name|dept_subject_to_SLA|
+---------------+----------------+----------------------+-------------------+
|311 Call Center|Customer Service|      Customer Service|               true|
+---------------+----------------+----------------------+-------------------+
only showing top 1 row



In [11]:
dept.dtypes

[('dept_division', 'string'),
 ('dept_name', 'string'),
 ('standardized_dept_name', 'string'),
 ('dept_subject_to_SLA', 'boolean')]

`dept` dtypes look good now.

In [12]:
source.dtypes

[('source_id', 'string'), ('source_username', 'string')]

I am leaving `source` as it is, because half of the id's contain letters.
___

#### 1. How old is the latest (in terms of days past SLA) currently open issue? How long has the oldest (in terms of days since opened) currently opened issue been open?
- **A:** 1,854 Days

In [13]:
(
   case.select(datediff(current_timestamp(), 'case_due_date')
    .alias('days_past_due'))
    .where(case.case_closed == False)
    .sort(col('days_past_due').desc())
    .show(5)
    
)

+-------------+
|days_past_due|
+-------------+
|         1854|
|         1854|
|         1854|
|         1853|
|         1851|
+-------------+
only showing top 5 rows



#### 2. How many Stray Animal cases are there?
- **A:** 26,760 Stray Animal Cases

In [14]:
case.filter(case.service_request_type == 'Stray Animal').count()

26760

#### 3. How many service requests that are assigned to the Field Operations department (`dept_division`) are not classified as 
#### "Officer Standby" request type (`service_request_type`)?
- **A:** 113,902 Service Requests

In [15]:
(
    case.filter(case.dept_division == 'Field Operations')
    .filter(case.service_request_type != 'Officer Standby')
    .count()
)

113902

#### 4. Convert the `council_district` column to a string column.
Was already done above, but here is the code:

In [16]:
case = case.withColumn("council_district", col("council_district").cast("string"))

#### 5. Extract the year from the `case_closed_date column`.

In [17]:
case.select('case_closed_date', year('case_closed_date')).show(5)

+-------------------+----------------------+
|   case_closed_date|year(case_closed_date)|
+-------------------+----------------------+
|2018-01-01 00:42:00|                  2018|
|2018-01-01 00:46:00|                  2018|
|2018-01-01 00:48:00|                  2018|
|2018-01-01 01:29:00|                  2018|
|2018-01-01 01:34:00|                  2018|
+-------------------+----------------------+
only showing top 5 rows



#### 6. Convert `num_days_late` from days to hours in new columns `num_hours_late`.

In [18]:
(
    case.withColumn('num_hours_late', case.num_days_late * 24)
    .select('num_days_late', 'num_hours_late')
    .show(10)
)

+-------------------+-------------------+
|      num_days_late|     num_hours_late|
+-------------------+-------------------+
| -998.5087616000001|     -23964.2102784|
|-2.0126041669999997|-48.302500007999996|
|       -3.022337963|      -72.536111112|
|       -15.01148148|      -360.27555552|
|0.37216435200000003|  8.931944448000001|
|       -29.74398148| -713.8555555199999|
|       -14.70673611|      -352.96166664|
|       -14.70662037|      -352.95888888|
|       -14.70662037|      -352.95888888|
|       -14.70649306|      -352.95583344|
+-------------------+-------------------+
only showing top 10 rows



#### 7. Join the case data with the source and department data.

In [19]:
all_dfs = case.join(dept, 'dept_division', 'left').join(source, 'source_id', 'left')
all_dfs.show(1, vertical=True, truncate=False)

-RECORD 0------------------------------------------------------
 source_id              | svcCRMLS                             
 dept_division          | Field Operations                     
 case_id                | 1014127332                           
 case_opened_date       | 2018-01-01 00:42:00                  
 case_closed_date       | 2018-01-01 00:42:00                  
 case_due_date          | 2020-09-26 00:42:00                  
 case_late              | false                                
 num_days_late          | -998.5087616000001                   
 case_closed            | true                                 
 service_request_type   | Stray Animal                         
 SLA_days               | 999.0                                
 case_status            | Closed                               
 request_address        | 2315  EL PASO ST, San Antonio, 78207 
 council_district       | 5                                    
 dept_name              | Animal Care Se

#### 8. Are there any cases that do not have a request source?

In [20]:
case.filter('source_id is null').count()

0

#### 9. What are the top 10 service request types in terms of number of requests?

In [21]:
(
    case.groupby('service_request_type')
    .count()
    .sort(col('count').desc())
    .show(10, truncate=False)
)

+--------------------------------+-----+
|service_request_type            |count|
+--------------------------------+-----+
|No Pickup                       |86855|
|Overgrown Yard/Trash            |65895|
|Bandit Signs                    |32910|
|Damaged Cart                    |30338|
|Front Or Side Yard Parking      |28794|
|Stray Animal                    |26760|
|Aggressive Animal(Non-Critical) |24882|
|Cart Exchange Request           |22024|
|Junk Vehicle On Private Property|21473|
|Pot Hole Repair                 |20616|
+--------------------------------+-----+
only showing top 10 rows



#### 10. What are the top 10 service request types in terms of average days late?

In [22]:
(
    case.where('case_late') 
    .groupBy('service_request_type')
    .agg(mean('num_days_late').alias('n_days_late'), count('*').alias('n_cases'))
    .sort(desc('n_days_late'))
    .show(10, truncate=False)
)

+--------------------------------------+------------------+-------+
|service_request_type                  |n_days_late       |n_cases|
+--------------------------------------+------------------+-------+
|Zoning: Recycle Yard                  |210.89201994318182|132    |
|Zoning: Junk Yards                    |200.20517608494276|262    |
|Structure/Housing Maintenance         |190.20707698509807|51     |
|Donation Container Enforcement        |171.09115313942615|122    |
|Storage of Used Mattress              |163.96812829714287|7      |
|Labeling for Used Mattress            |162.43032902285717|7      |
|Record Keeping of Used Mattresses     |153.99724039428568|7      |
|Signage Requied for Sale of Used Mattr|151.63868055333333|12     |
|Traffic Signal Graffiti               |137.64583330000002|2      |
|License Requied Used Mattress Sales   |128.79828704142858|7      |
+--------------------------------------+------------------+-------+
only showing top 10 rows



#### 11. Does number of days late depend on department?

In [23]:
case.select(case.num_days_late).show(5)

+-------------------+
|      num_days_late|
+-------------------+
| -998.5087616000001|
|-2.0126041669999997|
|       -3.022337963|
|       -15.01148148|
|0.37216435200000003|
+-------------------+
only showing top 5 rows



In [24]:
(
   case.filter('case_late')
    .groupby('dept_name')
    .agg(mean('num_days_late').alias('days_late'), count('num_days_late').alias('n_cases_late'))
    .sort('days_late')
    .withColumn('days_late', round(col('days_late'), 1))
    .show(truncate=False)
)

AnalysisException: cannot resolve 'dept_name' given input columns: [SLA_days, case_closed, case_closed_date, case_due_date, case_id, case_late, case_opened_date, case_status, council_district, dept_division, num_days_late, request_address, service_request_type, source_id];
'Aggregate ['dept_name], ['dept_name, avg(num_days_late#21) AS days_late#579, count(num_days_late#21) AS n_cases_late#581L]
+- Filter case_late#111: boolean
   +- Project [case_id#16, case_opened_date#156, case_closed_date#171, case_due_date#186, case_late#111, num_days_late#21, case_closed#96, dept_division#23, service_request_type#24, SLA_days#25, case_status#26, source_id#27, request_address#28, cast(council_district#126 as string) AS council_district#277]
      +- Project [case_id#16, case_opened_date#156, case_closed_date#171, to_timestamp('case_due_date, Some(M/d/yy H:mm)) AS case_due_date#186, case_late#111, num_days_late#21, case_closed#96, dept_division#23, service_request_type#24, SLA_days#25, case_status#26, source_id#27, request_address#28, council_district#126]
         +- Project [case_id#16, case_opened_date#156, to_timestamp('case_opened_date, Some(M/d/yy H:mm)) AS case_closed_date#171, case_due_date#141, case_late#111, num_days_late#21, case_closed#96, dept_division#23, service_request_type#24, SLA_days#25, case_status#26, source_id#27, request_address#28, council_district#126]
            +- Project [case_id#16, to_timestamp('case_opened_date, Some(M/d/yy H:mm)) AS case_opened_date#156, case_closed_date#18, case_due_date#141, case_late#111, num_days_late#21, case_closed#96, dept_division#23, service_request_type#24, SLA_days#25, case_status#26, source_id#27, request_address#28, council_district#126]
               +- Project [case_id#16, case_opened_date#17, case_closed_date#18, SLA_due_date#19 AS case_due_date#141, case_late#111, num_days_late#21, case_closed#96, dept_division#23, service_request_type#24, SLA_days#25, case_status#26, source_id#27, request_address#28, council_district#126]
                  +- Project [case_id#16, case_opened_date#17, case_closed_date#18, SLA_due_date#19, case_late#111, num_days_late#21, case_closed#96, dept_division#23, service_request_type#24, SLA_days#25, case_status#26, source_id#27, request_address#28, cast(council_district#29 as string) AS council_district#126]
                     +- Project [case_id#16, case_opened_date#17, case_closed_date#18, SLA_due_date#19, (case_late#20 = YES) AS case_late#111, num_days_late#21, case_closed#96, dept_division#23, service_request_type#24, SLA_days#25, case_status#26, source_id#27, request_address#28, council_district#29]
                        +- Project [case_id#16, case_opened_date#17, case_closed_date#18, SLA_due_date#19, case_late#20, num_days_late#21, (case_closed#22 = YES) AS case_closed#96, dept_division#23, service_request_type#24, SLA_days#25, case_status#26, source_id#27, request_address#28, council_district#29]
                           +- Relation [case_id#16,case_opened_date#17,case_closed_date#18,SLA_due_date#19,case_late#20,num_days_late#21,case_closed#22,dept_division#23,service_request_type#24,SLA_days#25,case_status#26,source_id#27,request_address#28,council_district#29] csv


#### 12. How do number of days late depend on department and request type?