<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Part-1" data-toc-modified-id="Part-1-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Part 1</a></span></li></ul></div>

In [56]:
import pyspark

import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, column
from pyspark.sql.functions import *

import warnings
warnings.filterwarnings("ignore")

In [2]:
spark = SparkSession.builder.master("local").appName("read").\
    enableHiveSupport().\
    getOrCreate()

## Part 1
Read the `case.csv` file from the 311 call data into a Spark DataFrame.

In [3]:
df = spark.read.csv('sa311/case.csv', sep=',', header=True, inferSchema=True)

How old is the latest (in terms of days past SLA) currently open issue? How long has the oldest (in terms of days since opened) currently opened issue been open?

In [39]:
df.columns

['case_id',
 'case_opened_date',
 'case_closed_date',
 'SLA_due_date',
 'case_late',
 'num_days_late',
 'case_closed',
 'dept_division',
 'service_request_type',
 'SLA_days',
 'case_status',
 'source_id',
 'request_address',
 'council_district']

In [45]:
df.select('SLA_due_date', 'case_closed', 'num_days_late')\
    .where(df.case_closed == 'NO')\
    .groupby().max('num_days_late').show()

+------------------+
|max(num_days_late)|
+------------------+
|       348.6458333|
+------------------+



In [60]:
import datetime
today = datetime.date.today()

df.select('case_opened_date').\
    withColumn('date',\
               to_timestamp(df.case_opened_date, 'M/d/yy')).\
    select(col('date'),\
           datediff(current_timestamp(), col('date'))).show(1)

+-------------------+-----------------------------------+
|               date|datediff(current_timestamp(), date)|
+-------------------+-----------------------------------+
|2018-01-01 00:00:00|                                500|
+-------------------+-----------------------------------+
only showing top 1 row



How many Stray Animal cases are there?

In [70]:
df.select('service_request_type')\
    .where(df.service_request_type == 'Stray Animal')\
    .count()

26760

How many service requests that are assigned to the Field Operations department (dept_division) are not classified as "Officer Standby" request type (service_request_type)?

In [73]:
df.select('dept_division', 'service_request_type')\
    .where(df.dept_division == 'Field Operations')\
    .where(df.service_request_type != 'Officer Standby')\
    .count()

113902

Create a new DataFrame without any information related to dates or location.

In [76]:
no_dates_df = (df.drop('case_opened_date',
                       'case_closed_date',
                       'SLA_due_date',
                       'request_address'))

Read dept.csv into a Spark DataFrame. Inspect the dept_name column. Replace the missing values with "other".

In [24]:
from math import factorial as f

In [25]:
print(len(str(f(1000000))))

5565709
