<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preparation" data-toc-modified-id="Preparation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preparation</a></span></li></ul></div>

In [31]:
%matplotlib inline
import pyspark
import pyspark.ml
from pyspark.sql.functions import *
from pyspark.ml.feature import RFormula
from pyspark.ml.classification import LogisticRegression
from pyspark.sql import SparkSession, DataFrame, Column, Row, GroupedData, \
    DataFrameNaFunctions, DataFrameStatFunctions, functions, types, Window
from pyspark.sql import functions as f

spark = pyspark.sql.SparkSession.builder.getOrCreate()

## Preparation

Use the `.randomSplit` method to split the 311 data into training and test sets.

Read in the csv that has been setup

In [80]:
df = spark.read.csv('sa311/joined_df.csv/', header=True, inferSchema=True)

Split into 75/25 train/test

In [81]:
train, test = df.randomSplit([3.0, 1.0], 123)

Create a classification model to predict whether a case will be late or not (i.e. predict `case_late`). Experiment with different combinations of features and different classification algorithms.

Create a new column with total amount of days that case was open.

In [82]:
df = df.withColumn('case_days_open', f.datediff('case_closed_date', 'case_opened_date'))

Drop columns that appear to be insignificant.

In [83]:
df = df.drop('request_address',\
             'source_username',\
             'case_id',\
             'source_id',\
             'case_closed_date',\
             'case_opened_date',\
             'dept_subject_to_SLA',
             'SLA_due_date',
             'num_days_late')

In [84]:
df.show(10)

+----------------+---------+-----------+--------------------+------------------+-----------+----------------+--------------------+----------------------+--------------+
|   dept_division|case_late|case_closed|service_request_type|          SLA_days|case_status|council_district|           dept_name|standardized_dept_name|case_days_open|
+----------------+---------+-----------+--------------------+------------------+-----------+----------------+--------------------+----------------------+--------------+
|Field Operations|       NO|        YES|        Stray Animal|             999.0|     Closed|               5|Animal Care Services|  Animal Care Services|          null|
|     Storm Water|       NO|        YES|Removal Of Obstru...|       4.322222222|     Closed|               3|Trans & Cap Impro...|  Trans & Cap Impro...|          null|
|     Storm Water|       NO|        YES|Removal Of Obstru...|       4.320729167|     Closed|               3|Trans & Cap Impro...|  Trans & Cap Impro...|  

Rounding SLA_days down to 2 decimals, because reasons.

In [85]:
df = df.withColumn('SLA_days', f.round('SLA_days', 2))

In [87]:
df.show(10)

+----------------+---------+-----------+--------------------+--------+-----------+----------------+--------------------+----------------------+--------------+
|   dept_division|case_late|case_closed|service_request_type|SLA_days|case_status|council_district|           dept_name|standardized_dept_name|case_days_open|
+----------------+---------+-----------+--------------------+--------+-----------+----------------+--------------------+----------------------+--------------+
|Field Operations|       NO|        YES|        Stray Animal|   999.0|     Closed|               5|Animal Care Services|  Animal Care Services|          null|
|     Storm Water|       NO|        YES|Removal Of Obstru...|    4.32|     Closed|               3|Trans & Cap Impro...|  Trans & Cap Impro...|          null|
|     Storm Water|       NO|        YES|Removal Of Obstru...|    4.32|     Closed|               3|Trans & Cap Impro...|  Trans & Cap Impro...|          null|
|Code Enforcement|       NO|        YES|Front 

In [91]:
rf = RFormula(formula='case_late ~ SLA_days + council_district')

rf_df = rf.fit(df).transform(df).select('features', 'labels')

AnalysisException: "cannot resolve '`labels`' given input columns: [dept_division, label, service_request_type, case_late, case_days_open, council_district, SLA_days, standardized_dept_name, case_status, dept_name, case_closed, features];;\n'Project [features#2516, 'labels]\n+- Project [dept_division#1475, case_late#1481, case_closed#1483, service_request_type#1484, SLA_days#1583, case_status#1486, council_district#1488, dept_name#1490, standardized_dept_name#1491, case_days_open#1511, features#2516, UDF(cast(case_late#1481 as string)) AS label#2539]\n   +- Project [dept_division#1475, case_late#1481, case_closed#1483, service_request_type#1484, SLA_days#1583, case_status#1486, council_district#1488, dept_name#1490, standardized_dept_name#1491, case_days_open#1511, features#2516]\n      +- Project [dept_division#1475, case_late#1481, case_closed#1483, service_request_type#1484, SLA_days#1583, case_status#1486, council_district#1488, dept_name#1490, standardized_dept_name#1491, case_days_open#1511, features#2504 AS features#2516]\n         +- Project [dept_division#1475, case_late#1481, case_closed#1483, service_request_type#1484, SLA_days#1583, case_status#1486, council_district#1488, dept_name#1490, standardized_dept_name#1491, case_days_open#1511, UDF(named_struct(SLA_days, SLA_days#1583, council_district_double_RFormula_791d32eb10a5, cast(council_district#1488 as double))) AS features#2504]\n            +- Project [dept_division#1475, case_late#1481, case_closed#1483, service_request_type#1484, round(SLA_days#1485, 2) AS SLA_days#1583, case_status#1486, council_district#1488, dept_name#1490, standardized_dept_name#1491, case_days_open#1511]\n               +- Project [dept_division#1475, case_late#1481, case_closed#1483, service_request_type#1484, SLA_days#1485, case_status#1486, council_district#1488, dept_name#1490, standardized_dept_name#1491, case_days_open#1511]\n                  +- Project [dept_division#1475, source_id#1476, case_id#1477, case_opened_date#1478, case_closed_date#1479, SLA_due_date#1480, case_late#1481, num_days_late#1482, case_closed#1483, service_request_type#1484, SLA_days#1485, case_status#1486, request_address#1487, council_district#1488, source_username#1489, dept_name#1490, standardized_dept_name#1491, dept_subject_to_SLA#1492, datediff(cast(case_closed_date#1479 as date), cast(case_opened_date#1478 as date)) AS case_days_open#1511]\n                     +- Relation[dept_division#1475,source_id#1476,case_id#1477,case_opened_date#1478,case_closed_date#1479,SLA_due_date#1480,case_late#1481,num_days_late#1482,case_closed#1483,service_request_type#1484,SLA_days#1485,case_status#1486,request_address#1487,council_district#1488,source_username#1489,dept_name#1490,standardized_dept_name#1491,dept_subject_to_SLA#1492] csv\n"