# NYC Housing Complaints Project EDA
### Holden Bruce, Dara Maguire, Francisco G. Estrada

"The learning problems that we consider can be roughly categorized as either supervised or unsupervised. In supervised learning, the goal is to predict the value of an outcome measure based on a number of input measures; in unsupervised learning, there is no outcome measure, and the goal is to describe the associations and patterns among a set of input measures." - Preface, ESL

What we are attempting to do here is run a supervised learning algorithm to predict the closeTime given a series of features from the NYCOpenData Housing Complaints dataset. One approach is to run a statistical regression to predict the value. Another approach / way to think about this could be closer to an unsupervised approach where a new complaint is compared to other complaints like it and then a prediction of that new complaints possible behavior (how long will NYC Gov take to close the ticket) will be made based on how other complaints behaved in the past. 

In [None]:
import os
import pandas
import pyforest
import pyspark
from pyspark import SparkContext, SparkConf
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.functions import pandas_udf, PandasUDFType, col, split, explode
from pyspark.sql.types import *
spark = (SparkSession.builder \
    .master('local') \
    .appName('nycHousing') \
    .config('spark.executor.memory', '28g') \
    .config('spark.driver.memory','4g')\
    .config("spark.cores.max", "6") \
    .getOrCreate())
spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', 'true')  # enable PyArrow

In [None]:
 cp_df=spark.read.csv("Complaint_Problems.csv",inferSchema=True, header=True)

In [None]:
 cp_df.show(5)

In [None]:
type(cp_df),cp_df.count(),len(cp_df.columns)

(pyspark.sql.dataframe.DataFrame, 4116011, 18)

In [None]:
 hmcc_df=spark.read.csv("Housing_Maintenance_Code_Complaints.csv",inferSchema=True, header=True)

In [None]:
 hmcc_df.show(5)

In [None]:
type(hmcc_df),hmcc_df.count(),len(hmcc_df.columns)

(pyspark.sql.dataframe.DataFrame, 2354181, 15)

In [None]:
hmcc_df = hmcc_df.drop("status", "statusdate", "statusid")

In [None]:
hmcc_df.show()

+-----------+----------+---------+--------+-----------+---------------+-----+-----+---+---------+--------------+------------+
|ComplaintID|BuildingID|BoroughID| Borough|HouseNumber|     StreetName|  Zip|Block|Lot|Apartment|CommunityBoard|ReceivedDate|
+-----------+----------+---------+--------+-----------+---------------+-----+-----+---+---------+--------------+------------+
|    6573046|    949821|        4|  QUEENS|     177-14|     129 AVENUE|11434|12538|156|     1FLR|            12|  11/07/2013|
|    6684157|    843381|        4|  QUEENS|      35-01|     101 STREET|11368| 1742|  1|      1FL|             3|  01/03/2014|
|    6714273|    612583|        4|  QUEENS|     133-14|     226 STREET|11413|12964|254|      1FL|            13|  01/11/2014|
|    6718256|    674042|        4|  QUEENS|      96-01| LIBERTY AVENUE|11417| 9119| 43|      3FL|            10|  01/14/2014|
|    6719783|    545149|        4|  QUEENS|     121-11|     133 AVENUE|11420|11728|  5|     BSMT|            10|  01/1

In [None]:
#merged_df = cp_df.join(hmcc_df, on=["ComplaintID", "status", "statusdate", "statusid"])
merged_df = cp_df.join(hmcc_df, on=["ComplaintID"])

In [None]:
merged_df.show(5)

In [None]:
type(merged_df),merged_df.count(),len(merged_df.columns)

(pyspark.sql.dataframe.DataFrame, 4115911, 29)

In [None]:
population = spark.read.csv("NY State Population by Zip.csv",inferSchema=True, header=True)

In [None]:
population = population.withColumnRenamed("Zip Code","Zip")

In [None]:
from pyspark.sql.types import IntegerType
population = population.withColumn("Zip", population.Zip.cast(IntegerType()))

In [None]:
# https://luminousmen.com/post/introduction-to-pyspark-join-types 
merged_population_join = merged_df.join(population, on='Zip', how='inner')

In [None]:
merged_population_join.printSchema()

root
 |-- Zip: string (nullable = true)
 |-- ComplaintID: integer (nullable = true)
 |-- ProblemID: integer (nullable = true)
 |-- UnitTypeID: integer (nullable = true)
 |-- UnitType: string (nullable = true)
 |-- SpaceTypeID : integer (nullable = true)
 |-- SpaceType: string (nullable = true)
 |-- TypeID: integer (nullable = true)
 |-- Type: string (nullable = true)
 |-- MajorCategoryID: integer (nullable = true)
 |-- MajorCategory: string (nullable = true)
 |-- MinorCategoryID: integer (nullable = true)
 |-- MinorCategory: string (nullable = true)
 |-- CodeID: integer (nullable = true)
 |-- Code: string (nullable = true)
 |-- StatusID: integer (nullable = true)
 |-- Status: string (nullable = true)
 |-- StatusDate: string (nullable = true)
 |-- StatusDescription: string (nullable = true)
 |-- BuildingID: integer (nullable = true)
 |-- BoroughID: integer (nullable = true)
 |-- Borough: string (nullable = true)
 |-- HouseNumber: string (nullable = true)
 |-- StreetName: string (nullabl

In [None]:
merged_population_join.show(3)

+-----+-----------+---------+----------+---------+------------+----------------+------+---------+---------------+-------------+---------------+-------------+------+--------------------+--------+------+----------+--------------------+----------+---------+-------+-----------+----------+-----+---+---------+--------------+------------+----------+
|  Zip|ComplaintID|ProblemID|UnitTypeID| UnitType|SpaceTypeID |       SpaceType|TypeID|     Type|MajorCategoryID|MajorCategory|MinorCategoryID|MinorCategory|CodeID|                Code|StatusID|Status|StatusDate|   StatusDescription|BuildingID|BoroughID|Borough|HouseNumber|StreetName|Block|Lot|Apartment|CommunityBoard|ReceivedDate|Population|
+-----+-----------+---------+----------+---------+------------+----------------+------+---------+---------------+-------------+---------------+-------------+------+--------------------+--------+------+----------+--------------------+----------+---------+-------+-----------+----------+-----+---+---------+-----

In [None]:
merged_df=merged_population_join

In [None]:
income = spark.read.csv("NY State Income by Zip.csv",inferSchema=True, header=True)

In [None]:
# https://sparkbyexamples.com/pyspark/pyspark-rename-dataframe-column/
income = income.withColumnRenamed("Zip Code","Zip")

In [None]:
# https://sparkbyexamples.com/pyspark/pyspark-cast-column-type/
# https://sparkbyexamples.com/pyspark/pyspark-rename-dataframe-column/

merged_df = merged_df.withColumn("Zip", merged_df.Zip.cast(IntegerType()))

In [None]:
# https://luminousmen.com/post/introduction-to-pyspark-join-types 
merged_income_join = merged_df.join(income, on='Zip', how='inner')

In [None]:
merged_income_join.show(3)

+-----+-----------+---------+----------+---------+------------+----------------+------+---------+---------------+-------------+---------------+-------------+------+--------------------+--------+------+----------+--------------------+----------+---------+-------+-----------+----------+-----+---+---------+--------------+------------+----------+--------------+--------+----------------------+-------------------------+
|  Zip|ComplaintID|ProblemID|UnitTypeID| UnitType|SpaceTypeID |       SpaceType|TypeID|     Type|MajorCategoryID|MajorCategory|MinorCategoryID|MinorCategory|CodeID|                Code|StatusID|Status|StatusDate|   StatusDescription|BuildingID|BoroughID|Borough|HouseNumber|StreetName|Block|Lot|Apartment|CommunityBoard|ReceivedDate|Population|All Households|Families|Families with Children|Families without Children|
+-----+-----------+---------+----------+---------+------------+----------------+------+---------+---------------+-------------+---------------+-------------+------+

In [None]:
merged_df=merged_income_join
merged_df = merged_df.withColumnRenamed("All Households","All Households Income")
merged_df = merged_df.withColumnRenamed("FAmilies","Families Income")
merged_df = merged_df.withColumnRenamed("Families with Children","Families with Children Income")
merged_df = merged_df.withColumnRenamed("Families without Children","Families without Children Income")

In [None]:
merged_df.show(3)

+-----+-----------+---------+----------+---------+------------+----------------+------+---------+---------------+-------------+---------------+-------------+------+--------------------+--------+------+----------+--------------------+----------+---------+-------+-----------+----------+-----+---+---------+--------------+------------+----------+---------------------+---------------+-----------------------------+--------------------------------+
|  Zip|ComplaintID|ProblemID|UnitTypeID| UnitType|SpaceTypeID |       SpaceType|TypeID|     Type|MajorCategoryID|MajorCategory|MinorCategoryID|MinorCategory|CodeID|                Code|StatusID|Status|StatusDate|   StatusDescription|BuildingID|BoroughID|Borough|HouseNumber|StreetName|Block|Lot|Apartment|CommunityBoard|ReceivedDate|Population|All Households Income|Families Income|Families with Children Income|Families without Children Income|
+-----+-----------+---------+----------+---------+------------+----------------+------+---------+-----------

In [None]:
from pyspark.sql.functions import isnan, when, count, col
merged_df.select([count(when(isnan(c),c)).alias(c) for c in merged_df.columns]).toPandas().head()

Unnamed: 0,Zip,ComplaintID,ProblemID,UnitTypeID,UnitType,SpaceTypeID,SpaceType,TypeID,Type,MajorCategoryID,...,Block,Lot,Apartment,CommunityBoard,ReceivedDate,Population,All Households Income,Families Income,Families with Children Income,Families without Children Income
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
#merged_df.write.csv('NYC_Merged_Complaints_Data')
#merged_df.repartition(1).write.csv("NYC_Merged_Complaints_Data", sep='|')
merged_df.write.csv('NYC_Merged_Complaints_Data.csv',mode='overwrite', header='true')#uncomment when ready to write csv

In [None]:
spark.catalog.clearCache()
hmcc_df.unpersist()
cp_df.unpersist()
income.unpersist()
population.persist()
merged_df.unpersist()
spark.stop()