## NYC Permit Data Cleaner Notebook

This notebook cleans the NYC permit data, organizes and saves it to a .csv to be joined to other data sets in the 'Master DataFrame' notebook. It follows the same format as the heavily documented 'NYC Crime Data Cleaning' notebook and the 'NYC Sales Data Cleaning' notebook. Contents outlined below.

**Notebook Contents**

> 1. Read in raw CSV selecting relevant columns, cleaning NaN values. 
> 2. Convert DATE column to datetime, slice main DataFrame by date and desired Borough
> 3. Visualize on map, histogram, and export to CSV.

In [1]:
import pandas as pd
import numpy as np
import datetime as dt

In [77]:
df = pd.read_csv('./DOB_Permit_Issuance.csv', usecols = ['Filing Date','Job Type','LONGITUDE','LATITUDE','BOROUGH'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3667352 entries, 0 to 3667351
Data columns (total 5 columns):
 #   Column       Dtype  
---  ------       -----  
 0   BOROUGH      object 
 1   Job Type     object 
 2   Filing Date  object 
 3   LATITUDE     float64
 4   LONGITUDE    float64
dtypes: float64(2), object(3)
memory usage: 139.9+ MB


In [79]:
df['Filing Date']=pd.to_datetime(df['Filing Date'])


In [113]:
sub_df = pd.DataFrame()
sub_df = df[(df['Filing Date'].dt.date >= pd.to_datetime('01/01/2014', format ='%m/%d/%Y')) & (df['Filing Date'].dt.date <= pd.to_datetime('12/31/2015', format ='%m/%d/%Y')) & (df['BOROUGH'] == 'MANHATTAN')].copy()

In [114]:
sub_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 155400 entries, 5 to 3664552
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   BOROUGH      155400 non-null  object        
 1   Job Type     155400 non-null  object        
 2   Filing Date  155400 non-null  datetime64[ns]
 3   LATITUDE     154980 non-null  float64       
 4   LONGITUDE    154980 non-null  float64       
dtypes: datetime64[ns](1), float64(2), object(2)
memory usage: 7.1+ MB


In [115]:
sub_df['Latitude'] = sub_df['LATITUDE']
sub_df['Longitude']= sub_df['LONGITUDE']
sub_df['DATE'] = sub_df['Filing Date']


In [116]:
sub_df.dropna(inplace=True)

In [117]:
sub_df.drop(['LATITUDE', 'LONGITUDE','Filing Date','BOROUGH'],axis=1, inplace=True)

In [118]:
sub_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 154980 entries, 5 to 3664552
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   Job Type   154980 non-null  object        
 1   Latitude   154980 non-null  float64       
 2   Longitude  154980 non-null  float64       
 3   DATE       154980 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 5.9+ MB


In [119]:
permit_df = pd.DataFrame()
permit_df = sub_df
permit_df = permit_df.join(pd.get_dummies(sub_df['Job Type']))
permit_df[['Latitude','Longitude']]=permit_df[['Latitude','Longitude']].round(3)
permit_df['Coords'] = list(zip(permit_df.Longitude, permit_df.Latitude))

In [120]:
permit_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 154980 entries, 5 to 3664552
Data columns (total 11 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   Job Type   154980 non-null  object        
 1   Latitude   154980 non-null  float64       
 2   Longitude  154980 non-null  float64       
 3   DATE       154980 non-null  datetime64[ns]
 4   A1         154980 non-null  uint8         
 5   A2         154980 non-null  uint8         
 6   A3         154980 non-null  uint8         
 7   DM         154980 non-null  uint8         
 8   NB         154980 non-null  uint8         
 9   SG         154980 non-null  uint8         
 10  Coords     154980 non-null  object        
dtypes: datetime64[ns](1), float64(2), object(2), uint8(6)
memory usage: 13.0+ MB


In [121]:
permit_df.drop(['Job Type'], axis = 1, inplace=True)

In [122]:
permit_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 154980 entries, 5 to 3664552
Data columns (total 10 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   Latitude   154980 non-null  float64       
 1   Longitude  154980 non-null  float64       
 2   DATE       154980 non-null  datetime64[ns]
 3   A1         154980 non-null  uint8         
 4   A2         154980 non-null  uint8         
 5   A3         154980 non-null  uint8         
 6   DM         154980 non-null  uint8         
 7   NB         154980 non-null  uint8         
 8   SG         154980 non-null  uint8         
 9   Coords     154980 non-null  object        
dtypes: datetime64[ns](1), float64(2), object(1), uint8(6)
memory usage: 11.8+ MB


In [123]:
permit_df.describe()

Unnamed: 0,Latitude,Longitude,A1,A2,A3,DM,NB,SG
count,154980.0,154980.0,154980.0,154980.0,154980.0,154980.0,154980.0,154980.0
mean,40.758529,-73.979187,0.039683,0.778591,0.137024,0.006369,0.022267,0.016067
std,0.032324,0.020204,0.195213,0.415197,0.343874,0.079549,0.147552,0.125732
min,40.691,-74.018,0.0,0.0,0.0,0.0,0.0,0.0
25%,40.736,-73.994,0.0,1.0,0.0,0.0,0.0,0.0
50%,40.756,-73.981,0.0,1.0,0.0,0.0,0.0,0.0
75%,40.775,-73.966,0.0,1.0,0.0,0.0,0.0,0.0
max,40.877,-73.908,1.0,1.0,1.0,1.0,1.0,1.0


In [124]:
permit_df.to_csv('20142015permit.csv')

In [109]:
g_prt_df = pd.DataFrame()
g_prt_df = permit_df.groupby(['Coords']).sum()
g_prt_df = g_prt_df.reset_index()
g_prt_df['Longitude'], g_prt_df['Latitude'] = zip(*g_prt_df.Coords)

In [110]:
g_prt_df.describe()

Unnamed: 0,Latitude,Longitude,A1,A2,A3,DM,NB,SG
count,4018.0,4018.0,4018.0,4018.0,4018.0,4018.0,4018.0,4018.0
mean,40.771933,-73.971228,0.900946,15.244898,2.631658,0.142359,0.465406,0.298158
std,0.041044,0.024209,1.764345,19.839015,3.440481,1.027694,1.69388,1.241832
min,40.691,-74.018,0.0,0.0,0.0,0.0,0.0,0.0
25%,40.739,-73.991,0.0,3.0,0.0,0.0,0.0,0.0
50%,40.767,-73.975,0.0,9.0,2.0,0.0,0.0,0.0
75%,40.8,-73.95,1.0,19.0,4.0,0.0,0.0,0.0
max,40.877,-73.908,17.0,270.0,40.0,24.0,19.0,33.0


In [111]:
import folium
from IPython.display import IFrame

nyc_map = folium.Map([40.78, -73.97], tiles = 'CartoDB positron')

for lat, long, in zip(g_prt_df['Latitude'], g_prt_df['Longitude']):

    folium.Marker( [lat, long], icon=folium.CustomIcon(icon_image='https://i.imgur.com/CYx04oC.png',
                                              icon_size=(10,10))).add_to(nyc_map)

nyc_map.save('nyc_permit.html')
IFrame(src='nyc_permit.html', width='100%', height=500)