# Day 1 of the 5-Day Kaggle Data Challenge 

In [46]:
import pandas as pd
import numpy as np


#read in data
nfl_data = pd.read_csv("NFL Play by Play 2009-2017 (v4).csv")
sf_permits = pd.read_csv("Building_Permits.csv")
                         
#set seed for reproducibility
np.random.seed(0)

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


In [47]:
# look at a sample of 5 rows in the sf_permits file. 
sf_permits.sample(5)

Unnamed: 0,Permit Number,Permit Type,Permit Type Definition,Permit Creation Date,Block,Lot,Street Number,Street Number Suffix,Street Name,Street Suffix,...,Existing Construction Type,Existing Construction Type Description,Proposed Construction Type,Proposed Construction Type Description,Site Permit,Supervisor District,Neighborhoods - Analysis Boundaries,Zipcode,Location,Record ID
40553,201403039652,8,otc alterations permit,03/03/2014,3732,8,400,,Clementina,St,...,,,1.0,constr type 1,,6.0,South of Market,94103.0,"(37.780460571778164, -122.40450626524974)",1334094491645
169731,201510159735,3,additions alterations or repairs,10/15/2015,2609,28,79,,Buena Vista,Tr,...,5.0,wood frame (5),5.0,wood frame (5),,8.0,Castro/Upper Market,94117.0,"(37.76757916496494, -122.43793170417105)",1399356139170
19180,M409787,8,otc alterations permit,07/22/2013,4624,31,178,,West Point,Rd,...,,,,,,10.0,Bayview Hunters Point,94124.0,"(37.73524725436046, -122.38063828309745)",1311685491725
68047,201411191888,8,otc alterations permit,11/19/2014,39,109,294,,Francisco,St,...,5.0,wood frame (5),5.0,wood frame (5),,3.0,North Beach,94133.0,"(37.805257822817126, -122.40998545760392)",1362881288870
64238,M527228,8,otc alterations permit,10/14/2014,1251,2,707,,Cole,St,...,,,,,,5.0,Haight Ashbury,94117.0,"(37.76836885973765, -122.45074431487859)",135886493776


In [48]:
# number of missing data points per column
missing_values_count = sf_permits.isnull().sum()

# # of missing points in the first 10 columns
missing_values_count[0:10]

Permit Number                  0
Permit Type                    0
Permit Type Definition         0
Permit Creation Date           0
Block                          0
Lot                            0
Street Number                  0
Street Number Suffix      196684
Street Name                    0
Street Suffix               2768
dtype: int64

There appears to be many missing values, specifically in the "Street Number Suffix" column. Let's look at the percentage of the values in the sf_permits dataset to give us a better sense of the scale of missing values.

In [49]:
# total missing values
total_cells = np.product(sf_permits.shape)
total_missing = missing_values_count.sum()

# percent of data missing
(total_missing/total_cells) * 100

26.26002315058403

A little over a quarter of the data appears to be missing. Next, we will see what is happening with these missing values and take a closer look at the data.  
<br>

## A good question to ask when dealing with missing values is this:

**"Is this value missing because it wasn't recorded or because it doesn't exist?"**
<br>
<br>
As an example of this question, let's take a closer look at the Zipcode column and Street Number Suffix from our sf_permits dataset. By the documentation, we see the zipcode column contains the zipcode of the building's address and the Street Number Suffix column is related to the address as well. We know that every building must have a zipcode, so, therefore, we know that the missing values for this column must exist, but weren't recorded. As for the Street Number Suffix column, a street number suffix is when there aren't enough numbers for buildings on a street(ex. 550A Front Street). In this case, a street number suffix is a value that very likely is non-existent in most of the missing value cases, but also possible that is wasn't recorded as well.
<br>
<br>
# Drop missing values

One must be careful when choosing to drop missing values from their data for various reasons. For example, if you wanted to just drop all observations with missing values, this is what would happen:

In [50]:
# remove all rows containing a missing value
sf_permits.dropna()

Unnamed: 0,Permit Number,Permit Type,Permit Type Definition,Permit Creation Date,Block,Lot,Street Number,Street Number Suffix,Street Name,Street Suffix,...,Existing Construction Type,Existing Construction Type Description,Proposed Construction Type,Proposed Construction Type Description,Site Permit,Supervisor District,Neighborhoods - Analysis Boundaries,Zipcode,Location,Record ID


As you can see, we have lost all of our data. This is because every row in the sf_permits dataset contained at least one missing value. 
<br> 
Say you wanted to remove all columns with at least one missing value instead.

In [51]:
columns_to_drop = sf_permits.dropna(axis = 1)
print(columns_to_drop.shape)
(columns_to_drop.head())

(198900, 12)


Unnamed: 0,Permit Number,Permit Type,Permit Type Definition,Permit Creation Date,Block,Lot,Street Number,Street Name,Current Status,Current Status Date,Filed Date,Record ID
0,201505065519,4,sign - erect,05/06/2015,326,23,140,Ellis,expired,12/21/2017,05/06/2015,1380611233945
1,201604195146,4,sign - erect,04/19/2016,306,7,440,Geary,issued,08/03/2017,04/19/2016,1420164406718
2,201605278609,3,additions alterations or repairs,05/27/2016,595,203,1647,Pacific,withdrawn,09/26/2017,05/27/2016,1424856504716
3,201611072166,8,otc alterations permit,11/07/2016,156,11,1230,Pacific,complete,07/24/2017,11/07/2016,1443574295566
4,201611283529,6,demolitions,11/28/2016,342,1,950,Market,issued,12/01/2017,11/28/2016,144548169992


In [52]:
# how much data did we lose
print("Number of columns in original dataset: %d \n" % sf_permits.shape[1])
print("Number of columns in na's dropped: %d \n" % columns_to_drop.shape[1])


Number of columns in original dataset: 43 

Number of columns in na's dropped: 12 



Looks like we lost quite a bit of our data. Almost 75% of it!
<br>
<br>
<br>
# Filling in missing values automatically
Our next option in trying to deal with missing values is to try and fill them known as "imputation". 

In [53]:
# subset of the SF Permits dataset
subset_sf_permits = sf_permits.loc[:,"Lot":"Issued Date"].head()
subset_sf_permits

Unnamed: 0,Lot,Street Number,Street Number Suffix,Street Name,Street Suffix,Unit,Unit Suffix,Description,Current Status,Current Status Date,Filed Date,Issued Date
0,23,140,,Ellis,St,,,"ground fl facade: to erect illuminated, electr...",expired,12/21/2017,05/06/2015,11/09/2015
1,7,440,,Geary,St,0.0,,remove (e) awning and associated signs.,issued,08/03/2017,04/19/2016,08/03/2017
2,203,1647,,Pacific,Av,,,installation of separating wall,withdrawn,09/26/2017,05/27/2016,
3,11,1230,,Pacific,Av,0.0,,repair dryrot & stucco at front of bldg.,complete,07/24/2017,11/07/2016,07/18/2017
4,1,950,,Market,St,,,demolish retail/office/commercial 3-story buil...,issued,12/01/2017,11/28/2016,12/01/2017


We could also replace missing values with whatever value comes directly after it in the same column. This approach makes sense for datasets where the observation have some sort of logical order to them.

In [54]:
# replace all NA's with 0
subset_sf_permits.fillna(0)

# replace all NA's the value that comes directly after it in the same colmun, 
# then replace all the remaining na's with 0

subset_sf_permits.fillna(method = 'bfill',axis = 0).fillna(0)

Unnamed: 0,Lot,Street Number,Street Number Suffix,Street Name,Street Suffix,Unit,Unit Suffix,Description,Current Status,Current Status Date,Filed Date,Issued Date
0,23,140,0.0,Ellis,St,0.0,0.0,"ground fl facade: to erect illuminated, electr...",expired,12/21/2017,05/06/2015,11/09/2015
1,7,440,0.0,Geary,St,0.0,0.0,remove (e) awning and associated signs.,issued,08/03/2017,04/19/2016,08/03/2017
2,203,1647,0.0,Pacific,Av,0.0,0.0,installation of separating wall,withdrawn,09/26/2017,05/27/2016,07/18/2017
3,11,1230,0.0,Pacific,Av,0.0,0.0,repair dryrot & stucco at front of bldg.,complete,07/24/2017,11/07/2016,07/18/2017
4,1,950,0.0,Market,St,0.0,0.0,demolish retail/office/commercial 3-story buil...,issued,12/01/2017,11/28/2016,12/01/2017


Another way we could've dealt with the missing values for the Zipcode is by using the addresses that are provided combined with an external dataset of some kind. Hanlding missing values is a crucial part in dealing with datasets. We learned mulitple ways of dealing with them here. Here we will end the first Day of the 5-day challenge. 