**1) Download the Building_Permits.csv from Kaggle**


**2) Clean the San Francisco Building permit dataset**


**3) Use imputation were necessary**



### Import der Daten & Python Pakete

In [514]:
import numpy as np 
import pandas as pd 

from sklearn import preprocessing
import matplotlib.pyplot as plt 
plt.rc("font", size=14)
import seaborn as sns

sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)


In [515]:
# get building_permit csv files as a DataFrame

#developmental data (train)
building_df = pd.read_csv("./Building_Permits.csv")

#cross validation data (hold-out testing)
#test_df    = pd.read_csv("./titanic/test.csv")
np.random.seed(0) 
# preview developmental data
building_df.sample(5)

Unnamed: 0,Permit Number,Permit Type,Permit Type Definition,Permit Creation Date,Block,Lot,Street Number,Street Number Suffix,Street Name,Street Suffix,...,Existing Construction Type,Existing Construction Type Description,Proposed Construction Type,Proposed Construction Type Description,Site Permit,Supervisor District,Neighborhoods - Analysis Boundaries,Zipcode,Location,Record ID
40553,201403039652,8,otc alterations permit,03/03/2014,3732,8,400,,Clementina,St,...,,,1.0,constr type 1,,6.0,South of Market,94103.0,"(37.780460571778164, -122.40450626524974)",1334094491645
169731,201510159735,3,additions alterations or repairs,10/15/2015,2609,28,79,,Buena Vista,Tr,...,5.0,wood frame (5),5.0,wood frame (5),,8.0,Castro/Upper Market,94117.0,"(37.76757916496494, -122.43793170417105)",1399356139170
19180,M409787,8,otc alterations permit,07/22/2013,4624,31,178,,West Point,Rd,...,,,,,,10.0,Bayview Hunters Point,94124.0,"(37.73524725436046, -122.38063828309745)",1311685491725
68047,201411191888,8,otc alterations permit,11/19/2014,39,109,294,,Francisco,St,...,5.0,wood frame (5),5.0,wood frame (5),,3.0,North Beach,94133.0,"(37.805257822817126, -122.40998545760392)",1362881288870
64238,M527228,8,otc alterations permit,10/14/2014,1251,2,707,,Cole,St,...,,,,,,5.0,Haight Ashbury,94117.0,"(37.76836885973765, -122.45074431487859)",135886493776


### Datenanalyse - Wie viele Werte fehlen in welcher Spalte?

In [516]:
#Find out what percent of the sf_permits dataset is missing total count of missing values in each column
missing_values = building_df.isnull().sum()

sums = building_df.isnull().sum()
missing_percent = (sums/building_df.shape[0]).sort_values(ascending=False)
missing_percent

TIDF Compliance                           0.999990
Voluntary Soft-Story Retrofit             0.999824
Unit Suffix                               0.990141
Street Number Suffix                      0.988859
Site Permit                               0.973057
Structural Notification                   0.965199
Fire Only Permit                          0.905344
Unit                                      0.851790
Completed Date                            0.511357
Permit Expiration Date                    0.260835
Existing Units                            0.259115
Proposed Units                            0.255963
Existing Construction Type                0.218029
Existing Construction Type Description    0.218029
Proposed Construction Type                0.217004
Proposed Construction Type Description    0.217004
Number of Proposed Stories                0.215525
Number of Existing Stories                0.215103
Proposed Use                              0.213369
Existing Use                   

In [517]:
# total number of cells 
total_cells = np.product(building_df.shape)
# total missing values 
missing_values = building_df.isnull().sum()
total_missing_values = missing_values.sum()

# percentage of nan or None cells 
nan_or_null_cell_percentage = (total_missing_values/total_cells)*100
nan_or_null_cell_percentage

26.26002315058403

### Bemerkung

Insgesamt fehlen über 26% der Einträge in den Spalten. 

bei folgenden Spalten (in %) fehlen extrem viele Werte und sind damit ungeinget, falls wir später mit Hilfe dieser Daten etwas vorhersagen möchten: 


* TIDF Compliance                           100.00
* Voluntary Soft-Story Retrofit              99.98
* Unit Suffix                                99.01
* Street Number Suffix                       98.89
* Site Permit                                97.31
* Structural Notification                    96.52
* Fire Only Permit                           90.53
* Unit                                       85.18
* Completed Date                             51.14

### Entschluss - Einfügen Y/N Spalten

Wir haben uns darauf geeinigt, dass wir den Großteil der Spalten ignorieren und bei Voluntary Soft-Story Retrofit, Site Permit und Fire Only Permit bei allen fehlenden Werten ein N (No) eintragen, da dies nur Y/N Spalten sind. Somit füllen wir die "fehlenden" Werte auf und können diese für die Prediction verwenden.

In [518]:
building_df['Voluntary Soft-Story Retrofit'].value_counts()


Y    35
Name: Voluntary Soft-Story Retrofit, dtype: int64

In [519]:
building_df['Site Permit'].value_counts()


Y    5359
Name: Site Permit, dtype: int64

In [520]:
building_df['Fire Only Permit'].value_counts()

Y    18827
Name: Fire Only Permit, dtype: int64

In [521]:
# Die Spalte besteht nur aus Y oder kein Y, daher fügen wir bei keinem Y einfach ein N hinzu.  
building_df['Voluntary Soft-Story Retrofit'] = np.where(building_df['Voluntary Soft-Story Retrofit']=='Y', 'Y', 'N')
building_df['Site Permit'] = np.where(building_df['Site Permit']=='Y', 'Y', 'N')
building_df['Fire Only Permit'] = np.where(building_df['Fire Only Permit']=='Y', 'Y', 'N')

In [522]:
# hat geklappt
building_df['Fire Only Permit'].value_counts()

N    180073
Y     18827
Name: Fire Only Permit, dtype: int64

### Ergebnis

Wie man sieht, fehlen jetzt keine Werte mehr in den oben behandelten Spalten: 

In [523]:
sums = building_df.isnull().sum()
missing_percent = (sums/building_df.shape[0]).sort_values(ascending=False)
missing_percent

TIDF Compliance                           0.999990
Unit Suffix                               0.990141
Street Number Suffix                      0.988859
Structural Notification                   0.965199
Unit                                      0.851790
Completed Date                            0.511357
Permit Expiration Date                    0.260835
Existing Units                            0.259115
Proposed Units                            0.255963
Existing Construction Type Description    0.218029
Existing Construction Type                0.218029
Proposed Construction Type                0.217004
Proposed Construction Type Description    0.217004
Number of Proposed Stories                0.215525
Number of Existing Stories                0.215103
Proposed Use                              0.213369
Existing Use                              0.206707
Estimated Cost                            0.191383
Plansets                                  0.187577
First Construction Document Dat

Als nächsten werden folgende restlichen Spalten behandelt:

* TIDF Compliance                           100.00
* Unit Suffix                                99.01
* Street Number Suffix                       98.89
* Structural Notification                    96.52
* Unit                                       85.18
* Completed Date                             51.14

### Ignorieren der Spalten mit vielen fehlenden Werten

Da eine rekonstruktion bzw. Imputation zu aufwändig ist bzw. dafür auch zu viele Daten fehlen ( >50%) werden wir die oben aufgezählten Spalten ignorieren. Nach unsrerem Verständnis ergbit es auch keine Sinn, dass wir anhand von bspw. der Street Number Suffix etwas vorhersagen und außerdem wäre es hier zu aufwändig im Verhältnis zum Nutzen

In [524]:
def drop_column(column_name):
    building_df.drop(column_name, axis=1, inplace=True)
for i in range(len(missing_percent)):
    if ( missing_percent[i] > 0.5):
        drop_column(missing_percent.index[i])
    else:
        break

Wie wir sehen können, wurden die entsprechenden Spalten gelöscht.

In [525]:
sums = building_df.isnull().sum()
missing_percent = (sums/building_df.shape[0]).sort_values(ascending=False)
missing_percent

Permit Expiration Date                    0.260835
Existing Units                            0.259115
Proposed Units                            0.255963
Existing Construction Type Description    0.218029
Existing Construction Type                0.218029
Proposed Construction Type Description    0.217004
Proposed Construction Type                0.217004
Number of Proposed Stories                0.215525
Number of Existing Stories                0.215103
Proposed Use                              0.213369
Existing Use                              0.206707
Estimated Cost                            0.191383
Plansets                                  0.187577
First Construction Document Date          0.075143
Issued Date                               0.075113
Revised Cost                              0.030498
Street Suffix                             0.013917
Neighborhoods - Analysis Boundaries       0.008673
Supervisor District                       0.008632
Zipcode                        

Nach der Behandlung der Daten fehlen nur noch knapp 8,4 % der Daten

In [526]:
# total number of cells 
total_cells = np.product(building_df.shape)
# total missing values 
missing_values = building_df.isnull().sum()
total_missing_values = missing_values.sum()

# percentage of nan or None cells 
nan_or_null_cell_percentage = (total_missing_values/total_cells)*100
nan_or_null_cell_percentage

8.39526857173916

### Beschreibenden Spalten ignorieren

Nachdem wir die Spalten mit den meistfehlenden Einträgen behandelt haben, schauen wir uns den restlichen Datensatz an. Wir können hier erkennen, dass es einige Spalten gibt, die andere wiederum  beschreiben.

Permit Type - Permit Type Definition

Wir haben uns entschieden, die Spalte rauszuschmeißen, welche eine numerische Spalte beschreibt -- Also Permit Type Definition z.B

In [527]:
unuseful_columns = ['Record ID', 'Description', 'Permit Type Definition', 'Existing Construction Type Description', 'Proposed Construction Type Description']
building_df.drop(unuseful_columns, axis=1, inplace=True)


In [528]:
#show columns
building_df.iloc[0]

Permit Number                                                       201505065519
Permit Type                                                                    4
Permit Creation Date                                                  05/06/2015
Block                                                                       0326
Lot                                                                          023
Street Number                                                                140
Street Name                                                                Ellis
Street Suffix                                                                 St
Current Status                                                           expired
Current Status Date                                                   12/21/2017
Filed Date                                                            05/06/2015
Issued Date                                                           11/09/2015
First Construction Document 

In [529]:
sums = building_df.isnull().sum()
missing_percent = (sums/building_df.shape[0]).sort_values(ascending=False)
missing_percent

Permit Expiration Date                 0.260835
Existing Units                         0.259115
Proposed Units                         0.255963
Existing Construction Type             0.218029
Proposed Construction Type             0.217004
Number of Proposed Stories             0.215525
Number of Existing Stories             0.215103
Proposed Use                           0.213369
Existing Use                           0.206707
Estimated Cost                         0.191383
Plansets                               0.187577
First Construction Document Date       0.075143
Issued Date                            0.075113
Revised Cost                           0.030498
Street Suffix                          0.013917
Neighborhoods - Analysis Boundaries    0.008673
Supervisor District                    0.008632
Zipcode                                0.008627
Location                               0.008547
Permit Creation Date                   0.000000
Street Name                            0

### Logischer Zusammenhang zw. Proposed Spalten und Existing Spalten

Wir haben uns den Datensatz mal genauer angeschaut und fest gestellt, dass bei vielen Tupeln, die Einträge bei Proposed Spalten und Existing Spalten identisch sind (immer über ca. 75%). Wir haben uns daher entschieden. dass wir die Werte der Proposed  Spalten in die Existing Spalten überführen und die Proposed Spalte danach ignorieren, da die Zahlen nicht zu unterschiedlich sein sollten und wir denken, dass die'Existing'-Spalten für Baugenehmigungen wichtiger sind. 

In [530]:
(building_df['Existing Use'] == building_df['Proposed Use']).sum() / len(building_df['Proposed Use'])

0.7422976370035194

In [531]:
(building_df['Existing Construction Type'] == building_df['Proposed Construction Type']).sum() / len(building_df['Proposed Construction Type'])

0.7635796882855707

In [532]:
(building_df['Revised Cost'] == building_df['Revised Cost']).sum() / len(building_df['Estimated Cost'])

0.969502262443439

In [533]:
(building_df['Number of Existing Stories'] == building_df['Number of Proposed Stories']).sum() / len(building_df['Number of Proposed Stories'])

0.7525641025641026

In [534]:
(building_df['Existing Construction Type'] == building_df['Proposed Construction Type']).sum() / len(building_df['Proposed Construction Type'])

0.7635796882855707

In [535]:
building_df['Existing Use'].fillna(building_df['Proposed Use'], inplace=True)
building_df['Revised Cost'].fillna(building_df['Estimated Cost'], inplace=True)
building_df['Existing Units'].fillna(building_df['Proposed Units'], inplace=True)
building_df['Number of Existing Stories'].fillna(building_df['Number of Proposed Stories'], inplace=True)
building_df['Existing Construction Type'].fillna(building_df['Proposed Construction Type'], inplace=True)

Anmerkung: Die Werte haben sich leicht verbessert

In [536]:
sums = building_df.isnull().sum()
missing_percent = (sums/building_df.shape[0]).sort_values(ascending=False)
missing_percent

Permit Expiration Date                 0.260835
Proposed Units                         0.255963
Existing Units                         0.243957
Proposed Construction Type             0.217004
Number of Proposed Stories             0.215525
Proposed Use                           0.213369
Existing Construction Type             0.199211
Number of Existing Stories             0.197959
Existing Use                           0.195093
Estimated Cost                         0.191383
Plansets                               0.187577
First Construction Document Date       0.075143
Issued Date                            0.075113
Street Suffix                          0.013917
Neighborhoods - Analysis Boundaries    0.008673
Supervisor District                    0.008632
Zipcode                                0.008627
Location                               0.008547
Revised Cost                           0.003042
Permit Creation Date                   0.000000
Street Name                            0

### Datum anpassen, neue Spaltenaufteilung und alte Datumspalten entfernen

da die Daten in der jeweiligen spalte auch in verschiedenen Formaten existieren, haben wir uns dazu entschieden, die Spalten aufzubrechen in einzelne Tag, Monat und Jahre Spalte und die ursprüngliche Spalte zu entfernen 

In [537]:
date_columns = [
    'Permit Creation Date',
    'Current Status Date',
    'Filed Date',
    'Issued Date',
    'First Construction Document Date',
    'Permit Expiration Date'
]

In [538]:
def split_dates(column_name):
    name = column_name[0:-4]
    building_df[name+'Year'] = building_df[column_name].astype(str).str[-4:]
    building_df[name+'Year'] = pd.to_numeric(building_df[name+'Year'], errors='coerce')
    building_df[name+'Day'] = building_df[column_name].astype(str).str[3:5]
    building_df[name+'Day'] = pd.to_numeric(building_df[name+'Day'], errors='coerce')
    building_df[name+'Month'] = building_df[column_name].astype(str).str[0:2]
    building_df[name+'Month'] = pd.to_numeric(building_df[name+'Month'], errors='coerce')
for i in date_columns:
    split_dates(i)

In [539]:
building_df.columns

Index(['Permit Number', 'Permit Type', 'Permit Creation Date', 'Block', 'Lot',
       'Street Number', 'Street Name', 'Street Suffix', 'Current Status',
       'Current Status Date', 'Filed Date', 'Issued Date',
       'First Construction Document Date', 'Number of Existing Stories',
       'Number of Proposed Stories', 'Voluntary Soft-Story Retrofit',
       'Fire Only Permit', 'Permit Expiration Date', 'Estimated Cost',
       'Revised Cost', 'Existing Use', 'Existing Units', 'Proposed Use',
       'Proposed Units', 'Plansets', 'Existing Construction Type',
       'Proposed Construction Type', 'Site Permit', 'Supervisor District',
       'Neighborhoods - Analysis Boundaries', 'Zipcode', 'Location',
       'Permit Creation Year', 'Permit Creation Day', 'Permit Creation Month',
       'Current Status Year', 'Current Status Day', 'Current Status Month',
       'Filed Year', 'Filed Day', 'Filed Month', 'Issued Year', 'Issued Day',
       'Issued Month', 'First Construction Document Year'

In [540]:
for i in date_columns:
    drop_column(i)

In [541]:
building_df.columns

Index(['Permit Number', 'Permit Type', 'Block', 'Lot', 'Street Number',
       'Street Name', 'Street Suffix', 'Current Status',
       'Number of Existing Stories', 'Number of Proposed Stories',
       'Voluntary Soft-Story Retrofit', 'Fire Only Permit', 'Estimated Cost',
       'Revised Cost', 'Existing Use', 'Existing Units', 'Proposed Use',
       'Proposed Units', 'Plansets', 'Existing Construction Type',
       'Proposed Construction Type', 'Site Permit', 'Supervisor District',
       'Neighborhoods - Analysis Boundaries', 'Zipcode', 'Location',
       'Permit Creation Year', 'Permit Creation Day', 'Permit Creation Month',
       'Current Status Year', 'Current Status Day', 'Current Status Month',
       'Filed Year', 'Filed Day', 'Filed Month', 'Issued Year', 'Issued Day',
       'Issued Month', 'First Construction Document Year',
       'First Construction Document Day', 'First Construction Document Month',
       'Permit Expiration Year', 'Permit Expiration Day',
       'Permi

In [542]:
sums = building_df.isnull().sum()
missing_percent = (sums/building_df.shape[0]).sort_values(ascending=False)
missing_percent

Permit Expiration Month                0.260835
Permit Expiration Year                 0.260835
Permit Expiration Day                  0.260835
Proposed Units                         0.255963
Existing Units                         0.243957
Proposed Construction Type             0.217004
Number of Proposed Stories             0.215525
Proposed Use                           0.213369
Existing Construction Type             0.199211
Number of Existing Stories             0.197959
Existing Use                           0.195093
Estimated Cost                         0.191383
Plansets                               0.187577
First Construction Document Month      0.075143
First Construction Document Day        0.075143
First Construction Document Year       0.075143
Issued Year                            0.075113
Issued Month                           0.075113
Issued Day                             0.075113
Street Suffix                          0.013917
Neighborhoods - Analysis Boundaries    0

### Welche Spalten sind für die Vorhersage (logisch) sinnvoll und welche nicht?

Nachdem wir schon einige Spalten behandelt haben, haben wir uns überlegt, welche Spalten wir in diesem schon gesäuberten Datensatz behalten, und welche wir vor nicht sinnvoll halten. 

Hier eine Übersicht:

* Location - Die Location ist nur ein Geokoordinate, das keinen Nutzen  ohne die Zonen hat (zu komplex für diesen Zweck und die Postleitzahl ist besser dafür).
* Proposed Use
* Estimated Cost,
* Proposed Units,
* Number of Proposed Stories,
* Proposed Construction Type,
* Street Number - wir ignorieren die Hausnummer , weil sie kategorisch ist, aber zu viele Kategorien hat und keinen Einfluss auf die Baugenehmigung hat (sollte nicht). 
* Block - tdie Adresse liefert keine guten Daten für Baugenehmigungen und kann nicht verwendet werden. 
* Lot
* **Street Name** - wir haben uns entschieden, den Straßennamen zu behalten, weil es nicht zu viele Werte gibt und alle Zeilen ihn haben.
* Street Suffix -  keinen Einfluss auf die Baugenehmigung hat (sollte nicht). 
* Permit Number - nur eine Zahl, zu viele Kategorien 

In [543]:
len(building_df['Street Name'].unique())

1704

In [544]:
columns_to_drop = ['Location', 
                   'Proposed Use',
                   'Estimated Cost',
                   'Proposed Units',
                   'Number of Proposed Stories',
                   'Proposed Construction Type',
                   'Street Number', 
                   'Block', 
                   'Lot',
                   'Street Suffix',
                   'Permit Number', 
                   
                  
                  ]

In [545]:
for i in columns_to_drop:
    drop_column(i)

In [546]:
building_df.shape

(198900, 33)

In [547]:
sums = building_df.isnull().sum()
missing_percent = (sums/building_df.shape[0]).sort_values(ascending=False)
missing_percent

Permit Expiration Month                0.260835
Permit Expiration Year                 0.260835
Permit Expiration Day                  0.260835
Existing Units                         0.243957
Existing Construction Type             0.199211
Number of Existing Stories             0.197959
Existing Use                           0.195093
Plansets                               0.187577
First Construction Document Year       0.075143
First Construction Document Day        0.075143
First Construction Document Month      0.075143
Issued Month                           0.075113
Issued Day                             0.075113
Issued Year                            0.075113
Neighborhoods - Analysis Boundaries    0.008673
Supervisor District                    0.008632
Zipcode                                0.008627
Revised Cost                           0.003042
Street Name                            0.000000
Current Status                         0.000000
Voluntary Soft-Story Retrofit          0

# Imputations


Damit man später weiß, welche Werte imputiert/nachberechnet wurden. fügen wir für jede Spalte eine zusätzliche Spalte hinzu, deren fehlende Werte anzeigen, ob dieser Wert berechnet wurde oder nicht. 

In [563]:
imputed = building_df.copy()
cols_with_missing = (col for col in building_df.columns 
                                 if building_df[col].isnull().any())
for col in cols_with_missing:
    imputed[col + '_was_missing'] = imputed[col].isnull()
    

In [564]:
imputed.head(5)

Unnamed: 0,Permit Type,Street Name,Current Status,Number of Existing Stories,Voluntary Soft-Story Retrofit,Fire Only Permit,Revised Cost,Existing Use,Existing Units,Plansets,...,Zipcode_was_missing,Issued Year_was_missing,Issued Day_was_missing,Issued Month_was_missing,First Construction Document Year_was_missing,First Construction Document Day_was_missing,First Construction Document Month_was_missing,Permit Expiration Year_was_missing,Permit Expiration Day_was_missing,Permit Expiration Month_was_missing
0,4,Ellis,expired,6.0,N,N,4000.0,tourist hotel/motel,143.0,2.0,...,False,False,False,False,False,False,False,False,False,False
1,4,Geary,issued,7.0,N,N,500.0,tourist hotel/motel,,2.0,...,False,False,False,False,False,False,False,False,False,False
2,3,Pacific,withdrawn,6.0,N,N,20000.0,retail sales,39.0,2.0,...,False,True,True,True,True,True,True,True,True,True
3,8,Pacific,complete,2.0,N,N,2000.0,1 family dwelling,1.0,2.0,...,False,False,False,False,False,False,False,False,False,False
4,6,Market,issued,3.0,N,N,100000.0,retail sales,,2.0,...,False,False,False,False,False,False,False,False,False,False


In [565]:
sums = imputed.isnull().sum()
missing_percent = (sums/imputed.shape[0]).sort_values(ascending=False)
missing_percent

Permit Expiration Month                            0.260835
Permit Expiration Year                             0.260835
Permit Expiration Day                              0.260835
Existing Units                                     0.243957
Existing Construction Type                         0.199211
Number of Existing Stories                         0.197959
Existing Use                                       0.195093
Plansets                                           0.187577
First Construction Document Year                   0.075143
First Construction Document Day                    0.075143
First Construction Document Month                  0.075143
Issued Day                                         0.075113
Issued Year                                        0.075113
Issued Month                                       0.075113
Neighborhoods - Analysis Boundaries                0.008673
Supervisor District                                0.008632
Zipcode                                 

In [551]:
# vergelich imputing zu original
building_df.shape


(198900, 33)

In [566]:
imputed.shape

(198900, 51)

Wenn man den Datensatz betrachtet, gibt es ein paar Spalten (z.B Revised Cost, Existing Units...) die Numerisch sind, bei denen wir, wie in der Vorlesung behandelt, den Median imputieren können..

In [553]:
numerical_cols = [
    'Number of Existing Stories',
    'Revised Cost',
    'Existing Units'
]

In [567]:
imputed['Existing Units'].isnull().sum()

48523

In [570]:
for column in numerical_cols:
    imputed[column].fillna( imputed[column].mean(skipna=True) , inplace=True)

In [571]:
imputed['Existing Units'].isnull().sum()

0

In [572]:
sums = imputed.isnull().sum()
missing_percent = (sums/imputed.shape[0]).sort_values(ascending=False)
missing_percent

Permit Expiration Month                            0.260835
Permit Expiration Year                             0.260835
Permit Expiration Day                              0.260835
Existing Construction Type                         0.199211
Existing Use                                       0.195093
Plansets                                           0.187577
First Construction Document Year                   0.075143
First Construction Document Day                    0.075143
First Construction Document Month                  0.075143
Issued Day                                         0.075113
Issued Year                                        0.075113
Issued Month                                       0.075113
Neighborhoods - Analysis Boundaries                0.008673
Supervisor District                                0.008632
Zipcode                                            0.008627
Current Status Month                               0.000000
Permit Creation Year                    

Bei einigen Spalten (insbesondere die Datumspalten) fehlen viele Werte und die nicht fehlenden Werte sind gleichmäßig verteilt. Daher werden wir die Verteilung ein wenig ändern, aber da wir eine Spalte haben, die angibt, dass diese Werte berechnet wurden, sollte es für das Modell funktionieren.


In [573]:
imputed.columns

Index(['Permit Type', 'Street Name', 'Current Status',
       'Number of Existing Stories', 'Voluntary Soft-Story Retrofit',
       'Fire Only Permit', 'Revised Cost', 'Existing Use', 'Existing Units',
       'Plansets', 'Existing Construction Type', 'Site Permit',
       'Supervisor District', 'Neighborhoods - Analysis Boundaries', 'Zipcode',
       'Permit Creation Year', 'Permit Creation Day', 'Permit Creation Month',
       'Current Status Year', 'Current Status Day', 'Current Status Month',
       'Filed Year', 'Filed Day', 'Filed Month', 'Issued Year', 'Issued Day',
       'Issued Month', 'First Construction Document Year',
       'First Construction Document Day', 'First Construction Document Month',
       'Permit Expiration Year', 'Permit Expiration Day',
       'Permit Expiration Month', 'Number of Existing Stories_was_missing',
       'Revised Cost_was_missing', 'Existing Use_was_missing',
       'Existing Units_was_missing', 'Plansets_was_missing',
       'Existing Construc

In [574]:
for column in imputed.columns:
    if ( imputed[column].isnull().sum() == 0 ):
        continue
    imputed[column].fillna(imputed[column].mode()[0], inplace=True)

In [575]:
imputed['Permit Expiration Month'].value_counts()

5.0     66152
3.0     14135
6.0     13491
4.0     13394
2.0     13061
1.0     13029
7.0     12726
8.0     11524
10.0    11158
9.0     10897
11.0     9838
12.0     9495
Name: Permit Expiration Month, dtype: int64

In [576]:
sums = imputed.isnull().sum()
missing_percent = (sums/imputed.shape[0]).sort_values(ascending=False)
missing_percent

Permit Expiration Month_was_missing                0.0
Supervisor District                                0.0
Filed Day                                          0.0
Filed Year                                         0.0
Current Status Month                               0.0
Current Status Day                                 0.0
Current Status Year                                0.0
Permit Creation Month                              0.0
Permit Creation Day                                0.0
Permit Creation Year                               0.0
Zipcode                                            0.0
Neighborhoods - Analysis Boundaries                0.0
Site Permit                                        0.0
Issued Year                                        0.0
Existing Construction Type                         0.0
Plansets                                           0.0
Existing Units                                     0.0
Existing Use                                       0.0
Revised Co

In [577]:
imputed.head(5)

Unnamed: 0,Permit Type,Street Name,Current Status,Number of Existing Stories,Voluntary Soft-Story Retrofit,Fire Only Permit,Revised Cost,Existing Use,Existing Units,Plansets,...,Zipcode_was_missing,Issued Year_was_missing,Issued Day_was_missing,Issued Month_was_missing,First Construction Document Year_was_missing,First Construction Document Day_was_missing,First Construction Document Month_was_missing,Permit Expiration Year_was_missing,Permit Expiration Day_was_missing,Permit Expiration Month_was_missing
0,4,Ellis,expired,6.0,N,N,4000.0,tourist hotel/motel,143.0,2.0,...,False,False,False,False,False,False,False,False,False,False
1,4,Geary,issued,7.0,N,N,500.0,tourist hotel/motel,16.350162,2.0,...,False,False,False,False,False,False,False,False,False,False
2,3,Pacific,withdrawn,6.0,N,N,20000.0,retail sales,39.0,2.0,...,False,True,True,True,True,True,True,True,True,True
3,8,Pacific,complete,2.0,N,N,2000.0,1 family dwelling,1.0,2.0,...,False,False,False,False,False,False,False,False,False,False
4,6,Market,issued,3.0,N,N,100000.0,retail sales,16.350162,2.0,...,False,False,False,False,False,False,False,False,False,False


### Fazit


Das Bereinigen des San Francisco Buidling Permit Datensatzes war schwieriger als gedacht. Es war zu Beginn sehr schwer den Datensatz zu verstehen und zusammenhänge daraus zu schließen. Erst nach Recherchen und Disskusion konnte man langsam durchsteigen.
Gerade zu bestimmen, welche Spalten für die Vorherrsage wichtig ist und welche nicht, fiel uns extrem schwer! Wir haben uns grundsätzlich immer auf rationale/logische Schritte geeinigt wie z.B. die 50% Hürde bei den fehlenden Daten. 
Dadurch, dass wir nach und nach Spalten entfernt haben, hatten wir gegen Ende auch einen besseren Überblick über den Datensatz