# B SF Permits Exploration
6 points

- Explore the data. Which columns correlate strongly with the "Current Status" column? How do other columns correlate to each other?
- Describe problems with "Current Status" as a target column to predict. Can you construct a better target column?

### Anmerkung:

Die Werte in der Spalte "Current Status" sind Text und nicht nummerisch, daher könnte es schwierig werden Korrelationen zu finden, da Korrelation mit numerischen Werten arbeitet. Wir behelfen uns aber, in dem wir einfach die verschiedenen Zustände in Kategorien repräsentieren. 

In [2]:
import numpy as np 
import pandas as pd 

from sklearn import preprocessing
import matplotlib.pyplot as plt 
plt.rc("font", size=14)
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)

In [3]:
# get San Francisco building permits csv files as a DataFrame


building_df = pd.read_csv("./Building_Permits.csv")

# preview developmental data
building_df.head(5)

FileNotFoundError: [Errno 2] File b'./Building_Permits.csv' does not exist: b'./Building_Permits.csv'

In [None]:
#constructing a better target column for 'Current Status'
permitted = ['approved', 'issued', 'complete']
for i in permitted:
    building_df['Current Status'].replace(to_replace=i, value='1', inplace=True)

not_permitted = ['appeal', 'plancheck', 'suspend', 'reinstated', 'filed', 'disapproved', 'incomplete', 'revoked', 'expired', 'cancelled', 'withdrawn']
for i in not_permitted:
    building_df['Current Status'].replace(to_replace=i, value='0', inplace=True)   
    
building_df['Current Status'] = pd.to_numeric(building_df['Current Status'])

In [None]:
#see if its worked?
building_df.head()

In [None]:
# Lassen uns die Anzahl der einzelenn Permit Types anzeigen
building_df['Permit Type'].mode()
ax = building_df['Permit Type'].hist(bins=8, color='teal', alpha=0.8)
ax.set(xlabel='Permit Type', ylabel='Count')
plt.show()

In [None]:
# da die einzelnen Spalten (Existing Use, Neighbhorhood -analysis... ) noch Strings enthalen, ersetzen wir diese mit numerischen Werten
use_numbers, use_label = pd.factorize(building_df['Existing Use'])
building_df['Existing Use'] = use_numbers

neighborhood_numbers, neighboorhood_label = pd.factorize(building_df['Neighborhoods - Analysis Boundaries'])
building_df['Neighborhoods - Analysis Boundaries'] = neighborhood_numbers
building_df.head(10)

### Korrelation zwischen Permit Type und Current Status

In [None]:
plt.figure(figsize=(20,4))
avg_status_by_permittype = building_df[['Permit Type','Current Status']].groupby('Permit Type',as_index=False).mean()
sns.barplot(x='Permit Type', y='Current Status', data=avg_status_by_permittype, color="LightSeaGreen")
plt.show()

Wie man hier sehen kann, Permit Type 4,5 und 8 wird am meisten erlaubt.

In [None]:
plt.figure(figsize=(20,4))
avg_status_by_permittype = building_df[['Existing Use','Current Status']].groupby('Existing Use',as_index=False).mean()
sns.barplot(x='Existing Use', y='Current Status', data=avg_status_by_permittype, color="LightSeaGreen")
plt.show()

Hier kann man sehen, dass nahezu keine Korrelation zwischen den beiden oben getesteten Spalten besteht

### Korrelation zwischen Number of Existing Stories und Current Status

In [None]:
plt.figure(figsize=(20,4))
avg_status_by_permittype = building_df[['Number of Existing Stories','Current Status']].groupby('Number of Existing Stories',as_index=False).mean()
sns.barplot(x='Number of Existing Stories', y='Current Status', data=avg_status_by_permittype, color="LightSeaGreen")
plt.show()

Hier kann man sehen, dass nahezu keine Korrelation zwischen den beiden oben getesteten Spalten besteht

### Korrelation zwischen Plansets und Current Status

In [None]:
plt.figure(figsize=(20,4))
avg_status_by_permittype = building_df[['Plansets','Current Status']].groupby('Plansets',as_index=False).mean()
sns.barplot(x='Plansets', y='Current Status', data=avg_status_by_permittype, color="LightSeaGreen")
plt.show()

### Korrelation zwischen Existing Construction Type und Current Status

In [None]:
plt.figure(figsize=(20,4))
avg_status_by_permittype = building_df[['Existing Construction Type','Current Status']].groupby('Existing Construction Type',as_index=False).mean()
sns.barplot(x='Existing Construction Type', y='Current Status', data=avg_status_by_permittype, color="LightSeaGreen")
plt.show()


Hier kann man sehen, dass nahezu keine Korrelation zwischen den beiden oben getesteten Spalten besteht

### Korrelation zwischen Neighborhoods - Analysis Boundaries und Current Status

In [None]:
plt.figure(figsize=(20,4))
avg_status_by_permittype = building_df[['Neighborhoods - Analysis Boundaries','Current Status']].groupby('Neighborhoods - Analysis Boundaries',as_index=False).mean()
sns.barplot(x='Neighborhoods - Analysis Boundaries', y='Current Status', data=avg_status_by_permittype, color="LightSeaGreen")
plt.show()


Hier kann man sehen, dass nahezu keine Korrelation zwischen den beiden oben getesteten Spalten besteht

In [None]:
plt.figure(figsize=(15,7))
sns.kdeplot(building_df['Existing Construction Type'][building_df['Current Status'] == 1], color="darkturquoise", shade=True)
sns.kdeplot(building_df['Existing Construction Type'][building_df['Current Status'] == 0], color="lightcoral", shade=True)
plt.legend(['permitted', 'not permitted'])
plt.xlabel('Existing Construction Type')
plt.ylabel('Permission')
plt.xlim(-2,8)
plt.show()

Alles in allem gibt es aber keinen eindeutigen Zusammenhang zwischen der Genehmigung einer Baugenehmigung und den anderen Säulen.

In [None]:
plt.figure(figsize=(20,4))
avg_status_by_permittype = building_df[['Neighborhoods - Analysis Boundaries','Existing Construction Type']].groupby('Neighborhoods - Analysis Boundaries',as_index=False).mean()
sns.barplot(x='Neighborhoods - Analysis Boundaries', y='Existing Construction Type', data=avg_status_by_permittype, color="LightSeaGreen")
plt.show()

### Fazit:

Nach der Durchsicht des Datensatzes kann man sagen, dass keine Spalte gut mit dem "Current Status" korreliert,
da sie zu viele verschiedene Kategorien hat. Es ist überall annährend gleich verteilt. Wenn man ganz genau hinschaut, könnte man beim Permit Type eine Korrelation fest stellen, da 4,5 und 8 am häufigsten freigegeben werden

