<h1> Combine Crime, Census and Green Space </h1>

The purpose of this notebook is to combine all data gathered - crime, census and green space count.

Once they are combined the population density for each subdivision can be found as the area and population of each subdivision are together.

In [27]:
import pandas as pd

In [28]:
# Read the necessary files - Crime, Census and Green space into datafrmaes
dub_crime = pd.read_csv('CrimeDublin.csv')
dub_crime = dub_crime.drop('Unnamed: 0', 1)
dub_census = pd.read_csv('CensusDublin.csv')
dub_census = dub_census.drop('Unnamed: 0', 1)
green_area = pd.read_csv('DublinGreenSpace&Area.csv')
green_area = green_area.drop('Unnamed: 0', 1)

In [29]:
# Merge census and crime files
df_censuscrime = pd.merge(dub_crime, dub_census, how='left', left_on='Garda Station', right_on='GardaStation')
# Merge Green space count and area data onto dataframe continaing census and crime data
df_censuscrime = pd.merge(df_censuscrime, green_area, how='left', left_on='Garda Station', right_on='GardaStation')
df_censuscrime = df_censuscrime.drop('GardaStation_x', 1)
df_censuscrime = df_censuscrime.drop('GardaStation_y', 1)
df_censuscrime.head(40)

Unnamed: 0,Garda Station,Theft,Assault & Kidnapping,Fraud & Drugs & Weapons,Environment & Public Order,TotalCrime,PopulationTotalMale,PopulationTotalFemale,PopulationTotal,PrincipalStatusAtWork,...,GeneralHealthGood,GeneralHealthFair,GeneralHealthBad,GeneralHealthVeryBad,GeneralHealthNotStated,Total_Under18Male,Total_Under18Female,Total_Under18,Green,Area
0,Balbriggan,193,168,179,470,1010,11746,12296,24042,9191,...,6532,1638,281,66,513,4134,3892,8026,9,60.000023
1,Ballyfermot,272,173,349,820,1614,13247,14332,27579,9627,...,8213,3122,546,133,567,3426,3286,6712,4,8.568579
2,Ballymun,189,217,260,752,1418,10408,11267,21675,6870,...,6412,2150,426,75,749,2993,2893,5886,4,7.025158
3,BlackrockCoDublin,243,106,156,354,859,15003,16975,31978,13827,...,7535,1883,284,65,511,3563,3469,7032,3,8.965902
4,Blanchardstown,790,543,579,2084,3996,48192,49886,98078,42376,...,25938,5398,904,195,2698,15255,14502,29757,15,45.28753
5,BridewellDublin,279,174,299,4611,5363,11883,11111,22994,10800,...,7194,2209,438,91,862,1636,1546,3182,0,3.005021
6,Cabinteely,230,83,74,310,697,15800,16954,32754,13227,...,8113,2025,303,74,685,4157,3983,8140,35,18.411407
7,Cabra,172,130,157,598,1057,10342,11509,21851,9363,...,6025,1916,371,70,775,2366,2331,4697,3,11.359962
8,Clontarf,415,99,113,469,1096,17749,19494,37243,16745,...,9954,2903,419,84,725,3949,3698,7647,21,18.083254
9,Coolock,403,279,608,951,2241,25244,26969,52213,20131,...,15359,4671,795,169,1343,6946,6642,13588,2,17.676889


In [30]:
# These Areas do not have a green space count and so are dropped
df_censuscrime = df_censuscrime[df_censuscrime['Garda Station'] != 'Airport']
df_censuscrime = df_censuscrime[df_censuscrime['Garda Station'] != 'Garristown']
df_censuscrime = df_censuscrime.reset_index(drop=True)

In [31]:
# Find the Population density of each subdivision (Population / Area) and add it to the datafrmae
df_censuscrime['PopulationDensity'] = (df_censuscrime['PopulationTotal']/df_censuscrime['Area'])
print(df_censuscrime['PopulationDensity'])

0       400.699846
1      3218.620030
2      3085.339783
3      3566.624005
4      2165.673399
5      7651.860597
6      1779.005778
7      1923.509972
8      2059.529726
9      2953.743672
10     3491.432368
11     3346.585953
12     1074.760818
13    10309.834669
14     1444.968583
15     1770.284709
16    10478.366461
17     4913.336912
18     1286.237047
19      335.018127
20     1433.200106
21     5659.055363
22     5940.879322
23     1829.891495
24     2358.014011
25     1095.320316
26      372.843897
27     3128.079865
28     5200.587305
29      466.993718
Name: PopulationDensity, dtype: float64


In [32]:
# Combined file - containing census crime and green space data
df_censuscrime.head()
df_censuscrime.to_csv('CombinedFinalDublin.csv')

<h1> Features per capita </h1>

As we are dealing with subdivsions of varying sizes and population this needed to be taken into account before we started any analysis or modeling.

The population density feature accounts for varying population sizes for different areas but all other features are a count related to the population size for that sub division.

To give a more accurate representaion of the significance of the the numbers for each feature in each subdivsion we found the count for each feature per capita.
   <br>Feature per capita = ( Feature / Population of subdivision )
   <br>e.g. Total Crime per capita in Kilmainham = Total Crime Count for Kilmainham / Population of Kilmainham


In [20]:
# exclusions is a list containing the columns that we do not need to find per capita
exclusions = ['Garda Station', 'Green', 'Area', 'PopulationDensity', 'PopulationTotal']
# columns to list takes away the exclusions specified from all the keys of the data frame and makes a list of the remainder 
columns_to_scale = list(set(df_censuscrime.keys()) - set(exclusions)) 
# per capita
for col in columns_to_scale:
    df_censuscrime[col] = (df_censuscrime[col]/df_censuscrime['PopulationTotal'] )

In [34]:
# Additional features of interest

# Greens per Kilometer squared 
df_censuscrime['GreensPerKMSQ'] = df_censuscrime['Green']/df_censuscrime['Area']
# Greens per person
df_censuscrime['GreensPerPerson'] = df_censuscrime['Green']/df_censuscrime['PopulationTotal']

In [22]:
df_censuscrime.to_csv('FinalPerCapita.csv')