### In this notebook: 

#### After we used fetchimage.py and boston_safety_subsample.csv (20000 samples from Boston) in GCP to fetch images, by checking the 'metadata.json' obtained, we found that for the 20,000 samples, we were able to obtain 19,904 images. The portion of the 'safety' and 'subdistrict' were not affected by missing 96 samples. So we would proceed to feed CNN with the 19904 images.

#### We further split the 19904 images into Training (80%, 15923) and Test (20%, 3981) sets. Their portion of 'target' and 'subdistrict' columns both represent that of the 'boston_safety_subsample' table.

In [83]:
import pandas as pd

In [3]:
metadata = pd.read_json('~/Desktop/metadata.json')

### All 20,000 samples' metadata obtained while fetching images

In [16]:
metadata.head()

Unnamed: 0,_file,copyright,date,location,pano_id,status
0,gsv_0.jpg,© Google,2018-08-01,"{'lat': 42.34874, 'lng': -71.0714033}",6oZeECWwjS0xTzdsN-ZE7g,OK
1,gsv_1.jpg,© Google,2018-08-01,"{'lat': 42.28859413668265, 'lng': -71.06564443...",q5dOW8BMAXU0vEtSKfa4yA,OK
2,gsv_2.jpg,© Google,2017-07-01,"{'lat': 42.2957334, 'lng': -71.1380628}",fsVE7knT_rhVkKqyAGbkyg,OK
3,gsv_3.jpg,© Google,2018-08-01,"{'lat': 42.31308354620529, 'lng': -71.07641430...",AJ7cTQ8rL1aADzsM11K39A,OK
4,gsv_4.jpg,© Google,2018-08-01,"{'lat': 42.3112202, 'lng': -71.0601081}",MnI2F0n9yqnIGie2rQSAZw,OK


In [10]:
metadata.shape

(20000, 6)

### Among all 20000 samples,  for 19904 samples, we are able to download images; for 96 samples, there were "ZERO_RESULTS". The index of the 'metadata' table indicates which sample successfully fetched image

In [15]:
metadata.groupby('status').size()

status
OK              19904
ZERO_RESULTS       96
dtype: int64

In [9]:
metadata[metadata['status'] == 'OK'].shape

(19904, 6)

In [20]:
metadata[metadata['status'] == 'ZERO_RESULTS'].head(5)

Unnamed: 0,_file,copyright,date,location,pano_id,status
37,,,NaT,,,ZERO_RESULTS
170,,,NaT,,,ZERO_RESULTS
1034,,,NaT,,,ZERO_RESULTS
1380,,,NaT,,,ZERO_RESULTS
1609,,,NaT,,,ZERO_RESULTS


In [21]:
metadata[metadata['status'] == 'OK'].head(5)

Unnamed: 0,_file,copyright,date,location,pano_id,status
0,gsv_0.jpg,© Google,2018-08-01,"{'lat': 42.34874, 'lng': -71.0714033}",6oZeECWwjS0xTzdsN-ZE7g,OK
1,gsv_1.jpg,© Google,2018-08-01,"{'lat': 42.28859413668265, 'lng': -71.06564443...",q5dOW8BMAXU0vEtSKfa4yA,OK
2,gsv_2.jpg,© Google,2017-07-01,"{'lat': 42.2957334, 'lng': -71.1380628}",fsVE7knT_rhVkKqyAGbkyg,OK
3,gsv_3.jpg,© Google,2018-08-01,"{'lat': 42.31308354620529, 'lng': -71.07641430...",AJ7cTQ8rL1aADzsM11K39A,OK
4,gsv_4.jpg,© Google,2018-08-01,"{'lat': 42.3112202, 'lng': -71.0601081}",MnI2F0n9yqnIGie2rQSAZw,OK


### Check whether the 19904 samples' portion of 'safety' and 'subdistrict' were affected

Method: Merge table 'image_fetched' with table 'boston_safety_subsample.csv' by index. 

Result: the 19904 samples' portion of 'safety' and 'subdistrict' were not affected comparing with that of the original 'boston_safety_subsample.csv'

In [22]:
boston_safety_subsample = pd.read_csv("~/Desktop/ML1030/us_safety/boston_safety_subsample.csv")

In [23]:
boston_safety_subsample.head()

Unnamed: 0.1,Unnamed: 0,city,latitude,longitude,q-score,safety,Coordinates,OBJECTID,SUBDISTRIC
0,25915,Boston,42.348724,-71.071495,28.634125,1,POINT (-71.071495 42.348724),54998,Residential
1,1131,Boston,42.288597,-71.065681,26.525396,1,POINT (-71.06568100000001 42.288597),55642,Residential
2,121773,Boston,42.295563,-71.138023,29.614868,1,POINT (-71.138023 42.295563),54450,Residential
3,7005,Boston,42.312973,-71.076004,30.076191,1,POINT (-71.076004 42.312973),55523,Residential
4,6342,Boston,42.311348,-71.060051,26.038879,1,POINT (-71.060051 42.311348),55604,Residential


In [28]:
boston_safety_subsample.shape

(20000, 9)

In [24]:
image_fetched = metadata[metadata['status'] == 'OK']

In [26]:
image_fetched.shape

(19904, 6)

In [29]:
image_fetched_with_target = pd.merge(image_fetched, boston_safety_subsample, left_index=True, right_index=True)

In [39]:
image_fetched_with_target.head()

Unnamed: 0.1,_file,copyright,date,location,pano_id,status,Unnamed: 0,city,latitude,longitude,q-score,safety,Coordinates,OBJECTID,SUBDISTRIC
0,gsv_0.jpg,© Google,2018-08-01,"{'lat': 42.34874, 'lng': -71.0714033}",6oZeECWwjS0xTzdsN-ZE7g,OK,25915,Boston,42.348724,-71.071495,28.634125,1,POINT (-71.071495 42.348724),54998,Residential
1,gsv_1.jpg,© Google,2018-08-01,"{'lat': 42.28859413668265, 'lng': -71.06564443...",q5dOW8BMAXU0vEtSKfa4yA,OK,1131,Boston,42.288597,-71.065681,26.525396,1,POINT (-71.06568100000001 42.288597),55642,Residential
2,gsv_2.jpg,© Google,2017-07-01,"{'lat': 42.2957334, 'lng': -71.1380628}",fsVE7knT_rhVkKqyAGbkyg,OK,121773,Boston,42.295563,-71.138023,29.614868,1,POINT (-71.138023 42.295563),54450,Residential
3,gsv_3.jpg,© Google,2018-08-01,"{'lat': 42.31308354620529, 'lng': -71.07641430...",AJ7cTQ8rL1aADzsM11K39A,OK,7005,Boston,42.312973,-71.076004,30.076191,1,POINT (-71.076004 42.312973),55523,Residential
4,gsv_4.jpg,© Google,2018-08-01,"{'lat': 42.3112202, 'lng': -71.0601081}",MnI2F0n9yqnIGie2rQSAZw,OK,6342,Boston,42.311348,-71.060051,26.038879,1,POINT (-71.060051 42.311348),55604,Residential


#### boston_safety_subsample: count by 'SUBDISTRIC','safety'

In [35]:
boston_safety_subsample.groupby(['SUBDISTRIC','safety']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 0,city,latitude,longitude,q-score,Coordinates,OBJECTID
SUBDISTRIC,safety,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Business,0,534,534,534,534,534,534,534
Business,1,534,534,534,534,534,534,534
Comm/Instit,0,210,210,210,210,210,210,210
Comm/Instit,1,210,210,210,210,210,210,210
Industrial,0,2275,2275,2275,2275,2275,2275,2275
Industrial,1,505,505,505,505,505,505,505
Miscellaneous,0,428,428,428,428,428,428,428
Miscellaneous,1,439,439,439,439,439,439,439
Mixed Use,0,663,663,663,663,663,663,663
Mixed Use,1,652,652,652,652,652,652,652


#### image_fetched_with_target: count by 'SUBDISTRIC', 'safety'

In [32]:
image_fetched_with_target.groupby(['SUBDISTRIC','safety']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,_file,copyright,date,location,pano_id,status,Unnamed: 0,city,latitude,longitude,q-score,Coordinates,OBJECTID
SUBDISTRIC,safety,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Business,0,534,534,533,534,534,534,534,534,534,534,534,534,534
Business,1,533,533,529,533,533,533,533,533,533,533,533,533,533
Comm/Instit,0,200,200,200,200,200,200,200,200,200,200,200,200,200
Comm/Instit,1,209,209,209,209,209,209,209,209,209,209,209,209,209
Industrial,0,2249,2249,2249,2249,2249,2249,2249,2249,2249,2249,2249,2249,2249
Industrial,1,497,497,497,497,497,497,497,497,497,497,497,497,497
Miscellaneous,0,426,426,426,426,426,426,426,426,426,426,426,426,426
Miscellaneous,1,437,437,437,437,437,437,437,437,437,437,437,437,437
Mixed Use,0,660,660,657,660,660,660,660,660,660,660,660,660,660
Mixed Use,1,649,649,646,649,649,649,649,649,649,649,649,649,649


#### boston_safety_subsample: count by 'SUBDISTRIC'

In [37]:
boston_safety_subsample.groupby(['SUBDISTRIC']).count()

Unnamed: 0_level_0,Unnamed: 0,city,latitude,longitude,q-score,safety,Coordinates,OBJECTID
SUBDISTRIC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Business,1068,1068,1068,1068,1068,1068,1068,1068
Comm/Instit,420,420,420,420,420,420,420,420
Industrial,2780,2780,2780,2780,2780,2780,2780,2780
Miscellaneous,867,867,867,867,867,867,867,867
Mixed Use,1315,1315,1315,1315,1315,1315,1315,1315
Open Space,3572,3572,3572,3572,3572,3572,3572,3572
Residential,9978,9978,9978,9978,9978,9978,9978,9978


#### image_fetched_with_target: count by 'SUBDISTRIC'¶

In [36]:
image_fetched_with_target.groupby(['SUBDISTRIC']).count()

Unnamed: 0_level_0,_file,copyright,date,location,pano_id,status,Unnamed: 0,city,latitude,longitude,q-score,safety,Coordinates,OBJECTID
SUBDISTRIC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Business,1067,1067,1062,1067,1067,1067,1067,1067,1067,1067,1067,1067,1067,1067
Comm/Instit,409,409,409,409,409,409,409,409,409,409,409,409,409,409
Industrial,2746,2746,2746,2746,2746,2746,2746,2746,2746,2746,2746,2746,2746,2746
Miscellaneous,863,863,863,863,863,863,863,863,863,863,863,863,863,863
Mixed Use,1309,1309,1303,1309,1309,1309,1309,1309,1309,1309,1309,1309,1309,1309
Open Space,3549,3549,3549,3549,3549,3549,3549,3549,3549,3549,3549,3549,3549,3549
Residential,9961,9961,9959,9961,9961,9961,9961,9961,9961,9961,9961,9961,9961,9961


#### boston_safety_subsample: count by 'safety'

In [34]:
boston_safety_subsample.groupby(['safety']).count()

Unnamed: 0_level_0,Unnamed: 0,city,latitude,longitude,q-score,Coordinates,OBJECTID,SUBDISTRIC
safety,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,10885,10885,10885,10885,10885,10885,10885,10885
1,9115,9115,9115,9115,9115,9115,9115,9115


#### image_fetched_with_target: count by 'safety'

In [38]:
image_fetched_with_target.groupby(['safety']).count()

Unnamed: 0_level_0,_file,copyright,date,location,pano_id,status,Unnamed: 0,city,latitude,longitude,q-score,Coordinates,OBJECTID,SUBDISTRIC
safety,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,10821,10821,10816,10821,10821,10821,10821,10821,10821,10821,10821,10821,10821,10821
1,9083,9083,9075,9083,9083,9083,9083,9083,9083,9083,9083,9083,9083,9083


In [50]:
image_fetched_with_target.head(5)

Unnamed: 0.1,_file,copyright,date,location,pano_id,status,Unnamed: 0,city,latitude,longitude,q-score,safety,Coordinates,OBJECTID,SUBDISTRIC
0,gsv_0.jpg,© Google,2018-08-01,"{'lat': 42.34874, 'lng': -71.0714033}",6oZeECWwjS0xTzdsN-ZE7g,OK,25915,Boston,42.348724,-71.071495,28.634125,1,POINT (-71.071495 42.348724),54998,Residential
1,gsv_1.jpg,© Google,2018-08-01,"{'lat': 42.28859413668265, 'lng': -71.06564443...",q5dOW8BMAXU0vEtSKfa4yA,OK,1131,Boston,42.288597,-71.065681,26.525396,1,POINT (-71.06568100000001 42.288597),55642,Residential
2,gsv_2.jpg,© Google,2017-07-01,"{'lat': 42.2957334, 'lng': -71.1380628}",fsVE7knT_rhVkKqyAGbkyg,OK,121773,Boston,42.295563,-71.138023,29.614868,1,POINT (-71.138023 42.295563),54450,Residential
3,gsv_3.jpg,© Google,2018-08-01,"{'lat': 42.31308354620529, 'lng': -71.07641430...",AJ7cTQ8rL1aADzsM11K39A,OK,7005,Boston,42.312973,-71.076004,30.076191,1,POINT (-71.076004 42.312973),55523,Residential
4,gsv_4.jpg,© Google,2018-08-01,"{'lat': 42.3112202, 'lng': -71.0601081}",MnI2F0n9yqnIGie2rQSAZw,OK,6342,Boston,42.311348,-71.060051,26.038879,1,POINT (-71.060051 42.311348),55604,Residential


In [51]:
image_fetched_with_target.to_csv("~/Desktop/ML1030/us_safety/boston19904image_with_target.csv")

### For the 19904 samples, split it into 80: 20 percent

In [46]:
from sklearn.model_selection import train_test_split

In [47]:
train, test = train_test_split(image_fetched_with_target, test_size=0.2)

### Examine Train set's portion of safey and subdistrict

In [49]:
train.groupby(['SUBDISTRIC','safety']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,_file,copyright,date,location,pano_id,status,Unnamed: 0,city,latitude,longitude,q-score,Coordinates,OBJECTID
SUBDISTRIC,safety,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Business,0,427,427,427,427,427,427,427,427,427,427,427,427,427
Business,1,424,424,420,424,424,424,424,424,424,424,424,424,424
Comm/Instit,0,164,164,164,164,164,164,164,164,164,164,164,164,164
Comm/Instit,1,177,177,177,177,177,177,177,177,177,177,177,177,177
Industrial,0,1837,1837,1837,1837,1837,1837,1837,1837,1837,1837,1837,1837,1837
Industrial,1,391,391,391,391,391,391,391,391,391,391,391,391,391
Miscellaneous,0,338,338,338,338,338,338,338,338,338,338,338,338,338
Miscellaneous,1,341,341,341,341,341,341,341,341,341,341,341,341,341
Mixed Use,0,544,544,542,544,544,544,544,544,544,544,544,544,544
Mixed Use,1,537,537,534,537,537,537,537,537,537,537,537,537,537


In [52]:
train.groupby(['SUBDISTRIC']).count()

Unnamed: 0_level_0,_file,copyright,date,location,pano_id,status,Unnamed: 0,city,latitude,longitude,q-score,safety,Coordinates,OBJECTID
SUBDISTRIC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Business,851,851,847,851,851,851,851,851,851,851,851,851,851,851
Comm/Instit,341,341,341,341,341,341,341,341,341,341,341,341,341,341
Industrial,2228,2228,2228,2228,2228,2228,2228,2228,2228,2228,2228,2228,2228,2228
Miscellaneous,679,679,679,679,679,679,679,679,679,679,679,679,679,679
Mixed Use,1081,1081,1076,1081,1081,1081,1081,1081,1081,1081,1081,1081,1081,1081
Open Space,2841,2841,2841,2841,2841,2841,2841,2841,2841,2841,2841,2841,2841,2841
Residential,7902,7902,7900,7902,7902,7902,7902,7902,7902,7902,7902,7902,7902,7902


In [74]:
train.groupby(['safety']).count()

Unnamed: 0_level_0,_file,copyright,date,location,pano_id,status,Unnamed: 0,city,latitude,longitude,q-score,Coordinates,OBJECTID,SUBDISTRIC
safety,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,8702,8702,8699,8702,8702,8702,8702,8702,8702,8702,8702,8702,8702,8702
1,7221,7221,7213,7221,7221,7221,7221,7221,7221,7221,7221,7221,7221,7221


### Train set's portion of safey and subdistrict is similar to that of 'boston_safety_subsample'

In [61]:
train.groupby(['SUBDISTRIC']).count()['_file'] / train.groupby(['SUBDISTRIC']).count()['_file'].sum()

SUBDISTRIC
Business         0.053445
Comm/Instit      0.021416
Industrial       0.139923
Miscellaneous    0.042643
Mixed Use        0.067889
Open Space       0.178421
Residential      0.496263
Name: _file, dtype: float64

In [75]:
train.groupby(['safety']).count()['_file'] / train.groupby(['safety']).count()['_file'].sum()

safety
0    0.546505
1    0.453495
Name: _file, dtype: float64

In [77]:
boston_safety_subsample.groupby(['SUBDISTRIC']).count()['Unnamed: 0'] / boston_safety_subsample.groupby(['SUBDISTRIC']).count()['Unnamed: 0'].sum()

SUBDISTRIC
Business         0.05340
Comm/Instit      0.02100
Industrial       0.13900
Miscellaneous    0.04335
Mixed Use        0.06575
Open Space       0.17860
Residential      0.49890
Name: Unnamed: 0, dtype: float64

In [76]:
boston_safety_subsample.groupby(['safety']).count()['Unnamed: 0'] / boston_safety_subsample.groupby(['safety']).count()['Unnamed: 0'].sum()

safety
0    0.54425
1    0.45575
Name: Unnamed: 0, dtype: float64

### Examine Test set's portion of safey and subdistrict

In [71]:
test.groupby(['SUBDISTRIC','safety']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,_file,copyright,date,location,pano_id,status,Unnamed: 0,city,latitude,longitude,q-score,Coordinates,OBJECTID
SUBDISTRIC,safety,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Business,0,107,107,106,107,107,107,107,107,107,107,107,107,107
Business,1,109,109,109,109,109,109,109,109,109,109,109,109,109
Comm/Instit,0,36,36,36,36,36,36,36,36,36,36,36,36,36
Comm/Instit,1,32,32,32,32,32,32,32,32,32,32,32,32,32
Industrial,0,412,412,412,412,412,412,412,412,412,412,412,412,412
Industrial,1,106,106,106,106,106,106,106,106,106,106,106,106,106
Miscellaneous,0,88,88,88,88,88,88,88,88,88,88,88,88,88
Miscellaneous,1,96,96,96,96,96,96,96,96,96,96,96,96,96
Mixed Use,0,116,116,115,116,116,116,116,116,116,116,116,116,116
Mixed Use,1,112,112,112,112,112,112,112,112,112,112,112,112,112


In [55]:
test.groupby(['safety']).count()

Unnamed: 0_level_0,_file,copyright,date,location,pano_id,status,Unnamed: 0,city,latitude,longitude,q-score,Coordinates,OBJECTID,SUBDISTRIC
safety,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,2119,2119,2117,2119,2119,2119,2119,2119,2119,2119,2119,2119,2119,2119
1,1862,1862,1862,1862,1862,1862,1862,1862,1862,1862,1862,1862,1862,1862


### Test set's portion of safey and subdistrict is similar to that of 'boston_safety_subsample'

In [69]:
test.groupby(['SUBDISTRIC']).count()['_file'] / test.groupby(['SUBDISTRIC']).count()['_file'].sum()

SUBDISTRIC
Business         0.054258
Comm/Instit      0.017081
Industrial       0.130118
Miscellaneous    0.046220
Mixed Use        0.057272
Open Space       0.177845
Residential      0.517207
Name: _file, dtype: float64

In [68]:
test.groupby(['safety']).count()['_file'] / test.groupby(['safety']).count()['_file'].sum()

safety
0    0.532278
1    0.467722
Name: _file, dtype: float64

In [65]:
boston_safety_subsample.groupby(['SUBDISTRIC']).count()['Unnamed: 0'] / boston_safety_subsample.groupby(['SUBDISTRIC']).count()['Unnamed: 0'].sum()

SUBDISTRIC
Business         0.05340
Comm/Instit      0.02100
Industrial       0.13900
Miscellaneous    0.04335
Mixed Use        0.06575
Open Space       0.17860
Residential      0.49890
Name: Unnamed: 0, dtype: float64

In [70]:
boston_safety_subsample.groupby(['safety']).count()['Unnamed: 0'] / boston_safety_subsample.groupby(['safety']).count()['Unnamed: 0'].sum()

safety
0    0.54425
1    0.45575
Name: Unnamed: 0, dtype: float64

### Save Train and Test, we will use Train set to build a Boston model and ue the Test set to evaluate the model's accuracy

In [79]:
train.shape

(15923, 15)

In [80]:
test.shape

(3981, 15)

In [81]:
train.to_csv("~/Desktop/ML1030/us_safety/boston_train.csv")

In [82]:
test.to_csv("~/Desktop/ML1030/us_safety/boston_test.csv")