#### In this notebook: We used fetchimage.py and boston_safety_subsample.csv (20000 samples from Boston) in GCP to fetch images.<br> 

#### By checking the 'metadata.json' obtained while fetching the images, we found that for the 20,000 samples, we were able to obtain 19,904 images. The portion of the 'safety' and 'subdistrict' were not affected by missing 96 samples. So we would proceed to feed CNN with the 19904 images

In [1]:
import pandas as pd

In [3]:
metadata = pd.read_json('~/Desktop/metadata.json')

### All 20,000 samples' metadata obtained while fetching images

In [16]:
metadata.head()

Unnamed: 0,_file,copyright,date,location,pano_id,status
0,gsv_0.jpg,© Google,2018-08-01,"{'lat': 42.34874, 'lng': -71.0714033}",6oZeECWwjS0xTzdsN-ZE7g,OK
1,gsv_1.jpg,© Google,2018-08-01,"{'lat': 42.28859413668265, 'lng': -71.06564443...",q5dOW8BMAXU0vEtSKfa4yA,OK
2,gsv_2.jpg,© Google,2017-07-01,"{'lat': 42.2957334, 'lng': -71.1380628}",fsVE7knT_rhVkKqyAGbkyg,OK
3,gsv_3.jpg,© Google,2018-08-01,"{'lat': 42.31308354620529, 'lng': -71.07641430...",AJ7cTQ8rL1aADzsM11K39A,OK
4,gsv_4.jpg,© Google,2018-08-01,"{'lat': 42.3112202, 'lng': -71.0601081}",MnI2F0n9yqnIGie2rQSAZw,OK


In [10]:
metadata.shape

(20000, 6)

### Among all 20000 samples,  for 19904 samples, we are able to download images; for 96 samples, there were "ZERO_RESULTS". The index of the 'metadata' table indicates which sample successfully fetched image

In [15]:
metadata.groupby('status').size()

status
OK              19904
ZERO_RESULTS       96
dtype: int64

In [9]:
metadata[metadata['status'] == 'OK'].shape

(19904, 6)

In [20]:
metadata[metadata['status'] == 'ZERO_RESULTS'].head(5)

Unnamed: 0,_file,copyright,date,location,pano_id,status
37,,,NaT,,,ZERO_RESULTS
170,,,NaT,,,ZERO_RESULTS
1034,,,NaT,,,ZERO_RESULTS
1380,,,NaT,,,ZERO_RESULTS
1609,,,NaT,,,ZERO_RESULTS


In [21]:
metadata[metadata['status'] == 'OK'].head(5)

Unnamed: 0,_file,copyright,date,location,pano_id,status
0,gsv_0.jpg,© Google,2018-08-01,"{'lat': 42.34874, 'lng': -71.0714033}",6oZeECWwjS0xTzdsN-ZE7g,OK
1,gsv_1.jpg,© Google,2018-08-01,"{'lat': 42.28859413668265, 'lng': -71.06564443...",q5dOW8BMAXU0vEtSKfa4yA,OK
2,gsv_2.jpg,© Google,2017-07-01,"{'lat': 42.2957334, 'lng': -71.1380628}",fsVE7knT_rhVkKqyAGbkyg,OK
3,gsv_3.jpg,© Google,2018-08-01,"{'lat': 42.31308354620529, 'lng': -71.07641430...",AJ7cTQ8rL1aADzsM11K39A,OK
4,gsv_4.jpg,© Google,2018-08-01,"{'lat': 42.3112202, 'lng': -71.0601081}",MnI2F0n9yqnIGie2rQSAZw,OK


### Check whether the 19904 samples' portion of 'safety' and 'subdistrict' were affected

Method: Merge table 'image_fetched' with table 'boston_safety_subsample.csv' by index. 

Result: the 19904 samples' portion of 'safety' and 'subdistrict' were not affected comparing with that of the original 'boston_safety_subsample.csv'

In [22]:
boston_safety_subsample = pd.read_csv("~/Desktop/ML1030/us_safety/boston_safety_subsample.csv")

In [23]:
boston_safety_subsample.head()

Unnamed: 0.1,Unnamed: 0,city,latitude,longitude,q-score,safety,Coordinates,OBJECTID,SUBDISTRIC
0,25915,Boston,42.348724,-71.071495,28.634125,1,POINT (-71.071495 42.348724),54998,Residential
1,1131,Boston,42.288597,-71.065681,26.525396,1,POINT (-71.06568100000001 42.288597),55642,Residential
2,121773,Boston,42.295563,-71.138023,29.614868,1,POINT (-71.138023 42.295563),54450,Residential
3,7005,Boston,42.312973,-71.076004,30.076191,1,POINT (-71.076004 42.312973),55523,Residential
4,6342,Boston,42.311348,-71.060051,26.038879,1,POINT (-71.060051 42.311348),55604,Residential


In [28]:
boston_safety_subsample.shape

(20000, 9)

In [24]:
image_fetched = metadata[metadata['status'] == 'OK']

In [26]:
image_fetched.shape

(19904, 6)

In [29]:
image_fetched_with_target = pd.merge(image_fetched, boston_safety_subsample, left_index=True, right_index=True)

In [39]:
image_fetched_with_target.head()

Unnamed: 0.1,_file,copyright,date,location,pano_id,status,Unnamed: 0,city,latitude,longitude,q-score,safety,Coordinates,OBJECTID,SUBDISTRIC
0,gsv_0.jpg,© Google,2018-08-01,"{'lat': 42.34874, 'lng': -71.0714033}",6oZeECWwjS0xTzdsN-ZE7g,OK,25915,Boston,42.348724,-71.071495,28.634125,1,POINT (-71.071495 42.348724),54998,Residential
1,gsv_1.jpg,© Google,2018-08-01,"{'lat': 42.28859413668265, 'lng': -71.06564443...",q5dOW8BMAXU0vEtSKfa4yA,OK,1131,Boston,42.288597,-71.065681,26.525396,1,POINT (-71.06568100000001 42.288597),55642,Residential
2,gsv_2.jpg,© Google,2017-07-01,"{'lat': 42.2957334, 'lng': -71.1380628}",fsVE7knT_rhVkKqyAGbkyg,OK,121773,Boston,42.295563,-71.138023,29.614868,1,POINT (-71.138023 42.295563),54450,Residential
3,gsv_3.jpg,© Google,2018-08-01,"{'lat': 42.31308354620529, 'lng': -71.07641430...",AJ7cTQ8rL1aADzsM11K39A,OK,7005,Boston,42.312973,-71.076004,30.076191,1,POINT (-71.076004 42.312973),55523,Residential
4,gsv_4.jpg,© Google,2018-08-01,"{'lat': 42.3112202, 'lng': -71.0601081}",MnI2F0n9yqnIGie2rQSAZw,OK,6342,Boston,42.311348,-71.060051,26.038879,1,POINT (-71.060051 42.311348),55604,Residential


#### boston_safety_subsample: count by 'SUBDISTRIC','safety'

In [35]:
boston_safety_subsample.groupby(['SUBDISTRIC','safety']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 0,city,latitude,longitude,q-score,Coordinates,OBJECTID
SUBDISTRIC,safety,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Business,0,534,534,534,534,534,534,534
Business,1,534,534,534,534,534,534,534
Comm/Instit,0,210,210,210,210,210,210,210
Comm/Instit,1,210,210,210,210,210,210,210
Industrial,0,2275,2275,2275,2275,2275,2275,2275
Industrial,1,505,505,505,505,505,505,505
Miscellaneous,0,428,428,428,428,428,428,428
Miscellaneous,1,439,439,439,439,439,439,439
Mixed Use,0,663,663,663,663,663,663,663
Mixed Use,1,652,652,652,652,652,652,652


#### image_fetched_with_target: count by 'SUBDISTRIC', 'safety'

In [32]:
image_fetched_with_target.groupby(['SUBDISTRIC','safety']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,_file,copyright,date,location,pano_id,status,Unnamed: 0,city,latitude,longitude,q-score,Coordinates,OBJECTID
SUBDISTRIC,safety,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Business,0,534,534,533,534,534,534,534,534,534,534,534,534,534
Business,1,533,533,529,533,533,533,533,533,533,533,533,533,533
Comm/Instit,0,200,200,200,200,200,200,200,200,200,200,200,200,200
Comm/Instit,1,209,209,209,209,209,209,209,209,209,209,209,209,209
Industrial,0,2249,2249,2249,2249,2249,2249,2249,2249,2249,2249,2249,2249,2249
Industrial,1,497,497,497,497,497,497,497,497,497,497,497,497,497
Miscellaneous,0,426,426,426,426,426,426,426,426,426,426,426,426,426
Miscellaneous,1,437,437,437,437,437,437,437,437,437,437,437,437,437
Mixed Use,0,660,660,657,660,660,660,660,660,660,660,660,660,660
Mixed Use,1,649,649,646,649,649,649,649,649,649,649,649,649,649


#### boston_safety_subsample: count by 'SUBDISTRIC'

In [37]:
boston_safety_subsample.groupby(['SUBDISTRIC']).count()

Unnamed: 0_level_0,Unnamed: 0,city,latitude,longitude,q-score,safety,Coordinates,OBJECTID
SUBDISTRIC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Business,1068,1068,1068,1068,1068,1068,1068,1068
Comm/Instit,420,420,420,420,420,420,420,420
Industrial,2780,2780,2780,2780,2780,2780,2780,2780
Miscellaneous,867,867,867,867,867,867,867,867
Mixed Use,1315,1315,1315,1315,1315,1315,1315,1315
Open Space,3572,3572,3572,3572,3572,3572,3572,3572
Residential,9978,9978,9978,9978,9978,9978,9978,9978


#### image_fetched_with_target: count by 'SUBDISTRIC'¶

In [36]:
image_fetched_with_target.groupby(['SUBDISTRIC']).count()

Unnamed: 0_level_0,_file,copyright,date,location,pano_id,status,Unnamed: 0,city,latitude,longitude,q-score,safety,Coordinates,OBJECTID
SUBDISTRIC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Business,1067,1067,1062,1067,1067,1067,1067,1067,1067,1067,1067,1067,1067,1067
Comm/Instit,409,409,409,409,409,409,409,409,409,409,409,409,409,409
Industrial,2746,2746,2746,2746,2746,2746,2746,2746,2746,2746,2746,2746,2746,2746
Miscellaneous,863,863,863,863,863,863,863,863,863,863,863,863,863,863
Mixed Use,1309,1309,1303,1309,1309,1309,1309,1309,1309,1309,1309,1309,1309,1309
Open Space,3549,3549,3549,3549,3549,3549,3549,3549,3549,3549,3549,3549,3549,3549
Residential,9961,9961,9959,9961,9961,9961,9961,9961,9961,9961,9961,9961,9961,9961


#### boston_safety_subsample: count by 'safety'

In [34]:
boston_safety_subsample.groupby(['safety']).count()

Unnamed: 0_level_0,Unnamed: 0,city,latitude,longitude,q-score,Coordinates,OBJECTID,SUBDISTRIC
safety,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,10885,10885,10885,10885,10885,10885,10885,10885
1,9115,9115,9115,9115,9115,9115,9115,9115


#### image_fetched_with_target: count by 'safety'

In [38]:
image_fetched_with_target.groupby(['safety']).count()

Unnamed: 0_level_0,_file,copyright,date,location,pano_id,status,Unnamed: 0,city,latitude,longitude,q-score,Coordinates,OBJECTID,SUBDISTRIC
safety,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,10821,10821,10816,10821,10821,10821,10821,10821,10821,10821,10821,10821,10821,10821
1,9083,9083,9075,9083,9083,9083,9083,9083,9083,9083,9083,9083,9083,9083
