# Dataset Augmentation Attempt

In [5]:
import numpy as np
import pandas as pd

Additional data (Images, Wikidata Tag, Geographic Coordinates) for the following areas:
- British Columbia
- Greater Vancouver
- Vancouver & Burnaby

Was requested using WikiData Query Service with the following SQL Query:

In [6]:
# Parses URL to a tag
def splitWikiTag(link):
    return link.rsplit('/')[-1]

# Parses Coordinate to Latitude
def parseLongitude(loc):
    split_part = loc.rsplit(' ')
    if(len(split_part) < 2):
        return False
    splitLeft = split_part[0].split('(')
    return np.float32(splitLeft[1])

# Parses Coordinate to Longitude
def parseLatitude(loc):
    split_part = loc.rsplit(' ')
    if(len(split_part) < 2):
        return False
    splitRight = split_part[1].rsplit(')')
    return np.float32(splitRight[0])

### Wikidata Query: British Columbia

In [56]:
# Load Data
bc_data = pd.read_json("./WikiData/query-BC.json", lines = False)

In [57]:
# Parse
bc_data['WikiData'] = bc_data['country'].apply(splitWikiTag)
bc_data['lat'] = bc_data['coordinate_location'].fillna("").apply(parseLatitude)
bc_data['lon'] = bc_data['coordinate_location'].fillna("").apply(parseLongitude)

In [58]:
# Drop Odd Axes
bc_data = bc_data.drop(['coordinate_location', 'imageLabel','country'], axis = 1)
bc_data.shape

(38317, 5)

In [60]:
bc_data

Unnamed: 0,image,Commons_category,WikiData,lat,lon
0,,,Q316373,49.5167,-125.233
1,,,Q316380,49.45,-122.017
2,http://commons.wikimedia.org/wiki/Special:File...,Floe Lake,Q316383,51.053,-116.141
3,,,Q316387,48.9,-124.9
4,,,Q279989,49.0333,-117.383
...,...,...,...,...,...
38312,,,Q61281083,49.2,-122.03
38313,,,Q61281084,49.37,-123.1
38314,,,Q61281087,49.35,-123.12
38315,,,Q61281090,49.32,-123.08


In [718]:
# Number of Images available for British Columbia
bc_data.image.unique().shape

(1158,)



### Wikidata Query: Metro Vancouver

In [11]:
# Load Data
metro_van_data = pd.read_json("./WikiData/query_Matro_van.json", lines = False)

In [12]:
# Parse
metro_van_data['WikiData'] = metro_van_data['country'].apply(splitWikiTag)
metro_van_data['lat'] = metro_van_data['coordinate_location'].fillna("").apply(parseLatitude)
metro_van_data['lon'] = metro_van_data['coordinate_location'].fillna("").apply(parseLongitude)

In [13]:
# Get Rid of Odd Axes
metro_van_data = metro_van_data.drop(['coordinate_location', 'imageLabel', 'country'], axis = 1)
metro_van_data.shape

(57, 5)

In [719]:
metro_van_data.head(5)

Unnamed: 0,Commons_category,image,WikiData,lat,lon
0,Maple Ridge,,Q16740,49.2167,-122.6
1,Vancouver,http://commons.wikimedia.org/wiki/Special:File...,Q24639,49.25,-123.1
2,Peace Arch Park,http://commons.wikimedia.org/wiki/Special:File...,Q192531,49.0021,-122.757
3,"Richmond, British Columbia",http://commons.wikimedia.org/wiki/Special:File...,Q236837,49.1667,-123.133
4,"Burnaby, British Columbia",http://commons.wikimedia.org/wiki/Special:File...,Q244025,49.25,-122.949


In [720]:
# Number of Images for Metro Vancouver
metro_van_data.image.unique().shape

(41,)

### Wikidata Query: Vancouver

In [726]:
# Load Data
van_data = pd.read_json("./WikiData/query-Van.json", lines = False)

In [727]:
# Parse
van_data['WikiData'] = van_data['country'].apply(splitWikiTag)
van_data['lat'] = van_data['coordinate_location'].fillna("").apply(parseLatitude)
van_data['lon'] = van_data['coordinate_location'].fillna("").apply(parseLongitude)

# Drop Odd Axes
van_data = van_data.drop(['coordinate_location', 'imageLabel', 'country'], axis = 1)
van_data.shape

(449, 5)

In [728]:
van_data.head(5)

Unnamed: 0,image,Commons_category,WikiData,lat,lon
0,,,Q30265517,49.2813,-123.094
1,,,Q30270127,49.2809,-123.119
2,,,Q30271056,49.2655,-123.25
3,,,Q30271079,49.2612,-123.106
4,,,Q30272159,49.2705,-123.134


In [731]:
# Number of Images Available for Vancouver
van_data.image.unique().shape

(164,)

### Wikidata Query: Burnaby

In [18]:
# Load Data
burnaby_data = pd.read_json("./WikiData/query-Burnaby.json", lines = False)

In [19]:
# Parse
burnaby_data['WikiData'] = burnaby_data['country'].apply(splitWikiTag)
burnaby_data['lat'] = burnaby_data['coordinate_location'].fillna("").apply(parseLatitude)
burnaby_data['lon'] = burnaby_data['coordinate_location'].fillna("").apply(parseLongitude)

# Drop Odd Axes
burnaby_data = burnaby_data.drop(['coordinate_location', 'imageLabel', 'country'], axis = 1)
burnaby_data.shape

(44, 5)

In [20]:
burnaby_data.head(5)

Unnamed: 0,image,Commons_category,WikiData,lat,lon
0,http://commons.wikimedia.org/wiki/Special:File...,Simon Fraser University,Q201603,49.2789,-122.916
1,http://commons.wikimedia.org/wiki/Special:File...,Burnaby Lake Regional Park,Q731136,49.2424,-122.946
2,http://commons.wikimedia.org/wiki/Special:File...,British Columbia Institute of Technology,Q820796,49.25,-123.002
3,http://commons.wikimedia.org/wiki/Special:File...,Lougheed Town Centre Station,Q1826172,49.2485,-122.897
4,http://commons.wikimedia.org/wiki/Special:File...,,Q3484338,False,False


In [733]:
# Number of Images Available for Burnaby
burnaby_data.image.unique().shape

(11,)

# Joining OSM & WikiData

We join the provided OSM data with Wikidata entries to potentially augment our data with images.

In [76]:
# Concatenate all datasets retreived from WikiData for all areas.
greater_van = [bc_data, van_data, burnaby_data, metro_van_data]
greater_van = pd.concat(greater_van, sort = True)
greater_van

Unnamed: 0,Commons_category,WikiData,image,lat,lon
0,,Q316373,,49.5167,-125.233
1,,Q316380,,49.45,-122.017
2,Floe Lake,Q316383,http://commons.wikimedia.org/wiki/Special:File...,51.053,-116.141
3,,Q316387,,48.9,-124.9
4,,Q279989,,49.0333,-117.383
...,...,...,...,...,...
52,,Q46941204,http://commons.wikimedia.org/wiki/Special:File...,49.1058,-123.303
53,Gurdwara Sahib Brookside,Q65228151,http://commons.wikimedia.org/wiki/Special:File...,49.1549,-122.835
54,Gurdwara Darbar Shri Guru Granth Sahib ji,Q65428660,http://commons.wikimedia.org/wiki/Special:File...,49.1634,-122.756
55,Gurdwara Shri Guru Singh Sabha,Q65428844,http://commons.wikimedia.org/wiki/Special:File...,49.1505,-122.858


In [735]:
# Total Images Available
greater_van.image.unique().shape

(1363,)

## Merging OSM Transit Data & Wikidata

In [265]:
# Checks for the tag
def parseTags(tag, string):
    if tag.get(string) != None:
        return tag.get(string)
    return None

# Load Transit Data from OSM
osm_transit = pd.read_json("./OSM_cleaned/transit_only", lines = False)

# Create a WikiData
osm_transit['WikiData'] = osm_transit['tags'].apply(parseTags, string = 'wikidata')

In [267]:
# Attempt Merging Data from Every Area Requested
transit_van = pd.merge(van_data, osm_transit, on = 'WikiData')
transit_bc = pd.merge(bc_data, osm_transit, on = 'WikiData')
transit_metro = pd.merge(metro_van_data, osm_transit, on = "WikiData")
transit_burnaby = pd.merge(burnaby_data, osm_transit, on = "WikiData")

transit_final = pd.concat([transit_van, transit_bc, transit_metro, transit_burnaby], sort = True)

In [269]:
# Drop Duplicate Entries and Write to a file
transit_final = transit_final.drop_duplicates(subset = 'name').drop(['lat_x', 'lon_x', 'Commons_category'], axis = 1)
transit_final.columns = ['WikiData', 'image', 'lat', 'lon', 'name', 'tags']
pd.DataFrame.to_json(transit_final, 'Prediction Data/subway_final', orient = 'records')
transit_final.head(5)

Unnamed: 0,WikiData,image,lat,lon,name,tags
0,Q1440907,http://commons.wikimedia.org/wiki/Special:File...,49.244208,-123.045922,29th Avenue,"{'wheelchair': 'yes', 'alt_name': '29th Avenue..."
1,Q1625437,http://commons.wikimedia.org/wiki/Special:File...,49.285964,-123.112285,Waterfront,"{'wheelchair': 'yes', 'alt_name': 'Waterfront ..."
2,Q2153961,http://commons.wikimedia.org/wiki/Special:File...,49.274538,-123.121905,Yaletown–Roundhouse,"{'wheelchair': 'yes', 'alt_name': 'Yaletown–Ro..."
4,Q2153968,http://commons.wikimedia.org/wiki/Special:File...,49.282015,-123.118936,Vancouver City Centre,"{'subway': 'yes', 'public_transport': 'stop_po..."
6,Q2258039,http://commons.wikimedia.org/wiki/Special:File...,49.249176,-123.11583,King Edward,"{'wheelchair': 'yes', 'alt_name': 'King Edward..."


In [270]:
# Number of Transit locations available as a result of merging
transit_final.shape

(23, 6)

## OSM Enterntainment Data & Wikidata

In [369]:
# Load Entertainment Data from OSM
osm_ent = pd.read_json("./OSM_cleaned/entr_only")
osm_ent['WikiData'] = osm_ent['tags'].apply(parseTags, string = 'wikidata')

In [370]:
# Attempt Merging
ent_van = pd.merge(van_data, osm_ent, on = 'WikiData')
ent_bc = pd.merge(bc_data, osm_ent, on = 'WikiData')
ent_metro = pd.merge(metro_van_data, osm_ent, on = "WikiData")
ent_burnaby = pd.merge(burnaby_data, osm_ent, on = "WikiData")

# Prepare Frames and drop odd axes
ent_van.columns = ['image', 'commons', 'WikiData', 'lat1', 'lon1', 'lat', 'lon', 'amenity', 'name', 'tags']
ent_van = ent_van.drop(['lat1', 'lon1', 'commons'],axis = 1)

ent_bc.columns = ['image', 'commons', 'WikiData', 'lat1', 'lon1', 'lat', 'lon', 'amenity', 'name', 'tags']
ent_bc = ent_bc.drop(['lat1', 'lon1', 'commons'],axis = 1)

ent_metro.columns = ['image', 'commons', 'WikiData', 'lat1', 'lon1', 'lat', 'lon', 'amenity', 'name', 'tags']
ent_metro = ent_metro.drop(['lat1', 'lon1', 'commons'],axis = 1)

ent_burnaby.columns = ['image', 'commons', 'WikiData', 'lat1', 'lon1', 'lat', 'lon', 'amenity', 'name', 'tags']
ent_burnaby = ent_burnaby.drop(['lat1', 'lon1', 'commons'],axis = 1)

# Assemble the Frame
enterntainment_final = pd.concat([ent_van, ent_bc, ent_metro, ent_burnaby, osm_ent], sort = True)
enterntainment_final = enterntainment_final.drop_duplicates(subset = 'name')

In [583]:
# Write to a file
pd.DataFrame.to_json(enterntainment_final, 'Prediction Data/entr_final', orient = 'records')
enterntainment_final.head(5)

Unnamed: 0,WikiData,amenity,image,lat,lon,name,tags
0,Q38377943,theatre,,49.245578,-123.185447,Dunbar Theatre,"{'addr:housenumber': '4555', 'addr:street': 'D..."
1,Q38378154,theatre,,49.203711,-123.137523,Metro Theatre,"{'addr:housenumber': '1370', 'website': 'https..."
2,Q38378234,theatre,,49.281101,-123.09837,The Rickshaw,"{'addr:housenumber': '254', 'website': 'http:/..."
3,Q38378292,cinema,,49.281998,-123.124346,Scotiabank Theatre Vancouver,"{'addr:housenumber': '900', 'website': 'https:..."
4,Q38378369,theatre,,49.279991,-123.116458,The Centre For Performing Arts,"{'addr:province': 'BC', 'addr:housenumber': '7..."


In [372]:
# Number of Entertainment Locations available as a result of merging
enterntainment_final.shape

(78, 7)

## OSM Food Data & Wikidata
Data related to food industry has only two entries that contain a WikiData tag, so we will use OSM data with no images instead

In [531]:
# Load Food-Related OSM data
osm_food = pd.read_json("./OSM_cleaned/food_only", lines = False)
osm_food['WikiData'] = osm_food['tags'].apply(parseTags, string = 'brand:wikidata')

In [532]:
# OSM Entries that have a WikiData Tag
osm_food['WikiData'].unique()

array([None, 'Q3472954', 'Q894578', 'Q2996960', 'Q7371139', 'Q191615',
       'Q17022490', 'Q2372909', 'Q2438391', 'Q8000869', 'Q7995414',
       'Q1185675', 'Q1189695', 'Q5503051', 'Q7744066', 'Q16997755',
       'Q23891278', 'Q1051593', 'Q7304886', 'Q55629932', 'Q3045312',
       'Q5326525', 'Q39054369', 'Q37158', 'Q175106', 'Q16829306',
       'Q3114287', 'Q64827032', 'Q64827025', 'Q862180', 'Q38076',
       'Q524757', 'Q177054', 'Q244457', 'Q7757289', 'Q7673972',
       'Q65148332', 'Q1131810', 'Q1141226', 'Q1936229', 'Q4943796',
       'Q2818848', 'Q7013558', 'Q6747622', 'Q5503082', 'Q2759586',
       'Q17111672', 'Q839466', 'Q7711610', 'Q3355059', 'Q7673969',
       'Q550258', 'Q1089932', 'Q465751', 'Q752941', 'Q1194143',
       'Q1393809', 'Q1466184', 'Q28229116', 'Q64876898', 'Q1066777',
       'Q66070360', 'Q7352199', 'Q1043486', 'Q1330910', 'Q7132349',
       'Q630866', 'Q7049671', 'Q7702453', 'Q64876684', 'Q17020087',
       'Q6816528', 'Q584601', 'Q8054358'], dtype=object)

In [540]:
# Attempt Merging
food_van = pd.merge(van_data, osm_food, on = 'WikiData')
food_van.columns = ['image', 'nameCommon', 'WikiData', 'lat', 'lon', 'lat1', 'lon1', 'amenity', 'name', 'tags']
food_van = food_van.drop(['lat1', 'lon1', 'nameCommon'], axis = 1)

food_bc = pd.merge(bc_data, osm_food, on = 'WikiData')
food_bc.columns = ['image', 'nameCommon', 'WikiData', 'lat', 'lon', 'lat1', 'lon1', 'amenity', 'name', 'tags']
food_bc = food_bc.drop(['lat1', 'lon1', 'nameCommon'], axis = 1)

food_metro = pd.merge(metro_van_data, osm_food, on = "WikiData")
food_metro.columns = ['image', 'nameCommon', 'WikiData', 'lat', 'lon', 'lat1', 'lon1', 'amenity', 'name', 'tags']
food_metro = food_metro.drop(['lat1', 'lon1', 'nameCommon'], axis = 1)

food_burnaby = pd.merge(burnaby_data, osm_food, on = "WikiData")
food_burnaby.columns = ['image', 'nameCommon', 'WikiData', 'lat', 'lon', 'lat1', 'lon1', 'amenity', 'name', 'tags']
food_burnaby = food_burnaby.drop(['lat1', 'lon1', 'nameCommon'], axis = 1)

# Assemble the Frame
food_final = pd.concat([food_van, food_bc, food_burnaby, food_metro, osm_food], sort = True)
food_final = food_final.drop_duplicates(subset = 'name')
food_final = food_final.drop('image', axis = 1)
food_final['cuisine'] = food_final['tags'].apply(parseTags, string = 'cuisine')

In [582]:
food_final.head(5)

Unnamed: 0,WikiData,amenity,lat,lon,name,tags,cuisine
13,,restaurant,49.12665,-123.18247,Best Bite Indian Cuisine,"{'addr:housenumber': '10-3891', 'phone': '+1-6...",indian
58,,restaurant,49.171276,-123.134873,Oriental Rice Noodle,"{'addr:housenumber': '8100', 'phone': '+1-604-...",chinese
65,Q3472954,restaurant,49.132705,-123.099186,Nando's,"{'brand:wikidata': 'Q3472954', 'cuisine': 'chi...",chicken;portuguese
66,Q894578,restaurant,49.13301,-123.095536,Boston Pizza,"{'brand:wikidata': 'Q894578', 'cuisine': 'pizz...",pizza
121,,restaurant,49.266575,-123.103744,Peaceful Restaurant,{'opening_hours': 'Su-Th 11:00-21:30; Fr-Sa 11...,


In [544]:
# Write to a file
pd.DataFrame.to_json(food_final, 'Prediction Data/food_final')

In [543]:
# Number of Food-Related Entities available as a result of merging
food_final.shape

(3157, 7)

## OSM Nightlife & Wikidata
Nightlife-Related data only has 2 unique entries with a WikiData tag.<br>
There is no way to augment data without having an attribute to join on.

Using plain OSM data instead.

In [577]:
# Load Nightlife-Related OSM data
osm_night = pd.read_json("./OSM_cleaned/night_only", lines = False)
osm_night['WikiData'] = osm_night['tags'].apply(parseTags, string = 'wikidata')

In [578]:
# Assert these are Only Tags found in OSM data.
osm_night['WikiData'].unique()

array([None, 'Q5153212', 'Q5060558'], dtype=object)

In [579]:
# Aborted as we lack the attribute to join on.

# night_only_van = pd.merge(van_data, osm_night, on = 'WikiData')
# night_only_bc = pd.merge(bc_data, osm_night, on = 'WikiData')
# night_only_metro = pd.merge(metro_van_data, osm_night, on = "WikiData")
# night_only_burnaby = pd.merge(burnaby_data, osm_night, on = "WikiData")

# night_final = pd.concat([night_only_bc, night_only_burnaby, night_only_metro, night_only_van], sort = True)
# night_final = night_final.drop_duplicates(subset = 'name').drop(['lat_x', 'lon_x', 'Commons_category'], axis = 1)

In [581]:
# Assemble the Frame and Write to a File
night_final = osm_night.drop('WikiData', axis = 1)
pd.DataFrame.to_json(night_final, 'Prediction Data/night_final')
night_final.head(5)

Unnamed: 0,lat,lon,amenity,name,tags
878,49.275391,-123.10964,nightclub,Privé,{}
2102,49.105807,-122.66035,nightclub,Gabby's Country Cabaret,{'name:en': 'Gabby's Country Cabaret'}
3484,49.280145,-123.131354,nightclub,Playhouse Nightclub,"{'addr:housenumber': '1240', 'addr:street': 'T..."
3485,49.279385,-123.129915,nightclub,Celebrities,"{'addr:housenumber': '1022', 'addr:street': 'D..."
4679,49.277291,-123.125726,nightclub,Aura,"{'addr:housenumber': '1180', 'website': 'https..."


In [575]:
# Number of Nightlife-Related entities available.
night_final.shape

(380, 5)

## OSM Historic spots & Wikidata
Historic-Related data only has 3 entries with WikiData tags attached.<br> So we will use original OSM data.

In [586]:
# Load Historic-Related OSM data
osm_hist = pd.read_json("./OSM_cleaned/historic_only", lines = False)
osm_hist['WikiData'] = osm_hist['tags'].apply(parseTags, string = 'wikidata')

In [590]:
# Aborted due to lack of attributes to join on

# hist_van = pd.merge(van_data, osm_hist, on = 'WikiData')
# hist_bc = pd.merge(bc_data, osm_hist, on = 'WikiData')
# hist_metro = pd.merge(metro_van_data, osm_hist, on = "WikiData")
# hist_burnaby = pd.merge(burnaby_data, osm_hist, on = "WikiData")
# 
# hist_final = pd.concat([hist_van, hist_bc, hist_metro, hist_burnaby], sort = True)
# hist_final = hist_final.drop_duplicates(subset = 'name').drop(['lat_x', 'lon_x', 'Commons_category'], axis = 1)

In [593]:
# Assemble the Frame and Write to a file.
hist_final = osm_hist.drop('WikiData', axis = 1)
hist_final.columns = ['lat', 'lon', 'name', 'tags', 'type']
pd.DataFrame.to_json(hist_final, 'Prediction Data/hist_final')
hist_final.head(5)

Unnamed: 0,lat,lon,name,tags,type
123,49.201003,-122.911255,"""Wait for Me Daddy"" War Memorial Sculpture",{'website_1': 'https://www.newwestcity.ca/publ...,monument
583,49.367672,-123.078425,F-86 Sabre Jet Crash Site Memorial,{'historic': 'memorial'},memorial
1304,49.115089,-122.90493,Water Tower,{'historic': 'ruins'},ruins
1408,49.198302,-122.594237,Heritage Area,"{'addr:housenumber': '10749', 'historic': 'yes...",yes
1426,49.279851,-123.108462,International Village Globe,{'historic': 'monument'},monument


In [595]:
# Number of Historic Spots available.
hist_final.shape

(200, 5)

## OSM Tourism data & Wikidata

In [687]:
# Load Tourism-Related OSM data
osm_tourism = pd.read_json("./OSM_cleaned/tourism_only", lines = False)
osm_tourism['WikiData'] = osm_tourism['tags'].apply(parseTags, string = 'wikidata')

In [688]:
# Since there are just a few tagged entries, we will join the images (if any) for these entries
osm_tourism.WikiData.unique()

array([None, 'Q27919857', 'Q4244994', 'Q841926', 'Q1224775', 'Q14874629',
       'Q7731049', 'Q14874772', 'Q14874600', 'Q7099175', 'Q5564477',
       'Q2510009', 'Q14874771', 'Q16967365', 'Q7230805', 'Q17118059',
       'Q2411231', 'Q5377412', 'Q867340'], dtype=object)

In [691]:
# Prepare the Frame
osm_tourism.columns = ['lat', 'lon', 'name', 'tags', 'type', 'WikiData']
tour_final = osm_tourism.drop('WikiData', axis = 1)

In [693]:
# Write to a File
pd.DataFrame.to_json(tour_final, 'Prediction Data/tour_final')
tour_final

Unnamed: 0,lat,lon,name,tags,type
0,49.279297,-122.920352,Simon Fraser University,"{'tourism': 'information', 'information': 'off...",information
1,49.153774,-122.525594,Eagle Acres Dairy,"{'tourism': 'attraction', 'addr:housenumber': ...",attraction
2,49.279103,-123.123671,HI Vancouver Central,"{'guest_house': 'hostel', 'addr:housenumber': ...",hostel
37,49.273064,-123.102547,Solar Bike Tree,{'tourism': 'artwork'},artwork
76,49.064828,-122.430143,8,"{'tourism': 'information', 'information': 'sig...",information
...,...,...,...,...,...
7794,49.342984,-122.832763,Barton Point,{'tourism': 'viewpoint'},viewpoint
7795,49.340373,-122.875030,Punta Del Este (East Point),{'tourism': 'viewpoint'},viewpoint
7796,49.347591,-122.879524,Vista #2,{'tourism': 'viewpoint'},viewpoint
7797,49.367366,-122.851936,Swan Falls,{'tourism': 'viewpoint'},viewpoint


## Results of Dataset Augmentation

Unfortunately, too many entries in our OSM data have no Wikidata Tag, thus we can not get results we hoped for in terms of augmenting our initial dataset with Images.

The queried WikiData had a total of 1158 images available.

Result:
- Food - No images added
- Night - No images added
- Tourism - No images added
- Historical - No images added

Even though Augmentation did not succeed, the fact that entries exist in both datasets verifies their validity<br>
i.e Asserts there are not only community added entries.