<a href="https://colab.research.google.com/github/kavyajeetbora/Delhi_NCR_dashboard/blob/master/notebooks/OSM_buildings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup environment

In [2]:
!pip install osmnx



In [77]:
import os
import osmnx as ox
import geopandas as gpd
import pandas as pd
from tqdm.notebook import tqdm
from shapely import Polygon, MultiPolygon

## Download the data from OSM

In [4]:
%%time
# prompt: Download buildings of delhi using osmnx

buildings = ox.features_from_place('Delhi', tags={'building': True})
gdf = gpd.GeoDataFrame(buildings).reset_index()

CPU times: user 55.7 s, sys: 9.16 s, total: 1min 4s
Wall time: 1min 8s


## Clean the data

- Remove all the geometries that are not polygons

In [5]:
buildings = gdf[gdf['element_type']!='node']
buildings.shape

  and should_run_async(code)


(226978, 351)

In [6]:
buildings.geometry.type.value_counts()

Polygon         226905
MultiPolygon        72
LineString           1
dtype: int64

remove the LineString geometry type

In [90]:
buildings = buildings[buildings.geometry.type.isin(['Polygon', 'MultiPolygon'])]

Now remove the unecessary columns

In [19]:
%%time
data = []
for col in tqdm(buildings.columns):
    perc_na = buildings[col].isna().sum()/buildings.shape[0]*100
    data.append((col, perc_na))


df = pd.DataFrame(data, columns=['column', "percentage_null"])
df = df.sort_values(by='percentage_null', ascending=True)
df.sample(5)

  0%|          | 0/351 [00:00<?, ?it/s]

CPU times: user 3.36 s, sys: 12.4 ms, total: 3.37 s
Wall time: 3.5 s


Unnamed: 0,column,percentage_null
330,unisex,99.999119
0,element_type,0.0
177,official_name,99.999119
12,subway,99.990307
86,contact:website,99.999559


Keep only required ones. Remove the columns that are having more than 90% null values

In [92]:
df.head(15)['column']

0          element_type
1                 osmid
14             building
13             geometry
114               nodes
43      building:levels
16     addr:housenumber
171                type
345                ways
6                  name
28          addr:street
17        addr:postcode
33           addr:place
32            addr:city
133              height
Name: column, dtype: object

We require only some of the columns from it:

In [93]:
top_cols = ['osmid', 'building', 'building:levels',  'height', 'name','geometry']
final_gdf = buildings[top_cols]
final_gdf.sample(5)

Unnamed: 0,osmid,building,building:levels,height,name,geometry
126638,351321928,yes,,,,"POLYGON ((77.20809 28.53128, 77.20828 28.53132..."
160725,351645398,yes,,,,"POLYGON ((77.10086 28.62810, 77.10101 28.62815..."
48386,350499316,yes,,,,"POLYGON ((77.14869 28.70433, 77.14872 28.70437..."
139225,351403820,yes,,,,"POLYGON ((77.06795 28.51407, 77.06802 28.51407..."
150822,351571860,yes,,,,"POLYGON ((77.29629 28.64386, 77.29637 28.64386..."


## Export the GDF to GeoParquet

In [61]:
%%time
final_gdf.to_file('buildings.gpkg')

CPU times: user 1min 19s, sys: 1.04 s, total: 1min 20s
Wall time: 1min 26s


In [62]:
%%time
final_gdf.to_parquet('buildings.parquet')

CPU times: user 588 ms, sys: 163 ms, total: 751 ms
Wall time: 802 ms


## Check each file size

In [75]:
def fileSize(filename):
    file_stats = os.stat(filename)
    print(f'File Size of {filename} is {file_stats.st_size / (1024 * 1024):.2f} MB')

In [76]:
fileSize('buildings.gpkg')
fileSize('buildings.parquet')

File Size of buildings.gpkg is 50.71 MB
File Size of buildings.parquet is 20.42 MB


It is better to save the data file in GeoParquet format as it consumes less space (2.5 times less) and also it takes considerably less time in writing the file (**100X** faster)