::: {#fig-etl}

![](images/geospatial.png){fig-align="left" width=30%}

Image by Henrikki Tenkanen, Vuokko Heikinheimo, David Whipp

:::

## Environment setting

In [1]:
# import libraries
import numpy as np
import pandas as pd
import polars as pl
import duckdb as db
import folium
from great_tables import GT, md
from warnings import filterwarnings
filterwarnings('ignore')

## Data collection

In [2]:
conn = db.connect('datasets/geospatial.db')

In [3]:
conn.sql('show tables')

┌─────────┐
│  name   │
│ varchar │
├─────────┤
│ zomato  │
└─────────┘

In [4]:
data = conn.sql('select * from zomato').pl()

In [5]:
data.columns

['url',
 'address',
 'name',
 'online_order',
 'book_table',
 'rate',
 'votes',
 'phone',
 'location',
 'rest_type',
 'dish_liked',
 'cuisines',
 'approx_cost(for two people)',
 'reviews_list',
 'menu_item',
 'listed_in(type)',
 'listed_in(city)']

In [6]:
data.shape

(51717, 17)

## Data preprocessing

In [7]:
data.is_duplicated().sum()

0

In [8]:
data.select(pl.all().is_null().sum()).to_dicts()

[{'url': 0,
  'address': 0,
  'name': 0,
  'online_order': 0,
  'book_table': 0,
  'rate': 7775,
  'votes': 0,
  'phone': 1208,
  'location': 21,
  'rest_type': 227,
  'dish_liked': 28078,
  'cuisines': 45,
  'approx_cost(for two people)': 346,
  'reviews_list': 0,
  'menu_item': 0,
  'listed_in(type)': 0,
  'listed_in(city)': 0}]

In [9]:
# As we have few missing values in location feature ,then we can drop the null
data = data.drop_nulls(subset=pl.col('location'))

In [10]:
#| label: tbl-head
#| tbl-cap: "Zomato Restaurants from Singh, S (2024) Geospatial Data Science in Python"
(
    GT(data.select('address','name','rate','votes','location','rest_type','dish_liked','cuisines').head(3))
    .tab_header(
        title=md('Zomato Restaurants')
    )
    .cols_width(
        cases={'rate':'50px',
              }
               )
    .tab_source_note(source_note=md('<br> *Source: Shan Singh*'))
)

Zomato Restaurants,Zomato Restaurants,Zomato Restaurants,Zomato Restaurants,Zomato Restaurants,Zomato Restaurants,Zomato Restaurants,Zomato Restaurants
address,name,rate,votes,location,rest_type,dish_liked,cuisines
"942, 21st Main Road, 2nd Stage, Banashankari, Bangalore",Jalsa,4.1/5,775,Banashankari,Casual Dining,"Pasta, Lunch Buffet, Masala Papad, Paneer Lajawab, Tomato Shorba, Dum Biryani, Sweet Corn Soup","North Indian, Mughlai, Chinese"
"2nd Floor, 80 Feet Road, Near Big Bazaar, 6th Block, Kathriguppe, 3rd Stage, Banashankari, Bangalore",Spice Elephant,4.1/5,787,Banashankari,Casual Dining,"Momos, Lunch Buffet, Chocolate Nirvana, Thai Green Curry, Paneer Tikka, Dum Biryani, Chicken Biryani","Chinese, North Indian, Thai"
"1112, Next to KIMS Medical College, 17th Cross, 2nd Stage, Banashankari, Bangalore",San Churro Cafe,3.8/5,918,Banashankari,"Cafe, Casual Dining","Churros, Cannelloni, Minestrone Soup, Hot Chocolate, Pink Sauce Pasta, Salsa, Veg Supreme Pizza","Cafe, Mexican, Italian"
Source: Shan Singh,Source: Shan Singh,Source: Shan Singh,Source: Shan Singh,Source: Shan Singh,Source: Shan Singh,Source: Shan Singh,Source: Shan Singh


In [11]:
# create a copy
df = data.clone()

Lets make every place more readible so that u will get more more accurate geographical co-ordinates..

In [12]:
df = df.with_columns(
    location=(pl.col('location') + ', Bangalore, Karnataka, India')
)

In [13]:
df.select('location').sample(5).to_dicts()

[{'location': 'HSR, Bangalore, Karnataka, India'},
 {'location': 'BTM, Bangalore, Karnataka, India'},
 {'location': 'BTM, Bangalore, Karnataka, India'},
 {'location': 'BTM, Bangalore, Karnataka, India'},
 {'location': 'BTM, Bangalore, Karnataka, India'}]

In [14]:
df.schema

Schema([('url', String),
        ('address', String),
        ('name', String),
        ('online_order', Boolean),
        ('book_table', Boolean),
        ('rate', String),
        ('votes', Int64),
        ('phone', String),
        ('location', String),
        ('rest_type', String),
        ('dish_liked', String),
        ('cuisines', String),
        ('approx_cost(for two people)', String),
        ('reviews_list', String),
        ('menu_item', String),
        ('listed_in(type)', String),
        ('listed_in(city)', String)])

## Extract coordinates from data

first we will learn how to extract Latitudes & longitudes using 'location' feature

In [15]:
rest_loc = pl.DataFrame()

In [16]:
rest_loc = pl.DataFrame({'name': df.select('location').unique()})

In [17]:
rest_loc.sample(5).to_dicts()

[{'name': 'Jakkur, Bangalore, Karnataka, India'},
 {'name': 'Kalyan Nagar, Bangalore, Karnataka, India'},
 {'name': 'RT Nagar, Bangalore, Karnataka, India'},
 {'name': 'Koramangala 7th Block, Bangalore, Karnataka, India'},
 {'name': 'Kaggadasapura, Bangalore, Karnataka, India'}]

In [18]:
# Nominatim is a tool to search OpenStreetMap data by address or location
from geopy.geocoders import Nominatim

In [19]:
geolocator = Nominatim(user_agent='app', timeout=None)

In [20]:
lat = [] # define lat list to store all the latitudes
lon = [] # define lon list to store all the longitudes

for name in pl.Series(rest_loc.select('name')):
    location = geolocator.geocode(name)
    
    if location is None:
        lat.append(np.nan)
        lon.append(np.nan)
        
    else:
        lat.append(location.latitude)
        lon.append(location.longitude)

In [21]:
lat[:10]

[13.0621474,
 12.9846713,
 12.981015523680384,
 12.985098650000001,
 12.9096941,
 nan,
 12.9067683,
 12.938455602031697,
 12.9176571,
 12.9489339]

In [22]:
rest_loc = rest_loc.with_columns(
    lat=pl.Series(lat), # For python lists, construct a Series
    lon=pl.Series(lon),
)

In [23]:
#| label: tbl-rest_loc
#| tbl-cap: "Zomato restaurants coordinates from Singh, S (2024) Geospatial Data Science in Python"
(
    GT(rest_loc.head(5), auto_align=True)
    .tab_header(
        title=md('Zomato Restaurants Coordinates')
    )
    .fmt_number(columns=['lat','lon'], decimals=4, use_seps=False)
    .cols_width(
        cases={'name':'200%',
               'lat':'90%',
               'lon':'90%',
              }
               )
    .tab_source_note(source_note=md('<br>*Source: Shan Singh*'))
)

Zomato Restaurants Coordinates,Zomato Restaurants Coordinates,Zomato Restaurants Coordinates
name,lat,lon
"Sahakara Nagar, Bangalore, Karnataka, India",13.0621,77.5801
"Kaggadasapura, Bangalore, Karnataka, India",12.9847,77.6791
"Infantry Road, Bangalore, Karnataka, India",12.9810,77.6021
"CV Raman Nagar, Bangalore, Karnataka, India",12.9851,77.6631
"JP Nagar, Bangalore, Karnataka, India",12.9097,77.5866
Source: Shan Singh,Source: Shan Singh,Source: Shan Singh


We have found out latitude and longitude of each location listed in the dataset using geopy
This is used to plot maps.

In [24]:
pl.Series(rest_loc.select('lat')).is_null().sum()

0

In [25]:
pl.Series(rest_loc.select('lat')).is_nan().sum()

2

In [26]:
rest_loc.filter(pl.col('lat').is_nan())

name,lat,lon
str,f64,f64
"""Sadashiv Nagar, Bangalore, Kar…",,
"""Rammurthy Nagar, Bangalore, Ka…",,


In [27]:
rest_loc = rest_loc.drop_nans()

## Where are most number of restaurants located in Bengalore?

In [28]:
rest_locations = pl.Series(df.select('location')).value_counts(sort=True, name='total')

In [29]:
rest_locations = rest_locations.rename({'location':'name', 'total':'count'})

In [30]:
#| label: tbl-rest_count
#| tbl-cap: "Zomato restaurants count from Singh, S (2024) Geospatial Data Science in Python"
(
    GT(rest_locations.head(), auto_align=True)
    .tab_header(
        title=md('Zomato Restaurants Count')
    )
    .cols_width(cases={'name': '200%',})
    .tab_source_note(source_note=md('<br>*Source: Shan Singh*'))
)

Zomato Restaurants Count,Zomato Restaurants Count
name,count
"BTM, Bangalore, Karnataka, India",5124
"HSR, Bangalore, Karnataka, India",2523
"Koramangala 5th Block, Bangalore, Karnataka, India",2504
"JP Nagar, Bangalore, Karnataka, India",2235
"Whitefield, Bangalore, Karnataka, India",2144
Source: Shan Singh,Source: Shan Singh


Now we can say that these are locations where most of restaurants are located.

Lets create Heatmap of this results so that it becomes more user-friendly.

Now, in order to perform spatial analysis, we need latitudes & longitudes of every location, so lets merge both dataframes in order to get geographical co-ordinates.

In [31]:
beng_rest_locations = rest_locations.join(rest_loc, on='name')

In [32]:
#| label: tbl-rest_count_coords
#| tbl-cap: "Zomato restaurants count and coordinates from Singh, S (2024) Geospatial Data Science in Python"
(
    GT(beng_rest_locations.head(), auto_align=True)
    .tab_header(
        title=md('Zomato Restaurants Count & coordinates')
    )
    .cols_width(cases={'name': '200%',})
    .tab_source_note(source_note=md('<br>*Source: Shan Singh*'))
)

Zomato Restaurants Count & coordinates,Zomato Restaurants Count & coordinates,Zomato Restaurants Count & coordinates,Zomato Restaurants Count & coordinates
name,count,lat,lon
"BTM, Bangalore, Karnataka, India",5124,12.9163603,77.604733
"HSR, Bangalore, Karnataka, India",2523,12.90056335,77.64947470503677
"Koramangala 5th Block, Bangalore, Karnataka, India",2504,12.9348429,77.6189768
"JP Nagar, Bangalore, Karnataka, India",2235,12.9096941,77.5866067
"Whitefield, Bangalore, Karnataka, India",2144,12.9696365,77.7497448
Source: Shan Singh,Source: Shan Singh,Source: Shan Singh,Source: Shan Singh


now in order to show-case it via Map(Heatmap) ,first we need to create BaseMap so that I can map our Heatmap on top of BaseMap !

In [33]:
def Generate_basemap():
    basemap = folium.Map(location=[12.97 , 77.59], zoom_start=11)
    return basemap

In [34]:
# Geographic heat maps are used to identify where something occurs, and demonstrate areas of high and low density...
from folium.plugins import HeatMap

In [35]:
basemap = Generate_basemap()

In [36]:
beng_rest_locations = beng_rest_locations.to_pandas()

In [37]:
HeatMap(beng_rest_locations[['lat', 'lon' , 'count']]).add_to(basemap)

<folium.plugins.heat_map.HeatMap at 0x3058e0da0>

In [38]:
#| label: fig-heatmap
#| fig-cap: "Zomato Restaurants Heatmap"
basemap

::: {.callout-note}
You can interact with the above map by zooming in or out.
:::

Majority of the Restaurants are avaiable in the city centre area.

## Performing Marker Cluster Analysis

In [39]:
from folium.plugins import FastMarkerCluster

In [40]:
basemap = Generate_basemap()

In [41]:
FastMarkerCluster(beng_rest_locations[['lat', 'lon' , 'count']]).add_to(basemap)

<folium.plugins.fast_marker_cluster.FastMarkerCluster at 0x30a2cd280>

In [42]:
#| label: fig-marker-cluster
#| fig-cap: "Zomato Marker Cluster Map"
basemap

::: {.callout-note}
You can interact with the above map by zooming in or out.
:::

## Mapping all the markers of places of Bangalore

Plotting Markers on the Map :

Folium gives a folium.Marker() class for plotting markers on a map

Just pass the latitude and longitude of the location, mention the popup and tooltip and add it to the map.

Plotting markers is a two-step process.

1) you need to create a base map on which your markers will be placed
2) and then add your markers to it:

In [43]:
m = Generate_basemap()

In [44]:
# Add points to the map
for index, row in beng_rest_locations.iterrows():
    folium.Marker(location=[row['lat'], row['lon']], popup=row['count']).add_to(m)

In [64]:
#| label: fig-markers
#| fig-cap: "Zomato Restaurants Marker Map"
m

::: {.callout-note}
You can interact with the above map by zooming in or out.
:::

**Rate field cleaning**

In order to Analyse where are the restaurants situated with high average rate, first we need to clean 'rate' feature

In [46]:
(
    df.filter(
        pl.col('rate').str.contains('^([^0-9]*)$')
    )
    .select('rate')
    .unique()
    .to_dicts()
)

[{'rate': '-'}, {'rate': 'NEW'}]

In [47]:
pl.Series(df.select('rate')).is_null().sum()

7754

In [48]:
# approximately 15% of your rating belongs to missing values
pl.Series(df.select('rate')).is_null().sum()/pl.Series(df.select('rate')).len()*100

14.999226245744351

In [49]:
df = (
    df.drop_nulls(subset='rate')
        .with_columns(
            pl.col('rate').replace(['NEW', '-',], ['0', '0'])
        )
        .with_columns(
            rating=pl.col('rate').str.replace('/5', '')
        )
        .with_columns(
            pl.col('rating').str.strip_chars()
        )
        .cast({'rating': pl.Float32})
)

In [50]:
df.select('rating').unique().to_dicts()

[{'rating': 2.4000000953674316},
 {'rating': 2.299999952316284},
 {'rating': 0.0},
 {'rating': 3.5},
 {'rating': 4.099999904632568},
 {'rating': 4.400000095367432},
 {'rating': 4.699999809265137},
 {'rating': 4.900000095367432},
 {'rating': 2.700000047683716},
 {'rating': 3.9000000953674316},
 {'rating': 3.799999952316284},
 {'rating': 3.4000000953674316},
 {'rating': 3.0},
 {'rating': 2.5999999046325684},
 {'rating': 3.299999952316284},
 {'rating': 4.199999809265137},
 {'rating': 2.200000047683716},
 {'rating': 4.0},
 {'rating': 4.5},
 {'rating': 2.5},
 {'rating': 3.5999999046325684},
 {'rating': 3.700000047683716},
 {'rating': 2.0999999046325684},
 {'rating': 4.800000190734863},
 {'rating': 3.200000047683716},
 {'rating': 2.799999952316284},
 {'rating': 4.300000190734863},
 {'rating': 2.9000000953674316},
 {'rating': 2.0},
 {'rating': 4.599999904632568},
 {'rating': 3.0999999046325684},
 {'rating': 1.7999999523162842}]

## Most highest rated restaurants

In [51]:
df.select('name','rate','votes','location','dish_liked','rating').sort('rating', descending=True).head()

name,rate,votes,location,dish_liked,rating
str,str,i64,str,str,f32
"""Byg Brewski Brewing Company""","""4.9/5""",16345,"""Sarjapur Road, Bangalore, Karn…","""Cocktails, Dahi Kebab, Rajma C…",4.9
"""Byg Brewski Brewing Company""","""4.9/5""",16345,"""Sarjapur Road, Bangalore, Karn…","""Cocktails, Dahi Kebab, Rajma C…",4.9
"""Byg Brewski Brewing Company""","""4.9/5""",16345,"""Sarjapur Road, Bangalore, Karn…","""Cocktails, Dahi Kebab, Rajma C…",4.9
"""Belgian Waffle Factory""","""4.9/5""",1746,"""Brigade Road, Bangalore, Karna…","""Coffee, Berryblast, Nachos, Ch…",4.9
"""Belgian Waffle Factory""","""4.9/5""",1746,"""Brigade Road, Bangalore, Karna…","""Coffee, Berryblast, Nachos, Ch…",4.9


In [52]:
grp_df = (
    df.group_by('location').agg(pl.col('rating').mean(), pl.col('name').count())
        .rename({'location':'name', 'rating':'avg_rating', 'name':'count'})
)

In [53]:
grp_df

name,avg_rating,count
str,f32,u32
"""Brookefield, Bangalore, Karnat…",3.374697,581
"""Thippasandra, Bangalore, Karna…",3.095396,152
"""Electronic City, Bangalore, Ka…",3.04191,964
"""Koramangala 1st Block, Bangalo…",3.263946,965
"""Koramangala 3rd Block, Bangalo…",3.978755,193
…,…,…
"""RT Nagar, Bangalore, Karnataka…",3.278125,64
"""Jalahalli, Bangalore, Karnatak…",3.486956,23
"""Commercial Street, Bangalore, …",3.109709,309
"""Banaswadi, Bangalore, Karnatak…",3.362927,499


lets consider only those restaurants who have send atleast 400 orders

In [54]:
temp_df = grp_df.filter(pl.col('count')>400)

In [55]:
temp_df.shape

(35, 3)

In [56]:
temp_df

name,avg_rating,count
str,f32,u32
"""Brookefield, Bangalore, Karnat…",3.374697,581
"""Electronic City, Bangalore, Ka…",3.04191,964
"""Koramangala 1st Block, Bangalo…",3.263946,965
"""Bannerghatta Road, Bangalore, …",3.271675,1324
"""HSR, Bangalore, Karnataka, Ind…",3.484063,2128
…,…,…
"""Richmond Road, Bangalore, Karn…",3.688013,634
"""Koramangala 7th Block, Bangalo…",3.747846,1089
"""Frazer Town, Bangalore, Karnat…",3.56488,578
"""Banaswadi, Bangalore, Karnatak…",3.362927,499


In [57]:
rest_loc

name,lat,lon
str,f64,f64
"""Sahakara Nagar, Bangalore, Kar…",13.062147,77.580061
"""Kaggadasapura, Bangalore, Karn…",12.984671,77.679091
"""Infantry Road, Bangalore, Karn…",12.981016,77.602133
"""CV Raman Nagar, Bangalore, Kar…",12.985099,77.663117
"""JP Nagar, Bangalore, Karnataka…",12.909694,77.586607
…,…,…
"""Seshadripuram, Bangalore, Karn…",12.993188,77.575342
"""Jakkur, Bangalore, Karnataka, …",13.078474,77.606894
"""Bommanahalli, Bangalore, Karna…",12.908945,77.623904
"""Kammanahalli, Bangalore, Karna…",13.009346,77.637709


lets merge both the dataframe so that we can get coordinates as well

In [58]:
ratings_locations = temp_df.join(rest_loc, on='name')

In [59]:
ratings_locations

name,avg_rating,count,lat,lon
str,f32,u32,f64,f64
"""JP Nagar, Bangalore, Karnataka…",3.412929,1849,12.909694,77.586607
"""Koramangala 4th Block, Bangalo…",3.814351,864,12.932778,77.629405
"""Whitefield, Bangalore, Karnata…",3.384171,1693,12.969637,77.749745
"""Bannerghatta Road, Bangalore, …",3.271675,1324,12.951856,77.604011
"""Jayanagar, Bangalore, Karnatak…",3.61525,1718,12.939904,77.582638
…,…,…,…,…
"""Ulsoor, Bangalore, Karnataka, …",3.541396,901,12.977879,77.62467
"""Frazer Town, Bangalore, Karnat…",3.56488,578,12.998683,77.615525
"""Indiranagar, Bangalore, Karnat…",3.652168,1936,12.996298,77.545278
"""Koramangala 6th Block, Bangalo…",3.662465,1111,12.939025,77.623848


In [60]:
basemap = Generate_basemap()

In [61]:
ratings_locations = ratings_locations.to_pandas()

In [62]:
HeatMap(ratings_locations[['lat', 'lon' , 'avg_rating']]).add_to(basemap)

<folium.plugins.heat_map.HeatMap at 0x30a39bcb0>

In [63]:
#| label: fig-heatmap-rated
#| fig-cap: "Highest-rated Zomato Restaurants Heatmap"
basemap

::: {.callout-note}
You can interact with the above map by zooming in or out.
:::

## Conclusions

Python, with its powerful libraries and ease of use, has become an indispensable tool for geospatial analysis.
By leveraging the capabilities of libraries like GeoPandas, Shapely, and folium, data scientists can effectively explore and analyze geospatial data, gain valuable insights, and make informed decisions.

In this article, we have shown a brief overview of geospatial analysis in Python.

## References

* Singh, S (2024) [Spatial Analysis & Geospatial Data Science in Python](https://www.udemy.com/course/spatial-data-science-in-python)
* Tenkanen, H et al (2022) [Introduction to Python for Geographic Data Analysis](https://pythongis.org)

## Contact

**Jesus L. Monroy**
<br>
*Economist & Data Scientist*

[Medium](https://medium.com/@jesuslm) | [Linkedin](https://www.linkedin.com/in/j3sus-lm) | [Twitter](https://x.com/j3suslm)