<a href="https://colab.research.google.com/github/kavyajeetbora/AIS_data_analysis/blob/main/development/02_Get_POIs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install --quiet duckdb
!pip install --quiet jupysql
!pip install --quiet duckdb-engine
!pip install -q pydeck
!touch __init__.py

In [None]:
import geopandas as gpd
import pandas as pd
import pydeck as pdk
import shapely
import duckdb
import os
import time
import json

# Import jupysql Jupyter extension to create SQL cells
%load_ext sql

from lxml import etree
import requests

import geemap
import ee

ee.Authenticate()
ee.Initialize(project='kavyajeetbora-ee')

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [None]:
from tqdm.notebook import tqdm
import math
import re

import warnings
warnings.filterwarnings('ignore')

In [None]:
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

%sql duckdb:///:memory:
# %sql duckdb:///path/to/file.db

In [None]:
%%sql
INSTALL httpfs;
INSTALL spatial;

Unnamed: 0,Success


Bounding Box:

In [None]:
W,S,E,N = 91.7730008646,26.1579902082,91.7868625208,26.1682553455
bbox = shapely.box(W,S,E,N)
lon,lat = [x[0] for x in bbox.centroid.xy]
print(lon,lat)

91.77993169269999 26.163122776850006


## Oveture map latest release

In [None]:
try:
    url = r'https://docs.overturemaps.org/release/latest/'
    req = requests.get(url)
    html_text = req.text
    tree = etree.HTML(html_text)

    result = tree.xpath('//h1')
    latest_release = result[0].text
    print("latest Overture map release:" ,latest_release)

except Exception as e:
    print("Error", e)

latest Overture map release: 2024-10-23.0


## Get Building Footprints

[Refer place schema references](https://docs.overturemaps.org/schema/reference/places/place/)

In [None]:
%%time

# Check if the file already exists and remove it
if os.path.exists('overture_places.geojson'):
    os.remove('overture_places.geojson')

places_data_url = rf"s3://overturemaps-us-west-2/release/{latest_release}/theme=places/type=*/*"

con = duckdb.connect()
con.execute("INSTALL spatial") # Install the spatial extension, which includes GDAL
con.execute("LOAD spatial")  # Load the spatial extension

df = con.sql(
    f'''
    COPY(
        SELECT
            pois.names.primary as name,
            pois.categories.primary as category,
            pois.confidence as confindence,
            pois.websites[1] as website,
            pois.addresses[1].postcode as Address,
            pois.phones[1] as phone_number,
            pois.geometry
        FROM read_parquet('{places_data_url}', filename=true, hive_partitioning=1) AS pois
        WHERE pois.bbox.xmin > {W}
        AND pois.bbox.xmax < {E}
        AND pois.bbox.ymin > {S}
        AND pois.bbox.ymax < {N}
    ) TO 'overture_places.geojson' WITH (FORMAT GDAL, DRIVER 'GeoJSON');
    '''
)
con.close() # Close the connection

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

CPU times: user 1.99 s, sys: 134 ms, total: 2.13 s
Wall time: 6.13 s


## Summarize the data

In [None]:
gdf = gpd.read_file("overture_places.geojson")
gdf['coordinates'] = list(zip(gdf['geometry'].x, gdf['geometry'].y))
gdf = gdf.drop('geometry',axis=1)
print(f"Total number of point of interests found: {gdf.shape[0]}")
gdf.head()

Total number of point of interests found: 343


Unnamed: 0,name,category,confindence,website,Address,phone_number,coordinates
0,Webdevlo,software_development,0.308564,https://www.webdevlo.com/,781005,,"(91.77417, 26.15808)"
1,Asian Thai Foods India Pvt Ltd,,0.926606,https://rumpum.net.in/,781123,,"(91.7736912, 26.1584426)"
2,"Ace School of Music, Guwahati",campus_building,0.275468,http://www.aceacoustics.co.in/ace_school_music...,"Assam, India",917399079027.0,"(91.7740118, 26.1584135)"
3,ASpritirajVlogs,weight_loss_center,0.308564,https://youtube.com/channel/ucvp8jbikmluwmzasn...,,,"(91.77416, 26.1583)"
4,Creative Reactive - Digital Marketing Agency,advertising_agency,0.77,https://creativereactive.com/,781005,6000490867.0,"(91.7742392, 26.1583028)"


In [None]:
gdf['category'].value_counts()

Unnamed: 0_level_0,count
category,Unnamed: 1_level_1
restaurant,17
hotel,12
resort,11
college_university,11
education,10
...,...
eat_and_drink,1
cinema,1
dairy_farm,1
elementary_school,1


There are several NLP techniques and tools that can help you categorize your data intelligently. Here are a few approaches you can consider:

1. Manual Mapping with NLP Assistance:
Use pre-trained models like BERT or GPT to understand the context of each category and suggest appropriate bins.
Tools like spaCy or NLTK can help you preprocess and analyze the text data to find similarities and differences between categories.
2. Clustering Algorithms:
Algorithms like K-means or hierarchical clustering can group similar categories together based on their features.
You can use libraries like Scikit-learn in Python to implement these clustering techniques.
3. Topic Modeling:
Techniques like Latent Dirichlet Allocation (LDA) can help identify topics within your categories and group them accordingly.
Libraries like Gensim can be used for topic modeling.
4. Custom Classification Models:
Train a custom classifier using labeled data to predict the bin for each category.
Use machine learning frameworks like TensorFlow or PyTorch to build and train your model.
5. Pre-built Tools and APIs:
Tools like IBM Watson, Google Cloud Natural Language, or Microsoft Azure Text Analytics offer APIs that can categorize text data based on predefined or custom categories.

## Downloading data from google maps API

Some refereneces

1. https://serpapi.com/blog/how-we-reverse-engineered-google-maps-pagination/

In [None]:
PAGINATION_PARAMETERS_REGEX = re.compile(
    r"""
    \A                                      # Start of string
    (?:\s*)                                 # Initial possible whitespace
    @(?P<latitude>[-+]?\d{1,2}(?:[.,]\d+)?)  # Latitude: @10.78472
    (?:\s*,\s*)                             # Separator between latitude and longitude
    (?P<longitude>[-+]?\d{1,3}(?:[.,]\d+)?)  # Longitude: @-110
    (?:\s*,\s*)                             # Separator between longitude and zoom
    (?P<zoom>\d{1,2}(?:[.,]\d+)?)z           # Zoom: 9.22
    $                                      # End of string
    """, re.VERBOSE

)

EARTH_RADIUS_IN_METERS = 6371010
TILE_SIZE = 256
SCREEN_PIXEL_HEIGHT = 768
RADIUS_X_PIXEL_HEIGHT = 27.3611 * EARTH_RADIUS_IN_METERS * SCREEN_PIXEL_HEIGHT

def altitude(zoom, latitude):
    return str((RADIUS_X_PIXEL_HEIGHT * math.cos((latitude * math.pi) / 180)) / ((2 ** zoom) * TILE_SIZE))

def pagination(location_lat_long, start_offset):
    extracted_parameters = PAGINATION_PARAMETERS_REGEX.match(location_lat_long)

    if not extracted_parameters:
        return ""

    return (
        "!4m8!1m3!1d" +
        altitude(float(extracted_parameters['zoom']), float(extracted_parameters['latitude'])) +
        "!2d" +
        extracted_parameters['longitude'] +
        "!3d" +
        extracted_parameters['latitude'] +
        "!3m2!1i1024!2i768!4f13.1!7i20!8i" +
        (start_offset or "0") +
        "!10b1!12m25!1m1!18b1!2m3!5m1!6e2!20e3!6m16!4b1!23b1!26i1!27i1!41i2!45b1!49b1!63m0!67b1!73m0!74i150000!75b1!89b1!105b1!109b1!110m0!10b1!16b1!19m4!2m3!1i360!2i120!4i8!20m65!2m2!1i203!2i100!3m2!2i4!5b1!6m6!1m2!1i86!2i86!1m2!1i408!2i240!7m50!1m3!1e1!2b0!3e3!1m3!1e2!2b1!3e2!1m3!1e2!2b0!3e3!1m3!1e3!2b0!3e3!1m3!1e8!2b0!3e3!1m3!1e3!2b1!3e2!1m3!1e10!2b0!3e3!1m3!1e10!2b1!3e2!1m3!1e9!2b1!3e2!1m3!1e10!2b0!3e3!1m3!1e10!2b1!3e2!1m3!1e10!2b0!3e4!2b1!4b1!9b0!22m3!1s!2z!7e81!24m55!1m15!13m7!2b1!3b1!4b1!6i1!8b1!9b1!20b0!18m6!3b1!4b1!5b1!6b1!13b0!14b0!2b1!5m5!2b1!3b1!5b1!6b1!7b1!10m1!8e3!14m1!3b1!17b1!20m4!1e3!1e6!1e14!1e15!24b1!25b1!26b1!29b1!30m1!2b1!36b1!43b1!52b1!54m1!1b1!55b1!56m2!1b1!3b1!65m5!3m4!1m3!1m2!1i224!2i298!89b1!26m4!2m3!1i80!2i92!4i8!30m28!1m6!1m2!1i0!2i0!2m2!1i458!2i768!1m6!1m2!1i974!2i0!2m2!1i1024!2i768!1m6!1m2!1i0!2i0!2m2!1i1024!2i20!1m6!1m2!1i0!2i748!2m2!1i1024!2i768!34m16!2b1!3b1!4b1!6b1!8m4!1b1!3b1!4b1!6b1!9b1!12b1!14b1!20b1!23b1!25b1!26b1!37m1!1e81!42b1!46m1!1e9!47m0!49m1!3b1!50m53!1m49!2m7!1u3!4s!5e1!9s!10m2!3m1!1e1!2m7!1u2!4s!5e1!9s!10m2!2m1!1e1!2m7!1u16!4s!5e1!9s!10m2!16m1!1e1!2m7!1u16!4s!5e1!9s!10m2!16m1!1e2!3m11!1u16!2m4!1m2!16m1!1e1!2s!2m4!1m2!16m1!1e2!2s!3m1!1u2!3m1!1u3!4BIAE!2e2!3m1!3b1!59B!65m0!69i540"
    )

def place_url(url):
  pattern = r"placeid=([^\&]+)"
  match = re.search(pattern, url)
  return 'https://www.google.com/maps/place/?q=place_id:' + match.group(1)

def pagination_to_pandas(messy_data):
    data = json.loads(messy_data[:-6])
    javascript_array_str = data["d"][5:]
    python_list = json.loads(javascript_array_str)
    places_data = python_list[0][1]

    name = [i[14][11] for i in places_data[1:]]
    Addresses = [i[14][2] for i in places_data[1:]]
    website = [i[14][7] for i in places_data[1:]]
    phone_number = [i[14][178] for i in places_data[1:]]
    open_close_timing = [i[14][34] for i in places_data[1:]]
    reviews_rating = [i[14][4] for i in places_data[1:]]


    places_dict = {
        "name": [i for i in name],
        "address": [i[0] for i in Addresses],
        "website": [i[0] if i is not None else "None" for i in website],
        "phone_number": [i for i in phone_number],
        "open_close_timing": [i for i in open_close_timing],
        "reviews_rating": [i for i in reviews_rating],
        "latitude": [
            i[14][9][2] if i[14][9] is not None else None for i in places_data[1:]
        ],
        "longitude": [
            i[14][9][3] if i[14][9] is not None else None for i in places_data[1:]
        ],
    }
    places_dict["phone_number"] = [
        data[0][0] if data is not None and data[0] is not None else "None"
        for data in places_dict["phone_number"]
    ]

    timings = []
    for list_index in range(len(places_dict["open_close_timing"])):
        temp_dic = {}
        if places_dict["open_close_timing"][list_index]:
            if places_dict["open_close_timing"][list_index][1]:
                for i in range(7):
                    temp_dic[places_dict["open_close_timing"][list_index][1][i][0]] = (
                        places_dict["open_close_timing"][list_index][1][i][1][0].replace(
                            "\u202f", " "
                        )
                    )
        else:
            temp_dic["days"] = None
        timings.append(temp_dic)
    places_dict["open_close_timing"] = [i for i in timings]

    places_dict["ratings"] = [
        i[7] if i is not None else None for i in places_dict["reviews_rating"]
    ]
    places_dict["reviews"] = [
        i[3][1] if i is not None else None for i in places_dict["reviews_rating"]
    ]
    places_dict["gmap_link"] = [
        place_url(x) if x is not None else None
        for x in [i[3][0] if i is not None else None for i in places_dict["reviews_rating"]]
    ]
    if "reviews_rating" in places_dict.keys():
        del places_dict["reviews_rating"]
    else:
        print('It"s Already deleted')

    ## Export to pandas dataframe
    df = pd.DataFrame(places_dict)
    new_order = [
        "name",
        "latitude",
        "longitude",
        "phone_number",
        "ratings",
        "reviews",
        "website",
        "address",
        "gmap_link",
        "open_close_timing"
    ]
    df = df[new_order]
    df["address"] = df.address.astype(str).str.replace("[", "").str.replace("]", "")
    df = df.astype(object)
    df.ratings = df.ratings.astype(str)
    df.open_close_timing = df.open_close_timing.astype(str)

    return df

Convert the messy data to dataframe

In [None]:
place_name = 'restaurants'

zoom_level = 16
location_lat_long = f"@{lat},{lon},{zoom_level}z"

headers = {
      'User-Agent': 'Mozilla/5.0 (X11; CrOS x86_64 14541.0.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
      'Cookie': "place_your_cookie_here",
      'Referer': 'https://www.google.com/',
      'Accept' : '*/*',
      'Accept-Language' : 'en-US,en;q=0.9'
}

url = "https://www.google.com/search"

results = []

for i in tqdm(range(20,200,20)):
    start_offset = f'{i}'

    params = {
        'tbm': 'map',
        'authuser': '0',
        'hl': 'en',
        'pb': pagination(location_lat_long, start_offset),  # Replace 'your_partial_pb_value' with the actual value for the 'pb' key
        'q': place_name,
        'tch': '1',
        'ech': '5',
        "near": f"{lat},{lon}",  # This is the key parameter for location binding
    }
    response = requests.get(url, params=params, headers=headers)
    messy_data = response.text

    df = pagination_to_pandas(messy_data)
    if df.shape[0]>0:
        results.append(df)
    else:
        print(f"No more data found at offset value: {i}")
        break

results = pd.concat(results)
results = results.drop_duplicates(subset=['gmap_link', 'name'])
results['coordinates'] = list(zip(results['longitude'], results['latitude']))
results.drop(['latitude', 'longitude'],axis=1, inplace=True)
print("Total Unique Restaurants found:",results.shape[0])

results.sample(10)

  0%|          | 0/9 [00:00<?, ?it/s]

No more data found at offset value: 160
Total Unique Restaurants found: 128


Unnamed: 0,name,phone_number,ratings,reviews,website,address,gmap_link,open_close_timing,coordinates
7,Maa Manasha Hotel & Restaurant,+91 70028 68934,3.7,15 reviews,,5Q88+WG8,https://www.google.com/maps/place/?q=place_id:...,"{'Wednesday': '10 AM–10 PM', 'Thursday': '10 A...","(91.7663208, 26.1672898)"
8,DAWAT THE TREAT GUWAHATI,+91 94357 48987,4.0,48 reviews,,"46, GMC Hostel Rd",https://www.google.com/maps/place/?q=place_id:...,"{'Wednesday': '10 AM–10 PM', 'Thursday': '10 A...","(91.7756874, 26.1567723)"
14,Vaibhavam Sweets & Restaurant: Best Restaurant...,+91 60021 44544,4.1,244 reviews,https://www.vaibhavamsweets.com/,"191, Exotica Greens, Guwahati Central Building...",https://www.google.com/maps/place/?q=place_id:...,"{'Wednesday': '9 AM–10 PM', 'Thursday': '9 AM–...","(91.7795208, 26.165705799999998)"
4,Pay Per Minute,+91 94351 10827,3.1,109 reviews,,"First Floor, Spanish Garden",https://www.google.com/maps/place/?q=place_id:...,"{'Wednesday': '1–10 PM', 'Thursday': '1–10 PM'...","(91.7803431, 26.1632568)"
5,Munkey Houzz,,4.8,108 reviews,https://instagram.com/munkeyhouzz,"Munkey Houzz, 6th Floor, Mayur Heights",https://www.google.com/maps/place/?q=place_id:...,"{'Wednesday': '11:30 AM–11:45 PM', 'Thursday':...","(91.77091519999999, 26.161562)"
0,Maihang,+91 93659 63774,4.2,"2,176 reviews",,Public Health,https://www.google.com/maps/place/?q=place_id:...,"{'Wednesday': '11:30 AM–10 PM', 'Thursday': '1...","(91.7924063, 26.1515289)"
14,Cosy Kitchen,+91 84038 19968,4.0,"1,337 reviews",https://m.facebook.com/cosykitchen12/,ABC,https://www.google.com/maps/place/?q=place_id:...,"{'Wednesday': '12–11:30 PM', 'Thursday': '12–1...","(91.7701536, 26.163208899999997)"
14,Dosa Express Namma Chennai,+91 96598 55555,4.2,836 reviews,,Sugam Path,https://www.google.com/maps/place/?q=place_id:...,"{'Wednesday': '9 AM–10 PM', 'Thursday': '9 AM–...","(91.77698439999999, 26.173423099999997)"
8,Surabhi Restaurant,,4.9,17 reviews,,5R32+82M,https://www.google.com/maps/place/?q=place_id:...,"{'Wednesday': '7:30 AM–2:30 PM', 'Thursday': '...","(91.8001, 26.153395999999997)"
7,"DROP City Night Lounge - Terrace Pub, Bar & Re...",+91 88228 40734,4.1,576 reviews,,"4th Floor, Roodraksh Mall",https://www.google.com/maps/place/?q=place_id:...,"{'Wednesday': '12–11 PM', 'Thursday': '12–11 P...","(91.7678893, 26.166040199999998)"


## Merge Data

once the data from both sources are fetched, concat the data into one for visualization

In [None]:
df_final = pd.concat([results, gdf])
print("Total nearby restaurants found:",df_final.shape[0])
df_final.sample(6)

Total nearby restaurants found: 471


Unnamed: 0,name,phone_number,ratings,reviews,website,address,gmap_link,open_close_timing,coordinates,category,confindence,Address
109,Deep's Fishing Store,+919706339249,,,,,,,"(91.77625, 26.16289)",sporting_goods,0.308564,781005.0
157,CJ Darcl Logistics Limited,18002124455,,,https://logistics-near-me.cjdarcl.com/cj-darcl...,,,,"(91.7795108, 26.1658235)",freight_and_cargo_service,0.77,781005.0
5,Naga Kitchen,+91 98642 68266,4.1,"1,087 reviews",https://www.facebook.com/nagakitchenaidc,"1, Prasanta Path",https://www.google.com/maps/place/?q=place_id:...,"{'Wednesday': '11 AM–10 PM', 'Thursday': '11 A...","(91.77760699999999, 26.1703191)",,,
122,"Bytecore - Laptop, Desktop & MacBook Repair Ex...",07578849910,,,http://bytecore.in/,,,,"(91.7739944, 26.1650181)",computer_store,0.77,781006.0
8,Surabhi Restaurant,,4.9,17 reviews,,5R32+82M,https://www.google.com/maps/place/?q=place_id:...,"{'Wednesday': '7:30 AM–2:30 PM', 'Thursday': '...","(91.8001, 26.153395999999997)",,,
323,Collection O 50144 Ir Luxuria Guwahati Central,01246201305,,,https://www.oyorooms.com/h/84390?utm_source=Bi...,,,,"(91.7819816, 26.1646603)",hotel,0.77,781024.0


## Visualize the data

In [None]:
tooltip = {
  "text": "🏷️ Name: {name}\n📞 Phone Number: {phone_number}\n⭐ Ratings: {ratings}\n📝 Reviews: {reviews}\n🌐 Website: {website}\n📍 Address: {address}"
}

In [None]:
# Define a layer to display on a map
layer = pdk.Layer(
    "ScatterplotLayer",
    results,
    pickable=True,
    opacity=0.8,
    stroked=True,
    filled=True,
    radius_scale=20,
    radius_min_pixels=1,
    radius_max_pixels=100,
    line_width_min_pixels=1,
    get_position="coordinates",
    get_fill_color=[255, 140, 0],
    get_line_color=[0, 0, 0],
)

layer2 = pdk.Layer(
    "ScatterplotLayer",
    gdf,
    pickable=True,
    opacity=0.8,
    stroked=True,
    filled=True,
    radius_scale=20,
    radius_min_pixels=1,
    radius_max_pixels=100,
    line_width_min_pixels=1,
    get_position="coordinates",
    get_fill_color=[3, 194, 252],
    get_line_color=[0, 0, 0],
)

# Set the viewport location
view_state = pdk.ViewState(latitude=lat, longitude=lon, zoom=zoom_level-1, bearing=0, pitch=45)

# Render
r = pdk.Deck(layers=[layer2, layer], initial_view_state=view_state, tooltip=tooltip)
r.to_html("index.html")

<IPython.core.display.Javascript object>