# Popularity Correlation

With this notebook you can correlate any value associated with a geo-reference with the Google popularity score. You
can upload your own file as a CSV. The only thing that is necessary to make it work is to have columns for latitude and
longitude and column headers.

The value columns can be specific to your use case, e.g., scooter bookings, sales in shops or crimes. The popularity
score is aggregated on a week. So ideally, the value columns that you want to correlate are aggregated on a weekly
timeframe as well.

As an example we are using an open data set from Uber that gives us the traversals of rides through specific hexagons.
You can find the raw data on [their open data platform](https://movement.uber.com/?lang=en-US). We preprocessed the raw
data so that the traversals are already aggregated per week.

## 1. Set Parameters

1. Set the file path to your CSV and the delimiter. Simply place your file under `kuwala/resources` from within the
Jupyter environment or under `kuwala/common/jupyter/resources` from the repository root on your local file system.

In [None]:
file_path = '../resources/lisbon_uber_traversals.csv'
delimiter = ';'

2. Set the H3 resolution to aggregate the results on.

    To see the average size of a hexagon at a given resolution go to the
    [official H3 documentation](https://h3geo.org/docs/core-library/restable). The currently set resolution 8 has on
    average an edge length of 0.46 km which can be freely interpreted as a radius.

In [None]:
resolution = 8

3. Set the column names for the coordinates and the columns of the file you want to correlate.

In [None]:
lat_column = 'latitude'
lng_column = 'longitude'
value_columns = ['weekly_traversals']

4. You can provide polygon coordinates as a GeoJSON-conform array to select a subregion. Otherwise, data form the entire
database will be analyzed. (The default coordinates are a rough representation of Lisbon, Portugal.)

In [None]:
polygon_coords = '[[[-9.092559814453125,38.794500078219826],[-9.164314270019531,38.793429729760994],[-9.217529296875,38.76666579487878],[-9.216842651367188,38.68792166352608],[-9.12139892578125,38.70399894245585],[-9.0911865234375,38.74551518488265],[-9.092559814453125,38.794500078219826]]]'


## 2. Load dataframes

#### Create a Spark session that is used to load your file.

In [None]:
from kuwala.modules.common import get_spark_session

sp = get_spark_session(memory_in_gb=16)

#### Load the file

In [None]:
import json
from geojson import Polygon
from kuwala.modules.common import add_h3_index_column, polyfill_polygon

df_file = sp.read.option('delimiter', delimiter).csv(file_path, header=True)
df_file = add_h3_index_column(df=df_file, lat_column=lat_column, lng_column=lng_column, resolution=resolution)

if polygon_coords:
    polygon_coords_json = json.loads(polygon_coords)
    polygon = Polygon(polygon_coords_json)
    h3_index_in_polygon = list(polyfill_polygon(polygon=polygon, resolution=resolution))
    df_file = df_file.filter(df_file.h3_index.isin(h3_index_in_polygon))

aggregations = { x: 'sum' for x in value_columns}
df_file = df_file.select('h3_index', *value_columns).groupBy('h3_index').agg(aggregations)

df_file.show(n=10)

#### Get weekly popularity per hexagon

##### Initialize dbt controller

In [None]:
from kuwala.modules.common import get_dbt_controller

dbt_controller = get_dbt_controller()

##### Run dbt macro to get the aggregated popularity data

In [None]:
from kuwala.modules.poi_controller import get_popularity_in_polygon

popularity = get_popularity_in_polygon(dbt_controller=dbt_controller, resolution=resolution, polygon_coords=polygon_coords)

popularity.head(n=10)

## 3. Join dataframes

In [None]:
popularity = sp.createDataFrame(popularity)
popularity = popularity.withColumnRenamed('h3_index', 'join_h3_index')
result = df_file \
    .join(popularity, df_file.h3_index == popularity.join_h3_index, 'left') \
    .drop('join_h3_index') \
    .fillna(0, subset=['popularity'])

result.show(n=10)

## 4. Visualize Results

#### Pandas Profiling Report

In [None]:
from pandas_profiling import ProfileReport

result_pd = result.toPandas()
profile = ProfileReport(result_pd, title="Pandas Profiling Report", explorative=True)

profile.to_notebook_iframe()

#### Map

In [None]:
from unfolded.map_sdk import UnfoldedMap
from sidecar import Sidecar
from uuid import uuid4

unfolded_map = UnfoldedMap()
sc = Sidecar(title=f'Popularity Correlation', anchor='split-right')

with sc:
    display(unfolded_map)

dataset_id_combined=uuid4()

unfolded_map.add_dataset({
    'uuid': dataset_id_combined,
    'label': f'Correlated values',
    'data': result_pd
})

## 5. Save Results as dataset
#### JSON

In [None]:
import re
import ipynbname
from kuwala.modules.common import to_json, to_csv, to_parquet

currentNB_name = ipynbname.name()
to_json(df=result, nb_name=currentNB_name)

#### CSV

In [None]:
to_csv(df=result, nb_name=currentNB_name, header=True)

#### Parquet

In [None]:
to_parquet(df=result, nb_name=currentNB_name)