# Processing data for Yestermap

In [None]:
import pandas as pd
import geopandas
import json

I downloaded my location history for Google Takeout

In [None]:
with open("data/Takeout/Location History/Location History.json") as f:
    data = json.load(f)

In [None]:
from pandas import json_normalize
df = json_normalize(data, "locations")

Preparing coordinates for converting into a Geopandas dataframe.

In [None]:
df["latitude"] = df["latitudeE7"] / 10 ** 7
df["longitude"] = df["longitudeE7"] / 10 ** 7

In [None]:
df

Convert to Geopandas dataframe keeping only the timestamp and geometry.

In [None]:
gdf = geopandas.GeoDataFrame(df, geometry=geopandas.points_from_xy(df.longitude, df.latitude))[["timestampMs", "geometry"]]

Setting the CRS in preparation for spatial joins.

In [None]:
gdf = gdf.set_crs(epsg=4326)

I'm setting the index here because we'll need it later on.

In [None]:
gdf = gdf.reset_index()

In [None]:
gdf

Got geometries for PH admin boundary level 4 (barangay) from our geodata warehouse.

In [None]:
ph_barangays = geopandas.read_file("data/ph_barangays.csv").set_crs(epsg=4326)

In [None]:
ph_barangays

Combining barangay and city because PH likes to repeat location names :P

In [None]:
ph_barangays["NAME"] = ph_barangays["BARANGAY"].str.split("(").str[0].str.strip().str.upper() + ", " + ph_barangays["MUNCITY_NAME"]

In [None]:
ph_barangays = ph_barangays[["NAME", "geometry"]]

In [None]:
ph_barangays

Running a spatial join on PH barangays and my location history coordinates. We're only keeping the barangay geometries to reduce granularity.

In [None]:
gdf_ph_barangay = geopandas.sjoin(ph_barangays, gdf, how="inner", op="contains").set_index("index").drop(columns="index_right")

In [None]:
gdf_ph_barangay

Here's where the index from above would be useful. I got all remaining locations that weren't within the PH geometries.

In [None]:
gdf_missing = gdf.iloc[gdf.index.difference(gdf_ph_barangay.index)]

In [None]:
gdf_missing

I repeat the same process but with less granular PH city geometries that I got from [GADM](https://gadm.org/download_country_v3.html).

In [None]:
ph_cities = geopandas.read_file("data/ph_cities.geojson")

In [None]:
ph_cities["NAME"] = ph_cities["NAME_2"].str.upper() + ", " + ph_cities["NAME_1"].str.upper()

In [None]:
ph_cities = ph_cities[["NAME", "geometry"]]

In [None]:
gdf_ph_city = geopandas.sjoin(ph_cities, gdf_missing, how="inner", op="contains").set_index("index").drop(columns="index_right")

In [None]:
gdf_ph_city

In [None]:
gdf_ph = pd.concat([gdf_ph_barangay, gdf_ph_city])

In [None]:
gdf_missing = gdf.iloc[gdf.index.difference(gdf_ph.index)]

Most of the remaining entries will be from the rest of the world. I got world city geometries from https://github.com/drei01/geojson-world-cities to match these.

In [None]:
world_cities = geopandas.read_file("data/world_cities.geojson")

In [None]:
gdf_world = geopandas.sjoin(world_cities, gdf_missing, how="inner", op="contains").set_index("index").drop(columns="index_right")

In [None]:
gdf_world = gdf_world[gdf_world["NAME"] != "MANILA"]

In [None]:
gdf_world

Combining and cleaning the results.

In [None]:
gdf_cleaned = pd.concat([gdf_ph, gdf_world])

Removing duplicate consecutive values

In [None]:
gdf_cleaned = gdf_cleaned.sort_values(by="timestampMs")

In [None]:
gdf_cleaned = gdf_cleaned[gdf_cleaned["NAME"] != gdf_cleaned["NAME"].shift()]

In [None]:
gdf_cleaned

Getting the centroids of the cities since I only need points.

In [None]:
gdf_cleaned = gdf_cleaned.set_crs(epsg=4326)
gdf_cleaned["longitude"] = gdf_cleaned.geometry.centroid.x
gdf_cleaned["latitude"] = gdf_cleaned.geometry.centroid.y

In [None]:
gdf_output = gdf_cleaned.rename(columns={"NAME": "name"})[["timestampMs", "name", "longitude", "latitude"]]

Final output is an ndjson file w/c I'll be loading into Firestore.

In [None]:
gdf_output.sort_values(by="timestampMs").to_json("data/location_history.ndjson", orient="records", lines=True)