<h1>Hot Spot Analysis - Australian Bushfire Demo (v0.1.0)</h1>

<h2>Part 1: Load Dependencies and Data</h2>

<h3>Dependencies</h3>

The first step is to load the dependencies required to prepare the data, perform the analysis, and visualise the results. Pandas and folium are both external open-source libraries used for manipulating data and mapping geographical data, respectively. GeoJikuu is an open-source library for analysing geospatial data and hence will be used for the hot spot analysis as well as any required pre-processing steps.

In [1]:
# External
import pandas as pd
import folium
import warnings
warnings.filterwarnings("ignore")

# GeoJikuu
from preprocessing.projection import *
from aggregation.point_aggregators import *
from hypothesis_testing.hot_spot_analysis import *

<h3>Data</h3>

The next step is to load the dataset to which the hot spot analysis will be applied. For this demo, we will be using a dataset titled: [19th Century Australian Bushfire Reporting](https://ghap.tlcmap.org/publicdatasets/170), courtesy of Dr. Fiannuala Morgan.

This dataset contains the approximate locations of bushfires in Australia from 1824 to 1899. As per the dataset's description, these locations were extracted from newspapers and geo-located using GHAP.

Let's start by loading the dataset, and then examining the first five entries.

In [2]:
df = pd.read_csv("bushfiredemo.csv")
df.head()

Unnamed: 0,ghap_id,dataset_id,title,recordtype_id,latitude,longitude,datestart,dateend,placename,created_at,updated_at,State,Newspaper Place of Publication,Newspaper,Article Word Count,Article Link,coordinates
0,t79e4,170,"TRIBUTARY LINES, Addressed to LIEUTENANT GOVER...",1,-33.906667,151.059444,1824-05-14,1824-05-14,GREECE,23/01/2022 13:51,23/01/2022 13:51,NSW,Hobart,Hobart Town Gazette and Van Diemen's Land Adve...,276,https://nla.gov.au/nla.news-article1090181,"-33.90666667,151.0594444"
1,t79e5,170,IMPROVED TRAVELLING.,1,-33.751528,150.994167,1831-12-24,1831-12-24,BAULKHAM HILLS,23/01/2022 13:51,23/01/2022 13:51,NSW,Sydney,The Sydney Gazette and New South Wales Adverti...,1280,https://nla.gov.au/nla.news-article2204168,"-33.75152778,150.9941667"
2,t79e6,170,FIRE.,1,-32.055085,115.7459,1834-03-15,1834-03-15,FREMANTLE,23/01/2022 13:51,23/01/2022 13:51,WA,Perth,The Perth Gazette and Western Australian Journ...,289,https://nla.gov.au/nla.news-article641603,"-32.055085,115.7459"
3,t79e7,170,The Broad and Liberal.,1,-19.225,138.35,1835-02-14,1835-02-14,NORFOLK,23/01/2022 13:51,23/01/2022 13:51,QLD,Hobart,The True Colonist Van Diemen's Land Political ...,2493,https://nla.gov.au/nla.news-article200328486,"-19.225,138.35"
4,t79e9,170,1835. SKETCH OF OCCURRENCES DURING THE YEAR.,1,-33.641667,151.267778,1835-12-19,1835-12-19,ELIZA BAY,23/01/2022 13:51,23/01/2022 13:51,NSW,Perth,The Perth Gazette and Western Australian Journ...,432,https://nla.gov.au/nla.news-article640616,"-33.64166667,151.2677778"


As can be seen above, the dataset contains several columns of useful information, ranging from place names, to newspaper articles, to dates and coordinates. For the purposes of this demonstration, however, we are only interested in the (lat, lon) coordinates of each bushfire.

We proceed by dropping all columns other than the 'coordinates' column.

In [3]:
df.drop(df.columns.difference(['coordinates']), axis=1, inplace=True)
df.head()

Unnamed: 0,coordinates
0,"-33.90666667,151.0594444"
1,"-33.75152778,150.9941667"
2,"-32.055085,115.7459"
3,"-19.225,138.35"
4,"-33.64166667,151.2677778"


<h2>Step 2: Preprocessing</h2>

<h3>Project Coordinates</h3>

Hot spot analysis involves applying arithmetic operations to the input coordinates. However, such operations only make sense when dealing with linear coordinate systems. Since our dataset uses an angular coordinate system (WGS84), we must first project the coordinates to some linear system.

There are many types of linear projection systems, and the best one is highly contextual. For the purposes of this demo, we will use the [Map Grid of Australia 1994 (MGA94)](https://www.icsm.gov.au/datum/geocentric-datum-australia-1994-gda94#:~:text=The%20standard%20map%20projection%20associated,Universal%20Transverse%20Mercator%20Grid%20system.) projection system, which is available as part of GeoJikuu's preprocessing.projection package.

The following block of code shows how this projection can be done:

In [4]:
mga1994_projector = MGA1994Projector("WGS84")
wgs84_coordinates = []
for row in df["coordinates"].tolist():
    wgs84_coordinates.append((float(row.split(',')[0]), float(row.split(',')[1])))

results = mga1994_projector.project(wgs84_coordinates)
mga1994_coordinates = results["mga1994_coordinates"]
# unit_conversion = results["unit_conversion"]

df['mga94'] = mga1994_coordinates
df.head()

Unnamed: 0,coordinates,mga94
0,"-33.90666667,151.0594444","(6240766.353818081, 875409.8550096203)"
1,"-33.75152778,150.9941667","(6258220.583029894, 870038.6276743908)"
2,"-32.055085,115.7459","(5988838.712066973, -2511909.2671285663)"
3,"-19.225,138.35","(7851487.897910243, -411913.177344884)"
4,"-33.64166667,151.2677778","(6269396.349533128, 895909.5787525391)"


Now that we have the projected coordinates, we can then drop WGS84 coordinate column (i.e., 'coordinates')

In [5]:
df.drop("coordinates", axis=1, inplace=True)
df.head()

Unnamed: 0,mga94
0,"(6240766.353818081, 875409.8550096203)"
1,"(6258220.583029894, 870038.6276743908)"
2,"(5988838.712066973, -2511909.2671285663)"
3,"(7851487.897910243, -411913.177344884)"
4,"(6269396.349533128, 895909.5787525391)"


<h2>Step 3: Aggregate Data</h2>

To perform a hot spot analysis on a large number of dispersed points, it typically makes more sense to first aggregate them into clusters. These clusters will then be analysed by the hot spot algorithm to determine whether any hot spots (or cold spots) exist among them. Points can be aggregated in many ways, but in this case, we will use the k-Nearest Neighbours algorithm to partition the points and then aggregate by count. We can do this using GeoJikuu's KNearestNeighbours class which is located in the aggregation.point_aggregators package.


In [6]:
knn_aggregator = KNearestNeighbours(df, "mga94")

bushfires_agg = knn_aggregator.aggregate(k=5)

bushfires_agg.head()

Aggregated 7454 points into 175 clusters.


Unnamed: 0,midpoint,count
,,
0.0,"(6239639.0801122, 874101.5371682383)",11.0
1.0,"(6260922.489469169, 861040.7434953144)",63.0
2.0,"(6024440.629068807, -2482320.381361867)",107.0
3.0,"(6884539.213468155, -499493.5306094593)",562.0
4.0,"(6245010.547854997, 889431.7893844307)",504.0


The above output shows that the 7454 bushfires were aggregated into 175 bushfire clusters. 

<h2>Step 4: Hot Spot Analysis</h2>

<h3>Run Analysis</h3>

We are finally ready to perform the analysis. This can be done using GeoJikuu's GiStarHotSpotAnalysis class which is imported from the hypothesis_testing module. It determines which aggregated locations are statistically significant hot spots (or cold spots). We are running the analysis on the 'count' variable, which means we are looking for the clusters where bushfires occur more or less frequently than would be expected given the other clusters in the research area. 

In [7]:
analysis = GiStarHotSpotAnalysis(bushfires_agg, "midpoint")

results = analysis.run("count")

Getis-Ord Gi* Hot Spot Analysis Summary
---------------------------------------
Statistically Significant Clusters: 5
    Statistically Significant Hot Spots: 3
    Statistically Significant Cold Spots: 2
Non-Statistically Significant Clusters: 170
Total Clusters: 175

Null Hypothesis (H₀): The observed pattern of the variable 'count' in cluster i is the result of spatial randomness alone.
Alpha Level (α): 0.05

Verdict: Sufficient evidence to reject H₀ when α = 0.05 for clusters i = {60, 65, 92, 139, 140}


The above output shows that the analysis found three statistically significant hot spots and two statistically significant cold spots. The dataset IDs of these statistically significant clusters are 60, 65, 92, 139, and 140.

<h3>Convert Data to Initial Format</h3>

Before viewing and mapping our statistically significant hot spots, we first need to convert them from MGA 1994 back to WGS84 (lat, lon) coordinates. This can be done using the inverse_project() function of the MGA1994 Projector class.

In [8]:
# Project coordinates back to WGS84
string_list = results.midpoint.values.tolist()
tuple_list = []
for string in string_list:
    x_string = string.strip('(').strip(')').split(", ")
    x = tuple([float(i) for i in x_string])
    tuple_list.append(x)
    
results["coordinates"] = mga1994_projector.inverse_project(tuple_list)
results.drop("midpoint", axis=1, inplace=True)

<h3>View Results</h3>

We can view our statistically significant hot and cold spots by filtering the results DataFrame:

In [9]:
sig_results = results[results['significant'] == "TRUE"].sort_values(by=['type'], ascending=False)
sig_results

Unnamed: 0,z-score,p-value,significant,type,coordinates
60,0.000605,0.004344,True,HOT SPOT,"(-37.30144942430419, 144.27002664967975)"
65,0.004511,0.032381,True,HOT SPOT,"(-37.702885430864846, 143.79277016751269)"
140,0.000186,0.001337,True,HOT SPOT,"(-38.53875, 143.9843056)"
92,-0.004729,0.033945,True,COLD SPOT,"(-37.64034940185914, 143.6840688162685)"
139,-0.000327,0.002345,True,COLD SPOT,"(-37.949983755975865, 143.5693624430405)"


<h2>Step 5: Map Results</h2>

Finally, it only makes sense that we should visualise our results. There are many ways to do this using third-party libraries. In this case, we will use Leaflet via a Python package called folium.

Each of the statistically significant hot and cold spots are displayed as red or cold dots, respectively. Non-statistically significant clusters are displayed as grey dots.

Clicking on a dot will show a popup containing the corresponding coordinate and test results.

In [10]:
import folium
from folium.plugins import MarkerCluster

aus_coords = [-25.2744, 133.7751]
demo_map = folium.Map(location = aus_coords, zoom_start = 5)

sig_results_hot = sig_results[sig_results['type'] == "HOT SPOT"]
coord_list_hot = sig_results_hot["coordinates"].values.tolist()
z_score_list_hot = sig_results_hot["z-score"].values.tolist()
p_value_list_hot = sig_results_hot["p-value"].values.tolist()

sig_results_cold = sig_results[sig_results['type'] == "COLD SPOT"]
coord_list_cold = sig_results_cold["coordinates"].values.tolist()
z_score_list_cold = sig_results_cold["z-score"].values.tolist()
p_value_list_cold = sig_results_cold["p-value"].values.tolist()

insig_results = results[results['significant'] == "FALSE"]
coord_list_insig = insig_results["coordinates"].values.tolist()
z_score_list_insig = insig_results["z-score"].values.tolist()
p_value_list_insig = insig_results["p-value"].values.tolist()

In [11]:
for i in range(0,len((coord_list_hot))):
    popup = "<b>Coords: </b>" + str(coord_list_hot[i]) + "<br><b>Z-Score: </b> " + str(z_score_list_hot[i]) + "\n" + "<br><b>p-value: </b> " + str(p_value_list_hot[i])
    folium.Circle(coord_list_hot[i], color='red', fillColor='#f03', fillOpacity=0.5, radius=1000, popup=popup).add_to(demo_map)
    
for i in range(0,len((coord_list_cold))):
    popup = "<b>Coords: </b>" + str(coord_list_cold[i]) + "<br><b>Z-Score: </b> " + str(z_score_list_cold[i]) + "\n" + "<br><b>p-value: </b> " + str(p_value_list_cold[i])
    folium.Circle(coord_list_cold[i], color='blue', fillColor='#13a8e8', fillOpacity=0.5, radius=1000, popup=popup).add_to(demo_map)
    
for i in range(0,len((coord_list_insig))):
    popup = "<b>Coords: </b>" + str(coord_list_insig[i]) + "<br><b>Z-Score: </b> " + str(z_score_list_insig[i]) + "\n" + "<br><b>p-value: </b> " + str(p_value_list_insig[i])
    folium.Circle(coord_list_insig[i], color='gray', fillColor='#808080', fillOpacity=0.5, radius=1000, popup=popup).add_to(demo_map)

In [12]:
folium.TileLayer(
    tiles = 'https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{z}/{y}/{x}',
    attr = 'Esri',
    name = 'Esri Satellite',
    overlay = False,
    control = True
    ).add_to(demo_map)

<folium.raster_layers.TileLayer at 0x2429d9f46d0>

In [13]:
demo_map

As a final note, the analysis results can be saved as a CSV file by calling the DataFrame's to_csv() function:

In [14]:
results.to_csv('bushfire_hsa_results.csv')