# Python for Spatial Analysis
### Second part of the module of GG3209 Spatial Analysis with GIS
---
# Notebook to practice Spatial Clustering - Exercises
---
Dr Fernando Benitez -  University of St Andrews - School of Geography and Sustainable Development

# 1. Get the required data in your Drive

Go to [Kaggle - Car Accidents in the UK](https://www.kaggle.com/datasets/devansodariya/road-accident-united-kingdom-uk-dataset/) and download the **Road Accident (United Kingdom (UK)) dataset**. This dataset included **more than a millon of observations. So you definitly need to slide it to be able to work on Colab. This will be one of the challenges you will face.**

Upload the dataset in your **Google Drive**, Make sure you mount the Drive (if you don't recall how to do that, check the guideline about GeoPandas) in this Notebook so you can access the data in the following tasks.


---




# 2. Exploratory Data Analysis and K-means Clustering

Install additional libraries like Lonboard to display large datasets.

## Part A: Data Exploration and Pre-Processing

1. Use pandas to load the car accidents dataset.
2. Display the first few rows to understand the available attributes.
3. Keep only the necessary columns, have a mix of Numerical and Categorical attributes
4. Slice (cut) the pandas dataframe by including only records from 2010, which will reduce your dataset to approx 770585 rows.
5. Make a simple plot to represent which day of the week historically has had more car accidents. Which day?
6. Make a second plot to explore the relationship between **Accident Severity** and **Road Conditions**. What insights can you gain about that?. Use a Text Cell reflecting on the previous charts.
7. Using Lonboard Library map all the car accidents included in the filtered dataset.
8. Make a spatial filter (create a new dataset) to map only the car accidents In the Glasgow-Edinburgh Region, create another map using the lonboard library to display the car accidents only in that region.

## Part B: K-means Clustering Implementation:

1. Implement K-means clustering with different values of k. (e.g 3 and 5 clusters) To the filtered dataset you have created for the Glasgow-Edinburgh region.
2. Map the clusters using the lonboard library.
3. Describe in a Text Cell the clustering results. **How does the choice of k impact the clusters?**. Describe how the clusters change once you adjust multiple versions of that required parameter.
4. **Finally**: In the guideline, we worked using only the coordinates to create the clusters (`['Longitude', 'Latitude'`]), in another code cell, implement K-means clustering again, but now using the attributes included in the dataframe like `Accident_Severity, Number_of_Vehicles`.
5. Visualise the results using the `lonboard` library.
6. In a Text Cell, reflect on the clusters that include only the coordinates and the ones that also include other attributes. What insights can you gain about that?

---


# 3. Spatial Analysis and DBSCAN Clustering

## Part A: Spatial Correlation

1. Create another GeoPandas Dataframe by rereading the data to avoid any confusion with the previous geodataframe. This new one is about DBSCAN name it accordingly.

2. Using the [BBox website](https://boundingbox.klokantech.com/), filter the
new geodataframe to contain only the accidents around **Birmingham**.

3. Using the Lonboard library, map the filtered dataset in **Birmingham**.

Before creating any spatial clustering, **it would be beneficial to explore any correlations to identify potential relationships between variables**, such as whether bad weather conditions influence accident severity or whether the number of vehicles involved correlates with the number of casualties.

4. In a code cell, investigate the data type of the attribute list, so you can identify which attributes are numerical and which are categorical. Tip: use `.dtypes`

5. In a code cell. Run the correlation between the numerical attributes by including in your code `corr= your_dataframe.corr()`

**You probably got an error running the previous code.**
**How can you solve this issue?**

Before asking ChatGPT or any GenAI tool, try **Pandas documentation** and see the parameters of the **corr** function, and find which is the parameter you need to only create a correlation matrix only for the numerical attributes. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html


6. In a new Code Cell. Adjust the following code to create a heatmap plot of your correlation values.

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(
    your_corr_variable,
    annot=True,
    cmap='coolwarm',
    fmt='.2f',           # Format annotations to 2 decimal places
    linewidths=0.5,
    cbar_kws={'label': 'Correlation Coefficient'}
)

plt.title('Pearson -Correlation Matrix')
plt.show()

In terms of predictive modelling, having strong correlations indicate that one variable is an excellent predictor of the other. For example, if `Number_of_Casualties` has a 0.99 correlation (Pearson) with a separate column called `Injured_People_Count`, you know they convey almost identical information.

But **therer are downsides like potential Multicollinearity**, which means, if you are building a predictive model (like linear regression), having two variables that are too highly correlated (often > 0.9) can cause multicollinearity and require you to remove one of them.

Here is where our module associated with **Spatial Data Science** plays an important role. **Spatial statistical models** differ from standard (non-spatial) statistical models because they explicitly account for geographic location and the principle that nearby things are more related than distant things. The first law of geography.

In the same way you have evaluated and learn from the GWR method, in this exercise, we can apply Moran's I.

In this context of our car accident dataset, while a standard stadistic(Pearson) might tell you that severe weather correlates globally with severe accidents, a spatial model could tell us that this relationship only holds true in coastal areas, while in mountain areas, road surface conditions are more important locally.

---

7. Install the library **pysal** by running in a code cell:
`pip install pysal `

8. Now import the new and requieres libraries.

```
import libpysal.weights as weights
from esda.moran import Moran
```
9. You must reproject your dataset (recall the EPSG code you used in the guideline notebook to study spatial data in the UK)



After you reproject the dataset, you can now use it to run Local Moran's I and Spatial clustering DBSCAN. **Adjust the following code to match your variable and datasets**

In [None]:
w = weights.DistanceBand.from_dataframe(your_dataset_projected, threshold=500, ids=your_dataset.index) #Adjust this line to match your variables.
w.transform = 'R'
moran = Moran(your_dataset['Accident_Severity'], w) #Adjust this line to match your variables.

print(f"\n--- Moran's I Spatial Autocorrelation Analysis ---")
print(f"Defined {w.n} observations and {w.mean_neighbors:.2f} average neighbors per point.")
print(f"\nMoran's I Statistic (Observed I): {moran.I:.4f}")
print(f"P-value (significance): {moran.p_sim:.4f}")

**How to read the results:**

The resulting moran.I value tells you about the spatial pattern defined in the requested dataset.

- **Near +1**: High positive spatial autocorrelation
- **Near -1**: Negative spatial autocorrelation
- **Near 0**: A random spatial pattern

10. In a Text Cell, describe with your own words the results., What insights can you gain from the correlation analysis.

## Part B: DBSCAN Clustering Implementation:

1. Implement DBSCAN clustering with different **eps** and **min_samples** to the projected dataset.
2. Map the clusters using the Plotly Library.
3. Describe in a Text Cell the clustering results. **How does the choice of eps and min_samples impact the clusters?**. Describe how the clusters change once you adjust multiple versions of that required parameter.
4. In a Text Cell, briefly reflect on the clusters created using **K-Means** and the ones generated with **DBSCAN**. What insights can you gain from that?, Do you see any limitations?
5. Finally, in a new text cell address the following question: **What do you think are the real-world implications of the identified clusters in the field of urban planning?**

---

If you finished the initial guide and all the challenges included in this notebeook. **Upload the finished version of both notebooks to your GitHub repository** (check how to do it in the workbook lab document included in Moodle), and **Congrats you have finished**

### **Important Note:** Avoid using ChatGPT for your reflective notes. Instead, describe in your own words what you observe from your analysis results. I want to see and read your authentic thoughts and insights based on your understanding, rather than a complicated or overly structured response. Take some time to evaluate the results you have obtained and make an effort to briefly describe what you have found.
