# Mapping Lepidoptera Diversity across Northern Ireland between 2014 and 2024, using Python and NBN Atlas Data

## Introduction
Biodiversity is the variety of all life on Earth from micro-organisms through plant and animal species (World health Organisation (WHO), 2025). Each species plays a role in maintaining ecosystems by providing specific ecosystem services, and the more species within a community, the healthier that community is (Pavid, n.d.). However, species and ecosystems are fragile and changes to the environment can have detrimental effects on the strength and condition of both.

The insect order *lepidoptera* consists of butterflies and moths, important pollinators that contribute to plant reproduction and ecosystem health and resilience (Baral et. al., 2025). Many *lepidopteran* species have specialised habitats or host plants and are well-studied, making them highly susceptible to change (Carus and Carneiro, 2024) and useful as diversity indicators (Barkmann et.al., 2024). 

This project demonstrates how to map biodiversity for butterflies and moths in Northern Ireland (NI) between 2014 and 2024, using open data from the National Biodiversity Network (NBN) Atlas. The aim of the study is to visualise areas with high species richness and assess biodiversity patterns, helping inform conservation and land management efforts.  
The project uses `python` in `jupyter-lab` to process, clean and analyse *lepidoptera* occurrence data, and to create species richness maps. The maps will be overlain with protected areas for enhanced ecological insights. 

## Setup and Installation

### Getting Started
To get started on the project, `git` and `conda` must be installed on your computer. The `git` installer for your operating system can be downloaded from [GitHub](https://git-scm.com/downloads).

To install `conda`, the graphical interface Anaconda Navigator can be downloaded from [https://www.anaconda.com/download/success](https://www.anaconda.com/download/success). 

### Optional Steps
It is recommended to open a GitHub account to enable working through the project and to create a fork of the repository. A free account is available at [https://github.com/](https://github.com/).  

It is also recommended to use an integrated development environment (IDE) such as PyCharm or VSCode to write your code. Downloads are available from:
- **PyCharm Community Edition** [https://www.jetbrains.com/pycharm/download/](https://www.jetbrains.com/pycharm/download/)
- **VSCode** [https://visualstudio.microsoft.com/downloads/](https://visualstudio.microsoft.com/downloads/)

### Download/Clone the Project Repository
**The project repository is available at:** [https://github.com/izlbyzl/egm722_project_02](https://github.com/izlbyzl/egm722_project_02)    

First, fork the repository to your GitHub account, then clone your fork using the following command:  

`git clone https://github.com/{your username}/egm722_project_02`  

replacing `{your_username}` with your own GitHub username.

If you do not have a GitHub account, the project repository can be cloned using: 

`git clone https://github.com/izlbyzl/egm722_project_02`


### Setting up the Conda Environment
After cloning the repository, a `conda` environment is created using the **environment.yml** file provided in the repository. 

Using Anaconda Navigator, select **Import** from the **Environments** panel.  
Or  
From the directory where the project is cloned, run command prompt or terminal with the following command:  

`conda env create -f environment.yml`  

The dependencies this project requires are:
- `numpy`: for performing mathematical functions [https://numpy.org/](https://numpy.org/)
- `geopandas`: for working with geospatial data [https://geopandas.org/en/stable/](https://geopandas.org/en/stable/)
- `matplotlib`: for visualising data [https://matplotlib.org/](https://matplotlib.org/)
- `cartopy`: for producing maps and geospatial data analysis [https://scitools.org.uk/cartopy/docs/latest/](https://scitools.org.uk/cartopy/docs/latest/)
- `jupyter-lab`: interactive script notebook [https://jupyter.org/](https://jupyter.org/)
- `pandas`: for data analysis and manipulation [https://pandas.pydata.org/](https://pandas.pydata.org/)
- `shapely`: for manipulation and analysis of polygons [https://pypi.org/project/shapely/](https://pypi.org/project/shapely/)
- `ipywidgets`: adds interactive widgets to Jupyter [https://ipywidgets.readthedocs.io/en/stable/](https://ipywidgets.readthedocs.io/en/stable/)
- `rasterio`: for use with raster datasets [https://rasterio.readthedocs.io/en/stable/](https://rasterio.readthedocs.io/en/stable/)
- `pyepsg`: provides access to EPSG codes [https://pyepsg.readthedocs.io/en/latest/](https://pyepsg.readthedocs.io/en/latest/)

Finally launch `jupyter-lab` from your `conda` environment and work through the code. 

### Additional Setup Steps
Northern Ireland shapefile (**.shp**) containing vector data covering, counties, towns, lakes, Areas of Outstanding Natural Beauty (AONB's) and Areas of Special Scientific Interest (ASSI's) was collected and prepared for use by Ulster University, in the Geographic Information Systems (GIS) MSc. Sourced from: [Ordnance Survey of Northern Ireland (n.d.).](https://www.nidirect.gov.uk/articles/osni-open-data-product-list)  

Species occurrence data was downloaded from the [National Biodiversity Network (NBN) Atlas (2025).](https://nbnatlas.org/) A search for the order *Lepidoptera* gives access to a UK dataset, refined to include data from a specific location, time-frame and other filter parameters, then downloaded in **.csv** format. Once downloaded, unnecessary columns are removed from the dataframe, ensuring species names, geographic coordinates (latitude/longitude) and date columns remained. 


## Methods
Analysis was carried out in `jupyter-lab` using a `python` script environment. The workflow includes data collection, pre-processing, map creation and species diversity analysis using Simpson's Diversity Index. A list of dependencies is detailed in *Setting up the Conda Environment* section.

### Data Collection 
Necessary `python` dependencies were added to `jupyter` using `import` statements, to enable data manipulation, geospatial analysis and visualisation.
A custom fuction was defined to assign EPSG code 4326 to GeoDataFrames if missing, then to reproject data to the target EPSG:32629, ensuring all dataframes have a consistent CRS, suitable for analysis within Northern Ireland. 

Datasets were loaded into the notebook using:
- `geopandas.read_file()` for vector data (**.shp**) representing NI boundary, counties, towns, lakes, AONB's and ASSI's.
- `pandas.read_csv` for the species occurence dataset **.csv**

### Pre-Processing
Other functions were defined:
- `generate_handles()` for legend creation
- `scale_bar()` to add scale bar to maps
- `simpsons_diversity()` for calculating biodiversity

The *lepidoptera* **.csv** file was transformed into a spatial dataset by creating a geometry column from X/Y coordinates (latitude/longitude). Converting into a `GeoDataFrame` and reprojected to match the UTM Zone 29 projection (EPSG:32629). Further unnecessary columns were removed or renamed to standardise the dataset, then saved as a **.shp** for future use.

### Map Creation
To visualise the data, a figure and axis object were created using `matplotlib`. The CRS for all layers was set to UTM Zone 29, and map extent defined using the total bounds of the NI outline geometry. 

Features were added to the map using `add_feature()`:
- Polygons using `ShapelyFeature()`
- Point data using `ax.plot()`

The counties data was assigned different symbology for individual counties, to distinguish each on the map and in the legend. Where unique county names were iterated through in a `for` loop and assigned colours.

Custom handles were generated to represent each feature in the legend and added to the map, along with gridlines, a scale bar and town name labels. Saving the final image as a high-resolution image. 

### Biodiversity Analysis
To assess *lepidoptera* biodiversity across Northern Ireland:
- The total number of individual species (Species Richness (SR)) was calculated by counting the number of unique species names in the dataset.
- **Simpson's Diversity Index** (SDI) was computed to measure species richness and evenness in NI. 

SDI calculates diversity using the number of species and their abundance. There are three forms of Simpson's Indices, Dominance Index (D), Diversity Index (1-D) and the Reciprocal Index (1/D), represented by the following equations:

- Simpson's Dominance Index (D):
$$
D = \displaystyle\sum_{i=1}^s p_i^2
$$


- Simpson's Diversity Index (1-D):
$$
1-D = 1 - \displaystyle\sum_{i=1}^s p_i^2
$$

- Simpson's Reciprocal Index (1/D):

$$
1/D = \frac{1}{\displaystyle\sum_{i=1}^s p_i^2}
$$

where *p<sub>i</sub>* is the proprtion of individuals in species *i* and *s* is the total number of species (Suf, 2024). 

The value of D measures the probability of same-species draws (Suf, 2024) and falls between 0 and 1, where values closer to 1 indicate 1 to a few species, and lower values closer to 0 indicate many equally abundant species. This result is counter-intuitive, so the transformation into 1-D and 1/D is carried out. 
Subtracting the probability of 1 from D gives greater diversity higher values. The value of 1-D lies between 0 and 1, where 1 is complete diversity and 0 is complete uniformity (Royal Geographical Society, n.d.). Comparing this value between locations indicates the relative degree of biodiversity between the locations. 
The value of 1/D indicates the total number of species present, with 1 being the lowest value and the number of species being the maximum value (Bobbitt, 2021). Comparing between locations, higher values means greater diversity.

Initially, SDI was assessed within the whole of Northern Ireland. Then to further analyse spatial patterns in diversity, the *lepidoptera* dataset had spatial joins applied to counties, AONB and ASSI layers using `geopandas.sjoin()`, enabling further diversity comparisons.

## Expected Results
Following working through the `python` script, it is expected that a series of maps will be generated. Figure 1 shows the distribution of *lepidoptera* point data throughout Northern Ireland between 2014 and 2024. Counties are colour coded and town names labelled. The number of unique *lepidoptera* species in Northern Ireland will ve computed

![Map of Northern Ireland *lepidoptera* occurence.](map.png)
*Figure 1 Lepidoptera distribution across Northern Ireland*

Calculations of Simpson's Diversity Index for all of Northern Ireland, by county, and by protected/unprotected areas. This produces ..... 


## Troubleshooting Tips

### Python Troubleshooting
When writing `python` code, error messages and warnings will be frequently encountered and will need to be addressed. For an experienced user, the type of error and attached message can help establish the issue, but novices may need to seek further help.  

#### **Error Types**
- *NameError*: check variables are defined and cells run
- *ValueError*: when an invalid value is assigned
- *AttributeError*: usually does not exist, check typos
- *KeyError*: something does not exist, check dataframe columns
- *UserWarning*: will still run code, but alerting to a future issue and suggesting modification  

### Git Troubleshooting
Problems will also occur when using `git`, where information can be lost or wrongly included. There are roll-back methods available in `git` to help resolve issues such as merge conflicts or restoring deleted branches. Commands that can be used to fix issues [geeksforgeeks, 2024](https://www.geeksforgeeks.org/common-git-problems-and-their-fixes/) include:  
- `--amend`
- `rebase`
- `reset`
- `revert`
- `bisect`

To avoid problems using `git`, it is good practice to make frequent commits with useful messages and to regularly synchronise the repository [Git Scripts, 2024](https://gitscripts.com/git-problems).

### Further Help
Useful tips from [Python Central, 2022](https://www.pythoncentral.io/how-to-solve-python-bugs-for-python-novices/) for writing code include: 
- use code that already works, look for scripts on Google and modify to suit
- print outputs often, checks values are correct
- run code after each change, makes errors easier to find
- read error message and search for online
- take a break, fresh minds approach problems differently
- ask for help, there are many forums [StackOverflow, 2025](https://stackoverflow.com/questions) being popular.

## References

- Barkmann (2024) 
- Bobbitt, Z. (2021)
- Carus and Carneiro (2024) 
- geeksforgeeks (2024) *Common Git Problems and Their Fixes*. Available at: [https://www.geeksforgeeks.org/common-git-problems-and-their-fixes/](https://www.geeksforgeeks.org/common-git-problems-and-their-fixes/) (Accessed: 22 April 2025).
- Git Scripts (2024) *Mastering Git Problems: Quick Fixes for Common Issues*. Available at: [https://gitscripts.com/git-problems](https://gitscripts.com/git-problems) (Accessed 22 April 2025).
- National Biodiversity Network (NBN) Atlas (2025) *NBN Atlas*. Available at: [https://nbnatlas.org/](https://nbnatlas.org/) (Accessed: 25 April 2025).
- Ordnance Survey of Northern Ireland (n.d.) *OSNI Open Data Product List*. Available at: [https://www.nidirect.gov.uk/articles/osni-open-data-product-list](https://www.nidirect.gov.uk/articles/osni-open-data-product-list) (Accessed: 25 April 2025).
- Python Central (2022) *How to Solve Python Bugs for Python Novices*. Available at: [https://www.pythoncentral.io/how-to-solve-python-bugs-for-python-novices/](https://www.pythoncentral.io/how-to-solve-python-bugs-for-python-novices/) (Accessed: 22 April 2025).
- Royal Geographical Society (n.d.)
- StackOverflow (2025) *Newest Questions*. Available at: [https://stackoverflow.com/questions](https://stackoverflow.com/questions) (Accessed 22 April 2025).
- Suf (2024) Simpson’s Diversity Index: Calculating Species Dominance and Evenness. *The Research Scientist Pod*. Available at: [https://researchdatapod.com/simpsons-diversity-index/](https://researchdatapod.com/simpsons-diversity-index/) (Accessed: 25 April 2025).
- World Health Organisation (2025)