# MapReader Autumn Workshop (2024)


## Set up for Google Colab

>**NOTE**: Skip this section if you are not using Google Colab!

The below cells will:

- Mount your Google Drive
- Create a directory for the workshop
- Change the working directory to the workshop directory
- Download and install the required packages and data

In [None]:
# mount your drive
from google.colab import drive
drive.mount('/content/drive')

# set up MapReader_Autumn_Workshop directory
!mkdir /content/drive/MyDrive/MapReader_Autumn_Workshop
%cd /content/drive/MyDrive/MapReader_Autumn_Workshop

In [None]:
!git clone https://github.com/maps-as-data/mapreader-autumn-workshop-2024.git
!pip install mapreader[dev]
!pip install sentence-transformers scikit-learn plotly

In [None]:
# enable custom widgets in colab
from google.colab import output
output.enable_custom_widget_manager()

## Define root directory

>**NOTE**: Start from here!

In [None]:
try:
	import google.colab
	ROOT = './mapreader-autumn-workshop-2024'
except ImportError:
	ROOT = '.'

## 'Urban' vs 'Rural' text

This notebook provides some examples of ways in which MapReader's patch classification and text spotting outputs can be combined.

We will use our datasets to investigate the textual description of urban and rural landscapes by comparing text that often appears in the built environment (i.e. near building patches) versus the rest of the map. 

In [None]:
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import rasterio
import rasterio.plot

from tqdm.auto import tqdm

## Load the data

We will first need to load our building classification data and text spotting data.

Since these were saved as `geojson` files we can load them with the `geopandas` library.

> **NOTE**: If you annotated/trained a model for something other than buildings, you will need to save the building predictions from [here](https://drive.google.com/file/d/1tqQAZ5rcHHel8WTRCEANRzGj88AGbXxA/view?usp=sharing) to the workshop directory in you Google Drive. You should then update the path in the cell below to "./building_predicted_outputs.geojson".

In [None]:
# load building patch predictions
building_predictions = gpd.read_file("./predicted_outputs.geojson")

Since we only have text outputs for "map_74427695.png", we will filter our building predictions to only include the building patches from this map.

In [None]:
building_predictions = building_predictions[building_predictions["parent_id"] == "map_74427695.png"] # filter for only map_74427695.png

In [None]:
building_predictions.head() # view the building predictions

We can plot our building predictions on the map image using `rasterio` and `matplotlib`. 

This will allow us to see the building patches that we will be working with.

In [None]:
src = rasterio.open(f"{ROOT}/maps/map_74427695.tif") # open the tiff file

fig, ax = plt.subplots(figsize=(10, 10)) # create a plot

rasterio.plot.show(src, transform=src.transform, ax=ax) # plot the map image
building_predictions.plot("predicted_label", legend=True, cmap="viridis", alpha=0.4, ax=ax) # plot the building predictions

In [None]:
# load the text predictions
spotted_text = gpd.read_file("./deepsolo_text_predictions.geojson")

In [None]:
spotted_text.head() # view the text predictions

## Find text on building patches

We can identify text that appears on building patches by checking if the centroid of a text instance is within a patch classified as a building.

This can give us an idea of text found in "urban" areas (i.e. near buildings) versus "rural" areas.

In [None]:
building_patches = building_predictions[building_predictions["predicted_label"] == "building"] # filter for patches that are predicted as "building"

In [None]:
spotted_text["close to building"] = spotted_text["geometry"].apply(lambda x: x.centroid.within(building_patches.unary_union)) # check if the centroid of the text is within a building patch

In [None]:
spotted_text.head() # view the text predictions

In [None]:
spotted_text["close to building"].value_counts() # count how many text instances are close to buildings

We can then plot the text instances that appear on building patches on the map image:

In [None]:
fig, ax = plt.subplots(figsize=(10, 10)) # create a plot

rasterio.plot.show(src, transform=src.transform, ax=ax) # plot the map
spotted_text.plot(column="close to building", legend=True, cmap="viridis", ax=ax) # plot the text patches, coloured by whether they are close to a building

### Find probabilities of each word/phrase

Now that we have separated our text into these two categories, we can calculate the probability of each word/phrase appearing in each category.

In [None]:
building_text = spotted_text[spotted_text["close to building"]] # filter for text that is on building patches
other_text = spotted_text[~spotted_text["close to building"]] # filter for text that is not on building patches

In [None]:
# get counts and probabilities of text for the building patches
building_text_freq = building_text["text"].value_counts()
building_text_prob = building_text["text"].value_counts(normalize=True)

In [None]:
building_text_prob.head(20) # view the top 20 probabilities

In [None]:
# get counts and probabilities of text for the non-building patches
other_text_freq = other_text["text"].value_counts()
other_text_prob = other_text["text"].value_counts(normalize=True)

In [None]:
other_text_prob.head(20) # view the top 20 probabilities

We can compare the probabilities of each word/phrase appearing in the 'urban' and 'rural' categories to see if there are any words/phrases that are more likely to appear in one category than the other.

e.g. We might expect "street" to be more likely to appear in the 'urban' category.

In [None]:
word = "street" # choose a word to investigate
word = word.upper() # convert to uppercase (since our predictions are all in uppercase)

pd.DataFrame({
        "frequency": [building_text_freq.get(word, 0), other_text_freq.get(word, 0)],
        "probability": [building_text_prob.get(word, 0), other_text_prob.get(word, 0)]
	}, 
    index=["building", "other"],)

Conversely, we might expect "field" to be more likely to appear in the 'rural' category.

In [None]:
word = "field" # choose a word to investigate
word = word.upper() # convert to uppercase (since our predictions are all in uppercase)

pd.DataFrame({
        "frequency": [building_text_freq.get(word, 0), other_text_freq.get(word, 0)],
        "probability": [building_text_prob.get(word, 0), other_text_prob.get(word, 0)]
	}, 
    index=["building", "other"],)

The proportional difference shows how much more likely a word/phrase is to appear in one category compared to the other.

In [None]:
# compute the proportional difference between building and non-building text probabilities
proportional_diff = building_text_prob - other_text_prob

In [None]:
proportional_diff.sort_values(ascending=False)[:20].plot(
    kind="bar", 
    ylabel="difference in probability",
    title="Words more likely to be near buildings"
    )

In [None]:
proportional_diff.sort_values()[:20].plot(
    kind="bar", 
    ylabel="difference in probability",
    title="Words less likely to be near buildings"
    )

To get a sense of what some of the abbreviations mean, please go to the NLS website: https://maps.nls.uk/os/abbrev/

# Visalizing the semantic of text on maps

In the visualization below we encode each label to a vector using BERT-type language model. 
This generates a vector for each labels that approximates the 'meaning' of this label. 
Then we visualize these embeddigns in two dimensional space where you can explore the different semantic regions of the text data.

In [None]:
from sklearn.manifold import TSNE
from sentence_transformers import SentenceTransformer
import plotly.express as px

In [None]:
text_labels = spotted_text.text.str.lower().tolist() # get the text labels

In [None]:
# load pre-trained sentence transformer model
# if you are working with a different language, you can change the model to a multilingual one
# please refer to the documentation for more information: https://www.sbert.net/docs/pretrained_models.html
model = SentenceTransformer('distilbert-base-nli-mean-tokens')

sentence_embeddings = model.encode(text_labels) # encode the text

We will use TSNE to reduce the dimensionality of the embeddings so that we can plot them.

In [None]:
tsne = TSNE(n_components=2, random_state=42)
embeddings_tsne = tsne.fit_transform(sentence_embeddings)

In [None]:
# create a dataframe to store the results
data = pd.DataFrame(embeddings_tsne, columns=['x','y'])
data['text'] = text_labels

In [None]:
data.head() # view the results (the numbers in x and y are the coordinates of the text in the 2D space)

In [None]:
# visualize the labels in 2D scatter plot
fig = px.scatter(data, x="x", y="y", text='text', width=1000, height=1000)
fig.show()

In [None]:
# visualize only the text labels in 2D scatter plot
data_text = data[data.text.str.isalpha()] # filter for text labels that contain only alphabetic characters i.e. remove numbers and special characters

fig = px.scatter(data_text, x="x", y="y", text='text', width=1000, height=1000,)
fig.show()

In [None]:
# visualize only the unique text labels in 2D scatter plot
data_text_unique =data[data.text.str.isalpha()].drop_duplicates(subset='text') # as above plus remove duplicates

fig = px.scatter(data_text_unique, x="x", y="y", text='text', width=1000, height=1000,)
fig.show()