# MapReader Autumn Workshop (2024)


## Set up for Google Colab

The below cells will:

- Mount your Google Drive
- Create a directory for the workshop
- Change the working directory to the workshop directory
- Download and install the required packages and data

In [None]:
# mount your drive
from google.colab import drive
drive.mount('/content/drive')

# set up MapReader_Autumn_Workshop directory
!mkdir /content/drive/MyDrive/MapReader_Autumn_Workshop
%cd /content/drive/MyDrive/MapReader_Autumn_Workshop

In [None]:
!git clone https://github.com/maps-as-data/mapreader-autumn-workshop-2024.git
!pip install mapreader[dev]
!pip install sentence-transformers scikit-learn plotly

In [None]:
# enable custom widgets in colab
from google.colab import output
output.enable_custom_widget_manager()

# Exploring Text on Maps

### 'Urban' vs 'Rural' text

This notebook provides some examples of ways in which MapReader's patch classification and text spotting outputs can be combined.

We will use our datasets to investigate the textual description of urban and rural landscapes by comparing text that often appears in the built environment (i.e. near building patches) versus the rest of the map. 

In [None]:
import pandas as pd
import geopandas as gpd
from collections import Counter
from tqdm import tqdm

### Load the data

We will first need to load our building classification data and text spotting data.

Since these were saved as `geojson` files we can load them with the `geopandas` library.

> **NOTE**: If you annotated/trained a model for something other than buildings, you will need to save the building predictions from [here](https://drive.google.com/file/d/1xrD5TQz_ILASsx17qScGbcH5OLq1XeE8/view?usp=drive_link) to the workshop directory in you Google Drive. You should then update the path in the cell below to "./building_predicted_outputs.geojson".

In [None]:
# load patch predictions and spotted text
predictions = gpd.read_file("./predicted_outputs.geojson")
spotted_text = gpd.read_file("./deepsolo_text_predictions.geojson")

In [None]:
predictions.head()

In [None]:
spotted_text.head()

For simplicity, we will convert all our patch and text polygons to centroids (i.e. the point in the middle of the polygon).

This will make distance calculations easier.

In [None]:
# convert polygons to centroids
predictions['centroid'] = predictions['geometry'].centroid 
spotted_text['centroid'] = spotted_text['geometry'].centroid

The below cell identifies the text that falls within the 100m of building patches. 

Text within this distance is stored as "adjacent text" and any other text is stored as "other text".

In [None]:
adjacent_text = [] # here we store text close to the target category
other_text = [] # here we store the other text

target_label = "building"

for i, row in tqdm(predictions.iterrows()):
    # get text within a certain distance from the patch centroid
    text = spotted_text[spotted_text.to_crs(epsg=27700).distance(row.centroid) <= 100]["text"].tolist()
    # if patch is classified as the target label, add text to adjacent_text, otherwise add to other_text
    if row['predicted_label'] == target_label:
        adjacent_text.extend(text)
    else:
        other_text.extend(text)

In [None]:
print(f"Text near to buildings: {len(adjacent_text)}") 
print(f"Text far from buildings: {len(other_text)}")

### Find probabilities of each word/phrase

In [None]:
# get counts and probabilities of the text labels for the building category
building_text_freq =  Counter([i.lower() for i in adjacent_text])
building_text_prob = {k : v / sum(building_text_freq.values()) for k, v in building_text_freq.items()}

In [None]:
# get counts and probabilities of the text labels for the other category
other_text_freq =  Counter([i.lower() for i in other_text])
other_text_prob = {k: v / sum(other_text_freq.values()) for k, v in other_text_freq.items()}

In [None]:
# compare both absolute counts and probabilities of a give word
word = 'street'
print(building_text_freq[word], other_text_freq[word])
print(building_text_prob[word], other_text_prob[word])

In [None]:
# compute the proportional difference
proportional_difference = sorted({w: building_text_prob.get(w,0) - other_text_prob.get(w,0) for w in other_text_prob.keys()}.items(), key=lambda x: x[1], reverse=True)


In [None]:
print(f'Building labels: {proportional_difference[:5]}')
print(f'Other labels: {proportional_difference[-5:]}')

In [None]:
pd.DataFrame(proportional_difference[:10]).plot(kind='bar', x=0, y=1, legend=False, 
                            title='Top 10 terms in Building labels', 
                            xlabel='Term', ylabel='Difference in probability')

In [None]:
pd.DataFrame(proportional_difference[-10:]).plot(kind='bar', x=0, y=1, legend=False, 
                            title='Top 10 terms in Other labels', 
                            xlabel='Term', ylabel='Difference in probability')

To get a sense of what some of the abbreviations mean, please go to the NLS website: https://maps.nls.uk/os/abbrev/

# Visalizing the semantic of text on maps

In the visualization below we encode each label to a vector using BERT-type language model. This generates a vector for each labels that approximates the 'meaning' of this label. Then we visualize these embeddigns in two dimensional space where you can explore the different semantic regions of the text data.

In [None]:
from sklearn.manifold import TSNE
from sentence_transformers import SentenceTransformer
import plotly.express as px

In [None]:
# get all text labels
text_labels = spotted_text.text.str.lower().tolist()

In [None]:
# load pre-trained sentence transformer model
# if you are working with a different language, you can change the model to a multilingual one
# please refer to the documentation for more information: https://www.sbert.net/docs/pretrained_models.html
model = SentenceTransformer('distilbert-base-nli-mean-tokens')

# encode the sentences
sentence_embeddings = model.encode(text_labels)


In [None]:
# perform dimensionality reduction using TSNE
tsne = TSNE(n_components=2, random_state=42)
embeddings_tsne = tsne.fit_transform(sentence_embeddings)

In [None]:
# visualize the labels in 2D scatter plot
data = pd.DataFrame(embeddings_tsne, columns=['x','y'])
data['text'] = text_labels
fig = px.scatter(data, x="x", y="y", text='text', width=1000, height=1000,)
fig.show()

In [None]:
# visualize only the text labels in 2D scatter plot
# i.e. remove all numbers
data_text = data[data.text.str.isalpha()]
fig = px.scatter(data_text, x="x", y="y", text='text', width=1000, height=1000,)
fig.show()

In [None]:
# visualize only the unique text labels in 2D scatter plot
# i.e. remove all numbers and duplicates
data_text_unique =data[data.text.str.isalpha()].drop_duplicates(subset='text')
fig = px.scatter(data_text_unique, x="x", y="y", text='text', width=1000, height=1000,)
fig.show()