# Exploring Text on Maps

This notebook provides some examples of how to load and visualise the text outputs of MapReader. 

To make the size of the data manageable and quick, we will use only the outputs from the six glasgow maps we have used throughout the workshop. You should download these files from [here](https://drive.google.com/drive/folders/14tSvFbH2DJSnuRX1B9dA7Ic9JhzVPmc1).

----

In [None]:
# uncomment the following line to run if you have not yet installed geopandas or plotly
#!pip install -q geopandas==1.0.0a1
#!pip install plotly

In [None]:
import pandas as pd
import geopandas as geopd
import matplotlib.pyplot as plt
import plotly.express as px
from ast import literal_eval
from collections import Counter
from tqdm import tqdm

## 'Urban' vs 'Rural' text

In this notebook we will investigate the textual description of urban and rural landscape. We compare labels that often appear in the built environment versus the rest of the map. 

This notebooks has the following structure:
- for simplicity we convert all polygons to centroids
- we iterate over the dataframe with patch predictions
- we look which labels fall within a certain distance of the patch centoid
- depending on the patch prediction (building or not building) we save the labels in different lists (`adjacent_text` and `other_text`)
- we compute the probability of labels for each of these classes (`adjacent_text` and `other_text`) and compute the difference in proportions to foreground words that are associated with building patches ('urban' labels) and or not ('rural' labels).

In [None]:
# load patch predictions and spotted text
predictions = geopd.read_file("/path/to/building_patches.geojson")
spotted_text = geopd.read_file("/path/to/deepsolo_outputs.geojson")

In [None]:
predictions.head()

In [None]:
spotted_text.head()

In [None]:
# print the shape of the dataframes
print(f"Building predictions shape: {predictions.shape}")
print(f"Spotted text shape: {spotted_text.shape}")

To save time for the workshop we will filter to one parent map, `map_75650907.png`.

> **NOTE**: If you'd like to run on all the results you can skip or comment out the cell below. It will take a long time to run.

In [None]:
# filter to one parent map to save time
predictions = predictions[predictions['parent_id'] == 'map_75650907.png']
spotted_text = spotted_text[spotted_text['image_id'] == 'map_75650907.png']

In [None]:
# convert polygons to centroids
predictions['centroid'] = predictions['geometry'].to_crs(epsg=27700).centroid 
spotted_text['centroid'] = spotted_text['geometry'].to_crs(epsg=27700).centroid

The below cell identifies the text that falls within the 100m of building patches. Text within this distance is stored as "adjacent text" and any other text is stored as "other text".

In [None]:
tqdm.pandas()
adjacent_text = [] # here we store labels close to the target category, i.e. building classified as 1
other_text = [] # here we store the other labels
target_label = "building"
distance = 100 # maximum distance in meters between patch and text centroid

for i,row in tqdm(predictions.iterrows(), total=predictions.shape[0]):
    # get text within a certain distance from the patch centroid
    labels = spotted_text[spotted_text.to_crs(epsg=27700).distance(row.centroid) <= distance].text.tolist()
    # if patch is classified as the target label, add text to adjacent_text, otherwise add to other_text
    if row['predicted_label'] == target_label:
        adjacent_text.extend(labels)
    else:
        other_text.extend(labels)

print('Building labels',len(adjacent_text), 'Other labels',len(other_text))

In [None]:
# get counts and probabilities of the text labels for the building category
building_text_freq =  Counter([i.lower() for i in adjacent_text])
building_text_prob = {k: v/ sum(building_text_freq.values()) for k,v in building_text_freq.items()}

In [None]:
# get counts and probabilities of the text labels for the other category
other_text_freq =  Counter([i.lower() for i in other_text])
other_text_prob = {k: v/ sum(other_text_freq.values()) for k,v in other_text_freq.items()}

In [None]:
# compare both absoluate counts and probabilities of a give word
word = 'street'
print(building_text_freq[word], other_text_freq[word])
print(building_text_prob[word], other_text_prob[word])

In [None]:
# compute the proportional difference
proportional_difference = sorted({w: building_text_prob.get(w,0) - other_text_prob.get(w,0) for w in other_text_prob.keys()}.items(), key=lambda x: x[1], reverse=True)


In [None]:
print('Building labels')
print(proportional_difference[:5])
print('Other labels')
print(proportional_difference[-5:])

In [None]:
pd.DataFrame(proportional_difference[:10]).plot(kind='bar', x=0, y=1, legend=False, 
                            title='Top 10 terms in Building labels', 
                            xlabel='Term', ylabel='Difference in probability')

In [None]:
pd.DataFrame(proportional_difference[-10:]).plot(kind='bar', x=0, y=1, legend=False, 
                            title='Top 10 terms in Other labels', 
                            xlabel='Term', ylabel='Difference in probability')

To get a sense of what some of the abbreviations mean, please go to the NLS website: https://maps.nls.uk/os/abbrev/

# Visalizing the semantic of text on maps

In the visualization below we encode each label to a vector using BERT-type language model. This generates a vector for each labels that approximates the 'meaning' of this label. Then we visualize these embeddigns in two dimensional space where you can explore the different semantic regions of the text data.

In [None]:
# uncomment the following line to run if you have not yet installed sentence-transformers, scikit-learn and plotly
#!pip install -U -q sentence-transformers scikit-learn plotly

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sentence_transformers import SentenceTransformer
import plotly.express as px

In [None]:
# get all text labels
text_labels = spotted_text.text.str.lower().tolist()

In [None]:
# load pre-trained sentence transformer model
# if you are working with a different language, you can change the model to a multilingual one
# please refer to the documentation for more information: https://www.sbert.net/docs/pretrained_models.html
model = SentenceTransformer('distilbert-base-nli-mean-tokens')

# encode the sentences
sentence_embeddings = model.encode(text_labels)


In [None]:
# perform dimensionality reduction using TSNE
tsne = TSNE(n_components=2, random_state=42)
embeddings_tsne = tsne.fit_transform(sentence_embeddings)

In [None]:
# visualize the labels in 2D scatter plot
data = pd.DataFrame(embeddings_tsne, columns=['x','y'])
data['text'] = text_labels
fig = px.scatter(data, x="x", y="y", text='text', width=1000, height=1000,)
fig.show()

In [None]:
# visualize only the text labels in 2D scatter plot
# i.e. remove all numbers
data_text = data[data.text.str.isalpha()]
fig = px.scatter(data_text, x="x", y="y", text='text', width=1000, height=1000,)
fig.show()

In [None]:
# visualize only the unique text labels in 2D scatter plot
# i.e. remove all numbers and duplicates
data_text_unique =data[data.text.str.isalpha()].drop_duplicates(subset='text')
fig = px.scatter(data_text_unique, x="x", y="y", text='text', width=1000, height=1000,)
fig.show()