# Simple statistics for the "Du bist Am Zug" project

## Project links
https://dubistamzug.net/en/

https://www.instagram.com/dubistamzugberlin/

https://www.facebook.com/dubistamzugberlin

## Getting .kml map
Current notebook is tailored to work with the 2024 version of it, some things could be different for the future versions

Open [map](https://www.google.com/maps/d/u/0/viewer?mid=1jXqAMP9-YYyS75qjMMC6zf45UsSkVIs&ll=52.530777634910116%2C13.465575394245812&z=10),
press on three vertical dots, choose "Download KML", choose "Export to KML instead of KMZ", press "OK"

In [1]:
import re
import logging
import xml.etree.ElementTree as ET

from collections import Counter


STYLE_YELLOW = "#icon-1899-FFD600"
STYLE_RED = "#icon-1899-A52714"


# Name of the downloaded file - if you have it under the different name just change the variable here.
# Also we assume that downloaded file is in the same folder as a script itself
file_name = "Standorte Plakate DBAZ 2024  1. & 2. Woche.kml"

tree = ET.parse(file_name)
root = tree.getroot()

In [2]:
def norm_text(text: str) -> str:
    """Remove all html tags and fix some common typos in the source file"""
    norm = re.sub(r"<.*?>", " ", text).strip()
    norm = re.sub("8Foto", "(Foto", norm)
    norm = re.sub(r"\s+", " ", norm)

    # if there is no normal description and it is just coordinates just return empty line
    if not re.sub(r"[\d\.\s]", "", norm):
        return ""

    return norm.strip()

def get_spotter(text: str) -> str:
    """
    Try to extract spotter from the description
    
    Examples:
        >>> get_spotter("Author Name (Foto SpotterName)   52.45022583 13.50795078")
        'SpotterName'
        >>> get_spotter("Anonym (Bild SpotterName SpotterSurname)  52.55495479 13.39554122")
        'SpotterName SpotterSurname'
        >>> get_spoter("Author Name, 1, 4 (Foto Spotter1) (Foto2 Spotter2)  52.49294, 13.3868")
        'Spotter1'
    """
    if "(" in text and ")" in text:
        tt = text
        tt = re.sub(r"[Ff]oto?:?", "", tt)
        tt = re.sub(r"[Bb]ild:?", "", tt)
        return tt.split("(")[1].split(")")[0].strip()
    elif "(" in text:
        return text.split("(")[1].split(" ")[1].strip()
    logging.warning(f"No rules to parse correctly: {text}")
    return ""

def parse_description(text: str) -> str:
    """Parse description into a dictionary with various helper fields"""
    norm = norm_text(text)
    res = {"raw": text, "norm": norm}
    if not norm:
        return res
    res["poster_by"] = norm.split("(")[0].strip()  # any text before first '('
    res["spotted_by"] = get_spotter(norm)
    return res


def placemark_to_dict(placemark: ET) -> dict:
    """Convert <Placemark> structure into the dictionary"""
    name = placemark.find('kml:name', namespace).text
    styleUrl = placemark.find('kml:styleUrl', namespace).text
    description = placemark.find('kml:description', namespace).text
    description = parse_description(description)
    coordinates = placemark.find('kml:Point/kml:coordinates', namespace).text.strip()
    coordinates = coordinates.split(",")[:2]
    return {
        "name": name,
        "style": styleUrl,
        "description": description,
        "coordinates": coordinates,
    }


# parse everything

namespace = {'kml': 'http://www.opengis.net/kml/2.2'}

unmarked = 0
locations = []
for placemark in root.findall('.//kml:Placemark', namespace):
    locations.append(placemark_to_dict(placemark))


In [3]:
# Ensure that all parsed yellow locations have non-empty poster_by and spotted_by fields
for l in locations:
    if l["style"] != STYLE_YELLOW:
        continue
    d = l["description"]
    if not d["norm"]:
        continue
    if not d["spotted_by"] or not d["poster_by"]:
        logging.warning(f"No poster/spotter for {l['name']}: {d}")

In [4]:
# typos search, show all descriptions that don't have Foto or Bild in the description
for l in locations:
    if l["style"] != STYLE_YELLOW:
        continue
    d = l["description"]
    if not ("(Foto " in d["norm"] or "(Bild " in d["norm"]):
        print("-", l["name"], "DESCR:", re.sub(r"\s+", "  ", d["norm"]))

- Otto-Suhr-Allee-18-20 Mittelinsel vor ggü. Marie-Elisabeth-Lüders-Str. staw DESCR: Jutta  Widrinsky  (Fotoangiie_pamela_photography)  52.51394238  13.31753913
- Alt-Friedrichsfelde-23 hinter Robert-Uhrig-Str. DESCR: Ali,  12,  (Foto:  deinkarma666)  52.50999195  13.51268279
- Gensinger Str.- vor Haus Nr. 103 (vor Seniorenzentrum) DESCR: Julia  Liebisch  (Fot  stricktdagegen)  52.51132243  13.53273538
- Blumberger Damm-146 hinter Zinndorfer Str. DESCR: M.  Maschke  (FotoPeter)  52.54367173  13.57086907
- Cecilienstr.-161 hinter Irmfriedstr. DESCR: Svenia  Andresen  (Fot  kadomaz_55)  52.52876293  13.57089755
- Tempelhofer Weg-36-42 Mittelinsel vor ggü. Holzmindener Str. staw. DESCR: Abdul  &  Daniel,  Fläming-Grundschule,  6d  (Fotoangiie_pamela_photography)  52.45580292  13.42825031
- Berliner Str.-13 A-B Mittelinsel vor Florastr.. Anlage B DESCR: Anika  Voß  (foto  splendid_andrea)  52.56774002  13.41208837
- Dietzgenstr.-191 hinter ggü. Rosenthaler Weg stew. DESCR: Annika  Wissen  

In [5]:
print("Locations with the same name if any")
name_counter = Counter(l["name"] for l in locations)
for name, count in name_counter.most_common():
    if count > 1:
        print("-", name, ":", count)

print()
print("Locations with the same coord if any")
coord_counter = Counter(" ".join(l["coordinates"]) for l in locations)
for coord, count in coord_counter.most_common():
    if count > 1:
        print("-", coord, ":", count)

Locations with the same name if any
- Neuköllner Str.-303 hinter Alt Rudow 1 : 3
- Oranienburger Str.-297 vor Eichborndamm : 3
- Siegener Str.- vor Falkenseer Chaussee 241 : 3
- Attilastr.-173 vor Alarichstr. 1 : 3
- Johannisthaler Chaussee-263 Mittelinsel hinter Fritz-Erler-Allee Rtg. Treptow : 2
- Seestr.-44a vor Müllerstr. : 2

Locations with the same coord if any


In [6]:
# If you are wanting to check if there are some posters you are intereted in
# just replace names here. Could be any string. I have that only because
# searching for a name or substring in google maps don't really work well for now (imho)
POSTERS_TO_CHECK = [
    "Nosyrev",
    "Kaltauskaite",
    "Dvayaitca",
    "Holubeva",
    "Pasichnyk",
    "Saliukhina",
]
for l in locations:
    d = l["description"]
    if not d["norm"]:
        continue
    for p in POSTERS_TO_CHECK:
        if p.lower() in d["poster_by"].lower():
            print(p)
            print("  ", l["name"])
            print("  ", d["norm"])

Holubeva
   Sterndamm-37 Mittelinsel vor Pietschkerstr. Halle A
   Sofiia Holubeva (Foto Anke) (Foto 2 sabineberlin.de) 52.45022583 13.50795078


# Statistics

In [7]:
print("Different dots on the map count")
print("Red:", sum(1 for l in locations if l["style"] == STYLE_RED))
print("Yellow:", sum(1 for l in locations if l["style"] == STYLE_YELLOW))

Different dots on the map count
Red: 0
Yellow: 489


In [8]:
print("Different styles of dotes, sanity check:")
print(Counter(l["style"] for l in locations))

Different styles of dotes, sanity check:
Counter({'#icon-1899-FFD600': 489, '#icon-1899-BDBDBD': 12})


In [9]:
# Some sanity check
print("Yellow locations without any description:")
for l in locations:
    d = l["description"]
    if l["style"] != STYLE_YELLOW:
        continue
    if not d["norm"]:
        print(l["name"])

Yellow locations without any description:


In [10]:
print("Most common posters")
posters_by_stat = Counter(l["description"].get("poster_by", "NOT_FOUND") for l in locations if l["style"] == STYLE_YELLOW)
for poster_by, count in posters_by_stat.most_common(10):
    print(f"{count}: {poster_by}")

print()
print(len(set(posters_by_stat)), "unique posters out of", sum(1 for l in locations if l["style"] == STYLE_YELLOW), "in total")

print()
print("<how_many_times_poster_was_spotted>: <posters_that_were_spotted_that_many_times>")
for freq, posters_like_that in sorted(Counter(posters_by_stat.values()).items(), reverse=True):
    print(f"{freq}: {posters_like_that}")


Most common posters
36: Anonym
5: Die Rap Girls, 9-13 Jahre
4: Lina, 8
4: Gunda Leiss
3: Blaxx443 Schneider/Werle
3: Kåre, 5
3: Denise Taureg
2: Nastasya Tikhnovetskaya
2: Maria Wirth
2: Sibylle Meister

364 unique posters out of 489 in total

<how_many_times_poster_was_spotted>: <posters_that_were_spotted_that_many_times>
36: 1
5: 1
4: 2
3: 3
2: 74
1: 283


In [11]:
n = 10
print(f"Top {n} spotters:")
spotted_by_stat = Counter(l["description"].get("spotted_by", "NOT_FOUND") for l in locations if l["style"] == STYLE_YELLOW)
for spotted_by, count in spotted_by_stat.most_common(n):
    print(f"{count}: {spotted_by}")

print()
print(len(set(spotted_by_stat)), "unique spotters spotted", sum(spotted_by_stat.values()), "posters")

print()
print("Spotter statistics:")
print("<number_of_spotted_posters>: <spotters_that_spotted_that_many_posters>")
for freq, posters_like_that in sorted(Counter(spotted_by_stat.values()).items(), reverse=True):
    print(f"{freq}: {posters_like_that}")

Top 10 spotters:
55: Peter
53: Svenja
48: Tim
41: andii.
36: angiie_pamela_photography
27: Anke
21: jo_wisz
20: grigorynosyrev
16: dbaz.kunstopfer
15: Vicky

86 unique spotters spotted 489 posters

Spotter statistics:
<number_of_spotted_posters>: <spotters_that_spotted_that_many_posters>
55: 1
53: 1
48: 1
41: 1
36: 1
27: 1
21: 1
20: 1
16: 1
15: 1
10: 1
9: 1
7: 2
6: 1
4: 4
3: 11
2: 13
1: 43


In [12]:
from pathlib import Path

import pandas as pd

df = pd.DataFrame()
df["name"] = [l["name"] for l in locations]
df["description"] = [l["description"]["norm"] for l in locations]
df["coordinates"] = [" ".join(l["coordinates"]) for l in locations]
df["style"] = [l["style"] for l in locations]
df = df.sort_values("name")
df.head()

# saving file to .csv format
df.to_csv(Path(file_name).with_suffix(".csv"), index=False)

