# Automatic event position extraction from a post

Methodology:

1. Find all posts / tweets with an #AlertMPK hashtag, because only them notified users about accidents.
2. Using regex, tokenize posts and find adresses. Adresses are (almost) always written using capital letters. If two words have its first letter capitalized and are separated by space, they are combined in a one group. List of keywords if used to clean posts from commonly used words (for ex. tramwaje, Brak, Informujemy, ...)  
Examples of post texts:
```
Brak przejazdu przy pl. Legionów - pomoc medyczna dla pasażera.
🚋 Tramwaje linii 0L, 24>FAT, 31, 32>GAJ skierowano objazdem od pl. Orląt Lwowskich przez ul. Podwale, Sądową do pl. Legionów.
#Extracted adresses: Legionów, FAT, GAJ, Orląt Lwowskich, Podwale, Sądową, Legionów
```
```
Ul. Gliniana/Gajowa>Gaj - kolizja samochodu z tramwajem.
🚋 Tramwaje linii 8 skierowano do Parku Południowego.
🚋 Tramwaje linii 31 i 32 skierowano do Zajezdni I (Uniwersytet Ekonomiczny).
#Extracted adresses: Gliniana, Gajowa, Gaj, Parku Południowego, Uniwersytet Ekonomiczny
```

3. Using publicly available list of stops and all adresses in Wrocław, extracted list of adresses is matched with lists. First matched address is considered an event position.  
Matching algorithm:
    1. Match extracted string with list of all stops. If matched, return stop name and its position.
    2. Calculate similarity of string to each stop. Sort by value and matched return most similar with it's position if similarity value is higher than threshold of 0.8.
    3. Calculate similarity of string to each addresss. Sort by value and return most similar if similarity value is higher than threshold of 0.7. Query position of street using open street maps.


### Used libraries:
* difflib - calculates similarity between strings
* geopy - queries position using open street maps


In [None]:
import json
import csv
from collections import OrderedDict
import re
import pandas as pd
from geopy.geocoders import Nominatim

In [None]:
with open('data/WRO_Ulice.txt', 'r') as f:
    streets = list(OrderedDict.fromkeys([s.strip().lower() for s in f.readlines()]))

In [137]:
print(len(streets))

2315


In [138]:
streets[:10]

['abrahama',
 'abramowskiego',
 'adamczewskich',
 'adamieckiego',
 'admiralska',
 'afgańska',
 'agatowa',
 'agrestowa',
 'akacjowa',
 'alberta']

In [None]:
with open('data/MPK_stops.txt', 'r') as f:
    reader = csv.DictReader(f, delimiter=',', quotechar='"')
    stops = {row['stop_name'].lower():(row['stop_lat'], row['stop_lon']) for row in reader}

In [134]:
print(len(stops.keys()))

862


In [135]:
print(list(stops.keys())[:10])

['wzgórze partyzantów', 'zoo', 'metalowców', 'bojanowska', '8 maja', 'głogowska', 'główna', 'bujwida', 'strachowice general aviation', 'nowy dwór (rogowska)']


In [None]:
facebook_texts = pd.read_json('data/alert_mpk_facebook_posts.json', lines=True)
facebook_texts = facebook_texts[facebook_texts['text'].str.contains("#AlertMPK")]
facebook_texts = facebook_texts['text'].tolist()
facebook_texts = [re.sub(r"#(\w+)", '', t, flags=re.MULTILINE) for t in facebook_texts]
facebook_texts = [re.sub(r"@(\w+)", '', t, flags=re.MULTILINE) for t in facebook_texts]

with open('data/AlertMPK_tweets.json', 'r') as f:
    twitter_tweets = [json.loads(l) for l in f.readlines() if "#AlertMPK" in l]
    twitter_texts = [tweet['tweet'] for tweet in twitter_tweets]
    twitter_texts = [re.sub(r"#(\w+)", '', t, flags=re.MULTILINE) for t in twitter_texts]
    twitter_texts = [re.sub(r"@(\w+)", '', t, flags=re.MULTILINE) for t in twitter_texts]

In [86]:
print(len(facebook_texts))
print(len(twitter_texts))

628
4630


In [70]:
twitter_texts[:10]

[' ul. Legnicka - ruch przywrócony. Tramwaje wracają na stałe trasy przejazdu.',
 ' - \n⚠ Ul. Legnicka/Kwiska>pl. Jana Pawła II - awaria tramwaju.\n🚋 Tramwaje linii 3, 10, 20, 23, 33 zawracają przez Most Pomorski, Pomorską, Dubois, Most Sikorskiego.\n🚋 Tramwaje linii 31 i 32 zawracają na Mostach Mieszczańskich.',
 ' ul. Kosmonautów - ruch przywrócony. Tramwaje wracają na stałe trasy przejazdu.',
 ' - \n⚠ Ul. Kosmonautów - awaria tramwaju. \n🚋 Tramwaje linii 3, 10, 20 skrócono do pętli Pilczyce.\n🚌 Uruchomiono autobusy "za tramwaj" w relacji Pilczyce - Leśnica.',
 ' ul. Legnicka/Na Ostatnim Groszu- ruch przywrócony. Tramwaje wracają na stałe trasy przejazdu.',
 ' \n⚠️ ul. Legnicka/ Na ostatnim groszu - brak przejazdu z powodu wypadku bez udziału pojazdów MPK. \n🚋 Tramwaje skierowano ruchem wahadłowym w relacji Kozanów>Pilczyce. \n🚌 Uruchomiono autobusy "za tramwaj" w relacji Kwiska>Pilczyce.',
 ' ul. Sienkiewicza - ruch przywrócony. Tramwaje wracają na stałe trasy przejazdu.',
 ' \n⚠️ u

In [None]:
bus_lines = ['a', 'c', 'd', 'k', 'n', 'e', 'i']

In [None]:
tram_lines = ['t1', 't2', 't3', 't4', 't5', 't6', 't7', 't8', 't9']

In [None]:
keywords = [
    'tramwaje',
    'autobusy',
    'uruchomiono',
    'ul', 'al', 'pl', 'mpk',
    'aktualizacja',
    'ruch', 'brak', 'w',
    'informujemy',
    'na', 'z' , 'see',
    'linie', 'linia',
    'celem', 'możliwe',
    'wprowadzono'
]

In [None]:
regex = re.compile(r'(?:\b[A-Z\u0141\u015A\u0179\u017B].*?\b)+(?: (?:\b[A-Z\u0141\u015A\u0179\u017B].*?\b)+)*')

In [None]:
def find_groups(text):
    groups = [g.lower() for g in re.findall(regex, text) if not g.lower() in keywords + bus_lines + tram_lines]
    if not groups:
      print(text)
    # print(groups)
    return groups

In [None]:
for text in facebook_texts[50:100]:
    find_groups(text)

In [None]:
from difflib import SequenceMatcher
address_similarity_threshold = 0.7
stop_similarity_threshold = 0.8

In [None]:
class AddressResult:
    name: str = None
    lat: float = None
    lon: float = None

    def __str__(self):
        return f"{self.name} [{self.lat}, {self.lon}]"

In [None]:
def match_address(address):
    matches = [(existing_addr, SequenceMatcher(None, address, existing_addr).ratio()) for existing_addr in streets if existing_addr.startswith(address[0])]
    matches = sorted(matches, key=lambda x: x[1], reverse=True)
    matches_thresholded = [m for m in matches if m[1] > address_similarity_threshold]
    # print(matches[:10])
    if matches_thresholded:
      return matches_thresholded[0][0]
    return None

In [None]:
def match_stop(address):
    matches = [(existing_stop, SequenceMatcher(None, address, existing_stop).ratio()) for existing_stop in stops.keys()]
    matches = sorted(matches, key=lambda x: x[1], reverse=True)
    matches_thresholded = [m for m in matches if m[1] > stop_similarity_threshold]
    # print(matches[:10])
    if matches_thresholded:
      return matches_thresholded[0][0]
    return None

In [None]:
def get_address_position(address):
    geolocator = Nominatim()
    loc = geolocator.geocode(f"{address},Wrocław,PL")
    return loc.latitude, loc.longitude

In [None]:
def parse_address(address):
    # print("matching:", address)
    result = AddressResult()
    if address in stops.keys():
        # print("found stop:", address)
        result.name = address
        result.lat = stops[address][0]
        result.lon = stops[address][1]
        return result 
    matched_stop = match_stop(address)
    if matched_stop:
        # print('matched stop:', matched_stop)
        result.name = matched_stop
        result.lat = stops[matched_stop][0]
        result.lon = stops[matched_stop][1]
        return result 
    matched_address = match_address(address)
    if not matched_address:
        print('Error finding address:', address)
        return None
    # print("matched address:", matched_address)
    result.name = matched_address
    result.lat, result.lon = get_address_position(matched_address)
    return result

In [None]:
def parse_groups(groups):
    result = None
    for gr in groups:
        result = parse_address(gr)
        if result is not None:
            return result

In [239]:
for text in facebook_texts[100:150]:
    groups = find_groups(text)
    result = parse_groups(groups)
    if result is None:
        print("Error parsing post:", text)
        print("Debug", groups)
    else:
        print(result)

matching: powstańców śl
matched address: powstańców śląskich




powstańców śląskich [51.0931937, 17.0207054]
matching: bałtycka
found stop: bałtycka
bałtycka [51.1382819000, 17.0293741400]
matching: kajdasza
found stop: kajdasza
kajdasza [51.0536790000, 17.0579230000]
matching: karłowice
found stop: karłowice
karłowice [51.1342531700, 17.0371479900]
matching: przyjaźni
found stop: przyjaźni
przyjaźni [51.0685463800, 17.0032825400]
matching: piłsudskiego
found stop: piłsudskiego
piłsudskiego [51.1018217400, 17.0247599100]
matching: zajezdni gaj
matched stop: zajezdnia gaj
zajezdnia gaj [51.0892658800, 51.0892658800]
matching: grabiszyńska
found stop: grabiszyńska
grabiszyńska [51.1024615200, 17.0145108900]
matching: pomorska
found stop: pomorska
pomorska [51.1180575500, 17.0305601600]
matching: brodzka
found stop: brodzka
brodzka [51.1581602600, 16.9223663000]
matching: legnicka
matched address: legnicka
legnicka [51.1131612, 17.0078288]
matching: podwale
matched address: podwale
podwale [51.1039523, 17.0410579]
matching: legnicka
matched address: l