# Schmerzgrenze der Wiener 🚉

<img src="https://www.biorama.eu/wp-content/uploads/2016/02/Bildschirmfoto-2016-02-26-um-17.14.57.png"></img>

### Project Aim
This Project aims to correlate, visualize and find patterns regarding the developments in the Viennese public transport grid and gauging public sentiment in correlation to such incidents. 



### Team Members
Julian Deleja-Hotko\
Nicolas Markl\
Dionis Ramadani

### Data Sources
**// TODO: DIE VERSCHIEDENEN DATEN GENAUER BESCHREIBEN**

##### [Digitales Wien / Open Government Data Portal Wien](https://digitales.wien.gv.at/open-data/) - REST Endpoint
Several hundred data sets provide detailed information about one-way streets, real-time information from Wiener Linien, historical aerial photographs, measurement data of air pollutants or WLAN locations, to name just a few areas.

##### [Öffi.at](öffi.at)  - XML / Web Scraping
A website gathering and organizing historical data about Wiener Linien outages, courtesy of Klaus Kirnbauer.\
Data available starting from July 2020.

##### [Twitter@WienerLinien](https://twitter.com/wienerlinien) - Web Scraping
A social media platform popular in Vienna with dedicated accounts from public service providers, useful for gauging sentiment about specific routes in Vienna

##### [data.gv.at](data.gv.at) - Flat Files
Gathering general info about the public transport network, usage, etc.



### Architecture Diagram
**//TODO: Hier Architekturdiagramm (verpflichtend)**

### Packages
Here we'll install and import all relevant Python packages for this project

In [3]:
!pip install pymongo
!pip install pyspark
!pip install requests
!pip install beautifulsoup4
!pip install pyspark 
!pip install pandas
!pip install geopandas

Collecting geopandas
  Using cached geopandas-0.11.0-py3-none-any.whl (1.0 MB)
Installing collected packages: geopandas
Successfully installed geopandas-0.11.0


In [1]:
import requestst it to a Dat
import re
from pymongo import MongoClient
from pyspark import SparkContext
from bs4 import BeautifulSoup
import pandas as pd
from pyspark.sql import SparkSession

### Gathering, Storing and Cleaning our Data
The data will be collected and processed via Kafka, analyzed with Spark and all the relevant data will be stored on our MongoDB instance after an ETL-style pipeline.


##### Setting up DB connection
We're connecting to our local MongoDB instance, this is to pipe our extracted and transformed data into the DB later.

In [7]:
# Provide the mongodb connection string
CONNECTION_STRING = 'mongodb://Mongo:mongo@192.168.50.25:27017/'

# Create a connection using MongoClient
myclient = MongoClient(CONNECTION_STRING)

# Check DB collections
for db in myclient.list_databases():
    print(db)

{'name': 'Project', 'sizeOnDisk': 45056, 'empty': False}
{'name': 'Sleepstudy', 'sizeOnDisk': 159744, 'empty': False}
{'name': 'admin', 'sizeOnDisk': 102400, 'empty': False}
{'name': 'config', 'sizeOnDisk': 110592, 'empty': False}
{'name': 'local', 'sizeOnDisk': 73728, 'empty': False}
{'name': 'wienerLinien', 'sizeOnDisk': 5718016, 'empty': False}


##### Scraping together historical Wiener Linien Data
For this step, we're using the Öffi.at website by Klaus Kirnbauer who has aggregated all historical Wiener Linien public transport incidents in an easily queriable fashion.

Since Öffi.at uses conveniently utilizes server-side rendering, we can use BeautifulSoup for our data transformation. 

Now first, we need to model a framework of parsing the relevant data from the 1520 available historical sites;\
Since the data is variable in some cases we have decided on the following format:

| [Affected Lines] | [Affected Stations] | Start Time | End Time | Time Problem Fixed | Title |
|------------------|---------------------|------------|----------|--------------------|-------|


In [3]:
# Format
# ([Affected Lines], [Affected Stations], Start Time, End Time, Time Problem Fixed, Title)
def parse_oeffi_soup(soup):
    data = list(map(lambda li: (list(map(lambda trafficline: trafficline.getText(), li.select('.trafficline'))), 
                                 list(map(lambda liSub: liSub.split('<li>')[1],
                                     list(filter(re.compile('[^+]*\n<b>Von</b>:.').match, str(li).split('<br/>')))[0].split('</li>')[0:-1])),
                                 list(filter(re.compile('[^+]*\n<b>Von</b>:.').match, str(li).split('<br/>')))[0].split('<b>Von</b>: ')[1],
                                 list(filter(re.compile('\n<b>Bis</b>:').match, str(li).split('<br/>'))),
                                 list(filter(re.compile('\n<b>Verkehrsaufnahme</b>:').match, str(li).split('<br/>'))),
                                 li.select('.disruption-title')[0].getText()),
                     soup.select('li.disruption')))

    return list(map(lambda x: (x[0], x[1], x[2], x[3][0].split('</b>: ')[1] if len(x[3]) > 0 else None, x[4][0].split('</b>: ')[1] if len(x[4]) > 0 else None, x[5]), data))

And now we can run this model on all the available sites and aggregate this data! \
For estimation, this takes around 15-20 minutes to run to completion with all 1520 requests.

In [None]:
data = []

for i in range(1, 1520):
    URL = 'https://xn--ffi-rna.at/?archive=1&page=' + str(i)
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    data.extend(parse_oeffi_soup(soup))
    
display(data)

After arduously gathering and cleaning our data, we'll now convert it to a Dataframe and insert it into our Mongo DB instance:

In [6]:
# Transform into Pandas DF
df = pd.DataFrame(data, columns=['Affected Lines', 'Affected Stations', 'Start Time', 'End Time', 'Fixed Time', 'Title'])

# Create Database
db = myclient['wienerLinien']

# Insert
db.stoerungen.insert_many(df.to_dict('records'))

<pymongo.results.InsertManyResult at 0x7f05a4866b50>

Now we can also check if we have inserted our data correctly:

In [188]:
stoerungen_col = db['stoerungen']
print('Stoerungen: ', len(list(stoerungen_col.find())))

Stoerungen:  27088


##### Gathering Geodata about Stations
For this step we read in a data.gv.at flat file for further usage in visualizing and interpreting the data:

// TODO: Read in Shapefile, prepare for drawing of map

##### Gathering Twitter data 

For XYZ reasons we're using the Twitter API to gather XYZ as follows:

// TODO: Write Twitter API code

In [1]:
import tweepy

client=tweepy.Client("AAAAAAAAAAAAAAAAAAAAAHJMdgEAAAAAaazY9nw6SjvKprN4BDsBWVjfIZU%3DlzAqZ4fbC4UTH1caM2XLfZYlTjszcqASp43jAiDcBBO9Lsv08H")
public_tweets =client.search_recent_tweets('"@WienerLinien" (Ausfall OR Störung)')

for tweet in public_tweets[0]:
    print(tweet.text)

@LandauDaniel @wienerlinien @Tom_Harb Beim Hund im Auto kommt die @LPDWien und schlägt die Fenster ein. In Wien werben die @wienerlinien damit, dass in nur 20min eine klimatisierte Tram kommt. Vl, kommt auf die Linie drauf an und dann ist eine Störung (wie immer)
@oebb die frage geht auch an @wienerlinien weil wien mobil zeigt keine störung
@wienerlinien Gibt es einen Grund dafür, dass die 44 Richtung Schottentor in den letzten Wochen &amp; Monaten so unfassbar unzuverlässig fährt? Laut Wien Mobil App keine Störung, sollte normal alle 7min fahren, an der Haltestelle steht kommt erst in 15min.. So macht das keinen Spaß☹️ https://t.co/FmW13vYhzv


##### Setting up Spark via SparkContext for MongoDB

In [11]:
my_spark = SparkSession \
    .builder \
    .appName("MongoSparkConnector") \
    .config("spark.mongodb.input.uri", "mongodb://localhost:27017/wienerLinien.stoerungen") \
    .config("spark.mongodb.output.uri", "mongodb://localhost:27017/wienerLinien.stoerungen") \
    .getOrCreate()


### Analysis

Graphs, maps and heatmaps showing / highlighting patterns and outages of the Viennese public transport system.

##### Generating a Heat Map of most affected stations
// TODO: Use the geolocation data and outage data in our DB to calculate the total amount of time a station was closed