@title[Title Slide]

CryptoCrawler

Mining information about Crypto-Currencies from the Web.

by Holger Büch & Kevin Hendel

Module "Web & Social Media Analytics"
by Prof. Dr. Stephan Wilczek, Prof. Dr. Jan Kirenz
Master Data Sciene & Business Analytics
University of Media Stuttgart, Germany

---?image=assets/bg-idea.png @title[Introduction]

Introduction

Idea & Planning

+++ @title[Goal]

Goal

Topic: Ongoing Hype on Crypto-Currencies in 2017 and 2018
Idea A: Correlate development of Tweets and Stock-Values over time.
Idea B: Provide additional information, that helps to interpret those developments.
Idea C: Automatically buy/sell stocks based on prediction. (not done)
Try and use new technologies
Gain experience as a reward for the overhead in DevOps

+++ @title[Data Sources]

Data Sources

Twitter Stream
Crypto-Stock-Market Stream
Crypto-Prices API
News (planned)

+++ @title[KPIs]

KPIs

Correlation coefficent between Stock-Values and (Sentiment of) Tweets
Delay between events in Stock-Values and on Twitter

+++ @title[Setup]

Project Setup

Github for collaboration

Feature Branches & Pull requests
Ticketing / Bugs Tracking
Slides (gitpitch)

+++ @title[Setup2]

AWS for Hosting

t2.medium running Ubuntu
Access via SSH

+++ @title[Setup3]

Architecture

Docker based virtualized Microservices

---?image=assets/bg-docker2.png @title[Architecture]

Architecture

Docker based Microservices

+++ @title[Docker]

Docker

virtualized Containers for each part of the software
independent from host system
deployable locally and in the cloud
brings all dependencies

+++

Dockerfile

FROM continuumio/miniconda3

RUN conda install -y pymongo pyyaml
RUN conda install -c conda-forge -y tweepy
RUN conda install -c gomss-nowcast schedule

WORKDIR /home
RUN git clone https://github.com/kevhen/CryptoCrawler.git

WORKDIR /home/CryptoCrawler/crypto-price-crawler

CMD while true; do python pricelistener.py; done

@[1](Get defined base image online) @[3,4,5](Install additional packages) @[8](Clone project into the container) @[12](Run python script in a loop in case it exits with an error)

+++ @title[Microservices]

Microservices

one Microservice each for single (or small set of) functions
every Microservice is independent and stateless
restarting of single services or the system without breaking it
internal and external networking

+++ @title[Docker Compose]

Docker Compose

Configuration file (docker-compos.yaml)

version: '3'

networks:
  backend:
    internal: true
  frontend:
    internal: false

services:
  crypto-mongo:
    image: mongo:jessie
    volumes:
      - /data/mongodb:/data/db:rw
    networks:
      - backend
    entrypoint:
      - docker-entrypoint.sh
      - --storageEngine
      - mmapv1

  crypto-jupyter:
    volumes:
      - /data/notebooks:/home/jovyan/work
    networks:
      - backend
      - frontend
    ports:
      - 8888:8888
    build:
      context: ./jupyter
    depends_on:
      - crypto-mongo
    entrypoint:
      - start-notebook.sh
      - --NotebookApp.password='sha1:f6a0093ff7ca:be25a6064ba30e37265b0f800cbb925c636cc4fe'
  .
  .
  .
  .

@[1](docker-compose version)

@[3,4,5,6,7](Define virtual networks for the containers) @[4,5](Internal backend network for the Database. Not accessible from outside the containers) @[6,7](Hybrid network. Accessible by the containers and from outside via mapped ports)

@[9](Definition of two example microservices)

@10,11,12,13,14,15,16,17,18,19 @[11](Base Image) @[12,13](Connect external volumes) @[14,15](Define network connection) @[16,17,18,19](Start script inside the container and additional parameters)

@[21,22,23,24,25,26,27,28,29,30,31,32,33,34,35](Jupyter Notebook) @[22,23](Connect external volumes) @[24,25,26](Define network connections) @[27,28](Map internal to external ports) @[31,32](Dependencies to specify build and start order) @[33,34,35](Start script inside the container and additional parameters)

+++ @title[In Action]

In Action

docker-compose build
docker-compose up

Combined log output

twitter-listener_1       | INFO - 01/30/2018 15:32:36: 374000 Tweets received. Still listening...
crypto-price-listener_1  | INFO - 01/30/2018 15:32:36: Prices are {'BTC': {'USD': 10407.7, 'EUR': 8416.89}, 'ETH': {'USD': 1115.7, 'EUR': 907.42}, 'IOT': {'USD': 2.39, 'EUR': 1.95}}
crypto-price-listener_1  | INFO - 01/30/2018 15:32:36: Trying to save the prices for timestamp 1517326356797 to mongo
crypto-price-listener_1  | INFO - 01/30/2018 15:32:36: Saved prices. Waiting until next call
crypto-dash_1            | INFO - 01/30/2018 15:32:38: 37.201.7.172 - - [30/Jan/2018 15:32:38] "POST /_dash-update-component HTTP/1.1" 200 -
crypto-dash_1            | INFO - 01/30/2018 15:32:43: 37.201.7.172 - - [30/Jan/2018 15:32:43] "POST /_dash-update-component HTTP/1.1" 200 -
crypto-price-listener_1  | INFO - 01/30/2018 15:32:46: Running job Every 10 seconds do startListening({'mongodb': {'host': 'crypto-mongo', 'port': 27017, 'db': 'cryptocrawl'}, 'cryptocompare': {'coinlist': 'https://min-api.cryptocompare.com/data/all/coinlist', 'price': 'https://min-api.cryptocompare.com/data/pricemulti'}, 'collections': {'generalcrypto': {'keywords': ['blockchain', 'crypto', 'altcoins', 'altcoin']}, 'bitcoin': {'keywords': ['bitcoin', 'bitcoins'], 'currencycode': 'BTC'}, 'ethereum': {'keywords': ['ethereum'], 'currencycode': 'ETH'}, 'iota': {'keywords': ['iota', 'iotas'], 'currencycode': 'IOT'}, 'trump': {'keywords': ['trump']}, 'car2go': {'keywords': ['car2go']}}, 'dash': {'live': {'default': ['bitcoin', 'ethereum', 'iota'], 'interval': 5}}}, Database(MongoClient(host=['crypto-mongo:27017'], document_class=dict, tz_aware=False, connect=True), 'cryptocrawl')) (last run: 2018-01-30 15:32:36, next run: 2018-01-30 15:32:46)
crypto-price-listener_1  | INFO - 01/30/2018 15:32:46: Starting Currency Listener
crypto-price-listener_1  | INFO - 01/30/2018 15:32:47: Valid coins are ['BTC', 'ETH', 'IOT']
crypto-price-listener_1  | INFO - 01/30/2018 15:32:47: The coin string is BTC,ETH,IOT
crypto-price-listener_1  | INFO - 01/30/2018 15:32:48: Prices are {'BTC': {'USD': 10403.83, 'EUR': 8412.23}, 'ETH': {'USD': 1115.97, 'EUR': 903.88}, 'IOT': {'USD': 2.39, 'EUR': 1.94}}
crypto-price-listener_1  | INFO - 01/30/2018 15:32:48: Trying to save the prices for timestamp 1517326368073 to mongo
crypto-price-listener_1  | INFO - 01/30/2018 15:32:48: Saved prices. Waiting until next call
crypto-dash_1            | INFO - 01/30/2018 15:32:48: 37.201.7.172 - - [30/Jan/2018 15:32:48] "POST /_dash-update-component HTTP/1.1" 200 -
crypto-dash_1            | INFO - 01/30/2018 15:32:53: 37.201.7.172 - - [30/Jan/2018 15:32:53] "POST /_dash-update-component HTTP/1.1" 200 -

+++ @title[Centralized Configuration]

Centralized Configuration

one configuration file for changes in a single place

mongodb:
    host: crypto-mongo
    port: 27017
    db:   cryptocrawl

cryptocompare:
    coinlist: https://min-api.cryptocompare.com/data/all/coinlist
    price: https://min-api.cryptocompare.com/data/pricemulti
    histo: https://min-api.cryptocompare.com/data/histo

collections:
    generalcrypto:
        keywords:
            - blockchain
            - crypto
            - altcoins
            - altcoin
    bitcoin:
        keywords:
            - bitcoin
            - bitcoins
        currencycode: BTC
    ethereum:
        keywords:
            - ethereum
        currencycode: ETH
    iota:
        keywords:
            - iota
            - iotas
        currencycode: IOT
    trump:
        keywords:
            - trump
    car2go:
        keywords:
            - car2go

dash:
    live:
        default:
            - bitcoin
            - ethereum
            - iota
        interval: 5

@[1,2,3,4](MongoDB connection information @[6,7,8,9](URLs to the CryptoCompare API) @[11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37](URLs to the CryptoCompare API) @[39,40,41,42,43,44,45](URLs to the CryptoCompare API)

---?image=assets/bg-mongodb.png @title[Microservice - MongoDB]

Microservice 1

Mongo DB

+++ @title[MongoDB]

MongoDB

Document based Database
A Document contains a JSON object
Multiple Documents grouped to Collections
Documents can be queried using JSON-based Syntax

+++ @title[Store Documents]

Storing Documents with Python

from pymongo import MongoClient

client = MongoClient('crypto-mongo', 27017)
db = client['cryptocrawl']

json_obj = {
    'timestamp_ms': 1517343098,
    'text': 'Something...'
    }

db['bitcoin'].insert(json_obj)

@[1](Import Module (has to be installed)) @[3](Initialize Client-Connection to MongoDB) @[4](Select Database) @[6-9](MongoDB ♥ JSON Documents) @[11](Write JSON as Document in a Collection)

+++ @title[Query Documents]

Query Documents with Python

import pandas
from pymongo import MongoClient

client = MongoClient('crypto-mongo', 27017)
db = client['cryptocrawl']

query = {'timestamp_ms': {'$gt': 1517343098}}
projection = {'text': 1, 'timestamp_ms': 1}

cursor = db['bitcoin'].find(query, projection).limit(100)

df = pandas.DataFrame(list(cursor))

@[1-5](Import Module, Initialize Client-Connection to MongoDB) @[7](Define Filter (similar to WHERE in SQL)) @[8](Define Fields to return (similar to SELECT in SQL)) @[10](find() returns a cursor object (here also limited to 100 results)) @[12](Cursor can be converted into list and transformed into Pandas Dataframe)

*Show in Robo RT*

+++ @title[Problems]

Problems

⚔

+++ @title[Problem with Speed - 1]

⚔ Slow Queries over Timestamp

We often query for a specified Range in the Timestamp, e.g:

query = {'timestamp_ms': {'$gt': 1517243098, '$lt': 1517343098}}

Performance was weak
CPU-Usage on VM peaked

+++ @title[Problem with Speed - 2]

✓ Create Index on Timestamp attribute

$ mongo
Mongo > use cryptocrawl
Mongo > show collections
> bitcoin
> ethereum
> generalcrypto
> iota
Mongo > db.bitcoin.createIndex({"timestamp_ms": 1}, {background:true})
Mongo > db.collection.totalIndexSize()

@[1](Start MongoDB CLI client) @[2](Open Database with Name "cryptocrawler") @[3-7](List all Collections of this DB) @[8](Create Index on Attribute 'timestamp_ms'. Repeat for all Collections.) @[9](Show Size of Indexes. Should fit in RAM.) @0-9

+++ @title[Problem with Aggregation - 1]

⚔ Aggregate by Timestamp in Milliseconds

We need to aggregate on Time-Intervals, e.g. for Tweets per Hour
MongoDB can aggregate DateTime-Object on Intervals
But we get Timestamps in Milliseconds from Twitter
How to aggregate Integer with Milliseconds on hourly Interval?

+++ @title[Problem with Aggregation - 2]

✓ Aggregate using Math

To aggregate Milliseconds by Hours, get Milliseconds per Hour: |

1 Hour is 1000ms * 60sec * 60min = 3.600.000 ms

Then divide Timestamp by this Value and round to floor: |

floor (timestamp / 3.600.000)

All Timestamps from the same Hour will result in the same Value |
Then Aggregation can be done on this Value |
Sadly, MongoDB has no floor-Function |
Luckily, it has a modulo- Function: |

timestamp/3.600.000 – ( (timestamp/3.600.000) mod 1)

+++ @title[Problem with Aggregation - 3]

✓ Alternative Solutions

Cast to DateTime during Query
Convert & store Milliseconds as DateTime-Value

Would those be faster?

---?image=assets/bg-twitterlistener.png @title[Twitter Stream Listener]

Microservice 2

Twitter Stream Listener

+++ @title[Tweepy]

Using Tweepy to listen to Twitter Streaming API

import tweepy

class MyStreamListener(tweepy.StreamListener):
    def on_status(self, status):
        print(status.text)  # Then store the tweets...
    def on_error(self, status_code):
        if status_code == 420:
            time.sleep(300) # Then reconnect ...

auth = tweepy.OAuthHandler(api_key, api_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth)

stream_listener = MyStreamListener(conf=conf)
stream = tweepy.Stream(auth=api.auth, listener=stream_listener)
stream.filter(track=list(['bitcoin','iota','...']), async=True)

@[1](Import Module) @[3](Inherit StreamListener class) @[4-5](Define what to do if tweet arrives) @[6-8](Handle API Error, especially 420 to avoid penalty) @[10-13](Set credentials and create API object) @[15-18](Instanciate class, start listening to Tweets with keywords) @0-18

+++ @title[Problems]

Problems

⚔

+++ @title[Information overload - 1]

⚔ Too much information

Over 500 MB Data during first two Hours.
Over 900 Tweets per Minute

Tweets per 15min, only crypto topics

+++ @title[Information overload - 2]

✓ Filter Tweets

Exclude non-English Tweets
Exclude Retweets

✓ Store subset of Attributes

TweetID
AuthorID
Text
Timestamps
Geo-Information

+++ @title[Tweepy Bug - 1]

⚔ Bug in Tweepy Module

Tweepy kept raising Exceptions after some Days of running:

File "tstreamer.py", line 109, in myStream.userstream("with=following")
File "/tweepy/streaming.py", line 394, in userstream
self._start(async)
File "/tweepy/streaming.py", line 361, in _start
self._run()
File "/tweepy/streaming.py", line 294, in _run
raise exception
AttributeError: 'NoneType' object has no attribute 'strip'

tweepy/tweepy#869 (open since March 2017)

+++ @title[Tweepy Bug - 2]

✓ Implement Workaround

Handle Exceptions and reconnect Tweepy:

def startListening():
    """Start listening to twitter streams."""
    try:
        stream_listener = MyStreamListener(conf=conf)
        # [...]
    except Exception as e:
        logger.error('Exception raised!', e)
        startListening()

Auto-restart Microservice on exit (just in case...):

while true; do python streamlistener.py; done

---?image=assets/bg-pricecrawler.png @title[Microservice - Price Crawler]

Microservice 3

Crypto Price Crawler

+++

Crypto Price Crawler

receive prices in a defined interval
call api to get all currencies and their current values
save the values to the database each in a separate collection

+++

Scheduled listener

def init():
    with open('../config.yaml', 'r') as stream:
        conf = yaml.load(stream)
    # Open a connection to mongo:
    client = MongoClient(conf['mongodb']['host'], conf['mongodb']['port'])
    db = client[conf['mongodb']['db']]
    schedule.every(10).seconds.do(startListening, conf, db)

@[2,3](Load config file) @[5,6](Connect to the database) @[7](Schedule listener to run every 10 seconds)

+++

REST API Call

def getPricesOnce(currencies, conf):
    """Gets the current price for the currencies defined in the config.yaml"""
    payload = { 'fsyms': currencies, 'tsyms': 'USD,EUR' }
    response = requests.get(conf['cryptocompare']['price'], params=payload)
    if response.status_code == 200:
        prices = json.loads(response.content)
        logger.info('Prices are {}'.format(prices))
        return prices
    else:
        logger.warn('Could not get prices. Error code {}'.format(response.status_code))
        return False

@[1](Crypto currencies as a comma-separated string and conf (config) as parameters) @[3](Build payload for the REST-Call) @[4](Send request and save response) @[5,6,7,8](Handle successful request and parse content) @[9,10,11](Handle error case)

---?image=assets/bg-cryptowrapper.png @title[Microservice - API Wrapper]

Microservice 4

Crypto API Wrapper

+++

Crypto API Wrapper

serve an API as an interface between the dashboard and the database or external APIs
benefits
- prevent direct database access
- filter data and fit format for later needs
- less logic in the frontend
using Flask framework to build the actual API

+++

Random tweets endpoint

return random tweets about certain topics in a defined timeframe

Example: /tweets?topics=bitcoin,ethereum,iota&amount=5&from=1516110498&to=1516290284

def getTweetsForTopics(topicstring, amount, fromTs, toTs):
    topicList = parseTopics(topicstring)
    randomTweets = []
    for topic in topicList:
        cursor = db[topic].aggregate([
                { '$match': { 'timestamp_ms': {'$gt': fromTs , '$lt': toTs }}},
                { '$sample': { 'size': amount } },
                { '$project' : { '_id': 0 } }
            ])
        tweetListForTopic = list(cursor)
        identifiedTweetListForTopic = []
        for singleTweet in tweetListForTopic:
            identifiedTweet = { 'topic': topic, 'tweet': singleTweet }
            identifiedTweetListForTopic.append(identifiedTweet)
        randomTweets = randomTweets + identifiedTweetListForTopic
    if len(randomTweets) >= amount:
        randomListFinal = random.sample(randomTweets, amount)
    else:
        randomListFinal = randomTweets
    resultDict = {'tweets': randomListFinal }
    return resultDict

@[1](Get URL parameters as input) @[2](Get list of topics from the comma separated topicstring) @[4](Do one request for each topic because every topic is in a separate mongo collection) @[5,6,7,8,9](Build mongo aggregation) @[6](Match all entries between the start and the end timestamps) @[7](Get a random sample with the specified size from the matched entries @[8](Remove the _id information from the results) @[12,13,14](Add the topic to each tweet) @[17](From all sample tweets from the collections we take a sample as big as the specified amount)

+++

Add endpoint in Flask

class RandomTweets(Resource):
    def get(self):
        now = int(time.time())

        fromTs = handleTs(request.args.get('from'), now) * 1000
        toTs = handleTs(request.args.get('to'), now) * 1000

        topicstring = request.args.get('topics')
        amount = parseAmount(request.args.get('amount'))

        result = getTweetsForTopics(topicstring, amount, fromTs, toTs)
        return jsonify(result)

api.add_resource(RandomTweets, '/tweets')

@[3](Get current timestamp for timeframe calculations) @[5](Handle the begin of the timeframe. Check that it is not in the future. Calculate milliseconds from seconds) @[6](Handle the end of the timeframe. Check that it is not in the future. Calculate milliseconds from seconds. Set now as default if not set.) @[8](Get the passed topicstring) @[9](Handle the amount) @[11](Do the actual request with all parameters) @[12](Return the result as a json) @[14](Add the endpoint to Flask at the route /tweets)

+++

Additional historical price endpoint

endpoint to retrieve historical prices from the CryptoCompare API
implemented to take a timeframe and parse the parameters to fit the CryptoCompare API definitions
returns the daily, hourly or minutely prices for a crypto currency
planned to replace or support the price listener or replace missing values from system outages
not yet integrated into the dashboard

---?image=assets/bg-anomaly.png @title[Microservice - Anomaly Detection]

Microservice 5

Anomaly Detection

+++ @title[Idea]

Detect 'unusual' Events in:

Amount of Tweets received
Sentiment
Prices of Crypto-Currencies

Use this information for:

Visualization in Dashboard
Easing the Interpretation of the Data
Searching for News in those Time-Ranges (not done)

+++ @title[Method]

Method

Research on Algorithms |
Tested ARIMA Model first (in Jupyter Notebook) |
Data doesn't cover a Time-Span long enough |
Twitter itself published an Algorithm |
Implemented only in R (Python-Versions are creepy) |
Solution: Implement very simplified Version |

Twitter's Algorithm explained

+++ @title[Step 1]

Step 1: Seasonal Decomposition

+++ @title[Step 2]

Step 2: Extreme Studentized Deviate test (ESD)

Detect Outliers in univariant Data that is approx. normal distributed (had to be tested before!).

Set Parameter for Maximal Outliers |
Set Parameter for Significance p |
ESD test detects 1 largest Outlier |
Calculates Coefficient |
ESD test detects 2 largest Outliers |
Calculates Coefficient |
... |
Optimal Number of Outliers selected by Coefficient |

+++ @title[Results]

Test for normal distribution

(Tested already during ARIMA Modelling approach)

+++ @title[Results]

Results of ESD Test

Number of outliers:  7
Indices of outliers:  [73, 63, 111, 119, 87, 117, 118]
        R      Lambda                   R      Lambda
 1   3.92406   2.85719          6   2.82568   2.84412
 2   3.36138   2.85462          7   2.92393   2.84144
 3   2.95692   2.85203          8   2.76510   2.83873
 4   2.96757   2.84941          9   2.51757   2.83600
 5   2.80918   2.84678          10  2.54265   2.83325

+++ @title[Implementation in Python]

Implementation in Python

import statsmodels.api as sm
from PyAstronomy import pyasl

# Seasonal Decomposition
model = sm.tsa.seasonal_decompose(ary, freq=freq)
resid = model.resid

# [...] clean/transform resid values

# ESD
anomalies = pyasl.generalizedESD(resid, max_anoms, p_value)

@[1-2](Load Modules) @[4-5](Seasonal Decompositon) @[6](We only need resid Values) @[8](Some Transformation for next Step (e.g. remove NaN)) @[10-11](Apply ESD) @0-11

Whole thing wrapped as Flask API (Show in Postman)

---?image=assets/bg-topic.png @title[Microservice - Topic Modelling]

Microservice 6

Topic Modelling

+++ @title[Idea]

Identify Topics in Tweet-Texts

Aggregated view on what the Tweets are about
Add Information to Dashboard
Search for those Topics in a News API (not done)

+++ @title[Method]

Latent Dirichlet Allocation (LDA)

A "Document" contains some Topics with different Weights |
A "Topic" is Probability Distribution about all Words in Corpus |
A "Word" can be assigned to more than one Topics |

Does this work for such short Documents like Tweets? Lets try!

+++ @title[Preprocessing]

Preprocessing

exclude_custom = '“”…‘’x'
exclude = set(string.punctuation + exclude_custom)

stop_custom = ['rt', 'bitcoin', 'bitcoins', 'iota', 'ethereum', 'btc',
               'eth', 'iot', 'ltc', 'litecoin', 'litecoins', 'iotas',
               'ltc', 'cryptocurrency', 'crypto', 'cryptocurrencies',
               'coin']
stop = set(stopwords.words('english') + stop_custom)

# Remove punctuation
doc = ''.join(ch for ch in doc
                      if ch not in exclude)

# Remove URLS
doc = ' '.join([i for i in doc.lower().split()
                      if not i.startswith('http')])

# Remove anything containing numbers
doc = ' '.join([i for i in doc.lower().split()
                      if not any(char.isdigit() for char in i)])

# Remove short words
doc = ' '.join([i for i in doc.lower().split()
                      if len(i) >= 4])

# Lemmatize
# clean_doc = ' '.join(lemma.lemmatize(word)
#     for word in clean_doc.split())

# Remove Stopwords
doc = ' '.join([i for i in doc.lower().split()
                      if i not in stop])

@[1-2](Define Chars to remove) @[4-8](Define Stopwords) @[10-12](Remove Punctuation and special Chars) @[14-16](Remove URLs) @[18-20](Remove Words containing Numbers) @[22-24](Remove Words below 4 Chars) @26-28 @[30-32](Remove Stopwords) @[0-32]

+++ @title[LDA in Python]

LDA in Python

import gensim
from gensim import corpora

dictionary = corpora.Dictionary(docs)

doc_term_matrix = [dictionary.doc2bow(doc) for doc in docs]

Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(doc_term_matrix, num_topics=num_topics,
               id2word=dictionary, passes=20)

# [...] Converts LDA model into nice list

@[1-2](Import Gensim Modul for Vector Space Modelling) @[4](Load Documents into Corpora Dictionary) @[6](Prepare Document Term Matrix) @[8-10](Do Modelling with Parameter for Number of Topics and Passes used) @[12](Convert Results into consumable List) @0-12

Whole thing wrapped as Flask API (Show in Postman)

---?image=assets/bg-senti.png @title[Microservice - Sentiment Analysis]

Microservice 7

Sentiment Analysis

+++ @title[Method]

Simple Wordlist-based approach

Research about Sentiment Analysis in financial Sector
Found Wordlist by Loughran & McDonald (2011)
Decided to use a simple Wordlist based approach

+++ @title[Implementation]

Implementation

As a scheduled Service
Runs every 5 minutes
Searches for Tweets without Sentiment in MongoDB
Counts positive & negative Words contained in the Tweet
Sentiment Score = Positive - Negative

---?image=assets/bg-jupyter.png @title[Microservice - Jupyter Notebook]

Microservice 8

Jupyter Notebook

+++

Integrate an instance of Jupyter notebook

Jupyter notebook inside of an own container
connection to the database and all other services
test environment for database and API calls
fast response without redeploying of any of the components

---?image=assets/bg-dash.png @title[Microservice - Dash]

Microservice 9

Dashbord

+++

Show random tweets from the selected timeframe

html.Div([
    html.Div([
        html.H3([
            html.Span('€', className='icon'),
             'Avg. Stock Prices per Hour'])
    ], className='title'),
    html.Div([
        dcc.Graph(
            style={'width': '878px', 'height': '250px'},
            id='stock-plot')
    ], className='content')
], className='box'),

@[1,12](Surrounding div element) @[2-11](Array of inner elements) @2-6 @[7-11](Stock graph)

@title[Wrap up]

Wrap up

+++ @title[What we have learned]

What we have learned

Plotly Dash is nice, but Data Management & Cross-Selection is quite difficult.
Take more care about Exception Handling, especially for critical Services (e.g. Stream Listener)
Docker(-Compose) is really cool for Development & Deployment!
Putting the right Data into Models is crucial (e.g. Missing Data in Anomaly-Detection)
Choosing the right Model for the Data is not easy (e.g. LDA for Tweets)

+++ @title[What we would improve]

What we would improve (1)

Architecture & Code

Connect Frontend through only one single API
Refactor Dashboard-Code

Topic Modelling

Search for a better Model for short Texts
Research, if LDA can be speed up somehow

Anomaly Detection

More complete Twitter Algorithm Implementation
Or build R Microservice

+++ @title[What we would improve]

What we would improve (2)

Dashboard UX / UI

Non blocking Interactions
Add loading Indicators
Improve Cross-Selection
Improve Performance, e.g. on Data-Loading

Additional Features

Leverage News APIs
Calculate and display Correlation Coefficent
Show Tweets on Map

@title[Thank you!]

Files

PITCHME.md

Latest commit

History

PITCHME.md

File metadata and controls

CryptoCrawler

Mining information about Crypto-Currencies from the Web.

Introduction

Idea & Planning

Goal

Data Sources

KPIs

Project Setup

Architecture

Docker based Microservices

Docker

Dockerfile

Microservices

Docker Compose

In Action

Centralized Configuration

Microservice 1

Mongo DB

MongoDB

Storing Documents with Python

Query Documents with Python

Problems

⚔

⚔ Slow Queries over Timestamp

✓ Create Index on Timestamp attribute

⚔ Aggregate by Timestamp in Milliseconds

✓ Aggregate using Math

✓ Alternative Solutions

Microservice 2

Twitter Stream Listener

Using Tweepy to listen to Twitter Streaming API

Problems

⚔

⚔ Too much information

✓ Filter Tweets

✓ Store subset of Attributes

⚔ Bug in Tweepy Module

✓ Implement Workaround

Microservice 3

Crypto Price Crawler

Crypto Price Crawler

Scheduled listener

REST API Call

Microservice 4

Crypto API Wrapper

Crypto API Wrapper

Random tweets endpoint

Add endpoint in Flask

Additional historical price endpoint

Microservice 5

Anomaly Detection

Detect 'unusual' Events in:

Use this information for:

Method

Step 1: Seasonal Decomposition

Step 2: Extreme Studentized Deviate test (ESD)

Test for normal distribution

Results of ESD Test

Implementation in Python

Microservice 6

Topic Modelling

Identify Topics in Tweet-Texts

Latent Dirichlet Allocation (LDA)

Preprocessing

LDA in Python

Microservice 7

Sentiment Analysis

Simple Wordlist-based approach

Implementation

Microservice 8

Jupyter Notebook

Integrate an instance of Jupyter notebook

Microservice 9

Dashbord