![](https://www.saa-authors.eu/picture/739/ftw_768/saa-mtcwmza4nzq5mq.jpg)

# Logging & Service-Level Agreements (SLAs) 

### Plan for the day

* Project Status
* PREP Discussion
* Research Prototype: Flask Monitoring Dashboard
* Topic: Logging
* Topic: Service Level Agreements
   

# Project Status

- monitoring?
- let's look at the charts

In [5]:
from IPython.display import IFrame
IFrame('http://164.92.246.227/chart.svg', width=900, height=500)

In [3]:
from IPython.display import IFrame
IFrame('http://159.89.26.109/commit_activity_weekly.svg', width=900, height=500)

In [4]:
from IPython.display import IFrame
IFrame('http://159.89.26.109/release_activity_weekly.svg', width=900, height=500)

# Prep Questions

### Question #1 for You: Which of your endpoints is the slowest? How slow is it?

### Question #2 for You: Where is the time being spent in this endpoint? How did you find out? 

Groups
- E
- O
- F
- R




## Flask Monitoring Dashboard

Research Project; [Open Source](https://github.com/flask-dashboard/)

Application Performance Monitor for Python + Flask

Goals:
* Simple to deploy & use
* Leverage version-control information

Observations:
* Aimed at small projects (stand-alone API / App)
* Can be easily installed because its laser-focused: Flask/Python

* Profiling introduces significant overhead
  - monitoring levels deployable with fine granularity
  - measuring the overhead is tricky
  - current ongoing project: automatically detecting performance regressions



# Meta-Observations

* Started as a bachelor thesis continued as master's thesis
  - Stick to your project (it takes a years till something becomes successful)
  - There are still several theses in extending this project



Open source, MIT 
- Code at: https://github.com/flask-dashboard/
- Demo: https://fmd-master.herokuapp.com/
- Mini-Case Study: https://mircealungu.com/post/18-09-07--db-indexes/
- Short paper: https://github.com/flask-dashboard/Vissoft-17-Paper/blob/master/FlaskDashboard-Preprint.pdf 
- Try it if you have a Flask API/App!



# Limitations of Monitoring


* Tracks high-level metrics the system (error rate of endpoint: 2%)
* Does not explain WHY there was a problem 

For the WHY we need more detailed information: 

  - **profiling** (as we saw in FMD) = dynamic program analysis for errors & performance problems
  - or **logging** as we will discuss today
  - or **tracing** = observing requests as they propagate through distributed systems ("not today")
         
         
Read More: [Why Grafana is Good at Metrics and Not Logs](https://grafana.com/blog/2016/01/05/logs-and-metrics-and-graphs-oh-my/)

# Logging


Logs are **the stream of aggregated, time-ordered events** collected from the output streams of all running processes and backing services


Purpose

* Understanding (how is your system being used?)


* Diagnosis (of an actual problem)

  - what happened yesterday? why was the service slow/down?


* Audit trails 
  - Sometimes logs are legally required
  - Ultimate example: bitcoin - a big log of all transactions


> The value in aggregating all the data to a centralized database for alerting and analysis simply cannot be understated

## Logfiles 


In server-based environments they are commonly written to a file on disk (a “logfile”)

    e.g. cat /var/log/auth.log
    
<img src="images/auth_log.png" alt="Drawing" style="width: 800px;"/>


* A reminder of the importance of security...

* ... but this is only one of the many possible output formats

## Different programs have different log formats

#### /var/log/auth.log

    Mar 16 07:15:55 zeeguu-amsterdam sshd[29424]: Invalid user vultr from 144.217.243.216 port 56450


#### /var/log/apache2/error.log

    [Wed Mar 18 20:39:02.962354 2020] [wsgi:error] [pid 18:tid 140056344164096] [remote 212.187.36.136:57046] Session is retrived from cookies
    
#### nginx

    66.249.65.62 - - [06/Nov/2014:19:12:14 +0600] "GET /?q=%E0%A6%A6%E0%A7%8B%E0%A7%9F%E0%A6%BE HTTP/1.1" 200 4356 "-" "Mozilla/5.0 (compatible; Googlebot/2.1;)"

#### /var/log/system.log (on mac)

    Mar 18 21:25:16 Harlequin logd[85]: #DECODE failed to resolve UUID: [pc:0x7fff75485ac7 ns:0x06 type:0x82 flags:0x8208 main:A52374C3-0F9D-3062-A636 pid:435]



# Syslog: What and How to Log

Syslog is a protocol 
* developed in 1980 
* aims at standardizing the way logs are formatted
* not only for Linux, but for any system exchanging logs.


Designed to enable the separation between:
  - sender
  - collector, and 
  - transport
    


More on syslog
- https://tools.ietf.org/html/rfc3164 (2001, cool intro, more humane)
- https://tools.ietf.org/html/rfc5424 (2009)


### Example Syslog Entry 

<img src="images/syslog_line.png" alt="Drawing" style="width: 400px;"/>

### Log Levels (Syslog)

<img src="images/syslog_levels.png" alt="Drawing" style="width: 700px;"/>





# Application Level Logging Principles


1. A Process Should Not Worry About Storage
2. A Process Should Log Only What Is Necessary
3. Use Log Levels to Allow Controlling the Amount of Output



### A Process Should Not Worry About Storage

Don't bother to decide which logfile to write to...


If each process **writes to its unbuffered stdout stream** ...


* During development: dev looks at the terminal

* During deployment output from process is routed where needed by the OPS

* Different contexts result in different logfiles 
  - e.g. cronjob vs stand-alone

Challenges: 
  - searching through a deluge of log messages in the terminal is also not that fun
  - when you depend on a very verbose upstream; you might have to find out how to silence some logs
  - when using Docker containers - make sure to not lose the files when recreating continaers

### A Process Should Log Only What Is Necessary

* Log files can become huge fast
  - One solution is log rotation
    - set a threshold of time / size after which the data in the file is truncated / stored elsewhere

* Imagine 10 servers x 1000 req / sec * 1KB per Event (you'd need quite a network bandwidth...)


* What is necessary? What is your business?
  - if you're Apache - ...
  - if you're a bank - ...
  - if you're MiniTwit - ... 


Even in the context of simple research projects logfiles grows to multi-GB size in a few months
- sufficient to fill the disk
- to the point of not being able to run bash commands!!!
  - Hint: ssh user@server rm ... 


### Use Log Levels to Allow Controlling the Amount of Output

* Allows the user of the app / library to control the amount of logging


* Better than print 
  - intention revealing
  - can be turned off gradually
  


See following example...




In [2]:
import sys
from importlib import reload 
import logging
reload(logging) #hack because of jupyter

logging.basicConfig(
        format="%(asctime)-15sZ %(levelname)s [%(module)s] %(message)s",
        datefmt="%Y-%m-%d %H:%M:%S.%f",
        level=logging.ERROR,
        stream=sys.stdout
)

logging.debug("Got here!")
logging.info("Demo started.")
logging.warning("Ooops. Something went wrong!")
logging.error("Logging is broken in Jupyter")
logging.critical("System will shot down")


2021-03-21 15:55:05.fZ ERROR [<ipython-input-2-eb853a24a201>] Logging is broken in Jupyter
2021-03-21 15:55:05.fZ CRITICAL [<ipython-input-2-eb853a24a201>] System will shot down


# The ELK Stack

One of the most popular solutions at the moment

<img src="images/elk.png" alt="Drawing" style="width: 600px;"/>


Stands for:
* ElasticSearch
* Logstash 
* Kibana


## ElasticSearch - Scalable Full Text Search

Solution based on JSON over HTTP

* Based on Apache Lucene
* Distributed & Replicated
* Almost real time full text search
* Stores logs in dedicated log indexes

Personal Info: Student project comparing MySQL with ES: 
- Did you know that MySQL has [full text search](https://dev.mysql.com/doc/refman/8.0/en/fulltext-search.html)? 
- Guess who wins?



## Logstash - Log Parser 


Converts from various log line formats to JSON

Tails log files and emits events when a new log message is added

Comes with a powerful pattern parsing plugin (Grok)

<img src="images/logstash_example_.png" alt="Drawing" style="width: 600px;"/>


Challenges:
- Not that easy to configure
- Resource hungry
- Difficult troubleshooting 



## Logstash Configured to Read Syslog from Logfile


    input {
    	file {
    		path => "/Users/mircea/local/projects/zeeguu/CodeBase/Zeeguu-Mono-Web/zeeguu_mono_web.log"
    	}
    }    

    filter {
    	grok {
    		match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} %{DATA:level} %{DATA:process} %{GREEDYDATA:log}" }
    	}
        date {
            match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
        }

    }    

    output {
    	elasticsearch {
    		hosts => "elasticsearch:9200"
    		user => "elastic"
    		password => "changeme"
    	    index => "zeeguu_web"
    	}
    }    







## Kibana


Powerful visualization tool tailored for ElasticSearch

Demo on http://35.214.243.123:5601/

## Filebeat - Log Shipper


- Lightweight agents on different machines - addresses resource consumption of logstash
- Sends logs to logstash
- Has special plugin for *docker* -- see your exercises for an example


<img src="images/FELK.png" alt="Drawing" style="width: 400px;" />


- Can also send data straight to ElasticSearch (in your exercises example)
  - if you don't need to parse further the `@message` field


- More on the relationship between Filebeat and Logstash  https://logz.io/blog/filebeat-vs-logstash/

## Alternative Architectures 

- Filebeat w/o Logstash = less parsing and transformation power; 
- With Redis message broker = prevention of data loss

<img src="images/FRELK.png" alt="Drawing" style="width: 400px;" />

- With Replicated Redis & Logstash = increased stability

- With alternative parsers and shippers:

> My group ditched LogStash last year in part because it is slow, but also because .NET had a logging package that seamlessly integrated with ElasticSearch. So we basically just logged straight into ElasticSearch. (one of your older colleagues)

# Ethical & Security Aspects


* With the raise in privacy awareness you might want to use your own logging infra for analytics instead of relying on Google Analytics; this means many more logs; and more privacy concerns


* Ensure that you're not logging in plain user secrets (or server secrets by that matter!)


* Be aware of who has access to the data and the data that is logged
  - If you log user details, and they ask you to "remove" them because GDPR? 




## Further Discussion Points




- Docker - all logs can be found in `/var/lib/docker/containers/<container_id>`


- Logging vs. Crash Reporting (e.g. Sentry)
  - similarity: often written to the same logfile (in the case of web apps)
  - difference: some log entires are not crashes but simply informational
  
  
- With sufficiently *high-resolution* logging you can have a practical backup of the state of the database... 

  - *Binary logging* in the MySql context: stream of events that modify the DB; you can ship them across machines -> related to event-based 



# Service Level Agreements (SLAs)

Commitment between a service provider and a client. 

Particular aspects of the service are agreed upon

* quality
* availability
* responsibilities 




# SLA Metrics

Common metrics for web apps / services

- Uptime/availability (usually percentage of all time)
- Mean response time (average time before answer)
- Mean time to recover (time to recover after outage)
- Failure frequency (number of failures/timeouts over time)






# SLA @ Google vs. Microsoft

https://cloud.google.com/translate/sla

https://azure.microsoft.com/en-us/support/legal/sla/machine-learning-service/v1_0/

What do you see? 

Which one is more fair?


## Resources


Installing ELK with Docker: https://logz.io/blog/elk-stack-on-docker/

Good patterns for Grok: https://qbox.io/blog/logstash-grok-filter-tutorial-patterns

This one uses Filebeat: https://www.elastic.co/guide/en/logstash/current/advanced-pipeline.html

Grok Debugger: https://grokdebug.herokuapp.com/