LISA19

October 28-30, 2019

These notes are mostly a mess. I took notes during every talk I attended at LISA19, and then condensed the information as best I could into one tweet per talk. Find the tweets on my timeline from the end of October 2019.

The Container Operator's Manual

Alice Goldfuss

What is a container?

Containers are processes, born from tarballs, anchored to namespaces and controlled by cgroups.
Namespaces determine what a process sees
Container strengths:
- running a stateless application
- portable (runs locally, in dev, prod, etc.)
- easy to upgrade and iterate
- good for testing different envs
Container weaknesses:
- Stateful apps
  - databases. Only google (or any FAANG company) needs to do this.
Containers need friends
- it's never just containers
- questions:
  - how will you install the code from the tarball?
  - how will you schedule container resources (orcehstration)?
  - how will you manage clusters?
  - how will you handle networking? routing, access control and service discovery
  - how will you deploy?
- It will take you at least a year to get your container platform up and running.
  - you will have a hybrid system. it's cool.
Containers need headcount
- Ideally 6-8 people; no less than 4
- You must have a new team to build your container platform. * You can't just give it to your existing ops team.

Tweet: containers are cool if you have lots of stateless apps and your company will pay for a new team to build your platform.

Alice used a lot of vintage photos, and she included several pictures of people of color. Representation matters. Thank you.

In Search of Security Shangri-la

Rich Smith, Duo Security

In this talk security means both security and privacy
The security industry generates FUD in order to sell hope and you (software companies) let them get away with it.
We need to stop blaming end-users for security breaches
- devops practices like blameless post-mortems will help.
Where people and technology overlap is information security
- Until now, the security industry has only focused on technology
Security has cultivated their image as the hooded hackers. This is silly and 'we need to grow up' Truisms:
- Security isn't absolute. It will always be incomplete in some way. Find secure-enough
- Security isn't binary, or black and white. 'Secure against who?' Context is everything.
- Security is a vector to be travelled
- Security isn't static
- Security doesn't end in zero-risk. There is always risk; know what it is. Rules for your security team:
- Enabling: a security team's success should be measured by what they enable, not what they block
- Transparent: a security team who embraces openness about what it does and why spreads understanding
- Blameless: Security failure what happen, only without blame will you be able to understand true causes, learn from them and improve. Recap:
- The security industry is failing to serve those it is trying to protect
- security is people-centric
- if the solution isn't usable, it isn't a solution
- security is a shared responsibliity and we need to make this marriage work
- hold your security team accountable
- take the lessons of devops and apply them to security
- enabling, transparent, blameless

Lookup: Corey Doctorow

Tweet: The security industry needs to drop the image of the lone, hooded hacker and embrace inclusive and agile practices.

Deep Dive Into Kubernetes Interals for Builders and Operators

Jerome Petazzoni * "The easiest way to install Kubernetes is to get someone else to do it for you." * K8s feels imperative at first (see tutorials), but it is a declarative system under the hood. * kublet * dockerd * kubectl get ep => get endpoints * ens5 * kubenet is a network plugin option

Tweet: When first learning Kubernetes, it feels imperative because tutorials have you running so many commands with flags (`kubectl run mypod --image=nginx --restart=Never --port=80). In reality, it is a declarative system because you declare the state you want, and the K8s control plane does whatever it can to make it so.

Earthquakes, Forest Fires and Your Next Production Incident

Alex Hidalgo

changes emit other changes
The process-related problems in tech are not new. We just assume we're special.
The Incident Command System is a useful approach 4 Problems
- Lack of insight
- Poor communication
- No establish hierarchy
- Too much freelancing
You need an incident commander
- In charge of the incident and holds all high-level state about it.
Operations lead
Command post
- usually a slack channel
Planning lead

Multi-Architecture Container Images: Why Bother, and How To

Lisa Seelye, Red Hat

architechture refers to CPU architechture
- arm64 vs. amd64
- assumption is that amd64
an image is a tarball with more tarbells and json files inside

Inside the image tarball:

JSON config file (sha256:somehash)
Layer tarballs (code for your OS and programs)
JSON manifest file (manifest.json) ties all these things together

No manifest list

widely used
gives you the image with no questions asked

Manifest list (a Docker thing)

less-common
'fat' manifest which allows for images that work on m

What happens when you pull an image?

docker pull <image>
docker run --rm -i <image>
exec format error is the error you see when a cpu architecture isn't supported

Problem: Docker images you grab with docker pull are designed to work with amd64 CPU architecture. This excludes non-traditional architectures.

Solution: Use the Docker image manifest v2 because the there is an architecture field in the list that allows for more than one type of CPU architecture

https://docs.docker.com/registry/spec/manifest-v2-2/

Multi-Architecture Container Images by Lisa Seelye AKA @thedoh

Assertive Communication

Kyira Wackett

Types of Communication

Verbal
- Passive
- Aggressive
- Passive-aggressive
- Assertive
Non-verbal
- emoji, fonts
- non-verbal cues
Tech/Passive
Self-Talk

Things that influence your communication

Early-childhood experiences
Fear and Shame
Imposter Syndrome
Self-esteem
Implicit Biases and Cultural Norms
- yours and others
Context and environment

We work on assertivenvess so that you can set the tone for how people interact with you.

How we talk to ourselves effects our ability to

give and receive feedback
advocate for our needs
set boundries

Thinking Assertively

identify the role of your thoughts
re-anchor the narrative Talking Assertively
set pre-emptive boundries
ask yourself, is it my responsibility?
both "And" statements
- instead of "thank you for going to the grocery store, but you got horseradish sauce instead of a horseradish" try "thank you for going to the grocery store, and I realize that I wasn't clear enough and didn't tell you to get a root vegetable."
Follow through on what you say (Dr. Mayne)
- Your boundries have to be real

Assertive communication is conveying your thoughts, feelings and opinions clearly without shaming or belittling the person you are talking with. For many of us, this style of communication is very different from our default, and a good first step is to examine your self-talk.

https://www.kindakreative.com/blog/how-is-that-thought-serving-you

Thinking Assertively by Kyira Wackett

#LISA19

Enabling Invisible Infrastructure Upgrades with Automated Canary Analysis

Adam McKenna, Pinterest

Canary analysis definition
Benefits of canary analysis
Examples

need: ci/cd, service metrics stored in a timeseries DB organizational buy-in

invisible means the service owner has confidence in a process that will happen without significant time from them
infrastructure: physical hardware, operating system, language
canary analysis: comparing a control cluster and a canary cluster. the canary cluster serves production traffic.
problem: infra has an expiration date. Think Python. OSes, Container runtimes.
upgrading is not optional so that your business meets compliance requirements, security bug fixes, and access to new features
developers don't want to have to support different OSes or language versions
upgrading is hard: complexity (lots of microservices), don't like downtime, migration work is not as exciting for engineers
"Canary analysis is a tool that helps us automate and normalize the most mundane migration tasks"
Canary analysis still requires testing (unit, integration and service health checks). There is no off-the-shelf product, and it can make deploys slower.

Components:

CI/CD pipeline
- workflow orchestration
Time-series metrics database
Execution env for custom code
Canary Judge software

Best Practices:

canary and control should be a smiliar size so you can compare
both clusters should be service a meaningful amount of production traffic
a minimum of 50 data points for a Canary Analysis score
You need good metrics (think 4 Golden Signals)
Kayenta

Lessons learned:

UX matters
Have good and bad versions of your app to test with

Tweet:

Q: Infrastructure upgrades can be risky and time-consuming. How can you upgrade prod, reduce the possibility of an outage and decrease toil?

A: Try Spinnaker to create canary deploy of your new infra and Kayenta to do an automated canary analysis.

Enabling Invisible Infrastructure Upgrades with Automated Canary Analysis by @deathtocss

#LISA19

GitOps, an Elegant Tool for Hybrid Cloud Kubernetes Ryan Cook, Red Hat, Inc.

GitOps

yaml objects stored in a git repo with version control
templating available in Helm or Kustomize
kubectl apply -f <repo> over and over

Best Practices in Gitops

least privileges
store code and k8s objects in different repos
k8s objects should be in a private repo
- secrets
- routing info
Document the process to create/recreate GitOps the articfacts
Kustomize is the 'infra as code' tool
The team decided to use ArgoCD

https://www.katacoda.com/mvazquezc/courses/introduction/gitops-introduction

need to figure out base and overlays

tweet: Gitops is a system in which pushing a commit to a code repository triggers an infrastructure update. Kustomize and ArgoCD are good tools for enabling gitops for multi-cloud K8s clusters.

GitOps, an Elegant Tool for Hybrid Cloud Kubernetes Ryan Cook

Ops on the Edge of Democracy

Chris Alfano and Julia Schaumburg

We're conditioned to define success as hockey-stick growth
We need devops not for global scale, but for supporting local creativity and growth (local innovation)
Human interactions vs. scalable infrastructures
Science Leadership Academy in Philly
- tool: SLATE
"Software isn't transforming schools, folks that are transforming schools need software"
The Cloud vs. The Ground
Growth vs. Evolution
Frictionless (dehumanized) vs. Human interaction is the whole point
- Kids need time with an adult, not an optomized cloud platform
Cogs vs. Contributors
Permissioned, symmetric infra vs. distribute, asymmetric infrastructure (git and linux have done this)
Consistent deployment vs. every deployment is a free canvas
Civic hacking is the engine of change

Things that are built in the civic-hacking movement aren't online for long

Build a new tool kit:
- everyone needs the tools to contribut without learning the whole fucking stack
- running on your own machine is great, but people need community infrastructures
- Idling must be free
"No one is coming, it's up to us" --code for america
"Go upstream" there are bigger patterns and problems

Tweet:

We need a new devops movement that supports local innovation, not hyper-growth on a global scale.

Human interactions > scalable infrastructures

Civic hacking is the engine of this change.

Ops on the Edge of Democracy by Chris Alfano and Julia Schaumburg

Running Excellent Retrospectives: Talking with People

Courtney Eckhardt

"Words mean things"

Goals of the workshop:

run a retrospective
create a good emotional space for a retro
how to run a great meeting

When you run a retro, you have three jobs:

facilitation
run a productive meeting

"perceptual learning"

Better words (less blame):

how, what, what if, could we, what do you think about, what would you have wanted to know

"human error is not a root cause" --John Allspaw

how did the human make the error?
what enabled the error?
how did that error take the system down?
how long did it take for the human to notice the error?

Miller's Law: "in order to understand what another person is saying,you must assume it is true and try to imagine what it could be true of"

Conway's Law: orgs design system that are copies of the comms structure of their org

How to run a good retro:

select a notetaker

How to Have an Operational Incident (A Crash Course)

Courtney Eckhardt

Emergency reponse
saying "nobody's going to die" isn't a good idea. you don't know the effects of your system on your users.
urgent vs. important
- urgent: people want to know soon. lunch order
- important: you need to refill a presription
emergencies are urgetn and important
someone has to notice the emergency
thinking takes time. "I didn't have time to think"
- an indcident reponse system needs to make it so you don't haev to think
example: when you call 911. the responder has a framework to decide how to respond.
if you respond to an emergency and your mind goes blank, you haven't been setup for success

Determine 'what's an emergency?'
- Your users can't
How do I get help?
- Have a single point of contact you can engage. 911, or hotline, or chatops in slack
- This person or bot is the dispatcher (and often incident responder)
You need more than one person for indcident resposne
Assemble: physical => ambulance, firetruck; digital => conference call, Slack
Communicate: physical => radio, phone; digital => conference call, Slack
How to assess the situation
How to delegate
Incident Commander is the Start of Authority for the emergency
How to disperse
Have shifts for long incidents. No more than 4 hours
Have an indident response plans, then train your folks

When you run infrastructure for a company, there will be emergencies. Effective devops teams have strong incident response plans which allow the person on call to use muscle memory, not cognitive energy, to kick-off the response. A real world example is dialing 911.

How to Have an Operational Incident (A Crash Course) by @hashoctothorpe

#LISA19

Sysadmins' Introduction to Vulnerability Scanning

Tabitha Sable

People see vulnerability scanning as a chore
Figure out how to make it fun

Agenda What's in it for you

you'll enjoy this: you can uncover the servers time forgot
your infosec team is going to love you Understand Findings
common exploit types:
- remote code execution (EternalBlue)
- Authz and authn (Dirty CoW)
- Info disclosure (HeartBleed)
- Denial of service (SACK panic)
usual scanner output:
- Name, host, port, CVSS (common vulnerability score system), References, Explanation Validate Findings
Ask yourself "do these findings apply to us?"
Is it exploitable? Evaluate Risk
What are you protecting?
- uptime
- customer data
- business data
- service distruption
- embarrassment
- misuse of infrastructure (bitcoin mining)
Does this vuln help attackers?
- use attack trees to model the potential threats
  Remediate Right
Patch
- how did this get missed?

Show your security team you've got a system for patching

Vulnerability scanners aren't always aware of backported fixes The exploit might not apply to you right now.

Penetration Testing a Beginner's Guide to Hacking (?)

Make vulnerability scanning more enjoyable by viewing it is a way to find 'servers time forgot.' When you get the results of a scan, determine if a patch is necessary, and whether or not it needs to happen ASAP.

Sysadmins' Introduction to Vulnerability Scanning by @tabbysable

#LISA19

Speculative and Traditional Execution Side Channel and Software Protection Mechanisms

Neelima Krishnan, Intel

Side channels
- a way that a malicious actor can collect information about what a system does
- if you leak data accidently, it is found in the side channel
- broad group: power, electromagnetic radiation, sound, time
- characteristics: deep systems knowledge, target implementation, system works as expected
Timing attacks
- time is used to determine when to execute a side-channel attack
- diffie-hellman is suseptible
Speculative execution side channel methods
- what's new: out of order execution (OoOE), speculative execution
- characteriscs: local methods, no privilege escalation, read access
End users can protect themselves by enabling automatic software updates
Datacenters can enable updates and perform risk analysis
Developers should lean into long-standing security practices
Microarchiteture Data Sampling (MDS)
- 3 internal CPU structures: Store buffer, fill butter and load port
MDS mitigation: microcode
- used since the mid 90s to change a CPUs behavior
MDS mitigation: OS
1. implement a function that calls VERW
2. Invoke on ring transitions
3. Provide a mechanism to enable/disable mitigations
4. Use sysfs
Improve process isolation
- Can we improve process isolation to limit resource sharing?
- What's the finest granulairty possible to allocate processes?
  - Node, physical core or logical processor level?
- Can we define different trust domains?
The solution is core scheduling user to user
- under open development
- not too different under low contention
- may look different with high contention
Intel's Rendezvous- user to kernel
- improve kernel isolation from user space
  1. is a hyperthread goes to kernel space... ???

Key takeaways:

perform risk analysis
id if risk matches workload deployment
enabable mitigation
change workload deployment as needed

There are traditional ways to

A side channel attack is when a malicious actor collects information about what a computer system does an exploits it by examining power usage, timing, electromagnetic leaks or sound instead of an attack through a software bug or spying on traffic sent to or from a server.

Speculative and Traditional Execution Side Channel and Software Protection Mechanisms by Neelima Krishnan

Some traditional side channel attacks include

CPUs cache data

When /bin/sh Attacks: Revisiting "Automate All the Things"

J. Paul Reed

Engineering Resilince into Modern IT Operations: A Play in Three Acts

Act 1: Resilience in Incident Response

Heuristic #1: what has changed since the system was in a good state?
H #2: Go Wide
H3: Convergent Searching: confirm and disqualify diagnoses by matching signals and symptoms
- Engineers use a specific and past diagnosis or a general and recent diagnosis
- This means engineers use a really painful incident memory
- An incident still in your L1 cache How can we get better during an incident response? Experts can recognize typicality, make fine discriminations, use mental simulation, use their knowledge base to know when to apply higher order rules

Being an expert means experincing lots of failure. You can increase your rate of failure with personal exeprience, directed experience (being a student, then a teacher), manufactured experiences (chaos engineering/game days), vicarious experiences (hearing about other issues)

Engineering Resilince into Modern IT Operations: A Play in Three Acts by @jpaulreed

Act 2: Resilience in Incident Analysis Anti-patterns in retros/post-mortems:

Only reviewing failure
Forgetting about bias
De-prioritizing retros and learning processes Act 3: Remediation and Prevention

In incident response, "Expertise takes time and space. Make that time and create that space."

For incident analysis, beware your biases. Success and failure: two sides, same coing

Remediation: Automation that truly participates in our joint cognitive systems

Why Are Distributed Systems So Hard?

Denise Yu

when data was the business way to make $$$, hosting your servers made sense
Now, the technnology itself is the valueable commodity
business analysis, ML and natural language processeing, faster queries
we started scaling vertically: adding CPU
- this worked for a while (and now the Moore's law doesn't apply)
CLOUD COMPUTING gave us an easy way to scale horizontially. The workload get spread across multiple machines
- Some reasons:
  - scalability
  - availability
  - latency Ways to deal with the unreliability in distributed systesm
monitor and observe
test things out with chaos engineering CAP Thereom:
It's not a pick two of three
C is for linearizability
- creates a situation where all nodes should see the change
- upper bound is the speed electricty travels through fiber optic cables
- Consistency is a spectrum, not a binary
A is for availability
- also a spectrum
- messy because of network latency
P is for partition tolerance
- network partitions occur when network connectivity between two nodes (or datacenters) get interrupted

Hardware will fail network cables fail Sofware will behave in a weird way "some part of every system is always at risk of failing"

Distributed systems are so complex that humans cannot construct an accurate mental model of them. This makes communicating about them nearly impossible. We can begin to build shared understandings through incident analysis, but more importantly, as we iterate on these systems, we must keep the design human-centered.

Why Are Distributed Systems So Hard? by @deniseyu21

LISA19

We can't have an accurate mental model of big, distributed systems. Woods' Theorem This makes it impossible to accurately communicate with our peers

Proxy is built during an incident analysis

design systems for humans, not machines

Thanks ## and ##

3 high-level takeaways:

"Socio-technical system" is a term that describes the group of humans that build the thing with code and computer hardware, their collective habits and the system itself.
Container images are just tarballs (.tgz or .tar.gz files)
I'd like to see more tech conferences feature at least one talk from a local mental-health professional who can coach attendees on how to be betterteammates.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LISA19

LISA19

The Container Operator's Manual

In Search of Security Shangri-la

Deep Dive Into Kubernetes Interals for Builders and Operators

Earthquakes, Forest Fires and Your Next Production Incident

Multi-Architecture Container Images: Why Bother, and How To

Assertive Communication

Enabling Invisible Infrastructure Upgrades with Automated Canary Analysis

Ops on the Edge of Democracy

Running Excellent Retrospectives: Talking with People

How to Have an Operational Incident (A Crash Course)

Sysadmins' Introduction to Vulnerability Scanning

Speculative and Traditional Execution Side Channel and Software Protection Mechanisms

When /bin/sh Attacks: Revisiting "Automate All the Things"

Engineering Resilince into Modern IT Operations: A Play in Three Acts

Why Are Distributed Systems So Hard?

LISA19

Clone this wiki locally