-
Notifications
You must be signed in to change notification settings - Fork 12
LISA19
October 28-30, 2019
These notes are mostly a mess. I took notes during every talk I attended at LISA19, and then condensed the information as best I could into one tweet per talk. Find the tweets on my timeline from the end of October 2019.
Alice Goldfuss
What is a container?
- Containers are processes, born from tarballs, anchored to namespaces and controlled by cgroups.
- Namespaces determine what a process sees
- Container strengths:
- running a stateless application
- portable (runs locally, in dev, prod, etc.)
- easy to upgrade and iterate
- good for testing different envs
- Container weaknesses:
- Stateful apps
- databases. Only google (or any FAANG company) needs to do this.
- Stateful apps
- Containers need friends
- it's never just containers
- questions:
- how will you install the code from the tarball?
- how will you schedule container resources (orcehstration)?
- how will you manage clusters?
- how will you handle networking? routing, access control and service discovery
- how will you deploy?
- It will take you at least a year to get your container platform up and running.
- you will have a hybrid system. it's cool.
- Containers need headcount
- Ideally 6-8 people; no less than 4
- You must have a new team to build your container platform. * You can't just give it to your existing ops team.
Tweet: containers are cool if you have lots of stateless apps and your company will pay for a new team to build your platform.
Alice used a lot of vintage photos, and she included several pictures of people of color. Representation matters. Thank you.
Rich Smith, Duo Security
- In this talk security means both security and privacy
- The security industry generates FUD in order to sell hope and you (software companies) let them get away with it.
- We need to stop blaming end-users for security breaches
- devops practices like blameless post-mortems will help.
- Where people and technology overlap is information security
- Until now, the security industry has only focused on technology
- Security has cultivated their image as the hooded hackers. This is silly and 'we need to grow up'
Truisms:
- Security isn't absolute. It will always be incomplete in some way. Find secure-enough
- Security isn't binary, or black and white. 'Secure against who?' Context is everything.
- Security is a vector to be travelled
- Security isn't static
- Security doesn't end in zero-risk. There is always risk; know what it is. Rules for your security team:
- Enabling: a security team's success should be measured by what they enable, not what they block
- Transparent: a security team who embraces openness about what it does and why spreads understanding
- Blameless: Security failure what happen, only without blame will you be able to understand true causes, learn from them and improve. Recap:
- The security industry is failing to serve those it is trying to protect
- security is people-centric
- if the solution isn't usable, it isn't a solution
- security is a shared responsibliity and we need to make this marriage work
- hold your security team accountable
- take the lessons of devops and apply them to security
- enabling, transparent, blameless
Lookup: Corey Doctorow
Tweet: The security industry needs to drop the image of the lone, hooded hacker and embrace inclusive and agile practices.
Jerome Petazzoni
* "The easiest way to install Kubernetes is to get someone else to do it for you."
* K8s feels imperative at first (see tutorials), but it is a declarative system under the hood.
* kublet
* dockerd
* kubectl get ep
=> get endpoints
* ens5
* kubenet is a network plugin option
Tweet: When first learning Kubernetes, it feels imperative because tutorials have you running so many commands with flags (`kubectl run mypod --image=nginx --restart=Never --port=80). In reality, it is a declarative system because you declare the state you want, and the K8s control plane does whatever it can to make it so.
Alex Hidalgo
- changes emit other changes
- The process-related problems in tech are not new. We just assume we're special.
- The Incident Command System is a useful approach
4 Problems
- Lack of insight
- Poor communication
- No establish hierarchy
- Too much freelancing
- You need an incident commander
- In charge of the incident and holds all high-level state about it.
- Operations lead
- Command post
- usually a slack channel
- Planning lead
Lisa Seelye, Red Hat
- architechture refers to CPU architechture
- arm64 vs. amd64
- assumption is that amd64
- an image is a tarball with more tarbells and json files inside
Inside the image tarball:
- JSON config file (sha256:somehash)
- Layer tarballs (code for your OS and programs)
- JSON manifest file (manifest.json) ties all these things together
No manifest list
- widely used
- gives you the image with no questions asked
Manifest list (a Docker thing)
- less-common
- 'fat' manifest which allows for images that work on m
What happens when you pull an image?
-
docker pull <image>
-
docker run --rm -i <image>
-
exec format error
is the error you see when a cpu architecture isn't supported
Problem: Docker images you grab with docker pull
are designed to work with amd64 CPU architecture. This excludes non-traditional architectures.
Solution: Use the Docker image manifest v2 because the there is an architecture field in the list that allows for more than one type of CPU architecture
https://docs.docker.com/registry/spec/manifest-v2-2/
Multi-Architecture Container Images by Lisa Seelye AKA @thedoh
Kyira Wackett
Types of Communication
- Verbal
- Passive
- Aggressive
- Passive-aggressive
- Assertive
- Non-verbal
- emoji, fonts
- non-verbal cues
- Tech/Passive
- Self-Talk
Things that influence your communication
- Early-childhood experiences
- Fear and Shame
- Imposter Syndrome
- Self-esteem
- Implicit Biases and Cultural Norms
- yours and others
- Context and environment
We work on assertivenvess so that you can set the tone for how people interact with you.
How we talk to ourselves effects our ability to
- give and receive feedback
- advocate for our needs
- set boundries
Thinking Assertively
- identify the role of your thoughts
- re-anchor the narrative Talking Assertively
- set pre-emptive boundries
- ask yourself, is it my responsibility?
- both "And" statements
- instead of "thank you for going to the grocery store, but you got horseradish sauce instead of a horseradish" try "thank you for going to the grocery store, and I realize that I wasn't clear enough and didn't tell you to get a root vegetable."
- Follow through on what you say (Dr. Mayne)
- Your boundries have to be real
Assertive communication is conveying your thoughts, feelings and opinions clearly without shaming or belittling the person you are talking with. For many of us, this style of communication is very different from our default, and a good first step is to examine your self-talk.
https://www.kindakreative.com/blog/how-is-that-thought-serving-you
Thinking Assertively by Kyira Wackett
#LISA19
Adam McKenna, Pinterest
- Canary analysis definition
- Benefits of canary analysis
- Examples
need: ci/cd, service metrics stored in a timeseries DB organizational buy-in
- invisible means the service owner has confidence in a process that will happen without significant time from them
- infrastructure: physical hardware, operating system, language
- canary analysis: comparing a control cluster and a canary cluster. the canary cluster serves production traffic.
- problem: infra has an expiration date. Think Python. OSes, Container runtimes.
- upgrading is not optional so that your business meets compliance requirements, security bug fixes, and access to new features
- developers don't want to have to support different OSes or language versions
- upgrading is hard: complexity (lots of microservices), don't like downtime, migration work is not as exciting for engineers
- "Canary analysis is a tool that helps us automate and normalize the most mundane migration tasks"
- Canary analysis still requires testing (unit, integration and service health checks). There is no off-the-shelf product, and it can make deploys slower.
Components:
- CI/CD pipeline
- workflow orchestration
- Time-series metrics database
- Execution env for custom code
- Canary Judge software
Best Practices:
-
canary and control should be a smiliar size so you can compare
-
both clusters should be service a meaningful amount of production traffic
-
a minimum of 50 data points for a Canary Analysis score
-
You need good metrics (think 4 Golden Signals)
-
Kayenta
Lessons learned:
- UX matters
- Have good and bad versions of your app to test with
Tweet:
Q: Infrastructure upgrades can be risky and time-consuming. How can you upgrade prod, reduce the possibility of an outage and decrease toil?
A: Try Spinnaker to create canary deploy of your new infra and Kayenta to do an automated canary analysis.
Enabling Invisible Infrastructure Upgrades with Automated Canary Analysis by @deathtocss
#LISA19
GitOps, an Elegant Tool for Hybrid Cloud Kubernetes Ryan Cook, Red Hat, Inc.
GitOps
- yaml objects stored in a git repo with version control
- templating available in Helm or Kustomize
-
kubectl apply -f <repo>
over and over
Best Practices in Gitops
-
least privileges
-
store code and k8s objects in different repos
-
k8s objects should be in a private repo
- secrets
- routing info
-
Document the process to create/recreate GitOps the articfacts
-
Kustomize is the 'infra as code' tool
-
The team decided to use ArgoCD
https://www.katacoda.com/mvazquezc/courses/introduction/gitops-introduction
need to figure out base and overlays
tweet: Gitops is a system in which pushing a commit to a code repository triggers an infrastructure update. Kustomize and ArgoCD are good tools for enabling gitops for multi-cloud K8s clusters.
GitOps, an Elegant Tool for Hybrid Cloud Kubernetes Ryan Cook
Chris Alfano and Julia Schaumburg
- We're conditioned to define success as hockey-stick growth
- We need devops not for global scale, but for supporting local creativity and growth (local innovation)
- Human interactions vs. scalable infrastructures
- Science Leadership Academy in Philly
- tool: SLATE
- "Software isn't transforming schools, folks that are transforming schools need software"
- The Cloud vs. The Ground
- Growth vs. Evolution
- Frictionless (dehumanized) vs. Human interaction is the whole point
- Kids need time with an adult, not an optomized cloud platform
- Cogs vs. Contributors
- Permissioned, symmetric infra vs. distribute, asymmetric infrastructure (git and linux have done this)
- Consistent deployment vs. every deployment is a free canvas
- Civic hacking is the engine of change
Things that are built in the civic-hacking movement aren't online for long
- Build a new tool kit:
- everyone needs the tools to contribut without learning the whole fucking stack
- running on your own machine is great, but people need community infrastructures
- Idling must be free
- "No one is coming, it's up to us" --code for america
- "Go upstream" there are bigger patterns and problems
Tweet:
We need a new devops movement that supports local innovation, not hyper-growth on a global scale.
Human interactions > scalable infrastructures
Civic hacking is the engine of this change.
Ops on the Edge of Democracy by Chris Alfano and Julia Schaumburg
Courtney Eckhardt
"Words mean things"
Goals of the workshop:
- run a retrospective
- create a good emotional space for a retro
- how to run a great meeting
When you run a retro, you have three jobs:
- facilitation
- run a productive meeting
"perceptual learning"
Better words (less blame):
- how, what, what if, could we, what do you think about, what would you have wanted to know
"human error is not a root cause" --John Allspaw
- how did the human make the error?
- what enabled the error?
- how did that error take the system down?
- how long did it take for the human to notice the error?
Miller's Law: "in order to understand what another person is saying,you must assume it is true and try to imagine what it could be true of"
Conway's Law: orgs design system that are copies of the comms structure of their org
How to run a good retro:
- select a notetaker
Courtney Eckhardt
- Emergency reponse
- saying "nobody's going to die" isn't a good idea. you don't know the effects of your system on your users.
- urgent vs. important
- urgent: people want to know soon. lunch order
- important: you need to refill a presription
- emergencies are urgetn and important
- someone has to notice the emergency
- thinking takes time. "I didn't have time to think"
- an indcident reponse system needs to make it so you don't haev to think
- example: when you call 911. the responder has a framework to decide how to respond.
- if you respond to an emergency and your mind goes blank, you haven't been setup for success
- Determine 'what's an emergency?'
- Your users can't
- How do I get help?
- Have a single point of contact you can engage. 911, or hotline, or chatops in slack
- This person or bot is the dispatcher (and often incident responder)
- You need more than one person for indcident resposne
- Assemble: physical => ambulance, firetruck; digital => conference call, Slack
- Communicate: physical => radio, phone; digital => conference call, Slack
- How to assess the situation
- How to delegate
- Incident Commander is the Start of Authority for the emergency
- How to disperse
- Have shifts for long incidents. No more than 4 hours
- Have an indident response plans, then train your folks
When you run infrastructure for a company, there will be emergencies. Effective devops teams have strong incident response plans which allow the person on call to use muscle memory, not cognitive energy, to kick-off the response. A real world example is dialing 911.
How to Have an Operational Incident (A Crash Course) by @hashoctothorpe
#LISA19
Tabitha Sable
- People see vulnerability scanning as a chore
- Figure out how to make it fun
Agenda What's in it for you
- you'll enjoy this: you can uncover the servers time forgot
- your infosec team is going to love you Understand Findings
- common exploit types:
- remote code execution (EternalBlue)
- Authz and authn (Dirty CoW)
- Info disclosure (HeartBleed)
- Denial of service (SACK panic)
- usual scanner output:
- Name, host, port, CVSS (common vulnerability score system), References, Explanation Validate Findings
- Ask yourself "do these findings apply to us?"
- Is it exploitable? Evaluate Risk
- What are you protecting?
- uptime
- customer data
- business data
- service distruption
- embarrassment
- misuse of infrastructure (bitcoin mining)
- Does this vuln help attackers?
- use attack trees to model the potential threats
Remediate Right
- use attack trees to model the potential threats
- Patch
- how did this get missed?
Show your security team you've got a system for patching
Vulnerability scanners aren't always aware of backported fixes The exploit might not apply to you right now.
- Penetration Testing a Beginner's Guide to Hacking (?)
Make vulnerability scanning more enjoyable by viewing it is a way to find 'servers time forgot.' When you get the results of a scan, determine if a patch is necessary, and whether or not it needs to happen ASAP.
Sysadmins' Introduction to Vulnerability Scanning by @tabbysable
#LISA19
Neelima Krishnan, Intel
-
Side channels
- a way that a malicious actor can collect information about what a system does
- if you leak data accidently, it is found in the side channel
- broad group: power, electromagnetic radiation, sound, time
- characteristics: deep systems knowledge, target implementation, system works as expected
-
Timing attacks
- time is used to determine when to execute a side-channel attack
- diffie-hellman is suseptible
-
Speculative execution side channel methods
- what's new: out of order execution (OoOE), speculative execution
- characteriscs: local methods, no privilege escalation, read access
-
End users can protect themselves by enabling automatic software updates
-
Datacenters can enable updates and perform risk analysis
-
Developers should lean into long-standing security practices
-
Microarchiteture Data Sampling (MDS)
- 3 internal CPU structures: Store buffer, fill butter and load port
-
MDS mitigation: microcode
- used since the mid 90s to change a CPUs behavior
-
MDS mitigation: OS
- implement a function that calls VERW
- Invoke on ring transitions
- Provide a mechanism to enable/disable mitigations
- Use sysfs
-
Improve process isolation
- Can we improve process isolation to limit resource sharing?
- What's the finest granulairty possible to allocate processes?
- Node, physical core or logical processor level?
- Can we define different trust domains?
-
The solution is core scheduling user to user
- under open development
- not too different under low contention
- may look different with high contention
-
Intel's Rendezvous- user to kernel
- improve kernel isolation from user space
- is a hyperthread goes to kernel space... ???
- improve kernel isolation from user space
Key takeaways:
- perform risk analysis
- id if risk matches workload deployment
- enabable mitigation
- change workload deployment as needed
There are traditional ways to
A side channel attack is when a malicious actor collects information about what a computer system does an exploits it by examining power usage, timing, electromagnetic leaks or sound instead of an attack through a software bug or spying on traffic sent to or from a server.
Speculative and Traditional Execution Side Channel and Software Protection Mechanisms by Neelima Krishnan
Some traditional side channel attacks include
CPUs cache data
J. Paul Reed
Act 1: Resilience in Incident Response
- Heuristic #1: what has changed since the system was in a good state?
- H #2: Go Wide
- H3: Convergent Searching: confirm and disqualify diagnoses by matching signals and symptoms
- Engineers use a specific and past diagnosis or a general and recent diagnosis
- This means engineers use a really painful incident memory
- An incident still in your L1 cache How can we get better during an incident response? Experts can recognize typicality, make fine discriminations, use mental simulation, use their knowledge base to know when to apply higher order rules
Being an expert means experincing lots of failure. You can increase your rate of failure with personal exeprience, directed experience (being a student, then a teacher), manufactured experiences (chaos engineering/game days), vicarious experiences (hearing about other issues)
Engineering Resilince into Modern IT Operations: A Play in Three Acts by @jpaulreed
Act 2: Resilience in Incident Analysis Anti-patterns in retros/post-mortems:
- Only reviewing failure
- Forgetting about bias
- De-prioritizing retros and learning processes Act 3: Remediation and Prevention
In incident response, "Expertise takes time and space. Make that time and create that space."
For incident analysis, beware your biases. Success and failure: two sides, same coing
Remediation: Automation that truly participates in our joint cognitive systems
Denise Yu
- when data was the business way to make $$$, hosting your servers made sense
- Now, the technnology itself is the valueable commodity
- business analysis, ML and natural language processeing, faster queries
- we started scaling vertically: adding CPU
- this worked for a while (and now the Moore's law doesn't apply)
- CLOUD COMPUTING gave us an easy way to scale horizontially. The workload get spread across multiple machines
- Some reasons:
- scalability
- availability
- latency Ways to deal with the unreliability in distributed systesm
- Some reasons:
- monitor and observe
- test things out with chaos engineering CAP Thereom:
- It's not a pick two of three
- C is for linearizability
- creates a situation where all nodes should see the change
- upper bound is the speed electricty travels through fiber optic cables
- Consistency is a spectrum, not a binary
- A is for availability
- also a spectrum
- messy because of network latency
- P is for partition tolerance
- network partitions occur when network connectivity between two nodes (or datacenters) get interrupted
Hardware will fail network cables fail Sofware will behave in a weird way "some part of every system is always at risk of failing"
Distributed systems are so complex that humans cannot construct an accurate mental model of them. This makes communicating about them nearly impossible. We can begin to build shared understandings through incident analysis, but more importantly, as we iterate on these systems, we must keep the design human-centered.
Why Are Distributed Systems So Hard? by @deniseyu21
We can't have an accurate mental model of big, distributed systems. Woods' Theorem This makes it impossible to accurately communicate with our peers
Proxy is built during an incident analysis
design systems for humans, not machines
Thanks ## and ##
3 high-level takeaways:
-
"Socio-technical system" is a term that describes the group of humans that build the thing with code and computer hardware, their collective habits and the system itself.
-
Container images are just tarballs (.tgz or .tar.gz files)
-
I'd like to see more tech conferences feature at least one talk from a local mental-health professional who can coach attendees on how to be betterteammates.