# Day 2 Talks

April 30, 2022

## Setup

In [1]:
# %load_ext autoreload
# %autoreload 2

In [2]:
# Imports
...

In [3]:
# Setup
# %config InlineBackend.figure_format = "retina"

## Using Python for Disease Variant Analysis

presenter: Atia Binte Amin  
link: https://us.pycon.org/2022/schedule/presentation/113/

> Variant, a term once only known to the researchers of biological sciences, is now quite familiar to the general people.
> Rising of the new variants of SARS-Cov2 virus with novel mutations have become a topic of concern during this COVID-19 pandemic.
> How do the researchers identify these variants from the analysis of genomics data? How could Python be used in this analysis?
> This talk will address these questions. 
> 
> Mutations in any organism are usually identified after performing a Next Generation Sequence analysis experiment named variant calling.
> Variant calling generates the output in a specialized file format called Variant Call Format (VCF) file.
> VCF file carries the meta data and the information of thousands of mutations and is generally large in size.
> Thus, it is challenging to extract information and identify mutations from this file, especially when there are hundreds of samples.
> The Python package scikit-allel provides utilities for exploring this large-scale genetic variation data in VCF file and helps to identify important mutations from the downstream analysis.
> This package depends on scipy, matplotlib, seaborn, pandas, scikit-learn, h5py and zarr.
> After identifying the mutations, the next step is the visualization of the mutations in a meaningful way.
> This task might be simpler for a small size virus like SARS-Cov2, but complicated for eukaryotic organisms with multiple chromosomes like mouse or human.
> Another python package QMplot is handy and useful for the visualization of thousands of mutations in each chromosome, making the interpretation of the extracted mutations easier for the biologists.
> This package uses numpy, scipy, pandas and matplotlib.

- alignment of reads to reference genome
    - outputs a BAM ("binary alignment map")
- use variant callers to identify variants from the BAM
    - GATK, FreeBayes, DeepCaller
    - outputs a VCF
    - there are errors in the sequencing that need to be accounted for
    - `scikit-allel`: Python package to explore VCF files
- annotate variants
    - `Varcode`: Python package for variant annotation
        - can provide custom variant data base
        - creates objects for each variant with the relevant information about the gene and effect on gene product
- visualization of variants
    - `qmplot`: Python packages for making plots
    - `CMplot`: more advanced visualizations (R package)

## Hooking into the import system

presenter: Fred Phillips  
link: https://us.pycon.org/2022/schedule/presentation/76/

> Import hooks and the import system in general is an under-used and under-documented resource within Python.
> This talk will introduce the audience to the import system, how it works, and how it can be adapted for their needs.
> We will build a simple import hook that can inspect what is being imported, and go on to demonstrate how we can use the import system to load Python modules from a database and how to reload files on disk immediately as they are changed.

### Intro to Python import system
- two main steps:
    1. finding: discover code to import
        - `builtinImporter`: imports built-in modules
        - `FrozenImporter`: for bootstrapping importing system before `import` is available
        - `PathFinder`: finds paths to modules
    1. loading: executing the code and bringing into env
- `PathFinder`:
    - has a specific list of "finders" to search for a module
    - searched in order

### Import hook system

- to add a custom finder, add to the list `sys.meta_path`:

```python
import importlib
import sys

class MyFinder(importlib.abc.MetaPathFinder):
    ...
    
    
sys.meta_path.insert(0, MyFinder())
```

### Uses

1. "blocklist"
    - prevent user from importing certain modules
    - meant to help less-experienced users
1. "database loader"
    - for loading files from places other than the local disk
    - save modules in a db and import from there
    - useful for quickly adding changes to modules for use in production
        - as easy as saving to a db
    - can maintain multiple versions of a module and import specific versions

## Flexible ML Experiment Tracking System for Python Coders with DVC and Streamlit

presenter: Antoine Toubhans  
link: https://us.pycon.org/2022/schedule/presentation/87/

> There are so many tools to do data science today that it can be difficult to navigate.
> Many of them are AI platforms that "do everything by clicking on a UI" and do not leverage pre-existing tools e.g., GIT for versioning, or good old python IDE instead of Jupyter Notebooks.
> On the other hand, ML engineering is not classical software engineering:
> 
> - in addition to the code, the data should also be versioned;
> - in its essence, ML engineering is an exploratory work: one can not know if the model is going to work before testing it;
> - there is no clear way to guarantee the quality of the trained model: the data-scientist has to play with it to make it "talk".
> 
> In this talk, we will build a fully customizable and complete system in python to track Machine Learning experiments.
> For the purpose of this talk, we will train a neural network (Tensorflow) to classify images between cat and dog, though, the main focus is on the tooling and not the ML algorithm.
> We will use:
> 
> - **DVC** (Data Version Control) to 1) version the data alongside the code with GIT 2) build training pipelines to orchestrate the python scripts 3) version experiments.
> - **Streamlit** to build data exploration apps to play with the trained models.
> Both DVC and Streamlit are open-source libraries with python APIs.
> 
> In the second part of the talk, we will focus on various ways of combining DVC and Streamlit.
> For instance, we will see how to build a Streamlit app that allows selecting any trained model tracked with DVC (provided its GIT commit), loading it, and testing it on given input images.

- challenges of ML engineering vs. traditional software engineering
    - track data
    - exploration
    - track progress and compare models
    - reproducibility and the ability to return to previous versions
- DVC (data version control)
    - Python package; pip installed
    - track large files that are too big for git
        - replaces the large files with smaller meta data files
        - similar interface as git
    - has a pipeline feature
        - defined in a YAML format
        - indicate command to run, dependencies, and outputs
        - tracks output automatically
        - caching mechanism that tracks changes to dependencies
    - good for a replacement to a bash script that run a series of scripts for downloading and processing data
- R&D is a non-linear workflow
    - unlike traditional software engineering
    - DVC has features for tracking experiments in a non-linear way
        - can store and explore model performance
        - the experiments are listed under the parent commit
            - the parameters may change in each experiment, but the underlying code has not changed (so under the same parent commit)
    - `dvc` can be imported and used to read history and logged results
        - `dvc_repo_info = dvc.repo.Repo(".")`
- used Streamlit for interface with investigating the model and results

## (Professionally) Coding with Others

presenter: William Morrell  
link: https://us.pycon.org/2022/schedule/presentation/103/

> A mix of tools and practices to incorporate for facilitating collaboration between developers.
> As a nice side-effect, these also let past-you help future-you work on entirely solo projects.
> Topics include:
>
> - Documentation, specifically calling out a README and contributor guidelines, and site generators à la Sphinx or MkDocs
> - Version control / git, collecting changes in logical commits, writing good commit and pull request messages
> - Auto-lint and formatting: pre-commit, black, isort, flake8
> - Dependency management: pyenv, pipenv/poetry, Docker

### Documentation

- README file
- site generation
    - PyCon 2021 talk: "Static Site with Sphinx and Markdown"

### Version control

- git

```bash
# git undo button
git reset HEAD~

# select specific lines of a file to commit
git add --patch  # or -p

# fix merge conflicts by selecting which branch to keep
git checkout --ours/--theirs

# pick specific commits to put at the head of your branch
git cherry-pick [HASH]

# take changes from the `main` branch and stick your
# branch to the top
git rebase -i [REF]
```

### Code quality tools

- readability, maintainability, catches bugs
- linters:
    - identify potential errors or anti-patterns
- formatters:
    - identify and *fix* "problems" in the code
    - `isort`
    - upgrade tools: `2to3` and `pyupgrade`
- where to run these tools:
    - standalone commands
    - IDE
    - with project tests
    - CI
    - `pre-commit`

### Pull requests

- bring in the human component to code quality

### Dependency management

- pinning deps.
- presenter recommends using `poetry`
    - wraps `pip` and provides some security features
- containers with Docker for managing all dependencies including non-Python ones
    - checkout blog posts on [pythonspeed.com/docker/](pythonspeed.com/docker/)

## Write Docs Devs Love: Ten Tips To Level Up Your Tech Writing

presenter: Mason Egger  
link: https://us.pycon.org/2022/schedule/presentation/173/

> Think of that feeling you get when you follow an online tutorial or documentation and the code works on the first run.
> Now think of all the hours spent wasted following broken, outdated, or incomplete documentation.
> From our favorite tutorials to bad product docs we all consume technical writing.
> Tutorials, blog posts, and product docs help developers learn new things, build projects, and debug issues.
> But what makes one tutorial better than another?
> In this talk I'll discuss how you can write the documentation that developers love and I'll share 10 tips and tricks to improve your technical writing.

1. make the end goal clear
    - the point of the tutorial/page
    - "Here we will use X to do Y."
    - make it very obvious what will be accomplished
1. be concise
    - need not use SAT words
    - remember readers may not be native English speakers
    - aim for relatively-low reading level (3rd grade)
1. use inclusive language
    - avoid demeaning terms like "noobs" or "10x dev"
    - avoid "easily," "simple," "obvious," etc.
1. limit technical jargon
    - depends on audience (internal team vs. general public)
1. define all acronyms
1. avoid memes/idioms and regional language
    - depends on audience
1. use meaningful code samples and variable names
    - show, don't tell
    - include everything that is required to run the code
        - e.g. `import` statements
    - include a full copy of the code
1. avoid making your readers leave your docs
    - links to other websites
    - if you need to send them away, give them a way to come back
    - perhaps list pre-requisites at the top with links on how to do them
        - avoid having to leave part way through the body of the tutorial
1. make the content "scannable"
    - users tend not to read everything
    - make it easy to find a single piece of info
    - headings and subheadings or other visual breaks
    - add a Table of Contents
    - use consistent styling
1. verify your instructions
    - add tests and other verifications
    - incorrect doc is worse than no doc

## Writing Functional Code in Python

presenter: Vic Kumar  
link: https://us.pycon.org/2022/schedule/presentation/80/

> In this talk, we'll define exactly what functional programming is and how it helps us.
> We'll explore the main concepts from functional programming and see how we can apply them to our Python code going over some concrete examples.
> 
> **WHAT IS REFERENTIAL TRANSPARENCY (3 MIN)**
> We'll look at how we can prove if a function is pure or not, and what the implications are of writing our code more explicitly.
> We'll look at an existing snippet of code that uses mutation and how we can transform such a function into one that is referentially transparent.
> 
> **HANDLING EMPTINESS: THE OPTION MONAD (7 MIN)**
> We'll go over a concrete example of using branching logic and nested ifs to check for emptiness.
> Again, we'll transform this code by using the Option monad and make our code more readable and robust.
> 
> **HANDLING EXCEPTIONS: THE TRY MONAD (7 MIN)**
> We'll take a look at dealing with exceptions using the `try..except` syntax.
> We'll change this to using the Try monad and see how our code reads from left-to-right and how we can make the side-effects of our functions more explicit to the caller.
> 
> **HANDLING ASYNCHRONOUS CALLS: THE FUTURE MONAD (8 MIN)**
> We'll take an example which uses threads and processes and change it into an example that uses the Future monad.
> Along the way, we'll see how we can more explicitly handle asynchronous behaviors and their effects.
> 
> **CONCLUSIONS AND TAKEAWAYS (3 MIN)**
> We'll conclude by looking at the pros and cons of functional programming and where we can use it in our Python code.

- functional programming (FP): construct programs using only *pure* functions
    - functions that have no side effects
    - side effects examples: modifying a var, modifying a data structure in place, setting a field of an object
    - *referential transparency* requirement for all functions
        - follow standard mathematical principles

In [4]:
# Not referentially transparent.


def add_numbers(numbers: list[int]) -> int:
    total = 0  # problem: total is not always 0
    for num in numbers:
        total += num
    return total

In [5]:
# Referentially transparent.

from functools import reduce


def add_numbers_2(numbers: list[int]) -> int:
    return reduce(lambda a, b: a + b, numbers)

- some difficulties of functional programming is dealing with `None`s
    - solve using `pyeffects` library to introduce `Option` classes
    - instead of dealing with `if/else` control flow and branching logic, functional code reads left-to-right
- exceptions:
    - using a `try`/`except` block results in an impure function
        - the function signature wouldn't indicate that it could fail
    - return a `Try` class that can be a `Success` or `Failure`
        - now the function is pure
    - makes handling errors more explicit than the current Python system

## The Model Review: improving transparency, reproducibility, & knowledge sharing using MLflow

presenter: Jes Ford  
link: https://us.pycon.org/2022/schedule/presentation/68/

> Code Review is an integral part of software development, but many teams don’t have similar processes in place for the development and deployment of Machine Learning (ML) models.
> I will motivate the decision to create a Model Review process, starting from the principles of transparency, reproducibility, and knowledge sharing.
> MLflow is a useful Python package to help simplify and automate much of the tracking necessary to create detailed records of machine learning experiments.
> Much of this talk will be spent introducing this tool, and demonstrating the core MLflow Tracking functionality.
> I’ll discuss how my team is currently running a Model Review process for any ML models that we push to production, and how we use MLflow to streamline this work and learn from each other.

[MLFlow](https://mlflow.org/)

- model review system
    - goals
        1. transparency and record keeping of what was deployed
        1. reproducibility
        1. format for knowledge sharing
    - to track:
        - model perforamcne
        - what data was used
        - entire ML process (including failed experiments)
    - cannot track all of these with a standard code review
- MLFlow: platform for managing the end-to-end machine learning lifecycle
    - parts:
        - tracking
        - projects
        - models
        - registry: manage model lifecycle
    - language and ML tool agnositc
    - can log anything you want to track
        - parameters, metrics, files (or "artifacts"), version of code, training data

Simple example:

```python
impoprt mlflow

mlflow.log_param("param-name", 0)
mlflow.log_metric("metric-name", 1)
mlflow.end_run()
```

Or use a context manager:

```python
with mlflow.start_run(run_name="log-artifacts"):
    ...
```

- can see results in a GUI
    - bash command: `mlflow ui`
- can log data locally or to a remote server
    - more info in docs
- `mlflow` has a `sklearn`-specific module for tracking the model itself
    - pickles the model and saves it with conda and pip requirements files
    - also saves an MLFlow model that can be used later
- MLFlow UI has a notes section
- auto-logging model training:
    - automatic tracking of different types of models (check out list of supported modeling frameworks)
        - Sci-kit Learn, TF, PyTorch
    - tracks parameters, metrics, and even some evaluation plots
- integrating MLFlow for the team:
    1. common model-training process used by the entire team
    1. build MLFlow parameters, metric, other artifacts into the infrastructure so they are used automatically
    1. use the notes section reliably
    1. review the model before deployment

---

In [6]:
%load_ext watermark
%watermark -d -u -v -iv -b -h -m

Last updated: 2022-04-30

Python implementation: CPython
Python version       : 3.10.4
IPython version      : 8.2.0

Compiler    : Clang 12.0.1 
OS          : Darwin
Release     : 21.4.0
Machine     : x86_64
Processor   : i386
CPU cores   : 4
Architecture: 64bit

Hostname: JHCookMac.local

Git branch: pycon2022

