# Day 1 Talks

April 29, 2022

## Setup

In [1]:
# %load_ext autoreload
# %autoreload 2

In [2]:
# Imports
...

In [3]:
# Setup
# %config InlineBackend.figure_format = "retina"

## Python Oddities Explained

presenter: Trey Hunner  
link: https://us.pycon.org/2022/schedule/presentation/31/  

> A number of Python features often seem counter-intuitive at first glance, especially when moving from another programming language to Python.
> Often what at first seems like a bug, will later reveal itself to be a misunderstood feature.
> 
> During this talk we'll look at a number of Python's unique features and quirks and attempt to re-shape our mental models of Python to better match reality.
> By the end of this talk you'll have a deeper understanding of Python's rules behind objects, scope, and variables.
> 
> Warning: this talk will include many Python head-scratchers so show up prepared to think on your feet!

In [4]:
x = 0
for x in [1, 2, 8]:
    y = x**2
print(y, x)

64 8


- has an interesting example of weird behavior and errors when assigning variables in local vs. global scope
    - *cannot* assign to a global var, only local
    - *can* modify global vars
- some examples of mutability
    - can have unexpected behavior with reference values
    - can have multiple vars pointing to same object
    - can modify a list in a tuple
    - can append a list to itself
        - only works because a list holds a *reference* to an object
    - ambiguity about what "change" means to a Python variability

In [5]:
# append a list to itself
my_list = []
my_list.append(my_list)
my_list

[[...]]

Can add a tuple to a list, gets appended to the list:

In [6]:
first_list = [1, 2]
first_tuple = (3, 4)
first_list += first_tuple
first_list

[1, 2, 3, 4]

Cannot add a list to a tuple:

```python
first_tuple += first_list
#> TypeError: can only concatenate tuple (not "list") to tuple
```

- recommends embracing duck typing and being comfortable with more generic terms:
    - iterable, hashable, mapping, etc.

- weird behavior with in-place addition:
    - "augmented assignment"
    - allows for *mutation* if possible (e.g. for lists, but not tuple)

In [7]:
# in-place addition results in mutation if possible
l1 = l2 = [0]
l1 += [2]
l1, l2

([0, 2], [0, 2])

In [8]:
l1 = l2 = [0]
l1 += l2
l1, l2

([0, 0], [0, 0])

In [9]:
l1 = l2 = [0]
l1 = l1 + l2
l1, l2

([0, 0], [0])

- take a look at `#pythonoddity` on Twitter for more of these

## Finding Penguins with a Snake: Linux Features for a Python User

presenter: Mario Corchero    
link: https://us.pycon.org/2022/schedule/presentation/40/  

> Python has APIs that allow developers to use Linux features that many are often unaware of.
> If you are a modest Linux/Unix user and want to learn some features of the OS through the APIs that Python offers, this is the perfect talk with you.
> We will speak about processes, named pipes, fork and exec, inodes, and signals, among others, all whilst seeing how to play with these through the APIs that the Python standard library offers us.

- some useful built-in modules
    - `filecmp`: for comparing files

- `os.fork()` forks the current process
    - Linux uses "copy-on-write" so the entire environment is not duplicated, only if the object is modified
    - interesting related function `os.register_at_fork()` to register commands on forking
- `os.execvp()` to run a command
    - *replaces* the current process
    - any instructions afterwards are not run because the original process has been replaced

- `locale` module
    - get information about user location
- `time` module
    - the time is provided by the OS in Python
    - the `datetime` module uses `time`, too
- `signal` module
    - kill a process `os.kill(PID, signal.SIGUSR1)`
    - can send signals to other processes
    - can register handlers for signals recieved by the process
- named pipes: inter-process communication
    - unidirectional
- Unix domain sockets
    - bi-directional channel
    - is kind of like a file that two processes can read and write to for communication
    - can make a socket and hand it to the parent and child when forking a process
- `memray`: a new profiler from Bloomberg

## Building a Binary Extension

presenter: Henry Fredrick Schreiner III  
link: https://us.pycon.org/2022/schedule/presentation/57/  

> Support for binary extensions is an exceptional advantage of Python that is
too often avoided for smaller packages with low developer resources.
> Binary extensions are used to achieve high performance for libraries like PyTorch, MyPy, and many thousands more.
> Binary extensions also allow access to a wealth
of existing compiled libraries.
> Building your own binary extension is plagued by historically poor documentation, bad common practices, and many misconceptions.
> But it is actually easy to write extensions today that work seamlessly on all common developer platforms using modern libraries and continuous integration.
> 
> We will take a look at packaging a binary extension from start to finish.
> This starts with `pybind11` for C++ bindings, providing simple, header only builds and avoiding the need for a new language or pre-processor step.
> We will look at `scikit-build` for building, providing powerful `CMake` based builds with library search, multithreaded builds, and more. We will use PyPA's `build` to produce SDists.
> And we will use PyPA's `cibuildwheel` to produce binaries for all common platforms with minimal setup and simple CI code in GitHub Actions (but trivially movable to any other CI system).
> We will talk about how to automate common tasks, like using GitHub's Dependabot to keep `cibuildwheel` up-to-date while also ensuring reproducible builds.
> 
> After this talk, it is our hope that you will no longer shy away from using compiled code in libraries, but will feel comfortable writing extensions to accelerate or advance your libraries functionality.

- can achieve better performance for an algorithm with compiled packages
    - keep "driver" code in Python and then move heavier, longer-running code to another language
    - a good solution is Numba, worth trying first
    - other implementations of Python may provide speed-ups
    - pre-compiling (presented here) is another option
- what is a binary extension:
    - compiled code in a 3rd party library
    - can write parts of the code in C, C++, Rust, etc.
    - need to pre-compile for each OS distribution
- divide process into 3 stages (tool to use):
    1. write code in a compiled language (pybind11)
    1. use a build system to make the binary (scikit-build)
    1. build a wheel (cibuildwheel)
- `nanobind`: recent substitute for `pybind11`
    - written by same author
    - requires for recent versions of Python and C++
- build systems – `scikit-build`
    - currently wraps `setuptools`/`disutils`
    - built by the team behind `CMake` to use `CMake` within Python
- `cibuildwheel` for building wheels for different OS
    - can run locally and on CI providers

## Understanding Attributes

presenter: Reuven M. Lerner  
link: https://us.pycon.org/2022/schedule/presentation/22/  

> Attributes in Python, which we use dozens of times each day, seem boring, obvious, and not worthy of attention.
> But it turns out that they're key to the Python language: Every time you say `a.b` in Python, that little dot is hiding a lot of work, from searching across multiple objects to silently rewriting things.
And it turns out that what happens with attributes, while not always obvious to developers, determines a great deal of behavior in the Python language.
> 
> In this talk, I'll discuss what attributes are (and aren't), what Python does when you use a dot (`.`) in your code, and how you can take advantage of it.
> We'll talk about attribute lookup, about inheritance, and about methods vs. functions.
> We'll also look into properties, and how they allow us to have attributes that look like data but behave like `setters` and `getters`.
> Finally, we'll look at the descriptor protocol, which makes so much of Python's functionality possible, including the automatic insertion of "self" as the first argument in method calls.

- class attributes do work, but are an **anti-pattern**
    - classes are objects and any object can be assigned an attribute

In [10]:
class Person:
    def __init__(self, name: str) -> None:
        self.name = name
        Person.population += 1
        return None


Person.population = 0

In [11]:
joe = Person("Joe")
Person.population

1

In [12]:
suzy = Person("Suzy")
Person.population

2

- Python does not use shared attributes
    - referred to as "static variables" in C (and in Swift, I think)

- side-note: interesting function `functools.partial()`

In [13]:
def add(a, b):
    return a + b


add(2, 3)

5

In [14]:
from functools import partial

add5 = partial(add, 5)
add5(10)

15

## Distributed web scraping in Python

presenter: Josh Weissbock  
link: https://us.pycon.org/2022/schedule/presentation/48/

> Web scraping is easy to do in Python, but it quickly becomes tedious when routinely running large batch scraping jobs.
> This talk looks at how to build a distributed web scraper to reduce batch scraping job times and improve durability of your code as well as lessons learned & stories along the way.

### Intro

- steps of data science project
    1. collect
    1. store data
    1. clean data
    1. prepare data
    1. analysis and model data
- web scraping: act of extracting data from a website and processing and storing it
    - distributed: over multiple computers

### Building a distributed scraper

- intermediate improvements
    - introduce a middle proxy layer (to fool bots)
    - add headers to make it look like there are different requests from real humans
    - delays between requests from the same computer
    - use Selenium as a headless browser so dynamic websites will run their Javascript
    - try-until-success to avoid missing URLs that fail
    - basic tracking of successes and failures
- make the system distributed
    - controller to make a list of work
    - manages a work queue
    - distribute the jobs to workers/scraper nodes
        - no longer need to use proxies now because each job is run on a separate computer
- queues: `pika`, `puka`, and `celery` are worth looking into

### Code management

- automatic pulling most recent code version

### Helpful Python packages

- rquests, beautfiulsoup, selenium
- `rety` and `backoff-retry`: retrying of failing functions
- `requests-cache`: auto cache request results
    - good for dev
- `fake-useragent`: add header to `requests.get()` based off of real headers stored in a date base

### Considerations

- full automation
- alerts (emails) from distributed system

## Dock your Jupyter Notebook

presenter: Nir Barazida  
link: https://us.pycon.org/2022/schedule/presentation/15/

> To perfect your Jupyter Notebook craft, you'd want to make your work reproducible and shareable outside your local machine.
> In this talk, we will learn how to use Docker to build an isolated and pre-defined environment suited for ML project that runs smoothly on a remote machine.

(works for DagsHub: "the GitHub for ML")

### Docker for ML

- reproducibility and portability
    - e.g. moving from local to remote machines/compute cluster
    - makes moving from dev/research to production easier

### Docker

- main components of a Docker container:
    1. Dockerfile: commands to be called to generate the Docker image
    1. image: a file that contains files and dependencies
    1. container: a single instance of the application that is alive and running
- can get images from registries to be used in our project
    - popular registry is DockerHub
    
### Dockerfile

- build on top of existing Docker images
    - helpful to search DockerHub to find images with desired environment already setup
- install dependencies

```docker
# Dockerfile
FROM jupyter/scipy-notebook:latest  # using a pre-built image

RUN git clone https//:www.my-repo.github \
    && cd project-dir \
    && pip install -r requirements \
    && ...
    
...
    
CMD jupyter lab
```

- other useful comamnds:
    - `ENV`: sent env vars
    - `COPY`: copy files from the local OS into the Docker image
    - `MAINTAINER`: indicates maintinaer of the image
    - `CMD`: final command to run at the end of the file
    
```bash
docker build --pull -t name:tag dockerfile
# name: name of the project
# tag: specific tag for this container
```

- to run the image and create a container:

```bash
docker run ...
```
- have done some work on the project and now want to commit and push my changes:
    - SSH into the Docker container

```bash
docker exec -it <container ID> /bin/bash
```

## Implementing Shared Functionality using Middleware

presenter: Amit Saha  
link: https://us.pycon.org/2022/schedule/presentation/6/

> In this talk, I will provide an introduction to the topic of writing middleware for your web applications.
> Middleware is often simply brought in to an application's code base, without perhaps a thorough understanding of how they work.
> This talk will shed light on how middleware components work in popular Python web frameworks - Flask, Django and FastAPI.
> 
> Armed with that understanding, you will learn how to write your own middleware as well as use standard community contributed middleware to implement vital functionality in your applications.

- *middleware*:
    - the "-" in "client-server" or the "-to-" in "peer-to-peer"
    - "glue" in an application
    
### Flask

- exmaple of adding middleware in Flask
    - order of handlers declaration matters

```python
bp = BluePrint(...) # some other Flask stuff

@bp.before_request
def before_request_handler_1():
    ...
    
@bp.before_request
def before_request_handler_2():
    ...

@bp.after_request
def after_request_handler_2():
    ...

@bp.after_request
def after_request_handler_1():
    ...
```

- example use:
    - measure page rendering time

### Django

- have option to do a class-based or function-passed handler

```python
# class-based
class ExecHandlingMiddleware:
    
    def __int__(self, get_response):
        self.get_response = get_response
    
    def __call__(self, request):
        ...
```

```python
# function-based
def latency_reporter(get_response):
    
    def middleware(request):
        ...
    
    return middleware
```

Then add to `setup.py`:

```python
# setup.py (this may be wrong)
HANLDERS = [
    "path.to.handler_function1",
    "path.to.handler_function2",
]
```

- want to create middleware in a framework-agnositc way
    - can apply to Django, Flask, or any other WSGI framework

### FastAPI

- implement an "ASGI" middleware because it is asynchronous

```python
app = FastAPI()
app.add_middleware(MiddlewareHandlerClass)
...
```

- can implement a WSGI middleware inside of a FastAPI ASGI app

## How We Migrated 3.8 Million Lines of Python 2 Without Interrupting Development

presenter: Benjamin Bariteau  
link: https://us.pycon.org/2022/schedule/presentation/81/ 

> Migrating from python 2 to python 3 is not very easy, but it can be exacerbated by needing to port a large codebase modified frequently by many different developers.
> Our codebase was nearly 4 million lines of code modified dozens of times a day by hundreds of total developers.
> It is also business critical, containing a large portion of our most important code and data.
> We used several tools, techniques, and patterns to achieve the migration without disrupting day-to-day development and keeping regressions a minimum.
> In this talk, we’ll detail our migration steps, our usage of pre-commit hooks to reduce regressions to fixes, our usage of a reverse proxy to allow granular, low risk rollout for a webapp, and our migration of pickle to rollforward safe json for caching.

- multiple stages
    1. parsability
        - tried to parse files
        - fairly easy to fix - took about 2 weeks
    1. importablity
        - import all files and fix failures
        - took about 1 month
    1. functional parity
        - (took the longest)
        - run test suite
        - group failing tests by root cause and address
    1. rollout
- `python-modernize`: make Python2 code compatible with Python2 *and* 3
    - uses `lib2to3` package
    - other packages use the `future` module or the `six` library
- common syntacic changes:
    - change to import names
    - dictionary `.values()`, etc. now return generators
    - change in syntax for `raise`
- used `pre-commit` to ensure compatability with new/modified code
    - (also included `black` hook)
- misc:
    - a lot of the major breaks in the "funational parity" step were related to strings
        - not parsed so don't throw obvious errors
    - stopped caching with pickle, switched to JSON files
        - made the switch overtime so that it didn't interrupt performance
- rollout
    - maintained python2 and 3 virtualenv
    - separate URL prefixes to decide which PYthon to use
    - enabling rolling out pieces of the code base piece by piece and rolling back parts that had bugs
- outcomes
    - presenter claims: "type annotations would have been useful"
    - 15-20% speedup and 26% less RAM
        - were not the original goals, but side-effects of maintain existing code and using modern tools

---

In [15]:
%load_ext watermark
%watermark -d -u -v -iv -b -h -m

Last updated: 2022-04-29

Python implementation: CPython
Python version       : 3.10.4
IPython version      : 8.2.0

Compiler    : Clang 12.0.1 
OS          : Darwin
Release     : 21.4.0
Machine     : x86_64
Processor   : i386
CPU cores   : 4
Architecture: 64bit

Hostname: JHCookMac.local

Git branch: pycon2022

