# The Python Almanac

The world of Python packages is adventurous and can be confusing at times.
Here, I try to aggregate and showcase a diverse set of Python packages which have become useful at some point.

## Introduction

### Installing Python

Normally you should use your system's package manager.
In case of problems, try [pyenv](https://github.com/pyenv/pyenv):

```bash
$ pyenv versions
$ pyenv install <foo>
$ pyenv global <foo>
```

This will install the specified Python version to `$(pyenv root)/versions`.

### Installing packages

Python packages can be easily installed from [PyPI (Python Package Index)](https://pypi.org/):

```bash
$ pip install --user <package> (local install does not clash with system packages)
```

Using `--user` will install the package only for the current user. This is good if multiple users need different package versions, but can lead to redundant installations.

To install from a git repository, use the following command:

```bash
$ pip install --user -U git+https://github.com/<user>/<repository>@<branch>
```

### Package management

While packages can be installed globally or user-specific, it often makes sense to create project-specific virtual environments.

This can be easily accomplished using [venv](https://docs.python.org/3/library/venv.html):

```bash
$ python -m venv my_venv
$ . venv/bin/activate
$ pip install <package>
```

## Software development

### Package Distribution

Use `setuptools`.
[poetry](https://github.com/sdispater/poetry) handles many otherwise slightly annoying things:
```
$ poetry init/add/install/run/publish
```

CI encapsulation: [tox](https://github.com/tox-dev/tox).

Keeping track of version numbers can be achieved using [bump2version](https://github.com/c4urself/bump2version).

Transform between various project file formats using [dephell](https://github.com/dephell/dephell).

### Testing

Setup testing using [pytest](https://github.com/pytest-dev/pytest). It has a wide range of useful features, such as fixtures (modularized per-test setup code) and test parametrization (quickly execute the same test for multiple inputs).

In [None]:
%%writefile /tmp/tests.py

import os
import pytest


@pytest.fixture(scope='session')
def custom_directory(tmp_path_factory):
    return tmp_path_factory.mktemp('workflow_test')


def test_fixture_execution(custom_directory):
    assert os.path.isdir(custom_directory)


@pytest.mark.parametrize('expression_str,result', [
    ('2+2', 4), ('2*2', 4), ('2**2', 4)
])
def test_expression_evaluation(expression_str, result):
    assert eval(expression_str) == result

In [None]:
!pytest -v /tmp/tests.py

### Linting/Formatting

Linters and code formatters improve the quality of your Python code by conducting a static analysis and flagging issues.

* [flake8](https://github.com/PyCQA/flake8): Catch various common errors and adhere to PEP8. Supports many [plugins](https://github.com/DmytroLitvinov/awesome-flake8-extensions).
* [pylint](https://github.com/PyCQA/pylint): Looks for even more sources of code smell.
* [black](https://github.com/psf/black): "*the* uncompromising Python code formatter".

While there can be a considerable overlap between the tools' outputs, each offers its own advantages and they can typically be used together.

### Profiling

Code profiling tools are a great way of finding parts of your code which can be optimized.
They come in various flavors:

* [line_profiler](https://github.com/pyutils/line_profiler): which parts of the code require most execution time
* [memory_profiler](https://github.com/pythonprofilers/memory_profiler): which parts of the code consume the most memory

Consider the following script (note the `@profile` decorator):

In [None]:
%%writefile /tmp/script.py

@profile
def main():
    # takes a long time
    for _ in range(100_000):
        1337**42

    # requires a lot of memory
    arr = [1] * 1_000_000
    
main()

#### line_profiler

In [None]:
!kernprof -l -v -o /tmp/script.py.lprof /tmp/script.py

#### memory_profiler

In [None]:
!python3 -m memory_profiler /tmp/script.py

### Debugging

#### Raw python

[ipdb](https://github.com/gotcha/ipdb) is useful Python commandline debugger.
To invoke it, simply put `import ipdb; ipdb.set_trace()` in your code.
Starting with Python 3.7, you can also write `breakpoint()`. This honors the `PYTHONBREAKPOINT` environment variable.
To automatically start the debugger when an error occurs, run your script with `python -m ipdb -c continue <script>`.

The debugger supports various commands:
* p: print expression
* pp: pretty print
* n: next line in current function
* s: execute current line and stop at next possible location (e.g. in function call)
* c: continue execution
* unt: execute until we reach greater line
* l: list source (`l .`)
* ll: whole source code of current function
* b: breakpoint (`[ ([filename:]lineno | function) [, condition] ]`)
* w/bt: print stack trace
* u: move up the stack trace
* d: move down the stack trace
* h: help
* q: quit

#### C++ extension:
Open two windows: ipython, ldb (gdb)

In [1]: !ps aux | grep -i ipython
(lldb) attach --pid 1234
(lldb) continue

(lldb) breakpoint set -f myfile.cpp -l 400

In [2]: run myscript.py

### Documentation

sphinx, nbsphinx

### Logging

There are various built-in and third-party logging modules available.

In [None]:
from loguru import logger

In [None]:
logger.debug('Helpful debug message')
logger.error('oh no')

## Data Science

### SciPy

[SciPy](https://www.scipy.org/) is comprised of various popular Python modules which are for scientific computations.

[Numpy](https://numpy.org/) can be used for a multitude of things.

In [None]:
import numpy as np

In [None]:
data = np.random.normal(size=(100, 3))

In [None]:
data[:10, :]

### Dataframes

Organizing your data in dataframes using [pandas](https://pandas.pydata.org/) makes nearly everything easier.

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
df['group'] = np.random.choice(['G1', 'G2'], size=df.shape[0])

In [None]:
df.head()

### Networkx

[Networkx](https://github.com/networkx/networkx) is a wonderful library for conducting network analysis.

In [None]:
import networkx as nx

In [None]:
graph = nx.watts_strogatz_graph(100, 4, 0.1)
print(nx.info(graph))

In [None]:
pos = nx.drawing.nx_agraph.graphviz_layout(graph, prog='neato', args='-Goverlap=scale')
list(pos.items())[:3]

In [None]:
node_clustering = nx.clustering(graph)
list(node_clustering.items())[:3]

In [None]:
nx.draw(
    graph, pos,
    node_size=100,
    node_list=list(node_clustering.keys()),
    node_color=list(node_clustering.values())
)

### Plotting

#### Matplotlib

[Matplotlib](https://matplotlib.org/) is the de facto standard plotting library for Python.

In [None]:
import matplotlib.pyplot as plt

In [None]:
fig, ax = plt.subplots()

ax.scatter(data[:, 0], data[:, 1])

fig.tight_layout()

Axis ticks can be formatted in a multitude of different [ways](https://matplotlib.org/api/ticker_api.html#tick-formatting).
The most versatile way is probably `FuncFormatter`.

In [None]:
from matplotlib.ticker import FuncFormatter

In [None]:
@FuncFormatter
def my_formatter(x, pos):
    return f'{x=}, {pos=}'

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))

ax.scatter(data[:, 0], data[:, 1])

ax.xaxis.set_major_formatter(my_formatter)
ax.yaxis.set_major_formatter(my_formatter)

#### Seaborn

[Seaborn](https://seaborn.pydata.org/) makes working with dataframes and creating commonly used plots accessible and comfortable.

In [None]:
import seaborn as sns

In [None]:
# first convert dataframe from wide to long format
df_long = pd.melt(df, id_vars=['group'])
df_long.head()

In [None]:
sns.boxplot(data=df_long, x='variable', y='value', hue='group')

#### Statannot

[Statannot](https://github.com/webermarcolivier/statannot) can be used to quickly add markers of significance to comparison plots.

In [None]:
import statannot

In [None]:
ax = sns.boxplot(
    data=df_long,
    x='variable', y='value', hue='group',
    order=['A', 'B', 'C'], hue_order=['G1', 'G2']
)

statannot.add_stat_annotation(
    ax, plot='barplot',
    data=df_long,
    x='variable', y='value', hue='group',
    order=['A', 'B', 'C'], hue_order=['G1', 'G2'],
    box_pairs=[(('B', 'G1'), ('B', 'G2'))],
    text_format='simple', test='Mann-Whitney'
)

#### Brokenaxes

[Brokenaxes](https://github.com/bendichter/brokenaxes) can be used to include outliers in a plot without messing up the axis range. Note that this can be quite misleading.

In [None]:
import brokenaxes

In [None]:
bax = brokenaxes.brokenaxes(ylims=((0, 20), (90, 110)))
bax.boxplot([np.random.normal(10, size=100), np.random.normal(100, size=100)]);

#### Adjusttext

[Adjusttext](https://github.com/Phlya/adjustText) can help for plots with many labels which potentially overlap.

In [None]:
from adjustText import adjust_text

In [None]:
data_sub = data[:40, :]
fig, (ax_raw, ax_adj) = plt.subplots(nrows=1, ncols=2, figsize=(16, 6))

ax_raw.scatter(data_sub[:, 0], data_sub[:, 1])
[ax_raw.annotate(f'{round(x, 1)},{round(y, 1)}', xy=(x, y)) for x, y in data_sub[:, [0, 1]]]

ax_adj.scatter(data_sub[:, 0], data_sub[:, 1])
adjust_text([ax_adj.annotate(f'{round(x, 1)},{round(y, 1)}', xy=(x, y)) for x, y in data_sub[:, [0, 1]]], arrowprops=dict(arrowstyle='->'))

#### Plotnine

While matplotlib's pyplot provides a similar plotting interface as MATLAB, [plotnine](https://github.com/has2k1/plotnine) implements a grammar of graphics and is (in ideology) based on R's [ggplot2](https://github.com/tidyverse/ggplot2).

In [None]:
import plotnine

In [None]:
(plotnine.ggplot(df_long, plotnine.aes(x='variable', y='value', color='group')) +
    plotnine.geom_boxplot() +
    plotnine.facet_wrap('~group') +
    plotnine.theme_minimal())

#### Folium

[Folium](https://github.com/python-visualization/folium) is a Python wrapper of the [Leaflet.js](https://leafletjs.com/) library to visualize dynamic maps.

In [None]:
import folium

In [None]:
folium.Map(
    location=[np.random.uniform(40, 70), np.random.uniform(10, 30)], zoom_start=7,
    width=500, height=500
)

### High performance

When dealing with large amounts of data or many computations, it can make sense to optimize hotspots in C++ or use specialized libraries.

#### Dask

[Dask](https://github.com/dask/dask) provides a Panda's like interface to high-performance dataframes which support out-of-memory processing, cluster distribution, and more.
It is particularly useful when the dataframe does not fit in RAM anymore. Common operations operate on chunks of the dataframe and are only executed when explicitly requested.

In [None]:
import dask.dataframe as dd

In [None]:
df = pd.DataFrame(np.random.normal(size=(1_000_000, 2)), columns=['A', 'B'])

In [None]:
ddf = dd.from_pandas(df, npartitions=4)
ddf.head()

In [None]:
ddf['A'] + ddf['B']

In [None]:
(ddf['A'] + ddf['B']).compute()

#### Vaex

[Vaex](https://github.com/vaexio/vaex) fills a similar niche as dask and makes working with out-of-core dataframe easy.
It has a slightly more intuitive interface and offers many cool visualizations right out of the box.

In [None]:
import vaex as vx

In [None]:
vdf = vx.from_pandas(df)
vdf.head()

In [None]:
vdf['A'] + vdf['B']

In [None]:
vdf.plot(vdf['A'], vdf['B'])

#### Joblib

[Joblib](https://github.com/joblib/joblib) makes executing functions in parallel very easy and removes boilerplate code.

In [None]:
import time
import random

import joblib

In [None]:
def heavy_function(i):
    print(f'{i=}')
    time.sleep(random.random())
    return i ** i

In [None]:
joblib.Parallel(n_jobs=2)([joblib.delayed(heavy_function)(i) for i in range(10)])

#### Swifter

Choosing the correct way of parallelizing your computations can be non-trivial. [Swifter](https://github.com/jmcarpenter2/swifter) tries to automatically select the most suitable one.

In [None]:
import swifter

In [None]:
df_big = pd.DataFrame({
    'A': np.random.randint(0, 100, size=1_000_000)
})
df_big.head()

In [None]:
%%timeit
df_big['A'].apply(lambda x: x**2)

In [None]:
%%timeit
df_big['A'].swifter.apply(lambda x: x**2)

### Bioinformatics

#### PyRanges

[PyRanges](https://github.com/biocore-ntnu/pyranges) makes working with genomic ranges easy as pie.

In [None]:
import pyranges as pr

In [None]:
df_exons = pr.data.exons()
df_exons

In [None]:
df_locus = pr.PyRanges(pd.DataFrame({'Chromosome': ['chrX'], 'Start': [1_400_000], 'End': [1_500_000]}))
df_locus

In [None]:
df_exons.overlap(df_locus).df

#### Obonet

[Obonet](https://github.com/dhimmel/obonet) is a library for working with (OBO-formatted) ontologies.

In [None]:
import obonet

In [None]:
url = 'https://github.com/DiseaseOntology/HumanDiseaseOntology/raw/master/src/ontology/HumanDO.obo'
graph = obonet.read_obo(url)

In [None]:
list(graph.nodes(data=True))[0]

### Statistics/Machine Learning

#### Statsmodels

[Statsmodels](https://github.com/statsmodels/statsmodels) helps with statistical modelling.

In [None]:
import statsmodels.formula.api as smf

In [None]:
df_data = pd.DataFrame({
    'X': np.random.normal(size=100)
})

df_data['Y'] = 1.3 * df_data['X'] + 4.2

df_data.head()

In [None]:
mod = smf.ols('Y ~ X', data=df_data)
res = mod.fit()

In [None]:
res.params

In [None]:
res.summary().tables[1]

#### Pingouin

[Pingouin](https://github.com/raphaelvallat/pingouin) provides additional statistical methods.

In [None]:
import pingouin as pg

In [None]:
pg.normality(np.random.normal(size=100))

In [None]:
pg.normality(np.random.uniform(size=100))

#### Scitkit-learn

[Scikit-learn](https://github.com/scikit-learn/scikit-learn) facilitates machine learning in Python.

In [None]:
from sklearn import svm
from sklearn import datasets
from sklearn.model_selection import train_test_split

In [None]:
X, y = datasets.load_iris(return_X_y=True)
X.shape, y.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

In [None]:
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)

Scikit-learn offers various plugins which deal with common issues encountered while modeling.

[Imbalanced-learn](https://github.com/scikit-learn-contrib/imbalanced-learn) provides various re-sampling techniques when the dataset has annoying class imbalances.

In [None]:
import collections

from imblearn.over_sampling import RandomOverSampler

In [None]:
ros = RandomOverSampler(random_state=0)

In [None]:
X_sub, y_sub = X[:60, :], y[:60]
X_resampled, y_resampled = ros.fit_resample(X_sub, y_sub)

In [None]:
print('sub:', sorted(collections.Counter(y_sub).items()))
print('resampled:', sorted(collections.Counter(y_resampled).items()))

[Category_encoders](https://github.com/scikit-learn-contrib/category_encoders) helps with converting categorical variables to numerical ones.

In [None]:
import category_encoders

In [None]:
tmp = np.random.choice(['A', 'B'], size=10)
df_cat = pd.DataFrame({
    'original_class': tmp,
    'feature01': tmp
})
df_cat.head()

In [None]:
category_encoders.OneHotEncoder(cols=['feature01']).fit_transform(df_cat)

[Yellowbrick](https://github.com/DistrictDataLabs/yellowbrick) makes a multitude of visual diagnostic tools readily accessible.

In [None]:
from yellowbrick.classifier import ROCAUC

In [None]:
clf.fit(X, y)

In [None]:
visualizer = ROCAUC(clf)
visualizer.score(X, y)
visualizer.show()

## Language Bindings

### Pybind11

[Pybind11](https://github.com/pybind/pybind11) makes writing bindings between Python and C++ enjoyable. In combination with [cppimport](https://github.com/tbenthompson/cppimport) some might even call it fun.
It is possible to implement [custom typecasters](https://pybind11.readthedocs.io/en/stable/advanced/cast/custom.html) to support bindings for arbitrary objects.

In [None]:
%%writefile cpp_source.cpp

#include <pybind11/pybind11.h>

namespace py = pybind11;


int square(int x) {
    return x * x;
}

PYBIND11_MODULE(cpp_source, m) {
    m.def(
        "square", &square,
        py::arg("x") = 1
    );
}

/*
<%
setup_pybind11(cfg)
cfg['compiler_args'] = ['-std=c++11']
%>
*/

In [None]:
import cppimport

In [None]:
cpp_source = cppimport.imp('cpp_source')

In [None]:
cpp_source.square(5)

## Jupyter

### Nbstripout

Commiting Jupyter notebooks to CVS (e.g. git) can be annoying due to non-code properties being saved.
[Nbstripout](https://github.com/kynan/nbstripout) strips all of those away and can be run automatically for each committed notebook by executing `nbstripout --install` once.

## ToDo

* Validate your config files using [schemas](https://github.com/Julian/jsonschema/).
* Design your pipelines using [Snakemake](https://bitbucket.org/snakemake/snakemake).
* moviepy
* https://github.com/tqdm/tqdm
* https://github.com/pyca/cryptography
* https://github.com/jmoiron/humanize
* numba
* pythran
* https://github.com/cloudpipe/cloudpickle
* jupytext
* dfply(/plydata)
* tensorflow
* filprofiler
* https://github.com/mitmproxy/mitmproxy
* https://github.com/secdev/scapy