# Analysis of space used by `docker-texlive` container during build

When building the `docker-texlive` image based on the `Dockerfile` [specified as of 6th April 2019](https://github.com/lanecodes/docker-texlive/tree/9d541ba302e90e2d3aaf9cffa8602b09ef8dbd89) I ran into problems caused by running out of disk space.

To investigate I tracked the disk usage of the directory where Docker stores its data on my machine (`/media/docker`) using the Bash script given below. I found that adding an additional call to `apt-get clean` during the build process was sufficient to keep the size of the container below 10 Gb, with a maximum size of 9.3 Gb during the build process. The file `./data/docker-space-log` is the output of the below script when this additional call to `apt-get clean` after the installation of `texlive` in included in the `Dockerfile`.

```bash
#! /usr/bin/env bash

LOGFILE=~/docker-space.log
echo "time Used Avail Use%" > $LOGFILE

while true
 do
     echo \
         $(date +%H:%M:%S) \
         $(df -h /media/docker \
               | awk '{if ($1 != "Filesystem") print $3 " " $4 " " $5}') 2>&1 \
         | tee -a $LOGFILE
     sleep 10
done
```

In [None]:
from pathlib import Path

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

In [None]:
DATA_DIR = Path('data')
IMG_DIR = Path('../img')
IMG_DIR.mkdir(exist_ok=True)

In [None]:
def unit_gb_value(value_str) -> str:
    unit_str = value_str[-1]
    if unit_str == '0':
        return float(value_str)
    elif unit_str == 'M':
        return float(value_str[:-1]) / 1000
    elif unit_str == 'G':
        return float(value_str[:-1])
    else:
        raise ValueError("couldn't parse unit")

In [None]:
df = (
    pd.read_csv(DATA_DIR / 'docker-space.log', sep=' ')
    .assign(time=lambda df: (
        pd.to_datetime('20190409' + df['time'], format='%Y%m%d%H:%M:%S')))
    .assign(avail_gb=lambda df: df['Avail'].apply(unit_gb_value))
    .assign(used_gb=lambda df: df['Used'].apply(unit_gb_value))
    .assign(used_pct=lambda df: df['Use%'].str[:-1].astype(int))
    .set_index('time')
    .drop(columns=['Used', 'Avail', 'Use%'])
)

In [None]:
df[['avail_gb', 'used_gb']].plot()

In [None]:
container_size_s = (
    df.reset_index()
    .assign(build_time=(
        lambda df: (df['time'].diff().fillna(0).cumsum() 
        / np.timedelta64(1, 's')).astype(int)))
    .set_index('build_time')
    .assign(container_size_gb=lambda df: df['used_gb'] - df['used_gb'].iloc[0])
    ['container_size_gb']
)

In [None]:
print('Maximum size of container:', container_size_s.max(), 'Gb')

In [None]:
matplotlib.rc('font', size=14)
fig, ax = plt.subplots(figsize=(8, 5))
annotation_c = 'dimgrey'
container_size_s.plot(ax=ax)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.set_xlabel('Build time [s]')
ax.set_ylabel('Image size [Gb]')
ax.annotate('1st\napt-get clean', xy=(1450, 2), color=annotation_c)
ax.axvline(x=2280, color=annotation_c, ls='--')
ax.annotate('2nd\napt-get clean', xy=(2950, 2), color=annotation_c)
ax.axvline(x=2850, color=annotation_c, ls='--')
plt.tight_layout()
plt.savefig(IMG_DIR / 'image_size_during_build.png')