# arXiv Matplotlib Query

Anecdotally the Matplotlib maintainers were told 

*"About 15% of arXiv papers use Matplotlib"*

Unfortunately the original analysis of this data was lost.  We reproduce it here.

## Watermark

Starting in the early 2010s, Matplotlib started including the bytes `b"Matplotlib"` in every PNG and PDF that they produce.  These bytes persist in the output PDFs stored on arXiv.  As a result, it's pretty simple to check if a PDF contains a Matplotlib image.  All we have to do is scan through every PDF and look for these bytes; no parsing required.

## Data

The data is stored in a requester pays bucket at s3://arxiv (more information at https://arxiv.org/help/bulk_data_s3 ) and also on GCS hosted by Kaggle (more information at https://www.kaggle.com/datasets/Cornell-University/arxiv).  

The data is about 1TB in size.  We're going to use Dask for this.

## Create Dask Cluster

We start with a small Dask cluster on AWS in the same region where the data is stored.  We also mimic the local software environment on the cluster with `package_sync=True`.

In [None]:
import coiled

cluster = coiled.Cluster(
    name="arxiv",
    shutdown_on_close=False,
    package_sync=True, 
    backend_options={"region": "us-east-1"},
)

In [None]:
from dask.distributed import Client, wait
client = Client(cluster)

### Get all filenames

In [None]:
import s3fs
s3 = s3fs.S3FileSystem(requester_pays=True)

directories = s3.ls("s3://arxiv/pdf")
len(directories)

In [None]:
directories[1000]

## Process one file

Mostly we have to muck about with tar files.  This wasn't hard.  The `tarfile` library is in the stardard library.  It's not beautiful, but it's also not hard to use.

In [None]:
import tarfile
import io

def extract(filename):
    out = []
    with s3.open(filename) as f:
        bytes = f.read()
        with io.BytesIO() as bio:
            bio.write(bytes)
            bio.seek(0)
            with tarfile.TarFile(fileobj=bio) as tf:
                for member in tf.getmembers():
                    if member.isfile() and member.name.endswith(".pdf"):
                        data = tf.extractfile(member).read()
                        out.append((
                            member.name, 
                            b"matplotlib" in data.lower()
                        ))
            return out

In [None]:
# See an example of its use
extract(directories[20])[:10]

# Scale function to full dataset

In [None]:
cluster.scale(100)

In [None]:
futures = client.map(extract, directories)
wait(futures)

# We had one error in one file.  Let's just ignore and move on.
good = [future for future in futures if future.status == "finished"]

lists = client.gather(good)

In [None]:
# Scale down now that we're done
cluster.scale(4)

In [None]:
# Convert to Pandas

dfs = [
    pd.DataFrame(list, columns=["filename", "has_matplotlib"]) 
    for list in lists
]

df = pd.concat(dfs)

df

## Enrich Data

Let's make a couple of functions to enhance our data a bit. 

In [None]:
def date(filename):
    year = int(filename.split("/")[0][:2])
    month = int(filename.split("/")[0][2:4])
    if year > 80:
        year = 1900 + year
    else:
        year = 2000 + year
    
    return pd.Timestamp(year=year, month=month, day=1)

date("0005/astro-ph0001322.pdf")

In [None]:
df["date"] = df.filename.map(date)
df.head()

## Plot

The scalable work is over.  Now we can just fool around with Pandas and Matplotlib.

In [None]:
df.groupby("date").has_matplotlib.mean().plot(
    title="Matplotlib Usage in arXiv", 
    ylabel="Fraction of papers"
).get_figure().savefig("results.png")

Yup.  Matplotlib is used pretty commonly on arXiv.  Go team.

## Save results

This data was slighly painful to procure.  Let's save the results locally for future analysis.

In [None]:
df.to_csv("arxiv-matplotlib.csv")

In [None]:
!du -hs arxiv-matplotlib.csv

In [None]:
df.to_parquet("arxiv-matplotlib.parquet", compression="snappy")

In [None]:
!du -hs arxiv-matplotlib.parquet