# Anaconda Package Download Data

This notebook demonstrates how to load and use Anaconda package data.  For more details, see the [Github repository](https://github.com/ContinuumIO/anaconda-package-data/blob/master/README.md).  Due to limitations on Binder, you might find some of the analysis examples below run slowly or require more memory than is available on the Binder instance.  Feel free to download this notebook locally and run it.


## Setting up

To start we need to install the needed packages by running `conda install dask intake numpy pandas` and `conda install -c conda-forge hvplot`. Then we can import the packages:

In [None]:
import dask
import dask.dataframe as dd
from datetime import datetime
import hvplot.pandas
import intake
import numpy as np
import pandas as pd

In [None]:
from dask.distributed import Client
client = Client()
client

This enables the Dask progress bar on all operations:

In [None]:
from dask.diagnostics import ProgressBar
pbar = ProgressBar()
pbar.register()

## Loading Data

There are multiple ways to load Anaconda package data. Below we show examples of loading one month of data for December 2018.

#### Method 1:  load data from S3 url

First, we can read parquet files directly from S3 url. We recommend using `dask.dataframe` to read data files into a Dask DataFrame. Please visit the [Dask website](http://docs.dask.org/en/latest/dataframe.html) for more information.

In [None]:
#df = dd.read_parquet('s3://anaconda-package-data/conda/hourly/2019/*/*.parquet',
#                     storage_options={'anon': True})
#df

### Method 2: load data from intake catalog
Second, we can load data from an Intake catalog file. One advantage of using intake catalog is that we can define the cache specifications in the catelog so that intake caches remote data source files locally. This saves bandwidth and improves the performance of future analyses. If you would like to remove the intake cache, simply run intake cache clear. For more information on Intake catalogs, click here.

Before loading the data file, we need to load the Intake catalog file. We can use a URL to the catalog file directly:

In [None]:
cat = intake.open_catalog('https://raw.githubusercontent.com/ContinuumIO/anaconda-package-data/master/catalog/anaconda_package_data.yaml')

Then we can load the data with user specified year and month.

In [None]:
df = cat.anaconda_package_data_by_year(year=2019).to_dask()

Again, if you would like to read data directly into a Pandas Dataframe, please use `intake.cat.anaconda_package_data_by_month(year=2018, month=12).read()`.

## Examples

After loading the data, we can do a lot of data wrangling and visualization to answer interesting questions. Below we show a few examples of how people can use the data. 

#### Example 1: Dask download statistics

In this first example, we are looking at the download statistics of Dask. First, let's see how many times Dask is installed this year from Anaconda distribution:

In [None]:
pkg_names = ['dask', 'ray', 'koalas']
#df.loc[(df.data_source=='anaconda') & df.pkg_name==pkg_name)]['counts'].sum().compute()

pkg_counts = [{pkg_name: df.loc[df.pkg_name == pkg_name]['counts'].sum()} 
              for pkg_name in pkg_names]
dask.compute(pkg_counts)

Note that `.compute()` is needed when df is a dask dataframe. Delete `.compute()` if you load data into a pandas dataframe. Please visit [dask website](http://docs.dask.org/en/latest/dataframe.html) for more information.

Next, let's take a look at the daily trends of pandas usage. 

In [None]:
df['month'] = df.time.dt.month
pkg_month_agg = df\
    .loc[(df.pkg_name == 'dask') | (df.pkg_name == 'ray') | (df.pkg_name == 'koalas')]\
    .groupby(['month', 'pkg_name'])\
    .sum()\
    .reset_index()\
    .compute()
pkg_month_agg.head()

In [None]:
pkg_month_agg.hvplot.bar('month','counts')

In [None]:
import holoviews as hv
hv.extension('bokeh')

In [None]:
ds = hv.Dataset(pkg_month_agg, ['month', 'pkg_name'], 'counts')
bars = ds.to(hv.Bars, ['month', 'pkg_name'], 'counts')
bars

#### Example 2: Package platform comparison

We can also compare package platforms. Here we calculated the total number of downloads from each platform and visualize the results in a bar chart.  (Note that "noarch" packages have no platform value because they work on all platforms.)

In [None]:
platform_month = df\
    .loc[(df.pkg_name == 'dask') | (df.pkg_name == 'ray') | (df.pkg_name == 'koalas')]\
    .groupby(['pkg_platform'])['counts'].sum().reset_index().compute()
platform_month

In [None]:
platform_month.hvplot.bar('pkg_platform', 'counts', rot=90)