# Tutorial: analysis of pandas docstring errors

In this tutorial we will perform exploratory data analysis of
the pandas docstring errors.

We will use two source files obtained from previous tutorials:

- `docstring_errors_pandas023.hd5`
- `pandas_page_views_2018.parquet`

In [5]:
import os
import pandas

DOCSTRING_ERRORS_FNAME = os.path.join('data', 'docstring_errors_pandas023.hd5')
PAGE_VIEWS_FNAME = os.path.join('data', 'pandas_page_views_2018.parquet')

### Join the two sources of data

- Load the data for every source
- Transform the "primary key" of the sources so they match
- Join both sources into a single `DataFrame`

### Use pandas to answer your questions about the data

- Discuss what questions you want to get answered
- Use pandas to get the answers for them
- Can you use pandas visualization?

In [10]:
import operator
import urllib.parse

page_views = pandas.read_parquet(PAGE_VIEWS_FNAME)
page_views.index = (page_views.index
                              .to_series()
                              .apply(urllib.parse.urlparse)
                              .apply(operator.attrgetter('path'))
                              .str.split('/')
                              .str[-1]
                              .str.rstrip('.html'))


docstring_errors = (pandas.read_hdf(DOCSTRING_ERRORS_FNAME)
                          .join(page_views.groupby('Page')['Pageviews'].sum()))

In [17]:
(docstring_errors[['docstring_length', 'Pageviews']].dropna().drop_duplicates().plot(kind='scatter', x='docstring_length', y='Pageviews'))