**NOTE** This notebook is work under progress

# Interactive exploration of current errors in pandas docstrings

*DISCLAIMER: This notebook is based on the one uploaded by @dujm [here](https://github.com/python-sprints/pandas-mentoring/blob/master/notebooks/docstring_error_interactive.ipynb)*


This notebook will help you detect which errors are still present on some of the docstrings of pandas, so that you can select one of them, fix it, and submit a PR to the [pandas repository](https://github.com/pandas-dev/pandas). 

**IMPORTANT!** Before starting to work on fixing an error, check that nobody is already working on it by searching the issues and PRs in the pandas repository. If you nobody is doing so, open an issue and let others know you will be fixing that docstring.

This script currently supports pandas version >= 0.25.0

Let's start by importing the necessary packages:

In [1]:
import os

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import json

import ipywidgets as widgets
from IPython.display import display, clear_output, Markdown
import qgrid

from error_descriptions import error_descriptions

## *Static exploration*

## 1. Generate a .json containing all current errors

This step was automatically done if you are running this notebook from Binder. Keep in mind that the .json file is updated every 15 minutes, so it might be outdated. When you select an error to work on, double check that nobody has submitted an issue to work on it already.

If you want to generate the .json file locally, simply run the following command from your pandas clone:

`./scripts/validate_docstrings.py --format=json > /path/to/json/pandas_docstring_errors.json`

## 2. Plot a table describing the errors

We will plot a table that indicates which pandas functions still have an error in their docstrings. The specific error code and description will be also described.

In [2]:
# Import JSON as df
file = 'pandas_docstring_errors.json'
df = (pd.read_json(file)
            .transpose()
            .filter(items=['errors', 'file', 'file_line'])
            .explode('errors')
            .dropna()
            .reset_index()
            .rename(columns={"index": "function"})
     )

# Divide errors in their code and description
df[['error_code','error_description']] = pd.DataFrame(df.errors.tolist())
df = df.drop(["errors"], axis=1)

# Print file name relative to pandas repository path
df['file'] = df['file'].str.split('/pandas').str[1]

# Show the ten first examples
df.head(10)

Unnamed: 0,function,file,file_line,error_code,error_description
0,pandas.BooleanDtype,/core/arrays/boolean.py,40,SA01,See Also section not found
1,pandas.Categorical,/core/arrays/categorical.py,213,PR01,Parameters {'fastpath'} not documented
2,pandas.Categorical.__array__,/core/arrays/categorical.py,1268,ES01,No extended summary found
3,pandas.Categorical.__array__,/core/arrays/categorical.py,1268,PR01,Parameters {'dtype'} not documented
4,pandas.Categorical.__array__,/core/arrays/categorical.py,1268,SA01,See Also section not found
5,pandas.Categorical.__array__,/core/arrays/categorical.py,1268,EX01,No examples section found
6,pandas.Categorical.from_codes,/core/arrays/categorical.py,589,SA01,See Also section not found
7,pandas.CategoricalDtype,/core/dtypes/dtypes.py,168,SA04,"Missing description for See Also ""Categorical""..."
8,pandas.CategoricalIndex,/core/indexes/category.py,69,EX02,Examples do not pass tests:\n*****************...
9,pandas.CategoricalIndex,/core/indexes/category.py,69,EX03,"flake8 error: E231 missing whitespace after ',..."


## 3. Count number of functions with errors per error type

In [3]:
df_code = df['error_code'].value_counts().reset_index()
df_code.columns = ['error_code','counts']

df_code

Unnamed: 0,error_code,counts
0,SA01,372
1,EX01,362
2,ES01,354
3,RT03,288
4,GL08,242
5,SA04,227
6,PR07,212
7,PR01,212
8,EX03,145
9,RT02,121


## 4. Count number of errors per function

In [4]:
df_function = df['function'].value_counts().reset_index()
df_function.columns = ['function','counts']

df_function

Unnamed: 0,function,counts
0,pandas.core.groupby.DataFrameGroupBy.boxplot,14
1,pandas.HDFStore.append,13
2,pandas.PeriodIndex,12
3,pandas.CategoricalIndex.remove_unused_categories,11
4,pandas.Series.cat.remove_unused_categories,11
...,...,...
1013,pandas.tseries.offsets.BusinessDay.is_on_offset,1
1014,pandas.DataFrame.max,1
1015,pandas.Series.str.get_dummies,1
1016,pandas.Series.max,1


## *Interactive exploration* 

Select an error from the following dropdown menu to see a complete description and example of it, and the number of that type of errors in pandas:

In [6]:
def unique_sorted_values(array):
    unique = array.unique().tolist()
    unique.sort()
    return unique

w = widgets.Dropdown(
    options=unique_sorted_values(df_code.error_code),
    value='ES01',
    description='Task:',
)
display(w)

out = widgets.Output()
display(out)

def on_change(change):
    with out:
        if change['type'] == 'change' and change['name'] == 'value':
            error_code = change['new']
            error_cnt = df_code.loc[df_code['error_code'] == error_code]['counts'].values[0]
            description = error_descriptions[error_code]["description"]
            bad_example = error_descriptions[error_code]["bad_example"]
            good_example = error_descriptions[error_code]["good_example"]
            references = error_descriptions[error_code]["references"]

            clear_output()
            
            # Display text
            display(Markdown("## {} ({} errors)".format(error_code, error_cnt)))
            display(Markdown(description))
            display(Markdown("### Bad Example"))
            display(Markdown(bad_example))
            display(Markdown("### Good Example"))
            display(Markdown(good_example))
            display(Markdown("### References"))
            display(Markdown(references))

w.observe(on_change)

Dropdown(description='Task:', options=('ES01', 'EX01', 'EX02', 'EX03', 'GL08', 'PR01', 'PR02', 'PR06', 'PR07',…

Output()

You can filter the following table by the error code you want to work on, or its function.

In [None]:
# Create qgrid widget
qgrid_widget = qgrid.show_grid(df, grid_options={'forceFitColumns': True})
qgrid_widget