In [4]:
import pandas as pd

pd.set_option("display.max_rows", 5)

# Data Analysis guide

See also these resources:

* [pandas Series methods API](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)
* [siuba verb API reference](api_index.rst)
* [siuba examples](examples)

## Overview


In [5]:
from siuba.data import cars

cars

Unnamed: 0,cyl,mpg,hp
0,6,21.0,110
1,6,21.0,110
...,...,...,...
30,8,15.0,335
31,4,21.4,109


## Split-apply-combine

* Split with group_by()
* Apply with `_`
* Combine with verbs

## Dates and times

> 🚧 Coming soon. See this [article on timeseries](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html) in the pandas docs.

## Strings

> 🚧 Coming soon. See this [article on working with text](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html) in the pandas docs.

## Reshaping

> 🚧 Coming soon. See the following User API entries for reshaping verbs.

* [gather](api_tidy/02_gather.Rmd)
* [spread](api_tidy/03_spread.Rmd)

## Table joins

> 🚧 Coming soon. See the siuba's User API entry for joins.

* [joins](api_table_two/joins.Rmd)

## Nested data

> 🚧 Coming soon. See the siuba's User API entry for nest and unnest.

* [nest and unnest](api_tidy/01_nest)

## Debugging

This section covers the four most common issues people seem to hit.

1. Referring to a column that doesn't a exist
2. A pandas Series method raising an error
3. Python syntax errors
4. Any of the above in a pipe

> Note that stack traces shown here are shorter than normal, to help make them clearer. This is something siuba does for SQL by default, and will be implemented for pandas in the future.

In [38]:
import pandas as pd
from siuba import mutate, _

df = pd.DataFrame({
    'g': ['a','a','b'],
    'x': [1,2,3]
})


In [36]:
test = {}

def limit_traceback(f, keep_first = True, limit = 1):
    """Wraps the ipython shell._showtraceback, to cut out some pieces.
    
    Note: ipython allows Exceptions to have a _render_traceback_ method, to
          do what this wrapper does, but that doesn't help us change the
          behavior of existing classes. This is a situation where generic
          function dispatch would help.
    """
    from functools import wraps

    if getattr(f, '_wrapped_lt', False):
        # don't wrap multiple times. re-wrap original
        f = f.__wrapped__
    
    @wraps(f)
    def wrapper(etype, evalue, stb):
        test['stb'] = stb
        header = stb[0:3] if keep_first and len(stb) > 3 else []
        body = stb[-limit:]
        
        f(etype, evalue, [*header, *body])
    
    # ensure we don't wrap multiple times
    wrapper._wrapped_lt = True
    
    # otherwise, return wrapper
    return wrapper

from IPython.core.magic import (register_line_magic, register_cell_magic,
                                register_line_cell_magic)

@register_cell_magic
def short_traceback(line, cell):
    shell = get_ipython()
    shell._showtraceback = limit_traceback(shell._showtraceback, limit = 1)
    shell.run_cell(cell)
    shell._showtraceback = shell._showtraceback.__wrapped__

shell = get_ipython()
shell._showtraceback = limit_traceback(shell._showtraceback, limit = 1)

### Missing columns

In [39]:
mutate(df, y = _.X + 1)

AttributeError: 'DataFrame' object has no attribute 'X'

In this case, the data doesn't have a column named "X".

In [40]:
df.columns

Index(['g', 'x'], dtype='object')

### Series method error

In [41]:
mutate(df, y = _.x.mean(bad_arg = True))

TypeError: mean() got an unexpected keyword argument 'bad_arg'

In this case, it's helpful to try replacing `_` with the actual data.

In [42]:
# expression to debug
_.x.mean(bad_arg = True)

# replacing _ with the data
df.x.mean(bad_arg = True)

TypeError: mean() got an unexpected keyword argument 'bad_arg'

### Python syntax errors

In [43]:
df
    >> mutate(y = _.x + 1)

IndentationError: unexpected indent (<ipython-input-43-fded324be8e0>, line 2)

In this case, we either need to use a backslash, or put the code in parentheses.

In [46]:
df \
    >> mutate(y = _.x + 1)

(df
    >> mutate(y = _.x + 1)
)

Unnamed: 0,g,x,y
0,a,1,2
1,a,2,3
2,b,3,4


### Pipes

When the error occurs in a pipe, it's helpful to comment out parts of the pipe.

For example, consider the 3 step pipe below.

In [31]:
from siuba import select, arrange, mutate

(df
   >> select(_.g, _.x)
   >> mutate(res = _.X + 1)
   >> arrange(-_.res)
)

AttributeError: 'DataFrame' object has no attribute 'X'

Notice the arrow pointing to line 6. This is not because that's where the error is, but because python will always point to the last line of a pipe.

Let's debug by running only the first line, then only the first two, etc.., until we find the error.

In [32]:
(df
   >> select(_.g, _.x)
#    >> mutate(res = _.X + 1)
#    >> arrange(-_.res)
)

Unnamed: 0,g,x
0,a,1
1,a,2
2,b,3


Select works okay, now let's uncomment the next line.

In [37]:
(df
   >> select(_.g, _.x)
   >> mutate(res = _.X + 1)
#    >> arrange(-_.res)
)

AttributeError: 'DataFrame' object has no attribute 'X'

We found our bug! Note that when working with SQL, siuba prints out the name of the verb where the error occured. This is very useful, and will be added to working with pandas in the future!