<span style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">An Exception was encountered at '<a href="#papermill-error-cell">In [7]</a>'.</span>

In [None]:
# Parameters
year = 1997
plot_type = "boxviolin"
continent_to_exclude = "Europe"


In [None]:
# command:
## papermill --kernel ir ggbetweenstats.ipynb ggbetweenstats.ipynb \
## -p plot_type "boxviolin" \
## -p continent_to_exclude "Asia" \
## -p year 1987

The function `ggstatsplot::ggbetweenstats` is designed to facilitate 
**data exploration**, and for making highly customizable **publication-ready plots**,
with relevant statistical details included in the plot itself if desired. We
will see examples of how to use this function in this vignette.

To begin with, here are some instances where you would want to use
`ggbetweenstats`-

 - to check if a continuous variable differs across multiple groups/conditions

 - to compare distributions visually and check for outliers

**Note**: This vignette uses the pipe operator (`%>%`), if you are not
familiar with this operator, here is a good explanation:
<http://r4ds.had.co.nz/pipes.html>

## Comparisons between groups with `ggbetweenstats`

To illustrate how this function can be used, we will use the `gapminder` dataset
throughout this vignette. This dataset provides values for life expectancy, GDP
per capita, and population, at 5 year intervals, from 1952 to 2007, for each of
142 countries (courtesy [Gapminder Foundation](https://www.gapminder.org/)).
Let's have a look at the data-

In [None]:
library(gapminder)

dplyr::glimpse(x = gapminder::gapminder)

**Note**: For the remainder of the vignette, we're going to exclude *Oceania*
from the analysis simply because there are so few observations (countries).

Suppose the first thing we want to inspect is the distribution of life
expectancy for the countries of a continent in 2007. We also want to know if the
mean differences in life expectancy between the continents is statistically
significant.

The simplest form of the function call is-

In [None]:
# since the confidence intervals for the effect sizes are computed using
# bootstrapping, important to set a seed for reproducibility
options(repr.plot.width=12, repr.plot.height=6)
set.seed(123)

# function call
ggstatsplot::ggbetweenstats(
  data = dplyr::filter(gapminder::gapminder, 
                       year == year, 
                       continent != continent_to_exclude),
  x = continent,
  y = lifeExp,
  nboot = 10,
  messages = FALSE
)

**Note**:
  
  - The function automatically decides whether an independent samples *t*-test
    is preferred (for 2 groups) or a Oneway ANOVA (3 or more groups). based on
    the number of levels in the grouping variable.
    
  - The output of the function is a `ggplot` object which means that it can be
    further modified with `ggplot2` functions.

As can be seen from the plot, the function by default returns Bayes Factor for
the test. If the null hypothesis can't be rejected with the null hypothesis
significance testing (NHST) approach, the Bayesian approach can help index
evidence in favor of the null hypothesis (i.e., $BF_{01}$).

By default, natural logarithms are shown because Bayes Factor values can
sometimes be pretty large. Having values on logarithmic scale also makes it easy
to compare evidence in favor alternative ($BF_{10}$) versus null ($BF_{01}$)
hypotheses (since $log_{e}(BF_{01}) = - log_{e}(BF_{10})$). 

We can make the output much more aesthetically pleasing as well as informative
by making use of the many optional parameters in `ggbetweenstats`. We'll add a
title and caption, better `x` and `y` axis labels, and tag and label the
outliers in the data. We can and will change the overall theme as well as the
color palette in use.

In [None]:
# for reproducibility
set.seed(123)
library(ggstatsplot)
library(gapminder)

# plot
ggstatsplot::ggbetweenstats(
  data = dplyr::filter(.data = gapminder, year == 2007, continent != "Oceania"),
  x = continent, # grouping/independent variable
  y = lifeExp, # dependent variables
  type = "robust", # type of statistics
  xlab = "Continent", # label for the x-axis
  ylab = "Life expectancy", # label for the y-axis
  plot.type = "boxviolin", # type of plot
  outlier.tagging = TRUE, # whether outliers should be flagged
  outlier.coef = 1.5, # coefficient for Tukey's rule
  outlier.label = country, # label to attach to outlier values
  outlier.label.args = list(color = "red"), # outlier point label color
  # turn off messages
  ggtheme = ggplot2::theme_gray(), # a different theme
  package = "yarrr", # package from which color palette is to be taken
  palette = "info2", # choosing a different color palette
  title = "Comparison of life expectancy across continents (Year: 2007)",
  caption = "Source: Gapminder Foundation"
) + # modifying the plot further
  ggplot2::scale_y_continuous(
    limits = c(35, 85),
    breaks = seq(from = 35, to = 85, by = 5)
  )

As can be appreciated from the effect size (partial eta squared) of 0.635, there
are large differences in the mean life expectancy across continents.
Importantly, this plot also helps us appreciate the distributions within any
given continent. For example, although Asian countries are doing much better
than African countries, on average, Afghanistan has a particularly grim average
for the Asian continent, possibly reflecting the war and the political turmoil.

So far we have only used a classic parametric test and a boxviolin plot, 
but we can also use other available options:

  - The `type` (of test) argument also accepts the following abbreviations:
    `"p"` (for *parametric*), `"np"` (for *nonparametric*), `"r"` (for
    *robust*), `"bf"` (for *Bayes Factor*). 

  - The type of plot to be displayed can also be modified (`"box"`, `"violin"`,
  or `"boxviolin"`).

  - The color palettes can be modified.

Let's use the `combine_plots` function to make one plot from four separate
plots that demonstrates all of these options. Let's compare life expectancy for
all countries for the first and last year of available data 1957 and 2007. We
will generate the plots one by one and then use `combine_plots` to merge them
into one plot with some common labeling. It is possible, but not necessarily
recommended, to make each plot have different colors or themes.

For example,

In [None]:
?ggstatsplot::ggbetweenstats

<span id="papermill-error-cell" style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">Execution using papermill encountered an exception here and stopped:</span>