## Introduction and background

These notebooks describe how to undertake analyses introduced as examples in the
Ninth Edition of *Introduction to the Practice of Statistics* (2017) by Moore, McCabe, and Craig.
The data used in the notebooks are from the R version of the notebooks found at https://nhorton.people.amherst.edu/ips9/.

## Setup

First load the IPS (short for *Introduction to the Practice of Statistics*) package and dependencies:

In [None]:
(ql:quickload :ips)

In [None]:
(ql:quickload :ips)

(in-package :ips)

(do-external-symbols (s (find-package "PACKAGE"))
  (print s))

(format t "Hello world")


# Chapter 1: Looking at data &mdash; distributions

## 1.1 Key characteristics of a data set

## 1.2 Displaying distributions with graphs

### Categorical variables: Bar graphs and pie charts

The distribution of a categorical variable lists the categories and gives either the count or the percent of cases that fall in each category. An alternative to the percent is the proportion, the count divided by the sum of the counts. Note that the percent is simply the proportion times 100.

#### Example 1.7 The distribution of a categorical variable using a bar graph.

In example 1.7, the distribution of a categorical variable is examined along with a demonstration of the count of each catagory as a percentage of the total. The data set is the preferences for online information resources taken from a survey of 552 first year university students.

First, read the data into a data frame with the name 'online'. Note that we're not using the usual convention of \*earmuffs\* on the variable name. The example data sets are named after the example number in the book, e.g. example 1.7 data is named eg01-07.

In [None]:
(defparameter online (read-csv (dex:get eg01-07 :want-stream t)))

We can view the counts by typing the variable's name:

In [None]:
online

On page 10, figure 1.2 shows the data as a bar chart. We will create a Vega-Lite specification for this data:

In [None]:
(defparameter online-bar-chart (bar-chart online "SOURCE" "COUNT"))

and plot it:

In [None]:
(plot online-bar-chart)

We can see that the source labels are overlapping. Let's fix this by adding a width setting:

In [None]:
(plot (pushnew '("width" . 300) online-bar-chart))

You should always consider the best way to order the values in a bar chart. In this case, we will sort X
by the value of Y, decending:

In [None]:
(pushnew '("sort" . "-y") (accesses online-bar-chart :encoding :x))

In [None]:
(plot online-bar-chart)

#### Example 1.10 Pie chart for the online resource preference data
Figure 1.3 (page 11) displays the same data in a pie chart. We can create a spec and plot the data as pie chart like this:

In [None]:
(defparameter online-pie-chart (pie-chart online "SOURCE" "COUNT"))

In [None]:
(plot online-pie-chart)

### Quantitative Variables: Stemplots and histograms
A _stemplot_ (stem-and-leaf plot) provides a quick graphical summary of the shape of a distribution. They are good for small data sets. For larger data sets, histograms work best.

#### Example 1.11 - Soluble corn fiber and calcium
This example show a stem-and-leaf plot of the effect of soluble corn fiber (SCF) on the absorption of calcium in adolescent boys and girls.

In [None]:
(define-data-frame scf (read-csv (dex:get ips:eg01-11 :want-stream t)))

Let's take a high-level look at this data set:

In [None]:
(summary scf)

A data-frame of 3 variables and 46 observations. This is small enough for us to print in its entirety. We'll use a helper function that we've added to the `IPS` package to make printing with Lisp-Stat easier when working with notebooks.

In [None]:
(print-df scf)

We want the treatment group, so we'll use the [select](https://lisp-stat.dev/docs/tasks/select/) package to subset the data from ID 24 to 46:

In [None]:
(stem-and-leaf (select scf (range 23 nil) 'scf:absorption)) ;arrays are 0 based, so we start at 23

#### Examples 1.12 & 1.13
Example 1.12 compares the data between the SCF and control groups using a back-to-back stemplot. Example 1.13 demonstrates *splitting stems* and *trimming* digits in the leaves to fine-tune the display of the stemplot to better observe characteristics of the data. Trimming (rounding) the data can be done using Common Lisp before plotting. Implementing a back-to-back stemplot is outlined in [issue #1](https://github.com/Lisp-Stat/plot/issues/1) and splitting in [issue #2](https://github.com/Lisp-Stat/plot/issues/2)